🧪 Experiment Campaign: copilot-opt
Workflow file: .github/workflows/copilot-opt.md
Selected dimension: engine_variant
Triggered by: ab-testing-advisor on 2026-06-15
Background
copilot-opt is a weekly, 5-phase Copilot session analysis workflow that parses events.jsonl logs, detects performance bottlenecks, runs mandatory PR cross-analysis, and creates exactly 3 evidence-backed optimization issues per run. It currently runs exclusively on the copilot engine (copilot-sdk: true, strict: true). With 9 consecutive successful runs averaging 28.7 min against a 30-minute agent timeout, the engine choice directly constrains how much analysis completes within budget — testing claude may reveal faster completion or deeper recommendation quality for this compute-heavy workflow.
Hypothesis
- H0 (null): Engine choice (
copilot vs claude) does not affect run_duration_minutes or issue_creation_success_rate.
- H1 (alternative): The
claude engine reduces mean run duration by ≥20% (from 28.7 min to ≤23 min) while maintaining a 100% issue creation success rate and equal or greater recommendation depth.
Experiment Configuration
Add the following experiments: block to the workflow frontmatter:
experiments:
engine_variant:
variants: [copilot, claude]
description: "Compare copilot and claude engines on multi-phase session analysis efficiency and recommendation quality"
hypothesis: "H0: no change in run_duration_minutes or issue_creation_success_rate. H1: claude reduces mean run duration by >=20% while maintaining 100% issue creation rate"
metric: run_duration_minutes
secondary_metrics: [credit_spend, issue_creation_success_rate, recommendation_depth]
guardrail_metrics:
- name: issue_creation_success_rate
direction: min
threshold: 1.0
min_samples: 10
weight: [50, 50]
start_date: "2026-06-16"
analysis_type: t_test
tags: [engine_comparison, cost_efficiency, session_analysis]
notify:
issue: #aw_campaign
issue: #aw_campaign
Variant descriptions:
copilot: Baseline — current engine with copilot-sdk: true, all existing behavior unchanged.
claude: Alternative — switches to the claude engine; a single variant-context note is injected at Phase 1 to flag that Copilot-SDK-specific session-enrichment steps do not apply, so the agent relies on standard log parsing.
Workflow Changes Required
View before/after diff
1. Frontmatter — add experiments: block and make engine: conditional
+experiments:
+ engine_variant:
+ variants: [copilot, claude]
+ description: "Compare copilot and claude engines on multi-phase session analysis efficiency and recommendation quality"
+ metric: run_duration_minutes
+ secondary_metrics: [credit_spend, issue_creation_success_rate, recommendation_depth]
+ guardrail_metrics:
+ - name: issue_creation_success_rate
+ direction: min
+ threshold: 1.0
+ min_samples: 10
+ weight: [50, 50]
+ start_date: "2026-06-16"
+ analysis_type: t_test
+ tags: [engine_comparison, cost_efficiency, session_analysis]
+ notify:
+ issue: <campaign-issue-number>
+ issue: <campaign-issue-number>
engine:
- id: copilot
- copilot-sdk: true
+{{#if experiments.engine_variant == "claude" }}
+ id: claude
+{{else}}
+ id: copilot
+ copilot-sdk: true
+{{/if}}
2. Markdown body — add variant context note at the start of Phase 1
## Phase 1 — Ingestion and Normalization
+{{#if experiments.engine_variant == "claude" }}
+> **Variant note**: Running as `claude` engine. Copilot-SDK-specific session enrichment
+> steps are not applicable; rely on standard log parsing. Apply extended chain-of-thought
+> reasoning when inferring latency patterns from sparse or malformed event data.
+{{/if}}
+
1. For each in-scope session, locate one of:
Note on engine: frontmatter conditionals: The {{#if}} blocks above are processed by gh aw compile at compile time. If the compiler does not yet support Handlebars inside the YAML frontmatter block, the engine: switching should be implemented as a compiler-side feature that maps experiments.engine_variant to the engine.id field, following the pattern of other compile-time experiment expansions.
Success Metrics
| Metric |
Type |
Target |
run_duration_minutes |
Primary |
≤23.0 min for claude vs 28.7 min baseline |
credit_spend |
Secondary |
Directional reduction signal |
recommendation_depth (chars/issue body) |
Secondary |
Must not decrease vs baseline |
issue_creation_success_rate |
Guardrail |
Must remain = 1.0 (all 3 issues created every run) |
Statistical Design
View sample size calculation
- Variants:
copilot (baseline), claude
- Assignment: Round-robin via
gh-aw experiments runtime (cache-based)
- Historical baseline: 9 runs, mean = 28.7 min, σ ≈ 7 min (range 15.5–41.0 min), 9/9 success
- Minimum detectable effect: 20% = 5.7 min
- Sample size (two-sample Welch's t-test, α = 0.05, power = 80%):
- n = 2σ2 × (z0.975 + z0.80)2 / δ2 = 2 × 49 × 7.84 / 32.5 ≈ 24 total → 12 per variant
- Minimum runs per variant: 10 (conservative, accounts for weekly cadence)
- Run frequency: ~1/week; 50/50 split → ~0.5 runs/variant/week
- Expected experiment duration: ~20 weeks (~5 months, targeting ~November 2026)
- Analysis approach: Welch's two-sample t-test for
run_duration_minutes; two-proportion z-test for issue_creation_success_rate
Implementation Steps
References
Generated by 🧪 Daily A/B Testing Advisor · 511.1 AIC · ⌖ 22.6 AIC · ⊞ 22.4K · ◷
🧪 Experiment Campaign: copilot-opt
Workflow file:
.github/workflows/copilot-opt.mdSelected dimension:
engine_variantTriggered by:
ab-testing-advisoron 2026-06-15Background
copilot-optis a weekly, 5-phase Copilot session analysis workflow that parsesevents.jsonllogs, detects performance bottlenecks, runs mandatory PR cross-analysis, and creates exactly 3 evidence-backed optimization issues per run. It currently runs exclusively on thecopilotengine (copilot-sdk: true,strict: true). With 9 consecutive successful runs averaging 28.7 min against a 30-minute agent timeout, the engine choice directly constrains how much analysis completes within budget — testingclaudemay reveal faster completion or deeper recommendation quality for this compute-heavy workflow.Hypothesis
copilotvsclaude) does not affectrun_duration_minutesorissue_creation_success_rate.claudeengine reduces mean run duration by ≥20% (from 28.7 min to ≤23 min) while maintaining a 100% issue creation success rate and equal or greater recommendation depth.Experiment Configuration
Add the following
experiments:block to the workflow frontmatter:Variant descriptions:
copilot: Baseline — current engine withcopilot-sdk: true, all existing behavior unchanged.claude: Alternative — switches to theclaudeengine; a single variant-context note is injected at Phase 1 to flag that Copilot-SDK-specific session-enrichment steps do not apply, so the agent relies on standard log parsing.Workflow Changes Required
View before/after diff
1. Frontmatter — add
experiments:block and makeengine:conditional2. Markdown body — add variant context note at the start of Phase 1
Success Metrics
run_duration_minutesclaudevs 28.7 min baselinecredit_spendrecommendation_depth(chars/issue body)issue_creation_success_rateStatistical Design
View sample size calculation
copilot(baseline),claudegh-awexperiments runtime (cache-based)run_duration_minutes; two-proportion z-test forissue_creation_success_rateImplementation Steps
experiments:section to frontmatterengine:conditional with{{#if experiments.engine_variant == "claude" }}...{{else}}...{{/if}}(value-comparison form — never use the internal__GH_AW_EXPERIMENTS__env-var syntax){{#if experiments.engine_variant == "claude" }}gh aw compile copilot-optto regenerate lock file/tmp/gh-aw/agent/experiments/state.jsondaily-experiment-reportworkflow artifactsReferences
.github/workflows/copilot-opt.md