[ab-advisor] Experiment campaign for copilot-opt: A/B test engine_variant

### 🧪 Experiment Campaign: copilot-opt

**Workflow file**: `.github/workflows/copilot-opt.md`
**Selected dimension**: `engine_variant`
**Triggered by**: `ab-testing-advisor` on 2026-06-15

---

### Background

`copilot-opt` is a weekly, 5-phase Copilot session analysis workflow that parses `events.jsonl` logs, detects performance bottlenecks, runs mandatory PR cross-analysis, and creates exactly 3 evidence-backed optimization issues per run. It currently runs exclusively on the `copilot` engine (`copilot-sdk: true`, `strict: true`). With 9 consecutive successful runs averaging 28.7 min against a 30-minute agent timeout, the engine choice directly constrains how much analysis completes within budget — testing `claude` may reveal faster completion or deeper recommendation quality for this compute-heavy workflow.

### Hypothesis

- **H0 (null)**: Engine choice (`copilot` vs `claude`) does not affect `run_duration_minutes` or `issue_creation_success_rate`.
- **H1 (alternative)**: The `claude` engine reduces mean run duration by ≥20% (from 28.7 min to ≤23 min) while maintaining a 100% issue creation success rate and equal or greater recommendation depth.

### Experiment Configuration

Add the following `experiments:` block to the workflow frontmatter:

```yaml
experiments:
  engine_variant:
    variants: [copilot, claude]
    description: "Compare copilot and claude engines on multi-phase session analysis efficiency and recommendation quality"
    hypothesis: "H0: no change in run_duration_minutes or issue_creation_success_rate. H1: claude reduces mean run duration by >=20% while maintaining 100% issue creation rate"
    metric: run_duration_minutes
    secondary_metrics: [credit_spend, issue_creation_success_rate, recommendation_depth]
    guardrail_metrics:
      - name: issue_creation_success_rate
        direction: min
        threshold: 1.0
    min_samples: 10
    weight: [50, 50]
    start_date: "2026-06-16"
    analysis_type: t_test
    tags: [engine_comparison, cost_efficiency, session_analysis]
    notify:
      issue: #aw_campaign
    issue: #aw_campaign
```

**Variant descriptions**:
- `copilot`: Baseline — current engine with `copilot-sdk: true`, all existing behavior unchanged.
- `claude`: Alternative — switches to the `claude` engine; a single variant-context note is injected at Phase 1 to flag that Copilot-SDK-specific session-enrichment steps do not apply, so the agent relies on standard log parsing.

### Workflow Changes Required

<details><summary>View before/after diff</summary>

**1. Frontmatter — add `experiments:` block and make `engine:` conditional**

```diff
+experiments:
+  engine_variant:
+    variants: [copilot, claude]
+    description: "Compare copilot and claude engines on multi-phase session analysis efficiency and recommendation quality"
+    metric: run_duration_minutes
+    secondary_metrics: [credit_spend, issue_creation_success_rate, recommendation_depth]
+    guardrail_metrics:
+      - name: issue_creation_success_rate
+        direction: min
+        threshold: 1.0
+    min_samples: 10
+    weight: [50, 50]
+    start_date: "2026-06-16"
+    analysis_type: t_test
+    tags: [engine_comparison, cost_efficiency, session_analysis]
+    notify:
+      issue: <campaign-issue-number>
+    issue: <campaign-issue-number>
 engine:
-  id: copilot
-  copilot-sdk: true
+{{#if experiments.engine_variant == "claude" }}
+  id: claude
+{{else}}
+  id: copilot
+  copilot-sdk: true
+{{/if}}
```

**2. Markdown body — add variant context note at the start of Phase 1**

```diff
 ## Phase 1 — Ingestion and Normalization
 
+{{#if experiments.engine_variant == "claude" }}
+> **Variant note**: Running as `claude` engine. Copilot-SDK-specific session enrichment
+> steps are not applicable; rely on standard log parsing. Apply extended chain-of-thought
+> reasoning when inferring latency patterns from sparse or malformed event data.
+{{/if}}
+
 1. For each in-scope session, locate one of:
```

> **Note on `engine:` frontmatter conditionals**: The `{{#if}}` blocks above are processed by `gh aw compile` at compile time. If the compiler does not yet support Handlebars inside the YAML frontmatter block, the `engine:` switching should be implemented as a compiler-side feature that maps `experiments.engine_variant` to the `engine.id` field, following the pattern of other compile-time experiment expansions.

</details>

### Success Metrics

| Metric | Type | Target |
|--------|------|--------|
| `run_duration_minutes` | Primary | ≤23.0 min for `claude` vs 28.7 min baseline |
| `credit_spend` | Secondary | Directional reduction signal |
| `recommendation_depth` (chars/issue body) | Secondary | Must not decrease vs baseline |
| `issue_creation_success_rate` | Guardrail | Must remain = 1.0 (all 3 issues created every run) |

### Statistical Design

<details><summary>View sample size calculation</summary>

- **Variants**: `copilot` (baseline), `claude`
- **Assignment**: Round-robin via `gh-aw` experiments runtime (cache-based)
- **Historical baseline**: 9 runs, mean = 28.7 min, σ ≈ 7 min (range 15.5–41.0 min), 9/9 success
- **Minimum detectable effect**: 20% = 5.7 min
- **Sample size** (two-sample Welch's t-test, α = 0.05, power = 80%):
  - n = 2σ2 × (z0.975 + z0.80)2 / δ2 = 2 × 49 × 7.84 / 32.5 ≈ 24 total → **12 per variant**
- **Minimum runs per variant**: 10 (conservative, accounts for weekly cadence)
- **Run frequency**: ~1/week; 50/50 split → ~0.5 runs/variant/week
- **Expected experiment duration**: ~20 weeks (~5 months, targeting ~November 2026)
- **Analysis approach**: Welch's two-sample t-test for `run_duration_minutes`; two-proportion z-test for `issue_creation_success_rate`

</details>

### Implementation Steps

- [ ] Add `experiments:` section to frontmatter
- [ ] Make `engine:` conditional with `{{#if experiments.engine_variant == "claude" }}...{{else}}...{{/if}}` (value-comparison form — never use the internal `__GH_AW_EXPERIMENTS__` env-var syntax)
- [ ] Add variant context note in Phase 1 body using `{{#if experiments.engine_variant == "claude" }}`
- [ ] Run `gh aw compile copilot-opt` to regenerate lock file
- [ ] Monitor experiment artifact uploaded per run to `/tmp/gh-aw/agent/experiments/state.json`
- [ ] After 10 runs per variant, analyze via `daily-experiment-report` workflow artifacts
- [ ] Document findings and promote winning variant

### References

- [A/B Testing in gh-aw](https://github.com/github/gh-aw/blob/main/.github/aw/github-agentic-workflows.md)
- Workflow file: `.github/workflows/copilot-opt.md`







> Generated by [🧪 Daily A/B Testing Advisor](https://github.com/github/gh-aw/actions/runs/27546492220) · 511.1 AIC · ⌖ 22.6 AIC · ⊞ 22.4K · [◷](https://github.com/search?q=repo%3Agithub%2Fgh-aw+is%3Aissue+%22gh-aw-workflow-call-id%3A+github%2Fgh-aw%2Fab-testing-advisor%22&type=issues)
> - [x] expires  on Jun 29, 2026, 4:49 AM UTC-08:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ab-advisor] Experiment campaign for copilot-opt: A/B test engine_variant #39370

🧪 Experiment Campaign: copilot-opt

Background

Hypothesis

Experiment Configuration

Workflow Changes Required

Success Metrics

Statistical Design

Implementation Steps

References

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Metric	Type	Target
`run_duration_minutes`	Primary	≤23.0 min for `claude` vs 28.7 min baseline
`credit_spend`	Secondary	Directional reduction signal
`recommendation_depth` (chars/issue body)	Secondary	Must not decrease vs baseline
`issue_creation_success_rate`	Guardrail	Must remain = 1.0 (all 3 issues created every run)

[ab-advisor] Experiment campaign for copilot-opt: A/B test engine_variant #39370

Description

🧪 Experiment Campaign: copilot-opt

Background

Hypothesis

Experiment Configuration

Workflow Changes Required

Success Metrics

Statistical Design

Implementation Steps

References

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions