Skip to content

[ab-advisor] Experiment campaign for copilot-opt: A/B test engine_variant #39370

@github-actions

Description

@github-actions

🧪 Experiment Campaign: copilot-opt

Workflow file: .github/workflows/copilot-opt.md
Selected dimension: engine_variant
Triggered by: ab-testing-advisor on 2026-06-15


Background

copilot-opt is a weekly, 5-phase Copilot session analysis workflow that parses events.jsonl logs, detects performance bottlenecks, runs mandatory PR cross-analysis, and creates exactly 3 evidence-backed optimization issues per run. It currently runs exclusively on the copilot engine (copilot-sdk: true, strict: true). With 9 consecutive successful runs averaging 28.7 min against a 30-minute agent timeout, the engine choice directly constrains how much analysis completes within budget — testing claude may reveal faster completion or deeper recommendation quality for this compute-heavy workflow.

Hypothesis

  • H0 (null): Engine choice (copilot vs claude) does not affect run_duration_minutes or issue_creation_success_rate.
  • H1 (alternative): The claude engine reduces mean run duration by ≥20% (from 28.7 min to ≤23 min) while maintaining a 100% issue creation success rate and equal or greater recommendation depth.

Experiment Configuration

Add the following experiments: block to the workflow frontmatter:

experiments:
  engine_variant:
    variants: [copilot, claude]
    description: "Compare copilot and claude engines on multi-phase session analysis efficiency and recommendation quality"
    hypothesis: "H0: no change in run_duration_minutes or issue_creation_success_rate. H1: claude reduces mean run duration by >=20% while maintaining 100% issue creation rate"
    metric: run_duration_minutes
    secondary_metrics: [credit_spend, issue_creation_success_rate, recommendation_depth]
    guardrail_metrics:
      - name: issue_creation_success_rate
        direction: min
        threshold: 1.0
    min_samples: 10
    weight: [50, 50]
    start_date: "2026-06-16"
    analysis_type: t_test
    tags: [engine_comparison, cost_efficiency, session_analysis]
    notify:
      issue: #aw_campaign
    issue: #aw_campaign

Variant descriptions:

  • copilot: Baseline — current engine with copilot-sdk: true, all existing behavior unchanged.
  • claude: Alternative — switches to the claude engine; a single variant-context note is injected at Phase 1 to flag that Copilot-SDK-specific session-enrichment steps do not apply, so the agent relies on standard log parsing.

Workflow Changes Required

View before/after diff

1. Frontmatter — add experiments: block and make engine: conditional

+experiments:
+  engine_variant:
+    variants: [copilot, claude]
+    description: "Compare copilot and claude engines on multi-phase session analysis efficiency and recommendation quality"
+    metric: run_duration_minutes
+    secondary_metrics: [credit_spend, issue_creation_success_rate, recommendation_depth]
+    guardrail_metrics:
+      - name: issue_creation_success_rate
+        direction: min
+        threshold: 1.0
+    min_samples: 10
+    weight: [50, 50]
+    start_date: "2026-06-16"
+    analysis_type: t_test
+    tags: [engine_comparison, cost_efficiency, session_analysis]
+    notify:
+      issue: <campaign-issue-number>
+    issue: <campaign-issue-number>
 engine:
-  id: copilot
-  copilot-sdk: true
+{{#if experiments.engine_variant == "claude" }}
+  id: claude
+{{else}}
+  id: copilot
+  copilot-sdk: true
+{{/if}}

2. Markdown body — add variant context note at the start of Phase 1

 ## Phase 1 — Ingestion and Normalization
 
+{{#if experiments.engine_variant == "claude" }}
+> **Variant note**: Running as `claude` engine. Copilot-SDK-specific session enrichment
+> steps are not applicable; rely on standard log parsing. Apply extended chain-of-thought
+> reasoning when inferring latency patterns from sparse or malformed event data.
+{{/if}}
+
 1. For each in-scope session, locate one of:

Note on engine: frontmatter conditionals: The {{#if}} blocks above are processed by gh aw compile at compile time. If the compiler does not yet support Handlebars inside the YAML frontmatter block, the engine: switching should be implemented as a compiler-side feature that maps experiments.engine_variant to the engine.id field, following the pattern of other compile-time experiment expansions.

Success Metrics

Metric Type Target
run_duration_minutes Primary ≤23.0 min for claude vs 28.7 min baseline
credit_spend Secondary Directional reduction signal
recommendation_depth (chars/issue body) Secondary Must not decrease vs baseline
issue_creation_success_rate Guardrail Must remain = 1.0 (all 3 issues created every run)

Statistical Design

View sample size calculation
  • Variants: copilot (baseline), claude
  • Assignment: Round-robin via gh-aw experiments runtime (cache-based)
  • Historical baseline: 9 runs, mean = 28.7 min, σ ≈ 7 min (range 15.5–41.0 min), 9/9 success
  • Minimum detectable effect: 20% = 5.7 min
  • Sample size (two-sample Welch's t-test, α = 0.05, power = 80%):
    • n = 2σ2 × (z0.975 + z0.80)2 / δ2 = 2 × 49 × 7.84 / 32.5 ≈ 24 total → 12 per variant
  • Minimum runs per variant: 10 (conservative, accounts for weekly cadence)
  • Run frequency: ~1/week; 50/50 split → ~0.5 runs/variant/week
  • Expected experiment duration: ~20 weeks (~5 months, targeting ~November 2026)
  • Analysis approach: Welch's two-sample t-test for run_duration_minutes; two-proportion z-test for issue_creation_success_rate

Implementation Steps

  • Add experiments: section to frontmatter
  • Make engine: conditional with {{#if experiments.engine_variant == "claude" }}...{{else}}...{{/if}} (value-comparison form — never use the internal __GH_AW_EXPERIMENTS__ env-var syntax)
  • Add variant context note in Phase 1 body using {{#if experiments.engine_variant == "claude" }}
  • Run gh aw compile copilot-opt to regenerate lock file
  • Monitor experiment artifact uploaded per run to /tmp/gh-aw/agent/experiments/state.json
  • After 10 runs per variant, analyze via daily-experiment-report workflow artifacts
  • Document findings and promote winning variant

References

Generated by 🧪 Daily A/B Testing Advisor · 511.1 AIC · ⌖ 22.6 AIC · ⊞ 22.4K ·

  • expires on Jun 29, 2026, 4:49 AM UTC-08:00

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions