🔬 Infrastructure Improvement: Experiments Schema, Reporting & Audit
Triggered by: ab-testing-advisor on 2026-06-15
Parent campaign: #39370 (copilot-opt engine_variant)
Background
During the copilot-opt engine_variant campaign design, a field-presence-checker audit of pkg/workflow/compiler_experiments.go and actions/setup/js/pick_experiment.cjs found that analysis_type and tags are fully implemented end-to-end, but notify is only partially implemented: the field is parsed, stored in GH_AW_EXPERIMENT_SPEC, and rendered in the step summary — but no outbound action is ever taken. Significance alerts are never actually posted to the configured discussion or issue.
Field-presence audit results
| Field |
Go struct |
JSON marshal |
CJS display |
Outbound action |
Status |
analysis_type |
✅ line 195–196 |
✅ json.Marshal(cfg) |
✅ step summary |
n/a |
Complete |
tags |
✅ lines 198–199 |
✅ json.Marshal(cfg) |
✅ step summary |
n/a |
Complete |
notify |
✅ lines 201–214 |
✅ json.Marshal(cfg) |
✅ step summary |
❌ never fires |
Partial |
notify.issue and notify.discussion are parsed by extractOneExperimentConfig() into ExperimentNotify{}, marshalled into GH_AW_EXPERIMENT_SPEC, and read in pick_experiment.cjs (cfg.notify?.discussion / cfg.notify?.issue) — but only to append informational text to the step summary. No API call or comment-posting request is made anywhere in either file.
Area 1: Complete the notify Field Implementation
Problem: Operators who declare notify: { issue: 1234 } in their experiment frontmatter expect a significance alert to be posted when n >= min_samples. Currently this never happens; the field silently does nothing beyond displaying the target in the step summary.
Proposed completion (three concrete steps):
-
State-file signal — When pick_experiment.cjs determines that a variant has reached min_samples, write a notify_targets array to /tmp/gh-aw/agent/experiments/state.json:
{
"experiment": "engine_variant",
"variant_counts": { "copilot": 10, "claude": 10 },
"min_samples_reached": true,
"notify_targets": [{ "type": "issue", "number": 1234 }]
}
-
Post-run notification step — In the compiled lock file, add a post-run step that reads the state and posts a comment when a winner is declared:
- name: Notify experiment target
if: always()
run: |
TARGETS=$(jq -r '.notify_targets // [] | .[] | "\(.type):\(.number)"' /tmp/gh-aw/agent/experiments/state.json 2>/dev/null)
for t in $TARGETS; do
TYPE="${t%%:*}"; NUM="${t##*:}"
if [ "$TYPE" = "issue" ]; then
gh issue comment "$NUM" --body "$(jq -r '.notify_message' /tmp/gh-aw/agent/experiments/state.json)"
fi
done
-
Test coverage — Add a fixture test to actions/setup/js/pick_experiment.test.cjs that asserts notify_targets is populated in state.json once min_samples is reached for each variant.
Area 2: Reporting & Dashboards
Proposed enhancements to daily-experiment-report
The existing daily-experiment-report.md provides a starting point but lacks cross-run aggregation and statistical significance detection. A complete analytics pipeline needs:
1. Cross-run artifact aggregation
# Collect state.json artifacts from all runs of an instrumented workflow
gh run list --workflow=copilot-opt.lock.yml --json databaseId | jq -r '.[].databaseId' | while read id; do
gh run download "$id" --name experiments-state --dir /tmp/runs/$id 2>/dev/null
done
# Merge per-experiment across runs
jq -s '[.[] | select(.experiment == "engine_variant")]' /tmp/runs/*/state.json
2. Running statistics with ASCII comparison table
Experiment: copilot-opt / engine_variant
┌─────────┬──────┬──────────┬─────────┐
│ Variant │ n │ Mean │ Std Dev │
├─────────┼──────┼──────────┼─────────┤
│ copilot │ 7 │ 28.3 min │ 6.8 min │
│ claude │ 6 │ 22.1 min │ 5.2 min │
└─────────┴──────┴──────────┴─────────┘
p-value: 0.041 ✅ (significant at α=0.05)
Declared winner: claude
3. Significance detection: Apply the declared analysis_type (t_test, mann_whitney, proportion_test, bayesian_ab) when n >= min_samples and report p-value. Use a Python/scipy step in the report workflow.
4. Discussion post: When significant, post results to the experiment's notify target (directly enables Area 1 completion).
Area 3: Audit & Logs Integration
Proposed OTEL and gh aw audit enhancements
1. OTEL span attributes: In pick_experiment.cjs, export the selected variant as OTEL attributes using the existing OTLP setup (shared/otlp.md):
span.setAttribute("experiment.name", experimentName);
span.setAttribute("experiment.variant", selectedVariant);
span.setAttribute("experiment.run_count", runCount);
2. gh aw audit surfacing: Add an "Active Experiments" section to gh aw audit output:
Active Experiments
copilot-opt / engine_variant
copilot: 7 runs | claude: 6 runs | started: 2026-06-16
status: collecting (min_samples=10, need 4 more per variant)
3. Variant filtering: Support gh aw audit --experiment copilot-opt/engine_variant --variant claude to filter run logs by assigned variant — enabling side-by-side failure-mode comparison between control and treatment.
4. Step summary enrichment: Prefix the step title in the compiled lock file with the variant name so GitHub Actions' run list displays assignments without requiring artifact downloads:
[engine_variant=claude] Copilot Opt — run #42
5. Structured event log: Emit a structured JSON line to stdout so gh aw audit --json surfaces assignments:
{"event":"experiment.assigned","experiment":"engine_variant","variant":"claude","run_id":"42","timestamp":"2026-06-16T09:00:00Z"}
Implementation Steps
References
Generated by 🧪 Daily A/B Testing Advisor · 511.1 AIC · ⌖ 22.6 AIC · ⊞ 22.4K · ◷
🔬 Infrastructure Improvement: Experiments Schema, Reporting & Audit
Triggered by:
ab-testing-advisoron 2026-06-15Parent campaign: #39370 (copilot-opt engine_variant)
Background
During the
copilot-optengine_variant campaign design, afield-presence-checkeraudit ofpkg/workflow/compiler_experiments.goandactions/setup/js/pick_experiment.cjsfound thatanalysis_typeandtagsare fully implemented end-to-end, butnotifyis only partially implemented: the field is parsed, stored inGH_AW_EXPERIMENT_SPEC, and rendered in the step summary — but no outbound action is ever taken. Significance alerts are never actually posted to the configured discussion or issue.Field-presence audit results
analysis_typejson.Marshal(cfg)tagsjson.Marshal(cfg)notifyjson.Marshal(cfg)notify.issueandnotify.discussionare parsed byextractOneExperimentConfig()intoExperimentNotify{}, marshalled intoGH_AW_EXPERIMENT_SPEC, and read inpick_experiment.cjs(cfg.notify?.discussion/cfg.notify?.issue) — but only to append informational text to the step summary. No API call or comment-posting request is made anywhere in either file.Area 1: Complete the
notifyField ImplementationProblem: Operators who declare
notify: { issue: 1234 }in their experiment frontmatter expect a significance alert to be posted whenn >= min_samples. Currently this never happens; the field silently does nothing beyond displaying the target in the step summary.Proposed completion (three concrete steps):
State-file signal — When
pick_experiment.cjsdetermines that a variant has reachedmin_samples, write anotify_targetsarray to/tmp/gh-aw/agent/experiments/state.json:{ "experiment": "engine_variant", "variant_counts": { "copilot": 10, "claude": 10 }, "min_samples_reached": true, "notify_targets": [{ "type": "issue", "number": 1234 }] }Post-run notification step — In the compiled lock file, add a post-run step that reads the state and posts a comment when a winner is declared:
Test coverage — Add a fixture test to
actions/setup/js/pick_experiment.test.cjsthat assertsnotify_targetsis populated instate.jsononcemin_samplesis reached for each variant.Area 2: Reporting & Dashboards
Proposed enhancements to daily-experiment-report
The existing
daily-experiment-report.mdprovides a starting point but lacks cross-run aggregation and statistical significance detection. A complete analytics pipeline needs:1. Cross-run artifact aggregation
2. Running statistics with ASCII comparison table
3. Significance detection: Apply the declared
analysis_type(t_test,mann_whitney,proportion_test,bayesian_ab) whenn >= min_samplesand report p-value. Use a Python/scipy step in the report workflow.4. Discussion post: When significant, post results to the experiment's
notifytarget (directly enables Area 1 completion).Area 3: Audit & Logs Integration
Proposed OTEL and gh aw audit enhancements
1. OTEL span attributes: In
pick_experiment.cjs, export the selected variant as OTEL attributes using the existing OTLP setup (shared/otlp.md):2.
gh aw auditsurfacing: Add an "Active Experiments" section togh aw auditoutput:3. Variant filtering: Support
gh aw audit --experiment copilot-opt/engine_variant --variant claudeto filter run logs by assigned variant — enabling side-by-side failure-mode comparison between control and treatment.4. Step summary enrichment: Prefix the step title in the compiled lock file with the variant name so GitHub Actions' run list displays assignments without requiring artifact downloads:
5. Structured event log: Emit a structured JSON line to stdout so
gh aw audit --jsonsurfaces assignments:{"event":"experiment.assigned","experiment":"engine_variant","variant":"claude","run_id":"42","timestamp":"2026-06-16T09:00:00Z"}Implementation Steps
notifyoutbound action inpick_experiment.cjs: writenotify_targetsto state.json whenmin_samplesreachednotifyfield inpick_experiment.test.cjsdaily-experiment-report.mdto aggregate cross-runstate.jsonartifactsanalysis_typeexperiment.name/experiment.variantOTEL span attributes topick_experiment.cjsgh aw auditoutput--experiment/--variantfilter togh aw auditexperiment.assignedevent to stdout logReferences
pkg/workflow/compiler_experiments.goactions/setup/js/pick_experiment.cjsactions/setup/js/pick_experiment.test.cjs.github/workflows/daily-experiment-report.mdRelated to [ab-advisor] Experiment campaign for copilot-opt: A/B test engine_variant #39370