Skip to content

[ab-advisor] Improve experiment infrastructure: schema, reporting & audit #39371

@github-actions

Description

@github-actions

🔬 Infrastructure Improvement: Experiments Schema, Reporting & Audit

Triggered by: ab-testing-advisor on 2026-06-15
Parent campaign: #39370 (copilot-opt engine_variant)


Background

During the copilot-opt engine_variant campaign design, a field-presence-checker audit of pkg/workflow/compiler_experiments.go and actions/setup/js/pick_experiment.cjs found that analysis_type and tags are fully implemented end-to-end, but notify is only partially implemented: the field is parsed, stored in GH_AW_EXPERIMENT_SPEC, and rendered in the step summary — but no outbound action is ever taken. Significance alerts are never actually posted to the configured discussion or issue.

Field-presence audit results
Field Go struct JSON marshal CJS display Outbound action Status
analysis_type ✅ line 195–196 json.Marshal(cfg) ✅ step summary n/a Complete
tags ✅ lines 198–199 json.Marshal(cfg) ✅ step summary n/a Complete
notify ✅ lines 201–214 json.Marshal(cfg) ✅ step summary ❌ never fires Partial

notify.issue and notify.discussion are parsed by extractOneExperimentConfig() into ExperimentNotify{}, marshalled into GH_AW_EXPERIMENT_SPEC, and read in pick_experiment.cjs (cfg.notify?.discussion / cfg.notify?.issue) — but only to append informational text to the step summary. No API call or comment-posting request is made anywhere in either file.


Area 1: Complete the notify Field Implementation

Problem: Operators who declare notify: { issue: 1234 } in their experiment frontmatter expect a significance alert to be posted when n >= min_samples. Currently this never happens; the field silently does nothing beyond displaying the target in the step summary.

Proposed completion (three concrete steps):

  1. State-file signal — When pick_experiment.cjs determines that a variant has reached min_samples, write a notify_targets array to /tmp/gh-aw/agent/experiments/state.json:

    {
      "experiment": "engine_variant",
      "variant_counts": { "copilot": 10, "claude": 10 },
      "min_samples_reached": true,
      "notify_targets": [{ "type": "issue", "number": 1234 }]
    }
  2. Post-run notification step — In the compiled lock file, add a post-run step that reads the state and posts a comment when a winner is declared:

    - name: Notify experiment target
      if: always()
      run: |
        TARGETS=$(jq -r '.notify_targets // [] | .[] | "\(.type):\(.number)"'          /tmp/gh-aw/agent/experiments/state.json 2>/dev/null)
        for t in $TARGETS; do
          TYPE="${t%%:*}"; NUM="${t##*:}"
          if [ "$TYPE" = "issue" ]; then
            gh issue comment "$NUM" --body "$(jq -r '.notify_message'              /tmp/gh-aw/agent/experiments/state.json)"
          fi
        done
  3. Test coverage — Add a fixture test to actions/setup/js/pick_experiment.test.cjs that asserts notify_targets is populated in state.json once min_samples is reached for each variant.


Area 2: Reporting & Dashboards

Proposed enhancements to daily-experiment-report

The existing daily-experiment-report.md provides a starting point but lacks cross-run aggregation and statistical significance detection. A complete analytics pipeline needs:

1. Cross-run artifact aggregation

# Collect state.json artifacts from all runs of an instrumented workflow
gh run list --workflow=copilot-opt.lock.yml --json databaseId   | jq -r '.[].databaseId'   | while read id; do
      gh run download "$id" --name experiments-state --dir /tmp/runs/$id 2>/dev/null
    done
# Merge per-experiment across runs
jq -s '[.[] | select(.experiment == "engine_variant")]' /tmp/runs/*/state.json

2. Running statistics with ASCII comparison table

Experiment: copilot-opt / engine_variant
┌─────────┬──────┬──────────┬─────────┐
│ Variant │  n   │ Mean     │ Std Dev │
├─────────┼──────┼──────────┼─────────┤
│ copilot │  7   │ 28.3 min │ 6.8 min │
│ claude  │  6   │ 22.1 min │ 5.2 min │
└─────────┴──────┴──────────┴─────────┘
p-value: 0.041 ✅ (significant at α=0.05)
Declared winner: claude

3. Significance detection: Apply the declared analysis_type (t_test, mann_whitney, proportion_test, bayesian_ab) when n >= min_samples and report p-value. Use a Python/scipy step in the report workflow.

4. Discussion post: When significant, post results to the experiment's notify target (directly enables Area 1 completion).


Area 3: Audit & Logs Integration

Proposed OTEL and gh aw audit enhancements

1. OTEL span attributes: In pick_experiment.cjs, export the selected variant as OTEL attributes using the existing OTLP setup (shared/otlp.md):

span.setAttribute("experiment.name", experimentName);
span.setAttribute("experiment.variant", selectedVariant);
span.setAttribute("experiment.run_count", runCount);

2. gh aw audit surfacing: Add an "Active Experiments" section to gh aw audit output:

Active Experiments
  copilot-opt / engine_variant
    copilot: 7 runs  |  claude: 6 runs  |  started: 2026-06-16
    status: collecting (min_samples=10, need 4 more per variant)

3. Variant filtering: Support gh aw audit --experiment copilot-opt/engine_variant --variant claude to filter run logs by assigned variant — enabling side-by-side failure-mode comparison between control and treatment.

4. Step summary enrichment: Prefix the step title in the compiled lock file with the variant name so GitHub Actions' run list displays assignments without requiring artifact downloads:

[engine_variant=claude] Copilot Opt — run #42

5. Structured event log: Emit a structured JSON line to stdout so gh aw audit --json surfaces assignments:

{"event":"experiment.assigned","experiment":"engine_variant","variant":"claude","run_id":"42","timestamp":"2026-06-16T09:00:00Z"}

Implementation Steps

  • Complete notify outbound action in pick_experiment.cjs: write notify_targets to state.json when min_samples reached
  • Add post-run notification step to compiled lock file template
  • Add end-to-end test for notify field in pick_experiment.test.cjs
  • Update daily-experiment-report.md to aggregate cross-run state.json artifacts
  • Implement per-variant significance detection using declared analysis_type
  • Add experiment.name / experiment.variant OTEL span attributes to pick_experiment.cjs
  • Surface active experiments in gh aw audit output
  • Add --experiment / --variant filter to gh aw audit
  • Add structured experiment.assigned event to stdout log

References

Generated by 🧪 Daily A/B Testing Advisor · 511.1 AIC · ⌖ 22.6 AIC · ⊞ 22.4K ·

  • expires on Jun 29, 2026, 4:49 AM UTC-08:00

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions