[ab-advisor] Improve experiment infrastructure: schema, reporting & audit

### 🔬 Infrastructure Improvement: Experiments Schema, Reporting & Audit

**Triggered by**: `ab-testing-advisor` on 2026-06-15
**Parent campaign**: #39370 (copilot-opt engine_variant)

---

### Background

During the `copilot-opt` engine_variant campaign design, a `field-presence-checker` audit of `pkg/workflow/compiler_experiments.go` and `actions/setup/js/pick_experiment.cjs` found that `analysis_type` and `tags` are **fully implemented end-to-end**, but `notify` is only **partially implemented**: the field is parsed, stored in `GH_AW_EXPERIMENT_SPEC`, and rendered in the step summary — but **no outbound action is ever taken**. Significance alerts are never actually posted to the configured discussion or issue.

<details><summary>Field-presence audit results</summary>

| Field | Go struct | JSON marshal | CJS display | Outbound action | Status |
|---|---|---|---|---|---|
| `analysis_type` | ✅ line 195–196 | ✅ `json.Marshal(cfg)` | ✅ step summary | n/a | **Complete** |
| `tags` | ✅ lines 198–199 | ✅ `json.Marshal(cfg)` | ✅ step summary | n/a | **Complete** |
| `notify` | ✅ lines 201–214 | ✅ `json.Marshal(cfg)` | ✅ step summary | ❌ never fires | **Partial** |

`notify.issue` and `notify.discussion` are parsed by `extractOneExperimentConfig()` into `ExperimentNotify{}`, marshalled into `GH_AW_EXPERIMENT_SPEC`, and read in `pick_experiment.cjs` (`cfg.notify?.discussion` / `cfg.notify?.issue`) — but only to append informational text to the step summary. No API call or comment-posting request is made anywhere in either file.

</details>

---

### Area 1: Complete the `notify` Field Implementation

**Problem**: Operators who declare `notify: { issue: 1234 }` in their experiment frontmatter expect a significance alert to be posted when `n >= min_samples`. Currently this never happens; the field silently does nothing beyond displaying the target in the step summary.

**Proposed completion** (three concrete steps):

1. **State-file signal** — When `pick_experiment.cjs` determines that a variant has reached `min_samples`, write a `notify_targets` array to `/tmp/gh-aw/agent/experiments/state.json`:
   ```json
   {
     "experiment": "engine_variant",
     "variant_counts": { "copilot": 10, "claude": 10 },
     "min_samples_reached": true,
     "notify_targets": [{ "type": "issue", "number": 1234 }]
   }
   ```

2. **Post-run notification step** — In the compiled lock file, add a post-run step that reads the state and posts a comment when a winner is declared:
   ```yaml
   - name: Notify experiment target
     if: always()
     run: |
       TARGETS=$(jq -r '.notify_targets // [] | .[] | "\(.type):\(.number)"'          /tmp/gh-aw/agent/experiments/state.json 2>/dev/null)
       for t in $TARGETS; do
         TYPE="${t%%:*}"; NUM="${t##*:}"
         if [ "$TYPE" = "issue" ]; then
           gh issue comment "$NUM" --body "$(jq -r '.notify_message'              /tmp/gh-aw/agent/experiments/state.json)"
         fi
       done
   ```

3. **Test coverage** — Add a fixture test to `actions/setup/js/pick_experiment.test.cjs` that asserts `notify_targets` is populated in `state.json` once `min_samples` is reached for each variant.

---

### Area 2: Reporting & Dashboards

<details><summary>Proposed enhancements to daily-experiment-report</summary>

The existing `daily-experiment-report.md` provides a starting point but lacks cross-run aggregation and statistical significance detection. A complete analytics pipeline needs:

**1. Cross-run artifact aggregation**
```bash
# Collect state.json artifacts from all runs of an instrumented workflow
gh run list --workflow=copilot-opt.lock.yml --json databaseId   | jq -r '.[].databaseId'   | while read id; do
      gh run download "$id" --name experiments-state --dir /tmp/runs/$id 2>/dev/null
    done
# Merge per-experiment across runs
jq -s '[.[] | select(.experiment == "engine_variant")]' /tmp/runs/*/state.json
```

**2. Running statistics with ASCII comparison table**
```
Experiment: copilot-opt / engine_variant
┌─────────┬──────┬──────────┬─────────┐
│ Variant │  n   │ Mean     │ Std Dev │
├─────────┼──────┼──────────┼─────────┤
│ copilot │  7   │ 28.3 min │ 6.8 min │
│ claude  │  6   │ 22.1 min │ 5.2 min │
└─────────┴──────┴──────────┴─────────┘
p-value: 0.041 ✅ (significant at α=0.05)
Declared winner: claude
```

**3. Significance detection**: Apply the declared `analysis_type` (`t_test`, `mann_whitney`, `proportion_test`, `bayesian_ab`) when `n >= min_samples` and report p-value. Use a Python/scipy step in the report workflow.

**4. Discussion post**: When significant, post results to the experiment's `notify` target (directly enables Area 1 completion).

</details>

---

### Area 3: Audit & Logs Integration

<details><summary>Proposed OTEL and gh aw audit enhancements</summary>

**1. OTEL span attributes**: In `pick_experiment.cjs`, export the selected variant as OTEL attributes using the existing OTLP setup (`shared/otlp.md`):
```js
span.setAttribute("experiment.name", experimentName);
span.setAttribute("experiment.variant", selectedVariant);
span.setAttribute("experiment.run_count", runCount);
```

**2. `gh aw audit` surfacing**: Add an "Active Experiments" section to `gh aw audit` output:
```
Active Experiments
  copilot-opt / engine_variant
    copilot: 7 runs  |  claude: 6 runs  |  started: 2026-06-16
    status: collecting (min_samples=10, need 4 more per variant)
```

**3. Variant filtering**: Support `gh aw audit --experiment copilot-opt/engine_variant --variant claude` to filter run logs by assigned variant — enabling side-by-side failure-mode comparison between control and treatment.

**4. Step summary enrichment**: Prefix the step title in the compiled lock file with the variant name so GitHub Actions' run list displays assignments without requiring artifact downloads:
```
[engine_variant=claude] Copilot Opt — run #42
```

**5. Structured event log**: Emit a structured JSON line to stdout so `gh aw audit --json` surfaces assignments:
```json
{"event":"experiment.assigned","experiment":"engine_variant","variant":"claude","run_id":"42","timestamp":"2026-06-16T09:00:00Z"}
```

</details>

---

### Implementation Steps

- [ ] Complete `notify` outbound action in `pick_experiment.cjs`: write `notify_targets` to state.json when `min_samples` reached
- [ ] Add post-run notification step to compiled lock file template
- [ ] Add end-to-end test for `notify` field in `pick_experiment.test.cjs`
- [ ] Update `daily-experiment-report.md` to aggregate cross-run `state.json` artifacts
- [ ] Implement per-variant significance detection using declared `analysis_type`
- [ ] Add `experiment.name` / `experiment.variant` OTEL span attributes to `pick_experiment.cjs`
- [ ] Surface active experiments in `gh aw audit` output
- [ ] Add `--experiment` / `--variant` filter to `gh aw audit`
- [ ] Add structured `experiment.assigned` event to stdout log

### References

- `pkg/workflow/compiler_experiments.go`
- `actions/setup/js/pick_experiment.cjs`
- `actions/setup/js/pick_experiment.test.cjs`
- `.github/workflows/daily-experiment-report.md`
- [Experiment Infrastructure Docs](https://github.com/github/gh-aw/blob/main/.github/aw/github-agentic-workflows.md)
Related to #39370







> Generated by [🧪 Daily A/B Testing Advisor](https://github.com/github/gh-aw/actions/runs/27546492220) · 511.1 AIC · ⌖ 22.6 AIC · ⊞ 22.4K · [◷](https://github.com/search?q=repo%3Agithub%2Fgh-aw+is%3Aissue+%22gh-aw-workflow-call-id%3A+github%2Fgh-aw%2Fab-testing-advisor%22&type=issues)
> - [x] expires  on Jun 29, 2026, 4:49 AM UTC-08:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ab-advisor] Improve experiment infrastructure: schema, reporting & audit #39371

🔬 Infrastructure Improvement: Experiments Schema, Reporting & Audit

Background

Area 1: Complete the `notify` Field Implementation

Area 2: Reporting & Dashboards

Area 3: Audit & Logs Integration

Implementation Steps

References

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Field	Go struct	JSON marshal	CJS display	Outbound action	Status
`analysis_type`	✅ line 195–196	✅ `json.Marshal(cfg)`	✅ step summary	n/a	Complete
`tags`	✅ lines 198–199	✅ `json.Marshal(cfg)`	✅ step summary	n/a	Complete
`notify`	✅ lines 201–214	✅ `json.Marshal(cfg)`	✅ step summary	❌ never fires	Partial

[ab-advisor] Improve experiment infrastructure: schema, reporting & audit #39371

Description

🔬 Infrastructure Improvement: Experiments Schema, Reporting & Audit

Background

Area 1: Complete the notify Field Implementation

Area 2: Reporting & Dashboards

Area 3: Audit & Logs Integration

Implementation Steps

References

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Area 1: Complete the `notify` Field Implementation