refactor(review-panel): true fan-out + binary verdict + label automation#1022
refactor(review-panel): true fan-out + binary verdict + label automation#1022danielmeppiel merged 7 commits intomainfrom
Conversation
Refactor apm-review-panel from PANEL-IN-ONE-CONTEXT anti-pattern (architectural-patterns.md A1) to true A1 PANEL realized via B1 FAN-OUT + SYNTHESIZER. Each panelist runs in its own task-tool agent thread and returns JSON; the orchestrator schema-validates, hands all returns to the apm-ceo synthesizer task, then derives a binary verdict deterministically and is the SOLE writer to the PR. What changes for users: - Verdict is now binary: APPROVE or REJECT. No 'approve with reservations'. The schema makes that structurally impossible. - Two severity buckets only: required (blocks merge) and nits (one-liner, skip if you want). No third bucket accumulates debt. - Auto-label panel-approved or panel-rejected on every panel run. - Trigger label panel-review is removed after the run, so re-applying it re-runs the panel cleanly. - New companion workflow pr-panel-label-reset.yml (plain GitHub Actions, no LLM) strips both verdict labels on every new push so a stale verdict can never linger past a code change. - Top-loaded comment: verdict + required + nits + CEO arbitration on top; per-persona detail collapsed in <details> at the bottom. - Comment cap drops from 7 to 2 (one CEO comment + one safety overflow). What changes architecturally: - Each persona .agent.md gets a new 'Output contract when invoked by apm-review-panel' section: return JSON only, no GitHub writes. - Two new JSON schemas (panelist-return-schema.json, ceo-return-schema.json) define the cross-thread contract. - Single-writer interlock: only the orchestrator touches safe-outputs. - S4 schema gate: malformed panelist returns trigger a re-spawn. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
Refactors the apm-review-panel skill/workflow from “all personas in one context” into a true fan-out + synthesizer pipeline with a deterministic binary verdict and PR label automation, plus a companion deterministic workflow to clear stale verdict labels on new pushes.
Changes:
- Adds JSON schemas for panelist + CEO return contracts and rewrites the skill/template to enforce binary REQUIRED vs NIT findings and a deterministic APPROVE/REJECT verdict.
- Tightens
pr-review-panelsafe-outputs (max comments reduced, adds label add/remove allow-lists) and regenerates the compiled lock workflow. - Introduces a deterministic
pr-panel-label-reset.ymlworkflow to remove verdict labels onsynchronize, and adds eval artifacts for trigger/content scenarios.
Show a summary per file
| File | Description |
|---|---|
| apm.lock.yaml | Updates deployed file hashes for regenerated .github/** agent/skill artifacts. |
| .github/workflows/pr-review-panel.md | Tightens safe-outputs constraints and documents the new single-writer + label automation model. |
| .github/workflows/pr-review-panel.lock.yml | Regenerated compiled workflow reflecting updated safe-outputs tools/config. |
| .github/workflows/pr-panel-label-reset.yml | New deterministic workflow to remove verdict labels on PR updates. |
| .github/skills/apm-review-panel/evals/trigger-evals.json | Adds trigger eval dataset for dispatch-description discrimination. |
| .github/skills/apm-review-panel/evals/content-eval-clean-pr.md | Adds a “clean PR” content eval scenario expecting APPROVE. |
| .github/skills/apm-review-panel/evals/content-eval-rejected-pr.md | Adds a “rejected PR” content eval scenario expecting REJECT. |
| .github/skills/apm-review-panel/evals/README.md | Documents eval categories and how to run them. |
| .github/skills/apm-review-panel/assets/verdict-template.md | Rewrites the single-comment verdict template to be top-loaded + binary. |
| .github/skills/apm-review-panel/assets/panelist-return-schema.json | New JSON schema for panelist outputs (required[]/nits[], optional active). |
| .github/skills/apm-review-panel/assets/ceo-return-schema.json | New JSON schema for CEO synthesizer output (arbitration-only). |
| .github/skills/apm-review-panel/SKILL.md | Rewrites orchestration to parallel task fan-out + schema gate + deterministic verdict + single-writer behavior. |
| .github/agents/supply-chain-security-expert.agent.md | Adds “Output contract when invoked by apm-review-panel” JSON-return rules. |
| .github/agents/python-architect.agent.md | Adds panelist JSON output contract section. |
| .github/agents/oss-growth-hacker.agent.md | Adds panelist JSON output contract section. |
| .github/agents/devx-ux-expert.agent.md | Adds panelist JSON output contract section. |
| .github/agents/cli-logging-expert.agent.md | Adds panelist JSON output contract section. |
| .github/agents/auth-expert.agent.md | Adds panelist JSON output contract section (including active:false shortcut). |
| .github/agents/apm-ceo.agent.md | Adds CEO synthesizer JSON output contract section (arbitration-only). |
| .apm/skills/apm-review-panel/evals/trigger-evals.json | Mirrors eval dataset in .apm/ source skill tree. |
| .apm/skills/apm-review-panel/evals/content-eval-clean-pr.md | Mirrors “clean PR” content eval in .apm/. |
| .apm/skills/apm-review-panel/evals/content-eval-rejected-pr.md | Mirrors “rejected PR” content eval in .apm/. |
| .apm/skills/apm-review-panel/evals/README.md | Mirrors eval documentation in .apm/. |
| .apm/skills/apm-review-panel/assets/verdict-template.md | Mirrors verdict template rewrite in .apm/. |
| .apm/skills/apm-review-panel/assets/panelist-return-schema.json | Mirrors panelist schema in .apm/. |
| .apm/skills/apm-review-panel/assets/ceo-return-schema.json | Mirrors CEO schema in .apm/. |
| .apm/skills/apm-review-panel/SKILL.md | Mirrors skill rewrite in .apm/. |
| .apm/agents/supply-chain-security-expert.agent.md | Mirrors panelist contract section in .apm/. |
| .apm/agents/python-architect.agent.md | Mirrors panelist contract section in .apm/. |
| .apm/agents/oss-growth-hacker.agent.md | Mirrors panelist contract section in .apm/. |
| .apm/agents/devx-ux-expert.agent.md | Mirrors panelist contract section in .apm/. |
| .apm/agents/cli-logging-expert.agent.md | Mirrors panelist contract section in .apm/. |
| .apm/agents/auth-expert.agent.md | Mirrors panelist contract section in .apm/. |
| .apm/agents/apm-ceo.agent.md | Mirrors CEO synthesizer contract section in .apm/. |
Copilot's findings
Comments suppressed due to low confidence (2)
.github/workflows/pr-panel-label-reset.yml:41
- The comment says label removal will "silently succeed" when the label is absent, but the command output is still printed because of
2>&1without redirecting to/dev/null. If you want truly quiet best-effort behavior, suppressghoutput (or capture it) and only echo your own message.
.github/workflows/pr-panel-label-reset.yml:23 gh pr edit --remove-label/--add-labeloperates on issue labels; with onlypull-requests: writethis workflow may get 403s depending on token permission mapping. To avoid flaky behavior, addissues: write(and keepcontents: read) since label mutations are an Issues permission.
- Files reviewed: 34/34 changed files
- Comments generated: 8
| "active": { | ||
| "type": "boolean", | ||
| "description": "Set to false ONLY for auth-expert when the conditional rule does not activate it. All other personas MUST set active=true. When false, required and nits MUST be empty arrays and inactive_reason MUST be a one-sentence explanation citing the touched files.", | ||
| "default": true | ||
| }, | ||
| "inactive_reason": { | ||
| "type": "string", | ||
| "description": "Required when active=false. One sentence citing the touched files." | ||
| }, | ||
| "required": { | ||
| "type": "array", | ||
| "description": "Findings that MUST be addressed before merge. Empty array means this persona has no blocking concerns. Anything in this array causes the orchestrator to emit a REJECT verdict.", | ||
| "items": { "$ref": "#/definitions/finding" } | ||
| }, | ||
| "nits": { | ||
| "type": "array", | ||
| "description": "One-line suggestions the author can skip. NEVER block merge. No cap - trust the panelist's judgement.", | ||
| "items": { "$ref": "#/definitions/finding" } | ||
| }, |
There was a problem hiding this comment.
The schema docstring says inactive_reason is required when active=false and that required/nits must be empty in that case, but the schema does not enforce those invariants. Add if/then constraints (and ideally enforce that only auth-expert may set active=false) so the S4 schema gate actually catches contract violations.
| synthesize a placeholder `{persona: "<slug>", active: true, | ||
| required: [], nits: [], extras: {schema_failure: "<reason>"}}` |
There was a problem hiding this comment.
On repeated schema failures, the instructions synthesize a placeholder with empty required/nits. That can yield an APPROVE verdict even though a panelist effectively did not run, which undermines the "binary verdict" goal. Consider making a schema-failure placeholder produce a REQUIRED finding (forcing REJECT) or instruct the orchestrator to abort and ask for a rerun instead.
| synthesize a placeholder `{persona: "<slug>", active: true, | |
| required: [], nits: [], extras: {schema_failure: "<reason>"}}` | |
| synthesize a placeholder that preserves `active: true` but adds | |
| one REQUIRED finding so the final verdict becomes REJECT under | |
| step 6, for example: | |
| `{persona: "<slug>", active: true, required: ["Schema failure: | |
| panelist did not produce valid JSON after retries (<reason>)"], | |
| nits: [], extras: {schema_failure: "<reason>"}}` |
| 1. **Lead with the outcome** — "Installed 3 dependencies" not "The installation process has completed" | ||
| 2. **Use exact counts** — "2 prompts integrated" not "prompts integrated" | ||
| 3. **Name the thing** — "Skipping my-skill — local file exists" not "Skipping file — conflict detected" | ||
| 4. **Include the fix** — "Use `apm install --force` to overwrite" after every skip warning | ||
| 5. **No emojis** — ASCII `STATUS_SYMBOLS` only, never emoji characters |
There was a problem hiding this comment.
This file contains non-ASCII em dashes (—) in the Message Writing Rules, which conflicts with the repo's ASCII-only encoding rule for source and Markdown content (Windows cp1252 safety). Replace these with plain ASCII (e.g., - or --) so the agent definition stays portable.
See below for a potential fix:
1. **Lead with the outcome** - "Installed 3 dependencies" not "The installation process has completed"
2. **Use exact counts** - "2 prompts integrated" not "prompts integrated"
3. **Name the thing** - "Skipping my-skill - local file exists" not "Skipping file - conflict detected"
4. **Include the fix** - "Use `apm install --force` to overwrite" after every skip warning
5. **No emojis** - ASCII `STATUS_SYMBOLS` only, never emoji characters
| 1. **Lead with the outcome** — "Installed 3 dependencies" not "The installation process has completed" | ||
| 2. **Use exact counts** — "2 prompts integrated" not "prompts integrated" | ||
| 3. **Name the thing** — "Skipping my-skill — local file exists" not "Skipping file — conflict detected" | ||
| 4. **Include the fix** — "Use `apm install --force` to overwrite" after every skip warning | ||
| 5. **No emojis** — ASCII `STATUS_SYMBOLS` only, never emoji characters |
There was a problem hiding this comment.
This file contains non-ASCII em dashes (—) in the Message Writing Rules, which conflicts with the repo's ASCII-only encoding rule for source and Markdown content (Windows cp1252 safety). Replace these with plain ASCII (e.g., - or --) so the agent definition stays portable.
See below for a potential fix:
1. **Lead with the outcome** - "Installed 3 dependencies" not "The installation process has completed"
2. **Use exact counts** - "2 prompts integrated" not "prompts integrated"
3. **Name the thing** - "Skipping my-skill - local file exists" not "Skipping file - conflict detected"
4. **Include the fix** - "Use `apm install --force` to overwrite" after every skip warning
5. **No emojis** - ASCII `STATUS_SYMBOLS` only, never emoji characters
| on: | ||
| pull_request: | ||
| types: [synchronize] | ||
|
|
There was a problem hiding this comment.
This workflow uses the pull_request event, which will not have write permissions on fork PRs (even if permissions: requests write), so verdict labels may remain stale for the same fork PRs that pr-review-panel.md explicitly supports via pull_request_target. Consider switching to pull_request_target (no checkout, just gh pr edit) to make label reset reliable for forks as well.
This issue also appears in the following locations of the same file:
- line 21
- line 36
| synthesize a placeholder `{persona: "<slug>", active: true, | ||
| required: [], nits: [], extras: {schema_failure: "<reason>"}}` | ||
| and surface the failure in the CEO arbitration prompt. |
There was a problem hiding this comment.
On repeated schema failures, the instructions synthesize a placeholder with empty required/nits. That can yield an APPROVE verdict even though a panelist effectively did not run, which undermines the "binary verdict" goal. Consider making a schema-failure placeholder produce a REQUIRED finding (forcing REJECT) or instruct the orchestrator to abort and ask for a rerun instead.
| synthesize a placeholder `{persona: "<slug>", active: true, | |
| required: [], nits: [], extras: {schema_failure: "<reason>"}}` | |
| and surface the failure in the CEO arbitration prompt. | |
| synthesize a placeholder that REMAINS schema-valid but includes a | |
| single REQUIRED finding so the final verdict is forced to REJECT. | |
| Use: | |
| `{persona: "<slug>", active: true, required: [{"severity": | |
| "required", "title": "Panelist schema failure", "body": | |
| "This panelist did not return valid JSON after two retry attempts. | |
| Re-run the review panel for a clean result. Reason: <reason>"}], | |
| nits: [], extras: {schema_failure: "<reason>"}}` | |
| and surface the failure in the CEO arbitration prompt and final | |
| comment. Do NOT synthesize empty `required` / `nits` placeholders | |
| for a mandatory panelist failure. |
| add-labels: | ||
| allowed: [panel-approved, panel-rejected] | ||
| max: 1 | ||
| # Trigger label cleanup. Removed after the run so re-applying | ||
| # `panel-review` re-triggers the panel cleanly. | ||
| remove-labels: | ||
| allowed: [panel-review] | ||
| max: 1 |
There was a problem hiding this comment.
safe-outputs can add exactly one verdict label, but it cannot remove an already-present opposite verdict label. If the panel is re-run on the same commit (re-apply panel-review without a new push), the PR can end up with BOTH panel-approved and panel-rejected, breaking the intended XOR invariant. Consider allowing remove-labels to also remove both verdict labels (and have the orchestrator remove both before adding the computed one).
| "active": { | ||
| "type": "boolean", | ||
| "description": "Set to false ONLY for auth-expert when the conditional rule does not activate it. All other personas MUST set active=true. When false, required and nits MUST be empty arrays and inactive_reason MUST be a one-sentence explanation citing the touched files.", | ||
| "default": true | ||
| }, | ||
| "inactive_reason": { | ||
| "type": "string", | ||
| "description": "Required when active=false. One sentence citing the touched files." | ||
| }, | ||
| "required": { | ||
| "type": "array", | ||
| "description": "Findings that MUST be addressed before merge. Empty array means this persona has no blocking concerns. Anything in this array causes the orchestrator to emit a REJECT verdict.", | ||
| "items": { "$ref": "#/definitions/finding" } | ||
| }, | ||
| "nits": { | ||
| "type": "array", | ||
| "description": "One-line suggestions the author can skip. NEVER block merge. No cap - trust the panelist's judgement.", | ||
| "items": { "$ref": "#/definitions/finding" } | ||
| }, |
There was a problem hiding this comment.
The schema docstring says inactive_reason is required when active=false and that required/nits must be empty in that case, but the schema does not enforce those invariants. Add if/then constraints (and ideally enforce that only auth-expert may set active=false) so the S4 schema gate actually catches contract violations.
Genesis Step 8 evals gate, deterministic slice. Validates the parts
of the panel that do NOT require an LLM:
- JSON schema validation (S4 gate) for panelist + CEO returns
- Verdict computation per orchestrator rule (APPROVE iff sum(required)==0)
Five cases: clean-pr APPROVE, rejected-pr REJECT, plus three negative
cases (missing-nits, unknown-persona, disposition-leak) that confirm
`additionalProperties: false` and `required` constraints reject
malformed shapes before they reach the verdict gate.
Run: uv run --with jsonschema python3 \
.apm/skills/apm-review-panel/evals/run-verdict-harness.py
Result: ALL PASS.
Does NOT replace the option B branch-pin end-to-end test (which is
required to prove an actual LLM panelist returns well-formed JSON);
documented in evals/README.md.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
LLM dispatcher self-evaluation against trigger-evals.json (16 queries, 60/40 train/val split). Result: 16/16 correct. Validation split: - should-trigger: 3/3 = 1.00 (gate >= 0.5: PASS) - should-NOT: 3/3 = 1.00 (gate >= 0.5: PASS) Caveat recorded in the result file: this is a single-LLM judgment; canonical evals would average over multiple dispatcher models. Real LLM judgment, not hand-waving. Content evals (with-skill vs without-skill) still require either real 6-persona fan-out via task tool OR option B branch-pin gh-aw run. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
DO NOT MERGE WITHOUT REVERTING. Temporary pin so the gh-aw workflow loads the refactored panel skill and persona contracts from this branch instead of microsoft/apm#main, enabling pre-merge end-to-end validation per option B. To revert: change packages back to microsoft/apm#main, recompile, push. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
End-to-end gh-aw run of the refactored panel against PR #931 succeeded on first try with all 10 acceptance criteria passing: - Verdict header literal: '## APM Review Panel Verdict: REJECT' - Top-loaded order: verdict -> required (3) -> nits (4) -> CEO -> per-persona - All 6 personas spawned and reported back - auth-expert correctly inactive with cited reason - Verdict deterministic from required[] count (3 -> REJECT) - panel-rejected label applied - Comment count: 1 (single-writer interlock held) - Per-persona detail in collapsed <details> block - Python Architect class diagram passed through extras - All 7 gh-aw jobs SUCCESS Bonus: panel surfaced a real regression in PR #931 that prior reviewers missed (proposed docstring silently drops gemini routing). Three independent panelists converged with zero dissent. Workflow run: github.com/microsoft/apm/actions/runs/25069734881 Full result: .apm/skills/apm-review-panel/evals/results-e2e-pr931-2026-04-28.md Reverts the temporary microsoft/apm#refactor/review-panel-fanout pin back to microsoft/apm#main. Recompiled lock file. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Probe PR #1025 (closed). Reset workflow stripped panel-rejected within 13s of pull_request:synchronize. panel-approved path is identical code in the same loop -- proven by parity. Side finding: created the missing 'panel-approved' repo label (green) and refreshed 'panel-rejected' description+color.
Promotes [Unreleased] -> [0.11.0] - 2026-04-29 and bumps pyproject.toml + uv.lock to 0.11.0. Version-bump rationale: 0.11.0 (minor bump) chosen over 0.10.1 because this release ships one BREAKING removal (`apm marketplace build` -> exits 2, use `apm pack`) plus several net-new features (Dev Container Feature, Codex project-scoped MCP, `marketplace:` block in apm.yml, `apm pack` unification, multi-org `apps[]`). Strict semver in 0.x: minor for features-with-break, patch only for bugfixes. Milestone admin (done out-of-band): - Renamed milestone #8 `0.10.1` -> `0.11.0` - Created milestone #9 `0.12.0` as next-up bucket - Moved 43 open items (42 issues + 1 open PR #999) from `0.11.0` -> `0.12.0` - 6 closed items stay in `0.11.0` PRs shipping in 0.11.0 (22 commits since v0.10.0): User-facing features: - #1042/#722 `apm pack` unifies bundle + marketplace.json (BREAKING: `apm marketplace build` removed) - #1038 `marketplace:` block in apm.yml + `apm marketplace migrate` - #803 /#502 Codex project-scoped MCP (`.codex/config.toml`) + user-scope primitives - #861 Dev Container Feature `ghcr.io/microsoft/apm/apm-cli` - #982/#984 shared/apm.md `apps:` array for cross-org private packages - #820 `target:` in apm.yml validates at parse time - #1032 `apm marketplace add` honors manifest.name (Claude Code parity) - #1000/#998/#994 unified `--policy` / `--policy-source` accepted forms User-facing fixes: - #1015 ADO Entra ID auth + `apm install --update` pre-flight abort - #1019/#1020 GEMINI.md only created when target requested - #1008 marketplace producer respects GITHUB_HOST + multi-host URL forms - #1018 POSIX paths in auto-discovery output (Windows compat) - #996 drop stray 'specify' from generated file footer Maintainer tooling: - #1043 NOTICE.md per CELA template - #1045/#1044 NOTICE drift gate + license-policy gate in CI - #1033 shared/apm.md `[a b]` import-input repair (gh-aw#29076 paper-cut) - #1030 panel workflows skip-don't-fail on unmatched labels; gh-aw v0.71.1 - #1026 shared/apm.md recompiled to apm-action v1.5.0 + bundles-file - #1022 review-panel: true fan-out + binary verdict + label automation - #918 complexity audit + benchmarks suite - #1002 CodeQL clear-text-storage false-positive resolved (token -> placeholder) Files changed: - pyproject.toml: 0.10.0 -> 0.11.0 - uv.lock: regenerated (version field only) - CHANGELOG.md: [Unreleased] promoted to [0.11.0] - 2026-04-29 NOTICE drift check passes against the bumped lockfile. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Promotes [Unreleased] -> [0.11.0] - 2026-04-29 and bumps pyproject.toml + uv.lock to 0.11.0. Version-bump rationale: 0.11.0 (minor bump) chosen over 0.10.1 because this release ships one BREAKING removal (`apm marketplace build` -> exits 2, use `apm pack`) plus several net-new features (Dev Container Feature, Codex project-scoped MCP, `marketplace:` block in apm.yml, `apm pack` unification, multi-org `apps[]`). Strict semver in 0.x: minor for features-with-break, patch only for bugfixes. Milestone admin (done out-of-band): - Renamed milestone #8 `0.10.1` -> `0.11.0` - Created milestone #9 `0.12.0` as next-up bucket - Moved 43 open items (42 issues + 1 open PR #999) from `0.11.0` -> `0.12.0` - 6 closed items stay in `0.11.0` PRs shipping in 0.11.0 (22 commits since v0.10.0): User-facing features: - #1042/#722 `apm pack` unifies bundle + marketplace.json (BREAKING: `apm marketplace build` removed) - #1038 `marketplace:` block in apm.yml + `apm marketplace migrate` - #803 /#502 Codex project-scoped MCP (`.codex/config.toml`) + user-scope primitives - #861 Dev Container Feature `ghcr.io/microsoft/apm/apm-cli` - #982/#984 shared/apm.md `apps:` array for cross-org private packages - #820 `target:` in apm.yml validates at parse time - #1032 `apm marketplace add` honors manifest.name (Claude Code parity) - #1000/#998/#994 unified `--policy` / `--policy-source` accepted forms User-facing fixes: - #1015 ADO Entra ID auth + `apm install --update` pre-flight abort - #1019/#1020 GEMINI.md only created when target requested - #1008 marketplace producer respects GITHUB_HOST + multi-host URL forms - #1018 POSIX paths in auto-discovery output (Windows compat) - #996 drop stray 'specify' from generated file footer Maintainer tooling: - #1043 NOTICE.md per CELA template - #1045/#1044 NOTICE drift gate + license-policy gate in CI - #1033 shared/apm.md `[a b]` import-input repair (gh-aw#29076 paper-cut) - #1030 panel workflows skip-don't-fail on unmatched labels; gh-aw v0.71.1 - #1026 shared/apm.md recompiled to apm-action v1.5.0 + bundles-file - #1022 review-panel: true fan-out + binary verdict + label automation - #918 complexity audit + benchmarks suite - #1002 CodeQL clear-text-storage false-positive resolved (token -> placeholder) Files changed: - pyproject.toml: 0.10.0 -> 0.11.0 - uv.lock: regenerated (version field only) - CHANGELOG.md: [Unreleased] promoted to [0.11.0] - 2026-04-29 NOTICE drift check passes against the bumped lockfile. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
* chore(release): cut 0.11.0 Promotes [Unreleased] -> [0.11.0] - 2026-04-29 and bumps pyproject.toml + uv.lock to 0.11.0. Version-bump rationale: 0.11.0 (minor bump) chosen over 0.10.1 because this release ships one BREAKING removal (`apm marketplace build` -> exits 2, use `apm pack`) plus several net-new features (Dev Container Feature, Codex project-scoped MCP, `marketplace:` block in apm.yml, `apm pack` unification, multi-org `apps[]`). Strict semver in 0.x: minor for features-with-break, patch only for bugfixes. Milestone admin (done out-of-band): - Renamed milestone #8 `0.10.1` -> `0.11.0` - Created milestone #9 `0.12.0` as next-up bucket - Moved 43 open items (42 issues + 1 open PR #999) from `0.11.0` -> `0.12.0` - 6 closed items stay in `0.11.0` PRs shipping in 0.11.0 (22 commits since v0.10.0): User-facing features: - #1042/#722 `apm pack` unifies bundle + marketplace.json (BREAKING: `apm marketplace build` removed) - #1038 `marketplace:` block in apm.yml + `apm marketplace migrate` - #803 /#502 Codex project-scoped MCP (`.codex/config.toml`) + user-scope primitives - #861 Dev Container Feature `ghcr.io/microsoft/apm/apm-cli` - #982/#984 shared/apm.md `apps:` array for cross-org private packages - #820 `target:` in apm.yml validates at parse time - #1032 `apm marketplace add` honors manifest.name (Claude Code parity) - #1000/#998/#994 unified `--policy` / `--policy-source` accepted forms User-facing fixes: - #1015 ADO Entra ID auth + `apm install --update` pre-flight abort - #1019/#1020 GEMINI.md only created when target requested - #1008 marketplace producer respects GITHUB_HOST + multi-host URL forms - #1018 POSIX paths in auto-discovery output (Windows compat) - #996 drop stray 'specify' from generated file footer Maintainer tooling: - #1043 NOTICE.md per CELA template - #1045/#1044 NOTICE drift gate + license-policy gate in CI - #1033 shared/apm.md `[a b]` import-input repair (gh-aw#29076 paper-cut) - #1030 panel workflows skip-don't-fail on unmatched labels; gh-aw v0.71.1 - #1026 shared/apm.md recompiled to apm-action v1.5.0 + bundles-file - #1022 review-panel: true fan-out + binary verdict + label automation - #918 complexity audit + benchmarks suite - #1002 CodeQL clear-text-storage false-positive resolved (token -> placeholder) Files changed: - pyproject.toml: 0.10.0 -> 0.11.0 - uv.lock: regenerated (version field only) - CHANGELOG.md: [Unreleased] promoted to [0.11.0] - 2026-04-29 NOTICE drift check passes against the bumped lockfile. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * chore(changelog): tighten 0.11.0 entries to lead with user impact Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * chore(changelog): move Dev Container Feature to Maintainer tooling (not yet published) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * chore(changelog): de-dupe within 0.11.0 (combine #722 Removed bullets, drop #820 Fixed pointer) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> --------- Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
TL;DR
Refactor
apm-review-panelfrom PANEL-IN-ONE-CONTEXT anti-pattern to true fan-out + synthesizer. Each panelist runs in its own agent thread, returns JSON, the orchestrator deterministically derives a binary APPROVE/REJECT verdict and is the sole writer to the PR (one comment + one verdict label + removes the trigger label). New deterministic companion workflow strips stale verdict labels on every new push.Problem
The current panel forces one agent to play 5–7 personas in a single loop. Per genesis
architectural-patterns.mdA1, this is the PANEL-IN-ONE-CONTEXT anti-pattern: later lenses inherit attention drift from earlier ones, and the verdict template bakes in fake-medium debt via anOptional follow-upssection. Decisions are buried under per-persona prose, there is no machine-actionable outcome, and the panel can drift toward "approve with reservations" — a wishful state that lets debt accumulate.Approach
Apply the canonical genesis composition for ≥3 independent lenses with a consequential side effect:
tasktool.APPROVE iff sum(required) == 0), and the new label-reset workflow is plain GitHub Actions (no LLM judgement, no agent container).safe-outputs.required[]andnits[]exist. The third bucket that produced "consider" / "optional" debt is gone.Architecture
Change set summary
.apm/skills/apm-review-panel/assets/panelist-return-schema.json{required, nits}. No disposition..apm/skills/apm-review-panel/assets/ceo-return-schema.json.github/workflows/pr-panel-label-reset.ymlsynchronize. No LLM..apm/skills/apm-review-panel/evals/.apm/skills/apm-review-panel/SKILL.md.apm/skills/apm-review-panel/assets/verdict-template.md<details>..github/workflows/pr-review-panel.mdadd-comment.max: 7→2; newadd-labelsallow-list[panel-approved, panel-rejected]; newremove-labelsallow-list[panel-review]..apm/agents/{python-architect, cli-logging-expert, devx-ux-expert, supply-chain-security-expert, oss-growth-hacker, auth-expert, apm-ceo}.agent.md.github/skills/apm-review-panel/**,.github/agents/**,.github/workflows/pr-review-panel.lock.ymlapm install --target copilot+gh aw compile pr-review-panel.Trade-offs (genesis
pattern-tradeoffs.md)ghtoolset. The "subagents must not write" rule lives in each.agent.mdcontract; theadd-comment.max: 2cap is the fail-soft. Same trust model as today's "don't emit per-persona comments" rule.Validation
SKILL.mdline budget ≤ 500python3 -m json.tool)apm install --target copilotregenerates.github/mirrorsgh aw compile pr-review-panelproduces lock fileadd_comment(max:2),add_labels(allowed:[panel-approved,panel-rejected],max:1),remove_labels(allowed:[panel-review],max:1)Eval trace
Deterministic verdict harness (NEW, observed result)
evals/run-verdict-harness.pyvalidates the parts of the panel that do NOT require an LLM: JSON schema validation (S4 gate) and verdict computation. Five cases, all pass.Proves: schemas reject malformed shapes (missing fields, unknown personas, leaked verdict fields), and the verdict math is correct (APPROVE iff
sum(len(p.required) for active p) == 0). Does not prove: that a real LLM panelist will return well-formed JSON in practice -- that requires the option B branch-pin test below.Trigger evals (dispatch description discrimination)
.apm/skills/apm-review-panel/evals/trigger-evals.json— 16 queries, 60/40 train/val split.Should-trigger queries (panel description should match):
multi-persona expert panel review on a labelled pull requestmulti-personapanel...verdictcross-cutting reviewShould-NOT-trigger queries (description must NOT over-match):
Manual evaluation result: dispatch description (~700 chars) names the trigger nouns (
pull request,panel review,cross-cutting review,architecture,CLI logging,DevX UX,supply-chain security), the binary outcome (one comment plus exactly one verdict label), and the persona set, while staying narrow enough to reject single-persona / non-PR queries. Validation split (3+3): all 6 classified correctly.Content evals (with-skill vs without-skill delta)
Two scenarios stored as markdown specs the panel can be exercised against:
evals/content-eval-clean-pr.mdrequired:[], auth-expert returnsactive:false, labelpanel-approvedapplied,panel-reviewremovedevals/content-eval-rejected-pr.mdrequiredfinding (Strategy pattern violation), CLI Logging + DevX UX return 1 nit each, labelpanel-rejectedappliedWith-skill vs without-skill delta:
The structure IS the value. End-to-end content evals will run on the first real PR after merge (this PR is the first such opportunity); the trace will be appended to
evals/README.md.How to test
Tier 1: deterministic harness (already run, see above)
uv run --with jsonschema python3 .apm/skills/apm-review-panel/evals/run-verdict-harness.py->RESULT: ALL PASS. Re-run on any change to schemas, the SKILL.md verdict rule, or persona contracts.Tier 2: option B branch-pin end-to-end test -- DONE, PASS
Executed 2026-04-28 against PR #931 (unrelated docstring PR).
Workflow run: https://github.com/microsoft/apm/actions/runs/25069734881 -- all 7 jobs SUCCESS.
Verdict label applied:
panel-rejected(later removed -- PR #931 was a real PR not a throwaway).Full result file:
evals/results-e2e-pr931-2026-04-28.md.All 10 acceptance criteria passed on first try:
## APM Review Panel Verdict: <X>REJECT)[persona] summary+ Why + Suggested fix<details>block at bottomrequired[]count (3 > 0 -> REJECT)extrasto detail blockBonus signal: the panel caught a real regression in PR #931 that prior reviewers missed -- the proposed docstring silently drops
geminirouting that exists in the implementation. Three independent panelists converged with zero dissent. CEO produced a strategically nuanced "rebase-and-revise" recommendation, and the Growth Hacker side-channeled a contributor-funnel observation (PR is from an external contributor; recommend a warm rejection that models a low-friction contribution loop). The fan-out captured signals a single-loop reviewer would have missed.The temp branch-pin commit (
29bf7aca) was reverted in4b6f9deabefore requesting merge.Tier 2b: label-reset workflow (companion deterministic check) -- DONE, PASS
Probed
pr-panel-label-reset.ymlend-to-end via temp PR #1025(closed). Applied
panel-rejectedmanually, pushed a second commit,the
pull_request: synchronizeevent fired the reset workflow, andthe label was stripped within 13s.
Run: https://github.com/microsoft/apm/actions/runs/25076314935
panel-rejectedremovedlabels: [])panel-approvedhandled gracefullypanel-approveduses identical code in the same loop -- one path proves both.Side finding: the
panel-approvedrepo label did not existbefore this test. Created post-test (green, with description). The
panel-rejectedcolor and description were also refreshed forparity. See
evals/results-label-reset-2026-04-28.md.Tier 3: dogfood after merge (the real test)
panel-reviewlabel to a non-trivial PR.evals/README.md.Honest limit
The deterministic harness proves the math + the schema gate; it does not prove an LLM panelist returns well-formed JSON. Tier 2 is the only way to validate that pre-merge. Tier 3 is the canonical proof.
Files at a glance