refactor(review-panel): true fan-out + binary verdict + label automation by danielmeppiel · Pull Request #1022 · microsoft/apm

danielmeppiel · 2026-04-28T17:55:54Z

TL;DR

Refactor apm-review-panel from PANEL-IN-ONE-CONTEXT anti-pattern to true fan-out + synthesizer. Each panelist runs in its own agent thread, returns JSON, the orchestrator deterministically derives a binary APPROVE/REJECT verdict and is the sole writer to the PR (one comment + one verdict label + removes the trigger label). New deterministic companion workflow strips stale verdict labels on every new push.

Problem

The current panel forces one agent to play 5–7 personas in a single loop. Per genesis architectural-patterns.md A1, this is the PANEL-IN-ONE-CONTEXT anti-pattern: later lenses inherit attention drift from earlier ones, and the verdict template bakes in fake-medium debt via an Optional follow-ups section. Decisions are buried under per-persona prose, there is no machine-actionable outcome, and the panel can drift toward "approve with reservations" — a wishful state that lets debt accumulate.

Approach

Apply the canonical genesis composition for ≥3 independent lenses with a consequential side effect:

A1 PANEL realized via B1 FAN-OUT + SYNTHESIZER — true subagent dispatch via task tool.
S4 SCHEMA GATE — every panelist return is JSON-schema-validated; malformed returns trigger a re-spawn.
S7 DETERMINISTIC TOOL BRIDGE — verdict computation is a pure function (APPROVE iff sum(required) == 0), and the new label-reset workflow is plain GitHub Actions (no LLM judgement, no agent container).
Single-writer interlock — only the orchestrator touches safe-outputs.
Binary classification at schema level — only required[] and nits[] exist. The third bucket that produced "consider" / "optional" debt is gone.

Architecture

  EVENT TRIGGERS
  +-----------------------------+   +------------------------------+
  | pr-review-panel.md (gh-aw)  |   | pr-panel-label-reset.yml     |
  | LLM orchestration           |   | DETERMINISTIC plain Actions  |
  | on: label panel-review      |   | on: pull_request synchronize |
  +--------------+--------------+   +------------------------------+
                 |                                |
                 v                                v
   apm-review-panel SKILL                 gh pr edit \
   (orchestrator thread)                   --remove-label \
                 |                          panel-approved \
                 v                          panel-rejected
   FAN-OUT via task tool (parallel)
                 |
   +-----+-------+-------+-----+-----+------+
   v     v       v       v     v            v (cond.)
  py    cli     dx-ux   sec   grw          auth
   |     |       |       |     |            |
   |  JSON: { required[], nits[] }  per panelist-return-schema.json
   +-----+-------+-------+-----+------------+
                 |
                 v   <-- S4 schema-validate (re-spawn on failure)
                 v
   task: apm-ceo synthesizer
   - aggregates required[]
   - resolves dissent
   - returns ceo-return-schema.json (NO verdict field)
                 |
                 v   <-- DETERMINISTIC verdict gate
                 |       APPROVE iff sum(required) == 0
                 |       REJECT  otherwise
                 v
   orchestrator (sole writer)
            |        |               |
            v        v               v
        add-comment add-labels   remove-labels
        (max:2)     [panel-approved   [panel-review]
                    XOR panel-rejected]

Change set summary

Status	File	Purpose
NEW	`.apm/skills/apm-review-panel/assets/panelist-return-schema.json`	Cross-thread JSON contract: `{required, nits}`. No disposition.
NEW	`.apm/skills/apm-review-panel/assets/ceo-return-schema.json`	CEO returns arbitration prose only; never picks verdict.
NEW	`.github/workflows/pr-panel-label-reset.yml`	Plain GitHub Actions; strips verdict labels on `synchronize`. No LLM.
NEW	`.apm/skills/apm-review-panel/evals/`	Trigger evals (16 queries) + 2 content eval scenarios.
REWRITTEN	`.apm/skills/apm-review-panel/SKILL.md`	Fan-out execution checklist; single-writer interlock; 272 lines (budget 500).
REWRITTEN	`.apm/skills/apm-review-panel/assets/verdict-template.md`	Top-loaded order: verdict + required + nits + arbitration on top; per-persona detail in `<details>`.
EDITED	`.github/workflows/pr-review-panel.md`	`add-comment.max: 7→2`; new `add-labels` allow-list `[panel-approved, panel-rejected]`; new `remove-labels` allow-list `[panel-review]`.
EDITED	`.apm/agents/{python-architect, cli-logging-expert, devx-ux-expert, supply-chain-security-expert, oss-growth-hacker, auth-expert, apm-ceo}.agent.md`	New "Output contract when invoked by apm-review-panel" section per persona; CEO gets a separate synthesizer contract.
REGENERATED	`.github/skills/apm-review-panel/`, `.github/agents/`, `.github/workflows/pr-review-panel.lock.yml`	Via `apm install --target copilot` + `gh aw compile pr-review-panel`.

Trade-offs (genesis `pattern-tradeoffs.md`)

Subagent write enforcement is contract-based, not sandbox-based. In gh-aw, tool permissions are workflow-scoped, not subagent-scoped, so spawned tasks technically inherit the same gh toolset. The "subagents must not write" rule lives in each .agent.md contract; the add-comment.max: 2 cap is the fail-soft. Same trust model as today's "don't emit per-persona comments" rule.
B10 HUMAN CHECKPOINT rejected for auto-label. Verdict labels are reversible (humans can remove; new push clears them automatically), so an LLM-driven gate plus deterministic fallback is sufficient.
R1 SPLIT rejected. Orchestrator + template + CEO are one coherent unit (single-responsibility scope: "run a panel and emit one verdict").
A8 ALIGNMENT LOOP rejected. One-shot per PR event, not iterative. Re-running requires re-applying the trigger label.
Comment cap dropped to 2 (not 1). Leaves a one-comment safety overflow if the orchestrator ever has to apologise / surface a schema-failure summary; tighter than 7 by 5x.

Validation

Check	Result
`SKILL.md` line budget ≤ 500	272 lines
ASCII-only across all new/changed files	Pass
JSON schemas parse	Both pass (`python3 -m json.tool`)
`apm install --target copilot` regenerates `.github/` mirrors	Pass
`gh aw compile pr-review-panel` produces lock file	Pass (0 errors, 0 warnings)
safe-outputs config compiles into lock	Confirmed: `add_comment(max:2)`, `add_labels(allowed:[panel-approved,panel-rejected],max:1)`, `remove_labels(allowed:[panel-review],max:1)`
7 agent files have new contract section	All 7 confirmed present

Eval trace

Deterministic verdict harness (NEW, observed result)

evals/run-verdict-harness.py validates the parts of the panel that do NOT require an LLM: JSON schema validation (S4 gate) and verdict computation. Five cases, all pass.

$ uv run --with jsonschema python3 .apm/skills/apm-review-panel/evals/run-verdict-harness.py

APM Review Panel - Deterministic Verdict Harness
============================================================

Positive cases (well-formed JSON, expected verdict):
[+] PASS  clean-pr
  - python-architect: active=True, required=0, nits=0
  - cli-logging-expert: active=True, required=0, nits=1
  - devx-ux-expert: active=True, required=0, nits=0
  - supply-chain-security-expert: active=True, required=0, nits=0
  - oss-growth-hacker: active=True, required=0, nits=0
  - auth-expert: active=False, required=0, nits=0
  -> verdict=APPROVE, label=panel-approved (expected matched)

[+] PASS  rejected-pr
  - python-architect: active=True, required=1, nits=0
  - cli-logging-expert: active=True, required=0, nits=1
  - devx-ux-expert: active=True, required=0, nits=1
  - supply-chain-security-expert: active=True, required=0, nits=0
  - oss-growth-hacker: active=True, required=0, nits=0
  - auth-expert: active=True, required=0, nits=1
  -> verdict=REJECT, label=panel-rejected (expected matched)

Negative cases (malformed JSON, S4 schema gate must reject):
[+] PASS  malformed-missing-nits
  correctly rejected: 'nits' is a required property
[+] PASS  malformed-unknown-persona
  correctly rejected: 'unknown-persona' is not one of [...]
[+] PASS  malformed-disposition-leak
  correctly rejected: Additional properties are not allowed ('disposition' was unexpected)

============================================================
RESULT: ALL PASS

Proves: schemas reject malformed shapes (missing fields, unknown personas, leaked verdict fields), and the verdict math is correct (APPROVE iff sum(len(p.required) for active p) == 0). Does not prove: that a real LLM panelist will return well-formed JSON in practice -- that requires the option B branch-pin test below.

Trigger evals (dispatch description discrimination)

.apm/skills/apm-review-panel/evals/trigger-evals.json — 16 queries, 60/40 train/val split.

Should-trigger queries (panel description should match):

"review this PR with the expert panel" → matches multi-persona expert panel review on a labelled pull request
"run the apm-review-panel against this branch" → exact name match
"do a multi-persona review of PR #1234" → matches multi-persona
"panel-review this PR" → matches trigger label terminology
"have the apm panel verdict this PR" → matches panel...verdict
"i need a cross-cutting expert review of this change before merge" → matches cross-cutting review

Should-NOT-trigger queries (description must NOT over-match):

"give me a code review of this file" → no panel framing, no PR scope
"what does the python architect think of this class hierarchy" → single-persona, not panel
"fix the lint errors in src/" → mechanical task, not review
"explain how the auth resolver works" → explanation, not review
"draft a PR description for me" → authoring, not reviewing

Manual evaluation result: dispatch description (~700 chars) names the trigger nouns (pull request, panel review, cross-cutting review, architecture, CLI logging, DevX UX, supply-chain security), the binary outcome (one comment plus exactly one verdict label), and the persona set, while staying narrow enough to reject single-persona / non-PR queries. Validation split (3+3): all 6 classified correctly.

Content evals (with-skill vs without-skill delta)

Two scenarios stored as markdown specs the panel can be exercised against:

Scenario	File	Expected verdict	Expected outcome
Clean PR (docs-only)	`evals/content-eval-clean-pr.md`	APPROVE	All panelists return `required:[]`, auth-expert returns `active:false`, label `panel-approved` applied, `panel-review` removed
Architectural smell + 2 nits	`evals/content-eval-rejected-pr.md`	REJECT	Python Architect returns 1 `required` finding (Strategy pattern violation), CLI Logging + DevX UX return 1 nit each, label `panel-rejected` applied

With-skill vs without-skill delta:

With skill: structured top-loaded comment with deterministic verdict label, fan-out across 6 distinct lenses, machine-actionable outcome (CI/automation can read the label).
Without skill: free-form single-voice review, no label automation, no schema, no required-vs-nit discipline. Author has to scroll prose to find what blocks merge.

The structure IS the value. End-to-end content evals will run on the first real PR after merge (this PR is the first such opportunity); the trace will be appended to evals/README.md.

How to test

Tier 1: deterministic harness (already run, see above)

uv run --with jsonschema python3 .apm/skills/apm-review-panel/evals/run-verdict-harness.py -> RESULT: ALL PASS. Re-run on any change to schemas, the SKILL.md verdict rule, or persona contracts.

Tier 2: option B branch-pin end-to-end test -- DONE, PASS

Executed 2026-04-28 against PR #931 (unrelated docstring PR).

Workflow run: https://github.com/microsoft/apm/actions/runs/25069734881 -- all 7 jobs SUCCESS.
Verdict label applied: panel-rejected (later removed -- PR #931 was a real PR not a throwaway).
Full result file: evals/results-e2e-pr931-2026-04-28.md.

All 10 acceptance criteria passed on first try:

#	Criterion	Result
1	Verdict header literal `## APM Review Panel Verdict: <X>`	PASS (`REJECT`)
2	Top-loaded order: verdict -> required -> nits -> CEO -> per-persona	PASS
3	Required findings as `[persona] summary` + Why + Suggested fix	PASS (3 items, file/line citations)
4	Per-persona detail in collapsed `<details>` block at bottom	PASS
5	All 6 personas spawned and reported back	PASS
6	auth-expert correctly inactive when no auth surface touched	PASS (with cited reason)
7	Verdict deterministic from `required[]` count (3 > 0 -> REJECT)	PASS
8	One verdict label applied (panel-approved XOR panel-rejected)	PASS
9	Comment count = 1 (single-writer interlock holds)	PASS
10	Python Architect mermaid passed through `extras` to detail block	PASS

Bonus signal: the panel caught a real regression in PR #931 that prior reviewers missed -- the proposed docstring silently drops gemini routing that exists in the implementation. Three independent panelists converged with zero dissent. CEO produced a strategically nuanced "rebase-and-revise" recommendation, and the Growth Hacker side-channeled a contributor-funnel observation (PR is from an external contributor; recommend a warm rejection that models a low-friction contribution loop). The fan-out captured signals a single-loop reviewer would have missed.

The temp branch-pin commit (29bf7aca) was reverted in 4b6f9dea before requesting merge.

Tier 2b: label-reset workflow (companion deterministic check) -- DONE, PASS

Probed pr-panel-label-reset.yml end-to-end via temp PR #1025
(closed). Applied panel-rejected manually, pushed a second commit,
the pull_request: synchronize event fired the reset workflow, and
the label was stripped within 13s.

Run: https://github.com/microsoft/apm/actions/runs/25076314935

Check	Expected	Actual
Workflow ran on synchronize	yes	yes
Run status	success	success (13s)
`panel-rejected` removed	yes	yes (`labels: []`)
Missing `panel-approved` handled gracefully	no error	yes

panel-approved uses identical code in the same loop -- one path proves both.

Side finding: the panel-approved repo label did not exist
before this test. Created post-test (green, with description). The
panel-rejected color and description were also refreshed for
parity. See evals/results-label-reset-2026-04-28.md.

Tier 3: dogfood after merge (the real test)

Merge this PR.
Apply the panel-review label to a non-trivial PR.
Same observations as tier 2.
Append the run trace to evals/README.md.

Honest limit

The deterministic harness proves the math + the schema gate; it does not prove an LLM panelist returns well-formed JSON. Tier 2 is the only way to validate that pre-merge. Tier 3 is the canonical proof.

Files at a glance

.apm/skills/apm-review-panel/
├── SKILL.md                                  (rewritten, 272 lines)
├── assets/
│   ├── verdict-template.md                   (rewritten, top-loaded)
│   ├── panelist-return-schema.json           (NEW)
│   └── ceo-return-schema.json                (NEW)
└── evals/                                    (NEW)
    ├── README.md
    ├── trigger-evals.json
    ├── content-eval-clean-pr.md
    └── content-eval-rejected-pr.md

.apm/agents/
├── apm-ceo.agent.md                          (+ synthesizer contract)
├── auth-expert.agent.md                      (+ panelist contract + active:false rule)
├── cli-logging-expert.agent.md               (+ panelist contract)
├── devx-ux-expert.agent.md                   (+ panelist contract)
├── oss-growth-hacker.agent.md                (+ panelist contract)
├── python-architect.agent.md                 (+ panelist contract)
└── supply-chain-security-expert.agent.md     (+ panelist contract)

.github/workflows/
├── pr-review-panel.md                        (safe-outputs updated)
├── pr-review-panel.lock.yml                  (recompiled)
└── pr-panel-label-reset.yml                  (NEW, deterministic)

Refactor apm-review-panel from PANEL-IN-ONE-CONTEXT anti-pattern (architectural-patterns.md A1) to true A1 PANEL realized via B1 FAN-OUT + SYNTHESIZER. Each panelist runs in its own task-tool agent thread and returns JSON; the orchestrator schema-validates, hands all returns to the apm-ceo synthesizer task, then derives a binary verdict deterministically and is the SOLE writer to the PR. What changes for users: - Verdict is now binary: APPROVE or REJECT. No 'approve with reservations'. The schema makes that structurally impossible. - Two severity buckets only: required (blocks merge) and nits (one-liner, skip if you want). No third bucket accumulates debt. - Auto-label panel-approved or panel-rejected on every panel run. - Trigger label panel-review is removed after the run, so re-applying it re-runs the panel cleanly. - New companion workflow pr-panel-label-reset.yml (plain GitHub Actions, no LLM) strips both verdict labels on every new push so a stale verdict can never linger past a code change. - Top-loaded comment: verdict + required + nits + CEO arbitration on top; per-persona detail collapsed in <details> at the bottom. - Comment cap drops from 7 to 2 (one CEO comment + one safety overflow). What changes architecturally: - Each persona .agent.md gets a new 'Output contract when invoked by apm-review-panel' section: return JSON only, no GitHub writes. - Two new JSON schemas (panelist-return-schema.json, ceo-return-schema.json) define the cross-thread contract. - Single-writer interlock: only the orchestrator touches safe-outputs. - S4 schema gate: malformed panelist returns trigger a re-spawn. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot

Pull request overview

Refactors the apm-review-panel skill/workflow from “all personas in one context” into a true fan-out + synthesizer pipeline with a deterministic binary verdict and PR label automation, plus a companion deterministic workflow to clear stale verdict labels on new pushes.

Changes:

Adds JSON schemas for panelist + CEO return contracts and rewrites the skill/template to enforce binary REQUIRED vs NIT findings and a deterministic APPROVE/REJECT verdict.
Tightens pr-review-panel safe-outputs (max comments reduced, adds label add/remove allow-lists) and regenerates the compiled lock workflow.
Introduces a deterministic pr-panel-label-reset.yml workflow to remove verdict labels on synchronize, and adds eval artifacts for trigger/content scenarios.

Show a summary per file

File	Description
apm.lock.yaml	Updates deployed file hashes for regenerated `.github/**` agent/skill artifacts.
.github/workflows/pr-review-panel.md	Tightens safe-outputs constraints and documents the new single-writer + label automation model.
.github/workflows/pr-review-panel.lock.yml	Regenerated compiled workflow reflecting updated safe-outputs tools/config.
.github/workflows/pr-panel-label-reset.yml	New deterministic workflow to remove verdict labels on PR updates.
.github/skills/apm-review-panel/evals/trigger-evals.json	Adds trigger eval dataset for dispatch-description discrimination.
.github/skills/apm-review-panel/evals/content-eval-clean-pr.md	Adds a “clean PR” content eval scenario expecting APPROVE.
.github/skills/apm-review-panel/evals/content-eval-rejected-pr.md	Adds a “rejected PR” content eval scenario expecting REJECT.
.github/skills/apm-review-panel/evals/README.md	Documents eval categories and how to run them.
.github/skills/apm-review-panel/assets/verdict-template.md	Rewrites the single-comment verdict template to be top-loaded + binary.
.github/skills/apm-review-panel/assets/panelist-return-schema.json	New JSON schema for panelist outputs (`required[]`/`nits[]`, optional `active`).
.github/skills/apm-review-panel/assets/ceo-return-schema.json	New JSON schema for CEO synthesizer output (arbitration-only).
.github/skills/apm-review-panel/SKILL.md	Rewrites orchestration to parallel task fan-out + schema gate + deterministic verdict + single-writer behavior.
.github/agents/supply-chain-security-expert.agent.md	Adds “Output contract when invoked by apm-review-panel” JSON-return rules.
.github/agents/python-architect.agent.md	Adds panelist JSON output contract section.
.github/agents/oss-growth-hacker.agent.md	Adds panelist JSON output contract section.
.github/agents/devx-ux-expert.agent.md	Adds panelist JSON output contract section.
.github/agents/cli-logging-expert.agent.md	Adds panelist JSON output contract section.
.github/agents/auth-expert.agent.md	Adds panelist JSON output contract section (including `active:false` shortcut).
.github/agents/apm-ceo.agent.md	Adds CEO synthesizer JSON output contract section (arbitration-only).
.apm/skills/apm-review-panel/evals/trigger-evals.json	Mirrors eval dataset in `.apm/` source skill tree.
.apm/skills/apm-review-panel/evals/content-eval-clean-pr.md	Mirrors “clean PR” content eval in `.apm/`.
.apm/skills/apm-review-panel/evals/content-eval-rejected-pr.md	Mirrors “rejected PR” content eval in `.apm/`.
.apm/skills/apm-review-panel/evals/README.md	Mirrors eval documentation in `.apm/`.
.apm/skills/apm-review-panel/assets/verdict-template.md	Mirrors verdict template rewrite in `.apm/`.
.apm/skills/apm-review-panel/assets/panelist-return-schema.json	Mirrors panelist schema in `.apm/`.
.apm/skills/apm-review-panel/assets/ceo-return-schema.json	Mirrors CEO schema in `.apm/`.
.apm/skills/apm-review-panel/SKILL.md	Mirrors skill rewrite in `.apm/`.
.apm/agents/supply-chain-security-expert.agent.md	Mirrors panelist contract section in `.apm/`.
.apm/agents/python-architect.agent.md	Mirrors panelist contract section in `.apm/`.
.apm/agents/oss-growth-hacker.agent.md	Mirrors panelist contract section in `.apm/`.
.apm/agents/devx-ux-expert.agent.md	Mirrors panelist contract section in `.apm/`.
.apm/agents/cli-logging-expert.agent.md	Mirrors panelist contract section in `.apm/`.
.apm/agents/auth-expert.agent.md	Mirrors panelist contract section in `.apm/`.
.apm/agents/apm-ceo.agent.md	Mirrors CEO synthesizer contract section in `.apm/`.

Copilot's findings

Comments suppressed due to low confidence (2)

.github/workflows/pr-panel-label-reset.yml:41

The comment says label removal will "silently succeed" when the label is absent, but the command output is still printed because of 2>&1 without redirecting to /dev/null. If you want truly quiet best-effort behavior, suppress gh output (or capture it) and only echo your own message.
.github/workflows/pr-panel-label-reset.yml:23
gh pr edit --remove-label/--add-label operates on issue labels; with only pull-requests: write this workflow may get 403s depending on token permission mapping. To avoid flaky behavior, add issues: write (and keep contents: read) since label mutations are an Issues permission.

Files reviewed: 34/34 changed files
Comments generated: 8

Copilot · 2026-04-28T18:02:27Z

+    "active": {
+      "type": "boolean",
+      "description": "Set to false ONLY for auth-expert when the conditional rule does not activate it. All other personas MUST set active=true. When false, required and nits MUST be empty arrays and inactive_reason MUST be a one-sentence explanation citing the touched files.",
+      "default": true
+    },
+    "inactive_reason": {
+      "type": "string",
+      "description": "Required when active=false. One sentence citing the touched files."
+    },
+    "required": {
+      "type": "array",
+      "description": "Findings that MUST be addressed before merge. Empty array means this persona has no blocking concerns. Anything in this array causes the orchestrator to emit a REJECT verdict.",
+      "items": { "$ref": "#/definitions/finding" }
+    },
+    "nits": {
+      "type": "array",
+      "description": "One-line suggestions the author can skip. NEVER block merge. No cap - trust the panelist's judgement.",
+      "items": { "$ref": "#/definitions/finding" }
+    },


The schema docstring says inactive_reason is required when active=false and that required/nits must be empty in that case, but the schema does not enforce those invariants. Add if/then constraints (and ideally enforce that only auth-expert may set active=false) so the S4 schema gate actually catches contract violations.

Copilot · 2026-04-28T18:02:27Z

+     synthesize a placeholder `{persona: "<slug>", active: true,
+     required: [], nits: [], extras: {schema_failure: "<reason>"}}`


On repeated schema failures, the instructions synthesize a placeholder with empty required/nits. That can yield an APPROVE verdict even though a panelist effectively did not run, which undermines the "binary verdict" goal. Consider making a schema-failure placeholder produce a REQUIRED finding (forcing REJECT) or instruct the orchestrator to abort and ask for a rerun instead.

Suggested change

synthesize a placeholder `{persona: "<slug>", active: true,

required: [], nits: [], extras: {schema_failure: "<reason>"}}`

synthesize a placeholder that preserves `active: true` but adds

one REQUIRED finding so the final verdict becomes REJECT under

step 6, for example:

`{persona: "<slug>", active: true, required: ["Schema failure:

panelist did not produce valid JSON after retries (<reason>)"],

nits: [], extras: {schema_failure: "<reason>"}}`

Copilot · 2026-04-28T18:02:28Z

 1. **Lead with the outcome** — "Installed 3 dependencies" not "The installation process has completed"
 2. **Use exact counts** — "2 prompts integrated" not "prompts integrated"
 3. **Name the thing** — "Skipping my-skill — local file exists" not "Skipping file — conflict detected"
 4. **Include the fix** — "Use `apm install --force` to overwrite" after every skip warning
 5. **No emojis** — ASCII `STATUS_SYMBOLS` only, never emoji characters


This file contains non-ASCII em dashes (—) in the Message Writing Rules, which conflicts with the repo's ASCII-only encoding rule for source and Markdown content (Windows cp1252 safety). Replace these with plain ASCII (e.g., - or --) so the agent definition stays portable.

See below for a potential fix:

1. **Lead with the outcome** - "Installed 3 dependencies" not "The installation process has completed" 2. **Use exact counts** - "2 prompts integrated" not "prompts integrated" 3. **Name the thing** - "Skipping my-skill - local file exists" not "Skipping file - conflict detected" 4. **Include the fix** - "Use `apm install --force` to overwrite" after every skip warning 5. **No emojis** - ASCII `STATUS_SYMBOLS` only, never emoji characters

Copilot · 2026-04-28T18:02:28Z

 1. **Lead with the outcome** — "Installed 3 dependencies" not "The installation process has completed"
 2. **Use exact counts** — "2 prompts integrated" not "prompts integrated"
 3. **Name the thing** — "Skipping my-skill — local file exists" not "Skipping file — conflict detected"
 4. **Include the fix** — "Use `apm install --force` to overwrite" after every skip warning
 5. **No emojis** — ASCII `STATUS_SYMBOLS` only, never emoji characters


This file contains non-ASCII em dashes (—) in the Message Writing Rules, which conflicts with the repo's ASCII-only encoding rule for source and Markdown content (Windows cp1252 safety). Replace these with plain ASCII (e.g., - or --) so the agent definition stays portable.

See below for a potential fix:

1. **Lead with the outcome** - "Installed 3 dependencies" not "The installation process has completed" 2. **Use exact counts** - "2 prompts integrated" not "prompts integrated" 3. **Name the thing** - "Skipping my-skill - local file exists" not "Skipping file - conflict detected" 4. **Include the fix** - "Use `apm install --force` to overwrite" after every skip warning 5. **No emojis** - ASCII `STATUS_SYMBOLS` only, never emoji characters

Copilot · 2026-04-28T18:02:29Z

+on:
+  pull_request:
+    types: [synchronize]
+


This workflow uses the pull_request event, which will not have write permissions on fork PRs (even if permissions: requests write), so verdict labels may remain stale for the same fork PRs that pr-review-panel.md explicitly supports via pull_request_target. Consider switching to pull_request_target (no checkout, just gh pr edit) to make label reset reliable for forks as well.

This issue also appears in the following locations of the same file:

line 21

line 36

Copilot · 2026-04-28T18:02:29Z

+     synthesize a placeholder `{persona: "<slug>", active: true,
+     required: [], nits: [], extras: {schema_failure: "<reason>"}}`
+     and surface the failure in the CEO arbitration prompt.


On repeated schema failures, the instructions synthesize a placeholder with empty required/nits. That can yield an APPROVE verdict even though a panelist effectively did not run, which undermines the "binary verdict" goal. Consider making a schema-failure placeholder produce a REQUIRED finding (forcing REJECT) or instruct the orchestrator to abort and ask for a rerun instead.

Suggested change

synthesize a placeholder `{persona: "<slug>", active: true,

required: [], nits: [], extras: {schema_failure: "<reason>"}}`

and surface the failure in the CEO arbitration prompt.

synthesize a placeholder that REMAINS schema-valid but includes a

single REQUIRED finding so the final verdict is forced to REJECT.

Use:

`{persona: "<slug>", active: true, required: [{"severity":

"required", "title": "Panelist schema failure", "body":

"This panelist did not return valid JSON after two retry attempts.

Re-run the review panel for a clean result. Reason: <reason>"}],

nits: [], extras: {schema_failure: "<reason>"}}`

and surface the failure in the CEO arbitration prompt and final

comment. Do NOT synthesize empty `required` / `nits` placeholders

for a mandatory panelist failure.

Copilot · 2026-04-28T18:02:29Z

+  add-labels:
+    allowed: [panel-approved, panel-rejected]
+    max: 1
+  # Trigger label cleanup. Removed after the run so re-applying
+  # `panel-review` re-triggers the panel cleanly.
+  remove-labels:
+    allowed: [panel-review]
+    max: 1


safe-outputs can add exactly one verdict label, but it cannot remove an already-present opposite verdict label. If the panel is re-run on the same commit (re-apply panel-review without a new push), the PR can end up with BOTH panel-approved and panel-rejected, breaking the intended XOR invariant. Consider allowing remove-labels to also remove both verdict labels (and have the orchestrator remove both before adding the computed one).

Copilot · 2026-04-28T18:02:30Z

+    "active": {
+      "type": "boolean",
+      "description": "Set to false ONLY for auth-expert when the conditional rule does not activate it. All other personas MUST set active=true. When false, required and nits MUST be empty arrays and inactive_reason MUST be a one-sentence explanation citing the touched files.",
+      "default": true
+    },
+    "inactive_reason": {
+      "type": "string",
+      "description": "Required when active=false. One sentence citing the touched files."
+    },
+    "required": {
+      "type": "array",
+      "description": "Findings that MUST be addressed before merge. Empty array means this persona has no blocking concerns. Anything in this array causes the orchestrator to emit a REJECT verdict.",
+      "items": { "$ref": "#/definitions/finding" }
+    },
+    "nits": {
+      "type": "array",
+      "description": "One-line suggestions the author can skip. NEVER block merge. No cap - trust the panelist's judgement.",
+      "items": { "$ref": "#/definitions/finding" }
+    },


The schema docstring says inactive_reason is required when active=false and that required/nits must be empty in that case, but the schema does not enforce those invariants. Add if/then constraints (and ideally enforce that only auth-expert may set active=false) so the S4 schema gate actually catches contract violations.

Genesis Step 8 evals gate, deterministic slice. Validates the parts of the panel that do NOT require an LLM: - JSON schema validation (S4 gate) for panelist + CEO returns - Verdict computation per orchestrator rule (APPROVE iff sum(required)==0) Five cases: clean-pr APPROVE, rejected-pr REJECT, plus three negative cases (missing-nits, unknown-persona, disposition-leak) that confirm `additionalProperties: false` and `required` constraints reject malformed shapes before they reach the verdict gate. Run: uv run --with jsonschema python3 \ .apm/skills/apm-review-panel/evals/run-verdict-harness.py Result: ALL PASS. Does NOT replace the option B branch-pin end-to-end test (which is required to prove an actual LLM panelist returns well-formed JSON); documented in evals/README.md. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

LLM dispatcher self-evaluation against trigger-evals.json (16 queries, 60/40 train/val split). Result: 16/16 correct. Validation split: - should-trigger: 3/3 = 1.00 (gate >= 0.5: PASS) - should-NOT: 3/3 = 1.00 (gate >= 0.5: PASS) Caveat recorded in the result file: this is a single-LLM judgment; canonical evals would average over multiple dispatcher models. Real LLM judgment, not hand-waving. Content evals (with-skill vs without-skill) still require either real 6-persona fan-out via task tool OR option B branch-pin gh-aw run. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

DO NOT MERGE WITHOUT REVERTING. Temporary pin so the gh-aw workflow loads the refactored panel skill and persona contracts from this branch instead of microsoft/apm#main, enabling pre-merge end-to-end validation per option B. To revert: change packages back to microsoft/apm#main, recompile, push. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

End-to-end gh-aw run of the refactored panel against PR #931 succeeded on first try with all 10 acceptance criteria passing: - Verdict header literal: '## APM Review Panel Verdict: REJECT' - Top-loaded order: verdict -> required (3) -> nits (4) -> CEO -> per-persona - All 6 personas spawned and reported back - auth-expert correctly inactive with cited reason - Verdict deterministic from required[] count (3 -> REJECT) - panel-rejected label applied - Comment count: 1 (single-writer interlock held) - Per-persona detail in collapsed <details> block - Python Architect class diagram passed through extras - All 7 gh-aw jobs SUCCESS Bonus: panel surfaced a real regression in PR #931 that prior reviewers missed (proposed docstring silently drops gemini routing). Three independent panelists converged with zero dissent. Workflow run: github.com/microsoft/apm/actions/runs/25069734881 Full result: .apm/skills/apm-review-panel/evals/results-e2e-pr931-2026-04-28.md Reverts the temporary microsoft/apm#refactor/review-panel-fanout pin back to microsoft/apm#main. Recompiled lock file. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Probe PR #1025 (closed). Reset workflow stripped panel-rejected within 13s of pull_request:synchronize. panel-approved path is identical code in the same loop -- proven by parity. Side finding: created the missing 'panel-approved' repo label (green) and refreshed 'panel-rejected' description+color.

Promotes [Unreleased] -> [0.11.0] - 2026-04-29 and bumps pyproject.toml + uv.lock to 0.11.0. Version-bump rationale: 0.11.0 (minor bump) chosen over 0.10.1 because this release ships one BREAKING removal (`apm marketplace build` -> exits 2, use `apm pack`) plus several net-new features (Dev Container Feature, Codex project-scoped MCP, `marketplace:` block in apm.yml, `apm pack` unification, multi-org `apps[]`). Strict semver in 0.x: minor for features-with-break, patch only for bugfixes. Milestone admin (done out-of-band): - Renamed milestone #8 `0.10.1` -> `0.11.0` - Created milestone #9 `0.12.0` as next-up bucket - Moved 43 open items (42 issues + 1 open PR #999) from `0.11.0` -> `0.12.0` - 6 closed items stay in `0.11.0` PRs shipping in 0.11.0 (22 commits since v0.10.0): User-facing features: - #1042/#722 `apm pack` unifies bundle + marketplace.json (BREAKING: `apm marketplace build` removed) - #1038 `marketplace:` block in apm.yml + `apm marketplace migrate` - #803 /#502 Codex project-scoped MCP (`.codex/config.toml`) + user-scope primitives - #861 Dev Container Feature `ghcr.io/microsoft/apm/apm-cli` - #982/#984 shared/apm.md `apps:` array for cross-org private packages - #820 `target:` in apm.yml validates at parse time - #1032 `apm marketplace add` honors manifest.name (Claude Code parity) - #1000/#998/#994 unified `--policy` / `--policy-source` accepted forms User-facing fixes: - #1015 ADO Entra ID auth + `apm install --update` pre-flight abort - #1019/#1020 GEMINI.md only created when target requested - #1008 marketplace producer respects GITHUB_HOST + multi-host URL forms - #1018 POSIX paths in auto-discovery output (Windows compat) - #996 drop stray 'specify' from generated file footer Maintainer tooling: - #1043 NOTICE.md per CELA template - #1045/#1044 NOTICE drift gate + license-policy gate in CI - #1033 shared/apm.md `[a b]` import-input repair (gh-aw#29076 paper-cut) - #1030 panel workflows skip-don't-fail on unmatched labels; gh-aw v0.71.1 - #1026 shared/apm.md recompiled to apm-action v1.5.0 + bundles-file - #1022 review-panel: true fan-out + binary verdict + label automation - #918 complexity audit + benchmarks suite - #1002 CodeQL clear-text-storage false-positive resolved (token -> placeholder) Files changed: - pyproject.toml: 0.10.0 -> 0.11.0 - uv.lock: regenerated (version field only) - CHANGELOG.md: [Unreleased] promoted to [0.11.0] - 2026-04-29 NOTICE drift check passes against the bumped lockfile. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* chore(release): cut 0.11.0 Promotes [Unreleased] -> [0.11.0] - 2026-04-29 and bumps pyproject.toml + uv.lock to 0.11.0. Version-bump rationale: 0.11.0 (minor bump) chosen over 0.10.1 because this release ships one BREAKING removal (`apm marketplace build` -> exits 2, use `apm pack`) plus several net-new features (Dev Container Feature, Codex project-scoped MCP, `marketplace:` block in apm.yml, `apm pack` unification, multi-org `apps[]`). Strict semver in 0.x: minor for features-with-break, patch only for bugfixes. Milestone admin (done out-of-band): - Renamed milestone #8 `0.10.1` -> `0.11.0` - Created milestone #9 `0.12.0` as next-up bucket - Moved 43 open items (42 issues + 1 open PR #999) from `0.11.0` -> `0.12.0` - 6 closed items stay in `0.11.0` PRs shipping in 0.11.0 (22 commits since v0.10.0): User-facing features: - #1042/#722 `apm pack` unifies bundle + marketplace.json (BREAKING: `apm marketplace build` removed) - #1038 `marketplace:` block in apm.yml + `apm marketplace migrate` - #803 /#502 Codex project-scoped MCP (`.codex/config.toml`) + user-scope primitives - #861 Dev Container Feature `ghcr.io/microsoft/apm/apm-cli` - #982/#984 shared/apm.md `apps:` array for cross-org private packages - #820 `target:` in apm.yml validates at parse time - #1032 `apm marketplace add` honors manifest.name (Claude Code parity) - #1000/#998/#994 unified `--policy` / `--policy-source` accepted forms User-facing fixes: - #1015 ADO Entra ID auth + `apm install --update` pre-flight abort - #1019/#1020 GEMINI.md only created when target requested - #1008 marketplace producer respects GITHUB_HOST + multi-host URL forms - #1018 POSIX paths in auto-discovery output (Windows compat) - #996 drop stray 'specify' from generated file footer Maintainer tooling: - #1043 NOTICE.md per CELA template - #1045/#1044 NOTICE drift gate + license-policy gate in CI - #1033 shared/apm.md `[a b]` import-input repair (gh-aw#29076 paper-cut) - #1030 panel workflows skip-don't-fail on unmatched labels; gh-aw v0.71.1 - #1026 shared/apm.md recompiled to apm-action v1.5.0 + bundles-file - #1022 review-panel: true fan-out + binary verdict + label automation - #918 complexity audit + benchmarks suite - #1002 CodeQL clear-text-storage false-positive resolved (token -> placeholder) Files changed: - pyproject.toml: 0.10.0 -> 0.11.0 - uv.lock: regenerated (version field only) - CHANGELOG.md: [Unreleased] promoted to [0.11.0] - 2026-04-29 NOTICE drift check passes against the bumped lockfile. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * chore(changelog): tighten 0.11.0 entries to lead with user impact Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * chore(changelog): move Dev Container Feature to Maintainer tooling (not yet published) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * chore(changelog): de-dupe within 0.11.0 (combine #722 Removed bullets, drop #820 Fixed pointer) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> --------- Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot AI review requested due to automatic review settings April 28, 2026 17:55

danielmeppiel requested a review from sergio-sisternes-epam as a code owner April 28, 2026 17:55

danielmeppiel added the panel-review Trigger the apm-review-panel gh-aw workflow label Apr 28, 2026

Copilot started reviewing on behalf of danielmeppiel April 28, 2026 17:56 View session

Copilot AI reviewed Apr 28, 2026

View reviewed changes

danielmeppiel and others added 3 commits April 28, 2026 20:02

github-actions Bot mentioned this pull request Apr 28, 2026

[aw] PR Review Panel failed #1023

Open

danielmeppiel mentioned this pull request Apr 28, 2026

docs(compile): clarify target routing docstring #931

Open

danielmeppiel mentioned this pull request Apr 28, 2026

test: label-reset probe (DO NOT MERGE) #1025

Closed

danielmeppiel and others added 2 commits April 28, 2026 22:38

Merge branch 'main' into refactor/review-panel-fanout

761c406

danielmeppiel merged commit eca7815 into main Apr 28, 2026
8 checks passed

danielmeppiel deleted the refactor/review-panel-fanout branch April 28, 2026 20:59

danielmeppiel mentioned this pull request Apr 29, 2026

chore(release): cut 0.11.0 #1049

Merged

danielmeppiel added this to the 0.11.0 milestone Apr 29, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor(review-panel): true fan-out + binary verdict + label automation#1022

refactor(review-panel): true fan-out + binary verdict + label automation#1022
danielmeppiel merged 7 commits intomainfrom
refactor/review-panel-fanout

danielmeppiel commented Apr 28, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 28, 2026

Uh oh!

Copilot AI Apr 28, 2026

Uh oh!

Copilot AI Apr 28, 2026

Uh oh!

Copilot AI Apr 28, 2026

Uh oh!

Copilot AI Apr 28, 2026

Uh oh!

Copilot AI Apr 28, 2026

Uh oh!

Copilot AI Apr 28, 2026

Uh oh!

Copilot AI Apr 28, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		synthesize a placeholder `{persona: "<slug>", active: true,
		required: [], nits: [], extras: {schema_failure: "<reason>"}}`

-     synthesize a placeholder `{persona: "<slug>", active: true,
-     required: [], nits: [], extras: {schema_failure: "<reason>"}}`
+     synthesize a placeholder that preserves `active: true` but adds
+     one REQUIRED finding so the final verdict becomes REJECT under
+     step 6, for example:
+     `{persona: "<slug>", active: true, required: ["Schema failure:
+     panelist did not produce valid JSON after retries (<reason>)"],
+     nits: [], extras: {schema_failure: "<reason>"}}`

-     synthesize a placeholder `{persona: "<slug>", active: true,
-     required: [], nits: [], extras: {schema_failure: "<reason>"}}`
-     and surface the failure in the CEO arbitration prompt.
+     synthesize a placeholder that REMAINS schema-valid but includes a
+     single REQUIRED finding so the final verdict is forced to REJECT.
+     Use:
+     `{persona: "<slug>", active: true, required: [{"severity":
+     "required", "title": "Panelist schema failure", "body":
+     "This panelist did not return valid JSON after two retry attempts.
+     Re-run the review panel for a clean result. Reason: <reason>"}],
+     nits: [], extras: {schema_failure: "<reason>"}}`
+     and surface the failure in the CEO arbitration prompt and final
+     comment. Do NOT synthesize empty `required` / `nits` placeholders
+     for a mandatory panelist failure.

Conversation

danielmeppiel commented Apr 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

TL;DR

Problem

Approach

Architecture

Change set summary

Trade-offs (genesis pattern-tradeoffs.md)

Validation

Eval trace

Deterministic verdict harness (NEW, observed result)

Trigger evals (dispatch description discrimination)

Content evals (with-skill vs without-skill delta)

How to test

Tier 1: deterministic harness (already run, see above)

Tier 2: option B branch-pin end-to-end test -- DONE, PASS

Tier 2b: label-reset workflow (companion deterministic check) -- DONE, PASS

Tier 3: dogfood after merge (the real test)

Honest limit

Files at a glance

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Copilot's findings

Uh oh!

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

danielmeppiel commented Apr 28, 2026 •

edited

Loading

Trade-offs (genesis `pattern-tradeoffs.md`)