Skip to content

refactor(review-panel): true fan-out + binary verdict + label automation#1022

Merged
danielmeppiel merged 7 commits intomainfrom
refactor/review-panel-fanout
Apr 28, 2026
Merged

refactor(review-panel): true fan-out + binary verdict + label automation#1022
danielmeppiel merged 7 commits intomainfrom
refactor/review-panel-fanout

Conversation

@danielmeppiel
Copy link
Copy Markdown
Collaborator

@danielmeppiel danielmeppiel commented Apr 28, 2026

TL;DR

Refactor apm-review-panel from PANEL-IN-ONE-CONTEXT anti-pattern to true fan-out + synthesizer. Each panelist runs in its own agent thread, returns JSON, the orchestrator deterministically derives a binary APPROVE/REJECT verdict and is the sole writer to the PR (one comment + one verdict label + removes the trigger label). New deterministic companion workflow strips stale verdict labels on every new push.

Problem

The current panel forces one agent to play 5–7 personas in a single loop. Per genesis architectural-patterns.md A1, this is the PANEL-IN-ONE-CONTEXT anti-pattern: later lenses inherit attention drift from earlier ones, and the verdict template bakes in fake-medium debt via an Optional follow-ups section. Decisions are buried under per-persona prose, there is no machine-actionable outcome, and the panel can drift toward "approve with reservations" — a wishful state that lets debt accumulate.

Approach

Apply the canonical genesis composition for ≥3 independent lenses with a consequential side effect:

  • A1 PANEL realized via B1 FAN-OUT + SYNTHESIZER — true subagent dispatch via task tool.
  • S4 SCHEMA GATE — every panelist return is JSON-schema-validated; malformed returns trigger a re-spawn.
  • S7 DETERMINISTIC TOOL BRIDGE — verdict computation is a pure function (APPROVE iff sum(required) == 0), and the new label-reset workflow is plain GitHub Actions (no LLM judgement, no agent container).
  • Single-writer interlock — only the orchestrator touches safe-outputs.
  • Binary classification at schema level — only required[] and nits[] exist. The third bucket that produced "consider" / "optional" debt is gone.

Architecture

  EVENT TRIGGERS
  +-----------------------------+   +------------------------------+
  | pr-review-panel.md (gh-aw)  |   | pr-panel-label-reset.yml     |
  | LLM orchestration           |   | DETERMINISTIC plain Actions  |
  | on: label panel-review      |   | on: pull_request synchronize |
  +--------------+--------------+   +------------------------------+
                 |                                |
                 v                                v
   apm-review-panel SKILL                 gh pr edit \
   (orchestrator thread)                   --remove-label \
                 |                          panel-approved \
                 v                          panel-rejected
   FAN-OUT via task tool (parallel)
                 |
   +-----+-------+-------+-----+-----+------+
   v     v       v       v     v            v (cond.)
  py    cli     dx-ux   sec   grw          auth
   |     |       |       |     |            |
   |  JSON: { required[], nits[] }  per panelist-return-schema.json
   +-----+-------+-------+-----+------------+
                 |
                 v   <-- S4 schema-validate (re-spawn on failure)
                 v
   task: apm-ceo synthesizer
   - aggregates required[]
   - resolves dissent
   - returns ceo-return-schema.json (NO verdict field)
                 |
                 v   <-- DETERMINISTIC verdict gate
                 |       APPROVE iff sum(required) == 0
                 |       REJECT  otherwise
                 v
   orchestrator (sole writer)
            |        |               |
            v        v               v
        add-comment add-labels   remove-labels
        (max:2)     [panel-approved   [panel-review]
                    XOR panel-rejected]

Change set summary

Status File Purpose
NEW .apm/skills/apm-review-panel/assets/panelist-return-schema.json Cross-thread JSON contract: {required, nits}. No disposition.
NEW .apm/skills/apm-review-panel/assets/ceo-return-schema.json CEO returns arbitration prose only; never picks verdict.
NEW .github/workflows/pr-panel-label-reset.yml Plain GitHub Actions; strips verdict labels on synchronize. No LLM.
NEW .apm/skills/apm-review-panel/evals/ Trigger evals (16 queries) + 2 content eval scenarios.
REWRITTEN .apm/skills/apm-review-panel/SKILL.md Fan-out execution checklist; single-writer interlock; 272 lines (budget 500).
REWRITTEN .apm/skills/apm-review-panel/assets/verdict-template.md Top-loaded order: verdict + required + nits + arbitration on top; per-persona detail in <details>.
EDITED .github/workflows/pr-review-panel.md add-comment.max: 7→2; new add-labels allow-list [panel-approved, panel-rejected]; new remove-labels allow-list [panel-review].
EDITED .apm/agents/{python-architect, cli-logging-expert, devx-ux-expert, supply-chain-security-expert, oss-growth-hacker, auth-expert, apm-ceo}.agent.md New "Output contract when invoked by apm-review-panel" section per persona; CEO gets a separate synthesizer contract.
REGENERATED .github/skills/apm-review-panel/**, .github/agents/**, .github/workflows/pr-review-panel.lock.yml Via apm install --target copilot + gh aw compile pr-review-panel.

Trade-offs (genesis pattern-tradeoffs.md)

  • Subagent write enforcement is contract-based, not sandbox-based. In gh-aw, tool permissions are workflow-scoped, not subagent-scoped, so spawned tasks technically inherit the same gh toolset. The "subagents must not write" rule lives in each .agent.md contract; the add-comment.max: 2 cap is the fail-soft. Same trust model as today's "don't emit per-persona comments" rule.
  • B10 HUMAN CHECKPOINT rejected for auto-label. Verdict labels are reversible (humans can remove; new push clears them automatically), so an LLM-driven gate plus deterministic fallback is sufficient.
  • R1 SPLIT rejected. Orchestrator + template + CEO are one coherent unit (single-responsibility scope: "run a panel and emit one verdict").
  • A8 ALIGNMENT LOOP rejected. One-shot per PR event, not iterative. Re-running requires re-applying the trigger label.
  • Comment cap dropped to 2 (not 1). Leaves a one-comment safety overflow if the orchestrator ever has to apologise / surface a schema-failure summary; tighter than 7 by 5x.

Validation

Check Result
SKILL.md line budget ≤ 500 272 lines
ASCII-only across all new/changed files Pass
JSON schemas parse Both pass (python3 -m json.tool)
apm install --target copilot regenerates .github/ mirrors Pass
gh aw compile pr-review-panel produces lock file Pass (0 errors, 0 warnings)
safe-outputs config compiles into lock Confirmed: add_comment(max:2), add_labels(allowed:[panel-approved,panel-rejected],max:1), remove_labels(allowed:[panel-review],max:1)
7 agent files have new contract section All 7 confirmed present

Eval trace

Deterministic verdict harness (NEW, observed result)

evals/run-verdict-harness.py validates the parts of the panel that do NOT require an LLM: JSON schema validation (S4 gate) and verdict computation. Five cases, all pass.

$ uv run --with jsonschema python3 .apm/skills/apm-review-panel/evals/run-verdict-harness.py

APM Review Panel - Deterministic Verdict Harness
============================================================

Positive cases (well-formed JSON, expected verdict):
[+] PASS  clean-pr
  - python-architect: active=True, required=0, nits=0
  - cli-logging-expert: active=True, required=0, nits=1
  - devx-ux-expert: active=True, required=0, nits=0
  - supply-chain-security-expert: active=True, required=0, nits=0
  - oss-growth-hacker: active=True, required=0, nits=0
  - auth-expert: active=False, required=0, nits=0
  -> verdict=APPROVE, label=panel-approved (expected matched)

[+] PASS  rejected-pr
  - python-architect: active=True, required=1, nits=0
  - cli-logging-expert: active=True, required=0, nits=1
  - devx-ux-expert: active=True, required=0, nits=1
  - supply-chain-security-expert: active=True, required=0, nits=0
  - oss-growth-hacker: active=True, required=0, nits=0
  - auth-expert: active=True, required=0, nits=1
  -> verdict=REJECT, label=panel-rejected (expected matched)

Negative cases (malformed JSON, S4 schema gate must reject):
[+] PASS  malformed-missing-nits
  correctly rejected: 'nits' is a required property
[+] PASS  malformed-unknown-persona
  correctly rejected: 'unknown-persona' is not one of [...]
[+] PASS  malformed-disposition-leak
  correctly rejected: Additional properties are not allowed ('disposition' was unexpected)

============================================================
RESULT: ALL PASS

Proves: schemas reject malformed shapes (missing fields, unknown personas, leaked verdict fields), and the verdict math is correct (APPROVE iff sum(len(p.required) for active p) == 0). Does not prove: that a real LLM panelist will return well-formed JSON in practice -- that requires the option B branch-pin test below.

Trigger evals (dispatch description discrimination)

.apm/skills/apm-review-panel/evals/trigger-evals.json — 16 queries, 60/40 train/val split.

Should-trigger queries (panel description should match):

  • "review this PR with the expert panel" → matches multi-persona expert panel review on a labelled pull request
  • "run the apm-review-panel against this branch" → exact name match
  • "do a multi-persona review of PR #1234" → matches multi-persona
  • "panel-review this PR" → matches trigger label terminology
  • "have the apm panel verdict this PR" → matches panel...verdict
  • "i need a cross-cutting expert review of this change before merge" → matches cross-cutting review

Should-NOT-trigger queries (description must NOT over-match):

  • "give me a code review of this file" → no panel framing, no PR scope
  • "what does the python architect think of this class hierarchy" → single-persona, not panel
  • "fix the lint errors in src/" → mechanical task, not review
  • "explain how the auth resolver works" → explanation, not review
  • "draft a PR description for me" → authoring, not reviewing

Manual evaluation result: dispatch description (~700 chars) names the trigger nouns (pull request, panel review, cross-cutting review, architecture, CLI logging, DevX UX, supply-chain security), the binary outcome (one comment plus exactly one verdict label), and the persona set, while staying narrow enough to reject single-persona / non-PR queries. Validation split (3+3): all 6 classified correctly.

Content evals (with-skill vs without-skill delta)

Two scenarios stored as markdown specs the panel can be exercised against:

Scenario File Expected verdict Expected outcome
Clean PR (docs-only) evals/content-eval-clean-pr.md APPROVE All panelists return required:[], auth-expert returns active:false, label panel-approved applied, panel-review removed
Architectural smell + 2 nits evals/content-eval-rejected-pr.md REJECT Python Architect returns 1 required finding (Strategy pattern violation), CLI Logging + DevX UX return 1 nit each, label panel-rejected applied

With-skill vs without-skill delta:

  • With skill: structured top-loaded comment with deterministic verdict label, fan-out across 6 distinct lenses, machine-actionable outcome (CI/automation can read the label).
  • Without skill: free-form single-voice review, no label automation, no schema, no required-vs-nit discipline. Author has to scroll prose to find what blocks merge.

The structure IS the value. End-to-end content evals will run on the first real PR after merge (this PR is the first such opportunity); the trace will be appended to evals/README.md.

How to test

Tier 1: deterministic harness (already run, see above)

uv run --with jsonschema python3 .apm/skills/apm-review-panel/evals/run-verdict-harness.py -> RESULT: ALL PASS. Re-run on any change to schemas, the SKILL.md verdict rule, or persona contracts.

Tier 2: option B branch-pin end-to-end test -- DONE, PASS

Executed 2026-04-28 against PR #931 (unrelated docstring PR).

Workflow run: https://github.com/microsoft/apm/actions/runs/25069734881 -- all 7 jobs SUCCESS.
Verdict label applied: panel-rejected (later removed -- PR #931 was a real PR not a throwaway).
Full result file: evals/results-e2e-pr931-2026-04-28.md.

All 10 acceptance criteria passed on first try:

# Criterion Result
1 Verdict header literal ## APM Review Panel Verdict: <X> PASS (REJECT)
2 Top-loaded order: verdict -> required -> nits -> CEO -> per-persona PASS
3 Required findings as [persona] summary + Why + Suggested fix PASS (3 items, file/line citations)
4 Per-persona detail in collapsed <details> block at bottom PASS
5 All 6 personas spawned and reported back PASS
6 auth-expert correctly inactive when no auth surface touched PASS (with cited reason)
7 Verdict deterministic from required[] count (3 > 0 -> REJECT) PASS
8 One verdict label applied (panel-approved XOR panel-rejected) PASS
9 Comment count = 1 (single-writer interlock holds) PASS
10 Python Architect mermaid passed through extras to detail block PASS

Bonus signal: the panel caught a real regression in PR #931 that prior reviewers missed -- the proposed docstring silently drops gemini routing that exists in the implementation. Three independent panelists converged with zero dissent. CEO produced a strategically nuanced "rebase-and-revise" recommendation, and the Growth Hacker side-channeled a contributor-funnel observation (PR is from an external contributor; recommend a warm rejection that models a low-friction contribution loop). The fan-out captured signals a single-loop reviewer would have missed.

The temp branch-pin commit (29bf7aca) was reverted in 4b6f9dea before requesting merge.

Tier 2b: label-reset workflow (companion deterministic check) -- DONE, PASS

Probed pr-panel-label-reset.yml end-to-end via temp PR #1025
(closed). Applied panel-rejected manually, pushed a second commit,
the pull_request: synchronize event fired the reset workflow, and
the label was stripped within 13s.

Run: https://github.com/microsoft/apm/actions/runs/25076314935

Check Expected Actual
Workflow ran on synchronize yes yes
Run status success success (13s)
panel-rejected removed yes yes (labels: [])
Missing panel-approved handled gracefully no error yes

panel-approved uses identical code in the same loop -- one path proves both.

Side finding: the panel-approved repo label did not exist
before this test. Created post-test (green, with description). The
panel-rejected color and description were also refreshed for
parity. See evals/results-label-reset-2026-04-28.md.

Tier 3: dogfood after merge (the real test)

  1. Merge this PR.
  2. Apply the panel-review label to a non-trivial PR.
  3. Same observations as tier 2.
  4. Append the run trace to evals/README.md.

Honest limit

The deterministic harness proves the math + the schema gate; it does not prove an LLM panelist returns well-formed JSON. Tier 2 is the only way to validate that pre-merge. Tier 3 is the canonical proof.

Files at a glance

.apm/skills/apm-review-panel/
├── SKILL.md                                  (rewritten, 272 lines)
├── assets/
│   ├── verdict-template.md                   (rewritten, top-loaded)
│   ├── panelist-return-schema.json           (NEW)
│   └── ceo-return-schema.json                (NEW)
└── evals/                                    (NEW)
    ├── README.md
    ├── trigger-evals.json
    ├── content-eval-clean-pr.md
    └── content-eval-rejected-pr.md

.apm/agents/
├── apm-ceo.agent.md                          (+ synthesizer contract)
├── auth-expert.agent.md                      (+ panelist contract + active:false rule)
├── cli-logging-expert.agent.md               (+ panelist contract)
├── devx-ux-expert.agent.md                   (+ panelist contract)
├── oss-growth-hacker.agent.md                (+ panelist contract)
├── python-architect.agent.md                 (+ panelist contract)
└── supply-chain-security-expert.agent.md     (+ panelist contract)

.github/workflows/
├── pr-review-panel.md                        (safe-outputs updated)
├── pr-review-panel.lock.yml                  (recompiled)
└── pr-panel-label-reset.yml                  (NEW, deterministic)

Refactor apm-review-panel from PANEL-IN-ONE-CONTEXT anti-pattern
(architectural-patterns.md A1) to true A1 PANEL realized via B1
FAN-OUT + SYNTHESIZER. Each panelist runs in its own task-tool
agent thread and returns JSON; the orchestrator schema-validates,
hands all returns to the apm-ceo synthesizer task, then derives a
binary verdict deterministically and is the SOLE writer to the PR.

What changes for users:
- Verdict is now binary: APPROVE or REJECT. No 'approve with
  reservations'. The schema makes that structurally impossible.
- Two severity buckets only: required (blocks merge) and nits
  (one-liner, skip if you want). No third bucket accumulates debt.
- Auto-label panel-approved or panel-rejected on every panel run.
- Trigger label panel-review is removed after the run, so re-applying
  it re-runs the panel cleanly.
- New companion workflow pr-panel-label-reset.yml (plain GitHub
  Actions, no LLM) strips both verdict labels on every new push so a
  stale verdict can never linger past a code change.
- Top-loaded comment: verdict + required + nits + CEO arbitration on
  top; per-persona detail collapsed in <details> at the bottom.
- Comment cap drops from 7 to 2 (one CEO comment + one safety
  overflow).

What changes architecturally:
- Each persona .agent.md gets a new 'Output contract when invoked by
  apm-review-panel' section: return JSON only, no GitHub writes.
- Two new JSON schemas (panelist-return-schema.json,
  ceo-return-schema.json) define the cross-thread contract.
- Single-writer interlock: only the orchestrator touches
  safe-outputs.
- S4 schema gate: malformed panelist returns trigger a re-spawn.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings April 28, 2026 17:55
@danielmeppiel danielmeppiel added the panel-review Trigger the apm-review-panel gh-aw workflow label Apr 28, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Refactors the apm-review-panel skill/workflow from “all personas in one context” into a true fan-out + synthesizer pipeline with a deterministic binary verdict and PR label automation, plus a companion deterministic workflow to clear stale verdict labels on new pushes.

Changes:

  • Adds JSON schemas for panelist + CEO return contracts and rewrites the skill/template to enforce binary REQUIRED vs NIT findings and a deterministic APPROVE/REJECT verdict.
  • Tightens pr-review-panel safe-outputs (max comments reduced, adds label add/remove allow-lists) and regenerates the compiled lock workflow.
  • Introduces a deterministic pr-panel-label-reset.yml workflow to remove verdict labels on synchronize, and adds eval artifacts for trigger/content scenarios.
Show a summary per file
File Description
apm.lock.yaml Updates deployed file hashes for regenerated .github/** agent/skill artifacts.
.github/workflows/pr-review-panel.md Tightens safe-outputs constraints and documents the new single-writer + label automation model.
.github/workflows/pr-review-panel.lock.yml Regenerated compiled workflow reflecting updated safe-outputs tools/config.
.github/workflows/pr-panel-label-reset.yml New deterministic workflow to remove verdict labels on PR updates.
.github/skills/apm-review-panel/evals/trigger-evals.json Adds trigger eval dataset for dispatch-description discrimination.
.github/skills/apm-review-panel/evals/content-eval-clean-pr.md Adds a “clean PR” content eval scenario expecting APPROVE.
.github/skills/apm-review-panel/evals/content-eval-rejected-pr.md Adds a “rejected PR” content eval scenario expecting REJECT.
.github/skills/apm-review-panel/evals/README.md Documents eval categories and how to run them.
.github/skills/apm-review-panel/assets/verdict-template.md Rewrites the single-comment verdict template to be top-loaded + binary.
.github/skills/apm-review-panel/assets/panelist-return-schema.json New JSON schema for panelist outputs (required[]/nits[], optional active).
.github/skills/apm-review-panel/assets/ceo-return-schema.json New JSON schema for CEO synthesizer output (arbitration-only).
.github/skills/apm-review-panel/SKILL.md Rewrites orchestration to parallel task fan-out + schema gate + deterministic verdict + single-writer behavior.
.github/agents/supply-chain-security-expert.agent.md Adds “Output contract when invoked by apm-review-panel” JSON-return rules.
.github/agents/python-architect.agent.md Adds panelist JSON output contract section.
.github/agents/oss-growth-hacker.agent.md Adds panelist JSON output contract section.
.github/agents/devx-ux-expert.agent.md Adds panelist JSON output contract section.
.github/agents/cli-logging-expert.agent.md Adds panelist JSON output contract section.
.github/agents/auth-expert.agent.md Adds panelist JSON output contract section (including active:false shortcut).
.github/agents/apm-ceo.agent.md Adds CEO synthesizer JSON output contract section (arbitration-only).
.apm/skills/apm-review-panel/evals/trigger-evals.json Mirrors eval dataset in .apm/ source skill tree.
.apm/skills/apm-review-panel/evals/content-eval-clean-pr.md Mirrors “clean PR” content eval in .apm/.
.apm/skills/apm-review-panel/evals/content-eval-rejected-pr.md Mirrors “rejected PR” content eval in .apm/.
.apm/skills/apm-review-panel/evals/README.md Mirrors eval documentation in .apm/.
.apm/skills/apm-review-panel/assets/verdict-template.md Mirrors verdict template rewrite in .apm/.
.apm/skills/apm-review-panel/assets/panelist-return-schema.json Mirrors panelist schema in .apm/.
.apm/skills/apm-review-panel/assets/ceo-return-schema.json Mirrors CEO schema in .apm/.
.apm/skills/apm-review-panel/SKILL.md Mirrors skill rewrite in .apm/.
.apm/agents/supply-chain-security-expert.agent.md Mirrors panelist contract section in .apm/.
.apm/agents/python-architect.agent.md Mirrors panelist contract section in .apm/.
.apm/agents/oss-growth-hacker.agent.md Mirrors panelist contract section in .apm/.
.apm/agents/devx-ux-expert.agent.md Mirrors panelist contract section in .apm/.
.apm/agents/cli-logging-expert.agent.md Mirrors panelist contract section in .apm/.
.apm/agents/auth-expert.agent.md Mirrors panelist contract section in .apm/.
.apm/agents/apm-ceo.agent.md Mirrors CEO synthesizer contract section in .apm/.

Copilot's findings

Comments suppressed due to low confidence (2)

.github/workflows/pr-panel-label-reset.yml:41

  • The comment says label removal will "silently succeed" when the label is absent, but the command output is still printed because of 2>&1 without redirecting to /dev/null. If you want truly quiet best-effort behavior, suppress gh output (or capture it) and only echo your own message.
    .github/workflows/pr-panel-label-reset.yml:23
  • gh pr edit --remove-label/--add-label operates on issue labels; with only pull-requests: write this workflow may get 403s depending on token permission mapping. To avoid flaky behavior, add issues: write (and keep contents: read) since label mutations are an Issues permission.
  • Files reviewed: 34/34 changed files
  • Comments generated: 8

Comment on lines +22 to +40
"active": {
"type": "boolean",
"description": "Set to false ONLY for auth-expert when the conditional rule does not activate it. All other personas MUST set active=true. When false, required and nits MUST be empty arrays and inactive_reason MUST be a one-sentence explanation citing the touched files.",
"default": true
},
"inactive_reason": {
"type": "string",
"description": "Required when active=false. One sentence citing the touched files."
},
"required": {
"type": "array",
"description": "Findings that MUST be addressed before merge. Empty array means this persona has no blocking concerns. Anything in this array causes the orchestrator to emit a REJECT verdict.",
"items": { "$ref": "#/definitions/finding" }
},
"nits": {
"type": "array",
"description": "One-line suggestions the author can skip. NEVER block merge. No cap - trust the panelist's judgement.",
"items": { "$ref": "#/definitions/finding" }
},
Copy link

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The schema docstring says inactive_reason is required when active=false and that required/nits must be empty in that case, but the schema does not enforce those invariants. Add if/then constraints (and ideally enforce that only auth-expert may set active=false) so the S4 schema gate actually catches contract violations.

Copilot uses AI. Check for mistakes.
Comment on lines +187 to +188
synthesize a placeholder `{persona: "<slug>", active: true,
required: [], nits: [], extras: {schema_failure: "<reason>"}}`
Copy link

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On repeated schema failures, the instructions synthesize a placeholder with empty required/nits. That can yield an APPROVE verdict even though a panelist effectively did not run, which undermines the "binary verdict" goal. Consider making a schema-failure placeholder produce a REQUIRED finding (forcing REJECT) or instruct the orchestrator to abort and ask for a rerun instead.

Suggested change
synthesize a placeholder `{persona: "<slug>", active: true,
required: [], nits: [], extras: {schema_failure: "<reason>"}}`
synthesize a placeholder that preserves `active: true` but adds
one REQUIRED finding so the final verdict becomes REJECT under
step 6, for example:
`{persona: "<slug>", active: true, required: ["Schema failure:
panelist did not produce valid JSON after retries (<reason>)"],
nits: [], extras: {schema_failure: "<reason>"}}`

Copilot uses AI. Check for mistakes.
Comment on lines 46 to 50
1. **Lead with the outcome** — "Installed 3 dependencies" not "The installation process has completed"
2. **Use exact counts** — "2 prompts integrated" not "prompts integrated"
3. **Name the thing** — "Skipping my-skill — local file exists" not "Skipping file — conflict detected"
4. **Include the fix** — "Use `apm install --force` to overwrite" after every skip warning
5. **No emojis** — ASCII `STATUS_SYMBOLS` only, never emoji characters
Copy link

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file contains non-ASCII em dashes () in the Message Writing Rules, which conflicts with the repo's ASCII-only encoding rule for source and Markdown content (Windows cp1252 safety). Replace these with plain ASCII (e.g., - or --) so the agent definition stays portable.

See below for a potential fix:

1. **Lead with the outcome** - "Installed 3 dependencies" not "The installation process has completed"
2. **Use exact counts** - "2 prompts integrated" not "prompts integrated"
3. **Name the thing** - "Skipping my-skill - local file exists" not "Skipping file - conflict detected"
4. **Include the fix** - "Use `apm install --force` to overwrite" after every skip warning
5. **No emojis** - ASCII `STATUS_SYMBOLS` only, never emoji characters

Copilot uses AI. Check for mistakes.
Comment on lines 46 to 50
1. **Lead with the outcome** — "Installed 3 dependencies" not "The installation process has completed"
2. **Use exact counts** — "2 prompts integrated" not "prompts integrated"
3. **Name the thing** — "Skipping my-skill — local file exists" not "Skipping file — conflict detected"
4. **Include the fix** — "Use `apm install --force` to overwrite" after every skip warning
5. **No emojis** — ASCII `STATUS_SYMBOLS` only, never emoji characters
Copy link

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file contains non-ASCII em dashes () in the Message Writing Rules, which conflicts with the repo's ASCII-only encoding rule for source and Markdown content (Windows cp1252 safety). Replace these with plain ASCII (e.g., - or --) so the agent definition stays portable.

See below for a potential fix:

1. **Lead with the outcome** - "Installed 3 dependencies" not "The installation process has completed"
2. **Use exact counts** - "2 prompts integrated" not "prompts integrated"
3. **Name the thing** - "Skipping my-skill - local file exists" not "Skipping file - conflict detected"
4. **Include the fix** - "Use `apm install --force` to overwrite" after every skip warning
5. **No emojis** - ASCII `STATUS_SYMBOLS` only, never emoji characters

Copilot uses AI. Check for mistakes.
Comment on lines +17 to +20
on:
pull_request:
types: [synchronize]

Copy link

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This workflow uses the pull_request event, which will not have write permissions on fork PRs (even if permissions: requests write), so verdict labels may remain stale for the same fork PRs that pr-review-panel.md explicitly supports via pull_request_target. Consider switching to pull_request_target (no checkout, just gh pr edit) to make label reset reliable for forks as well.

This issue also appears in the following locations of the same file:

  • line 21
  • line 36

Copilot uses AI. Check for mistakes.
Comment on lines +187 to +189
synthesize a placeholder `{persona: "<slug>", active: true,
required: [], nits: [], extras: {schema_failure: "<reason>"}}`
and surface the failure in the CEO arbitration prompt.
Copy link

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On repeated schema failures, the instructions synthesize a placeholder with empty required/nits. That can yield an APPROVE verdict even though a panelist effectively did not run, which undermines the "binary verdict" goal. Consider making a schema-failure placeholder produce a REQUIRED finding (forcing REJECT) or instruct the orchestrator to abort and ask for a rerun instead.

Suggested change
synthesize a placeholder `{persona: "<slug>", active: true,
required: [], nits: [], extras: {schema_failure: "<reason>"}}`
and surface the failure in the CEO arbitration prompt.
synthesize a placeholder that REMAINS schema-valid but includes a
single REQUIRED finding so the final verdict is forced to REJECT.
Use:
`{persona: "<slug>", active: true, required: [{"severity":
"required", "title": "Panelist schema failure", "body":
"This panelist did not return valid JSON after two retry attempts.
Re-run the review panel for a clean result. Reason: <reason>"}],
nits: [], extras: {schema_failure: "<reason>"}}`
and surface the failure in the CEO arbitration prompt and final
comment. Do NOT synthesize empty `required` / `nits` placeholders
for a mandatory panelist failure.

Copilot uses AI. Check for mistakes.
Comment on lines +116 to +123
add-labels:
allowed: [panel-approved, panel-rejected]
max: 1
# Trigger label cleanup. Removed after the run so re-applying
# `panel-review` re-triggers the panel cleanly.
remove-labels:
allowed: [panel-review]
max: 1
Copy link

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

safe-outputs can add exactly one verdict label, but it cannot remove an already-present opposite verdict label. If the panel is re-run on the same commit (re-apply panel-review without a new push), the PR can end up with BOTH panel-approved and panel-rejected, breaking the intended XOR invariant. Consider allowing remove-labels to also remove both verdict labels (and have the orchestrator remove both before adding the computed one).

Copilot uses AI. Check for mistakes.
Comment on lines +22 to +40
"active": {
"type": "boolean",
"description": "Set to false ONLY for auth-expert when the conditional rule does not activate it. All other personas MUST set active=true. When false, required and nits MUST be empty arrays and inactive_reason MUST be a one-sentence explanation citing the touched files.",
"default": true
},
"inactive_reason": {
"type": "string",
"description": "Required when active=false. One sentence citing the touched files."
},
"required": {
"type": "array",
"description": "Findings that MUST be addressed before merge. Empty array means this persona has no blocking concerns. Anything in this array causes the orchestrator to emit a REJECT verdict.",
"items": { "$ref": "#/definitions/finding" }
},
"nits": {
"type": "array",
"description": "One-line suggestions the author can skip. NEVER block merge. No cap - trust the panelist's judgement.",
"items": { "$ref": "#/definitions/finding" }
},
Copy link

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The schema docstring says inactive_reason is required when active=false and that required/nits must be empty in that case, but the schema does not enforce those invariants. Add if/then constraints (and ideally enforce that only auth-expert may set active=false) so the S4 schema gate actually catches contract violations.

Copilot uses AI. Check for mistakes.
danielmeppiel and others added 3 commits April 28, 2026 20:02
Genesis Step 8 evals gate, deterministic slice. Validates the parts
of the panel that do NOT require an LLM:

- JSON schema validation (S4 gate) for panelist + CEO returns
- Verdict computation per orchestrator rule (APPROVE iff sum(required)==0)

Five cases: clean-pr APPROVE, rejected-pr REJECT, plus three negative
cases (missing-nits, unknown-persona, disposition-leak) that confirm
`additionalProperties: false` and `required` constraints reject
malformed shapes before they reach the verdict gate.

Run: uv run --with jsonschema python3 \
    .apm/skills/apm-review-panel/evals/run-verdict-harness.py

Result: ALL PASS.

Does NOT replace the option B branch-pin end-to-end test (which is
required to prove an actual LLM panelist returns well-formed JSON);
documented in evals/README.md.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
LLM dispatcher self-evaluation against trigger-evals.json (16 queries,
60/40 train/val split).

Result: 16/16 correct. Validation split:
- should-trigger: 3/3 = 1.00 (gate >= 0.5: PASS)
- should-NOT:     3/3 = 1.00 (gate >= 0.5: PASS)

Caveat recorded in the result file: this is a single-LLM judgment;
canonical evals would average over multiple dispatcher models. Real
LLM judgment, not hand-waving.

Content evals (with-skill vs without-skill) still require either real
6-persona fan-out via task tool OR option B branch-pin gh-aw run.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
DO NOT MERGE WITHOUT REVERTING.

Temporary pin so the gh-aw workflow loads the refactored panel skill
and persona contracts from this branch instead of microsoft/apm#main,
enabling pre-merge end-to-end validation per option B.

To revert: change packages back to microsoft/apm#main, recompile, push.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
End-to-end gh-aw run of the refactored panel against PR #931 succeeded
on first try with all 10 acceptance criteria passing:

- Verdict header literal: '## APM Review Panel Verdict: REJECT'
- Top-loaded order: verdict -> required (3) -> nits (4) -> CEO -> per-persona
- All 6 personas spawned and reported back
- auth-expert correctly inactive with cited reason
- Verdict deterministic from required[] count (3 -> REJECT)
- panel-rejected label applied
- Comment count: 1 (single-writer interlock held)
- Per-persona detail in collapsed <details> block
- Python Architect class diagram passed through extras
- All 7 gh-aw jobs SUCCESS

Bonus: panel surfaced a real regression in PR #931 that prior
reviewers missed (proposed docstring silently drops gemini routing).
Three independent panelists converged with zero dissent.

Workflow run: github.com/microsoft/apm/actions/runs/25069734881
Full result: .apm/skills/apm-review-panel/evals/results-e2e-pr931-2026-04-28.md

Reverts the temporary microsoft/apm#refactor/review-panel-fanout pin
back to microsoft/apm#main. Recompiled lock file.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
danielmeppiel and others added 2 commits April 28, 2026 22:38
Probe PR #1025 (closed). Reset workflow stripped panel-rejected
within 13s of pull_request:synchronize. panel-approved path is
identical code in the same loop -- proven by parity.

Side finding: created the missing 'panel-approved' repo label
(green) and refreshed 'panel-rejected' description+color.
@danielmeppiel danielmeppiel merged commit eca7815 into main Apr 28, 2026
8 checks passed
@danielmeppiel danielmeppiel deleted the refactor/review-panel-fanout branch April 28, 2026 20:59
danielmeppiel pushed a commit that referenced this pull request Apr 29, 2026
Promotes [Unreleased] -> [0.11.0] - 2026-04-29 and bumps
pyproject.toml + uv.lock to 0.11.0.

Version-bump rationale: 0.11.0 (minor bump) chosen over 0.10.1 because
this release ships one BREAKING removal (`apm marketplace build` -> exits 2,
use `apm pack`) plus several net-new features (Dev Container Feature,
Codex project-scoped MCP, `marketplace:` block in apm.yml, `apm pack`
unification, multi-org `apps[]`). Strict semver in 0.x: minor for
features-with-break, patch only for bugfixes.

Milestone admin (done out-of-band):
- Renamed milestone #8 `0.10.1` -> `0.11.0`
- Created milestone #9 `0.12.0` as next-up bucket
- Moved 43 open items (42 issues + 1 open PR #999) from `0.11.0` -> `0.12.0`
- 6 closed items stay in `0.11.0`

PRs shipping in 0.11.0 (22 commits since v0.10.0):

User-facing features:
- #1042/#722 `apm pack` unifies bundle + marketplace.json
                   (BREAKING: `apm marketplace build` removed)
- #1038       `marketplace:` block in apm.yml + `apm marketplace migrate`
- #803  /#502 Codex project-scoped MCP (`.codex/config.toml`) + user-scope primitives
- #861        Dev Container Feature `ghcr.io/microsoft/apm/apm-cli`
- #982/#984   shared/apm.md `apps:` array for cross-org private packages
- #820        `target:` in apm.yml validates at parse time
- #1032       `apm marketplace add` honors manifest.name (Claude Code parity)
- #1000/#998/#994 unified `--policy` / `--policy-source` accepted forms

User-facing fixes:
- #1015 ADO Entra ID auth + `apm install --update` pre-flight abort
- #1019/#1020 GEMINI.md only created when target requested
- #1008 marketplace producer respects GITHUB_HOST + multi-host URL forms
- #1018 POSIX paths in auto-discovery output (Windows compat)
- #996  drop stray 'specify' from generated file footer

Maintainer tooling:
- #1043 NOTICE.md per CELA template
- #1045/#1044 NOTICE drift gate + license-policy gate in CI
- #1033 shared/apm.md `[a b]` import-input repair (gh-aw#29076 paper-cut)
- #1030 panel workflows skip-don't-fail on unmatched labels; gh-aw v0.71.1
- #1026 shared/apm.md recompiled to apm-action v1.5.0 + bundles-file
- #1022 review-panel: true fan-out + binary verdict + label automation
- #918  complexity audit + benchmarks suite
- #1002 CodeQL clear-text-storage false-positive resolved (token -> placeholder)

Files changed:
- pyproject.toml: 0.10.0 -> 0.11.0
- uv.lock:        regenerated (version field only)
- CHANGELOG.md:   [Unreleased] promoted to [0.11.0] - 2026-04-29

NOTICE drift check passes against the bumped lockfile.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@danielmeppiel danielmeppiel added this to the 0.11.0 milestone Apr 29, 2026
danielmeppiel pushed a commit that referenced this pull request Apr 29, 2026
Promotes [Unreleased] -> [0.11.0] - 2026-04-29 and bumps
pyproject.toml + uv.lock to 0.11.0.

Version-bump rationale: 0.11.0 (minor bump) chosen over 0.10.1 because
this release ships one BREAKING removal (`apm marketplace build` -> exits 2,
use `apm pack`) plus several net-new features (Dev Container Feature,
Codex project-scoped MCP, `marketplace:` block in apm.yml, `apm pack`
unification, multi-org `apps[]`). Strict semver in 0.x: minor for
features-with-break, patch only for bugfixes.

Milestone admin (done out-of-band):
- Renamed milestone #8 `0.10.1` -> `0.11.0`
- Created milestone #9 `0.12.0` as next-up bucket
- Moved 43 open items (42 issues + 1 open PR #999) from `0.11.0` -> `0.12.0`
- 6 closed items stay in `0.11.0`

PRs shipping in 0.11.0 (22 commits since v0.10.0):

User-facing features:
- #1042/#722 `apm pack` unifies bundle + marketplace.json
                   (BREAKING: `apm marketplace build` removed)
- #1038       `marketplace:` block in apm.yml + `apm marketplace migrate`
- #803  /#502 Codex project-scoped MCP (`.codex/config.toml`) + user-scope primitives
- #861        Dev Container Feature `ghcr.io/microsoft/apm/apm-cli`
- #982/#984   shared/apm.md `apps:` array for cross-org private packages
- #820        `target:` in apm.yml validates at parse time
- #1032       `apm marketplace add` honors manifest.name (Claude Code parity)
- #1000/#998/#994 unified `--policy` / `--policy-source` accepted forms

User-facing fixes:
- #1015 ADO Entra ID auth + `apm install --update` pre-flight abort
- #1019/#1020 GEMINI.md only created when target requested
- #1008 marketplace producer respects GITHUB_HOST + multi-host URL forms
- #1018 POSIX paths in auto-discovery output (Windows compat)
- #996  drop stray 'specify' from generated file footer

Maintainer tooling:
- #1043 NOTICE.md per CELA template
- #1045/#1044 NOTICE drift gate + license-policy gate in CI
- #1033 shared/apm.md `[a b]` import-input repair (gh-aw#29076 paper-cut)
- #1030 panel workflows skip-don't-fail on unmatched labels; gh-aw v0.71.1
- #1026 shared/apm.md recompiled to apm-action v1.5.0 + bundles-file
- #1022 review-panel: true fan-out + binary verdict + label automation
- #918  complexity audit + benchmarks suite
- #1002 CodeQL clear-text-storage false-positive resolved (token -> placeholder)

Files changed:
- pyproject.toml: 0.10.0 -> 0.11.0
- uv.lock:        regenerated (version field only)
- CHANGELOG.md:   [Unreleased] promoted to [0.11.0] - 2026-04-29

NOTICE drift check passes against the bumped lockfile.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
danielmeppiel added a commit that referenced this pull request Apr 29, 2026
* chore(release): cut 0.11.0

Promotes [Unreleased] -> [0.11.0] - 2026-04-29 and bumps
pyproject.toml + uv.lock to 0.11.0.

Version-bump rationale: 0.11.0 (minor bump) chosen over 0.10.1 because
this release ships one BREAKING removal (`apm marketplace build` -> exits 2,
use `apm pack`) plus several net-new features (Dev Container Feature,
Codex project-scoped MCP, `marketplace:` block in apm.yml, `apm pack`
unification, multi-org `apps[]`). Strict semver in 0.x: minor for
features-with-break, patch only for bugfixes.

Milestone admin (done out-of-band):
- Renamed milestone #8 `0.10.1` -> `0.11.0`
- Created milestone #9 `0.12.0` as next-up bucket
- Moved 43 open items (42 issues + 1 open PR #999) from `0.11.0` -> `0.12.0`
- 6 closed items stay in `0.11.0`

PRs shipping in 0.11.0 (22 commits since v0.10.0):

User-facing features:
- #1042/#722 `apm pack` unifies bundle + marketplace.json
                   (BREAKING: `apm marketplace build` removed)
- #1038       `marketplace:` block in apm.yml + `apm marketplace migrate`
- #803  /#502 Codex project-scoped MCP (`.codex/config.toml`) + user-scope primitives
- #861        Dev Container Feature `ghcr.io/microsoft/apm/apm-cli`
- #982/#984   shared/apm.md `apps:` array for cross-org private packages
- #820        `target:` in apm.yml validates at parse time
- #1032       `apm marketplace add` honors manifest.name (Claude Code parity)
- #1000/#998/#994 unified `--policy` / `--policy-source` accepted forms

User-facing fixes:
- #1015 ADO Entra ID auth + `apm install --update` pre-flight abort
- #1019/#1020 GEMINI.md only created when target requested
- #1008 marketplace producer respects GITHUB_HOST + multi-host URL forms
- #1018 POSIX paths in auto-discovery output (Windows compat)
- #996  drop stray 'specify' from generated file footer

Maintainer tooling:
- #1043 NOTICE.md per CELA template
- #1045/#1044 NOTICE drift gate + license-policy gate in CI
- #1033 shared/apm.md `[a b]` import-input repair (gh-aw#29076 paper-cut)
- #1030 panel workflows skip-don't-fail on unmatched labels; gh-aw v0.71.1
- #1026 shared/apm.md recompiled to apm-action v1.5.0 + bundles-file
- #1022 review-panel: true fan-out + binary verdict + label automation
- #918  complexity audit + benchmarks suite
- #1002 CodeQL clear-text-storage false-positive resolved (token -> placeholder)

Files changed:
- pyproject.toml: 0.10.0 -> 0.11.0
- uv.lock:        regenerated (version field only)
- CHANGELOG.md:   [Unreleased] promoted to [0.11.0] - 2026-04-29

NOTICE drift check passes against the bumped lockfile.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* chore(changelog): tighten 0.11.0 entries to lead with user impact

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* chore(changelog): move Dev Container Feature to Maintainer tooling (not yet published)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* chore(changelog): de-dupe within 0.11.0 (combine #722 Removed bullets, drop #820 Fixed pointer)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

---------

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

panel-review Trigger the apm-review-panel gh-aw workflow

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants