research(stability): head verdict stability under seed perturbation

## Problem
동일 PR 을 10번 review 할 때 head verdict (ACCEPT/REJECT/NEEDS_HUMAN) 가 안정적인지 측정된 바 없음. Cheap head 모델 (MiniMax M2.5, gpt-5-mini) 은 verdict 가 run-to-run flip 할 가능성 높음.

## Why this matters
Review 의 authoritative output 은 head verdict. 여기가 noise 면 시스템 전체가 noise generator. 특히 cheap-model regime 에서 consistency 가 핵심 품질 지표.

## Proposed approach
### 측정 (Phase 1)
- 5개 reference PR 선정 (명백한 ACCEPT 2개, 명백한 REJECT 2개, ambiguous 1개)
- 각 PR 을 10회 review (동일 config, cache 없음)
- Verdict 분포 기록: mode, variance, flip rate

### 기준치
- **Ambiguous PR 에서 mode verdict 6/10 이상** (60% stability minimum)
- **명백한 case 에서 10/10 동일** (100% required)

### 개선 (Phase 2, 측정 후 필요 시)
- Head 프롬프트에 tie-breaking rule 추가
- Majority voting: head 를 3회 돌려 majority 채택 (비용 증가)
- Temperature 하향 (head 전용)

## Acceptance criteria
- [ ] `scripts/bench-stability.mjs` — reference PR 반복 실행
- [ ] Baseline 수치 (현재 stability rate)
- [ ] 개선 후 재측정
- [ ] CI 에 주기 실행 + regression gate

## References
- `packages/core/src/l3/` (head verdict 구현)
- #7 (golden bug benchmark) 과 상호보완

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

research(stability): head verdict stability under seed perturbation #475

Problem

Why this matters

Proposed approach

측정 (Phase 1)

기준치

개선 (Phase 2, 측정 후 필요 시)

Acceptance criteria

References

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

research(stability): head verdict stability under seed perturbation #475

Description

Problem

Why this matters

Proposed approach

측정 (Phase 1)

기준치

개선 (Phase 2, 측정 후 필요 시)

Acceptance criteria

References

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions