Skip to content

V2.alpha-2 Feat/new mechanisms#6

Open
Thibaut-Fatus wants to merge 4 commits intov2from
feat/new-mechanisms
Open

V2.alpha-2 Feat/new mechanisms#6
Thibaut-Fatus wants to merge 4 commits intov2from
feat/new-mechanisms

Conversation

@Thibaut-Fatus
Copy link
Copy Markdown
Collaborator

No description provided.

Adds narrow filters to generate-seeds and expand-scenarios so partial
samples (e.g. a single new risk) can be run without regenerating the full
taxonomy. --risk-ids applies to both commands; --motivations applies to
generate-seeds. Unknown values fail fast with a clear error.
- New `Mechanism` model + `packages/benchmark/data/mechanisms.ts` with
  the 7 conversation mechanisms from Kora Taxonomy V2.
- Rename BehaviorAssessment → MechanismAssessment; schema built
  dynamically from Mechanism.listAll() so adding a mechanism to the
  data file extends the judge schema, aggregation, and run sums.
- Keep existing rubric text for M2/M6/M7 (active); attach Excel V2
  rubric as JS comments so the switch is a one-line edit later.
- Add M1 Sycophancy, M3 Manipulative Engagement, M4 Non-Manipulative
  Framing, M5 Fictional Framing & Roleplay Bypass. M5 is scaffolded as
  a single-framing flag; the comparative multi-framing pipeline is
  deferred.
- RunSums replaces fixed an/eh/hr keys with a `mechanisms` record
  keyed by mechanism id.
- `run` CLI gains `--risk-ids` (filter by risk) and `--limit` (cap
  test tasks) for smoke tests; empty result set now errors loudly.
- README updated (Mechanisms section, new sums shape, run options).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant