Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
18 changes: 18 additions & 0 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -50,3 +50,21 @@ jobs:
cache: 'pnpm'
- run: pnpm install --frozen-lockfile
- run: bash scripts/e2e-publish.sh

# Reproducible multi-agent scenarios under tests/scenarios/. Kept out
# of the `pnpm test` aggregate so a scenario failure stays attributable
# to the harness rather than blending into the per-package test job.
scenarios:
if: github.event_name != 'pull_request' || github.event.pull_request.draft == false
runs-on: ubuntu-latest
needs: build
timeout-minutes: 10
steps:
- uses: actions/checkout@v4
- uses: pnpm/action-setup@v4
- uses: actions/setup-node@v4
with:
node-version: '20'
cache: 'pnpm'
- run: pnpm install --frozen-lockfile
- run: pnpm scenarios
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -1015,7 +1015,7 @@ validate.
- 🟡 Cursor and Gemini CLI installers exist but have less smoke coverage
- 🔵 Per-runtime smoke for claim-before-edit emission
- 🔵 Cross-runtime handoff smoke (Codex hands off to Claude, both run)
- Reproducible test fixture set under `tests/scenarios/`
- Reproducible test fixture set under `tests/scenarios/` (5 scenarios, harness self-tests, `pnpm scenarios`)

> **`time-to-healthy`: still hours**, but the time the human spends _deciding what to run_ drops sharply because every signal carries its `cmd:` and `tool:` already.

Expand Down
31 changes: 31 additions & 0 deletions openspec/changes/scenarios-harness-2026-05-16/CHANGE.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
---
slug: scenarios-harness-2026-05-16
---

# CHANGE · scenarios-harness-2026-05-16

## §P proposal

### Problem

README §v0.x "Multi-runtime confidence" lists "Reproducible test fixture set under `tests/scenarios/`" as the last open item. Today, multi-agent situations (claim-before-edit, cross-runtime handoff, stale-claim sweep, plan claim adoption, pre/post path mismatch) live as ad-hoc smoke tests scattered across `packages/hooks/test/` and `apps/cli/test/`. Each rebuilds its own tempdir + git repo + fake-timer scaffolding inline. Reproducing a regression means hand-porting that scaffolding into a fresh file.

### Proposal

Add a reproducible test-scenarios harness under `tests/scenarios/`. Each scenario is a directory of plaintext artifacts (no binary snapshots):

- `seed.sql` — applied after schema migrations against a fresh tempdir SQLite DB.
- `inputs.jsonl` — one envelope per line: `{kind, at_ms, payload}` where `kind` is `lifecycle | mcp | tick`. Lifecycle flows through the same `runOmxLifecycleEnvelope` that production hooks call.
- `expected.json` — normalized substrate snapshot with subset matchers (`toMatchObject` style), not full-row equality. Paths normalized to `<REPO_ROOT>`.
- Optional `meta.yaml` — runtimes, tags, description.

A shared `_harness/` drives all scenarios via `vi.useFakeTimers` + `vi.setSystemTime(BASE_TS + at_ms)` per envelope so timing is deterministic. Embeddings forced to `provider: none` to remove network. Five canonical scenarios ship in this PR: claim-before-edit, cross-runtime handoff, stale-claim sweep, plan claim adoption, pre/post path mismatch. Two harness self-tests prove the runner fails closed on missing expected and reports a clear diff on mismatch. A separate CI job runs `pnpm scenarios` on Node 20 after `build`, kept out of `pnpm test` so failure attribution stays clean.

### Acceptance criteria

- `pnpm scenarios` runs all five scenarios plus two harness self-tests, all green.
- `pnpm scenarios:filter <slug>` runs a single scenario by name.
- `pnpm scenarios:explain <slug>` prints a human-readable timeline.
- `pnpm scenarios:record <slug>` regenerates `expected.json` from a live run (manual trim still required for subset matcher discipline).
- `.github/workflows/ci.yml` gains a `scenarios` job after `build` running on Node 20 only.
- `pnpm typecheck` and `pnpm build` clean.
7 changes: 6 additions & 1 deletion package.json
Original file line number Diff line number Diff line change
Expand Up @@ -33,13 +33,18 @@
"p": "pnpm run release",
"publish:cli": "bash scripts/publish-cli.sh",
"publish:cli:dry-run": "bash scripts/publish-cli.sh --dry-run",
"release": "changeset publish"
"release": "changeset publish",
"scenarios": "vitest run --config tests/scenarios/_harness/vitest.config.ts",
"scenarios:filter": "vitest run --config tests/scenarios/_harness/vitest.config.ts -t",
"scenarios:explain": "tsx tests/scenarios/_harness/explain.mts",
"scenarios:record": "tsx tests/scenarios/_harness/record.mts"
},
"devDependencies": {
"@biomejs/biome": "^1.9.4",
"@changesets/cli": "^2.27.9",
"@types/node": "^22.9.0",
"tsup": "^8.3.5",
"tsx": "^4.19.2",
"typescript": "^5.6.3",
"vitest": "^2.1.5"
}
Expand Down
3 changes: 3 additions & 0 deletions pnpm-lock.yaml

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

112 changes: 112 additions & 0 deletions tests/scenarios/01-claim-before-edit/expected.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,112 @@
{
"observations": [
{
"kind": "lifecycle_event",
"ts_offset": 10,
"metadata_subset": {
"event_name": "session_start",
"session_id": "codex@scenario-01",
"agent": "codex",
"branch": "agent/scenario/default",
"binding_status": "bound_task"
}
},
{
"kind": "omx-lifecycle",
"metadata_subset": {
"event_id": "evt_01_session",
"event_type": "session_start",
"ok": true
}
},
{
"kind": "omx-lifecycle",
"metadata_subset": {
"event_id": "evt_01_bind",
"event_type": "task_bind",
"ok": true
}
},
{
"kind": "claim",
"ts_offset": 40,
"metadata_subset": {
"kind": "claim",
"source": "pre-tool-use",
"file_path": "src/target.ts",
"auto_claimed_before_edit": true,
"tool": "Edit"
}
},
{
"kind": "claim-before-edit",
"ts_offset": 40,
"metadata_subset": {
"kind": "claim-before-edit",
"source": "pre-tool-use",
"outcome": "auto_claimed_before_edit",
"file_path": "src/target.ts",
"tool": "Edit",
"conflict": false
}
},
{
"kind": "omx-lifecycle",
"ts_offset": 40,
"metadata_subset": {
"event_id": "evt_01_pre",
"event_type": "pre_tool_use",
"tool_name": "Edit",
"extracted_paths": ["src/target.ts"]
}
},
{
"kind": "tool_use",
"ts_offset": 60,
"metadata_subset": {
"tool": "Edit",
"lifecycle_event_id": "evt_01_post",
"parent_event_id": "evt_01_pre",
"file_path": "src/target.ts"
}
},
{
"kind": "omx-lifecycle",
"ts_offset": 60,
"metadata_subset": {
"event_id": "evt_01_post",
"event_type": "post_tool_use",
"parent_event_id": "evt_01_pre",
"tool_name": "Edit"
}
}
],
"claims": [
{
"task_id": 1,
"file_path": "src/target.ts",
"session_id": "codex@scenario-01",
"state": "active"
}
],
"mcp_metrics": [],
"lifecycle_events": [
{
"event_type": "session_start",
"event_id": "evt_01_session"
},
{
"event_type": "task_bind",
"event_id": "evt_01_bind"
},
{
"event_type": "pre_tool_use",
"event_id": "evt_01_pre"
},
{
"event_type": "post_tool_use",
"event_id": "evt_01_post",
"parent_event_id": "evt_01_pre"
}
]
}
7 changes: 7 additions & 0 deletions tests/scenarios/01-claim-before-edit/inputs.jsonl
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
# Scenario 01 — claim-before-edit
# A codex session starts, binds to a task, then claims src/target.ts
# before issuing a pre_tool_use + post_tool_use Edit on it.
{"kind":"lifecycle","at_ms":10,"payload":{"event_id":"evt_01_session","event_name":"session_start","session_id":"codex@scenario-01","agent":"codex","branch":"agent/scenario/default"}}
{"kind":"lifecycle","at_ms":20,"payload":{"event_id":"evt_01_bind","event_name":"task_bind","session_id":"codex@scenario-01","agent":"codex","branch":"agent/scenario/default"}}
{"kind":"lifecycle","at_ms":40,"payload":{"event_id":"evt_01_pre","event_name":"pre_tool_use","session_id":"codex@scenario-01","agent":"codex","branch":"agent/scenario/default","tool_name":"Edit","tool_input":{"operation":"replace","paths":[{"path":"<REPO_ROOT>/src/target.ts","role":"target","kind":"file"}]}}}
{"kind":"lifecycle","at_ms":60,"payload":{"event_id":"evt_01_post","event_name":"post_tool_use","parent_event_id":"evt_01_pre","session_id":"codex@scenario-01","agent":"codex","branch":"agent/scenario/default","tool_name":"Edit","tool_input":{"operation":"replace","paths":[{"path":"<REPO_ROOT>/src/target.ts","role":"target","kind":"file"}]},"tool_response":{"success":true}}}
10 changes: 10 additions & 0 deletions tests/scenarios/01-claim-before-edit/meta.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
description: |
Codex session binds to a task on agent/scenario/default and edits
src/target.ts. Proves pre_tool_use synthesizes a claim-before-edit
signal and that the post_tool_use observation lands with the claim
already in place.
runtimes:
- codex
tags:
- claim-before-edit
- lifecycle
2 changes: 2 additions & 0 deletions tests/scenarios/01-claim-before-edit/seed.sql
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
-- No seed needed: the lifecycle session_start + task_bind events drive
-- session creation and task binding by themselves.
92 changes: 92 additions & 0 deletions tests/scenarios/02-cross-runtime-handoff/expected.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,92 @@
{
"observations": [
{
"kind": "lifecycle_event",
"metadata_subset": {
"event_name": "session_start",
"session_id": "codex@scenario-02",
"agent": "codex"
}
},
{
"kind": "omx-lifecycle",
"metadata_subset": {
"event_id": "evt_02_codex_session",
"event_type": "session_start"
}
},
{
"kind": "omx-lifecycle",
"metadata_subset": {
"event_id": "evt_02_codex_bind",
"event_type": "task_bind"
}
},
{
"kind": "claim",
"ts_offset": 30,
"metadata_subset": {
"kind": "claim",
"file_path": "src/target.ts"
}
},
{
"kind": "relay",
"ts_offset": 50,
"metadata_subset": {
"kind": "relay",
"from_session_id": "codex@scenario-02",
"from_agent": "codex",
"reason": "quota"
}
},
{
"kind": "claim-weakened",
"ts_offset": 50,
"metadata_subset": {
"kind": "claim-weakened",
"file_path": "src/target.ts",
"state": "handoff_pending"
}
},
{
"kind": "lifecycle_event",
"ts_offset": 300000,
"metadata_subset": {
"event_name": "session_start",
"session_id": "claude@scenario-02",
"agent": "claude"
}
},
{
"kind": "omx-lifecycle",
"metadata_subset": {
"event_id": "evt_02_claude_session",
"event_type": "session_start"
}
}
],
"claims": [
{
"task_id": 1,
"file_path": "src/target.ts",
"session_id": "claude@scenario-02",
"state": "active"
}
],
"mcp_metrics": [],
"lifecycle_events": [
{
"event_type": "session_start",
"event_id": "evt_02_codex_session"
},
{
"event_type": "task_bind",
"event_id": "evt_02_codex_bind"
},
{
"event_type": "session_start",
"event_id": "evt_02_claude_session"
}
]
}
11 changes: 11 additions & 0 deletions tests/scenarios/02-cross-runtime-handoff/inputs.jsonl
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
# Scenario 02 — cross-runtime handoff
# Codex binds to a task, claims src/target.ts, then relays out. Claude
# session adopts via accept_relay. Proves the baton pass survives
# across runtimes and the claim ends up owned by claude's session.
{"kind":"lifecycle","at_ms":10,"payload":{"event_id":"evt_02_codex_session","event_name":"session_start","session_id":"codex@scenario-02","agent":"codex","branch":"agent/scenario/default"}}
{"kind":"lifecycle","at_ms":20,"payload":{"event_id":"evt_02_codex_bind","event_name":"task_bind","session_id":"codex@scenario-02","agent":"codex","branch":"agent/scenario/default"}}
{"kind":"task","at_ms":30,"payload":{"action":"claim_file","task_id":1,"session_id":"codex@scenario-02","file_path":"src/target.ts","note":"codex claims before relay"}}
{"kind":"task","at_ms":50,"payload":{"action":"relay","task_id":1,"from_session_id":"codex@scenario-02","from_agent":"codex","reason":"quota","one_line":"codex hit quota, handing off","base_branch":"main","expires_in_ms":600000}}
{"kind":"lifecycle","at_ms":300000,"payload":{"event_id":"evt_02_claude_session","event_name":"session_start","session_id":"claude@scenario-02","agent":"claude","branch":"agent/scenario/default"}}
{"kind":"task","at_ms":300100,"payload":{"action":"join","task_id":1,"session_id":"claude@scenario-02","agent":"claude"}}
{"kind":"task","at_ms":300200,"payload":{"action":"accept_relay","task_id":1,"session_id":"claude@scenario-02"}}
13 changes: 13 additions & 0 deletions tests/scenarios/02-cross-runtime-handoff/meta.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
description: |
Codex binds to a task on agent/scenario/default and claims
src/target.ts. Codex then relays out (quota reason). 5 minutes later
a claude session starts, joins the task, and accepts the relay.
Proves the cross-runtime baton pass and that the claim ends up
owned by claude's session.
runtimes:
- codex
- claude
tags:
- handoff
- relay
- cross-runtime
2 changes: 2 additions & 0 deletions tests/scenarios/02-cross-runtime-handoff/seed.sql
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
-- No seed needed: lifecycle session_start + task_bind create the task,
-- task envelopes drive the relay and adoption.
Loading
Loading