From b54b78e025bf63962deab4029bec628346a4b231 Mon Sep 17 00:00:00 2001 From: Layne Penney Date: Wed, 15 Apr 2026 10:27:25 -0500 Subject: [PATCH 1/4] design: hook/event contract and PR lifecycle for gr2 Sprint 20 --- gr2/docs/HOOK-EVENT-CONTRACT.md | 607 ++++++++++++++++++++++++++++++++ gr2/docs/PR-LIFECYCLE.md | 569 ++++++++++++++++++++++++++++++ 2 files changed, 1176 insertions(+) create mode 100644 gr2/docs/HOOK-EVENT-CONTRACT.md create mode 100644 gr2/docs/PR-LIFECYCLE.md diff --git a/gr2/docs/HOOK-EVENT-CONTRACT.md b/gr2/docs/HOOK-EVENT-CONTRACT.md new file mode 100644 index 0000000..5807147 --- /dev/null +++ b/gr2/docs/HOOK-EVENT-CONTRACT.md @@ -0,0 +1,607 @@ +# gr2 Hook/Event Contract + +This document defines the event contract for gr2: what events the system emits, +their schema, delivery model, and how consumers (spawn, recall, channel bridge) +integrate. + +This is a **design document** for Sprint 20. It does not describe current +behavior; it defines the target contract. + +## 1. Design Goals + +- Every gr2 operation that changes workspace state emits a typed event. +- Events are durable, append-only, and replayable. +- Consumers read events at their own pace via cursors. gr2 does not block on + delivery. +- The event schema is the stable API between OSS gr2 and premium spawn. +- Hook execution is one event source among several, not the only one. + +## 2. Event Sources + +gr2 emits events from five operational domains: + +| Domain | Examples | Current State | +|--------|----------|---------------| +| **Lane lifecycle** | lane.created, lane.entered, lane.exited, lane.archived | Partial (SYNAPT-INTEGRATION.md defines format, not wired) | +| **Lease lifecycle** | lease.acquired, lease.released, lease.expired, lease.force_broken | Prototype only | +| **Hook execution** | hook.started, hook.completed, hook.failed | hooks.py runs commands but emits nothing | +| **PR lifecycle** | pr.created, pr.status_changed, pr.merged, pr.checks_passed | Missing (Sprint 20 deliverable) | +| **Sync operations** | sync.started, sync.repo_updated, sync.completed, sync.conflict | Missing (Atlas's sync algorithm design) | + +Each domain owns a namespace prefix. Events are globally ordered by timestamp +and monotonic sequence number within the outbox. + +## 3. Event Schema + +### 3.1 Common Envelope + +Every event shares a common envelope: + +```json +{ + "version": 1, + "event_id": "a1b2c3d4e5f6", + "seq": 42, + "timestamp": "2026-04-15T16:30:00+00:00", + "type": "lane.entered", + "workspace": "synapt-dev", + "actor": "agent:apollo", + "agent_id": "agent_apollo_xyz789", + "owner_unit": "apollo", + "payload": { ... } +} +``` + +Field definitions: + +| Field | Type | Required | Description | +|-------|------|----------|-------------| +| `version` | int | yes | Schema version. Always `1` for this contract. | +| `event_id` | string | yes | Unique event identifier. 16-char hex from `os.urandom(8).hex()`. | +| `seq` | int | yes | Monotonically increasing sequence number within this outbox file. Starts at 1. | +| `timestamp` | string | yes | ISO 8601 with timezone. | +| `type` | string | yes | Dotted event type from the taxonomy (section 3.2). | +| `workspace` | string | yes | Workspace name from WorkspaceSpec. | +| `actor` | string | yes | Who triggered the event. Format: `agent:`, `human:`, or `system`. | +| `agent_id` | string | no | Persistent agent identity from premium. Opaque in OSS. | +| `owner_unit` | string | yes | Unit that owns the context where this event occurred. | +| `payload` | object | yes | Event-type-specific data. | + +Rules: +- `event_id` must be unique within a workspace. +- `seq` must be strictly monotonically increasing within a single outbox file. +- `actor` uses the prefix convention to distinguish agents from humans from + automated operations. +- `agent_id` is optional because human-triggered and system-triggered events + do not have one. + +### 3.2 Event Type Taxonomy + +#### Lane Lifecycle + +| Type | Trigger | Payload | +|------|---------|---------| +| `lane.created` | `gr2 lane create` | `{lane_name, lane_type, repos: [str], branch_map: {repo: branch}}` | +| `lane.entered` | `gr2 lane enter` | `{lane_name, lane_type, repos: [str]}` | +| `lane.exited` | `gr2 lane exit` | `{lane_name, stashed_repos: [str]}` | +| `lane.switched` | Enter a different lane (exit + enter) | `{from_lane, to_lane, stashed_repos: [str]}` | +| `lane.archived` | Lane cleanup after merge | `{lane_name, reason}` | + +#### Lease Lifecycle + +| Type | Trigger | Payload | +|------|---------|---------| +| `lease.acquired` | `gr2 lane lease acquire` | `{lane_name, mode, ttl_seconds, lease_id}` | +| `lease.released` | `gr2 lane lease release` | `{lane_name, lease_id}` | +| `lease.expired` | TTL watchdog or next acquire check | `{lane_name, lease_id, expired_at}` | +| `lease.force_broken` | `--force` acquire or admin break | `{lane_name, lease_id, broken_by, reason}` | + +#### Hook Execution + +| Type | Trigger | Payload | +|------|---------|---------| +| `hook.started` | Lifecycle hook begins execution | `{stage, hook_name, repo, command, cwd}` | +| `hook.completed` | Hook exits successfully | `{stage, hook_name, repo, duration_ms, exit_code: 0}` | +| `hook.failed` | Hook exits with non-zero code | `{stage, hook_name, repo, duration_ms, exit_code, on_failure, stderr_tail}` | +| `hook.skipped` | Hook `when` condition not met | `{stage, hook_name, repo, reason}` | + +Rules for hook events: +- `stderr_tail` is the last 500 bytes of stderr, truncated. Full output is not + stored in the event. +- `on_failure` records the policy that was applied (block, warn, skip). +- `hook.failed` with `on_failure: "block"` means the parent operation also + failed. Consumers should expect a corresponding operation failure event. + +#### PR Lifecycle + +| Type | Trigger | Payload | +|------|---------|---------| +| `pr.created` | `gr2 pr create` | `{pr_group_id, repos: [{repo, pr_number, url, title, base, head}]}` | +| `pr.status_changed` | Poll or webhook | `{pr_group_id, repo, pr_number, old_status, new_status}` | +| `pr.checks_passed` | All CI checks green | `{pr_group_id, repo, pr_number}` | +| `pr.checks_failed` | CI check failure | `{pr_group_id, repo, pr_number, failed_checks: [str]}` | +| `pr.review_submitted` | Review posted | `{pr_group_id, repo, pr_number, reviewer, verdict}` | +| `pr.merged` | `gr2 pr merge` | `{pr_group_id, repos: [{repo, pr_number, merge_sha}]}` | +| `pr.merge_failed` | Merge blocked or conflict | `{pr_group_id, repo, pr_number, reason}` | + +`pr_group_id` is the cross-repo correlation key. When `gr2 pr create` creates +PRs in multiple repos, they share the same `pr_group_id`. This is how consumers +reconstruct the cross-repo PR as a unit. + +#### Sync Operations + +| Type | Trigger | Payload | +|------|---------|---------| +| `sync.started` | `gr2 sync` begins | `{repos: [str], strategy}` | +| `sync.repo_updated` | Single repo pull/rebase completes | `{repo, old_sha, new_sha, strategy, commits_pulled: int}` | +| `sync.repo_skipped` | Repo skipped (dirty, no remote, etc.) | `{repo, reason}` | +| `sync.conflict` | Merge/rebase conflict during sync | `{repo, conflicting_files: [str]}` | +| `sync.completed` | `gr2 sync` finishes | `{repos_updated: int, repos_skipped: int, repos_failed: int, duration_ms}` | + +#### Workspace Operations + +| Type | Trigger | Payload | +|------|---------|---------| +| `workspace.materialized` | `gr2 workspace materialize` or `gr2 apply` | `{repos: [{repo, first_materialize: bool}]}` | +| `workspace.file_projected` | File link/copy applied | `{repo, kind, src, dest}` | + +### 3.3 Payload Conventions + +- All paths in payloads are relative to `workspace_root`, never absolute. +- Repo names match `WorkspaceSpec` `[[repos]]` names, not filesystem paths. +- SHA values are full 40-char hex. +- Duration values are in milliseconds as integers. +- String arrays are used for repo lists, file lists, etc. Never comma-separated strings. + +## 4. Event Outbox + +### 4.1 Storage + +Events are written to a single append-only JSONL file: + +``` +.grip/events/outbox.jsonl +``` + +One JSON object per line. No trailing commas. No array wrapper. + +The outbox file is the single source of truth for all gr2 events in a workspace. + +### 4.2 Write Path + +Events are written synchronously at the point of state change: + +1. Operation performs its work (e.g., creates a lane, runs a hook). +2. Operation calls `emit_event(type, payload)`. +3. `emit_event` assigns `event_id`, `seq`, `timestamp`. +4. Event is serialized and appended to `outbox.jsonl`. +5. File is flushed (fsync not required; OS page cache is sufficient for + local-only delivery). + +`seq` is derived from the current line count of the outbox file plus one. This +is safe because gr2 operations are single-process. If concurrent writers become +necessary (multiple agents in the same workspace), `seq` assignment must move to +a lock or use a separate sequence file. + +### 4.3 Rotation + +When `outbox.jsonl` exceeds 10 MB: + +1. Rename to `outbox.{timestamp}.jsonl`. +2. Create new empty `outbox.jsonl` with `seq` continuing from the last value. +3. Old files are retained for 7 days, then eligible for cleanup by `gr2 gc`. + +Consumers must handle rotation by checking for new files when their cursor +points past the end of the current file. + +### 4.4 No Deletion + +Events are never deleted from the outbox. They are append-only. Rotation moves +old events to archived files but does not remove them. `gr2 gc` is the only +operation that removes archived event files, and only after the retention period. + +## 5. Consumer Model + +### 5.1 Cursor-Based Reading + +Each consumer maintains a cursor file in `.grip/events/cursors/`: + +``` +.grip/events/cursors/{consumer_name}.json +``` + +Cursor format: + +```json +{ + "consumer": "channel_bridge", + "last_seq": 41, + "last_event_id": "a1b2c3d4e5f6", + "last_read": "2026-04-15T16:31:00+00:00" +} +``` + +Reading flow: + +1. Consumer opens cursor file (or starts at seq 0 if no cursor exists). +2. Consumer reads `outbox.jsonl` from line `last_seq + 1` forward. +3. Consumer processes each event. +4. Consumer updates cursor atomically (write temp file, rename). + +### 5.2 Known Consumers + +| Consumer | Location | What It Does | +|----------|----------|--------------| +| **channel_bridge** | OSS | Derives `#dev`-style notifications from events. Posts to channel transport. | +| **recall_indexer** | OSS | Indexes events into recall for searchable lane/activity history. | +| **spawn_watcher** | Premium | Watches for events that trigger agent orchestration (lane assignments, PR readiness, hook failures). | + +### 5.3 Consumer Contract + +Consumers must: +- Be idempotent. Re-processing the same event (e.g., after a crash before + cursor update) must produce the same result. +- Use `event_id` for deduplication if their target store does not naturally + deduplicate. +- Not modify or delete events in the outbox. +- Handle unknown event types gracefully (skip, log, do not crash). +- Handle schema version bumps by checking `version` and ignoring events with + a version they do not understand. + +### 5.4 Spawn Integration (Premium) + +Spawn is the premium consumer that orchestrates multi-agent workflows. It +consumes the same outbox as OSS consumers but interprets events through the +lens of org policy and agent identity. + +Events that spawn cares about: + +| Event | Spawn Reaction | +|-------|----------------| +| `lane.created` | May assign agent to lane based on policy. | +| `pr.created` | May assign reviewers based on compiled review requirements. | +| `pr.checks_passed` | May trigger merge if auto-merge policy is active. | +| `pr.checks_failed` | May notify owning agent or escalate. | +| `hook.failed` with `on_failure: "block"` | May retry, reassign, or alert. | +| `lease.expired` | May reclaim the lane or notify the agent. | +| `sync.conflict` | May pause agent work on conflicting repos. | + +Spawn does not write to the outbox. Spawn's actions (assigning agents, +triggering merges) flow back through the gr2 CLI, which then emits its own +events. This prevents circular event chains. + +## 6. Hook Execution Contract + +This section formalizes the relationship between hook execution (hooks.py) and +event emission. + +### 6.1 Current State + +`hooks.py` currently: +- Parses `.gr2/hooks.toml` +- Resolves template variables +- Runs commands via `subprocess.run` +- Raises `SystemExit` on `on_failure: "block"` failures +- Prints JSON on `on_failure: "warn"` failures +- Does nothing on `on_failure: "skip"` failures + +It does **not** emit structured events. + +### 6.2 Target State + +Every hook execution produces events: + +``` +hook.started -> (command runs) -> hook.completed | hook.failed +``` + +If the hook's `when` condition is not met: + +``` +hook.skipped +``` + +The lifecycle stage runner (`run_lifecycle_stage`) becomes the event emitter. +After running all hooks for a stage, it emits the parent lifecycle event +(e.g., `lane.entered`) with a summary of hook results in the payload. + +### 6.3 Hook Output Capture + +Hook commands produce stdout and stderr. The event contract does not store full +output in events (it would bloat the outbox). Instead: + +- `hook.completed` includes `duration_ms` and `exit_code: 0`. +- `hook.failed` includes `duration_ms`, `exit_code`, `on_failure` policy, and + `stderr_tail` (last 500 bytes). +- Full stdout/stderr is written to: + ``` + .grip/events/hook_output/{event_id}.stdout + .grip/events/hook_output/{event_id}.stderr + ``` +- Hook output files follow the same retention policy as rotated outbox files. + +### 6.4 Hook Failure Propagation + +When a hook fails with `on_failure: "block"`: + +1. `hook.failed` event is emitted with `on_failure: "block"`. +2. The parent operation (e.g., `workspace.materialized`) is **not** emitted + because the operation did not complete. +3. Instead, the calling code should emit a domain-specific failure event + (e.g., `sync.conflict` or handle it in its own error path). + +When a hook fails with `on_failure: "warn"`: + +1. `hook.failed` event is emitted with `on_failure: "warn"`. +2. The parent operation continues and eventually emits its success event. +3. Consumers can correlate the `hook.failed` event with the parent by timestamp + and `owner_unit` context. + +When a hook fails with `on_failure: "skip"`: + +1. `hook.failed` event is emitted with `on_failure: "skip"`. +2. No consumer-visible notification. The event exists for audit trail only. + +## 7. Event Emission API + +### 7.1 Python Interface + +```python +from gr2.events import emit, EventType + +# Simple emission +emit( + event_type=EventType.LANE_ENTERED, + workspace_root=workspace_root, + actor="agent:apollo", + owner_unit="apollo", + payload={ + "lane_name": "feat/hook-events", + "lane_type": "feature", + "repos": ["grip", "synapt"], + }, +) + +# With optional agent_id +emit( + event_type=EventType.HOOK_FAILED, + workspace_root=workspace_root, + actor="agent:apollo", + agent_id="agent_apollo_xyz789", + owner_unit="apollo", + payload={ + "stage": "on_materialize", + "hook_name": "editable-install", + "repo": "synapt", + "duration_ms": 3400, + "exit_code": 1, + "on_failure": "block", + "stderr_tail": "ERROR: pip install failed ...", + }, +) +``` + +### 7.2 EventType Enum + +```python +class EventType(str, Enum): + # Lane lifecycle + LANE_CREATED = "lane.created" + LANE_ENTERED = "lane.entered" + LANE_EXITED = "lane.exited" + LANE_SWITCHED = "lane.switched" + LANE_ARCHIVED = "lane.archived" + + # Lease lifecycle + LEASE_ACQUIRED = "lease.acquired" + LEASE_RELEASED = "lease.released" + LEASE_EXPIRED = "lease.expired" + LEASE_FORCE_BROKEN = "lease.force_broken" + + # Hook execution + HOOK_STARTED = "hook.started" + HOOK_COMPLETED = "hook.completed" + HOOK_FAILED = "hook.failed" + HOOK_SKIPPED = "hook.skipped" + + # PR lifecycle + PR_CREATED = "pr.created" + PR_STATUS_CHANGED = "pr.status_changed" + PR_CHECKS_PASSED = "pr.checks_passed" + PR_CHECKS_FAILED = "pr.checks_failed" + PR_REVIEW_SUBMITTED = "pr.review_submitted" + PR_MERGED = "pr.merged" + PR_MERGE_FAILED = "pr.merge_failed" + + # Sync operations + SYNC_STARTED = "sync.started" + SYNC_REPO_UPDATED = "sync.repo_updated" + SYNC_REPO_SKIPPED = "sync.repo_skipped" + SYNC_CONFLICT = "sync.conflict" + SYNC_COMPLETED = "sync.completed" + + # Workspace operations + WORKSPACE_MATERIALIZED = "workspace.materialized" + WORKSPACE_FILE_PROJECTED = "workspace.file_projected" +``` + +### 7.3 Implementation Location + +The event emission module lives at: + +``` +gr2/python_cli/events.py +``` + +This module owns: +- `emit()` function +- `EventType` enum +- Outbox file management (append, rotation, seq tracking) +- Cursor read helpers for consumers + +It does **not** own consumer logic. Each consumer is a separate module. + +## 8. Channel Bridge Event Mapping + +The channel bridge translates gr2 events into channel messages. Not every event +produces a channel message. + +| Event | Channel Message | Channel | +|-------|----------------|---------| +| `lane.created` | `"{actor} created lane {lane_name} [{lane_type}] repos={repos}"` | #dev | +| `lane.entered` | `"{actor} entered {owner_unit}/{lane_name}"` | #dev | +| `lane.exited` | `"{actor} exited {owner_unit}/{lane_name}"` | #dev | +| `pr.created` | `"{actor} opened PR group {pr_group_id}: {repos}"` | #dev | +| `pr.merged` | `"{actor} merged PR group {pr_group_id}"` | #dev | +| `pr.checks_failed` | `"CI failed on {repo}#{pr_number}: {failed_checks}"` | #dev | +| `hook.failed` (block) | `"Hook {hook_name} failed in {repo} (blocking): {stderr_tail}"` | #dev | +| `sync.conflict` | `"Sync conflict in {repo}: {conflicting_files}"` | #dev | + +Events not listed (hook.started, hook.completed, hook.skipped, lease.acquired, +lease.released, sync.repo_updated, workspace.file_projected, etc.) are **not** +posted to channels by default. They exist in the outbox for recall indexing and +spawn, but would be noise in `#dev`. + +The channel bridge can be configured to include or exclude specific event types +via a filter file at `.grip/events/channel_filter.toml`: + +```toml +[channel_bridge] +include = ["lane.*", "pr.*", "hook.failed", "sync.conflict"] +exclude = ["hook.started", "hook.completed", "hook.skipped"] +``` + +Default: the mapping table above. Filter file is optional. + +## 9. Recall Indexing + +Recall indexes all events (not just the channel-visible subset) for searchable +history. The recall indexer is a cursor-based consumer that: + +1. Reads new events from the outbox. +2. Indexes each event by: lane, actor, repo, event type, and time range. +3. Stores indexed events in recall's existing storage layer. + +Query examples that this enables: + +- `recall_files(path="grip/src/main.rs")` can include "last sync updated this + file" if sync events include file-level detail. +- `recall_search("hook failure editable-install")` returns the hook.failed event + and its context. +- `recall_timeline(actor="agent:apollo", start="2026-04-15")` shows Apollo's + full activity timeline. + +The recall indexer does **not** need premium logic. It consumes the same neutral +event stream as the channel bridge. + +## 10. Failure Modes and Recovery + +### 10.1 Outbox Write Failure + +If `emit_event` fails to append (disk full, permission error): + +- The event is lost. The operation that triggered it still completed. +- The outbox may be in an inconsistent state (partial line written). +- Recovery: consumers skip malformed lines. `gr2 gc` can truncate trailing + partial lines. + +Mitigation: `emit_event` should catch write errors and log them to stderr +without crashing the parent operation. Events are important but not +operation-critical. + +### 10.2 Consumer Crash Mid-Processing + +If a consumer crashes after reading an event but before updating its cursor: + +- On restart, it re-reads from `last_seq + 1` and reprocesses events. +- This is safe because consumers must be idempotent (section 5.3). + +### 10.3 Outbox Rotation During Consumer Read + +If the outbox rotates while a consumer is reading: + +- The consumer's cursor points to a seq that no longer exists in the current + `outbox.jsonl`. +- Consumer must scan archived `outbox.{timestamp}.jsonl` files in order to find + the file containing its cursor position. +- Once caught up through archived files, it continues reading the current + `outbox.jsonl`. + +### 10.4 Concurrent Writers + +The current design assumes single-process writes (one gr2 CLI invocation at a +time per workspace). If concurrent writes become necessary: + +- Option A: File-level advisory lock during append. +- Option B: Separate outbox files per writer, with a merge step. +- Option C: Move to SQLite WAL-mode database. + +This is explicitly out of scope for the initial implementation. The single-writer +assumption is safe because gr2 operations are CLI-driven and workspace-local. + +## 11. Versioning and Evolution + +### 11.1 Schema Version + +The `version` field in the event envelope is `1` for this initial contract. + +Version bumps happen when: +- A required field is added to the common envelope. +- A payload field's type or meaning changes in a breaking way. + +Version bumps do **not** happen when: +- A new event type is added (consumers skip unknown types). +- An optional field is added to a payload. +- A new consumer is added. + +### 11.2 Backward Compatibility + +New event types are additive. Consumers that do not understand a new type skip +it. This means adding `pr.review_submitted` in a future release does not require +updating all consumers. + +Payload changes within an existing event type should be additive (new optional +fields). If a breaking change is needed, bump the version and document the +migration. + +## 12. Relation to Existing Documents + +This document supersedes the event-related sections of: + +- **SYNAPT-INTEGRATION.md** section 4 (Lane Event -> Recall Pipeline): This + contract formalizes and extends that design. The event format here is the + canonical schema; SYNAPT-INTEGRATION.md's examples are now illustrative only. +- **SYNAPT-INTEGRATION.md** section 5 (Channel Bridge): The channel bridge model + here is consistent but more precise about filtering and cursor management. + +This document builds on: + +- **HOOK-CONFIG-MODEL.md**: The hook execution contract (section 6) extends the + lifecycle model defined there. hooks.toml schema is unchanged; the new + contribution is event emission during hook execution. + +This document is a dependency for: + +- **PR-LIFECYCLE.md** (Sprint 20, Apollo): PR lifecycle design references the + pr.* event types defined here. +- **PLATFORM-ADAPTER-AND-SYNC.md** (Sprint 20, Atlas): Sync algorithm references + the sync.* event types defined here. +- **QA Arena** (Sprint 20, Sentinel): Adversarial test scenarios should exercise + event emission failure modes (section 10). + +## 13. Open Questions + +1. **Hook output retention**: Should hook output files (`.grip/events/hook_output/`) + follow the same 7-day retention as rotated outbox files, or longer? +2. **Event batching**: Should operations that touch multiple repos emit one event + per repo or one aggregate event? Current design uses both patterns depending + on the domain (sync uses per-repo events; PR uses aggregate events with + per-repo detail in payload arrays). +3. **Webhook bridge**: Should gr2 support an HTTP webhook consumer in addition to + file-based cursor consumers? This would be relevant for remote spawn + deployments. +4. **SQLite alternative**: For workspaces with heavy event traffic (many agents, + frequent operations), should the outbox be SQLite WAL instead of JSONL? + JSONL is simpler and auditable; SQLite handles concurrent writes better. +5. **Event signing**: Should events carry a signature or checksum for tamper + detection? Relevant if the outbox is consumed by premium policy enforcement. diff --git a/gr2/docs/PR-LIFECYCLE.md b/gr2/docs/PR-LIFECYCLE.md new file mode 100644 index 0000000..f48efda --- /dev/null +++ b/gr2/docs/PR-LIFECYCLE.md @@ -0,0 +1,569 @@ +# gr2 PR Lifecycle Management + +This document defines how gr2 manages pull requests across multiple repos. It +builds on Atlas's PlatformAdapter protocol and references the event contract in +HOOK-EVENT-CONTRACT.md. + +This is a **design document** for Sprint 20. It does not describe current +behavior; it defines the target design for `gr2 pr` commands. + +## 1. Design Goals + +- PR operations are cross-repo by default. `gr2 pr create` creates linked PRs + across all repos with changes on the current lane's branch. +- A PR group is the first-class unit. Individual repo PRs are children of the + group. +- PR state transitions emit events from the hook/event contract. +- PlatformAdapter is the only interface to the hosting platform. gr2 does not + shell out to `gh`, `glab`, or platform-specific CLIs. +- Merge ordering is explicit and configurable, not implicit. + +## 2. Concepts + +### 2.1 PR Group + +A **PR group** is a set of related PRs across repos that belong to the same +logical change. When an agent works on a lane that touches `grip`, `synapt`, and +`synapt-private`, `gr2 pr create` produces one PR group with three child PRs. + +```json +{ + "pr_group_id": "pg_8a3f1b2c", + "lane_name": "feat/hook-events", + "owner_unit": "apollo", + "created_by": "agent:apollo", + "created_at": "2026-04-15T17:00:00+00:00", + "title": "feat: hook/event contract design", + "base_branch": "sprint-20", + "head_branch": "design/hook-event-contract", + "prs": [ + { + "repo": "grip", + "pr_number": 570, + "url": "https://github.com/synapt-dev/grip/pull/570", + "status": "open", + "checks_status": "pending", + "reviews": [] + }, + { + "repo": "synapt", + "pr_number": 583, + "url": "https://github.com/synapt-dev/synapt/pull/583", + "status": "open", + "checks_status": "passing", + "reviews": [{"reviewer": "sentinel", "verdict": "approved"}] + } + ] +} +``` + +The `pr_group_id` is the cross-repo correlation key from the event contract. +Format: `pg_` prefix + 8-char hex. + +### 2.2 PR Group State + +A PR group has an aggregate state derived from its children: + +| Group State | Condition | +|-------------|-----------| +| `draft` | All child PRs are draft. | +| `open` | At least one child PR is open (non-draft). | +| `checks_pending` | At least one child PR has pending checks. | +| `checks_passing` | All child PRs have passing checks. | +| `checks_failing` | At least one child PR has failing checks. | +| `review_required` | At least one child PR needs more reviews to meet compiled review requirements. | +| `approved` | All child PRs meet their review requirements. | +| `mergeable` | All children are `checks_passing` + `approved` + no merge conflicts. | +| `merged` | All child PRs have been merged. | +| `partially_merged` | Some (but not all) child PRs have been merged. This is an error state. | + +Group state is computed, not stored. `gr2 pr status` queries each child PR via +PlatformAdapter and aggregates. + +### 2.3 PR Group Storage + +PR group metadata is stored at: + +``` +.grip/pr_groups/{pr_group_id}.json +``` + +This file is created by `gr2 pr create` and updated by `gr2 pr status` (to +cache last-known state) and `gr2 pr merge` (to record merge SHAs). + +The `.grip/pr_groups/` directory is workspace-local state, not committed to any +repo. + +## 3. Commands + +### 3.1 `gr2 pr create` + +Creates linked PRs across repos. + +``` +gr2 pr create + --title "feat: hook/event contract" + [--body "description"] + [--base sprint-20] + [--draft] + [--push] + [--json] +``` + +Flow: + +1. Load lane doc for `owner_unit/lane_name`. +2. For each repo in the lane's `repos` list: + a. Check if the repo has commits on `head_branch` not in `base_branch`. + b. Skip repos with no new commits (no empty PRs). + c. Push the branch if `--push` is set. + d. Call `PlatformAdapter.create_pr(repo, head, base, title, body, draft)`. + e. Record the returned PR number and URL. +3. Generate `pr_group_id`. +4. Write PR group metadata to `.grip/pr_groups/{pr_group_id}.json`. +5. Update each child PR's body to include cross-links: + ``` + ## Linked PRs (gr2 group: pg_8a3f1b2c) + - synapt-dev/grip#570 + - synapt-dev/synapt#583 + ``` +6. Emit `pr.created` event. +7. Print summary. + +**Cross-linking** is important: each child PR's body includes references to all +sibling PRs. This makes the relationship visible on the platform even if gr2 is +not available. + +**Base branch resolution**: If `--base` is not specified, use the lane's +`base_branch` (from lane doc) or fall back to the repo's default branch. + +### 3.2 `gr2 pr status` + +Shows aggregated status of the PR group for the current lane. + +``` +gr2 pr status [] + [--json] +``` + +Flow: + +1. Find the PR group for the lane (scan `.grip/pr_groups/` for matching + `lane_name` and `owner_unit`). +2. For each child PR, call `PlatformAdapter.get_pr_status(repo, pr_number)`. +3. For each child PR, call `PlatformAdapter.get_checks(repo, pr_number)`. +4. For each child PR, call `PlatformAdapter.get_reviews(repo, pr_number)`. +5. Aggregate into group state. +6. Evaluate review requirements from compiled workspace constraints. +7. Print summary table: + +``` +PR Group pg_8a3f1b2c: feat: hook/event contract +Lane: apollo/design/hook-event-contract -> sprint-20 + + Repo PR Checks Reviews Mergeable + grip #570 passing 1/1 required yes + synapt #583 pending 0/1 required no (checks pending) + + Group state: checks_pending + Blocking: synapt checks pending +``` + +If any child PR has status changes since the last cached state, emit +`pr.status_changed` events. + +### 3.3 `gr2 pr merge` + +Merges the PR group. + +``` +gr2 pr merge [] + [--strategy squash|merge|rebase] + [--force] + [--auto] + [--json] +``` + +Flow: + +1. Find the PR group for the lane. +2. Compute group state. If not `mergeable` and `--force` is not set, abort with + an error explaining what is blocking. +3. Determine merge order (section 4). +4. For each child PR in order: + a. Call `PlatformAdapter.merge_pr(repo, pr_number, strategy)`. + b. Record the merge SHA. + c. If merge fails, stop. Do not merge remaining repos. Emit + `pr.merge_failed` event. +5. If all merges succeed, emit `pr.merged` event. +6. Update PR group metadata with merge SHAs and final state. +7. Print summary. + +**`--auto` mode**: Instead of merging immediately, enable auto-merge on each +child PR. The platform merges each PR when its checks pass. This is useful for +CI-heavy repos where checks take time. Note: auto-merge relies on platform +support (GitHub has this; others may not). + +**`--force` mode**: Skip the `mergeable` gate. Useful when a reviewer override +is needed. Still respects platform-level branch protection. + +### 3.4 `gr2 pr checks` + +Shows CI/check status for the PR group. + +``` +gr2 pr checks [] + [--watch] + [--json] +``` + +Flow: + +1. Find the PR group. +2. For each child PR, call `PlatformAdapter.get_checks(repo, pr_number)`. +3. Print status per repo. + +`--watch` mode: Poll every 30 seconds and update the display. Emit +`pr.checks_passed` or `pr.checks_failed` events when the aggregate state +changes. + +### 3.5 `gr2 pr list` + +Lists PR groups in the workspace. + +``` +gr2 pr list + [--owner-unit ] + [--state open|merged|all] + [--json] +``` + +Flow: + +1. Scan `.grip/pr_groups/` for group metadata files. +2. Filter by owner_unit and state. +3. Print summary table. + +## 4. Merge Ordering + +When merging a PR group, the order matters. If `synapt-private` depends on +`synapt`, merging `synapt-private` first could break CI on the base branch. + +### 4.1 Default Order + +Merge in `[[repos]]` declaration order from WorkspaceSpec. This is the simplest +model and works when the workspace author has already ordered repos by +dependency. + +### 4.2 Explicit Order + +The workspace spec can declare merge ordering: + +```toml +[workspace_constraints.merge_order] +strategy = "explicit" +order = ["grip", "synapt", "synapt-private"] +``` + +### 4.3 Dependency-Aware Order (Future) + +A future extension could parse repo dependency graphs (e.g., pip dependencies, +Cargo workspace members) to derive merge order automatically. This is out of +scope for Sprint 20. + +### 4.4 Partial Merge Recovery + +If merge fails partway through (repo A merged, repo B failed): + +1. The PR group enters `partially_merged` state. +2. `pr.merge_failed` event is emitted for repo B. +3. The already-merged repo A cannot be un-merged. +4. Options: + a. Fix the issue in repo B and retry `gr2 pr merge`. + b. Revert repo A's merge manually and start over. + +This is the most dangerous failure mode in cross-repo PR management. The design +doc acknowledges it but does not try to solve it automatically. The right +mitigation is: + +- Run `gr2 pr checks` and confirm all checks pass before merging. +- Use `--auto` mode to let the platform gate each merge on checks. +- Keep the merge order aligned with dependency order so downstream repos + merge after their dependencies. + +## 5. PlatformAdapter Integration + +### 5.1 Adapter Protocol (Atlas's Design) + +gr2's PR lifecycle consumes Atlas's PlatformAdapter protocol. The expected +interface (from Atlas's `platform.py`): + +```python +class PlatformAdapter(Protocol): + def create_pr(self, repo: str, head: str, base: str, + title: str, body: str, draft: bool) -> PRRef: ... + def get_pr(self, repo: str, pr_number: int) -> PRStatus: ... + def merge_pr(self, repo: str, pr_number: int, + strategy: str) -> MergeResult: ... + def get_checks(self, repo: str, pr_number: int) -> list[PRCheck]: ... + def get_reviews(self, repo: str, pr_number: int) -> list[PRReview]: ... + def update_pr_body(self, repo: str, pr_number: int, body: str) -> None: ... +``` + +### 5.2 Adapter Resolution + +`get_platform_adapter(repo_spec)` resolves the correct adapter based on the +repo's remote URL. For Sprint 20, only `GitHubAdapter` is implemented. + +### 5.3 Rate Limiting + +The adapter handles rate limiting internally. If the platform returns a rate +limit response, the adapter retries with backoff. gr2 does not manage rate +limits at the PR lifecycle level. + +### 5.4 Relation to gr1 HostingPlatform + +gr1's Rust `HostingPlatform` trait (in `src/platform/traits.rs`) covers the same +operations. The Python PlatformAdapter is the gr2 equivalent, designed for +Python-first UX validation. When Rust gr2 absorbs PR lifecycle, it should +reuse the existing `HostingPlatform` trait, not create a third adapter surface. + +The mapping: + +| gr1 Rust trait | gr2 Python adapter | +|----------------|--------------------| +| `create_pull_request` | `create_pr` | +| `get_pull_request` | `get_pr` | +| `merge_pull_request` | `merge_pr` | +| `get_status_checks` | `get_checks` | +| `get_pull_request_reviews` | `get_reviews` | +| `update_pull_request_body` | `update_pr_body` | +| `find_pr_by_branch` | Not yet in adapter (needed for `gr2 pr status` without group ID) | +| `is_pull_request_approved` | Derived from `get_reviews` | + +## 6. Event Emission + +PR lifecycle emits events defined in HOOK-EVENT-CONTRACT.md section 3.2. + +### 6.1 Create Flow Events + +``` +gr2 pr create + -> pr.created (payload: pr_group_id, repos with pr_numbers) +``` + +### 6.2 Status Check Events + +``` +gr2 pr status (or --watch poll) + -> pr.status_changed (per repo, when status differs from cached) + -> pr.checks_passed (per repo, when all checks go green) + -> pr.checks_failed (per repo, when a check fails) + -> pr.review_submitted (per repo, when new review detected) +``` + +### 6.3 Merge Flow Events + +``` +gr2 pr merge + -> pr.merged (if all repos merge successfully) + or + -> pr.merge_failed (for the first repo that fails) +``` + +### 6.4 Event Ordering + +Events are emitted in operation order. For `gr2 pr merge` with repos A, B, C: + +1. Merge A succeeds (no event yet; waiting for group completion). +2. Merge B succeeds (no event yet). +3. Merge C succeeds. +4. Emit `pr.merged` with all three repos' merge SHAs. + +If merge B fails: + +1. Merge A succeeds (no event for A alone). +2. Merge B fails. +3. Emit `pr.merge_failed` for B. +4. Do not attempt C. + +The design emits one event at the end, not per-repo events during merge. This +keeps the event stream clean: consumers see either one `pr.merged` or one +`pr.merge_failed`, not a mix. + +## 7. Review Requirements + +### 7.1 Compiled Requirements + +Review requirements come from the compiled WorkspaceSpec (originally from +premium's org policy): + +```toml +[workspace_constraints.required_reviewers] +grip = 1 +synapt = 1 +synapt-private = 2 +``` + +### 7.2 Evaluation + +`gr2 pr status` evaluates review requirements per repo: + +1. Get reviews from PlatformAdapter. +2. Count approvals (excluding stale reviews on outdated commits). +3. Compare against compiled requirement. +4. Report satisfied/unsatisfied per repo. + +This already exists in the Python CLI as `gr2 review requirements`. The PR +lifecycle integrates it into the `mergeable` gate. + +### 7.3 Boundary + +Review requirement **evaluation** (counting approvals against a threshold) is +OSS. Review requirement **definition** (who can review, role-based overrides, +org-level policies) is premium. gr2 only consumes the compiled numeric +threshold. + +## 8. Cross-Link Format + +When `gr2 pr create` creates linked PRs, it appends a standard section to each +PR body: + +```markdown +--- + +## gr2 PR Group: pg_8a3f1b2c + +| Repo | PR | +|------|----| +| grip | synapt-dev/grip#570 | +| synapt | synapt-dev/synapt#583 | +| synapt-private | synapt-dev/synapt-private#291 | + +Lane: `apollo/design/hook-event-contract` +Base: `sprint-20` + +*Managed by [gr2](https://github.com/synapt-dev/grip)* +``` + +This section is: +- Machine-parseable (table format with consistent columns). +- Human-readable on GitHub/GitLab. +- Identifiable by the `gr2 PR Group:` header for updates. + +When `gr2 pr status` detects a new child PR was added (e.g., a new repo was +added to the lane), it updates all sibling PR bodies to include the new link. + +## 9. Lane Integration + +### 9.1 Lane -> PR Group Mapping + +A lane can have at most one active PR group. Creating a second PR group for the +same lane replaces the first (the old group is archived). + +The mapping is: +- Forward: lane doc stores `pr_group_id` when a PR group is created. +- Reverse: PR group metadata stores `lane_name` and `owner_unit`. + +### 9.2 Lane Exit with Open PRs + +When `gr2 lane exit` is called while the lane has an open PR group: + +- The lane exit proceeds normally (stash dirty state, run on_exit hooks). +- The PR group remains open. PRs are on the platform; they do not depend on + the local lane state. +- `gr2 pr status` can still query the group even after the lane is exited. + +### 9.3 Lane Archive after Merge + +When `gr2 pr merge` completes successfully: + +- The PR group is marked as `merged`. +- The lane is eligible for archival (`lane.archived` event). +- Actual archival (deleting the lane root, cleaning up branches) is a separate + command or automated by spawn. + +## 10. Relation to gr1 + +gr1's `gr pr create/status/merge/checks` commands are the production surface +today. They work but have implicit cross-repo linking (via branch name +convention, not explicit group IDs). + +gr2's PR lifecycle improves on gr1 in three ways: + +1. **Explicit grouping**: PR groups with stable IDs replace implicit branch-name + matching. +2. **Event emission**: Every PR state change produces a durable event. +3. **Platform abstraction**: PlatformAdapter replaces direct `gh` CLI calls. + +The migration path: gr1 continues to handle daily PR workflow until gr2's PR +commands are proven. gr2 PR commands are validated in the playground first +(Sentinel's QA arena), then adopted for real workflow. + +## 11. Implementation Plan + +### Sprint 20 (Design) + +- This document. +- Event schema for pr.* types (done, in HOOK-EVENT-CONTRACT.md). +- Coordinate with Atlas on PlatformAdapter method signatures. +- Coordinate with Sentinel on QA scenarios for PR lifecycle. + +### Sprint 21 (Implementation Target) + +1. `gr2/python_cli/pr.py` module with PR group CRUD. +2. `gr2 pr create` command consuming PlatformAdapter. +3. `gr2 pr status` command with aggregated state. +4. `gr2 pr merge` command with ordering. +5. Event emission at each step. +6. Integration tests in QA arena. + +### Sprint 22 (Polish) + +- `gr2 pr checks --watch` with polling. +- `gr2 pr list` for workspace-wide PR overview. +- Auto-merge mode. +- Edge cases from QA arena feedback. + +## 12. QA Arena Scenarios + +These scenarios should be covered by Sentinel's adversarial test suite: + +1. **Happy path**: Create PR group with 3 repos, all checks pass, all reviews + met, merge succeeds. +2. **Partial merge failure**: Repo A merges, repo B has a conflict. Verify + `partially_merged` state and `pr.merge_failed` event. +3. **Review requirement not met**: One repo needs 2 reviews, only has 1. Verify + `gr2 pr merge` blocks (without `--force`). +4. **Stale review**: Review was approved, then new commits pushed. Verify the + stale review is not counted. +5. **PR created with no changes in some repos**: Verify repos with no new + commits are skipped, not given empty PRs. +6. **Rate limiting**: PlatformAdapter returns rate limit during `gr2 pr merge`. + Verify retry behavior. +7. **Platform timeout**: PlatformAdapter times out during `gr2 pr status`. + Verify graceful degradation (show cached state with warning). +8. **Concurrent merge**: Two agents try to merge the same PR group. Verify only + one succeeds (platform-level atomicity). +9. **Cross-link update**: New repo added to lane after initial PR creation. + Verify cross-links are updated in all sibling PRs. +10. **Auto-merge mode**: Enable auto-merge on all child PRs. Verify events are + emitted when platform auto-merges each PR. + +## 13. Open Questions + +1. **PR group ID persistence**: Should `pr_group_id` be stored in the lane doc + (tying it to local state) or only in `.grip/pr_groups/` (making it + workspace-level state)? Current design uses both for forward/reverse lookup. +2. **Multi-platform groups**: Can a PR group span repos on different platforms + (e.g., grip on GitHub, infra on GitLab)? The adapter-per-repo model supports + this, but merge ordering and cross-linking become more complex. +3. **PR updates after creation**: Should `gr2 pr update` exist to change title, + body, or base branch of an existing group? Or is that always done directly + on the platform? +4. **Branch cleanup**: Should `gr2 pr merge` automatically delete remote + branches after merge? gr1 does this. gr2 should probably follow suit but + it is a destructive operation. +5. **Manifest repo PRs**: Should the manifest repo (if tracked) get its own + child PR in the group? gr1 includes the manifest in PR operations. gr2's + lane model may not always include the manifest. From f9b1d6c6a52fdbd127de4462be0589d69e5cb9ab Mon Sep 17 00:00:00 2001 From: Layne Penney Date: Wed, 15 Apr 2026 10:34:21 -0500 Subject: [PATCH 2/4] design: add failure recovery contract (section 14) to hook/event doc Covers lease reclaim (TTL-first + optional heartbeat renewal), hook failure markers (forward-only, no rollback), and dirty state handling on lane switch (--dirty=stash|block|discard). Co-Authored-By: Claude Opus 4.6 --- gr2/docs/HOOK-EVENT-CONTRACT.md | 131 ++++++++++++++++++++++++++++++++ 1 file changed, 131 insertions(+) diff --git a/gr2/docs/HOOK-EVENT-CONTRACT.md b/gr2/docs/HOOK-EVENT-CONTRACT.md index 5807147..60097cd 100644 --- a/gr2/docs/HOOK-EVENT-CONTRACT.md +++ b/gr2/docs/HOOK-EVENT-CONTRACT.md @@ -605,3 +605,134 @@ This document is a dependency for: JSONL is simpler and auditable; SQLite handles concurrent writes better. 5. **Event signing**: Should events carry a signature or checksum for tamper detection? Relevant if the outbox is consumed by premium policy enforcement. + +## 14. Failure Recovery Contract + +This section formalizes how gr2 handles operation failures at the state level. +Section 10 covers event infrastructure failures (outbox writes, consumer +crashes). This section covers operation-level failures: what happens to +workspace state when hooks fail, leases expire, or lane switches encounter +dirty repos. + +The core principle: **gr2 operations are forward-only. There is no rollback.** +Failures leave partial state with explicit markers that require resolution. + +### 14.1 Failure Markers + +When an operation fails mid-execution, gr2 writes a failure marker: + +``` +.grip/state/failures/{operation_id}.json +``` + +Marker format: + +```json +{ + "operation_id": "op_9f2a3b4c", + "operation": "sync", + "stage": "on_enter", + "hook_name": "editable-install", + "repo": "synapt", + "owner_unit": "apollo", + "lane_name": "feat/hook-events", + "failed_at": "2026-04-15T17:00:00+00:00", + "event_id": "abc123def456", + "partial_state": { + "repos_completed": ["grip"], + "repos_pending": ["synapt-private"], + "repo_failed": "synapt" + }, + "resolved": false +} +``` + +Marker behavior: + +- **Blocking**: The next operation on the same scope (lane, repos) checks for + unresolved failure markers. If one exists, the operation refuses to proceed + and reports the marker. +- **Resolution**: `gr2 lane resolve ` clears the marker. The + agent must decide whether to retry, skip, or escalate. Resolution is always + explicit. +- **Event**: Resolving a marker emits a new event type: + `failure.resolved` with payload `{operation_id, resolved_by, resolution}`. + +Why no automatic retry: retrying a failed hook might produce the same failure. +The agent (or spawn) has context about whether retry is appropriate. gr2 does +not guess. + +Why no rollback: reverting git operations (undo fetch+merge, undo checkout) is +dangerous, sometimes impossible (remote state changed), and introduces a second +failure mode (what if the revert fails?). Forward-only resolution is simpler and +more honest about what happened. + +### 14.2 Lease Reclaim Lifecycle + +Leases use TTL-first expiry with optional heartbeat renewal. + +**TTL expiry** is the primary reclaim mechanism: + +- Every lease carries `ttl_seconds` (default 900s) and `expires_at`. +- Expiry is checked lazily: the next `acquire`, `show`, or `status` call + evaluates `is_stale_lease()` (already in prototype at + `lane_workspace_prototype.py:592`). +- No daemon or background process required. + +**Heartbeat renewal** is optional: + +- `gr2 lane lease renew ` resets + `expires_at` to `now + ttl_seconds`. +- Agents running long operations (multi-repo test suites, large builds) call + renew periodically to prevent premature expiry. +- If the agent crashes, renewal stops, and TTL expiry reclaims the lease + naturally. + +**Reclaim flow**: + +1. Agent A holds lease with `expires_at = T`. +2. Agent A crashes (no explicit release). +3. Time passes beyond T. +4. Agent B calls `gr2 lane lease acquire`. +5. `acquire` finds A's lease, evaluates `is_stale_lease()` -> true. +6. Emits `lease.expired` event (payload: `{lane_name, lease_id, expired_at}`). +7. Garbage-collects A's stale lease from the lane doc. +8. Grants B's new lease. Emits `lease.acquired` event. + +**Force break**: + +- `gr2 lane lease acquire --force` breaks a live (non-expired) lease. +- Emits `lease.force_broken` event with `{broken_by, reason}`. +- Spawn can route this event to the original holder's channel as a notification. + +### 14.3 Dirty State on Lane Switch + +Lane transitions handle uncommitted changes via an explicit `--dirty` mode. + +**Modes** (flag on `lane enter` and `lane exit`): + +| Mode | Behavior | Default? | +|------|----------|----------| +| `stash` | Auto-stash dirty repos. Stash message: `"gr2 auto-stash: exiting {unit}/{lane}"`. | Yes | +| `block` | Refuse to switch if any repo is dirty. List dirty repos in error. | No | +| `discard` | Discard uncommitted changes. Requires `--yes` flag. | No | + +**Event payloads for dirty state**: + +- `lane.exited` with `stashed_repos: ["synapt"]` when stash mode is used. +- `lane.exited` with `discarded_repos: ["synapt"]` when discard mode is used. +- No `lane.exited` event when block mode prevents the exit. + +**Re-entry with stashed state**: + +When `lane enter` is called and the lane has stashed state from a previous exit: + +- Default: warn that stashed state exists, do not auto-pop. The agent decides + whether to `git stash pop` manually. +- `--dirty=restore` on `lane enter`: auto-pop the stash. If the pop produces + a merge conflict, leave the conflict markers and emit a `hook.failed`-style + warning event. + +**Consistency rule**: The `--dirty` flag and its values (`stash`, `block`, +`discard`, `restore`) must be consistent across `lane enter`, `lane exit`, and +`sync`. This is a shared contract with Atlas's sync algorithm design. From 1f75ea48accdf312947192332ff035cc34e64c3f Mon Sep 17 00:00:00 2001 From: Layne Penney Date: Wed, 15 Apr 2026 11:18:55 -0500 Subject: [PATCH 3/4] fix: address 4 review findings on grip#572 1. Align sync event envelope: sync.completed is the single terminal event with status field (success/partial_failure/blocked/failed), matching Atlas's syncops.py pattern. No separate sync.failed type. 2. Scope pr_group_id to orchestration layer: PlatformAdapter is group-unaware. pr.py assigns group IDs and correlates per-repo PRs. 3. Add recovery events to taxonomy: failure.resolved and lease.reclaimed added to section 3.2 and EventType enum. 4. Clarify force-break notification routing: lease.force_broken, failure.resolved, and lease.reclaimed added to channel bridge mapping. Notification routing is a consumer responsibility, not a core gr2 event concern. --- gr2/docs/HOOK-EVENT-CONTRACT.md | 77 ++++++++++++++++++++++++++++++--- gr2/docs/PR-LIFECYCLE.md | 13 ++++++ 2 files changed, 83 insertions(+), 7 deletions(-) diff --git a/gr2/docs/HOOK-EVENT-CONTRACT.md b/gr2/docs/HOOK-EVENT-CONTRACT.md index 60097cd..c2bb5fe 100644 --- a/gr2/docs/HOOK-EVENT-CONTRACT.md +++ b/gr2/docs/HOOK-EVENT-CONTRACT.md @@ -35,7 +35,8 @@ and monotonic sequence number within the outbox. ### 3.1 Common Envelope -Every event shares a common envelope: +Every event is a single flat JSON object. Envelope fields and domain-specific +fields sit at the same level. There is no nested `payload` wrapper. ```json { @@ -48,11 +49,18 @@ Every event shares a common envelope: "actor": "agent:apollo", "agent_id": "agent_apollo_xyz789", "owner_unit": "apollo", - "payload": { ... } + "lane_name": "feat/hook-events", + "lane_type": "feature", + "repos": ["grip", "synapt"] } ``` -Field definitions: +This flat shape matches Atlas's sync outbox implementation (`syncops.py`), where +`_append_outbox_event` spreads caller-provided fields into the envelope via +`{**envelope, **payload}`. Consumers read domain fields directly from the +top-level object without unwrapping a nested payload. + +**Envelope fields** (added automatically by the emit function): | Field | Type | Required | Description | |-------|------|----------|-------------| @@ -61,11 +69,19 @@ Field definitions: | `seq` | int | yes | Monotonically increasing sequence number within this outbox file. Starts at 1. | | `timestamp` | string | yes | ISO 8601 with timezone. | | `type` | string | yes | Dotted event type from the taxonomy (section 3.2). | + +**Context fields** (provided by the caller, required unless noted): + +| Field | Type | Required | Description | +|-------|------|----------|-------------| | `workspace` | string | yes | Workspace name from WorkspaceSpec. | | `actor` | string | yes | Who triggered the event. Format: `agent:`, `human:`, or `system`. | | `agent_id` | string | no | Persistent agent identity from premium. Opaque in OSS. | | `owner_unit` | string | yes | Unit that owns the context where this event occurred. | -| `payload` | object | yes | Event-type-specific data. | + +**Domain fields** vary by event type. See section 3.2 for the fields each event +type carries. Domain fields are top-level keys alongside envelope and context +fields. Rules: - `event_id` must be unique within a workspace. @@ -74,6 +90,9 @@ Rules: automated operations. - `agent_id` is optional because human-triggered and system-triggered events do not have one. +- Domain field names must not collide with envelope or context field names. + The reserved names are: `version`, `event_id`, `seq`, `timestamp`, `type`, + `workspace`, `actor`, `agent_id`, `owner_unit`. ### 3.2 Event Type Taxonomy @@ -128,6 +147,12 @@ Rules for hook events: PRs in multiple repos, they share the same `pr_group_id`. This is how consumers reconstruct the cross-repo PR as a unit. +**Boundary**: `pr_group_id` is assigned by gr2's orchestration layer (`pr.py`), +not by PlatformAdapter. PlatformAdapter is group-unaware: it creates, queries, +and merges individual per-repo PRs. The `pr.py` module correlates them into a +group and assigns the `pg_` prefixed ID. This keeps platform adapters simple and +reusable across contexts that may not need grouping. + #### Sync Operations | Type | Trigger | Payload | @@ -136,7 +161,34 @@ reconstruct the cross-repo PR as a unit. | `sync.repo_updated` | Single repo pull/rebase completes | `{repo, old_sha, new_sha, strategy, commits_pulled: int}` | | `sync.repo_skipped` | Repo skipped (dirty, no remote, etc.) | `{repo, reason}` | | `sync.conflict` | Merge/rebase conflict during sync | `{repo, conflicting_files: [str]}` | -| `sync.completed` | `gr2 sync` finishes | `{repos_updated: int, repos_skipped: int, repos_failed: int, duration_ms}` | +| `sync.completed` | `gr2 sync` finishes | `{status, repos_updated: int, repos_skipped: int, repos_failed: int, duration_ms}` | + +`sync.completed` is the **single terminal event** for sync operations. There is no +separate `sync.failed` type. The `status` field distinguishes outcomes: + +| `status` value | Meaning | +|----------------|---------| +| `success` | All repos updated without error. | +| `partial_failure` | Some repos updated, some failed. `repos_failed > 0`. | +| `blocked` | Sync could not proceed (e.g., unresolved failure marker). | +| `failed` | All repos failed or sync aborted early. | + +This matches Atlas's `syncops.py` pattern, which uses `sync.completed` with a +status field rather than emitting a separate `sync.failed` event type. + +#### Recovery + +| Type | Trigger | Payload | +|------|---------|---------| +| `failure.resolved` | `gr2 lane resolve ` | `{operation_id, resolved_by, resolution, lane_name}` | +| `lease.reclaimed` | Stale lease garbage-collected during acquire | `{lane_name, lease_id, previous_holder, expired_at, reclaimed_by}` | + +`failure.resolved` is emitted when an agent explicitly clears a failure marker +(section 14.1). `lease.reclaimed` is emitted when a stale lease is +garbage-collected during a new acquire (section 14.2, step 6-7). This is +distinct from `lease.expired` (which fires at the point of staleness detection) +and `lease.force_broken` (which fires when a live lease is broken with +`--force`). #### Workspace Operations @@ -420,6 +472,10 @@ class EventType(str, Enum): SYNC_CONFLICT = "sync.conflict" SYNC_COMPLETED = "sync.completed" + # Recovery + FAILURE_RESOLVED = "failure.resolved" + LEASE_RECLAIMED = "lease.reclaimed" + # Workspace operations WORKSPACE_MATERIALIZED = "workspace.materialized" WORKSPACE_FILE_PROJECTED = "workspace.file_projected" @@ -456,6 +512,9 @@ produces a channel message. | `pr.checks_failed` | `"CI failed on {repo}#{pr_number}: {failed_checks}"` | #dev | | `hook.failed` (block) | `"Hook {hook_name} failed in {repo} (blocking): {stderr_tail}"` | #dev | | `sync.conflict` | `"Sync conflict in {repo}: {conflicting_files}"` | #dev | +| `lease.force_broken` | `"Lease on {lane_name} force-broken by {broken_by}: {reason}"` | #dev | +| `failure.resolved` | `"{resolved_by} resolved failure {operation_id} on {lane_name}"` | #dev | +| `lease.reclaimed` | `"Stale lease on {lane_name} reclaimed (was held by {previous_holder})"` | #dev | Events not listed (hook.started, hook.completed, hook.skipped, lease.acquired, lease.released, sync.repo_updated, workspace.file_projected, etc.) are **not** @@ -467,7 +526,7 @@ via a filter file at `.grip/events/channel_filter.toml`: ```toml [channel_bridge] -include = ["lane.*", "pr.*", "hook.failed", "sync.conflict"] +include = ["lane.*", "pr.*", "hook.failed", "sync.conflict", "lease.force_broken", "failure.resolved", "lease.reclaimed"] exclude = ["hook.started", "hook.completed", "hook.skipped"] ``` @@ -703,7 +762,11 @@ Leases use TTL-first expiry with optional heartbeat renewal. - `gr2 lane lease acquire --force` breaks a live (non-expired) lease. - Emits `lease.force_broken` event with `{broken_by, reason}`. -- Spawn can route this event to the original holder's channel as a notification. +- Notification routing to the original holder is a **channel_bridge consumer + responsibility**, not a core gr2 concern. The `lease.force_broken` event + carries `broken_by` and the original holder's identity in context fields. + The channel bridge (or spawn_watcher) decides how and where to deliver the + notification based on its own routing rules. ### 14.3 Dirty State on Lane Switch diff --git a/gr2/docs/PR-LIFECYCLE.md b/gr2/docs/PR-LIFECYCLE.md index f48efda..ff31f7b 100644 --- a/gr2/docs/PR-LIFECYCLE.md +++ b/gr2/docs/PR-LIFECYCLE.md @@ -310,6 +310,19 @@ class PlatformAdapter(Protocol): def update_pr_body(self, repo: str, pr_number: int, body: str) -> None: ... ``` +**PlatformAdapter is group-unaware.** It operates on individual per-repo PRs and +has no concept of `pr_group_id` or cross-repo correlation. The grouping logic +lives in gr2's `pr.py` orchestration module, which: + +1. Calls PlatformAdapter per-repo to create/query/merge individual PRs. +2. Assigns the `pr_group_id` (format: `pg_` + 8-char hex). +3. Correlates per-repo `PRRef` objects into a PR group. +4. Manages cross-link injection into PR bodies. +5. Emits `pr.*` events with the group ID. + +This separation keeps platform adapters simple and reusable. A platform adapter +can be used by other tools that don't need grouping semantics. + ### 5.2 Adapter Resolution `get_platform_adapter(repo_spec)` resolves the correct adapter based on the From 8d77d9034f480baabe11ae1be44a0cdf5d501934 Mon Sep 17 00:00:00 2001 From: Layne Penney Date: Wed, 15 Apr 2026 11:38:26 -0500 Subject: [PATCH 4/4] fix: address 6 re-review findings on grip#572 C1: Fix event_id length -- examples now use 16-char hex matching os.urandom(8).hex() spec. Was 12-char in all examples. C2: Add lane_name to failure.resolved payload in section 14.1 to match taxonomy table and channel bridge template. C3: Insert lease.reclaimed emission (step 8) in section 14.2 reclaim flow. Was defined as event type but never emitted. M1: Standardize on emit() everywhere. Sections 4.2 and 10.1 said emit_event(); section 7 API said emit(). Now consistent. M2: Fix get_pr_status -> get_pr in PR-LIFECYCLE.md flow to match the PlatformAdapter protocol definition. M3: Add filter-vs-mapping note to section 8. Globs like lane.* match types not in the mapping table; those are silently dropped. --- gr2/docs/HOOK-EVENT-CONTRACT.md | 28 +++++++++++++++++++--------- gr2/docs/PR-LIFECYCLE.md | 2 +- 2 files changed, 20 insertions(+), 10 deletions(-) diff --git a/gr2/docs/HOOK-EVENT-CONTRACT.md b/gr2/docs/HOOK-EVENT-CONTRACT.md index c2bb5fe..0889be0 100644 --- a/gr2/docs/HOOK-EVENT-CONTRACT.md +++ b/gr2/docs/HOOK-EVENT-CONTRACT.md @@ -41,7 +41,7 @@ fields sit at the same level. There is no nested `payload` wrapper. ```json { "version": 1, - "event_id": "a1b2c3d4e5f6", + "event_id": "a1b2c3d4e5f67890", "seq": 42, "timestamp": "2026-04-15T16:30:00+00:00", "type": "lane.entered", @@ -224,8 +224,8 @@ The outbox file is the single source of truth for all gr2 events in a workspace. Events are written synchronously at the point of state change: 1. Operation performs its work (e.g., creates a lane, runs a hook). -2. Operation calls `emit_event(type, payload)`. -3. `emit_event` assigns `event_id`, `seq`, `timestamp`. +2. Operation calls `emit(event_type, workspace_root, actor, owner_unit, payload)`. +3. `emit()` assigns `event_id`, `seq`, `timestamp`. 4. Event is serialized and appended to `outbox.jsonl`. 5. File is flushed (fsync not required; OS page cache is sufficient for local-only delivery). @@ -268,7 +268,7 @@ Cursor format: { "consumer": "channel_bridge", "last_seq": 41, - "last_event_id": "a1b2c3d4e5f6", + "last_event_id": "a1b2c3d4e5f67890", "last_read": "2026-04-15T16:31:00+00:00" } ``` @@ -532,6 +532,14 @@ exclude = ["hook.started", "hook.completed", "hook.skipped"] Default: the mapping table above. Filter file is optional. +**Filter vs. mapping**: The `include` globs may match event types that have no +entry in the mapping table (e.g., `lane.*` matches `lane.switched` and +`lane.archived`, which are not in the table above). Events that match the +include filter but have no mapping template are silently dropped by the bridge. +The filter controls which events the bridge *considers*; the mapping table +controls which events produce channel messages. To add a channel message for a +new event type, add both a mapping entry and ensure the filter covers it. + ## 9. Recall Indexing Recall indexes all events (not just the channel-visible subset) for searchable @@ -557,14 +565,14 @@ event stream as the channel bridge. ### 10.1 Outbox Write Failure -If `emit_event` fails to append (disk full, permission error): +If `emit()` fails to append (disk full, permission error): - The event is lost. The operation that triggered it still completed. - The outbox may be in an inconsistent state (partial line written). - Recovery: consumers skip malformed lines. `gr2 gc` can truncate trailing partial lines. -Mitigation: `emit_event` should catch write errors and log them to stderr +Mitigation: `emit()` should catch write errors and log them to stderr without crashing the parent operation. Events are important but not operation-critical. @@ -696,7 +704,7 @@ Marker format: "owner_unit": "apollo", "lane_name": "feat/hook-events", "failed_at": "2026-04-15T17:00:00+00:00", - "event_id": "abc123def456", + "event_id": "9f3a7b2c1d4e8f06", "partial_state": { "repos_completed": ["grip"], "repos_pending": ["synapt-private"], @@ -715,7 +723,7 @@ Marker behavior: agent must decide whether to retry, skip, or escalate. Resolution is always explicit. - **Event**: Resolving a marker emits a new event type: - `failure.resolved` with payload `{operation_id, resolved_by, resolution}`. + `failure.resolved` with payload `{operation_id, resolved_by, resolution, lane_name}`. Why no automatic retry: retrying a failed hook might produce the same failure. The agent (or spawn) has context about whether retry is appropriate. gr2 does @@ -756,7 +764,9 @@ Leases use TTL-first expiry with optional heartbeat renewal. 5. `acquire` finds A's lease, evaluates `is_stale_lease()` -> true. 6. Emits `lease.expired` event (payload: `{lane_name, lease_id, expired_at}`). 7. Garbage-collects A's stale lease from the lane doc. -8. Grants B's new lease. Emits `lease.acquired` event. +8. Emits `lease.reclaimed` event (payload: + `{lane_name, lease_id, previous_holder, expired_at, reclaimed_by}`). +9. Grants B's new lease. Emits `lease.acquired` event. **Force break**: diff --git a/gr2/docs/PR-LIFECYCLE.md b/gr2/docs/PR-LIFECYCLE.md index ff31f7b..0166585 100644 --- a/gr2/docs/PR-LIFECYCLE.md +++ b/gr2/docs/PR-LIFECYCLE.md @@ -150,7 +150,7 @@ Flow: 1. Find the PR group for the lane (scan `.grip/pr_groups/` for matching `lane_name` and `owner_unit`). -2. For each child PR, call `PlatformAdapter.get_pr_status(repo, pr_number)`. +2. For each child PR, call `PlatformAdapter.get_pr(repo, pr_number)`. 3. For each child PR, call `PlatformAdapter.get_checks(repo, pr_number)`. 4. For each child PR, call `PlatformAdapter.get_reviews(repo, pr_number)`. 5. Aggregate into group state.