Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
17 commits
Select commit Hold shift + click to select a range
746e450
test(triggers): expand trigger contract conformance coverage (#1287)
aaight May 9, 2026
0c10ed6
docs(trigger): update trigger architecture contracts
May 9, 2026
5e0a2c9
docs(triggers): qualify deferred re-check as GitHub-only and fix rout…
May 9, 2026
03035f8
docs(trigger): qualify deferred re-check as GitHub-only in architectu…
May 9, 2026
cfbceef
docs(triggers): qualify buildDeferredRecheckResult as GitHub-only in …
May 9, 2026
c03c2c3
docs(resilience): qualify dispatch-failure compensation boundary accu…
May 9, 2026
2949b24
docs(triggers): fix resolveTriggerResult usage and resilience doc bou…
May 9, 2026
b8ca17c
refactor(gadgets): split CLI command factory helpers
May 9, 2026
cd7e8e5
feat(alerting): handle Sentry resource:issue webhooks (Internal Integ…
zbigniewsobiecki May 9, 2026
26acea9
Merge pull request #1289 from mongrel-intelligence/docs/trigger-archi…
zbigniewsobiecki May 9, 2026
7345285
Merge pull request #1290 from mongrel-intelligence/refactor/cli-comma…
zbigniewsobiecki May 9, 2026
63ac36f
fix(alerting): declare alerting:issue-lifecycle in alerting agent def…
May 9, 2026
ce051a4
Merge pull request #1291 from mongrel-intelligence/feat/sentry-issue-…
zbigniewsobiecki May 9, 2026
7de0b82
fix(implementation): harden agent contract against tool-output bloat …
zbigniewsobiecki May 9, 2026
63d821f
fix(backends): distinguish empty/malformed/missing sidecar in WARN log
May 9, 2026
56b5639
fix(gadgets): truncate hook output in error messages on commit/push f…
May 9, 2026
ff5048f
Merge pull request #1292 from mongrel-intelligence/fix/agent-contract…
zbigniewsobiecki May 9, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,10 @@ All notable user-visible changes to CASCADE are documented here. The format is l

## Unreleased

### Documentation

- **Trigger architecture docs now describe the migrated trigger contracts.** Added guidance for canonical `TRIGGER_EVENTS`, shared PM/GitHub result builders, first-match dispatch, structured skip vs bare `null`, no-agent results, deferred bare-job re-checks, router outcome decision reasons, PM coalescing, capacity scope, dispatch failure compensation, and wedged-lock diagnostics. Migration note for future trigger contributors: new handlers should import event constants, use the shared builders, return structured skips for claimed-but-non-dispatched events, and reserve bare `null` for "continue to later handlers." See Trello card [qUbPtALY](https://trello.com/c/69fe2a950699baaf91688a5b).

### Fixed

- **`resolve-conflicts` agent no longer silently skips when GitHub's async mergeability computation hasn't resolved by the time the `pull_request` webhook is processed** (spec 020). `PRConflictDetectedTrigger` previously exhausted a 2×2s synchronous retry budget and silently discarded the event when `mergeable === null` — because GitHub never sends a follow-up webhook once mergeability resolves, the `resolve-conflicts` agent never fired. The trigger now returns `TriggerResult.deferredRecheck`, which causes the router to schedule a bare BullMQ delayed re-check job ~45s later via `scheduleCoalescedJob` (deduped per PR). The worker re-dispatches via the trigger registry to get fresh mergeability state. Multiple rapid webhooks for the same PR coalesce to a single re-check job. If mergeability is still `null` after the re-check fires, a Sentry event is captured under tag `mergeability_recheck_exhausted` and a WARN log is emitted — not a silent discard. Observed live on ucho/PR #329 (2026-05-07). See [spec 020](docs/specs/020-github-mergeability-deferred-recheck.md).
Expand Down
2 changes: 1 addition & 1 deletion CLAUDE.md
Original file line number Diff line number Diff line change
Expand Up @@ -123,7 +123,7 @@ Some triggers take params (e.g. `review` + `scm:check-suite-success` accepts `{"

**Post-completion review dispatch** — when an implementation agent succeeds with a PR, the execution pipeline checks CI status and fires the review agent deterministically (before the container exits). This guarantees review dispatch within seconds of implementation completion, regardless of GitHub webhook timing. Uses the same `claimReviewDispatch` dedup key as the `check-suite-success` trigger, so the two paths cannot double-enqueue.

**Deferred re-check** — any trigger handler can return `TriggerResult.deferredRecheck: { delayMs, coalesceKey }` (with `agentType: null`) to schedule a bare delayed job that re-dispatches via the trigger registry when it fires. The router uses `scheduleCoalescedJob` for dedup. The worker detects re-check jobs via `GitHubJob.mergeabilityRecheckAttempt` and Sentry-captures under tag `mergeability_recheck_exhausted` when the trigger still cannot resolve state. Workers do not re-queue — a second `deferredRecheck` return in the worker handler exits gracefully.
**Deferred re-check** — a trigger handler can return `TriggerResult.deferredRecheck: { delayMs, coalesceKey }` (with `agentType: null`) to schedule a bare delayed job via `scheduleCoalescedJob`. The router scheduling is adapter-agnostic, but **bare re-dispatch is currently GitHub-only**: `GitHubRouterAdapter.buildJob()` strips `triggerResult` from the job so the GitHub worker re-dispatches through the trigger registry for fresh provider state. Non-GitHub adapters (Trello, JIRA, Linear, Sentry) embed `triggerResult` in the job; their workers pass it to `resolveTriggerResult()`, which returns the pre-resolved `agentType: null` result without re-dispatching — a non-GitHub handler using this field would schedule a job that reuses the same result rather than re-evaluating provider state. The worker detects GitHub re-check jobs via `GitHubJob.mergeabilityRecheckAttempt` and Sentry-captures under tag `mergeability_recheck_exhausted` when state still cannot resolve. Workers do not re-queue — a second `deferredRecheck` return exits gracefully.

**Worker exit diagnostics** — when a worker container exits non-zero, the router calls `container.inspect()` *before* AutoRemove reaps it and stamps the run record's `error` field with a structured, grep-stable string: `Worker crashed with exit code N · OOMKilled=<true|false> · reason="<State.Error>"`. The `OOMKilled=true` marker is the definitive cgroup-OOM signal (per Docker's own `State.OOMKilled`); a 137 exit *without* `OOMKilled=true` means the kill came from inside the container or from a non-cgroup signal — *not* memory. The `[WorkerManager] Resolved spawn settings` log emitted at every spawn includes both `projectWatchdogTimeoutMs` and `globalWorkerTimeoutMs` so post-mortems can confirm whether the per-project override actually won. See `src/router/active-workers.ts:formatCrashReason` for the format and `tests/unit/router/container-manager-diagnostics.test.ts` for regression pins.

Expand Down
34 changes: 32 additions & 2 deletions docs/architecture/02-webhook-pipeline.md
Original file line number Diff line number Diff line change
Expand Up @@ -102,7 +102,8 @@ flowchart TD
F -->|Not found| SKIP4[Skip: no project config]
F -->|Found| G[7. Dispatch triggers with credentials]
G -->|No match| SKIP5[Skip: no trigger matched]
G -->|Matched| H{8. Work-item / agent-type lock}
G -->|Structured skip / no-agent / deferred| OUTCOME[Handle non-dispatch outcome]
G -->|Agent dispatch| H{8. Work-item / agent-type lock}
H -->|Locked| SKIP6[Skip: concurrency limit]
H -->|Free| I[9. Post ack comment]
I --> J[10. Build job]
Expand All @@ -118,13 +119,36 @@ flowchart TD
4. **Self-check** — Adapter's `isSelfAuthored()` detects bot's own actions (loop prevention)
5. **Reaction** — Fire-and-forget emoji reaction on the source event
6. **Resolve config** — Look up project by platform identifier (board ID, repo, etc.)
7. **Dispatch triggers** — Within credential scope, call `TriggerRegistry.dispatch()` to find a matching agent. PM router adapters also wrap dispatch in `withPMScopeForDispatch(fullProject, dispatch)` so shared PM gates can resolve the active provider.
7. **Dispatch triggers** — Within credential scope, call `TriggerRegistry.dispatch()` to find a matching result. PM router adapters also wrap dispatch in `withPMScopeForDispatch(fullProject, dispatch)` so shared PM gates can resolve the active provider.
8. **Concurrency** — Check work-item lock (`work-item-lock.ts`) and agent-type concurrency (`agent-type-lock.ts`)
9. **Ack comment** — Post an acknowledgment comment to the work item or PR
10. **Build job** — Package trigger result + payload + ack info into a `CascadeJob`
11. **Pre-actions** — Optional fire-and-forget actions (e.g., GitHub eyes reaction)
12. **Enqueue** — Add job to BullMQ Redis queue; mark work item and agent type as enqueued

### Router outcomes

`src/router/webhook-trigger-outcomes.ts` normalizes trigger results into stable router decisions:

| Trigger result | Router behavior | Decision reason shape |
|----------------|-----------------|-----------------------|
| `null` from registry | No handler claimed the event | `No trigger matched for event` |
| `agentType: null` + `skipReason` | Handler claimed the event but intentionally self-skipped | `Trigger <handler> skipped: <message>` |
| `agentType: null` + `deferredRecheck` | Schedule a coalesced delayed bare job and exit | `Deferred re-check scheduled: <coalesceKey>` |
| `agentType: null` without skip/defer | Side-effect-only trigger completed | `Trigger completed without agent (PM operation)` |
| `agentType` + `coalesceKey` and coalescing enabled | Schedule a delayed coalesced dispatch | `Coalesced dispatch scheduled: <agent> agent for work item <id>` |
| `agentType` without coalescing | Post ack, build job, enqueue now | `Job queued: <agent> agent for work item <id>` |
| Immediate-dispatch or PM coalesced-dispatch Redis failure | Call `onBlocked` and leave a failure reason | `Failed to enqueue job to Redis` or `Failed to schedule coalesced job to Redis` |
| Deferred re-check Redis failure | Capture Sentry under `deferred_recheck_schedule_failure`; skip `onBlocked`; treat as if scheduled | `Deferred re-check scheduled: <coalesceKey>` |

Structured skip is intentionally different from bare `null`: it preserves the handler's reason in webhook logs instead of collapsing expected non-dispatch decisions into "no trigger matched."

### Coalescing and deferred re-check

PM status-change dispatches can include a `coalesceKey`, normally `${projectId}:${workItemId}`. When `PM_COALESCE_WINDOW_MS` is positive, the router schedules a delayed job via `scheduleCoalescedJob`; a newer dispatch with the same key supersedes the pending one and releases the superseded job's in-memory locks. PM ack comments are deferred to job fire time for coalesced jobs so superseded work does not leave orphan comments.

Deferred re-check also uses `scheduleCoalescedJob` and exits without dispatch locks or an ack comment. The bare re-dispatch on job fire is currently **GitHub-only**: `GitHubRouterAdapter.buildJob()` strips `triggerResult` and sets `mergeabilityRecheckAttempt: 1`, so the GitHub worker re-dispatches through the trigger registry to evaluate fresh provider state. Non-GitHub adapters (Trello, JIRA, Linear, Sentry) embed `triggerResult` in the job regardless of `deferredRecheck`, so their workers return the pre-resolved `agentType: null` result directly without re-dispatching. If a deferred re-check schedule call fails, the router captures Sentry under `deferred_recheck_schedule_failure` and still returns `Deferred re-check scheduled` — it does not call `onBlocked`. GitHub mergeability uses this when `mergeable` is still `null` after the synchronous retry budget; if the re-check still cannot resolve state, the GitHub worker records `mergeability_recheck_exhausted` and stops rather than re-queueing indefinitely.

### Concurrency controls

| Mechanism | File | Purpose |
Expand All @@ -136,6 +160,12 @@ flowchart TD

All locks are in-memory with TTL expiry. Work-item locks are scoped by `(projectId, workItemId, agentType)`: duplicate runs of the same agent are blocked, but different agent types can run concurrently on the same work item. When a lock rejects a webhook, logs distinguish `Awaiting worker slot` from `Work item locked (no active dispatch)`; the latter is a wedged-lock canary and captures to Sentry.

The work-item lock decision vocabulary is stable by design:

- `Job queued: ...` means the router successfully registered a dispatch and enqueued or scheduled work.
- `Awaiting worker slot: ...` means the same work item and agent type already have an active queued/waiting/running dispatch.
- `Work item locked (no active dispatch): ...` means the lock-state classifier could not correlate the lock with queued or running work. This is a wedged-lock canary, not normal backpressure.

## Signature Verification

`src/router/webhookVerification.ts`
Expand Down
40 changes: 36 additions & 4 deletions docs/architecture/03-trigger-system.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,8 @@ class TriggerRegistry {
2. Call `handle(ctx)` — if it returns a `TriggerResult`, return it
3. If `handle()` returns `null`, continue to next handler

That makes dispatch first-match-wins for non-null results. A handler should return bare `null` only when it does not claim the event and later handlers should still get a chance. If a handler claims the event but decides not to run an agent, it returns a structured `TriggerResult` with `agentType: null` instead.

## TriggerHandler

`src/triggers/types.ts`
Expand Down Expand Up @@ -58,6 +60,12 @@ interface TriggerResult {
prUrl?: string;
prTitle?: string;
onBlocked?: () => void; // Cleanup if job can't be enqueued
skipReason?: {
handler: string;
message: string;
};
lockKey?: string; // Optional work-item lock override
coalesceKey?: string; // Optional PM dispatch coalescing key
deferredRecheck?: {
delayMs: number;
coalesceKey: string;
Expand Down Expand Up @@ -135,15 +143,39 @@ function registerBuiltInTriggers(registry: TriggerRegistry): void {

### Event format

Triggers use category-prefixed events: `{category}:{event-name}`
- `pm:status-changed`, `pm:label-added`
- `scm:check-suite-success`, `scm:pr-review-submitted`, `scm:review-requested`
- `alerting:issue-alert`, `alerting:metric-alert`
Triggers use category-prefixed events from `src/triggers/shared/events.ts`. `TRIGGER_EVENTS` is the canonical catalog used by handlers, result builders, trigger configuration, and static tests:

- PM: `pm:status-changed`, `pm:label-added`, `pm:comment-mention`
- SCM: `scm:check-suite-success`, `scm:check-suite-failure`, `scm:pr-review-submitted`, `scm:review-requested`, `scm:pr-opened`, `scm:pr-comment-mention`, `scm:pr-merged`, `scm:pr-ready-to-merge`, `scm:pr-conflict-detected`
- Alerting: `alerting:issue-alert`, `alerting:metric-alert`
- Internal: `internal:auto-chain`

New handlers should import `TRIGGER_EVENTS` instead of adding raw string literals. The static guard in `tests/unit/triggers/trigger-event-consistency.test.ts` fails when a handler gates on one event string and emits a different `agentInput.triggerEvent`.

### Result builders

Shared builders live in `src/triggers/shared/result-builders.ts`, `src/triggers/shared/pm-status.ts`, `src/triggers/shared/pm-label.ts`, and `src/triggers/github/result-builders.ts`.

Use them for new handlers unless there is a concrete reason not to:

- `buildPMDispatchResult`, `buildPMStatusDispatchResult`, and `buildPMLabelDispatchResult` attach canonical PM trigger events, work-item fields, and PM coalescing keys.
- `buildGitHubPRDispatchResult` and the GitHub-specific wrappers attach PR metadata, optional linked PM work-item metadata, and normalized agent input for PR agents.
- `buildNoAgentResult` represents a matched trigger whose side effect is complete without spawning an agent, such as PM status updates after a PR merge.
- `buildSkipResult` or `skip()` represents a matched trigger that intentionally stops dispatch with a human-readable reason.
- `buildDeferredRecheckResult` represents a delayed bare-job re-dispatch.

### Structured skip vs bare `null`

Bare `null` means "this handler did not handle the event; continue registry dispatch." Structured skip means "this handler did handle the event; stop dispatch and record why no agent was queued."

The router preserves structured skips in webhook logs with `Trigger <handler> skipped: <message>`. Use structured skip for disabled trigger config, author-mode gates, self-loop gates, incomplete aggregate check-suite state, missing PR/work-item prerequisites, and similar expected non-dispatch outcomes.

### Deferred re-checks

Handlers that cannot make a final decision yet can return `deferredRecheck: { delayMs, coalesceKey }` with `agentType: null`. The router schedules a coalesced delayed BullMQ job and exits without spawning an agent. GitHub mergeability checks use this path; the worker recognizes re-check jobs via `mergeabilityRecheckAttempt` and captures a Sentry diagnostic if the second pass still cannot resolve state.

The bare re-dispatch on job fire is currently **GitHub-only**: `GitHubRouterAdapter.buildJob()` strips `triggerResult` and sets `mergeabilityRecheckAttempt: 1`, so the GitHub worker re-dispatches through the trigger registry to evaluate fresh provider state. Non-GitHub adapters (Trello, JIRA, Linear, Sentry) embed `triggerResult` in the job regardless of `deferredRecheck`; `resolveTriggerResult()` returns the pre-resolved result directly, skipping registry dispatch. A non-GitHub handler returning `buildDeferredRecheckResult` would therefore schedule a job that reuses the same `agentType: null` result rather than re-evaluating provider state. See `src/triggers/README.md` for the full authoring contract. Workers do not schedule another re-check after exhaustion.

### Config resolution

`src/triggers/config-resolver.ts`
Expand Down
14 changes: 14 additions & 0 deletions docs/architecture/07-gadgets.md
Original file line number Diff line number Diff line change
Expand Up @@ -108,6 +108,20 @@ Native-tool engines cannot invoke gadget classes directly (they run as subproces

The `cascade-tools` binary uses a separate oclif config (`bin/cascade-tools.js`) that discovers all non-dashboard commands, while `cascade` discovers only dashboard commands.

`createCLICommand()` is the stable facade used by command files under `src/cli/**`. Shared CLI behavior lives in focused helper modules under `src/gadgets/shared/cli/`:

| Helper | Role |
|--------|------|
| `commandNames.ts` | Command namespace/name derivation shared by the CLI factory and manifest generator |
| `examples.ts` | Tool example lookup, shell quoting, oclif example rendering, and JSON expected-shape hints |
| `flags.ts` | oclif flag construction and flag metadata collection |
| `booleanArgv.ts` | Boolean value-form normalization before oclif parsing |
| `parseErrors.ts` | oclif parse-error classification and unknown-flag suggestions |
| `params.ts` | File/stdin input, JSON parsing, direct parameter resolution, and git remote owner/repo resolution |
| `errorSink.ts` | Error-envelope routing through the active command instance |

New domain commands should not add branches in these helpers. They declare behavior through their `ToolDefinition` metadata (`cliAliases`, examples, file input alternatives, auto-resolution), and the shared generators consume it.

## Session State

`src/gadgets/sessionState.ts`
Expand Down
Loading
Loading