Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
15 commits
Select commit Hold shift + click to select a range
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion CLAUDE.md
Original file line number Diff line number Diff line change
Expand Up @@ -123,7 +123,7 @@ Some triggers take params (e.g. `review` + `scm:check-suite-success` accepts `{"

**Post-completion review dispatch** — when an implementation agent succeeds with a PR, the execution pipeline checks CI status and fires the review agent deterministically (before the container exits). This guarantees review dispatch within seconds of implementation completion, regardless of GitHub webhook timing. Uses the same `claimReviewDispatch` dedup key as the `check-suite-success` trigger, so the two paths cannot double-enqueue.

**Deferred re-check** — a trigger handler can return `TriggerResult.deferredRecheck: { delayMs, coalesceKey }` (with `agentType: null`) to schedule a bare delayed job via `scheduleCoalescedJob`. The router scheduling is adapter-agnostic, but **bare re-dispatch is currently GitHub-only**: `GitHubRouterAdapter.buildJob()` strips `triggerResult` from the job so the GitHub worker re-dispatches through the trigger registry for fresh provider state. Non-GitHub adapters (Trello, JIRA, Linear, Sentry) embed `triggerResult` in the job; their workers pass it to `resolveTriggerResult()`, which returns the pre-resolved `agentType: null` result without re-dispatching — a non-GitHub handler using this field would schedule a job that reuses the same result rather than re-evaluating provider state. The worker detects GitHub re-check jobs via `GitHubJob.mergeabilityRecheckAttempt` and Sentry-captures under tag `mergeability_recheck_exhausted` when state still cannot resolve. Workers do not re-queue — a second `deferredRecheck` return exits gracefully.
**Deferred re-check** — a trigger handler can return `TriggerResult.deferredRecheck: { delayMs, coalesceKey, recheckKind? }` (with `agentType: null`) to schedule a bare delayed job via `scheduleCoalescedJob`. The router scheduling is adapter-agnostic, but **bare re-dispatch is currently GitHub-only**: `GitHubRouterAdapter.buildJob()` strips `triggerResult` from the job so the GitHub worker re-dispatches through the trigger registry for fresh provider state. Non-GitHub adapters (Trello, JIRA, Linear, Sentry) embed `triggerResult` in the job; their workers pass it to `resolveTriggerResult()`, which returns the pre-resolved `agentType: null` result without re-dispatching — a non-GitHub handler using this field would schedule a job that reuses the same result rather than re-evaluating provider state. There are two recheck kinds, controlled by the optional `recheckKind` field on `deferredRecheck`: **mergeability re-check** (no `recheckKind`, sets `mergeabilityRecheckAttempt: 1` on the job) — one-shot; if the re-check still cannot resolve state, the worker Sentry-captures under `mergeability_recheck_exhausted` and stops without re-queueing. **Check-suite re-check** (`recheckKind: 'check-suite'`, sets `checkSuiteRecheckAttempt: 1` on the job) — safe rescheduling; if the Actions API is still stale when the job fires, the worker reschedules another coalesced delayed job instead of exhausting, so review/respond-to-ci dispatch stays alive until the API catches up. Used by `check-suite-success` and `check-suite-failure` handlers for the Actions-API-lag case (ucho PR #394/MNG-683, 2026-05-11).

**Worker exit diagnostics** — when a worker container exits non-zero, the router calls `container.inspect()` *before* AutoRemove reaps it and stamps the run record's `error` field with a structured, grep-stable string: `Worker crashed with exit code N · OOMKilled=<true|false> · reason="<State.Error>"`. The `OOMKilled=true` marker is the definitive cgroup-OOM signal (per Docker's own `State.OOMKilled`); a 137 exit *without* `OOMKilled=true` means the kill came from inside the container or from a non-cgroup signal — *not* memory. The `[WorkerManager] Resolved spawn settings` log emitted at every spawn includes both `projectWatchdogTimeoutMs` and `globalWorkerTimeoutMs` so post-mortems can confirm whether the per-project override actually won. See `src/router/active-workers.ts:formatCrashReason` for the format and `tests/unit/router/container-manager-diagnostics.test.ts` for regression pins.

Expand Down
2 changes: 1 addition & 1 deletion docs/architecture/01-services.md
Original file line number Diff line number Diff line change
Expand Up @@ -100,7 +100,7 @@ The router passes job data to workers via Docker container env vars:
|----------|---------|
| `JOB_ID` | Unique job identifier |
| `JOB_TYPE` | `trello`, `github`, `jira`, `linear`, `sentry`, `manual-run`, `retry-run`, `debug-analysis` |
| `JOB_DATA` | JSON-encoded job payload; GitHub jobs include `mergeabilityRecheckAttempt` in this payload for deferred re-checks |
| `JOB_DATA` | JSON-encoded job payload; GitHub jobs include `mergeabilityRecheckAttempt` (mergeability re-check) or `checkSuiteRecheckAttempt` (check-suite Actions-API-lag re-check) in this payload for deferred re-checks |
| `CASCADE_CREDENTIAL_KEYS` | Comma-separated list of credential env var names |
| Individual credential vars | Pre-loaded project credentials (e.g., `GITHUB_TOKEN_IMPLEMENTER`) |

Expand Down
2 changes: 1 addition & 1 deletion docs/architecture/02-webhook-pipeline.md
Original file line number Diff line number Diff line change
Expand Up @@ -147,7 +147,7 @@ Structured skip is intentionally different from bare `null`: it preserves the ha

PM status-change dispatches can include a `coalesceKey`, normally `${projectId}:${workItemId}`. When `PM_COALESCE_WINDOW_MS` is positive, the router schedules a delayed job via `scheduleCoalescedJob`; a newer dispatch with the same key supersedes the pending one and releases the superseded job's in-memory locks. PM ack comments are deferred to job fire time for coalesced jobs so superseded work does not leave orphan comments.

Deferred re-check also uses `scheduleCoalescedJob` and exits without dispatch locks or an ack comment. The bare re-dispatch on job fire is currently **GitHub-only**: `GitHubRouterAdapter.buildJob()` strips `triggerResult` and sets `mergeabilityRecheckAttempt: 1`, so the GitHub worker re-dispatches through the trigger registry to evaluate fresh provider state. Non-GitHub adapters (Trello, JIRA, Linear, Sentry) embed `triggerResult` in the job regardless of `deferredRecheck`, so their workers return the pre-resolved `agentType: null` result directly without re-dispatching. If a deferred re-check schedule call fails, the router captures Sentry under `deferred_recheck_schedule_failure` and still returns `Deferred re-check scheduled` — it does not call `onBlocked`. GitHub mergeability uses this when `mergeable` is still `null` after the synchronous retry budget; if the re-check still cannot resolve state, the GitHub worker records `mergeability_recheck_exhausted` and stops rather than re-queueing indefinitely.
Deferred re-check also uses `scheduleCoalescedJob` and exits without dispatch locks or an ack comment. The bare re-dispatch on job fire is currently **GitHub-only**: `GitHubRouterAdapter.buildJob()` strips `triggerResult` and stamps a re-check field determined by the optional `recheckKind` discriminator on `deferredRecheck`. Two kinds exist: **mergeability re-check** (no `recheckKind`, stamps `mergeabilityRecheckAttempt: 1`) — one-shot; if the re-check still cannot resolve state the worker records `mergeability_recheck_exhausted` and stops. **Check-suite re-check** (`recheckKind: 'check-suite'`, stamps `checkSuiteRecheckAttempt: 1`) — safe rescheduling; when the Actions API is still stale the worker reschedules another coalesced delayed job so dispatch stays alive until the API catches up (used for Actions-API-lag). Non-GitHub adapters (Trello, JIRA, Linear, Sentry) embed `triggerResult` in the job regardless of `deferredRecheck`, so their workers return the pre-resolved `agentType: null` result directly without re-dispatching. If a deferred re-check schedule call fails, the router captures Sentry under `deferred_recheck_schedule_failure` and still returns `Deferred re-check scheduled` — it does not call `onBlocked`.

### Concurrency controls

Expand Down
9 changes: 7 additions & 2 deletions docs/architecture/03-trigger-system.md
Original file line number Diff line number Diff line change
Expand Up @@ -172,9 +172,14 @@ The router preserves structured skips in webhook logs with `Trigger <handler> sk

### Deferred re-checks

Handlers that cannot make a final decision yet can return `deferredRecheck: { delayMs, coalesceKey }` with `agentType: null`. The router schedules a coalesced delayed BullMQ job and exits without spawning an agent. GitHub mergeability checks use this path; the worker recognizes re-check jobs via `mergeabilityRecheckAttempt` and captures a Sentry diagnostic if the second pass still cannot resolve state.
Handlers that cannot make a final decision yet can return `deferredRecheck: { delayMs, coalesceKey, recheckKind? }` with `agentType: null`. The router schedules a coalesced delayed BullMQ job and exits without spawning an agent.

The bare re-dispatch on job fire is currently **GitHub-only**: `GitHubRouterAdapter.buildJob()` strips `triggerResult` and sets `mergeabilityRecheckAttempt: 1`, so the GitHub worker re-dispatches through the trigger registry to evaluate fresh provider state. Non-GitHub adapters (Trello, JIRA, Linear, Sentry) embed `triggerResult` in the job regardless of `deferredRecheck`; `resolveTriggerResult()` returns the pre-resolved result directly, skipping registry dispatch. A non-GitHub handler returning `buildDeferredRecheckResult` would therefore schedule a job that reuses the same `agentType: null` result rather than re-evaluating provider state. See `src/triggers/README.md` for the full authoring contract. Workers do not schedule another re-check after exhaustion.
The bare re-dispatch on job fire is currently **GitHub-only**: `GitHubRouterAdapter.buildJob()` strips `triggerResult` and stamps the right re-check field based on the optional `recheckKind` discriminator. Two re-check kinds exist:

- **Mergeability re-check** (`recheckKind` absent) — `mergeabilityRecheckAttempt: 1` is set on the job. The GitHub worker re-dispatches through the registry for fresh provider state. If the re-check still cannot resolve state, the worker Sentry-captures under `mergeability_recheck_exhausted` and stops (one-shot — no further rescheduling).
- **Check-suite re-check** (`recheckKind: 'check-suite'`) — `checkSuiteRecheckAttempt: 1` is set on the job. If the Actions API is still stale when the job fires, the worker reschedules another coalesced delayed job instead of exhausting, so review/respond-to-ci dispatch stays alive until the API catches up. Used by `check-suite-success` and `check-suite-failure` for the Actions-API-lag case (ucho PR #394/MNG-683, 2026-05-11).

Non-GitHub adapters (Trello, JIRA, Linear, Sentry) embed `triggerResult` in the job regardless of `deferredRecheck`; `resolveTriggerResult()` returns the pre-resolved result directly, skipping registry dispatch. A non-GitHub handler returning `buildDeferredRecheckResult` would therefore schedule a job that reuses the same `agentType: null` result rather than re-evaluating provider state. See `src/triggers/README.md` for the full authoring contract.

### Config resolution

Expand Down
7 changes: 5 additions & 2 deletions docs/architecture/10-resilience.md
Original file line number Diff line number Diff line change
Expand Up @@ -97,9 +97,12 @@ This split prevents both classes of failure from wedging a work item for the loc

### Deferred re-check exhaustion

Some provider state is eventually consistent and has no follow-up webhook. A trigger can return `TriggerResult.deferredRecheck` with `agentType: null`; the router schedules a coalesced delayed bare job and does not take normal dispatch locks. The bare re-dispatch on job fire is currently **GitHub-only**: `GitHubRouterAdapter.buildJob()` strips `triggerResult` and sets `mergeabilityRecheckAttempt: 1`, so the GitHub worker re-dispatches through the registry to get fresh provider state. Non-GitHub adapters (Trello, JIRA, Linear, Sentry) embed `triggerResult` in the job; their workers return the pre-resolved `agentType: null` result directly without re-dispatching through the registry.
Some provider state is eventually consistent and has no follow-up webhook. A trigger can return `TriggerResult.deferredRecheck` (with `agentType: null` and an optional `recheckKind`); the router schedules a coalesced delayed bare job and does not take normal dispatch locks. The bare re-dispatch on job fire is currently **GitHub-only**: `GitHubRouterAdapter.buildJob()` strips `triggerResult` and stamps the right re-check field. Two re-check kinds exist:

GitHub mergeability uses this for `pull_request` events where `mergeable === null`. If the deferred job still gets another deferred result, workers do not schedule a second re-check. The GitHub worker emits a WARN and captures to Sentry with tag `mergeability_recheck_exhausted`, making pathological provider latency visible without creating an infinite retry loop.
- **Mergeability re-check** (`recheckKind` absent, sets `mergeabilityRecheckAttempt: 1`) — used for `pull_request` events where `mergeable === null`. One-shot: if the deferred job still gets another deferred result, the worker does not schedule a second re-check; it emits a WARN and captures to Sentry with tag `mergeability_recheck_exhausted`.
- **Check-suite re-check** (`recheckKind: 'check-suite'`, sets `checkSuiteRecheckAttempt: 1`) — used by `check-suite-success` and `check-suite-failure` when the Actions API lags webhook delivery. Safe rescheduling: if the deferred job still sees a stale result, the worker reschedules another coalesced delayed job instead of exhausting. This keeps review/respond-to-ci dispatch alive until the API catches up without risk of infinite loop (coalesceKey deduplicates concurrent rechecks).

Non-GitHub adapters (Trello, JIRA, Linear, Sentry) embed `triggerResult` in the job; their workers return the pre-resolved `agentType: null` result directly without re-dispatching through the registry.

### Wedged-lock canary

Expand Down
Loading
Loading