fix(router): offload oversized worker JOB_DATA to Redis (MNG-1660) by zbigniewsobiecki · Pull Request #1434 · mongrel-intelligence/cascade

zbigniewsobiecki · 2026-06-23T18:27:51Z

Problem

Diagnosed live on prod (project ucho, issue MNG-1660): every agent run — both planning and implementation — silently did no work. The user saw "it's not being planned" and "moving it to Todo doesn't start work."

The router serializes the entire job payload into the single JOB_DATA environment variable and passes it to the worker container via Docker Env. MNG-1660 has a ~10 KB+ markdown description + embedded plan, and for non-GitHub (PM) adapters the job embeds that content twice — once as the raw webhook payload, once inside the pre-resolved triggerResult.agentInput. The resulting JOB_DATA string exceeded the Linux per-arg limit MAX_ARG_STRLEN (128 KiB), so the kernel rejected the execve of the container entrypoint before any JS ran:

exec /usr/local/bin/docker-entrypoint.sh: argument list too long
Worker exited: statusCode: 255, oomKilled: false, durationMs: 261

All 5 MNG-1660 worker spawns crashed identically; siblings MNG-1661/1662/1663/1664 (normal-sized descriptions) ran for minutes and completed. GitHub avoids the crash by setting triggerResult: undefined and re-resolving worker-side — but PM adapters must keep embedding the pre-resolved triggerResult (MNG-1053 implementation-freshness-gate invariant), so shrinking the payload is not an option.

Fix

Offload oversized JOB_DATA to Redis; the env carries a key pointer instead. Small payloads keep the inline env path (backward compatible).

File	Change
`src/router/job-data-offload.ts` (new)	`JOB_DATA_INLINE_MAX_BYTES` (96 KiB), `offloadJobData` (router write, fail-loud + Sentry `job_data_offload_write`), `readOffloadedJobData` (worker GET + DEL), lazy Redis singleton via `routerConfig.redisUrl`
`src/router/worker-env.ts`	`Buffer.byteLength`-gate `JOB_DATA`; over threshold → `offloadJobData` + emit `JOB_DATA_REDIS_KEY`. A throw propagates into `spawnWorker`'s catch → BullMQ retry, so no crash-guaranteed container is launched
`src/worker-entry.ts`	resolve payload from inline env or Redis key (before `scrubSensitiveEnv` strips `REDIS_URL`), DEL the key after read. Fail-loud with distinct Sentry tag `worker_job_data_redis_read` — never the cryptic exec crash
`src/utils/cascadeEnv.ts`, `src/backends/shared/envFilter.ts`	add `JOB_DATA_REDIS_KEY` to scrub/block lists (parity with `JOB_DATA`)

Why Redis, not a bind-mounted file: the cascade-router runs as a container and spawns sibling workers via the Docker socket; a Dockerode bind path resolves on the docker host, not inside the router container, so a file the router writes is not the file mounted into the worker without a shared host volume. Redis needs zero infra change (the worker already receives REDIS_URL).

Tests (TDD — written first)

tests/unit/router/job-data-offload.test.ts (new) — key formatting, SET-with-TTL, GET+DEL, not-found, Redis-error.
tests/unit/router/job-data-roundtrip.test.ts (new) — a >128 KiB payload round-trips losslessly (router write → worker read), no env string exceeds the inline limit, no Docker. Direct regression for the MNG-1660 class.
Extended worker-env.test.ts (byte-vs-char threshold, fail-loud propagation), worker-entry.test.ts (Redis-read success / missing key / Redis down), shared-envFilter.test.ts.
✅ typecheck, lint, full unit + integration suite (3833 passed) green via pre-push.

Out of scope (follow-up)

De-duplicating the description across payload + triggerResult.agentInput — only halves size (a single >128 KiB description still overflows) and risks the MNG-1053 invariant. The Redis offload handles arbitrary size regardless.

Verification after deploy

Deploy router + worker images.
cascade runs trigger --agent-type implementation for MNG-1660.
docker logs cascade-router → expect [WorkerManager] Offloaded large JOB_DATA to Redis { jobId, bytes } then Worker started, no exit-255 / argument list too long.
Worker reaches [Worker] Starting job and proceeds past JSON parse.
Confirm a small item still uses inline JOB_DATA (no offload log line).

🤖 Generated with Claude Code

Worker containers are spawned with the full job payload serialized into the single JOB_DATA env var. For large work items (e.g. ucho/MNG-1660, whose ~10KB+ markdown description was serialized twice — raw webhook `payload` plus the pre-resolved `triggerResult.agentInput`), JOB_DATA exceeded the Linux MAX_ARG_STRLEN (128 KiB) per-arg limit. The kernel then rejected the exec of the container entrypoint with `argument list too long`, so the worker exited 255 in ~260ms before any JS ran — silently breaking every planning AND implementation run for that item, while siblings ran fine. PM adapters must keep embedding the pre-resolved triggerResult (MNG-1053 freshness-gate invariant), so shrinking the payload is not an option. Instead, move the payload off the env channel when it is too large: - New src/router/job-data-offload.ts: when JSON.stringify(job.data) exceeds JOB_DATA_INLINE_MAX_BYTES (96 KiB), store it in Redis under cascade:jobdata:<jobId> (TTL safety net) and pass JOB_DATA_REDIS_KEY instead of JOB_DATA. Small payloads keep the inline env path (backward compatible). - worker-env.ts: byteLength-gate JOB_DATA; offload + emit JOB_DATA_REDIS_KEY for oversized payloads. Offload write throws loudly → BullMQ retry, so no crash-guaranteed container is ever launched. - worker-entry.ts: resolve the payload from the inline env or the Redis key (before scrubSensitiveEnv strips REDIS_URL), deleting the key after read. Fail-loud with distinct Sentry tags — never the cryptic exec crash. - cascadeEnv.ts / envFilter.ts: add JOB_DATA_REDIS_KEY to the scrub/block lists (parity with JOB_DATA). Redis chosen over a bind-mounted file because the router runs as a container and spawns sibling workers via the Docker socket; a Dockerode bind path resolves on the docker host, not inside the router container, so a file the router writes is not the file mounted into the worker without a shared host volume. Redis needs zero infra change (worker already gets REDIS_URL). Tests (TDD): new job-data-offload + round-trip (>128 KiB round-trips losslessly, no env string over the limit, no Docker); extended worker-env (byte-vs-char threshold, fail-loud), worker-entry (Redis-read success / missing key / Redis down), and shared-envFilter. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01RyBTx5JozjbpUko5SRyz4X

codecov · 2026-06-23T18:32:39Z

Codecov Report

❌ Patch coverage is 90.21739% with 9 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
src/worker-entry.ts	78.57%	6 Missing ⚠️
src/router/job-data-offload.ts	93.87%	3 Missing ⚠️

📢 Thoughts on this report? Let us know!

nhopeatall

Summary

APPROVE — correct, surgical fix for the MNG-1660 exec crash (argument list too long). I traced the full router-write → worker-read path and every supporting invariant holds.

What I verified

Root cause & threshold math. MAX_ARG_STRLEN is the 128 KiB per-string env/argv limit; the 96 KiB gate plus the JOB_DATA= prefix stays comfortably under it. Threshold is measured with Buffer.byteLength(serialized, 'utf8'), and the multibyte test proves it's bytes, not String.length (61440 chars / 122926 bytes → offloaded).
Single gate, no bypass. buildWorkerEnvWithProjectId is the only place a JOB_DATA= entry is built, and every spawn path (spawnWorker main + snapshot-fallback at container-manager.ts:247, buildWorkerEnv) routes through it.
Retry-safety. guardedSpawn → spawnWorker → buildWorkerEnvWithProjectId re-runs on every BullMQ attempt, so the worker's delete-after-read is safe — each retry re-offloads the same cascade:jobdata:<jobId> key. The 1 h TTL bounds the leak when a worker never starts.
Read-before-scrub ordering. resolveRawJobData() runs before scrubSensitiveEnv() strips REDIS_URL, and getRedis() reads routerConfig.redisUrl (captured at module load) anyway — consistent with the existing review-dispatch-dedup.ts pattern. The worker already uses the same REDIS_URL for BullMQ completion reporting, so the read needs no new connectivity assumption.
No new secret surface. job.data is already persisted by BullMQ in the same Redis; project credentials are resolved separately in worker-env.ts and never embedded in job.data, so offloading to a second key exposes nothing new.
Key uniqueness. Both queues mint IDs as ${type}-${Date.now()}-${rand} with distinct type prefixes, so the offload key is exactly as collision-resistant as BullMQ's own jobId guarantee — no cross-queue clobbering.
Env-filter parity. JOB_DATA_REDIS_KEY is correctly added to both the subprocess denylist (SHARED_BLOCKED_ENV_EXACT) and the .cascade/env protected-keys (PROTECTED_ENV_KEYS), matching JOB_DATA.
Fail-loud everywhere. Offload-write failure throws → BullMQ retry (no crash-guaranteed container launched); worker read failure exits 1 with the distinct worker_job_data_redis_read Sentry tag instead of the cryptic exec crash.

Tests are strong — the real-module round-trip (job-data-roundtrip.test.ts) is a direct regression for the exact >128 KiB incident class. Ran the new + modified suites locally: 119 tests pass; CI is 7/7 green.

The "out of scope" reasoning (dedup only halves size and risks the MNG-1053 invariant) and the Redis-vs-bind-mount justification are both sound.

Optional (non-blocking)

Given the repo's strong documentation culture (extensive CLAUDE.md + drift-guard tests), a one-line mention of the JOB_DATA → Redis offload channel under the worker/dispatch section of CLAUDE.md would help future maintainers. The inline JSDoc in job-data-offload.ts is already thorough, so this is purely a nicety.

🕵️ claude-code · claude-opus-4-8 · run details

nhopeatall approved these changes Jun 23, 2026

View reviewed changes

zbigniewsobiecki merged commit 19cd6a8 into dev Jun 23, 2026
9 checks passed

zbigniewsobiecki mentioned this pull request Jun 23, 2026

Release: worker JOB_DATA Redis offload fix (MNG-1660) #1435

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(router): offload oversized worker JOB_DATA to Redis (MNG-1660)#1434

fix(router): offload oversized worker JOB_DATA to Redis (MNG-1660)#1434
zbigniewsobiecki merged 1 commit into
devfrom
fix/worker-job-data-argmax-offload

zbigniewsobiecki commented Jun 23, 2026

Uh oh!

codecov Bot commented Jun 23, 2026

Uh oh!

nhopeatall left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

zbigniewsobiecki commented Jun 23, 2026

Problem

Fix

Tests (TDD — written first)

Out of scope (follow-up)

Verification after deploy

Uh oh!

codecov Bot commented Jun 23, 2026

Codecov Report

Uh oh!

nhopeatall left a comment

Choose a reason for hiding this comment

Summary

What I verified

Optional (non-blocking)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants