fix(router): offload oversized worker JOB_DATA to Redis (MNG-1660)#1434
Conversation
Worker containers are spawned with the full job payload serialized into the single JOB_DATA env var. For large work items (e.g. ucho/MNG-1660, whose ~10KB+ markdown description was serialized twice — raw webhook `payload` plus the pre-resolved `triggerResult.agentInput`), JOB_DATA exceeded the Linux MAX_ARG_STRLEN (128 KiB) per-arg limit. The kernel then rejected the exec of the container entrypoint with `argument list too long`, so the worker exited 255 in ~260ms before any JS ran — silently breaking every planning AND implementation run for that item, while siblings ran fine. PM adapters must keep embedding the pre-resolved triggerResult (MNG-1053 freshness-gate invariant), so shrinking the payload is not an option. Instead, move the payload off the env channel when it is too large: - New src/router/job-data-offload.ts: when JSON.stringify(job.data) exceeds JOB_DATA_INLINE_MAX_BYTES (96 KiB), store it in Redis under cascade:jobdata:<jobId> (TTL safety net) and pass JOB_DATA_REDIS_KEY instead of JOB_DATA. Small payloads keep the inline env path (backward compatible). - worker-env.ts: byteLength-gate JOB_DATA; offload + emit JOB_DATA_REDIS_KEY for oversized payloads. Offload write throws loudly → BullMQ retry, so no crash-guaranteed container is ever launched. - worker-entry.ts: resolve the payload from the inline env or the Redis key (before scrubSensitiveEnv strips REDIS_URL), deleting the key after read. Fail-loud with distinct Sentry tags — never the cryptic exec crash. - cascadeEnv.ts / envFilter.ts: add JOB_DATA_REDIS_KEY to the scrub/block lists (parity with JOB_DATA). Redis chosen over a bind-mounted file because the router runs as a container and spawns sibling workers via the Docker socket; a Dockerode bind path resolves on the docker host, not inside the router container, so a file the router writes is not the file mounted into the worker without a shared host volume. Redis needs zero infra change (worker already gets REDIS_URL). Tests (TDD): new job-data-offload + round-trip (>128 KiB round-trips losslessly, no env string over the limit, no Docker); extended worker-env (byte-vs-char threshold, fail-loud), worker-entry (Redis-read success / missing key / Redis down), and shared-envFilter. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01RyBTx5JozjbpUko5SRyz4X
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
nhopeatall
left a comment
There was a problem hiding this comment.
Summary
APPROVE — correct, surgical fix for the MNG-1660 exec crash (argument list too long). I traced the full router-write → worker-read path and every supporting invariant holds.
What I verified
- Root cause & threshold math.
MAX_ARG_STRLENis the 128 KiB per-string env/argv limit; the 96 KiB gate plus theJOB_DATA=prefix stays comfortably under it. Threshold is measured withBuffer.byteLength(serialized, 'utf8'), and the multibyte test proves it's bytes, notString.length(61440 chars / 122926 bytes → offloaded). - Single gate, no bypass.
buildWorkerEnvWithProjectIdis the only place aJOB_DATA=entry is built, and every spawn path (spawnWorkermain + snapshot-fallback atcontainer-manager.ts:247,buildWorkerEnv) routes through it. - Retry-safety.
guardedSpawn → spawnWorker → buildWorkerEnvWithProjectIdre-runs on every BullMQ attempt, so the worker's delete-after-read is safe — each retry re-offloads the samecascade:jobdata:<jobId>key. The 1 h TTL bounds the leak when a worker never starts. - Read-before-scrub ordering.
resolveRawJobData()runs beforescrubSensitiveEnv()stripsREDIS_URL, andgetRedis()readsrouterConfig.redisUrl(captured at module load) anyway — consistent with the existingreview-dispatch-dedup.tspattern. The worker already uses the sameREDIS_URLfor BullMQ completion reporting, so the read needs no new connectivity assumption. - No new secret surface.
job.datais already persisted by BullMQ in the same Redis; project credentials are resolved separately inworker-env.tsand never embedded injob.data, so offloading to a second key exposes nothing new. - Key uniqueness. Both queues mint IDs as
${type}-${Date.now()}-${rand}with distinct type prefixes, so the offload key is exactly as collision-resistant as BullMQ's own jobId guarantee — no cross-queue clobbering. - Env-filter parity.
JOB_DATA_REDIS_KEYis correctly added to both the subprocess denylist (SHARED_BLOCKED_ENV_EXACT) and the.cascade/envprotected-keys (PROTECTED_ENV_KEYS), matchingJOB_DATA. - Fail-loud everywhere. Offload-write failure throws → BullMQ retry (no crash-guaranteed container launched); worker read failure exits 1 with the distinct
worker_job_data_redis_readSentry tag instead of the cryptic exec crash.
Tests are strong — the real-module round-trip (job-data-roundtrip.test.ts) is a direct regression for the exact >128 KiB incident class. Ran the new + modified suites locally: 119 tests pass; CI is 7/7 green.
The "out of scope" reasoning (dedup only halves size and risks the MNG-1053 invariant) and the Redis-vs-bind-mount justification are both sound.
Optional (non-blocking)
Given the repo's strong documentation culture (extensive CLAUDE.md + drift-guard tests), a one-line mention of the JOB_DATA → Redis offload channel under the worker/dispatch section of CLAUDE.md would help future maintainers. The inline JSDoc in job-data-offload.ts is already thorough, so this is purely a nicety.
🕵️ claude-code · claude-opus-4-8 · run details
Problem
Diagnosed live on prod (project
ucho, issue MNG-1660): every agent run — bothplanningandimplementation— silently did no work. The user saw "it's not being planned" and "moving it to Todo doesn't start work."The router serializes the entire job payload into the single
JOB_DATAenvironment variable and passes it to the worker container via DockerEnv. MNG-1660 has a ~10 KB+ markdown description + embedded plan, and for non-GitHub (PM) adapters the job embeds that content twice — once as the raw webhookpayload, once inside the pre-resolvedtriggerResult.agentInput. The resultingJOB_DATAstring exceeded the Linux per-arg limitMAX_ARG_STRLEN(128 KiB), so the kernel rejected theexecveof the container entrypoint before any JS ran:All 5 MNG-1660 worker spawns crashed identically; siblings MNG-1661/1662/1663/1664 (normal-sized descriptions) ran for minutes and completed. GitHub avoids the crash by setting
triggerResult: undefinedand re-resolving worker-side — but PM adapters must keep embedding the pre-resolvedtriggerResult(MNG-1053 implementation-freshness-gate invariant), so shrinking the payload is not an option.Fix
Offload oversized
JOB_DATAto Redis; the env carries a key pointer instead. Small payloads keep the inline env path (backward compatible).src/router/job-data-offload.ts(new)JOB_DATA_INLINE_MAX_BYTES(96 KiB),offloadJobData(router write, fail-loud + Sentryjob_data_offload_write),readOffloadedJobData(worker GET + DEL), lazy Redis singleton viarouterConfig.redisUrlsrc/router/worker-env.tsBuffer.byteLength-gateJOB_DATA; over threshold →offloadJobData+ emitJOB_DATA_REDIS_KEY. A throw propagates intospawnWorker's catch → BullMQ retry, so no crash-guaranteed container is launchedsrc/worker-entry.tsscrubSensitiveEnvstripsREDIS_URL), DEL the key after read. Fail-loud with distinct Sentry tagworker_job_data_redis_read— never the cryptic exec crashsrc/utils/cascadeEnv.ts,src/backends/shared/envFilter.tsJOB_DATA_REDIS_KEYto scrub/block lists (parity withJOB_DATA)Why Redis, not a bind-mounted file: the
cascade-routerruns as a container and spawns sibling workers via the Docker socket; a Dockerode bind path resolves on the docker host, not inside the router container, so a file the router writes is not the file mounted into the worker without a shared host volume. Redis needs zero infra change (the worker already receivesREDIS_URL).Tests (TDD — written first)
tests/unit/router/job-data-offload.test.ts(new) — key formatting, SET-with-TTL, GET+DEL, not-found, Redis-error.tests/unit/router/job-data-roundtrip.test.ts(new) — a >128 KiB payload round-trips losslessly (router write → worker read), no env string exceeds the inline limit, no Docker. Direct regression for the MNG-1660 class.worker-env.test.ts(byte-vs-char threshold, fail-loud propagation),worker-entry.test.ts(Redis-read success / missing key / Redis down),shared-envFilter.test.ts.typecheck,lint, full unit + integration suite (3833 passed) green via pre-push.Out of scope (follow-up)
De-duplicating the description across
payload+triggerResult.agentInput— only halves size (a single >128 KiB description still overflows) and risks the MNG-1053 invariant. The Redis offload handles arbitrary size regardless.Verification after deploy
cascade runs trigger --agent-type implementationfor MNG-1660.docker logs cascade-router→ expect[WorkerManager] Offloaded large JOB_DATA to Redis { jobId, bytes }thenWorker started, no exit-255 /argument list too long.[Worker] Starting joband proceeds past JSON parse.JOB_DATA(no offload log line).🤖 Generated with Claude Code