Skip to content

fix(router): offload oversized worker JOB_DATA to Redis (MNG-1660)#1434

Merged
zbigniewsobiecki merged 1 commit into
devfrom
fix/worker-job-data-argmax-offload
Jun 23, 2026
Merged

fix(router): offload oversized worker JOB_DATA to Redis (MNG-1660)#1434
zbigniewsobiecki merged 1 commit into
devfrom
fix/worker-job-data-argmax-offload

Conversation

@zbigniewsobiecki

Copy link
Copy Markdown
Member

Problem

Diagnosed live on prod (project ucho, issue MNG-1660): every agent run — both planning and implementation — silently did no work. The user saw "it's not being planned" and "moving it to Todo doesn't start work."

The router serializes the entire job payload into the single JOB_DATA environment variable and passes it to the worker container via Docker Env. MNG-1660 has a ~10 KB+ markdown description + embedded plan, and for non-GitHub (PM) adapters the job embeds that content twice — once as the raw webhook payload, once inside the pre-resolved triggerResult.agentInput. The resulting JOB_DATA string exceeded the Linux per-arg limit MAX_ARG_STRLEN (128 KiB), so the kernel rejected the execve of the container entrypoint before any JS ran:

exec /usr/local/bin/docker-entrypoint.sh: argument list too long
Worker exited: statusCode: 255, oomKilled: false, durationMs: 261

All 5 MNG-1660 worker spawns crashed identically; siblings MNG-1661/1662/1663/1664 (normal-sized descriptions) ran for minutes and completed. GitHub avoids the crash by setting triggerResult: undefined and re-resolving worker-side — but PM adapters must keep embedding the pre-resolved triggerResult (MNG-1053 implementation-freshness-gate invariant), so shrinking the payload is not an option.

Fix

Offload oversized JOB_DATA to Redis; the env carries a key pointer instead. Small payloads keep the inline env path (backward compatible).

File Change
src/router/job-data-offload.ts (new) JOB_DATA_INLINE_MAX_BYTES (96 KiB), offloadJobData (router write, fail-loud + Sentry job_data_offload_write), readOffloadedJobData (worker GET + DEL), lazy Redis singleton via routerConfig.redisUrl
src/router/worker-env.ts Buffer.byteLength-gate JOB_DATA; over threshold → offloadJobData + emit JOB_DATA_REDIS_KEY. A throw propagates into spawnWorker's catch → BullMQ retry, so no crash-guaranteed container is launched
src/worker-entry.ts resolve payload from inline env or Redis key (before scrubSensitiveEnv strips REDIS_URL), DEL the key after read. Fail-loud with distinct Sentry tag worker_job_data_redis_read — never the cryptic exec crash
src/utils/cascadeEnv.ts, src/backends/shared/envFilter.ts add JOB_DATA_REDIS_KEY to scrub/block lists (parity with JOB_DATA)

Why Redis, not a bind-mounted file: the cascade-router runs as a container and spawns sibling workers via the Docker socket; a Dockerode bind path resolves on the docker host, not inside the router container, so a file the router writes is not the file mounted into the worker without a shared host volume. Redis needs zero infra change (the worker already receives REDIS_URL).

Tests (TDD — written first)

  • tests/unit/router/job-data-offload.test.ts (new) — key formatting, SET-with-TTL, GET+DEL, not-found, Redis-error.
  • tests/unit/router/job-data-roundtrip.test.ts (new) — a >128 KiB payload round-trips losslessly (router write → worker read), no env string exceeds the inline limit, no Docker. Direct regression for the MNG-1660 class.
  • Extended worker-env.test.ts (byte-vs-char threshold, fail-loud propagation), worker-entry.test.ts (Redis-read success / missing key / Redis down), shared-envFilter.test.ts.
  • typecheck, lint, full unit + integration suite (3833 passed) green via pre-push.

Out of scope (follow-up)

De-duplicating the description across payload + triggerResult.agentInput — only halves size (a single >128 KiB description still overflows) and risks the MNG-1053 invariant. The Redis offload handles arbitrary size regardless.

Verification after deploy

  1. Deploy router + worker images.
  2. cascade runs trigger --agent-type implementation for MNG-1660.
  3. docker logs cascade-router → expect [WorkerManager] Offloaded large JOB_DATA to Redis { jobId, bytes } then Worker started, no exit-255 / argument list too long.
  4. Worker reaches [Worker] Starting job and proceeds past JSON parse.
  5. Confirm a small item still uses inline JOB_DATA (no offload log line).

🤖 Generated with Claude Code

Worker containers are spawned with the full job payload serialized into the
single JOB_DATA env var. For large work items (e.g. ucho/MNG-1660, whose
~10KB+ markdown description was serialized twice — raw webhook `payload` plus
the pre-resolved `triggerResult.agentInput`), JOB_DATA exceeded the Linux
MAX_ARG_STRLEN (128 KiB) per-arg limit. The kernel then rejected the exec of
the container entrypoint with `argument list too long`, so the worker exited
255 in ~260ms before any JS ran — silently breaking every planning AND
implementation run for that item, while siblings ran fine.

PM adapters must keep embedding the pre-resolved triggerResult (MNG-1053
freshness-gate invariant), so shrinking the payload is not an option. Instead,
move the payload off the env channel when it is too large:

- New src/router/job-data-offload.ts: when JSON.stringify(job.data) exceeds
  JOB_DATA_INLINE_MAX_BYTES (96 KiB), store it in Redis under
  cascade:jobdata:<jobId> (TTL safety net) and pass JOB_DATA_REDIS_KEY instead
  of JOB_DATA. Small payloads keep the inline env path (backward compatible).
- worker-env.ts: byteLength-gate JOB_DATA; offload + emit JOB_DATA_REDIS_KEY
  for oversized payloads. Offload write throws loudly → BullMQ retry, so no
  crash-guaranteed container is ever launched.
- worker-entry.ts: resolve the payload from the inline env or the Redis key
  (before scrubSensitiveEnv strips REDIS_URL), deleting the key after read.
  Fail-loud with distinct Sentry tags — never the cryptic exec crash.
- cascadeEnv.ts / envFilter.ts: add JOB_DATA_REDIS_KEY to the scrub/block lists
  (parity with JOB_DATA).

Redis chosen over a bind-mounted file because the router runs as a container
and spawns sibling workers via the Docker socket; a Dockerode bind path
resolves on the docker host, not inside the router container, so a file the
router writes is not the file mounted into the worker without a shared host
volume. Redis needs zero infra change (worker already gets REDIS_URL).

Tests (TDD): new job-data-offload + round-trip (>128 KiB round-trips
losslessly, no env string over the limit, no Docker); extended worker-env
(byte-vs-char threshold, fail-loud), worker-entry (Redis-read success / missing
key / Redis down), and shared-envFilter.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01RyBTx5JozjbpUko5SRyz4X
@codecov

codecov Bot commented Jun 23, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 90.21739% with 9 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
src/worker-entry.ts 78.57% 6 Missing ⚠️
src/router/job-data-offload.ts 93.87% 3 Missing ⚠️

📢 Thoughts on this report? Let us know!

@nhopeatall nhopeatall left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary

APPROVE — correct, surgical fix for the MNG-1660 exec crash (argument list too long). I traced the full router-write → worker-read path and every supporting invariant holds.

What I verified

  • Root cause & threshold math. MAX_ARG_STRLEN is the 128 KiB per-string env/argv limit; the 96 KiB gate plus the JOB_DATA= prefix stays comfortably under it. Threshold is measured with Buffer.byteLength(serialized, 'utf8'), and the multibyte test proves it's bytes, not String.length (61440 chars / 122926 bytes → offloaded).
  • Single gate, no bypass. buildWorkerEnvWithProjectId is the only place a JOB_DATA= entry is built, and every spawn path (spawnWorker main + snapshot-fallback at container-manager.ts:247, buildWorkerEnv) routes through it.
  • Retry-safety. guardedSpawn → spawnWorker → buildWorkerEnvWithProjectId re-runs on every BullMQ attempt, so the worker's delete-after-read is safe — each retry re-offloads the same cascade:jobdata:<jobId> key. The 1 h TTL bounds the leak when a worker never starts.
  • Read-before-scrub ordering. resolveRawJobData() runs before scrubSensitiveEnv() strips REDIS_URL, and getRedis() reads routerConfig.redisUrl (captured at module load) anyway — consistent with the existing review-dispatch-dedup.ts pattern. The worker already uses the same REDIS_URL for BullMQ completion reporting, so the read needs no new connectivity assumption.
  • No new secret surface. job.data is already persisted by BullMQ in the same Redis; project credentials are resolved separately in worker-env.ts and never embedded in job.data, so offloading to a second key exposes nothing new.
  • Key uniqueness. Both queues mint IDs as ${type}-${Date.now()}-${rand} with distinct type prefixes, so the offload key is exactly as collision-resistant as BullMQ's own jobId guarantee — no cross-queue clobbering.
  • Env-filter parity. JOB_DATA_REDIS_KEY is correctly added to both the subprocess denylist (SHARED_BLOCKED_ENV_EXACT) and the .cascade/env protected-keys (PROTECTED_ENV_KEYS), matching JOB_DATA.
  • Fail-loud everywhere. Offload-write failure throws → BullMQ retry (no crash-guaranteed container launched); worker read failure exits 1 with the distinct worker_job_data_redis_read Sentry tag instead of the cryptic exec crash.

Tests are strong — the real-module round-trip (job-data-roundtrip.test.ts) is a direct regression for the exact >128 KiB incident class. Ran the new + modified suites locally: 119 tests pass; CI is 7/7 green.

The "out of scope" reasoning (dedup only halves size and risks the MNG-1053 invariant) and the Redis-vs-bind-mount justification are both sound.

Optional (non-blocking)

Given the repo's strong documentation culture (extensive CLAUDE.md + drift-guard tests), a one-line mention of the JOB_DATA → Redis offload channel under the worker/dispatch section of CLAUDE.md would help future maintainers. The inline JSDoc in job-data-offload.ts is already thorough, so this is purely a nicety.

🕵️ claude-code · claude-opus-4-8 · run details

@zbigniewsobiecki zbigniewsobiecki merged commit 19cd6a8 into dev Jun 23, 2026
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants