Skip to content

Release: worker JOB_DATA Redis offload fix (MNG-1660)#1435

Merged
zbigniewsobiecki merged 2 commits into
mainfrom
dev
Jun 23, 2026
Merged

Release: worker JOB_DATA Redis offload fix (MNG-1660)#1435
zbigniewsobiecki merged 2 commits into
mainfrom
dev

Conversation

@zbigniewsobiecki

Copy link
Copy Markdown
Member

Promotes the MNG-1660 fix from dev to main (prod).

Included: #1434fix(router): offload oversized worker JOB_DATA to Redis (MNG-1660).

Worker containers were spawned with the full job payload in the single JOB_DATA env var; for large PM work items (ucho/MNG-1660) this exceeded the Linux MAX_ARG_STRLEN (128 KiB) and the container crashed at exec with argument list too long before any JS ran. Oversized payloads are now offloaded to Redis and referenced by JOB_DATA_REDIS_KEY; small payloads keep the inline env path.

Full unit + integration suite green on dev.

🤖 Generated with Claude Code

zbigniewsobiecki and others added 2 commits June 23, 2026 18:26
Worker containers are spawned with the full job payload serialized into the
single JOB_DATA env var. For large work items (e.g. ucho/MNG-1660, whose
~10KB+ markdown description was serialized twice — raw webhook `payload` plus
the pre-resolved `triggerResult.agentInput`), JOB_DATA exceeded the Linux
MAX_ARG_STRLEN (128 KiB) per-arg limit. The kernel then rejected the exec of
the container entrypoint with `argument list too long`, so the worker exited
255 in ~260ms before any JS ran — silently breaking every planning AND
implementation run for that item, while siblings ran fine.

PM adapters must keep embedding the pre-resolved triggerResult (MNG-1053
freshness-gate invariant), so shrinking the payload is not an option. Instead,
move the payload off the env channel when it is too large:

- New src/router/job-data-offload.ts: when JSON.stringify(job.data) exceeds
  JOB_DATA_INLINE_MAX_BYTES (96 KiB), store it in Redis under
  cascade:jobdata:<jobId> (TTL safety net) and pass JOB_DATA_REDIS_KEY instead
  of JOB_DATA. Small payloads keep the inline env path (backward compatible).
- worker-env.ts: byteLength-gate JOB_DATA; offload + emit JOB_DATA_REDIS_KEY
  for oversized payloads. Offload write throws loudly → BullMQ retry, so no
  crash-guaranteed container is ever launched.
- worker-entry.ts: resolve the payload from the inline env or the Redis key
  (before scrubSensitiveEnv strips REDIS_URL), deleting the key after read.
  Fail-loud with distinct Sentry tags — never the cryptic exec crash.
- cascadeEnv.ts / envFilter.ts: add JOB_DATA_REDIS_KEY to the scrub/block lists
  (parity with JOB_DATA).

Redis chosen over a bind-mounted file because the router runs as a container
and spawns sibling workers via the Docker socket; a Dockerode bind path
resolves on the docker host, not inside the router container, so a file the
router writes is not the file mounted into the worker without a shared host
volume. Redis needs zero infra change (worker already gets REDIS_URL).

Tests (TDD): new job-data-offload + round-trip (>128 KiB round-trips
losslessly, no env string over the limit, no Docker); extended worker-env
(byte-vs-char threshold, fail-loud), worker-entry (Redis-read success / missing
key / Redis down), and shared-envFilter.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01RyBTx5JozjbpUko5SRyz4X
…a-argmax-offload

fix(router): offload oversized worker JOB_DATA to Redis (MNG-1660)
@zbigniewsobiecki zbigniewsobiecki merged commit 9fb15af into main Jun 23, 2026
9 checks passed
@codecov

codecov Bot commented Jun 23, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 90.21739% with 9 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
src/worker-entry.ts 78.57% 6 Missing ⚠️
src/router/job-data-offload.ts 93.87% 3 Missing ⚠️

📢 Thoughts on this report? Let us know!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant