Release: worker JOB_DATA Redis offload fix (MNG-1660)#1435
Merged
Conversation
Worker containers are spawned with the full job payload serialized into the single JOB_DATA env var. For large work items (e.g. ucho/MNG-1660, whose ~10KB+ markdown description was serialized twice — raw webhook `payload` plus the pre-resolved `triggerResult.agentInput`), JOB_DATA exceeded the Linux MAX_ARG_STRLEN (128 KiB) per-arg limit. The kernel then rejected the exec of the container entrypoint with `argument list too long`, so the worker exited 255 in ~260ms before any JS ran — silently breaking every planning AND implementation run for that item, while siblings ran fine. PM adapters must keep embedding the pre-resolved triggerResult (MNG-1053 freshness-gate invariant), so shrinking the payload is not an option. Instead, move the payload off the env channel when it is too large: - New src/router/job-data-offload.ts: when JSON.stringify(job.data) exceeds JOB_DATA_INLINE_MAX_BYTES (96 KiB), store it in Redis under cascade:jobdata:<jobId> (TTL safety net) and pass JOB_DATA_REDIS_KEY instead of JOB_DATA. Small payloads keep the inline env path (backward compatible). - worker-env.ts: byteLength-gate JOB_DATA; offload + emit JOB_DATA_REDIS_KEY for oversized payloads. Offload write throws loudly → BullMQ retry, so no crash-guaranteed container is ever launched. - worker-entry.ts: resolve the payload from the inline env or the Redis key (before scrubSensitiveEnv strips REDIS_URL), deleting the key after read. Fail-loud with distinct Sentry tags — never the cryptic exec crash. - cascadeEnv.ts / envFilter.ts: add JOB_DATA_REDIS_KEY to the scrub/block lists (parity with JOB_DATA). Redis chosen over a bind-mounted file because the router runs as a container and spawns sibling workers via the Docker socket; a Dockerode bind path resolves on the docker host, not inside the router container, so a file the router writes is not the file mounted into the worker without a shared host volume. Redis needs zero infra change (worker already gets REDIS_URL). Tests (TDD): new job-data-offload + round-trip (>128 KiB round-trips losslessly, no env string over the limit, no Docker); extended worker-env (byte-vs-char threshold, fail-loud), worker-entry (Redis-read success / missing key / Redis down), and shared-envFilter. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01RyBTx5JozjbpUko5SRyz4X
…a-argmax-offload fix(router): offload oversized worker JOB_DATA to Redis (MNG-1660)
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Promotes the MNG-1660 fix from
devtomain(prod).Included: #1434 —
fix(router): offload oversized worker JOB_DATA to Redis (MNG-1660).Worker containers were spawned with the full job payload in the single
JOB_DATAenv var; for large PM work items (ucho/MNG-1660) this exceeded the LinuxMAX_ARG_STRLEN(128 KiB) and the container crashed atexecwithargument list too longbefore any JS ran. Oversized payloads are now offloaded to Redis and referenced byJOB_DATA_REDIS_KEY; small payloads keep the inline env path.Full unit + integration suite green on
dev.🤖 Generated with Claude Code