fix(data-pump): self-heal main fetch loop on error and validate restart input#74
Merged
Conversation
…rt input
Bugs A+B from data-pathways outage 2026-05-13: when the fetch loop throws
(transient API failure, invalid timeBucket from a corrupted restart command,
etc.), the pump used to die permanently — only pod restart could recover.
The processor loop already self-heals (v0.19.1, fragment
bcb6423f-c5e7-48d6-b56c-5723671b8197); this mirrors that pattern for the
main fetch loop.
Also: restart() now validates state.timeBucket against ^\d{14}$ at the
call-site (synchronous throw) so a bad command surfaces the error
immediately instead of silently killing the pump later in the loop.
The restartTo consumption is also try/catch-wrapped as belt-and-braces.
Refs: outage chain bddefdd2-c377-472a-bc50-cd75f708f822
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
startMainLoop()that wrapsloop()with.catch()+ exponential 1s..30s backoff, mirroring the v0.19.1startProcessLoop()self-heal pattern.mainLoopRestartAttemptscounter resets on successful pull and instop().restart()now synchronously validatesstate.timeBucketmatches^\d{14}$— bad input throws at call-site instead of silently killing the pump later in the loop.restartToconsumption at lines 349-358 so a poisoned restart state can't permanently crash the loop.Why
The 2026-05-13 outage (data-pathways outage chain
bddefdd2-c377-472a-bc50-cd75f708f822) had this exact failure mode: a CP regression sent ISO-formattimeBucket: "2026-05-12T13:13"→pump.restart()→ loop hitgetClosestTimeBucketregex → threw → pump died silently → recovery requiredkubectl rollout restart. The v0.19.1 fix self-heals the processor loop but the same pattern was never applied to the main fetch loop.Test plan
deno fmt --checkdeno lintdeno check src/mod.tsdeno test -A test/tests/data-pump-restart.test.ts— 3 tests, 9 steps, all greendeno test -A— 11 passed, 53 steps, 0 failedTest cases added
stop()during pending retry → no further retryrestart()with ISO timebucket → throws synchronouslyrestartTosurvives — loop catches inner throw, logs, exits cleanly🤖 Generated with Claude Code