Skip to content

fix(data-pump): self-heal main fetch loop on error and validate restart input#74

Merged
jbiskur merged 1 commit into
mainfrom
fix/main-loop-self-heal
May 13, 2026
Merged

fix(data-pump): self-heal main fetch loop on error and validate restart input#74
jbiskur merged 1 commit into
mainfrom
fix/main-loop-self-heal

Conversation

@jbiskur
Copy link
Copy Markdown
Contributor

@jbiskur jbiskur commented May 13, 2026

Summary

  • Adds startMainLoop() that wraps loop() with .catch() + exponential 1s..30s backoff, mirroring the v0.19.1 startProcessLoop() self-heal pattern.
  • mainLoopRestartAttempts counter resets on successful pull and in stop().
  • restart() now synchronously validates state.timeBucket matches ^\d{14}$ — bad input throws at call-site instead of silently killing the pump later in the loop.
  • Try/catch around the restartTo consumption at lines 349-358 so a poisoned restart state can't permanently crash the loop.

Why

The 2026-05-13 outage (data-pathways outage chain bddefdd2-c377-472a-bc50-cd75f708f822) had this exact failure mode: a CP regression sent ISO-format timeBucket: "2026-05-12T13:13"pump.restart() → loop hit getClosestTimeBucket regex → threw → pump died silently → recovery required kubectl rollout restart. The v0.19.1 fix self-heals the processor loop but the same pattern was never applied to the main fetch loop.

Test plan

  • deno fmt --check
  • deno lint
  • deno check src/mod.ts
  • deno test -A test/tests/data-pump-restart.test.ts — 3 tests, 9 steps, all green
  • deno test -A — 11 passed, 53 steps, 0 failed
  • TDD discipline: pre-fix verification confirmed cases (a) (b) (d) (e) fail meaningfully against unfixed code

Test cases added

  • (a) Loop throws once → self-restart succeeds, attempts counter resets
  • (b) 7 successive failures → exp backoff sequence 1, 2, 4, 8, 16, 30, 30s
  • (c) stop() during pending retry → no further retry
  • (d) restart() with ISO timebucket → throws synchronously
  • (e) Poisoned restartTo survives — loop catches inner throw, logs, exits cleanly
  • (f) Happy-path regression

🤖 Generated with Claude Code

…rt input

Bugs A+B from data-pathways outage 2026-05-13: when the fetch loop throws
(transient API failure, invalid timeBucket from a corrupted restart command,
etc.), the pump used to die permanently — only pod restart could recover.
The processor loop already self-heals (v0.19.1, fragment
bcb6423f-c5e7-48d6-b56c-5723671b8197); this mirrors that pattern for the
main fetch loop.

Also: restart() now validates state.timeBucket against ^\d{14}$ at the
call-site (synchronous throw) so a bad command surfaces the error
immediately instead of silently killing the pump later in the loop.
The restartTo consumption is also try/catch-wrapped as belt-and-braces.

Refs: outage chain bddefdd2-c377-472a-bc50-cd75f708f822

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@jbiskur jbiskur merged commit 7fbbe1d into main May 13, 2026
2 checks passed
@jbiskur jbiskur deleted the fix/main-loop-self-heal branch May 13, 2026 16:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant