Skip to content

fix(bootstrap): retry transient manifest fetch failures at boot#12

Merged
jh-lee-cryptolab merged 1 commit into
CryptoLabInc:mainfrom
jh-lee-cryptolab:fix/manifest-fetch-retry
Jun 9, 2026
Merged

fix(bootstrap): retry transient manifest fetch failures at boot#12
jh-lee-cryptolab merged 1 commit into
CryptoLabInc:mainfrom
jh-lee-cryptolab:fix/manifest-fetch-retry

Conversation

@jh-lee-cryptolab

Copy link
Copy Markdown
Contributor

Summary

  • What changed: FetchManifest now retries the manifest network GET with bounded exponential backoff (3 attempts, backoff 5s → 15s → 45s, aborts immediately on ctx cancellation), reusing the existing downloadWithRetry constants. The GET is split into fetchManifestBody; manifest parse / version validation stays outside the retry so deterministic errors fail fast.
  • Why: Artifact downloads already ride out transient failures via downloadWithRetry, but the manifest fetch was a single http.DefaultClient.Do — a transient GitHub CDN failure (a 504 was observed in production) hard-failed the daemon at boot. This closes that last unguarded fetch in the boot path.
  • Scope: Manifest-fetch resilience only. No mirror/fallback work, no success-path behavior change.

Validation

  • go build ./..., go vet ./internal/bootstrap/, go test ./internal/bootstrap/ all pass.
  • Added TestFetchManifest_RetriesTransient (two 504s then success); the existing 500 error test still surfaces the error (now after bounded retries, with backoff compressed in tests).

Notes for Reviewers

  • Risk areas: A persistently-failing manifest fetch now retries up to 3× before surfacing. downloadRetryBackoff is a package var so tests compress it; parse/version errors are deliberately not retried.
  • Backward compatibility: None. FetchManifest(ctx) signature unchanged; retry logging uses slog.Warn.
  • Follow-up: Pairs with fix(bootstrap): retry transient network failures during install rune#165, which adds the same retry convention to the rune CLI bootstrap chain.

FetchManifest did a single HTTP GET, so a transient GitHub CDN failure
(e.g. a 504) hard-failed the daemon at boot even though artifact
downloads already retry via downloadWithRetry. This wraps the manifest
fetch in the same bounded exponential-backoff retry (3 attempts, backoff
5s -> 15s -> 45s, ctx-cancel aware), reusing the existing download retry
constants. Only the network fetch is retried; deterministic parse/version
errors fail fast.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@jh-lee-cryptolab jh-lee-cryptolab merged commit 42aa3d9 into CryptoLabInc:main Jun 9, 2026
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants