Skip to content

fix(bootstrap): retry transient network failures during install#165

Merged
jh-lee-cryptolab merged 1 commit into
CryptoLabInc:mainfrom
jh-lee-cryptolab:fix/bootstrap-retry-504
Jun 9, 2026
Merged

fix(bootstrap): retry transient network failures during install#165
jh-lee-cryptolab merged 1 commit into
CryptoLabInc:mainfrom
jh-lee-cryptolab:fix/bootstrap-retry-504

Conversation

@jh-lee-cryptolab

Copy link
Copy Markdown
Contributor

Summary

  • What changed: Added bounded exponential-backoff retry to every blocking network fetch in the rune CLI bootstrap chain (bin/rune shell + Go internal/bootstrap).
    • bin/rune: --retry 3 --retry-delay 2 on the latest-release lookup curl and the binary/checksums.txt download curls.
    • internal/bootstrap/download.go: new shared withRetry helper (3 attempts, backoff 2s → 6s → 18s, aborts immediately on ctx cancellation).
    • internal/bootstrap/install.go: artifact downloads (DownloadAndVerify, raw + tarball) now go through withRetry; logf is threaded so retries are visible in install logs.
    • internal/bootstrap/manifest.go: manifest fetch now retries the network GET via withRetry. Body parse / version checks stay outside the retry so deterministic errors fail fast.
  • Why: The bootstrap chain had no retry anywhere, so a single transient GitHub CDN failure (a 504 was observed in production) aborted the entire install. runed already rides these out via downloadWithRetry; the rune CLI did not. This closes that asymmetry.
  • Scope: Install-flow resilience only (the P0 retry gaps). No mirror/fallback URL work, no behavior change on the success path.

Validation

  • Tests run (or explain why not): go build ./..., go vet ./internal/bootstrap/, go test ./internal/bootstrap/ all pass. Added TestFetchManifest_RetriesTransient (two 504s then success) and confirmed the existing 404 test still fails fast. bash -n bin/rune passes.
  • Docs updated (if behavior/setup changed): no user-facing setup change; retry is transparent.

Cross-Agent Invariants

  • No agent-specific script duplicates bootstrap/setup logic
  • Agent-specific scripts remain thin adapters (registration/wiring only)
  • Codex-only commands (codex mcp ...) are clearly separated from cross-agent/common instructions
  • Claude/Gemini/OpenAI instructions do not include Codex-only commands
  • SKILL.md, commands/rune/*.toml, and AGENT_INTEGRATION.md stay consistent on boundaries

Notes for Reviewers

  • Risk areas: A persistently-failing fetch (e.g. wrong pinned version → 404) now retries up to 3× before surfacing, adding ~8s latency on that error path. Backoff is a package var so tests compress it; manifest parse errors are deliberately not retried.
  • Backward compatibility impact: None. FetchManifest and installArtifact signatures gained a logf func(string, ...any) param (internal package only); all call sites updated.
  • Follow-up work (if any): P1 retry gaps in rune-admin/install.sh version-lookup curls, and optional mirror/fallback-URL support, are tracked separately.

The install bootstrap chain had no retry anywhere, so a single transient
GitHub CDN failure (e.g. a 504) aborted the whole install. This adds
bounded exponential-backoff retry to every blocking fetch in the rune CLI
bootstrap, matching the convention already used by runed's downloadWithRetry.

- bin/rune: add `--retry 3 --retry-delay 2` to the version-lookup and
  binary/checksums curls so a transient CDN blip is ridden out.
- internal/bootstrap: add a shared withRetry helper (3 attempts, backoff
  2s -> 6s -> 18s, ctx-cancel aware) and wrap artifact downloads
  (DownloadAndVerify) and the manifest fetch. Only the network fetch is
  retried; deterministic parse/version errors fail fast.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@jh-lee-cryptolab jh-lee-cryptolab merged commit 8cb8268 into CryptoLabInc:main Jun 9, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants