Skip to content

fix: v1->v7 migration tolerates a gc'd (410 Gone) backup source (0.6.12)#44

Merged
ehsan6sha merged 1 commit into
mainfrom
fix/v7-migration-gone-backup
Jun 17, 2026
Merged

fix: v1->v7 migration tolerates a gc'd (410 Gone) backup source (0.6.12)#44
ehsan6sha merged 1 commit into
mainfrom
fix/v7-migration-gone-backup

Conversation

@ehsan6sha

Copy link
Copy Markdown
Member

Fixes large-file uploads failing on GC-damaged buckets. Workspace 0.6.11 → 0.6.12.

Symptom

On a bucket whose forest grew past the v7 sharding threshold, a large (multi-chunk) upload uploads every chunk (200 OK) and then fails at finalize with "failed to upload large file." The trace shows:

PUT /archives/__fula_forest_v1_backup/{ts}  (x-amz-copy-source: /archives/Qm…)  → 410 Gone

followed by the SDK cleaning up the just-uploaded chunks.

Root cause

The large upload tips the forest over the v7 sharding threshold, triggering the one-time v1→v7 migration. Its first write (Step 4) is a server-side copy_object of the current forest index to a timestamped backup — and a server-side COPY must read the source. That index object's backing IPFS CID was garbage-collected (a one-off manual ipfs repo gc; HEAD still returns the ETag, but a content read 410s), so the copy fails → migration returns DeferredTransientError → upload fails. (Confirmed via read-only inspection of the gateway: auto-gc is off, so the damage is static.)

Fix (Option B — tolerate Gone)

In the migration's backup step, treat a 410/Gone copy failure as "the source content is already gone, so there is nothing to back up" → log + skip the backup and proceed. v7 is rebuilt faithfully from the already-loaded in-memory v1 forest (monolithic = whole-or-nothing, so no entry is dropped), and with auto-gc off the fresh content-addressed v7 nodes persist. Every other copy error (transient 5xx, throttling, auth, network) still defers — a transient must never masquerade as "gc'd".

Once migrated, the bucket is healthy v7 and never hits this path again — it stops referencing the gc'd v1 blob entirely.

  • error.rs: new ClientError::is_gone()narrow match on Gone/HTTP410/410 only.
  • encryption.rs: migrate_v1_to_v7_internal Step 4 skips the backup on is_gone(), defers otherwise.

Why skipping the backup is safe

  • The v1 source content is already gc'd, so a "copy of it" / restore point couldn't exist regardless.
  • The backup is read only by try_v1_backup_fallback, which triggers solely if a future v7 manifest becomes unreadable — and it returns None gracefully when no backup exists (surfaces the original error, no crash).
  • The thing that made forests unreadable (manual GC) is now off, so the restore point has no forward value here.

Both the built-in advisor and gemini-advisor independently recommended B over re-serializing a backup from memory (Option A), which adds crypto-path code + tech debt for an obsolete v1 format.

Tests

  • error.rs: is_gone_matches_only_410_gone — Gone/HTTP410/410 ⇒ true; NoSuchKey/PreconditionFailed/HTTP412/InternalError/HTTP500/SlowDown/NotFound ⇒ false (proves the narrow boundary — gate against a transient being skipped).
  • Full fula-client --features test-fault-injection suite green.
  • Migration-completes-despite-410 is validated on the live gateway by re-trying a large upload on the affected videos-v8 bucket after publish — the migration only runs against the real server (advisory lock + heartbeat), so this scenario isn't wiremock-mockable; the migration E2E tests are #[ignore] real-server for the same reason.

🤖 Generated with Claude Code

Large-file uploads failed on GC-damaged buckets: a large upload tips the
forest past the v7 sharding threshold, triggering the v1->v7 migration, whose
first step is a server-side copy_object of the v1 index to a backup key. A
server-side COPY must read the source, but that index object's backing CID was
garbage-collected (one-off manual `ipfs repo gc`; HEAD returns the ETag, a
content read 410s) -> copy fails -> migration defers -> upload fails after all
chunks already uploaded.

Fix (Option B, advisor + gemini endorsed): treat a 410/Gone backup copy as
"source content already gone, nothing to back up" -> skip the backup and
proceed. v7 is rebuilt faithfully from the already-loaded in-memory v1 forest
(monolithic = whole-or-nothing, no entry dropped); with auto-gc off the fresh
v7 nodes persist; once migrated the bucket never hits this path again. Every
OTHER copy error still defers -- a transient must not masquerade as gc'd.

Safe because: the v1 source is already gc'd (no restore point could exist
anyway); the backup is read only by try_v1_backup_fallback, which returns None
gracefully when absent and triggers only if a future v7 manifest is unreadable.

- error.rs: ClientError::is_gone() -- narrow match on Gone/HTTP410/410 only.
- encryption.rs: migrate_v1_to_v7_internal Step 4 skips backup on is_gone(),
  defers on every other error.
- test: is_gone_matches_only_410_gone (narrow-boundary guard).

Workspace 0.6.11 -> 0.6.12. Migration-completes-despite-410 validated on the
live gateway (real large-upload retry); the migration path is real-server-only
(advisory lock + heartbeat), so it isn't wiremock-mockable.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@ehsan6sha ehsan6sha merged commit 8fb3464 into main Jun 17, 2026
8 checks passed
ehsan6sha added a commit that referenced this pull request Jun 18, 2026
* Streaming upload P1: plan-mode ChunkedEncoder (no-AEAD pass 1) + plan doc

Foundation for web streaming + resumable large-file uploads (see
docs/web-streaming-resumable-upload-plan.md). Adds a plan-only mode to
ChunkedEncoder (into_plan_only): it still generates the per-chunk nonce, feeds
the plaintext to the BAO + content hashers, and advances the chunk count, but
skips the AEAD encrypt and retains NO ciphertext. The chunking / BAO / nonce
code is shared verbatim with the encrypting path, so plan-mode and a full
encode produce identical root_hash / content_hash / num_chunks for the same
input (the random per-chunk nonces differ, which is fine).

This is pass 1 of the streaming upload: commit the integrity root + nonce list
without holding ciphertext, then pass 2 re-encrypts each chunk from its stored
nonce (deterministic AEAD => identical ciphertext => idempotent PUT).

Tests (fula-crypto):
- test_plan_mode_matches_full_encode: root/content_hash/num_chunks parity vs a
  full encode; plan-mode retains no ciphertext.
- test_plan_mode_nonces_reencrypt_and_decrypt_roundtrip: commit nonces ->
  re-encrypt each chunk from its stored nonce -> decode byte-exact. This is the
  resume-safety core (deterministic re-encryption from committed nonces).

Full fula-crypto suite green (452 passed); wasm32 build green via fula-flutter.
No change to the encrypting path's behavior (existing encode/decode tests pass).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* P2 (wip): streaming_put_chunk -- pass-2 per-chunk encrypt-from-stored-nonce + PUT

First verifiable piece of the streaming upload core (see
docs/web-streaming-resumable-upload-plan.md + task #44 notes). Adds
EncryptedClient::streaming_put_chunk: encrypts ONE caller-supplied (pushed)
plaintext chunk with the nonce committed in pass 1, then PUTs it -- mirroring
the native resume re-encrypt (~8520) and the chunk-PUT closure in
put_object_chunked_internal (transient retry_idempotent, pinning, post-PUT CID
self-verify). Deterministic AES-GCM => identical ciphertext => idempotent
content-addressed PUT, safe to retry or repeat on resume. AAD binds ciphertext
to (storage_key, chunk_index).

Compiles native (dead-code warning expected -- wired up by the streaming
session + FRB handle in following commits). No change to existing paths.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* P2 (wip): fula-client streaming upload core (begin / finalize_plan / put_chunk / finish)

Push-model streaming-upload methods on EncryptedClient (pub for the FRB handle +
integration tests), completing the fula-client side of the OOM fix:
- streaming_begin: prelude (ensure_forest_loaded, generate DEK, derive flat
  storage_key incl. the v7 shard-salt path, HPKE-wrap DEK, KEK version).
  Read-only vs live state; mirrors the head of put_object_flat_deferred_locked.
- streaming_finalize_plan: end of pass 1 -- finalize the plan-only encoder, build
  PrivateMetadata (size + content_hash from the pushed plaintext) + encrypted
  metadata; returns ChunkedFileMetadata (nonces + BAO root) + the metas.
- streaming_put_chunk: pass-2 per-chunk encrypt-from-stored-nonce + PUT.
- streaming_finish: index PUT (header-safe) + forest register + flush, under the
  per-bucket write lock. register_streaming_upload_in_forest mirrors the
  wasm-proven upsert in put_object_flat_deferred_locked (v7/monolithic, WAL
  auto-skipped on wasm, orphan cleanup of the prior upload).

Peak memory is bounded by what the caller holds in flight, not file size (pass 1
holds ~1 chunk; pass 2 only the pushed chunk). Compiles native + wasm (via
fula-flutter). Not yet FRB-wired or end-to-end tested -- the stateful-mock
round-trip test (P2 gate) + the FRB handle follow.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* P2: streaming upload round-trip test (the P2 gate) -- PASSES byte-exact

Hermetic test driving the streaming methods exactly as the FRB handle will
(streaming_begin -> plan-only encoder -> streaming_finalize_plan ->
streaming_put_chunk loop -> streaming_finish) against a STATEFUL wiremock that
stores PUT bodies and serves them on GET, then downloads via get_object_flat and
asserts BYTE-EXACT recovery. No network / no credentials -- runs in CI.

Proves the streaming path produces a normally downloadable, decryptable object:
pass-1 commits nonces without ciphertext, pass-2 re-encrypts each chunk from its
stored nonce, the index + forest register + flush land correctly, and the
standard download reconstructs the exact bytes (incl. the 0.6.13 body-fallback
recovering chunk nonces). walkable-v8 post-PUT CID self-verify also exercised.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* P2 (done): FRB push-model streaming-upload handle + finalize the core

Adds the FRB handle that exposes the streaming upload to Dart (fula-flutter
forest.rs), completing P2 of the web streaming + resumable upload plan:
- StreamingUploadHandle (opaque) + StreamingPlanInfo (num_chunks, chunk_size).
- streaming_upload_begin / _plan_chunk / _finalize_plan / _upload_chunk /
  _finish. Dart slices the file from a Blob and drives the two passes; the handle
  never holds the whole file. The std::sync::Mutex is held only for brief sync
  critical sections (never across an .await), so concurrent _upload_chunk calls
  for distinct indices run their PUTs in parallel (Dart bounds concurrency).
  Pure Dart->Rust calls + handle state -- no Rust->Dart callback.
- streaming_put_chunk now reads walkable_v8 from config (one fewer param).
- cid added as a direct fula-flutter dep (handle stores per-chunk CID hints).

Verified: fula-flutter compiles native (the Send+Sync gate for FRB opaques) AND
wasm32; the fula-client round-trip test still passes byte-exact. The handle
drives the exact sequence that test proves end-to-end. Dart bindings are
FRB-codegen'd at publish; the live wasm path is validated at P6 (browser e2e).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* P2: real-server streaming upload e2e -- PASSES byte-exact (50MB / 200 chunks)

Real-gateway e2e (#[ignore]; run with the Mode A creds): drives the streaming
sequence (begin -> plan-only encoder -> finalize_plan -> put_chunk loop ->
finish) for a 50 MB / 200-chunk file, downloads via get_object_flat, asserts
byte-exact. 200 chunks @ 256 KB pushes the index metadata past the 16 KB header
budget, so it also exercises header_safe_enc_metadata stripping + body/forest
fallback on the real server.

Verified on the production gateway: 52,428,800 bytes round-tripped byte-exact in
108s. Complements the hermetic streaming_upload_roundtrip.rs (mock).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant