ENH+BF: Zarr upload retry resilience, diagnostics, and explicit timeout#1827
ENH+BF: Zarr upload retry resilience, diagnostics, and explicit timeout#1827yarikoptic wants to merge 4 commits intomasterfrom
Conversation
|
is it chunks or shards, or does the code not distinguish between the types of objects? we asked people to use zarr v3 with sharding, so likely these are shards. may be just a nomenclature issue in the code which can be fixed. |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #1827 +/- ##
==========================================
+ Coverage 76.25% 76.43% +0.17%
==========================================
Files 87 87
Lines 12484 12544 +60
==========================================
+ Hits 9520 9588 +68
+ Misses 2964 2956 -8
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
zarr upload/download code has no clue (AFAIK) about zarr structure/metadata indeed, and I think here it also does not matter really since we do not tell those apart anywhere in the design, not here in cli nor on dandi-archive backend AFAIK. For the overall issue, which is more grave, see/comment on |
feab887 to
1ecb8aa
Compare
|
@yarikoptic Is this intended to fix the users issue? Or just provide us more info? If so I will wait until they report back with some logs from this branch |
|
both: should make upload more resilient and also fail explicitly with attempt to upload |
|
uff -- this one fell of the radar and cognitive horizon , and my local version of the branch is not on the same base. digging: apparently this version was force-pushed forward my local version and local version base older❯ git merge-base origin/bf-zarr-upload3 origin/master | xargs git show
commit fa33e8d395fc1bdaa211f285114e0888851af051
Merge: f42d958e 505de8fd
Author: Yaroslav Halchenko <debian@onerussian.com>
Date: Sun Apr 5 15:19:03 2026 -0400
Merge pull request #1822 from dandi/enh-validators
ENH: Machine-readable validate output with store/reload
❯ git merge-base bf-zarr-upload3 origin/master | xargs git show
commit fb5e2f014156fd8915445d539bdb82217e67a866 (gh-candleindark/master, gh-candleindark/HEAD)
Merge: 34af06ff d7d5dd4a
Author: Yaroslav Halchenko <debian@onerussian.com>
Date: Mon Mar 30 11:10:11 2026 -0400
Merge pull request #1824 from dandi/bf-zarr-upload2
BF: Seek file data to 0 before every retry in request(), not just for retryable status codes
and so I think I just rebased somewhere else ... let's proceed here since @jjnesbitt is also working on zarr related code in #1839 and we already started to conflict (just in imports, no biggie) so best to merge asap to minimize conflicts etc |
1ecb8aa to
f4478ab
Compare
|
aha -- likely I just rebased in web UI before as well! ;) so if tests pass -- I will merge. To me changes looked legitish and limiting by 5G better be in place. |
- Batch-level retry backoff: replace linear sleep(1*N) with exponential backoff (5s, 20s, 40s, 80s, 120s) plus random jitter, giving S3 time to recover from throttling - Reduce parallelism on batch retry: halve ThreadPoolExecutor workers on each retry (5→3→2→1), reducing concurrent connections that may trigger S3 prefix-level throttling - Better failure diagnostics in _handle_failed_items_and_raise: log a summary line with failed/total counts, exception types grouped by count, and a "systematic" flag when all failures share the same exception type. This makes it immediately clear whether failures are random (flaky network) or deterministic (server-side issue) Motivated by investigation of #1821 where a previously-aborted OME-Zarr upload consistently fails with ConnectionAbortedError on every retry attempt for all 188 level-0 chunks (100% failure rate, 2259 connection attempts), while level-1 chunks and other fresh zarr uploads succeed fine from the same machine. Co-Authored-By: Claude Code 2.1.81 / Claude Opus 4.6 <noreply@anthropic.com>
Previously, zarr chunk uploads used no explicit timeout, relying on OS TCP defaults. For large zarr chunks (e.g., 2.6 GB level-0 OME-Zarr chunks), the upload can take many minutes at typical home upload speeds (~7 min at 50 Mbps, ~70 min at 5 Mbps). Without a timeout, the client either hangs indefinitely on network failures, or the OS TCP stack's default keepalive/idle timeout (often ~120s on Windows) kills the connection before a large upload can complete. Set an explicit requests timeout of (60, 7200) — 60 seconds for TCP connect, 2 hours for response read. The read timeout covers the period after the full request body is sent while waiting for S3's response, which is where ConnectionAbortedError was observed in #1821. Note: this does NOT limit the upload body transfer time itself. The requests library's read timeout only applies to waiting for response data, not to sending request data. So even very slow multi-hour uploads will not be interrupted by this timeout. Ref: #1821 Co-Authored-By: Claude Code 2.1.81 / Claude Opus 4.6 <noreply@anthropic.com>
- test_zarr_upload_403_batch_retry_reduces_parallelism: exercises the batch-level retry loop with reduced parallelism (workers: N logged) and exponential backoff by mocking 403 responses at the request() level so _upload_zarr_file returns RETRY_NEEDED - test_zarr_upload_connection_error_diagnostics: exercises _handle_failed_items_and_raise summary logging by mocking ConnectionError on all S3 PUTs, verifying "Upload failure summary" includes exception type counts and "systematic" flag Covers code added in befe4a9 and 4a2da70. Co-Authored-By: Claude Code 2.1.81 / Claude Opus 4.6 <noreply@anthropic.com>
- Add S3_MAX_SINGLE_PART_UPLOAD (5 GiB) constant and check file size in UploadItem.from_entry() before attempting upload. S3 rejects single-part PUTs larger than 5 GiB, and since the server mints presigned URLs without knowing the file size, the client must guard against this. Raises ValueError with a clear message about the multipart upload limitation. - Log file size alongside filepath in all upload failure paths: _upload_zarr_file now logs at WARNING level for both HTTPError (non-403) and generic Exception cases, making it immediately clear from logs which file failed and how large it was. - Include file size in _handle_failed_items_and_raise per-item error log lines. Motivated by #1821 where 2.6 GiB level-0 zarr chunks failed with ConnectionAbortedError but the logs didn't show the file sizes, making it hard to diagnose the size-related nature of the failure. Co-Authored-By: Claude Code 2.1.81 / Claude Opus 4.6 <noreply@anthropic.com>
f4478ab to
0f8867b
Compare
|
@candleindark a chance for your |
|
We are already clashing with @jjnesbitt work, so I will likely wait then here since not pressing, and then will resolve conflicts |
| ), f"URL {url} should not have been retried but had {request_attempts[url]} attempts" | ||
|
|
||
|
|
||
| @pytest.mark.ai_generated |
There was a problem hiding this comment.
I don't know the benefits from using the @pytest.mark.ai_generated mark.
Is it to mark a test that is
- completely generated by AI without any human intervention?
- completely generated by AI without any human intervention and without any human verification?
- generated by AI but with human intervention and approval?
If it is the case of 1 or/and 2, should the mark to be removed if the test is modified by a human in future, which can be a hassle? Additionally, case 2 is nearly impossible since we are already reviewing it now. As for case 1, how do you keep track to ensure case 1 is the case?
If it is case 3, how is it different than the code produced with AI as a co-author which is not part of a test? We don't/can't mark those code in a similar way.
I think we can just do away with such a mark.
There was a problem hiding this comment.
I mark even 3. as such . If I would invest significant amount of time into tuning that test manually, I would remove the marker.
code, produced by ai not not, in projects like this still need to be reviewed carefully by human(s). AI produced tests do not necessarily require such careful review, and could be refactored/replaced. Seeing the marker tells me right away to just grasp the testing idea in them and overall see if could be instructed to be made better since typically very sloppy and lots of "copy pasted" content
Summary
Follows up on #1824 (the
data.seek(0)fix). While investigating #1821 further with a new log where the digest fix eliminated BadDigest errors, we found that all 188 level-0 chunks (each ~2.6 GB) still fail systematically withConnectionAbortedErroratgetresponse()— the upload body is sent but TCP dies before S3's response arrives. Level-1 chunks (smaller) succeed fine. The user confirmed other large zarr uploads work from the same machine; only this previously-aborted zarr fails, pointing to a server-side issue for which a separate investigation is tracked in #1821.Changes
Batch-level retry improvements (57c1aab):
sleep(1 * retry_count)with exponential backoff (5s, 20s, 40s, 80s, 120s cap) plus random jitterThreadPoolExecutorworkers on each batch retry (max(1, ceil(N/2))), reducing concurrent connections that may trigger S3 prefix-level throttling_handle_failed_items_and_raise: failed/total counts, exception types viaCounter, "systematic" flag when all failures share the same typeExplicit upload timeout (7fd3014):
timeout=(60, 7200)on zarr chunk PUT — 60s TCP connect, 2h response readgetresponse()wait after body is fully sent (whereConnectionAbortedErrorwas observed)requestsread timeout only applies to waiting for response data)Test plan
🤖 Generated with Claude Code