ENH+BF: Zarr upload retry resilience, diagnostics, and explicit timeout by yarikoptic · Pull Request #1827 · dandi/dandi-cli

yarikoptic · 2026-03-30T15:12:45Z

Summary

Improve zarr upload retry resilience: exponential backoff at batch level, reduce parallelism on retry (5→3→2→1 workers), better failure diagnostics
Set explicit timeout (60s connect, 2h read) for zarr chunk PUT requests to prevent indefinite hangs

Follows up on #1824 (the data.seek(0) fix). While investigating #1821 further with a new log where the digest fix eliminated BadDigest errors, we found that all 188 level-0 chunks (each ~2.6 GB) still fail systematically with ConnectionAbortedError at getresponse() — the upload body is sent but TCP dies before S3's response arrives. Level-1 chunks (smaller) succeed fine. The user confirmed other large zarr uploads work from the same machine; only this previously-aborted zarr fails, pointing to a server-side issue for which a separate investigation is tracked in #1821.

Changes

Batch-level retry improvements (57c1aab):

Replace linear sleep(1 * retry_count) with exponential backoff (5s, 20s, 40s, 80s, 120s cap) plus random jitter
Halve ThreadPoolExecutor workers on each batch retry (max(1, ceil(N/2))), reducing concurrent connections that may trigger S3 prefix-level throttling
Add failure summary in _handle_failed_items_and_raise: failed/total counts, exception types via Counter, "systematic" flag when all failures share the same type

Explicit upload timeout (7fd3014):

Set timeout=(60, 7200) on zarr chunk PUT — 60s TCP connect, 2h response read
The read timeout covers the getresponse() wait after body is fully sent (where ConnectionAbortedError was observed)
Does NOT limit upload body transfer time (requests read timeout only applies to waiting for response data)

Test plan

All 11 local zarr tests pass (16 integration tests skipped — need Docker)
CI tests pass
Verify batch retry backoff etc works on the case of Validation error related to content-md5 when uploading OME-Zarr #1821:
- without custom --jobs
- overall works with --jobs 1:1

🤖 Generated with Claude Code

satra · 2026-03-30T15:17:38Z

is it chunks or shards, or does the code not distinguish between the types of objects? we asked people to use zarr v3 with sharding, so likely these are shards. may be just a nomenclature issue in the code which can be fixed.

codecov · 2026-03-30T15:19:04Z

Codecov Report

❌ Patch coverage is 88.88889% with 7 lines in your changes missing coverage. Please review.
✅ Project coverage is 76.43%. Comparing base (4a14a1f) to head (0f8867b).

Files with missing lines	Patch %	Lines
dandi/files/zarr.py	68.75%	5 Missing ⚠️
dandi/consts.py	0.00%	1 Missing ⚠️
dandi/tests/test_upload.py	97.82%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #1827      +/-   ##
==========================================
+ Coverage   76.25%   76.43%   +0.17%     
==========================================
  Files          87       87              
  Lines       12484    12544      +60     
==========================================
+ Hits         9520     9588      +68     
+ Misses       2964     2956       -8

Flag	Coverage Δ
unittests	`76.43% <88.88%> (+0.17%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

yarikoptic · 2026-03-30T16:15:36Z

zarr upload/download code has no clue (AFAIK) about zarr structure/metadata indeed, and I think here it also does not matter really since we do not tell those apart anywhere in the design, not here in cli nor on dandi-archive backend AFAIK. For the overall issue, which is more grave, see/comment on

Support multi-part uploads for Zarrs dandi-archive#2758

CodyCBakerPhD · 2026-04-05T19:49:44Z

@yarikoptic Is this intended to fix the users issue? Or just provide us more info?

If so I will wait until they report back with some logs from this branch

yarikoptic · 2026-04-06T00:12:24Z

both: should make upload more resilient and also fail explicitly with attempt to upload >5GB file, and also provide more info in case of errors. We could ask @dstansby for one more retry to get a sample of how it looks for him from user side trying to upload not uploadable (yet) zarrs now.

yarikoptic · 2026-04-24T02:53:10Z

uff -- this one fell of the radar and cognitive horizon , and my local version of the branch is not on the same base.

digging:

apparently this version was force-pushed forward my local version and local version base older

❯ git merge-base origin/bf-zarr-upload3 origin/master | xargs git show
commit fa33e8d395fc1bdaa211f285114e0888851af051
Merge: f42d958e 505de8fd
Author: Yaroslav Halchenko <debian@onerussian.com>
Date:   Sun Apr 5 15:19:03 2026 -0400

    Merge pull request #1822 from dandi/enh-validators
    
    ENH: Machine-readable validate output with store/reload

❯ git merge-base bf-zarr-upload3 origin/master | xargs git show
commit fb5e2f014156fd8915445d539bdb82217e67a866 (gh-candleindark/master, gh-candleindark/HEAD)
Merge: 34af06ff d7d5dd4a
Author: Yaroslav Halchenko <debian@onerussian.com>
Date:   Mon Mar 30 11:10:11 2026 -0400

    Merge pull request #1824 from dandi/bf-zarr-upload2
    
    BF: Seek file data to 0 before every retry in request(), not just for retryable status codes

and

❯ git rdp --help
'rdp' is aliased to 'range-diff '@{u}' '@{1}' '@''
❯ git rdp
1:  befe4a93 = 1:  befe4a93 ENH: Improve zarr upload retry resilience and diagnostics
2:  4a2da705 = 2:  4a2da705 BF: Set explicit timeout for zarr chunk upload PUT requests
3:  e9cdeacc = 3:  e9cdeacc TST: Add tests for zarr upload batch retry and failure diagnostics
-:  -------- > 4:  feab8876 ENH: Reject zarr chunks >5 GiB and improve upload failure logging

so I think I just rebased somewhere else ... let's proceed here since @jjnesbitt is also working on zarr related code in #1839 and we already started to conflict (just in imports, no biggie) so best to merge asap to minimize conflicts etc

yarikoptic · 2026-04-24T02:54:42Z

aha -- likely I just rebased in web UI before as well! ;) so if tests pass -- I will merge. To me changes looked legitish and limiting by 5G better be in place.

- Batch-level retry backoff: replace linear sleep(1*N) with exponential backoff (5s, 20s, 40s, 80s, 120s) plus random jitter, giving S3 time to recover from throttling - Reduce parallelism on batch retry: halve ThreadPoolExecutor workers on each retry (5→3→2→1), reducing concurrent connections that may trigger S3 prefix-level throttling - Better failure diagnostics in _handle_failed_items_and_raise: log a summary line with failed/total counts, exception types grouped by count, and a "systematic" flag when all failures share the same exception type. This makes it immediately clear whether failures are random (flaky network) or deterministic (server-side issue) Motivated by investigation of #1821 where a previously-aborted OME-Zarr upload consistently fails with ConnectionAbortedError on every retry attempt for all 188 level-0 chunks (100% failure rate, 2259 connection attempts), while level-1 chunks and other fresh zarr uploads succeed fine from the same machine. Co-Authored-By: Claude Code 2.1.81 / Claude Opus 4.6 <noreply@anthropic.com>

Previously, zarr chunk uploads used no explicit timeout, relying on OS TCP defaults. For large zarr chunks (e.g., 2.6 GB level-0 OME-Zarr chunks), the upload can take many minutes at typical home upload speeds (~7 min at 50 Mbps, ~70 min at 5 Mbps). Without a timeout, the client either hangs indefinitely on network failures, or the OS TCP stack's default keepalive/idle timeout (often ~120s on Windows) kills the connection before a large upload can complete. Set an explicit requests timeout of (60, 7200) — 60 seconds for TCP connect, 2 hours for response read. The read timeout covers the period after the full request body is sent while waiting for S3's response, which is where ConnectionAbortedError was observed in #1821. Note: this does NOT limit the upload body transfer time itself. The requests library's read timeout only applies to waiting for response data, not to sending request data. So even very slow multi-hour uploads will not be interrupted by this timeout. Ref: #1821 Co-Authored-By: Claude Code 2.1.81 / Claude Opus 4.6 <noreply@anthropic.com>

- test_zarr_upload_403_batch_retry_reduces_parallelism: exercises the batch-level retry loop with reduced parallelism (workers: N logged) and exponential backoff by mocking 403 responses at the request() level so _upload_zarr_file returns RETRY_NEEDED - test_zarr_upload_connection_error_diagnostics: exercises _handle_failed_items_and_raise summary logging by mocking ConnectionError on all S3 PUTs, verifying "Upload failure summary" includes exception type counts and "systematic" flag Covers code added in befe4a9 and 4a2da70. Co-Authored-By: Claude Code 2.1.81 / Claude Opus 4.6 <noreply@anthropic.com>

- Add S3_MAX_SINGLE_PART_UPLOAD (5 GiB) constant and check file size in UploadItem.from_entry() before attempting upload. S3 rejects single-part PUTs larger than 5 GiB, and since the server mints presigned URLs without knowing the file size, the client must guard against this. Raises ValueError with a clear message about the multipart upload limitation. - Log file size alongside filepath in all upload failure paths: _upload_zarr_file now logs at WARNING level for both HTTPError (non-403) and generic Exception cases, making it immediately clear from logs which file failed and how large it was. - Include file size in _handle_failed_items_and_raise per-item error log lines. Motivated by #1821 where 2.6 GiB level-0 zarr chunks failed with ConnectionAbortedError but the logs didn't show the file sizes, making it hard to diagnose the size-related nature of the failure. Co-Authored-By: Claude Code 2.1.81 / Claude Opus 4.6 <noreply@anthropic.com>

yarikoptic · 2026-04-25T12:02:55Z

@candleindark a chance for your /issue-analyze to shine, but be constructive. I likely to merge it before receiving review but review would still might be useful if detects issues

yarikoptic · 2026-04-27T21:01:36Z

We are already clashing with @jjnesbitt work, so I will likely wait then here since not pressing, and then will resolve conflicts

candleindark · 2026-04-28T19:07:10Z

        ), f"URL {url} should not have been retried but had {request_attempts[url]} attempts"


+@pytest.mark.ai_generated


I don't know the benefits from using the @pytest.mark.ai_generated mark.

Is it to mark a test that is

completely generated by AI without any human intervention?

completely generated by AI without any human intervention and without any human verification?

generated by AI but with human intervention and approval?

If it is the case of 1 or/and 2, should the mark to be removed if the test is modified by a human in future, which can be a hassle? Additionally, case 2 is nearly impossible since we are already reviewing it now. As for case 1, how do you keep track to ensure case 1 is the case?

If it is case 3, how is it different than the code produced with AI as a co-author which is not part of a test? We don't/can't mark those code in a similar way.

I think we can just do away with such a mark.

I mark even 3. as such . If I would invest significant amount of time into tuning that test manually, I would remove the marker.

code, produced by ai not not, in projects like this still need to be reviewed carefully by human(s). AI produced tests do not necessarily require such careful review, and could be refactored/replaced. Seeing the marker tells me right away to just grasp the testing idea in them and overall see if could be instructed to be made better since typically very sloppy and lots of "copy pasted" content

yarikoptic added cmd-upload zarr labels Mar 30, 2026

yarikoptic mentioned this pull request Mar 30, 2026

Validation error related to content-md5 when uploading OME-Zarr #1821

Open

yarikoptic added the patch Increment the patch version when merged label Mar 30, 2026

dstansby mentioned this pull request Mar 31, 2026

Support multi-part uploads for Zarrs dandi/dandi-archive#2758

Closed

7 tasks

yarikoptic force-pushed the bf-zarr-upload3 branch from feab887 to 1ecb8aa Compare April 5, 2026 19:20

yarikoptic marked this pull request as ready for review April 5, 2026 19:20

yarikoptic requested review from CodyCBakerPhD and candleindark April 5, 2026 19:20

yarikoptic force-pushed the bf-zarr-upload3 branch from 1ecb8aa to f4478ab Compare April 24, 2026 02:53

yarikoptic and others added 4 commits April 25, 2026 08:00

yarikoptic force-pushed the bf-zarr-upload3 branch from f4478ab to 0f8867b Compare April 25, 2026 12:00

candleindark reviewed Apr 27, 2026

View reviewed changes

Comment thread dandi/files/zarr.py

yarikoptic mentioned this pull request Apr 27, 2026

Use multi-part upload on zarr chunks larger than 5GB #1839

Open

candleindark reviewed Apr 28, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH+BF: Zarr upload retry resilience, diagnostics, and explicit timeout#1827

ENH+BF: Zarr upload retry resilience, diagnostics, and explicit timeout#1827
yarikoptic wants to merge 4 commits intomasterfrom
bf-zarr-upload3

yarikoptic commented Mar 30, 2026 •

edited

Loading

Uh oh!

satra commented Mar 30, 2026

Uh oh!

codecov Bot commented Mar 30, 2026 •

edited

Loading

Uh oh!

yarikoptic commented Mar 30, 2026

Uh oh!

CodyCBakerPhD commented Apr 5, 2026

Uh oh!

yarikoptic commented Apr 6, 2026

Uh oh!

yarikoptic commented Apr 24, 2026

Uh oh!

yarikoptic commented Apr 24, 2026

Uh oh!

yarikoptic commented Apr 25, 2026

Uh oh!

Uh oh!

yarikoptic commented Apr 27, 2026

Uh oh!

candleindark Apr 28, 2026

Uh oh!

yarikoptic Apr 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

		), f"URL {url} should not have been retried but had {request_attempts[url]} attempts"


		@pytest.mark.ai_generated

Conversation

yarikoptic commented Mar 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Test plan

Uh oh!

satra commented Mar 30, 2026

Uh oh!

codecov Bot commented Mar 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

yarikoptic commented Mar 30, 2026

Uh oh!

CodyCBakerPhD commented Apr 5, 2026

Uh oh!

yarikoptic commented Apr 6, 2026

Uh oh!

yarikoptic commented Apr 24, 2026

Uh oh!

yarikoptic commented Apr 24, 2026

Uh oh!

yarikoptic commented Apr 25, 2026

Uh oh!

Uh oh!

yarikoptic commented Apr 27, 2026

Uh oh!

candleindark Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

yarikoptic Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

yarikoptic commented Mar 30, 2026 •

edited

Loading

codecov Bot commented Mar 30, 2026 •

edited

Loading