Skip to content

ENH+BF: Zarr upload retry resilience, diagnostics, and explicit timeout#1827

Open
yarikoptic wants to merge 4 commits intomasterfrom
bf-zarr-upload3
Open

ENH+BF: Zarr upload retry resilience, diagnostics, and explicit timeout#1827
yarikoptic wants to merge 4 commits intomasterfrom
bf-zarr-upload3

Conversation

@yarikoptic
Copy link
Copy Markdown
Member

@yarikoptic yarikoptic commented Mar 30, 2026

Summary

  • Improve zarr upload retry resilience: exponential backoff at batch level, reduce parallelism on retry (5→3→2→1 workers), better failure diagnostics
  • Set explicit timeout (60s connect, 2h read) for zarr chunk PUT requests to prevent indefinite hangs

Follows up on #1824 (the data.seek(0) fix). While investigating #1821 further with a new log where the digest fix eliminated BadDigest errors, we found that all 188 level-0 chunks (each ~2.6 GB) still fail systematically with ConnectionAbortedError at getresponse() — the upload body is sent but TCP dies before S3's response arrives. Level-1 chunks (smaller) succeed fine. The user confirmed other large zarr uploads work from the same machine; only this previously-aborted zarr fails, pointing to a server-side issue for which a separate investigation is tracked in #1821.

Changes

Batch-level retry improvements (57c1aab):

  • Replace linear sleep(1 * retry_count) with exponential backoff (5s, 20s, 40s, 80s, 120s cap) plus random jitter
  • Halve ThreadPoolExecutor workers on each batch retry (max(1, ceil(N/2))), reducing concurrent connections that may trigger S3 prefix-level throttling
  • Add failure summary in _handle_failed_items_and_raise: failed/total counts, exception types via Counter, "systematic" flag when all failures share the same type

Explicit upload timeout (7fd3014):

  • Set timeout=(60, 7200) on zarr chunk PUT — 60s TCP connect, 2h response read
  • The read timeout covers the getresponse() wait after body is fully sent (where ConnectionAbortedError was observed)
  • Does NOT limit upload body transfer time (requests read timeout only applies to waiting for response data)

Test plan

🤖 Generated with Claude Code

@satra
Copy link
Copy Markdown
Member

satra commented Mar 30, 2026

is it chunks or shards, or does the code not distinguish between the types of objects? we asked people to use zarr v3 with sharding, so likely these are shards. may be just a nomenclature issue in the code which can be fixed.

@codecov
Copy link
Copy Markdown

codecov Bot commented Mar 30, 2026

Codecov Report

❌ Patch coverage is 88.88889% with 7 lines in your changes missing coverage. Please review.
✅ Project coverage is 76.43%. Comparing base (4a14a1f) to head (0f8867b).

Files with missing lines Patch % Lines
dandi/files/zarr.py 68.75% 5 Missing ⚠️
dandi/consts.py 0.00% 1 Missing ⚠️
dandi/tests/test_upload.py 97.82% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #1827      +/-   ##
==========================================
+ Coverage   76.25%   76.43%   +0.17%     
==========================================
  Files          87       87              
  Lines       12484    12544      +60     
==========================================
+ Hits         9520     9588      +68     
+ Misses       2964     2956       -8     
Flag Coverage Δ
unittests 76.43% <88.88%> (+0.17%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@yarikoptic yarikoptic added the patch Increment the patch version when merged label Mar 30, 2026
@yarikoptic
Copy link
Copy Markdown
Member Author

zarr upload/download code has no clue (AFAIK) about zarr structure/metadata indeed, and I think here it also does not matter really since we do not tell those apart anywhere in the design, not here in cli nor on dandi-archive backend AFAIK. For the overall issue, which is more grave, see/comment on

@yarikoptic yarikoptic marked this pull request as ready for review April 5, 2026 19:20
@CodyCBakerPhD
Copy link
Copy Markdown
Contributor

@yarikoptic Is this intended to fix the users issue? Or just provide us more info?

If so I will wait until they report back with some logs from this branch

@yarikoptic
Copy link
Copy Markdown
Member Author

both: should make upload more resilient and also fail explicitly with attempt to upload >5GB file, and also provide more info in case of errors. We could ask @dstansby for one more retry to get a sample of how it looks for him from user side trying to upload not uploadable (yet) zarrs now.

@yarikoptic
Copy link
Copy Markdown
Member Author

uff -- this one fell of the radar and cognitive horizon , and my local version of the branch is not on the same base.

digging:

apparently this version was force-pushed forward my local version and local version base older
❯ git merge-base origin/bf-zarr-upload3 origin/master | xargs git show
commit fa33e8d395fc1bdaa211f285114e0888851af051
Merge: f42d958e 505de8fd
Author: Yaroslav Halchenko <debian@onerussian.com>
Date:   Sun Apr 5 15:19:03 2026 -0400

    Merge pull request #1822 from dandi/enh-validators
    
    ENH: Machine-readable validate output with store/reload

❯ git merge-base bf-zarr-upload3 origin/master | xargs git show
commit fb5e2f014156fd8915445d539bdb82217e67a866 (gh-candleindark/master, gh-candleindark/HEAD)
Merge: 34af06ff d7d5dd4a
Author: Yaroslav Halchenko <debian@onerussian.com>
Date:   Mon Mar 30 11:10:11 2026 -0400

    Merge pull request #1824 from dandi/bf-zarr-upload2
    
    BF: Seek file data to 0 before every retry in request(), not just for retryable status codes

and

❯ git rdp --help
'rdp' is aliased to 'range-diff '@{u}' '@{1}' '@''
❯ git rdp
1:  befe4a93 = 1:  befe4a93 ENH: Improve zarr upload retry resilience and diagnostics
2:  4a2da705 = 2:  4a2da705 BF: Set explicit timeout for zarr chunk upload PUT requests
3:  e9cdeacc = 3:  e9cdeacc TST: Add tests for zarr upload batch retry and failure diagnostics
-:  -------- > 4:  feab8876 ENH: Reject zarr chunks >5 GiB and improve upload failure logging

so I think I just rebased somewhere else ... let's proceed here since @jjnesbitt is also working on zarr related code in #1839 and we already started to conflict (just in imports, no biggie) so best to merge asap to minimize conflicts etc

@yarikoptic
Copy link
Copy Markdown
Member Author

aha -- likely I just rebased in web UI before as well! ;) so if tests pass -- I will merge. To me changes looked legitish and limiting by 5G better be in place.

yarikoptic and others added 4 commits April 25, 2026 08:00
- Batch-level retry backoff: replace linear sleep(1*N) with exponential
  backoff (5s, 20s, 40s, 80s, 120s) plus random jitter, giving S3 time
  to recover from throttling

- Reduce parallelism on batch retry: halve ThreadPoolExecutor workers on
  each retry (5→3→2→1), reducing concurrent connections that may trigger
  S3 prefix-level throttling

- Better failure diagnostics in _handle_failed_items_and_raise: log a
  summary line with failed/total counts, exception types grouped by count,
  and a "systematic" flag when all failures share the same exception type.
  This makes it immediately clear whether failures are random (flaky
  network) or deterministic (server-side issue)

Motivated by investigation of #1821 where a previously-aborted OME-Zarr
upload consistently fails with ConnectionAbortedError on every retry
attempt for all 188 level-0 chunks (100% failure rate, 2259 connection
attempts), while level-1 chunks and other fresh zarr uploads succeed fine
from the same machine.

Co-Authored-By: Claude Code 2.1.81 / Claude Opus 4.6 <noreply@anthropic.com>
Previously, zarr chunk uploads used no explicit timeout, relying on OS
TCP defaults.  For large zarr chunks (e.g., 2.6 GB level-0 OME-Zarr
chunks), the upload can take many minutes at typical home upload speeds
(~7 min at 50 Mbps, ~70 min at 5 Mbps).  Without a timeout, the client
either hangs indefinitely on network failures, or the OS TCP stack's
default keepalive/idle timeout (often ~120s on Windows) kills the
connection before a large upload can complete.

Set an explicit requests timeout of (60, 7200) — 60 seconds for TCP
connect, 2 hours for response read.  The read timeout covers the period
after the full request body is sent while waiting for S3's response,
which is where ConnectionAbortedError was observed in #1821.

Note: this does NOT limit the upload body transfer time itself.  The
requests library's read timeout only applies to waiting for response
data, not to sending request data.  So even very slow multi-hour uploads
will not be interrupted by this timeout.

Ref: #1821

Co-Authored-By: Claude Code 2.1.81 / Claude Opus 4.6 <noreply@anthropic.com>
- test_zarr_upload_403_batch_retry_reduces_parallelism: exercises the
  batch-level retry loop with reduced parallelism (workers: N logged)
  and exponential backoff by mocking 403 responses at the request()
  level so _upload_zarr_file returns RETRY_NEEDED

- test_zarr_upload_connection_error_diagnostics: exercises
  _handle_failed_items_and_raise summary logging by mocking
  ConnectionError on all S3 PUTs, verifying "Upload failure summary"
  includes exception type counts and "systematic" flag

Covers code added in befe4a9 and 4a2da70.

Co-Authored-By: Claude Code 2.1.81 / Claude Opus 4.6 <noreply@anthropic.com>
- Add S3_MAX_SINGLE_PART_UPLOAD (5 GiB) constant and check file size
  in UploadItem.from_entry() before attempting upload.  S3 rejects
  single-part PUTs larger than 5 GiB, and since the server mints
  presigned URLs without knowing the file size, the client must guard
  against this.  Raises ValueError with a clear message about the
  multipart upload limitation.

- Log file size alongside filepath in all upload failure paths:
  _upload_zarr_file now logs at WARNING level for both HTTPError
  (non-403) and generic Exception cases, making it immediately clear
  from logs which file failed and how large it was.

- Include file size in _handle_failed_items_and_raise per-item error
  log lines.

Motivated by #1821 where 2.6 GiB level-0 zarr chunks failed with
ConnectionAbortedError but the logs didn't show the file sizes,
making it hard to diagnose the size-related nature of the failure.

Co-Authored-By: Claude Code 2.1.81 / Claude Opus 4.6 <noreply@anthropic.com>
@yarikoptic
Copy link
Copy Markdown
Member Author

@candleindark a chance for your /issue-analyze to shine, but be constructive. I likely to merge it before receiving review but review would still might be useful if detects issues

Comment thread dandi/files/zarr.py
@yarikoptic
Copy link
Copy Markdown
Member Author

We are already clashing with @jjnesbitt work, so I will likely wait then here since not pressing, and then will resolve conflicts

), f"URL {url} should not have been retried but had {request_attempts[url]} attempts"


@pytest.mark.ai_generated
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know the benefits from using the @pytest.mark.ai_generated mark.

Is it to mark a test that is

  1. completely generated by AI without any human intervention?
  2. completely generated by AI without any human intervention and without any human verification?
  3. generated by AI but with human intervention and approval?

If it is the case of 1 or/and 2, should the mark to be removed if the test is modified by a human in future, which can be a hassle? Additionally, case 2 is nearly impossible since we are already reviewing it now. As for case 1, how do you keep track to ensure case 1 is the case?

If it is case 3, how is it different than the code produced with AI as a co-author which is not part of a test? We don't/can't mark those code in a similar way.

I think we can just do away with such a mark.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mark even 3. as such . If I would invest significant amount of time into tuning that test manually, I would remove the marker.

code, produced by ai not not, in projects like this still need to be reviewed carefully by human(s). AI produced tests do not necessarily require such careful review, and could be refactored/replaced. Seeing the marker tells me right away to just grasp the testing idea in them and overall see if could be instructed to be made better since typically very sloppy and lots of "copy pasted" content

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cmd-upload patch Increment the patch version when merged zarr

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants