Use multi-part upload on zarr chunks larger than 5GB#1839
Use multi-part upload on zarr chunks larger than 5GB#1839
Conversation
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #1839 +/- ##
==========================================
+ Coverage 76.26% 76.27% +0.01%
==========================================
Files 87 87
Lines 12512 12575 +63
==========================================
+ Hits 9542 9592 +50
- Misses 2970 2983 +13
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
b427dc7 to
0e505bd
Compare
c1843cd to
367bd24
Compare
|
@yarikoptic I believe this is in a good state. Regarding the test failures, it's worth noting that tests against the API are almost guaranteed to fail, as they blindly pull the |
|
Thank you @jjnesbitt ! I meanwhile submitted to mitigate CI fails. Claude identified the same "cause" and provide some of its "aiplanation" but it could be slop, so beware. |
367bd24 to
83db4f4
Compare
|
@jjnesbitt FYI - I have rebased on now green master |
83db4f4 to
7121c31
Compare
|
@yarikoptic It would be good to get your review/approval on this PR. The only remaining test failure is the following: This is due to the fact that the version of the API that the tests are being run against doesn't contain the new URLs required for this change. |
|
re tests:
So let's make such test(s) conditional and skip if ran against released/older version of the archive. Is there anything in the |
We could do that, but it would achieve very little, since as soon as the feature is merged to master and the new image is built, they will immediately be re-enabled again. There's nothing at the Since the API change is backwards compatible, I suggest we simply merge the API change once it's ready (hopefully soon), at which point the tests will use the new image, and things are back to normal. |
|
re server-capability conditioning -- and then client ideally be aware of that too at run time (not only in the tests) so it could just refuse/error if trying to upload 5TB files as I have implemented awhile back in #1827 and will be merging shortly, instead of hitting a non-existing end point at EMBER and LINC (if they keep it afloat). I will leave it to you to decide on how client could discover server capability here:
|
|
Sorry @jjnesbitt , I will not be merging a PR with totally Red CI, it is out of question. |
|
ideally it should actually test that server works with at least some TEMP commit which would somehow provision container with that new mighty version of the server from the PR. But I guess that is for the future, to do smth like I did for bids-examples where in PR description we can point to perspective PRs for bids-specification and bids-validator to test against: https://github.com/bids-standard/bids-examples/blob/master/.github/workflows/validate_datasets.yml#L35 |
That is not what I was suggesting. Once the API PR is merged, this CI will become green, because it will be using the latest image of |
|
ok, let's wait then. |
yarikoptic
left a comment
There was a problem hiding this comment.
some minor ones about imports but also a question to clarify on why not to use smaller threshold
9f79117 to
3be1ccd
Compare
|
To avoid complicated conditioning of this feature based on the archive's capability (for which they may quickly upgrade anyway), we could do the following:
This removes any conditional logic based on archive capabilities in the CLI (which should be discouraged anyway) and puts the compatibility check solely on the archive side. Once LINC/EMBER merge in the supported change, the CLI version can be removed from @yarikoptic Do you have objections to this? |
|
yes, I have objections, as
But before we jump any ways, I think we still need to elaborate on overall details of the change in behaviors! Thanks for quickly addressing the question of diverging checksum/etag'ing -- I left a comment there as I think multipart upload etag compute must be consistent with how we compute for regular blobs, shouldn'it? And I (at least) still do not have a complete picture how it
Potential "resolutions" to consider:
or, once again my question -- am I overcomplicating? |
And I'll again reiterate, the archive PR is not yet merged, which is why the CI is red. Once that happens, it will be green.
What other clients besides LINC/EMBER? Why is simply excluding the new CLI version not sufficient to support these archives?
The CLI code for zarr upload is extremely convoluted as it is, adding this conditional logic further exacerbates that. However, if you're really determined to implement this conditional logic then I'll leave it to you.
The zarr-checksum library doesn't need to know specifics of how the digests are computed. As it stands now, we simply pass it the items containing the path and digest, so we control what digests it uses. There should be no issue here.
I think short term this is probably the best solution, as it immediately unblocks people trying to upload zarrs with parts over this limit. The underlying issue should still be re-addressed, but we can hold off until there's not a funding gap looming. |
great -- when it gets green we can merge! I feel like in a groundhog day ;) I guess we are missing each other's point on this one.
because it would require THEM to exclude, whenever WE introduce breakage, so I would prefer US to avoid breakage that rely on clients to mitigate, and then remitigate whenever we update or release yet another version. Long term -- it is client/server's couple job imho to agree on acceptable interfaces to be used.
gotcha, I will look into that... tomorrow morning. My question would only remain of absent testing against WiP archive I guess. did you have chance to test it against some instance? what was its
doesn't it provide compute across filetree?:
agreeed! let's stick to 5GB limit as decision point for now |
c349a4c to
6d99670
Compare
I've done this both through running the integration tests with this branch of the CLI code (successful run), as well as running the corresponding branch of the API locally and pointing the tests at that, instead of utilizing the docker-compose pytest fixture. As far as what test can be done to check for server compatibility, the existence of the
What you've linked is the one of the "default" file generators included in the
I've switched this back. |
…_upload->singlepart_items The split of a batch into "items above the 5GB threshold" and "items at or below" is really a split between items routed through the multipart upload endpoint and items uploaded with a single signed PUT. Naming them after the upload mechanism rather than the size makes the subsequent capability/gating logic read more naturally. Pure rename, no behavior change. Co-Authored-By: Claude Code 2.1.123 / Claude Opus 4.7 (1M context) <noreply@anthropic.com>
dandi-archive#2784 (still open) introduces the ``/uploads/zarr/initialize/`` endpoint required to upload zarr chunks larger than 5 GiB. Until that lands and is deployed, posting to the endpoint returns 404 and the multi-part code path here would surface that as an opaque ``HTTP404Error`` from inside ``_multipart_upload``. Add a lazy ``DandiAPIClient.supports_zarr_multipart_upload`` cached property that probes the endpoint once with an empty POST body: 404 means the server lacks the feature, anything else (400 for the bogus payload, 401/403 for auth) means it is present. In ``ZarrAsset.iter_upload``, right after the batch is split into ``multipart_items`` and ``singlepart_items``, raise an ``UploadError`` when there are multipart items and the server does not support them. The error reports the count and aggregate size of the oversized chunks plus the path of the largest one, mirroring the wording proposed in #1827. Marked with a TODO: the check (and the cached property) can be removed once all supported servers ship dandi-archive#2784 (> 0.23.0). Co-Authored-By: Claude Code 2.1.123 / Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…umulative ``_multipart_upload`` (used here to upload zarr chunks larger than 5 GiB) yields progress dicts whose ``current`` is the number of bytes uploaded *within the chunk currently being transferred*. The surrounding ``ZarrAsset.iter_upload`` loop reports ``current`` as the number of bytes uploaded *across the whole zarr*. Yielded as-is, the inner per-file values caused ``current`` to jump backwards each time a new multipart chunk started, breaking progress bars driven from that field. Wrap the inner generator: when an "uploading" status carries a ``current`` field, translate it by the cumulative bytes uploaded so far before re-yielding; pass other status dicts through unchanged. The trailing post-loop yield is dropped — the last per-part translated yield already reports the full cumulative size for that chunk. Co-Authored-By: Claude Code 2.1.123 / Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The two ``ai_generated`` tests added in 9bc2976 patch ``ZARR_LARGE_CHUNK_THRESHOLD`` down so chunks are routed through ``_multipart_upload`` against whatever server backs ``new_dandiset``. The local docker dandi-archive image does not yet ship dandi-archive#2784, so the new ``supports_zarr_multipart_upload`` capability check raises ``UploadError`` before ``_multipart_upload`` is ever called and the tests fail without exercising what they claim to. Probe the live client for capability and ``pytest.skip`` only when the endpoint is missing — keeps the tests effective on setups whose server already carries dandi-archive#2784. Co-Authored-By: Claude Code 2.1.123 / Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Sent conditioning part is quite minimal, as I expected. |
Zarr multipart upload tune ups: condition on server capability, fix progress reporting, harmonize variable names
Depends on and adds support for dandi/dandi-archive#2784