fix(session): retry zarr chunk-rename on Windows transient file locks#10
Open
wizzardkyd96 wants to merge 1 commit into
Open
fix(session): retry zarr chunk-rename on Windows transient file locks#10wizzardkyd96 wants to merge 1 commit into
wizzardkyd96 wants to merge 1 commit into
Conversation
zarr commits each chunk by writing a temp file and atomically renaming it over the destination. On POSIX that rename never fails; on Windows MoveFileEx raises PermissionError (WinError 5/32) whenever another handle to the file is still open at that instant -- an AV scan of the just-written chunk, the search indexer, or the OS not releasing the handle the microsecond after close() returned. It is intermittent, but a high-rate multi-stream recording reliably hits it: every Session.append grows the arrays, each trailing-chunk rewrite is another rename, and losing the race once crashes the whole recording. MyoGestic already retries on this exact Windows lag during teardown (_robust_rmtree); this adds the missing write-path counterpart. The retry must live at the rename itself -- Session.append -> zarr.Array.append can't be retried safely (it resizes before writing chunks, so a retry after a partial failure writes at the wrong offset and corrupts the array). The proper fix belongs upstream in zarr; this is a Windows-only, guarded shim until a fixed zarr is released.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
On Windows, session recording crashes mid-session with
PermissionError(WinError 5 / 32) raised from deep inside zarr's chunk write.Session.append()→zarr.Array.append()commits each chunk by writing a temp file and atomically renaming it over the destination (zarr.storage._local._atomic_write). On POSIX that rename is atomic and never fails. On Windows the underlyingMoveFileExfails whenever another handle to the file is still open at that instant — an antivirus scan of the just-written chunk, the Search indexer, or simply the OS not having released the handle the microsecond afterclose()returned.It's intermittent (a timing collision, not a concurrency bug), but a high-rate multi-stream recording hits it reliably: every
appendgrows the arrays, each trailing-chunk rewrite is another rename, and losing the race once crashes the whole recording.Why it can't be fixed at
Session.appendzarr.Array.appendresizes the array (persists the new shape) before writing the chunk data. If a chunk rename fails after the resize, retryingappendwould write at the wrong offset and corrupt the array. So the retry has to live at the rename itself.Fix
This is the write-path counterpart to the existing
_robust_rmtree, which already retries on the same Windows file-handle lag during teardown._install_windows_zarr_atomic_write_retry()wrapszarr.storage._local._atomic_writewith a bounded exponential back-off (10 attempts,50 ms × attempt, with agc.collect()between tries to drop lingering handles).sys.platform != "win32"returns immediately; complete no-op on POSIX.try/except, so if zarr's internals ever move it degrades to a no-op and can never breakimport.exclusive/_safe_movepath.Root cause / upstream note
The actual defect is in zarr (
_atomic_writehas no Windows-lock resilience), and it's unfixed as of the latest zarr (3.2.1) and on zarr'smain. The proper long-term fix belongs there; this shim keeps MyoGestic recording robust until a fixed zarr is released, at which point the wrapper is harmless.Testing
Verified against real zarr 3.2.1 by injecting forced transient
PermissionErrors into the rename: 10 forced failures were retried and the resulting array came back with the exact expected shape and byte-for-byte intact data. No-op confirmed on non-Windows (early return).