Skip to content

fix(session): retry zarr chunk-rename on Windows transient file locks#10

Open
wizzardkyd96 wants to merge 1 commit into
NsquaredLab:mainfrom
wizzardkyd96:fix/windows-zarr-atomic-write-retry
Open

fix(session): retry zarr chunk-rename on Windows transient file locks#10
wizzardkyd96 wants to merge 1 commit into
NsquaredLab:mainfrom
wizzardkyd96:fix/windows-zarr-atomic-write-retry

Conversation

@wizzardkyd96

Copy link
Copy Markdown
Collaborator

Problem

On Windows, session recording crashes mid-session with PermissionError (WinError 5 / 32) raised from deep inside zarr's chunk write.

Session.append()zarr.Array.append() commits each chunk by writing a temp file and atomically renaming it over the destination (zarr.storage._local._atomic_write). On POSIX that rename is atomic and never fails. On Windows the underlying MoveFileEx fails whenever another handle to the file is still open at that instant — an antivirus scan of the just-written chunk, the Search indexer, or simply the OS not having released the handle the microsecond after close() returned.

It's intermittent (a timing collision, not a concurrency bug), but a high-rate multi-stream recording hits it reliably: every append grows the arrays, each trailing-chunk rewrite is another rename, and losing the race once crashes the whole recording.

Why it can't be fixed at Session.append

zarr.Array.append resizes the array (persists the new shape) before writing the chunk data. If a chunk rename fails after the resize, retrying append would write at the wrong offset and corrupt the array. So the retry has to live at the rename itself.

Fix

This is the write-path counterpart to the existing _robust_rmtree, which already retries on the same Windows file-handle lag during teardown. _install_windows_zarr_atomic_write_retry() wraps zarr.storage._local._atomic_write with a bounded exponential back-off (10 attempts, 50 ms × attempt, with a gc.collect() between tries to drop lingering handles).

  • Windows-onlysys.platform != "win32" returns immediately; complete no-op on POSIX.
  • Guarded — wrapped in try/except, so if zarr's internals ever move it degrades to a no-op and can never break import.
  • Preserves the original exclusive / _safe_move path.

Root cause / upstream note

The actual defect is in zarr (_atomic_write has no Windows-lock resilience), and it's unfixed as of the latest zarr (3.2.1) and on zarr's main. The proper long-term fix belongs there; this shim keeps MyoGestic recording robust until a fixed zarr is released, at which point the wrapper is harmless.

Testing

Verified against real zarr 3.2.1 by injecting forced transient PermissionErrors into the rename: 10 forced failures were retried and the resulting array came back with the exact expected shape and byte-for-byte intact data. No-op confirmed on non-Windows (early return).

zarr commits each chunk by writing a temp file and atomically renaming it
over the destination. On POSIX that rename never fails; on Windows
MoveFileEx raises PermissionError (WinError 5/32) whenever another handle
to the file is still open at that instant -- an AV scan of the just-written
chunk, the search indexer, or the OS not releasing the handle the
microsecond after close() returned. It is intermittent, but a high-rate
multi-stream recording reliably hits it: every Session.append grows the
arrays, each trailing-chunk rewrite is another rename, and losing the race
once crashes the whole recording.

MyoGestic already retries on this exact Windows lag during teardown
(_robust_rmtree); this adds the missing write-path counterpart. The retry
must live at the rename itself -- Session.append -> zarr.Array.append can't
be retried safely (it resizes before writing chunks, so a retry after a
partial failure writes at the wrong offset and corrupts the array).

The proper fix belongs upstream in zarr; this is a Windows-only, guarded
shim until a fixed zarr is released.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant