Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions .claude/sweep-accuracy-state.csv
Original file line number Diff line number Diff line change
Expand Up @@ -36,3 +36,5 @@ visibility,2026-04-13T12:00:00Z,,,,"Bresenham line, LOS kernel, Fresnel zone all
worley,2026-05-01,,MEDIUM,2;5,"MEDIUM: numpy backend uses np.empty_like(data) so integer input dtype produces integer output (distances truncated to 0); cupy/dask paths always produce float32. LOW: freq=inf produces 100000 sentinel (sqrt of initial min_dist=1e10), no validation of freq/seed for non-finite values."
zonal,2026-03-30T12:00:00Z,1090,,,
geotiff,2026-05-06,pending-pr-cache-stale,MEDIUM,5,"Pass 5 (2026-05-06): MEDIUM fixed in fix-mmap-cache-stale-inode -- _MmapCache returned stale bytes after the file at the same path was replaced (e.g. by to_geotiff which writes via tempfile + os.replace, swapping the inode). Reproduces on numpy and GPU read paths: write A, read A, write B at same path, read returns A. Fixed by storing (st_ino, st_size, st_mtime_ns) per cache entry and dropping the entry on acquire when the file ident has changed; in-use mmaps stay valid for current holders. Tests added for both atomic-rename replacement and in-place truncate-and-rewrite. | Pass 4 (2026-05-06) PR #1501: sparse COG tiles (TileByteCounts==0). | Pass 3 (2026-05-06) PR #1500: predictor=3 byte-order. | Pass 2 (2026-05-05) PR #1498: predictor=2 sample-wise. | Pass 1 (2026-04-23) PR #1247. Re-confirmed clean: south-up flipping is a documented limitation in test_georef_edges.test_y_coords_known_limitation; CRS round-trip 4326/3857/32633/5070 exact; LERC int default lossless; NaN bit patterns survive predictor=3 round-trip; RATIONAL den=0 returns 0.0; ASCII tag NUL termination spec-correct on write; adler32 in GPU deflate path is per-tile not per-chunk so chunk-combine math does not apply; GeoDoubleParams uses correct double-array offset semantics. Deferred: per-band different NODATA via GDAL_METADATA exposed in attrs but not applied (single dataset-level value used for masking)."
geotiff,2026-05-07,1507,MEDIUM,1;5,"Pass 6 (2026-05-07): MEDIUM fixed in PR #1507 - predictor=2 decode crashed with numba TypingError ('Unsupported array dtype: >u2/u4/u8') on big-endian TIFFs with multi-byte integer samples. Regression introduced by the pass-2 rework (#1498) that switched predictor=2 to a sample-wise numpy view in the file's byte order; numba nopython mode rejects non-native dtype arrays. Fix byte-swaps the buffer in place around the kernel call so the kernel sees native order and the on-disk view stays in file order. Tests cover uint16/int16/uint32/int32 BE round-trip plus LE sanity. uint8 BE+pred2 unaffected (single-byte path). Also noted but NOT fixed in this PR (one fix per PR rule): GPU read of BE multi-byte TIFFs crashes with AttributeError on out.byteswap() at _gpu_decode.py:1817 and 1556 because cupy.ndarray has no byteswap method; user-visible effect is a silent fallback to CPU which works for non-pred2 BE files. Re-confirmed clean: empty-array round-trip raises clearly (0x0, 0xN, 1x1 boundary), inline tag values <4 bytes positioned correctly, BigTIFF count overflow, IFD chain loop, RATIONAL den=0, NextIFDOffset > file size, sparse tiles, mmap cache invalidation post-#1506, BE without predictor reads correctly, 12-bit unpack, redirect handling, fp_predictor BE swizzle (#1500), nodata-out-of-dtype handled by string tag, HTTP short-read caught by size check at _reader.py:440, concurrent fetch propagates first error via ThreadPoolExecutor exit. | Pass 5 (2026-05-06) PR #1506: mmap cache stale after file replace. | Pass 4 (2026-05-06) PR #1501: sparse COG tiles. | Pass 3 (2026-05-06) PR #1500: predictor=3 byte order. | Pass 2 (2026-05-05) PR #1498: predictor=2 sample-wise. | Pass 1 (2026-04-23) PR #1247."
geotiff,2026-05-07,pending-pr-cache-refcount,HIGH,5,"Pass 7 (2026-05-07): HIGH fixed in fix-mmap-cache-refcount-after-replace -- _MmapCache.release() looked up the cache entry by realpath, so a holder that acquired the OLD mmap before an os.replace and released it AFTER another caller had acquired the post-replace entry would decrement the new holder's refcount. Subsequent eviction (cache full, or another acquire) closed the still-in-use mmap, breaking reads with 'mmap closed or invalid'. Real exposure: any concurrent reader/writer pattern where to_geotiff replaces a file that another reader had just opened via open_geotiff with chunks= or via _FileSource. PR #1506 added stale-replacement detection but did not fix the refcount confusion across the pop. Fix: acquire returns an opaque entry token; release takes the token and decrements that exact entry, regardless of cache state. Orphaned (popped) entries close their fh+mmap when their own refcount hits zero. _FileSource updated to pass the token. Regression test test_release_after_path_replacement_does_not_clobber_new_holder added. All 665 geotiff tests pass; GPU path verified. | Pass 6 (2026-05-07) PR #1507: BE pred2 numba TypingError. | Pass 5 (2026-05-06) PR #1506: mmap cache stale after file replace. | Pass 4 (2026-05-06) PR #1501: sparse COG tiles. | Pass 3 (2026-05-06) PR #1500: predictor=3 byte order. | Pass 2 (2026-05-05) PR #1498: predictor=2 sample-wise. | Pass 1 (2026-04-23) PR #1247. Re-confirmed clean over passes 2-7: items 2 (writer always emits LE TIFFs - hardcoded b'II'), 3 (RowsPerStrip default = height when missing), 4 (StripByteCounts missing raises clear ValueError), 5 (TileWidth without TileLength caught by 'tw <= 0 or th <= 0' check at _reader.py:688), 9 (read determinism on compressed+tiled+multiband), 11 (predictor=2 with awkward sample stride round-trips), 18 (compression_level=99 raises ValueError 'out of range for deflate (valid: 1-9)'), 21 (concurrent writes serialize correctly via mkstemp+os.replace), 24 (uint16 dtype preserved on numpy backend, dask honors chunks param), 26 (chunks rounds correctly with remainder chunk for non-tile-aligned). Deferred: item 8 (BytesIO/file-like sources are not supported, source.lower() error) - documented as 'str' parameter, not a bug; item 19 (LERC max_z_error not user-exposed by to_geotiff) - missing feature, not a bug."
92 changes: 59 additions & 33 deletions xrspatial/geotiff/_reader.py
Original file line number Diff line number Diff line change
Expand Up @@ -82,11 +82,18 @@ class _MmapCache:

def __init__(self, max_size: int | None = None):
self._lock = threading.Lock()
# path -> [fh, mm, size, refcount, ident]
# path -> entry list. Each entry is
# [fh, mm, size, refcount, ident, orphaned]
#
# ``ident`` is (st_ino, st_size, st_mtime_ns) used to spot files that
# were replaced (e.g. via ``os.replace`` on an atomic write) at the
# same path. ``list`` so we can mutate in place; ``OrderedDict`` gives
# LRU semantics via move_to_end.
# same path. ``orphaned`` is True once the entry has been removed
# from ``self._entries`` (typically because the underlying file was
# replaced). An orphaned entry is no longer the cache slot for the
# path, but live ``_FileSource`` instances still hold the entry list
# by reference and decrement *its* refcount on release. This keeps
# holders of the old mmap unaffected by any new acquires for the
# same path. ``OrderedDict`` gives LRU semantics via move_to_end.
self._entries: OrderedDict[str, list] = OrderedDict()
self._max_size = (max_size if max_size is not None
else _mmap_cache_size_from_env())
Expand All @@ -100,8 +107,21 @@ def _file_ident(path: str):
return None
return (st.st_ino, st.st_size, st.st_mtime_ns)

@staticmethod
def _close_entry_locked(entry):
"""Close the file handle and mmap for *entry* (must be idle)."""
if entry[1] is not None:
entry[1].close()
entry[0].close()

def acquire(self, path: str):
"""Get or create a read-only mmap for *path*. Returns (mm, size)."""
"""Get or create a read-only mmap for *path*.

Returns ``(mm, size, entry)``. The opaque ``entry`` token must be
passed back to :meth:`release` so the matching reference count is
decremented even after the cache slot has been replaced (e.g. by an
atomic file overwrite at the same path).
"""
real = _os_module.path.realpath(path)
with self._lock:
entry = self._entries.get(real)
Expand All @@ -111,19 +131,20 @@ def acquire(self, path: str):
# size, or mtime) the cached mmap is stale. Drop the entry so
# we re-open below. If the old entry is still in use by other
# callers, leave their mmap valid -- they still hold a
# reference -- but unlink the cache slot so it isn't reused.
# reference -- but mark it orphaned so a later release of
# *that* entry closes its own resources rather than touching
# the new cache slot.
if ident is not None and entry[4] != ident:
self._entries.pop(real)
entry[5] = True # orphaned
if entry[3] <= 0:
if entry[1] is not None:
entry[1].close()
entry[0].close()
self._close_entry_locked(entry)
entry = None

if entry is not None:
entry[3] += 1
self._entries.move_to_end(real)
return entry[1], entry[2]
return entry[1], entry[2], entry

fh = open(real, 'rb')
fh.seek(0, 2)
Expand All @@ -135,26 +156,35 @@ def acquire(self, path: str):
mm = None
# Re-stat after opening so size matches the mmap we built.
ident = self._file_ident(real) or (0, size, 0)
self._entries[real] = [fh, mm, size, 1, ident]
new_entry = [fh, mm, size, 1, ident, False]
self._entries[real] = new_entry
self._evict_locked()
return mm, size
return mm, size, new_entry

def release(self, path: str):
"""Decrement the reference count.
def release(self, entry):
"""Decrement the reference count for the supplied entry token.

When the count hits zero the entry stays cached (keyed by realpath)
until LRU eviction or :meth:`clear` is called.
When the count hits zero on a still-cached entry, it stays cached
(keyed by realpath) until LRU eviction or :meth:`clear`. When the
count hits zero on an orphaned entry, its file handle and mmap are
closed immediately because no further callers can reach it.
"""
real = _os_module.path.realpath(path)
with self._lock:
entry = self._entries.get(real)
if entry is None:
return
entry[3] -= 1
if entry[3] <= 0:
# Idle but still cached; mark LRU position.
self._entries.move_to_end(real)
self._evict_locked()
if entry[3] > 0:
return
if entry[5]:
# Orphaned: not in the dict; close now.
self._close_entry_locked(entry)
return
# Find the path so we can move it to the LRU tail. The entry
# identity is unique per realpath while non-orphaned, so a
# linear search over a small dict is fine.
for key, ent in self._entries.items():
if ent is entry:
self._entries.move_to_end(key)
break
self._evict_locked()

def _evict_locked(self):
"""Drop oldest *idle* entries until the cache is at or below the cap."""
Expand All @@ -171,20 +201,14 @@ def _evict_locked(self):
to_drop.append(key)
for key in to_drop:
entry = self._entries.pop(key)
mm = entry[1]
if mm is not None:
mm.close()
entry[0].close()
self._close_entry_locked(entry)

def clear(self):
"""Close and drop all idle entries (used by tests)."""
with self._lock:
for key in [k for k, v in self._entries.items() if v[3] <= 0]:
entry = self._entries.pop(key)
mm = entry[1]
if mm is not None:
mm.close()
entry[0].close()
self._close_entry_locked(entry)


# Module-level cache shared across all reads
Expand All @@ -196,7 +220,7 @@ class _FileSource:

def __init__(self, path: str):
self._path = path
self._mm, self._size = _mmap_cache.acquire(path)
self._mm, self._size, self._entry = _mmap_cache.acquire(path)

def read_range(self, start: int, length: int) -> bytes:
if self._mm is not None:
Expand All @@ -214,7 +238,9 @@ def size(self) -> int:
return self._size

def close(self):
_mmap_cache.release(self._path)
if self._entry is not None:
_mmap_cache.release(self._entry)
self._entry = None


def _get_http_pool():
Expand Down
82 changes: 65 additions & 17 deletions xrspatial/geotiff/tests/test_polish_1488.py
Original file line number Diff line number Diff line change
Expand Up @@ -166,8 +166,8 @@ def test_cap_evicts_oldest_idle_entry(self, tmp_path):

# Acquire and release each: all become idle.
for f in files:
cache.acquire(f)
cache.release(f)
_, _, ent = cache.acquire(f)
cache.release(ent)

# Cache should be at the cap (2), with the oldest evicted.
assert len(cache._entries) == 2
Expand All @@ -181,9 +181,9 @@ def test_inuse_entries_not_evicted(self, tmp_path):
b = tmp_path / 'p3_b_1488.bin'
b.write_bytes(b'b' * 32)

cache.acquire(str(a)) # rc=1, in use
cache.acquire(str(b)) # would exceed cap, but a is in use
cache.release(str(b)) # b idle now
cache.acquire(str(a)) # rc=1, in use; entry intentionally leaked
_, _, ent_b = cache.acquire(str(b)) # would exceed cap, but a is in use
cache.release(ent_b) # b idle now

# a still in cache because rc > 0.
assert os.path.realpath(str(a)) in cache._entries
Expand All @@ -205,8 +205,8 @@ def test_fingerprint_mismatch_invalidates_idle_entry(self, tmp_path):
p = tmp_path / 'fingerprint_1488.bin'
p.write_bytes(b'A' * 16)

mm1, _ = cache.acquire(str(p))
cache.release(str(p))
mm1, _, ent1 = cache.acquire(str(p))
cache.release(ent1)

real = os.path.realpath(str(p))
# Force the cached fingerprint to a value the file cannot match,
Expand All @@ -218,12 +218,12 @@ def test_fingerprint_mismatch_invalidates_idle_entry(self, tmp_path):
# Next acquire must notice the mismatch and re-mmap. The mm
# object identity should change; the new entry's fingerprint
# should match the live file.
mm2, _ = cache.acquire(str(p))
mm2, _, ent2 = cache.acquire(str(p))
try:
assert mm2 is not mm1
assert cache._entries[real][4] == cache._file_ident(real)
finally:
cache.release(str(p))
cache.release(ent2)

@pytest.mark.skipif(
sys.platform.startswith('win'),
Expand All @@ -242,19 +242,19 @@ def test_replaced_file_invalidates_idle_entry(self, tmp_path):
p = tmp_path / 'replaced_1488.bin'
p.write_bytes(b'A' * 16)

mm1, _ = cache.acquire(str(p))
mm1, _, ent1 = cache.acquire(str(p))
first = bytes(mm1[:16])
assert first == b'A' * 16
cache.release(str(p))
cache.release(ent1)

# Atomic-rename style replacement: same path, new inode.
new = tmp_path / 'replaced_1488.bin.tmp'
new.write_bytes(b'B' * 16)
os.replace(str(new), str(p))

mm2, _ = cache.acquire(str(p))
mm2, _, ent2 = cache.acquire(str(p))
assert bytes(mm2[:16]) == b'B' * 16
cache.release(str(p))
cache.release(ent2)

@pytest.mark.skipif(
sys.platform.startswith('win'),
Expand All @@ -270,20 +270,68 @@ def test_truncated_file_invalidates_idle_entry(self, tmp_path):
p = tmp_path / 'truncate_1488.bin'
p.write_bytes(b'A' * 32)

mm1, sz1 = cache.acquire(str(p))
mm1, sz1, ent1 = cache.acquire(str(p))
assert sz1 == 32
cache.release(str(p))
cache.release(ent1)

# In-place rewrite with a smaller payload.
import time
time.sleep(0.01) # ensure st_mtime_ns advances
with open(str(p), 'wb') as fh:
fh.write(b'C' * 8)

mm2, sz2 = cache.acquire(str(p))
mm2, sz2, ent2 = cache.acquire(str(p))
assert sz2 == 8
assert bytes(mm2[:8]) == b'C' * 8
cache.release(str(p))
cache.release(ent2)

@pytest.mark.skipif(
sys.platform.startswith('win'),
reason="POSIX-only os.replace semantics",
)
def test_release_after_path_replacement_does_not_clobber_new_holder(
self, tmp_path,
):
"""A holder that acquired the *old* mmap before an ``os.replace``
must decrement its own refcount, not the refcount of whatever
new entry now lives at the same path.

Pre-fix, ``release(path)`` looked up by realpath, so a late
release after a file swap could free another caller's still-live
mmap. With a small cache the next acquire would then evict the
prematurely-zeroed entry, breaking reads in flight.
"""
cache = _MmapCache(max_size=1)
a = tmp_path / 'race_a_1488.bin'
a.write_bytes(b'A' * 16)

mm_a, _, ent_a = cache.acquire(str(a))

# Replace the file at the same path while ent_a is still held.
new = tmp_path / 'race_a_1488.bin.tmp'
new.write_bytes(b'B' * 16)
import time
time.sleep(0.01)
os.replace(str(new), str(a))

# Second acquire creates a fresh entry for the replaced file.
mm_b, _, ent_b = cache.acquire(str(a))
assert mm_b is not mm_a

# Original holder releases late. Must not touch ent_b.
cache.release(ent_a)
assert ent_b[3] == 1

# Force eviction of any idle entry by acquiring an unrelated
# file. ent_b is in use and must stay alive.
c = tmp_path / 'race_c_1488.bin'
c.write_bytes(b'C' * 16)
_, _, ent_c = cache.acquire(str(c))
try:
assert bytes(mm_b[:16]) == b'B' * 16
finally:
cache.release(ent_b)
cache.release(ent_c)


# ---------------------------------------------------------------------------
Expand Down
Loading