Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 6 additions & 4 deletions ARCHITECTURE.md
Original file line number Diff line number Diff line change
Expand Up @@ -45,7 +45,7 @@ src/lazycogs/
2. Parses `time_period` into a `_TemporalGrouper` (see `_temporal.py`).
3. Converts `bbox` from the target CRS to EPSG:4326 using `pyproj.Transformer`.
4. Calls `_discover_bands()`: queries the parquet source via `duckdb_client.search(..., max_items=1)` to find asset keys. Assets with role `"data"` or media type `"image/tiff"` are returned first.
5. Calls `_smoketest_store()`: fetches one sample item from the parquet, resolves the object store for a representative data asset HREF, and calls `head()` to confirm access. Raises `RuntimeError` immediately with a clear message if the store cannot reach the asset, so misconfiguration is surfaced at `open()` time rather than deferred to the first chunk read.
5. Calls `_smoketest_store()`: fetches one sample item from the parquet, resolves the store for a representative data asset HREF, and calls `GeoTIFF.open(path, store=...)` to confirm that the configured store satisfies the same read contract the real chunk reader uses. Raises `RuntimeError` immediately with a clear message if the store cannot reach the asset, so misconfiguration is surfaced at `open()` time rather than deferred to the first chunk read.
6. Calls `_build_time_steps()`: queries the parquet source via `duckdb_client.search_to_arrow(...)` to obtain an Arrow table containing only the `datetime` and `start_datetime` columns (plus any filter/sort fields). Extracts timestamps from the Arrow columns without Python-level dict walking, buckets them with the `_TemporalGrouper`, deduplicates, and returns sorted `(filter_strings, time_coords)` pairs. Only groups with at least one item produce a time step.
7. Calls `compute_output_grid()` to get the output affine transform and dimensions (width, height). No eager coordinate arrays are produced.
8. Creates a single `MultiBandStacBackendArray` (a dataclass) with shape `(band, time, y, x)` holding all the parameters needed to materialise any chunk later, then wraps it in one `xarray.core.indexing.LazilyIndexedArray`. This avoids `xr.concat` (used internally by `ds.to_array()`), which would eagerly load `LazilyIndexedArray`-backed objects.
Expand Down Expand Up @@ -248,11 +248,13 @@ mean that requires all weeks to be present before reducing).

## Store caching and the `store` parameter

`resolve()` in `_store.py` defers to `obstore.store.from_url` for scheme detection — including the special-case HTTPS routing for `amazonaws.com`, `r2.cloudflarestorage.com`, and Azure hosts — rather than maintaining its own list of known object-store domains. The constructed store is cached per thread in a `dict[str, ObjectStore]` keyed by root URL (`scheme://netloc`). Because dask tasks run in threads, this avoids repeated connection setup within a single task while remaining safe across concurrent tasks.
`lazycogs.open(..., store=...)` accepts the same obspec-compatible async range-read contract that `async-geotiff` consumes. In practice, any object that satisfies `async_tiff.ObspecInput` can be passed through and will be forwarded unchanged to `GeoTIFF.open(...)`.

No credential defaults are applied; stores are constructed with obstore's own environment-based credential discovery (environment variables, instance metadata, config files, etc.). For public buckets that do not require signed requests, callers pass `skip_signature=True` explicitly. For authenticated access or any non-default configuration, the caller is expected to construct an `ObjectStore` and pass it via the `store=` parameter to `open()`; `resolve()` then returns it unchanged and only extracts the object path from each HREF. No introspection is done on a user-supplied store — the caller is responsible for ensuring it is rooted at the same `scheme://netloc` the HREFs point to.
`resolve()` in `_store.py` remains the default convenience layer. When `store=None`, it defers to `obstore.store.from_url` for scheme detection — including the special-case HTTPS routing for `amazonaws.com`, `r2.cloudflarestorage.com`, and Azure hosts — rather than maintaining its own list of known object-store domains. The constructed obstore-backed store is cached per thread in a `dict[str, ObjectStore]` keyed by root URL (`scheme://netloc`). Because dask tasks run in threads, this avoids repeated connection setup within a single task while remaining safe across concurrent tasks.

`store_for(href, *, asset=None, **kwargs)` is a public convenience factory that automates this construction. It reads one sample item from the geoparquet file, extracts a data asset HREF, and calls `from_url` using obstore's environment-based credential discovery. If the item carries STAC Storage Extension metadata (v1.0.0 flat fields or v2.0.0 `storage:schemes`/`storage:refs`), `region` and `requester_pays` are also extracted and forwarded. Caller `kwargs` override all inferred values (e.g. `skip_signature=True` for public buckets). The returned store is not cached — the caller owns its lifetime and passes it to `open()` via `store=`.
No credential defaults are applied; auto-resolved stores are constructed with obstore's own environment-based credential discovery (environment variables, instance metadata, config files, etc.). For public buckets that do not require signed requests, callers pass `skip_signature=True` explicitly. For authenticated access or any non-default configuration, callers may still construct an obstore `ObjectStore` and pass it via `store=`; `resolve()` then returns it unchanged and only extracts the object path from each HREF. No introspection is done on a user-supplied store — the caller is responsible for ensuring it satisfies the `GeoTIFF.open` read contract and is rooted at the same `scheme://netloc` the HREFs point to.

`store_for(href, *, asset=None, **kwargs)` is a public convenience factory that automates the default obstore path. It reads one sample item from the geoparquet file, extracts a data asset HREF, and calls `from_url` using obstore's environment-based credential discovery. If the item carries STAC Storage Extension metadata (v1.0.0 flat fields or v2.0.0 `storage:schemes`/`storage:refs`), `region` and `requester_pays` are also extracted and forwarded. Caller `kwargs` override all inferred values (e.g. `skip_signature=True` for public buckets). The returned store is not cached — the caller owns its lifetime and passes it to `open()` via `store=`.

When the store root does not align with the URL structure of the asset HREFs — for example, an Azure Blob Storage store rooted at a container while the HREFs include the container name in the path — the caller can provide a `path_from_href` callable to `open()`. The callable takes the full HREF string and returns the object path to use with the store. When supplied, it replaces the default `urlparse`-based extraction in `resolve()`.

Expand Down
9 changes: 7 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@ Open a lazy `(band, time, y, x)` xarray DataArray from thousands of cloud-optimi

[stackstac](https://stackstac.readthedocs.io) and [odc-stac](https://odc-stac.readthedocs.io) established the pattern that lazycogs builds on: take a STAC item collection and expose it as a spatially-aligned xarray DataArray ready for dask-parallel computation. Both are excellent tools that cover most satellite imagery workflows well. They rely on the trusty combination of rasterio and GDAL for data i/o and warping operations.

lazycogs takes the same approach but replaces GDAL and rasterio with a Rust-native stack: [rustac](https://stac-utils.github.io/rustac-py) for STAC queries over stac-geoparquet files, [async-geotiff](https://developmentseed/async-geotiff) for COG i/o, and [obstore](https://developmentseed.org/obstore) for cloud storage access.
lazycogs takes the same approach but replaces GDAL and rasterio with a Rust-native stack: [rustac](https://stac-utils.github.io/rustac-py) for STAC queries over stac-geoparquet files, [async-geotiff](https://developmentseed/async-geotiff) for COG i/o, and [obstore](https://developmentseed.org/obstore) as the default cloud storage integration.

The result is a tool that can instantly expose a lazy xarray DataArray view of massive STAC item archives in any CRS and resolution. Each array operation triggers a targeted spatial query on the stac-geoparquet file to find only the assets needed for that specific chunk — no upfront scan of every item required.

Expand All @@ -23,7 +23,7 @@ One constraint worth naming: lazycogs only reads Cloud Optimized GeoTIFFs. If yo
|---|---|
| STAC search + spatial indexing | `rustac` (DuckDB + geoparquet) |
| COG I/O | `async-geotiff` (Rust, no GDAL) |
| Cloud storage | `obstore` |
| Cloud storage | `obstore` by default; any `async-geotiff`/obspec-compatible store when passed via `store=` |
| Reprojection | `pyproj` + numpy |
| Lazy dataset construction | xarray `BackendEntrypoint` + `LazilyIndexedArray` |

Expand Down Expand Up @@ -97,6 +97,11 @@ event loop. Multiple concurrent chunk reads overlap naturally, so the
async path can be faster than the synchronous `da.compute()` when
reading many chunks inside an already-running loop.

## Custom stores

`lazycogs.open(..., store=...)` accepts any store object that satisfies the async range-read contract consumed by `async-geotiff`.
For most users, the recommended path is still obstore: leave `store=None` to auto-resolve per-asset stores, or call `lazycogs.store_for()` to build one explicitly.

## Documentation

- [Home](https://developmentseed.github.io/lazycogs/) — quickstart and full usage guide
Expand Down
12 changes: 9 additions & 3 deletions docs/guides/cloud-storage.md
Original file line number Diff line number Diff line change
@@ -1,15 +1,21 @@
# Cloud storage

lazycogs uses [obstore](https://developmentseed.org/obstore/latest/) to read COG assets from cloud object storage. This guide covers how to configure object stores for different storage backends and authentication scenarios.
lazycogs uses [obstore](https://developmentseed.org/obstore/latest/) as its default way to read COG assets from cloud object storage. It can also accept any custom store object that satisfies the async range-read contract consumed by `async-geotiff`. This guide covers the default obstore path plus the custom-store contract.

## Default behavior

By default, `lazycogs.open()` parses each asset HREF into an `ObjectStore` using [`obstore.store.from_url`](https://developmentseed.org/obstore/latest/api/store/from_url/). No credential defaults are applied; the store uses obstore's own environment-based credential discovery (environment variables, instance metadata, config files, etc.).

`lazycogs.open()` runs a lightweight storage smoketest on startup: it resolves the object store for a sample data asset and calls `head()` to confirm access. If the store cannot reach the asset, a `RuntimeError` is raised immediately with a clear message rather than deferring the failure to the first chunk read.
`lazycogs.open()` runs a lightweight storage smoketest on startup: it resolves the store for a sample data asset and calls `GeoTIFF.open(..., store=...)` to confirm access through the same contract used by the real reader. If the store cannot reach the asset, a `RuntimeError` is raised immediately with a clear message rather than deferring the failure to the first chunk read.

For public buckets that do not require signed requests, pass `skip_signature=True` when constructing the store. For authenticated buckets, provide credentials via environment variables or a pre-configured store.

## Custom store contract

When you pass `store=` explicitly, lazycogs forwards that object to `async-geotiff`. The object does not need to be an obstore `ObjectStore`; it only needs to satisfy the obspec-compatible async range-read contract accepted by `GeoTIFF.open()`.

For most users, obstore is still the recommended path because `store=None` auto-resolves it for each asset HREF and `lazycogs.store_for()` constructs it for you.

## Constructing a store from your data

`lazycogs.store_for()` inspects a geoparquet file and builds a matching `ObjectStore` automatically. It reads one sample item, derives the store root from a data asset HREF, and infers `region` and `requester_pays` from [STAC Storage Extension](https://github.com/stac-extensions/storage) metadata when present.
Expand All @@ -29,7 +35,7 @@ store = lazycogs.store_for("items.parquet", asset="B04")

## Constructing stores manually

For authenticated buckets, requester-pays buckets, custom endpoints, or non-standard authentication, construct the store yourself:
For authenticated buckets, requester-pays buckets, custom endpoints, or non-standard authentication, construct an obstore-backed store yourself:

```python
from obstore.store import S3Store, GCSStore, HTTPStore
Expand Down
1 change: 1 addition & 0 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,7 @@ dependencies = [
"affine>=2.4.0",
"arro3-core>=0.8.0",
"async-geotiff>=0.4.0",
"async-tiff>=0.7.1",
"cql2>=0.5.5",
"numpy>=2.4.4",
"obstore>=0.9.2",
Expand Down
16 changes: 9 additions & 7 deletions src/lazycogs/_backend.py
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,7 @@
if TYPE_CHECKING:
from collections.abc import Callable

from async_tiff import ObspecInput
from rustac import DuckdbClient

from lazycogs._mosaic_methods import MosaicMethodBase
Expand Down Expand Up @@ -61,7 +62,8 @@ class _ChunkReadPlan:
chunk_height: Chunk height in pixels.
nodata: No-data fill value, or ``None``.
mosaic_method_cls: Mosaic method class, or ``None`` for the default.
store: Pre-configured obstore ``ObjectStore`` instance, or ``None``.
store: Pre-configured obspec-compatible store accepted by
``GeoTIFF.open``, or ``None``.
max_concurrent_reads: Maximum concurrent COG reads per chunk.
warp_cache: Shared warp map cache across time steps.
path_fn: Optional callable extracting an object path from an asset HREF.
Expand All @@ -83,7 +85,7 @@ class _ChunkReadPlan:
chunk_height: int
nodata: float | None
mosaic_method_cls: type[MosaicMethodBase] | None
store: Any | None
store: ObspecInput | None
max_concurrent_reads: int
warp_cache: dict
path_fn: Callable[[str], str] | None
Expand Down Expand Up @@ -335,10 +337,10 @@ class MultiBandStacBackendArray(BackendArray):
mosaic_method_cls: Mosaic method class instantiated per chunk, or
``None`` to use the default
:class:`~lazycogs._mosaic_methods.FirstMethod`.
store: Pre-configured obstore ``ObjectStore`` instance shared across
all chunk reads. When ``None``, each asset HREF is resolved to a
store via the thread-local cache in
:func:`~lazycogs._store.resolve`.
store: Pre-configured obspec-compatible store accepted by
``GeoTIFF.open`` and shared across all chunk reads. When ``None``,
each asset HREF is resolved to an obstore-backed store via the
thread-local cache in :func:`~lazycogs._store.resolve`.
max_concurrent_reads: Maximum number of COG reads to run concurrently
per chunk. Limits peak in-flight memory when a chunk overlaps
many items. Defaults to 32.
Expand Down Expand Up @@ -369,7 +371,7 @@ class MultiBandStacBackendArray(BackendArray):
dtype: np.dtype
nodata: float | None
mosaic_method_cls: type[MosaicMethodBase] | None = field(default=None)
store: Any | None = field(default=None)
store: ObspecInput | None = field(default=None)
max_concurrent_reads: int = field(default=32)
path_from_href: Callable[[str], str] | None = field(default=None)
shape: tuple[int, ...] = field(init=False)
Expand Down
19 changes: 12 additions & 7 deletions src/lazycogs/_chunk_reader.py
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,7 @@
from collections.abc import Callable

from affine import Affine
from obstore.store import ObjectStore
from async_tiff import ObspecInput
from pyproj import CRS, Transformer

logger = logging.getLogger(__name__)
Expand All @@ -46,7 +46,7 @@ class _ChunkContext:
chunk_width: int
chunk_height: int
nodata: float | None
store: ObjectStore | None
store: ObspecInput | None
path_fn: Callable[[str], str] | None
warp_cache: dict[tuple[tuple[float, ...], CRS], WarpMap] | None

Expand Down Expand Up @@ -375,7 +375,10 @@ async def _read_item_band(
return None

# Open all COGs concurrently for metadata.
async def _open_band(band: str, href: str) -> tuple[str, GeoTIFF, ObjectStore]:
async def _open_band(
band: str,
href: str,
) -> tuple[str, GeoTIFF, ObspecInput]:
band_store, path = _resolve_store(href, ctx.store, ctx.path_fn)
geotiff = await GeoTIFF.open(path, store=band_store)
return band, geotiff, band_store
Expand Down Expand Up @@ -520,7 +523,7 @@ async def read_chunk_async(
chunk_height: int,
nodata: float | None = None,
mosaic_method_cls: type[MosaicMethodBase] | None = None,
store: ObjectStore | None = None,
store: ObspecInput | None = None,
max_concurrent_reads: int = 32,
warp_cache: dict | None = None,
path_fn: Callable[[str], str] | None = None,
Expand All @@ -544,7 +547,8 @@ async def read_chunk_async(
nodata: No-data fill value.
mosaic_method_cls: Mosaic method class instantiated once per band.
Defaults to :class:`~lazycogs._mosaic_methods.FirstMethod`.
store: Optional pre-configured obstore ``ObjectStore`` instance.
store: Optional pre-configured obspec-compatible store accepted by
``GeoTIFF.open``.
max_concurrent_reads: Maximum number of COG reads to run concurrently.
warp_cache: Optional cache shared across calls for reusing warp maps
from earlier time steps.
Expand Down Expand Up @@ -626,7 +630,7 @@ def read_chunk(
chunk_height: int,
nodata: float | None = None,
mosaic_method_cls: type[MosaicMethodBase] | None = None,
store: ObjectStore | None = None,
store: ObspecInput | None = None,
max_concurrent_reads: int = 32,
warp_cache: dict | None = None,
path_fn: Callable[[str], str] | None = None,
Expand All @@ -645,7 +649,8 @@ def read_chunk(
nodata: No-data fill value.
mosaic_method_cls: Mosaic method class instantiated once per band.
Defaults to :class:`~lazycogs._mosaic_methods.FirstMethod`.
store: Optional pre-configured obstore ``ObjectStore`` instance.
store: Optional pre-configured obspec-compatible store accepted by
``GeoTIFF.open``.
max_concurrent_reads: Maximum number of COG reads to run concurrently.
warp_cache: Optional cache shared across calls for reusing warp maps
from earlier time steps.
Expand Down
Loading
Loading