Skip to content

feat: GPU keep-on-device + kvikio (GDS) reader + pipeline GPU wiring#112

Open
FIrgolitsch wants to merge 5 commits into
sphinx-configfrom
pr-m-gpu-kvikio
Open

feat: GPU keep-on-device + kvikio (GDS) reader + pipeline GPU wiring#112
FIrgolitsch wants to merge 5 commits into
sphinx-configfrom
pr-m-gpu-kvikio

Conversation

@FIrgolitsch

@FIrgolitsch FIrgolitsch commented Apr 30, 2026

Copy link
Copy Markdown
Contributor

Stacked PR 13/22 — review order: #115#97#98#99#100#101#108#106#107#87#116#110#111#40#112#113#117#118#120#121#122#123#124#125

Base: sphinx-config. Retargets to main as upstream PRs merge.


PR — GPU keep-on-device + kvikio (GDS) reader + pipeline GPU wiring

Extends the GPU stack with end-to-end on-device data flow for the OCT reconstruction pipeline and adds a GPUDirect Storage (GDS) reader as a fast path for reading uncompressed zarr arrays straight into device memory.

GPU keep-on-device

  • New linumpy.gpu.zarr_io with gpu_zarr_context() (uses zarr.config.enable_gpu()) and read_zarr_to_gpu(...) with auto backend selection (kvikio when available, zarr-gpu otherwise).
  • linumpy.gpu.interpolation: device-preserving resize, affine_transform, map_coordinates.
  • New linumpy.gpu.interface with a GPU implementation of find_tissue_interface (no-mask path) using cupyx filters.
  • linumpy.geometry.interface.find_tissue_interface(..., use_gpu=...) and linumpy.mosaic.stacking.find_z_overlap(..., use_gpu=...) now route to GPU when requested.
  • linum_aip.py and linum_resample_mosaic_grid.py use gpu_zarr_context to keep tiles on-device through the slab loop and writer.
  • linum_detect_focal_curvature.py: vectorized roll via take_along_axis (xp dispatch) and --use_gpu/--no-use_gpu.
  • linum_stack_slices_motor.py: --use_gpu/--no-use_gpu plumbed to find_z_overlap.

kvikio (GDS) reader (prototype)

  • linumpy/gpu/kvikio_zarr.py: GDS reader for raw uncompressed zarr v2 + v3.
    • Refuses incompatible arrays (compressed, filtered, non-C order, mismatched endian) with NotImplementedError.
    • Uses contiguous scratch buffer for CuFile.pread.
  • scripts/linum_benchmark_kvikio_zarr.py: benchmark with kvikio and zarr.config.enable_gpu() paths for comparison.
  • read_zarr_to_gpu falls back to zarr-gpu when kvikio is in compat mode, when arrays aren't GDS-compatible, or on any runtime failure.

Server / build

  • shell_scripts/server_setup/nvfs_kernel7_patch.sh: nvidia-fs 2.28.4 patch for kernel 7.0; symvers helper now also handles .ko.zst.
  • pyproject.toml: bump ome-zarr to >=0.16.0 (NGFF 0.5).

Nextflow pipeline GPU wiring

  • fix_focal_curvature and stack processes pass --use_gpu/--no-use_gpu from params.use_gpu.
  • nextflow.config: withName: "fix_focal_curvature" gets maxForks = params.use_gpu ? 4 : null.
  • withName: "resample_mosaic_grid": maxForks = params.use_gpu ? 6 : null (measured ~1 GB GPU mem per fork; IO-gated).
  • _run_pipelined: prefetch + GPU compute pipeline; periodic free of cupy memory pool.

@FIrgolitsch FIrgolitsch force-pushed the pr-m-gpu-kvikio branch 2 times, most recently from ab73c15 to a6f274b Compare May 1, 2026 17:20
Squashed pr-m-gpu-kvikio:
- linumpy.gpu.zarr_io: high-level read_zarr_to_gpu dispatcher (kvikio/GDS native -> zarr.config.enable_gpu fallback)
- linumpy.gpu.kvikio_zarr: kvikio-backed zarr->GPU reader using contiguous CuFile.pread scratch buffer
- GPU keep-on-device for resize/affine/map_coordinates; resample writer uses gpu_zarr_context
- linum_aip / linum_aip_png GPU pipelines stay on device end-to-end
- Device-aware find_tissue_interface, find_z_overlap; vectorized focal curvature roll
- Wire --use_gpu for focal_curvature + stack into nextflow
- Increase GPU resample maxForks; tune GPU memory management in _run_pipelined
- deps: bump ome-zarr to >=0.16.0 (NGFF 0.5)
- Server: nvidia-fs 2.28.4 patch script for kernel 7.0
- Benchmark script linum_benchmark_kvikio_zarr.py
- Bug fix: detect_focal_curvature broadcasts per-tile correction across tile positions
…n kernel 7.0

- create_nv.symvers.sh: case-insensitive grep for relative CRCs ('r' vs 'R'),
  needed because kernel 7.0 / open-gpu-kmd emits __crc_nvidia_p2p_* as
  section-local rodata. Without this, modversion fallback was skipped and
  bogus relocation offsets were written, producing an unloadable nvidia-fs.
- create_nv.symvers.sh: 'zstd -df --rm' (force) so reruns don't stall on the
  'overwrite (y/n)?' prompt when nvidia.ko already exists.
- nvfs-mmap.c patch: idempotent (skip if already applied).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant