Skip to content

perf: cap PyTorch threads, bump fix_illumination forks, skip intermediate pyramids#113

Open
FIrgolitsch wants to merge 6 commits into
pr-m-gpu-kvikiofrom
pr-n-perf
Open

perf: cap PyTorch threads, bump fix_illumination forks, skip intermediate pyramids#113
FIrgolitsch wants to merge 6 commits into
pr-m-gpu-kvikiofrom
pr-n-perf

Conversation

@FIrgolitsch

@FIrgolitsch FIrgolitsch commented Apr 30, 2026

Copy link
Copy Markdown
Contributor

Stacked PR 14/22 — review order: #115#97#98#99#100#101#108#106#107#87#116#110#111#40#112#113#117#118#120#121#122#123#124#125

Base: pr-m-gpu-kvikio. Retargets to main as upstream PRs merge.


PR — Pipeline performance tuning

Three measurement-driven perf improvements to the reconst_3d Nextflow pipeline:

  1. fix_illumination thread cap. linum_fix_illumination_3d.py now calls configure_all_libraries() to cap PyTorch threads, preventing oversubscription when running multiple forks.
  2. fix_illumination maxForks 1 → 4. Measured: BaSiCPy/PyTorch uses ~374 MiB per fork; 4 forks ≈ 1.5 GB, well under per-GPU capacity.
  3. Skip pyramids on intermediate ome-zarr outputs. --n_levels 0 on intermediate steps avoids wasted multiscale generation that's only needed on final outputs.

Squash of resample/GPU/workflow work on dev:
- Round-robin GPU assignment for resample_mosaic_grid (linumpy.gpu)
- Per-gpu active-slot counters via Helpers.gpuPinBlock for fair maxForks balance
- CUDA device pin in prefetch worker thread + workflow CUDA_VISIBLE_DEVICES
- Keeps simple per-tile CPU-read -> GPU-rescale -> CPU-write path
Heavy downsampling of small axes (e.g. Z=5 by factor 0.1) used to round
to zero, producing zero-sized output and ZeroDivisionError downstream.
Clamp output_shape to a minimum of 1 in both rescale() and out_tile_shape.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant