perf: pipeline-wide GPU + scratch optimizations by FIrgolitsch · Pull Request #120 · linum-uqam/linumpy

FIrgolitsch · 2026-05-20T16:26:47Z

Stacked PR 17/22 — review order: #115 → #97 → #98 → #99 → #100 → #101 → #108 → #106 → #107 → #87 → #116 → #110 → #111 → #40 → #112 → #113 → #117 → #118 → #120 → #121 → #122 → #123 → #124 → #125

Base: pr-p-py314-modernization. Retargets to main as upstream PRs merge.

Performance: pipeline-wide GPU + scratch optimizations

Stacked on #118 (pr-p-py314-modernization). Bundles the Nextflow scratch_dir config and several GPU performance fixes that touch resample + N4.

Commits

scratch dir config — adds scratch_dir Nextflow param so processes can stage IO on local NVMe instead of the work tree.
resample tile prefetch + leak-free GPU slot tracker (squashed: resample prefetch + reconst_3d resample prefetch) — hides PCIe latency by prefetching tiles while the GPU resamples the previous one; tightens slot-tracking so cancelled futures release their GPU slot.
N4 memory & D2H reductions (squashed: n4 astype + n4 D2H syncs + n4 host-mem peak + n4 ty appease) — keeps the bias-field iteration on-device, removes unnecessary host transfers, and trims peak host memory.
workflow symlink publishDir — publishDir now uses mode: 'symlink' for resample outputs so the run dir doesn't get duplicated on disk.

Docs gaps

scratch_dir is not yet mentioned in docs/NEXTFLOW_WORKFLOWS.md; that doc update will land as a follow-up so this chain matches dev exactly.

On hosts where /tmp is small (62 GiB on the lab server) and lives on the same NVMe as the work directory, Nextflow's default 'scratch = true' fills /tmp with staged zarr outputs and crashes with ENOSPC. Adds a 'scratch_dir' param (default true, preserving previous behaviour) so per-subject configs can set it to false (run in workdir, no double-write) or to a /scratch_nvme path (stage on faster local FS without /tmp pressure). Wired into reconst_3d and preproc workflows. Per-subject configs on the server have been updated to scratch_dir = false.

Squashed from: 7d81879 perf(resample): deepen tile prefetch (1->8) for ~3x speedup on sharded inputs 3b2dc84 fix(reconst_3d): resample prefetch param + leak-free GPU slot tracker

Squashed from: 856a974 fix(n4): avoid doubling GPU memory in correct_bias_field astype fcecf2e perf(n4): cut per-iter D2H syncs and host-side per-section copies b6c9d12 n4: cut host-memory peak in bias-field script (relieves kcompactd0) 80e941f n4: appease ty in bias-field script (typed n4_kwargs + vol_for_blend)

…esume cache hazard)

FIrgolitsch added 4 commits May 20, 2026 12:19

perf: resample tile prefetch + leak-free GPU slot tracker

2ca247f

Squashed from: 7d81879 perf(resample): deepen tile prefetch (1->8) for ~3x speedup on sharded inputs 3b2dc84 fix(reconst_3d): resample prefetch param + leak-free GPU slot tracker

workflow: always symlink stack/correct_bias_field publishDir (avoid r…

fcc793e

…esume cache hazard)

FIrgolitsch added a commit that referenced this pull request May 20, 2026

docs: catch up reference docs for new chain params (#120,#122-#125)

44cad89

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: pipeline-wide GPU + scratch optimizations#120

perf: pipeline-wide GPU + scratch optimizations#120
FIrgolitsch wants to merge 4 commits into
pr-p-py314-modernizationfrom
pr-q-perf-pipeline

FIrgolitsch commented May 20, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

FIrgolitsch commented May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Performance: pipeline-wide GPU + scratch optimizations

Commits

Docs gaps

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

FIrgolitsch commented May 20, 2026 •

edited

Loading