Skip to content

perf: pipeline-wide GPU + scratch optimizations#120

Open
FIrgolitsch wants to merge 4 commits into
pr-p-py314-modernizationfrom
pr-q-perf-pipeline
Open

perf: pipeline-wide GPU + scratch optimizations#120
FIrgolitsch wants to merge 4 commits into
pr-p-py314-modernizationfrom
pr-q-perf-pipeline

Conversation

@FIrgolitsch

@FIrgolitsch FIrgolitsch commented May 20, 2026

Copy link
Copy Markdown
Contributor

Stacked PR 17/22 — review order: #115#97#98#99#100#101#108#106#107#87#116#110#111#40#112#113#117#118#120#121#122#123#124#125

Base: pr-p-py314-modernization. Retargets to main as upstream PRs merge.


Performance: pipeline-wide GPU + scratch optimizations

Stacked on #118 (pr-p-py314-modernization). Bundles the Nextflow scratch_dir config and several GPU performance fixes that touch resample + N4.

Commits

  • scratch dir config — adds scratch_dir Nextflow param so processes can stage IO on local NVMe instead of the work tree.
  • resample tile prefetch + leak-free GPU slot tracker (squashed: resample prefetch + reconst_3d resample prefetch) — hides PCIe latency by prefetching tiles while the GPU resamples the previous one; tightens slot-tracking so cancelled futures release their GPU slot.
  • N4 memory & D2H reductions (squashed: n4 astype + n4 D2H syncs + n4 host-mem peak + n4 ty appease) — keeps the bias-field iteration on-device, removes unnecessary host transfers, and trims peak host memory.
  • workflow symlink publishDirpublishDir now uses mode: 'symlink' for resample outputs so the run dir doesn't get duplicated on disk.

Docs gaps

scratch_dir is not yet mentioned in docs/NEXTFLOW_WORKFLOWS.md; that doc update will land as a follow-up so this chain matches dev exactly.

On hosts where /tmp is small (62 GiB on the lab server) and lives on the same NVMe as the work directory, Nextflow's default 'scratch = true' fills /tmp with staged zarr outputs and crashes with ENOSPC. Adds a 'scratch_dir' param (default true, preserving previous behaviour) so per-subject configs can set it to false (run in workdir, no double-write) or to a /scratch_nvme path (stage on faster local FS without /tmp pressure). Wired into reconst_3d and preproc workflows. Per-subject configs on the server have been updated to scratch_dir = false.
Squashed from:
  7d81879 perf(resample): deepen tile prefetch (1->8) for ~3x speedup on sharded inputs
  3b2dc84 fix(reconst_3d): resample prefetch param + leak-free GPU slot tracker
Squashed from:
  856a974 fix(n4): avoid doubling GPU memory in correct_bias_field astype
  fcecf2e perf(n4): cut per-iter D2H syncs and host-side per-section copies
  b6c9d12 n4: cut host-memory peak in bias-field script (relieves kcompactd0)
  80e941f n4: appease ty in bias-field script (typed n4_kwargs + vol_for_blend)
This was referenced May 20, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant