feat(n4): GPU-accelerated N4 bias field correction#116
Open
FIrgolitsch wants to merge 10 commits into
Open
Conversation
This was referenced Apr 30, 2026
d941b23 to
df76137
Compare
d86f321 to
2063ff7
Compare
df76137 to
247fcc0
Compare
2063ff7 to
c737d27
Compare
247fcc0 to
e7894cc
Compare
c737d27 to
bed6f69
Compare
This was referenced May 1, 2026
n4_correct_gpu held mask_xp (full bool, ~9 GB on a 36 GB float32 volume) live through every fit iteration even though only the strided mask_small subview is used. Materialise vol_small / mask_small as contiguous copies, then drop mask_xp and free the GPU pool before entering the iteration loop. Frees ~9 GB of headroom that was being squeezed against the per-iter intermediates inside sharpen_residual.
…es downgrade Two related issues caused correct_bias_field --backend auto to run hours longer than the GPU benchmark suggested: 1) The script's --n_iterations defaulted to [50,50,50,50] (the CPU N4 default), which was passed verbatim to the GPU backend, overriding its own [25,25,25] default. The GPU N4 docstring explicitly warns that >20/level absorbs true tissue contrast, so this was both 2.7x slower AND wrong. Default to None so the backend picks. 2) n4_correct_per_section silently set n_processes=1 when GPU was selected. Now logs a warning so users see why their --n_processes is ignored.
…essure Two related issues caused dual-GPU and host-RAM pressure on the bias correction step: 1. The 'correct_bias_field' Nextflow process did not invoke Helpers.gpuPinBlock, so all GPUs stayed visible to the worker. The single PID was observed holding ~36 GB on each of two A6000s even though the GPU picker selected GPU 0. Pinning via CUDA_VISIBLE_DEVICES eliminates the second-GPU allocation regardless of which sub-call leaked onto it. 2. The script kept vol_ps + bias_ps alive through the global pass, then materialised bias_ps * bias_global as a third full-size array. On a 36 GB float32 mosaic that is ~144 GB of redundant host RAM. Drop the per-section aliases once their owning references move into bias_field_combined / working_vol, and combine biases in place.
…ions Profiling the GPU N4 path against the synthetic-phantom benchmark numbers in docs/N4_GPU.md flagged two hot per-iteration allocations that the benchmark amortises but production fits do not: 1. bspline_fit rebuilt M_z2/M_y2/M_x2/M_z3/M_y3/M_x3 and the separable denominator S = (sum_c M_z[z,c]^2)*(sum_c M_y[y,c]^2)*(sum_c M_x[x,c]^2) every iteration. S is a full small_vol-sized float32 array (~130 MB at shrink=4 on a stacked OCT mosaic). With n_iter=[25,25,25] that is 75 redundant alloc/free cycles plus six basis-power multiplies per iter, all on quantities that depend only on the bases (which the N4 driver already holds level-constant). 2. sharpen_residual cast mask_xp.astype(float32) every iteration to feed the weighted bincount, allocating another full small_vol-sized array each time. Add a bspline_fit_precompute helper that returns the seven invariants for a given bases tuple. n4_correct_gpu now builds it once before the fit loop, reuses the existing mask->float32 weights array as the sharpen mask weights, and threads both into bspline_fit / sharpen_residual via new optional kwargs. Equivalency tests still pass.
Continues the bspline_fit_precompute work (5bf20af) by squeezing the remaining per-iteration overheads out of n4_correct_gpu's fit loop. 1. bspline_fit_precompute now folds the eps clamp into the precomputed denominator (S_safe = maximum(S, eps)), so bspline_fit drops one full small_vol-sized maximum per iter. 2. Fit loop builds the residual in place (current -= sharpened; current *= weights), passes mask=None to bspline_fit (weights already encodes the mask), and accumulates with log_bias += update / coeff_total += coeffs. 3. Convergence check now does a single device-side comparison plus one bool() sync per iter instead of two float(xp.linalg.norm(...)) D2H round-trips with two sqrt ops. Equivalency tests still pass.
e7894cc to
e92a3b7
Compare
bed6f69 to
9737d06
Compare
This was referenced May 20, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
PR #109 — N4 Bias Field Correction (CPU + GPU)
Adds N4 bias-field correction with CPU/GPU backends, replacing the previous z-intensity normalization.
Library
linumpy/intensity/bias_field.py— high-level APIlinumpy/intensity/normalization.py— extended for N4 backend, plusfix(normalize): zero agarose per slice while preserving inter-slice brightness(per-slice subtraction of the agarose-median floor pins background voxels to exactly 0 in the normalized output, restoring pre-n4 visualization behavior; global divisor preserves the 2:1 inter-section brightness invariant)linumpy/gpu/{bias_field,bspline,n4}.py— GPU kernels (CuPy)linumpy/gpu/normalization.py(superseded)Scripts
scripts/linum_correct_bias_field.py— replaceslinum_normalize_z_intensityscripts/diagnostics/linum_benchmark_n4_gpu.pyscripts/diagnostics/linum_n4_gpu_visual_compare.pyTests
linumpy/tests/test_bias_field*.py,test_gpu_{bspline,n4}.pylinumpy/tests/test_n4_gpu_{equivalency,perf}.pytest_intensity_normalization.pywithtest_normalize_volume_agarose_floor_at_zeroWorkflow integration
workflows/reconst_3d/soct_3d_reconst.nf: addscorrect_bias_fieldprocessworkflows/reconst_3d/nextflow.config: N4 parametersBuild
pyproject.toml: entry-point swap (linum_normalize_z_intensity→linum_correct_bias_field),scipy-stubs/networkx-stubsdev deps, lint exemptions forlinumpy/gpu/*and N4 diagnostic scripts