Skip to content

feat(n4): GPU-accelerated N4 bias field correction#116

Open
FIrgolitsch wants to merge 10 commits into
pr-3-slice-interpolationfrom
pr-n4-bias-field
Open

feat(n4): GPU-accelerated N4 bias field correction#116
FIrgolitsch wants to merge 10 commits into
pr-3-slice-interpolationfrom
pr-n4-bias-field

Conversation

@FIrgolitsch

@FIrgolitsch FIrgolitsch commented Apr 30, 2026

Copy link
Copy Markdown
Contributor

Stacked PR 9/22 — review order: #115#97#98#99#100#101#108#106#107#87#116#110#111#40#112#113#117#118#120#121#122#123#124#125

Base: pr-3-slice-interpolation. Retargets to main as upstream PRs merge.


PR #109 — N4 Bias Field Correction (CPU + GPU)

Adds N4 bias-field correction with CPU/GPU backends, replacing the previous z-intensity normalization.

Library

  • linumpy/intensity/bias_field.py — high-level API
  • linumpy/intensity/normalization.py — extended for N4 backend, plus fix(normalize): zero agarose per slice while preserving inter-slice brightness (per-slice subtraction of the agarose-median floor pins background voxels to exactly 0 in the normalized output, restoring pre-n4 visualization behavior; global divisor preserves the 2:1 inter-section brightness invariant)
  • linumpy/gpu/{bias_field,bspline,n4}.py — GPU kernels (CuPy)
  • Removes linumpy/gpu/normalization.py (superseded)

Scripts

  • scripts/linum_correct_bias_field.py — replaces linum_normalize_z_intensity
  • scripts/diagnostics/linum_benchmark_n4_gpu.py
  • scripts/diagnostics/linum_n4_gpu_visual_compare.py

Tests

  • linumpy/tests/test_bias_field*.py, test_gpu_{bspline,n4}.py
  • linumpy/tests/test_n4_gpu_{equivalency,perf}.py
  • Extended test_intensity_normalization.py with test_normalize_volume_agarose_floor_at_zero

Workflow integration

  • workflows/reconst_3d/soct_3d_reconst.nf: adds correct_bias_field process
  • workflows/reconst_3d/nextflow.config: N4 parameters

Build

  • pyproject.toml: entry-point swap (linum_normalize_z_intensitylinum_correct_bias_field), scipy-stubs/networkx-stubs dev deps, lint exemptions for linumpy/gpu/* and N4 diagnostic scripts

@FIrgolitsch FIrgolitsch force-pushed the pr-3-slice-interpolation branch from d941b23 to df76137 Compare April 30, 2026 03:21
@FIrgolitsch FIrgolitsch force-pushed the pr-3-slice-interpolation branch from df76137 to 247fcc0 Compare April 30, 2026 03:26
@FIrgolitsch FIrgolitsch force-pushed the pr-3-slice-interpolation branch from 247fcc0 to e7894cc Compare April 30, 2026 03:51
n4_correct_gpu held mask_xp (full bool, ~9 GB on a 36 GB float32 volume) live through every fit iteration even though only the strided mask_small subview is used. Materialise vol_small / mask_small as contiguous copies, then drop mask_xp and free the GPU pool before entering the iteration loop. Frees ~9 GB of headroom that was being squeezed against the per-iter intermediates inside sharpen_residual.
…es downgrade

Two related issues caused correct_bias_field --backend auto to run hours longer than the GPU benchmark suggested:

1) The script's --n_iterations defaulted to [50,50,50,50] (the CPU N4 default), which was passed verbatim to the GPU backend, overriding its own [25,25,25] default. The GPU N4 docstring explicitly warns that >20/level absorbs true tissue contrast, so this was both 2.7x slower AND wrong. Default to None so the backend picks.

2) n4_correct_per_section silently set n_processes=1 when GPU was selected. Now logs a warning so users see why their --n_processes is ignored.
…essure

Two related issues caused dual-GPU and host-RAM pressure on the bias correction step:

1. The 'correct_bias_field' Nextflow process did not invoke Helpers.gpuPinBlock, so all GPUs stayed visible to the worker. The single PID was observed holding ~36 GB on each of two A6000s even though the GPU picker selected GPU 0. Pinning via CUDA_VISIBLE_DEVICES eliminates the second-GPU allocation regardless of which sub-call leaked onto it.

2. The script kept vol_ps + bias_ps alive through the global pass, then materialised bias_ps * bias_global as a third full-size array. On a 36 GB float32 mosaic that is ~144 GB of redundant host RAM. Drop the per-section aliases once their owning references move into bias_field_combined / working_vol, and combine biases in place.
…ions

Profiling the GPU N4 path against the synthetic-phantom benchmark numbers in docs/N4_GPU.md flagged two hot per-iteration allocations that the benchmark amortises but production fits do not:

1. bspline_fit rebuilt M_z2/M_y2/M_x2/M_z3/M_y3/M_x3 and the separable denominator S = (sum_c M_z[z,c]^2)*(sum_c M_y[y,c]^2)*(sum_c M_x[x,c]^2) every iteration. S is a full small_vol-sized float32 array (~130 MB at shrink=4 on a stacked OCT mosaic). With n_iter=[25,25,25] that is 75 redundant alloc/free cycles plus six basis-power multiplies per iter, all on quantities that depend only on the bases (which the N4 driver already holds level-constant).

2. sharpen_residual cast mask_xp.astype(float32) every iteration to feed the weighted bincount, allocating another full small_vol-sized array each time.

Add a bspline_fit_precompute helper that returns the seven invariants for a given bases tuple. n4_correct_gpu now builds it once before the fit loop, reuses the existing mask->float32 weights array as the sharpen mask weights, and threads both into bspline_fit / sharpen_residual via new optional kwargs. Equivalency tests still pass.
Continues the bspline_fit_precompute work (5bf20af) by squeezing the remaining per-iteration overheads out of n4_correct_gpu's fit loop.

1. bspline_fit_precompute now folds the eps clamp into the precomputed denominator (S_safe = maximum(S, eps)), so bspline_fit drops one full small_vol-sized maximum per iter.

2. Fit loop builds the residual in place (current -= sharpened; current *= weights), passes mask=None to bspline_fit (weights already encodes the mask), and accumulates with log_bias += update / coeff_total += coeffs.

3. Convergence check now does a single device-side comparison plus one bool() sync per iter instead of two float(xp.linalg.norm(...)) D2H round-trips with two sqrt ops.

Equivalency tests still pass.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant