feat(n4): GPU-accelerated N4 bias field correction by FIrgolitsch · Pull Request #116 · linum-uqam/linumpy

FIrgolitsch · 2026-04-30T03:01:48Z

Stacked PR 9/22 — review order: #115 → #97 → #98 → #99 → #100 → #101 → #108 → #106 → #107 → #87 → #116 → #110 → #111 → #40 → #112 → #113 → #117 → #118 → #120 → #121 → #122 → #123 → #124 → #125

Base: pr-3-slice-interpolation. Retargets to main as upstream PRs merge.

PR #109 — N4 Bias Field Correction (CPU + GPU)

Adds N4 bias-field correction with CPU/GPU backends, replacing the previous z-intensity normalization.

Library

linumpy/intensity/bias_field.py — high-level API
linumpy/intensity/normalization.py — extended for N4 backend, plus fix(normalize): zero agarose per slice while preserving inter-slice brightness (per-slice subtraction of the agarose-median floor pins background voxels to exactly 0 in the normalized output, restoring pre-n4 visualization behavior; global divisor preserves the 2:1 inter-section brightness invariant)
linumpy/gpu/{bias_field,bspline,n4}.py — GPU kernels (CuPy)
Removes linumpy/gpu/normalization.py (superseded)

Scripts

scripts/linum_correct_bias_field.py — replaces linum_normalize_z_intensity
scripts/diagnostics/linum_benchmark_n4_gpu.py
scripts/diagnostics/linum_n4_gpu_visual_compare.py

Tests

linumpy/tests/test_bias_field*.py, test_gpu_{bspline,n4}.py
linumpy/tests/test_n4_gpu_{equivalency,perf}.py
Extended test_intensity_normalization.py with test_normalize_volume_agarose_floor_at_zero

Workflow integration

workflows/reconst_3d/soct_3d_reconst.nf: adds correct_bias_field process
workflows/reconst_3d/nextflow.config: N4 parameters

Build

pyproject.toml: entry-point swap (linum_normalize_z_intensity → linum_correct_bias_field), scipy-stubs/networkx-stubs dev deps, lint exemptions for linumpy/gpu/* and N4 diagnostic scripts

…ting

n4_correct_gpu held mask_xp (full bool, ~9 GB on a 36 GB float32 volume) live through every fit iteration even though only the strided mask_small subview is used. Materialise vol_small / mask_small as contiguous copies, then drop mask_xp and free the GPU pool before entering the iteration loop. Frees ~9 GB of headroom that was being squeezed against the per-iter intermediates inside sharpen_residual.

…es downgrade Two related issues caused correct_bias_field --backend auto to run hours longer than the GPU benchmark suggested: 1) The script's --n_iterations defaulted to [50,50,50,50] (the CPU N4 default), which was passed verbatim to the GPU backend, overriding its own [25,25,25] default. The GPU N4 docstring explicitly warns that >20/level absorbs true tissue contrast, so this was both 2.7x slower AND wrong. Default to None so the backend picks. 2) n4_correct_per_section silently set n_processes=1 when GPU was selected. Now logs a warning so users see why their --n_processes is ignored.

…essure Two related issues caused dual-GPU and host-RAM pressure on the bias correction step: 1. The 'correct_bias_field' Nextflow process did not invoke Helpers.gpuPinBlock, so all GPUs stayed visible to the worker. The single PID was observed holding ~36 GB on each of two A6000s even though the GPU picker selected GPU 0. Pinning via CUDA_VISIBLE_DEVICES eliminates the second-GPU allocation regardless of which sub-call leaked onto it. 2. The script kept vol_ps + bias_ps alive through the global pass, then materialised bias_ps * bias_global as a third full-size array. On a 36 GB float32 mosaic that is ~144 GB of redundant host RAM. Drop the per-section aliases once their owning references move into bias_field_combined / working_vol, and combine biases in place.

…ions Profiling the GPU N4 path against the synthetic-phantom benchmark numbers in docs/N4_GPU.md flagged two hot per-iteration allocations that the benchmark amortises but production fits do not: 1. bspline_fit rebuilt M_z2/M_y2/M_x2/M_z3/M_y3/M_x3 and the separable denominator S = (sum_c M_z[z,c]^2)*(sum_c M_y[y,c]^2)*(sum_c M_x[x,c]^2) every iteration. S is a full small_vol-sized float32 array (~130 MB at shrink=4 on a stacked OCT mosaic). With n_iter=[25,25,25] that is 75 redundant alloc/free cycles plus six basis-power multiplies per iter, all on quantities that depend only on the bases (which the N4 driver already holds level-constant). 2. sharpen_residual cast mask_xp.astype(float32) every iteration to feed the weighted bincount, allocating another full small_vol-sized array each time. Add a bspline_fit_precompute helper that returns the seven invariants for a given bases tuple. n4_correct_gpu now builds it once before the fit loop, reuses the existing mask->float32 weights array as the sharpen mask weights, and threads both into bspline_fit / sharpen_residual via new optional kwargs. Equivalency tests still pass.

Continues the bspline_fit_precompute work (5bf20af) by squeezing the remaining per-iteration overheads out of n4_correct_gpu's fit loop. 1. bspline_fit_precompute now folds the eps clamp into the precomputed denominator (S_safe = maximum(S, eps)), so bspline_fit drops one full small_vol-sized maximum per iter. 2. Fit loop builds the residual in place (current -= sharpened; current *= weights), passes mask=None to bspline_fit (weights already encodes the mask), and accumulates with log_bias += update / coeff_total += coeffs. 3. Convergence check now does a single device-side comparison plus one bool() sync per iter instead of two float(xp.linalg.norm(...)) D2H round-trips with two sqrt ops. Equivalency tests still pass.

…lidates cache)

FIrgolitsch force-pushed the pr-3-slice-interpolation branch from d941b23 to df76137 Compare April 30, 2026 03:21

FIrgolitsch force-pushed the pr-n4-bias-field branch from d86f321 to 2063ff7 Compare April 30, 2026 03:21

FIrgolitsch force-pushed the pr-3-slice-interpolation branch from df76137 to 247fcc0 Compare April 30, 2026 03:26

FIrgolitsch force-pushed the pr-n4-bias-field branch from 2063ff7 to c737d27 Compare April 30, 2026 03:26

FIrgolitsch force-pushed the pr-3-slice-interpolation branch from 247fcc0 to e7894cc Compare April 30, 2026 03:51

FIrgolitsch force-pushed the pr-n4-bias-field branch from c737d27 to bed6f69 Compare April 30, 2026 03:51

This was referenced May 1, 2026

fix(galvo): per-tile detect + threaded column strips + --skip_tiles for manual overrides #117

Open

chore: bump python floor to 3.14, modernize CI/build, ruff FURB sweep #118

Open

FIrgolitsch added 6 commits May 20, 2026 12:19

feat(n4): GPU-accelerated N4 bias field correction (#109)

c445124

test(gpu/bspline): exercise PSDB primitive via iterative residual fit…

41361b5

…ting

docs(n4): correct GPU B-spline description (PSDB) and test name

71db93d

FIrgolitsch added 4 commits May 20, 2026 12:19

n4: add out=/bias_out= to n4_correct_gpu for in-place global pass

5b021a7

workflow: --verbose for correct_bias_field (surface N4 progress; inva…

9737d06

…lidates cache)

FIrgolitsch force-pushed the pr-3-slice-interpolation branch from e7894cc to e92a3b7 Compare May 20, 2026 16:22

FIrgolitsch force-pushed the pr-n4-bias-field branch from bed6f69 to 9737d06 Compare May 20, 2026 16:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(n4): GPU-accelerated N4 bias field correction#116

feat(n4): GPU-accelerated N4 bias field correction#116
FIrgolitsch wants to merge 10 commits into
pr-3-slice-interpolationfrom
pr-n4-bias-field

FIrgolitsch commented Apr 30, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

FIrgolitsch commented Apr 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR #109 — N4 Bias Field Correction (CPU + GPU)

Library

Scripts

Tests

Workflow integration

Build

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

FIrgolitsch commented Apr 30, 2026 •

edited

Loading