perf(mem): drop resident 16-bit k-mer vector from Raw, recompute on overflow (#32) by cjfields · Pull Request #34 · HPCBio/dada2-rs

cjfields · 2026-06-11T02:37:42Z

Closes #32 (first lever).

Summary

Pooled dada peak RSS jumps 28.8 GB (k5) → 53.8 GB (k7) on PacBio because it
holds a Raw for every unique across all samples resident, and the 4^k k-mer
screen arrays grow 16× from k5→k7.

The exact 16-bit frequency vector (Raw.kmer, 4^k × 2 bytes = 32 KB/unique
at k7, the single largest k-mer term) is only read on the kmer_dist8 overflow
fallback — which needs the same k-mer occurring ≥255× in both sequences
(m = a.min(b); if m == 255), essentially never for amplicon data. So it doesn't
need to be resident.

Change

Remove kmer: Option<Vec<u16>> from Raw; raw_assign_kmers now builds only
kmer8 (the u8 screen) and kord.
On the rare overflow path, recompute the u16 vectors from seq in
raw_align_with_buf.
The no-8-bit-vectors branch (cluster-center raws in build_trans_mat) keeps
kdist = 0.0 exactly as before — kmer8 and the old u16 kmer were always
absent together, so this preserves "no screen when vectors absent."

Correctness

Byte-identical screening result: the overflow path recomputes the same value the
stored vector held, and the missing-vector path is unchanged. Full suite green
(42 unit + 12 integration, incl. the dada snapshot and derep-JSON parity tests);
clippy + fmt clean.

Memory

Structural win — one 4^k-u16 vector dropped per resident Raw (32 KB at k7,
2 KB at k5). To be confirmed on the cluster: re-measure pooled k7 peak; expect
the k5→k7 gap to shrink by the dropped u16 term (kmer8 u8 + kord remain). Doesn't
touch ASVs or the dada hot loop, so wall should be flat (overflow recompute is
off the hot path).

🤖 Generated with Claude Code

…verflow (#32) Pooled dada holds a Raw for every unique across all samples resident through the single inference, so the per-Raw k-mer vectors dominate peak RSS — and the jump from k5→k7 (28.8→53.8 GB on PacBio) is the 4^k screen arrays growing 16×. The exact 16-bit frequency vector (`Raw.kmer`, 4^k × 2 bytes — 32 KB/unique at k7, the single largest k-mer term) is only ever read on the `kmer_dist8` overflow fallback, which requires the *same* k-mer occurring ≥255× in *both* sequences — essentially never for amplicon data. So stop storing it resident: - Remove the `kmer: Option<Vec<u16>>` field from `Raw`; `raw_assign_kmers` now populates only `kmer8` (the u8 screen) and `kord`. - On the rare overflow path, recompute the u16 vectors from `seq` in `raw_align_with_buf`. - The no-8-bit-vectors branch (cluster-center raws) keeps kdist = 0.0 exactly as before — previously kmer8 and the u16 kmer were absent together. Byte-identical: full suite green (42 unit + 12 integration, incl. dada snapshot and derep-JSON parity tests); clippy clean. Memory win to be confirmed on the cluster (pooled k7 peak; expect the k5→k7 gap to shrink by the dropped u16 term). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

cjfields · 2026-06-11T17:04:56Z

Confirmed on the cluster (PacBio, 93 samples, `--pool true`, k=7, no R)

Self-vs-self, current main (pre) vs this branch (post), node-exclusive:

metric (dada step)	pre	post	Δ
peak RSS	52.33 GB	35.69 GB	−31.8%
learn peak	2.65 GB	1.79 GB	−32.7%
dada wall	723.0 s	682.2 s	−5.6%
dada CPU	13186.9 s	12865.6 s	−2.4%
dada cores	18.24	18.86	+3.4%

The ~16.6 GB shed ≈ 32 KB (4^7×2) × ~508k pooled uniques — exactly the dropped
u16 vector. Bonus wall win (−5.6% dada, CPU down and cores up) is the
signature of relieving memory-bandwidth pressure on the bloated resident set —
which also partially addresses #33 (pooled core under-utilization was partly
bandwidth, not just the Amdahl serial fraction).

Concordance gate (pass/fail): PASS. compare_asvs.py pre-vs-post on the final
chimera-filtered table: 2082 ASVs vs 2082, shared=2082 only_pre=0 only_post=0 churn=0. Byte-identical at scale, including any overflow-path triggers.

Ready to merge.

cjfields · 2026-06-11T17:27:23Z

k5 control — confirms the `4^k` scaling and the bandwidth mechanism

Same setup (PacBio 93 samples, --pool true, k=5, no R), current main (pre) vs branch (post):

metric (dada)	pre	post	Δ
peak RSS	28.72 GB	27.59 GB	−3.9%
dada wall	4227.1 s	4217.7 s	−0.2% (flat)
dada cores	22.77	22.76	flat

The u16 vector is 4^k × 2 B/unique = 2 KB (k5) vs 32 KB (k7), a 16× ratio.
Observed savings: −1.13 GB (k5) vs −16.6 GB (k7) = a 14.7× ratio ≈ the
theoretical 16×, so the drop is provably exactly the removed vector.

The wall is flat at k5 because pre cores were already 22.77/24 — the 28.7 GB
resident set is below the bandwidth ceiling, so there's nothing to recover. The
k7 wall bonus (−5.6%) appeared only because k7's 52 GB set depressed cores to
18.2; relieving it recovered both. Wall improving only where cores were
depressed (and not where they were saturated) confirms the bonus is
bandwidth-relief, not a build-time CPU saving.

…profiling) Non-algorithmic instrumentation for memory profiling. Fires once per fresh Raw build in dada_uniques_cached (not on learn-errors' cached-reuse iterations), reporting per-inference resident Raw footprint: seq+qual bytes and the k-mer screen vectors (kmer8 = 4^k bytes; kord = (len-k+1)×2). In pooled mode this is the whole resident set; in pseudo/per-sample it is one sample's share (× jobs for the peak). Lets a --verbose run expose the k-mer footprint and the gap to peak RSS (non-Raw structures) without an external profiler. Verified: k5 ≈ 1516 B/raw (1024 kmer8 + ~492 kord), k7 ≈ 16872 B/raw — the 16× 4^k growth that #34 addressed, now visible per run. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

cjfields merged commit eecb057 into main Jun 11, 2026
5 checks passed

cjfields mentioned this pull request Jun 11, 2026

perf: pooled dada under-utilizes cores (serial bud/shuffle/p_update) — profile & parallelize #33

Open

cjfields mentioned this pull request Jun 11, 2026

perf/mem: shrink resident k-mer vectors in pooled dada — 4^k arrays dominate peak at k7 #32

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(mem): drop resident 16-bit k-mer vector from Raw, recompute on overflow (#32)#34

perf(mem): drop resident 16-bit k-mer vector from Raw, recompute on overflow (#32)#34
cjfields merged 1 commit into
mainfrom
perf/drop-resident-u16-kmer

cjfields commented Jun 11, 2026

Uh oh!

cjfields commented Jun 11, 2026

Uh oh!

Uh oh!

cjfields commented Jun 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

cjfields commented Jun 11, 2026

Summary

Change

Correctness

Memory

Uh oh!

cjfields commented Jun 11, 2026

Confirmed on the cluster (PacBio, 93 samples, --pool true, k=7, no R)

Uh oh!

Uh oh!

cjfields commented Jun 11, 2026

k5 control — confirms the 4^k scaling and the bandwidth mechanism

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Confirmed on the cluster (PacBio, 93 samples, `--pool true`, k=7, no R)

k5 control — confirms the `4^k` scaling and the bandwidth mechanism