perf(mem): drop resident 16-bit k-mer vector from Raw, recompute on overflow (#32)#34
Conversation
…verflow (#32) Pooled dada holds a Raw for every unique across all samples resident through the single inference, so the per-Raw k-mer vectors dominate peak RSS — and the jump from k5→k7 (28.8→53.8 GB on PacBio) is the 4^k screen arrays growing 16×. The exact 16-bit frequency vector (`Raw.kmer`, 4^k × 2 bytes — 32 KB/unique at k7, the single largest k-mer term) is only ever read on the `kmer_dist8` overflow fallback, which requires the *same* k-mer occurring ≥255× in *both* sequences — essentially never for amplicon data. So stop storing it resident: - Remove the `kmer: Option<Vec<u16>>` field from `Raw`; `raw_assign_kmers` now populates only `kmer8` (the u8 screen) and `kord`. - On the rare overflow path, recompute the u16 vectors from `seq` in `raw_align_with_buf`. - The no-8-bit-vectors branch (cluster-center raws) keeps kdist = 0.0 exactly as before — previously kmer8 and the u16 kmer were absent together. Byte-identical: full suite green (42 unit + 12 integration, incl. dada snapshot and derep-JSON parity tests); clippy clean. Memory win to be confirmed on the cluster (pooled k7 peak; expect the k5→k7 gap to shrink by the dropped u16 term). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Confirmed on the cluster (PacBio, 93 samples,
|
| metric (dada step) | pre | post | Δ |
|---|---|---|---|
| peak RSS | 52.33 GB | 35.69 GB | −31.8% |
| learn peak | 2.65 GB | 1.79 GB | −32.7% |
| dada wall | 723.0 s | 682.2 s | −5.6% |
| dada CPU | 13186.9 s | 12865.6 s | −2.4% |
| dada cores | 18.24 | 18.86 | +3.4% |
The ~16.6 GB shed ≈ 32 KB (4^7×2) × ~508k pooled uniques — exactly the dropped
u16 vector. Bonus wall win (−5.6% dada, CPU down and cores up) is the
signature of relieving memory-bandwidth pressure on the bloated resident set —
which also partially addresses #33 (pooled core under-utilization was partly
bandwidth, not just the Amdahl serial fraction).
Concordance gate (pass/fail): PASS. compare_asvs.py pre-vs-post on the final
chimera-filtered table: 2082 ASVs vs 2082, shared=2082 only_pre=0 only_post=0 churn=0. Byte-identical at scale, including any overflow-path triggers.
Ready to merge.
k5 control — confirms the
|
| metric (dada) | pre | post | Δ |
|---|---|---|---|
| peak RSS | 28.72 GB | 27.59 GB | −3.9% |
| dada wall | 4227.1 s | 4217.7 s | −0.2% (flat) |
| dada cores | 22.77 | 22.76 | flat |
The u16 vector is 4^k × 2 B/unique = 2 KB (k5) vs 32 KB (k7), a 16× ratio.
Observed savings: −1.13 GB (k5) vs −16.6 GB (k7) = a 14.7× ratio ≈ the
theoretical 16×, so the drop is provably exactly the removed vector.
The wall is flat at k5 because pre cores were already 22.77/24 — the 28.7 GB
resident set is below the bandwidth ceiling, so there's nothing to recover. The
k7 wall bonus (−5.6%) appeared only because k7's 52 GB set depressed cores to
18.2; relieving it recovered both. Wall improving only where cores were
depressed (and not where they were saturated) confirms the bonus is
bandwidth-relief, not a build-time CPU saving.
…profiling) Non-algorithmic instrumentation for memory profiling. Fires once per fresh Raw build in dada_uniques_cached (not on learn-errors' cached-reuse iterations), reporting per-inference resident Raw footprint: seq+qual bytes and the k-mer screen vectors (kmer8 = 4^k bytes; kord = (len-k+1)×2). In pooled mode this is the whole resident set; in pseudo/per-sample it is one sample's share (× jobs for the peak). Lets a --verbose run expose the k-mer footprint and the gap to peak RSS (non-Raw structures) without an external profiler. Verified: k5 ≈ 1516 B/raw (1024 kmer8 + ~492 kord), k7 ≈ 16872 B/raw — the 16× 4^k growth that #34 addressed, now visible per run. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Closes #32 (first lever).
Summary
Pooled
dadapeak RSS jumps 28.8 GB (k5) → 53.8 GB (k7) on PacBio because itholds a
Rawfor every unique across all samples resident, and the4^kk-merscreen arrays grow 16× from k5→k7.
The exact 16-bit frequency vector (
Raw.kmer,4^k × 2bytes = 32 KB/uniqueat k7, the single largest k-mer term) is only read on the
kmer_dist8overflowfallback — which needs the same k-mer occurring ≥255× in both sequences
(
m = a.min(b); if m == 255), essentially never for amplicon data. So it doesn'tneed to be resident.
Change
kmer: Option<Vec<u16>>fromRaw;raw_assign_kmersnow builds onlykmer8(the u8 screen) andkord.seqinraw_align_with_buf.build_trans_mat) keepskdist = 0.0exactly as before —kmer8and the old u16kmerwere alwaysabsent together, so this preserves "no screen when vectors absent."
Correctness
Byte-identical screening result: the overflow path recomputes the same value the
stored vector held, and the missing-vector path is unchanged. Full suite green
(42 unit + 12 integration, incl. the dada snapshot and derep-JSON parity tests);
clippy + fmt clean.
Memory
Structural win — one
4^k-u16 vector dropped per residentRaw(32 KB at k7,2 KB at k5). To be confirmed on the cluster: re-measure pooled k7 peak; expect
the k5→k7 gap to shrink by the dropped u16 term (kmer8 u8 + kord remain). Doesn't
touch ASVs or the
dadahot loop, so wall should be flat (overflow recompute isoff the hot path).
🤖 Generated with Claude Code