Skip to content

perf(mem): drop resident 16-bit k-mer vector from Raw, recompute on overflow (#32)#34

Merged
cjfields merged 1 commit into
mainfrom
perf/drop-resident-u16-kmer
Jun 11, 2026
Merged

perf(mem): drop resident 16-bit k-mer vector from Raw, recompute on overflow (#32)#34
cjfields merged 1 commit into
mainfrom
perf/drop-resident-u16-kmer

Conversation

@cjfields

Copy link
Copy Markdown
Member

Closes #32 (first lever).

Summary

Pooled dada peak RSS jumps 28.8 GB (k5) → 53.8 GB (k7) on PacBio because it
holds a Raw for every unique across all samples resident, and the 4^k k-mer
screen arrays grow 16× from k5→k7.

The exact 16-bit frequency vector (Raw.kmer, 4^k × 2 bytes = 32 KB/unique
at k7
, the single largest k-mer term) is only read on the kmer_dist8 overflow
fallback — which needs the same k-mer occurring ≥255× in both sequences
(m = a.min(b); if m == 255), essentially never for amplicon data. So it doesn't
need to be resident.

Change

  • Remove kmer: Option<Vec<u16>> from Raw; raw_assign_kmers now builds only
    kmer8 (the u8 screen) and kord.
  • On the rare overflow path, recompute the u16 vectors from seq in
    raw_align_with_buf.
  • The no-8-bit-vectors branch (cluster-center raws in build_trans_mat) keeps
    kdist = 0.0 exactly as before — kmer8 and the old u16 kmer were always
    absent together, so this preserves "no screen when vectors absent."

Correctness

Byte-identical screening result: the overflow path recomputes the same value the
stored vector held, and the missing-vector path is unchanged. Full suite green
(42 unit + 12 integration, incl. the dada snapshot and derep-JSON parity tests);
clippy + fmt clean.

Memory

Structural win — one 4^k-u16 vector dropped per resident Raw (32 KB at k7,
2 KB at k5). To be confirmed on the cluster: re-measure pooled k7 peak; expect
the k5→k7 gap to shrink by the dropped u16 term (kmer8 u8 + kord remain). Doesn't
touch ASVs or the dada hot loop, so wall should be flat (overflow recompute is
off the hot path).

🤖 Generated with Claude Code

…verflow (#32)

Pooled dada holds a Raw for every unique across all samples resident through the
single inference, so the per-Raw k-mer vectors dominate peak RSS — and the jump
from k5→k7 (28.8→53.8 GB on PacBio) is the 4^k screen arrays growing 16×.

The exact 16-bit frequency vector (`Raw.kmer`, 4^k × 2 bytes — 32 KB/unique at
k7, the single largest k-mer term) is only ever read on the `kmer_dist8`
overflow fallback, which requires the *same* k-mer occurring ≥255× in *both*
sequences — essentially never for amplicon data. So stop storing it resident:

- Remove the `kmer: Option<Vec<u16>>` field from `Raw`; `raw_assign_kmers` now
  populates only `kmer8` (the u8 screen) and `kord`.
- On the rare overflow path, recompute the u16 vectors from `seq` in
  `raw_align_with_buf`.
- The no-8-bit-vectors branch (cluster-center raws) keeps kdist = 0.0 exactly as
  before — previously kmer8 and the u16 kmer were absent together.

Byte-identical: full suite green (42 unit + 12 integration, incl. dada snapshot
and derep-JSON parity tests); clippy clean. Memory win to be confirmed on the
cluster (pooled k7 peak; expect the k5→k7 gap to shrink by the dropped u16 term).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@cjfields

Copy link
Copy Markdown
Member Author

Confirmed on the cluster (PacBio, 93 samples, --pool true, k=7, no R)

Self-vs-self, current main (pre) vs this branch (post), node-exclusive:

metric (dada step) pre post Δ
peak RSS 52.33 GB 35.69 GB −31.8%
learn peak 2.65 GB 1.79 GB −32.7%
dada wall 723.0 s 682.2 s −5.6%
dada CPU 13186.9 s 12865.6 s −2.4%
dada cores 18.24 18.86 +3.4%

The ~16.6 GB shed ≈ 32 KB (4^7×2) × ~508k pooled uniques — exactly the dropped
u16 vector. Bonus wall win (−5.6% dada, CPU down and cores up) is the
signature of relieving memory-bandwidth pressure on the bloated resident set —
which also partially addresses #33 (pooled core under-utilization was partly
bandwidth, not just the Amdahl serial fraction).

Concordance gate (pass/fail): PASS. compare_asvs.py pre-vs-post on the final
chimera-filtered table: 2082 ASVs vs 2082, shared=2082 only_pre=0 only_post=0 churn=0. Byte-identical at scale, including any overflow-path triggers.

Ready to merge.

@cjfields cjfields merged commit eecb057 into main Jun 11, 2026
5 checks passed
@cjfields

Copy link
Copy Markdown
Member Author

k5 control — confirms the 4^k scaling and the bandwidth mechanism

Same setup (PacBio 93 samples, --pool true, k=5, no R), current main (pre) vs branch (post):

metric (dada) pre post Δ
peak RSS 28.72 GB 27.59 GB −3.9%
dada wall 4227.1 s 4217.7 s −0.2% (flat)
dada cores 22.77 22.76 flat

The u16 vector is 4^k × 2 B/unique = 2 KB (k5) vs 32 KB (k7), a 16× ratio.
Observed savings: −1.13 GB (k5) vs −16.6 GB (k7) = a 14.7× ratio ≈ the
theoretical 16×, so the drop is provably exactly the removed vector.

The wall is flat at k5 because pre cores were already 22.77/24 — the 28.7 GB
resident set is below the bandwidth ceiling, so there's nothing to recover. The
k7 wall bonus (−5.6%) appeared only because k7's 52 GB set depressed cores to
18.2; relieving it recovered both. Wall improving only where cores were
depressed (and not where they were saturated) confirms the bonus is
bandwidth-relief, not a build-time CPU saving.

cjfields added a commit that referenced this pull request Jun 11, 2026
…profiling)

Non-algorithmic instrumentation for memory profiling. Fires once per fresh Raw
build in dada_uniques_cached (not on learn-errors' cached-reuse iterations),
reporting per-inference resident Raw footprint: seq+qual bytes and the k-mer
screen vectors (kmer8 = 4^k bytes; kord = (len-k+1)×2). In pooled mode this is
the whole resident set; in pseudo/per-sample it is one sample's share (× jobs
for the peak). Lets a --verbose run expose the k-mer footprint and the gap to
peak RSS (non-Raw structures) without an external profiler.

Verified: k5 ≈ 1516 B/raw (1024 kmer8 + ~492 kord), k7 ≈ 16872 B/raw — the 16×
4^k growth that #34 addressed, now visible per run.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

perf/mem: shrink resident k-mer vectors in pooled dada — 4^k arrays dominate peak at k7

1 participant