Skip to content

Fused makepaddedseqdb + streaming createindex#9

Open
antonvnv wants to merge 3 commits into
pskvins:masterfrom
antonvnv:fused-and-streaming
Open

Fused makepaddedseqdb + streaming createindex#9
antonvnv wants to merge 3 commits into
pskvins:masterfrom
antonvnv:fused-and-streaming

Conversation

@antonvnv
Copy link
Copy Markdown

For large nucleotide databases (NT is ~1.5 TB indexed), the current 4-step RNA indexing pipeline (createdb -> splitsequence -> makepaddedseqdb -> createindex) has two problems: intermediate databases double the disk I/O, and createindex allocates an O(totalSeqBytes) buffer for SequenceLookup that doesn't fit in RAM. On a 128 GB machine indexing a 1.5 TB DB, this forces heavy swapping — the swap I/O compounds with the already large sequential read I/O, making indexing painfully slow.

This PR reduces the pipeline to 2 steps and makes memory usage constant regardless of input size, eliminating both the intermediate I/O and the swap pressure.

Based on the databases-local-index branch (PR #4).

Commit 1: makepaddedseqdb fused FASTA codepath

makepaddedseqdb now accepts FASTA directly (detected by absence of .dbtype). The fused path eliminates all intermediate databases and their I/O, while producing byte-for-byte identical output to the 3-step pipeline (verified: 144/144 file comparisons across 6 configs x 4 sizes).

Two-pass design:

  • Pass 1: stream FASTA, collect metadata only (headers + lengths), then compute shuffle/sort/offsets from metadata
  • Pass 2: mmap the FASTA, encode+pwrite chunks in parallel via OpenMP

Falls back to sequential read for compressed/stdin inputs.

databases.sh LOCAL_FASTA pipeline updated from 4 steps to 2:

makepaddedseqdb <fasta> <outdb>
createindex <outdb> <tmp> --index-subset 2

Benchmark (64 threads, fused vs 3-step reference):

SIZE       FUSED             REF(DB)           TIME     RSS
1M         0.025s 2495%      0.119s  550%      0.21x    1.00x
10M        0.047s 1474%      0.160s  506%      0.29x    1.55x
50M        0.081s 1503%      0.188s  631%      0.43x    0.75x
100M       0.112s 1887%      0.248s  694%      0.45x    0.51x
500M       0.468s 2434%      0.649s  850%      0.72x    0.51x

Commit 2: createindex streaming SequenceLookup

When building only the SequenceLookup (--index-subset 2, no k-mer index), createindex now writes encoded+masked sequences directly to the .idx file via the DBWriter streaming API, instead of allocating an O(totalSeqBytes) heap buffer. This eliminates OOM at large input sizes (~250 GB+).

Memory: O(totalSeqBytes) -> O(numSequences * 8 + maxSeqLen). Uses posix_madvise(MADV_DONTNEED) every 64 MB to evict consumed mmap pages, keeping resident memory flat regardless of input size.

Parallelized with OpenMP (chunked processing, same thread-safety model as fillDatabase). Produces byte-identical index data (12/12 correctness checks).

Benchmark (64 threads, streaming vs bulk-allocation):

SIZE       STREAMING         BULK-ALLOC        TIME     MEM
1M         0.073s 1517%      0.058s 1783%      1.25x    n/a
10M        0.144s 1118%      0.145s 1110%      1.00x    0.88x
50M        0.278s 1524%      0.285s 1462%      0.98x    0.96x
100M       0.435s 1881%      0.427s 1770%      1.02x    1.00x
500M       1.908s 2084%      1.910s 2067%      1.00x    0.92x
1000M      3.707s 2147%      3.668s 2137%      1.01x    0.98x
2000M      7.302s 2179%      7.210s 2174%      1.01x    0.54x

Wall time is unchanged. Memory savings become significant at scale (1.0 GB vs 1.8 GB at 2 GB input, 0.54x).

Testing

Both commits include self-contained benchmark suites in util/benchmarks/ that verify correctness (byte-identical output) and measure performance. Run:

python3 util/benchmarks/makepaddedseqdb_fused/run
python3 util/benchmarks/createindex_streaming/run [--slow]

Building a nucleotide database from a FASTA file previously
required manually chaining createdb, splitsequence, makepaddedseqdb,
and createindex with the right flags. Now a single command does it:

  mmseqs databases ./input.fasta.gz outdb tmp

Both relative (./...) and absolute (/...) paths work — any argument
containing '/' that isn't a known database name is treated as a local
file. Protein inputs are rejected with a clear error since the
indexing pipeline is nucleotide-specific.

This keeps `databases` as the single entry point to maintain indexing
requirements, and makes it suitable for reindexing external or already
manually downloaded databases.
…parallelize with mmap+OpenMP

When makepaddedseqdb detects no .dbtype file for the input, it runs a
fused codepath that accepts FASTA directly, eliminating intermediate
file I/O from the 3-step createdb -> splitsequence -> makepaddedseqdb
pipeline. The existing DB-input codepath is unchanged.

The fused path replicates the exact behavior of the 3-step pipeline:
FASTA parsing via KSeqWrapper, createdb-style shuffle (group by id%32),
splitsequence-style chunking with ORF headers, length-based sorting,
padded encoding, and lookup/source file generation. Output is
byte-for-byte identical to the 3-step pipeline (144/144 correctness
checks across 6 configs x 4 sizes).

Uses a two-pass streaming approach to avoid O(N) memory:

  Pass 1: stream FASTA collecting only metadata (headers + lengths)
  Layout: compute shuffle/sort/offsets from metadata alone
  Write:  headers, lookup, source (no sequence data needed)
  Pass 2: re-read FASTA, encode each chunk, pwrite() to pre-computed offset

Pass 2 (encode + pwrite) is parallelized with mmap + OpenMP: record
sequenceOffset/newlineCount during Pass 1, then mmap the input FASTA
and process entries in parallel — each thread strips newlines from its
mmap'd region into a thread-local buffer, encodes, and pwrites to
pre-computed offsets. For compressed/stdin inputs, falls back to the
original sequential path.

Default --max-seq-len to 10000 to match splitsequence. Without this,
the fused FASTA codepath would not split sequences unless --max-seq-len
was explicitly passed, producing different output than the 3-step
pipeline.

Replace the 4-step LOCAL_FASTA pipeline in databases.sh
(createdb -> splitsequence -> makepaddedseqdb -> createindex) with
2 steps:

  makepaddedseqdb <fasta> <outdb>   (fused single-pass)
  createindex <outdb> <tmp> --index-subset 2  (streaming + MADV_DONTNEED)

This eliminates intermediate databases and their cleanup, reduces
disk I/O by ~2x, and uses constant memory for indexing regardless
of input size.

Benchmark (64 threads, fused vs reference 3-step makepaddedseqdb):

  SIZE       FUSED             REF(DB)           TIME     RSS
  1M         0.025s 2495%      0.119s  550%      0.21x    1.00x
  10M        0.047s 1474%      0.160s  506%      0.29x    1.55x
  50M        0.081s 1503%      0.188s  631%      0.43x    0.75x
  100M       0.112s 1887%      0.248s  694%      0.45x    0.51x
  500M       0.468s 2434%      0.649s  850%      0.72x    0.51x

Includes a self-contained correctness and performance benchmark suite in
util/benchmarks/makepaddedseqdb_fused/.
…el encode+mask

When building only the SequenceLookup (no k-mer index), encode+mask each
sequence and write directly to the .idx file via DBWriter streaming API,
instead of allocating the O(totalSeqBytes) heap buffer. This eliminates
the dominant memory allocation that causes OOM at ~250 GB input.

Memory usage drops from O(totalSeqBytes) to O(numSequences * 8 + maxSeqLen).
Produces byte-identical SequenceLookup data in the index file (12/12
correctness checks across 3 sizes).

The streaming path now calls posix_madvise(MADV_DONTNEED) every 64 MB
to release consumed mmap pages, keeping memory flat at ~100 MB regardless
of input size. Previously the mmap'd padded DB filled page cache
proportionally to input size (~1x). With a 1.5 TB DB and 128 GB RAM
this is problematic.

The streaming streamSequenceLookupToIndex() path is parallelized with
OpenMP using chunked processing: sequences are processed in chunks of
100K, each chunk parallel-encoded into a shared buffer at pre-computed
non-overlapping offsets (same thread-safety model as fillDatabase), then
written serially via writeAdd.

Benchmark (64 threads, streaming vs bulk-allocation, --slow):

  SIZE       STREAMING         BULK-ALLOC        TIME     MEM
  1M         0.073s 1517%      0.058s 1783%      1.25x    n/a
  10M        0.144s 1118%      0.145s 1110%      1.00x    0.88x
  50M        0.278s 1524%      0.285s 1462%      0.98x    0.96x
  100M       0.435s 1881%      0.427s 1770%      1.02x    1.00x
  500M       1.908s 2084%      1.910s 2067%      1.00x    0.92x
  1000M      3.707s 2147%      3.668s 2137%      1.01x    0.98x
  2000M      7.302s 2179%      7.210s 2174%      1.01x    0.54x

Wall time is unchanged (~1.0x ratio). Memory savings become significant
at 2 GB input (1.0 GB streaming vs 1.8 GB bulk, 0.54x).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant