Fused makepaddedseqdb + streaming createindex by antonvnv · Pull Request #9 · pskvins/MMseqs2

antonvnv · 2026-04-29T02:09:26Z

For large nucleotide databases (NT is ~1.5 TB indexed), the current 4-step RNA indexing pipeline (createdb -> splitsequence -> makepaddedseqdb -> createindex) has two problems: intermediate databases double the disk I/O, and createindex allocates an O(totalSeqBytes) buffer for SequenceLookup that doesn't fit in RAM. On a 128 GB machine indexing a 1.5 TB DB, this forces heavy swapping — the swap I/O compounds with the already large sequential read I/O, making indexing painfully slow.

This PR reduces the pipeline to 2 steps and makes memory usage constant regardless of input size, eliminating both the intermediate I/O and the swap pressure.

Based on the databases-local-index branch (PR #4).

Commit 1: makepaddedseqdb fused FASTA codepath

makepaddedseqdb now accepts FASTA directly (detected by absence of .dbtype). The fused path eliminates all intermediate databases and their I/O, while producing byte-for-byte identical output to the 3-step pipeline (verified: 144/144 file comparisons across 6 configs x 4 sizes).

Two-pass design:

Pass 1: stream FASTA, collect metadata only (headers + lengths), then compute shuffle/sort/offsets from metadata
Pass 2: mmap the FASTA, encode+pwrite chunks in parallel via OpenMP

Falls back to sequential read for compressed/stdin inputs.

databases.sh LOCAL_FASTA pipeline updated from 4 steps to 2:

makepaddedseqdb <fasta> <outdb>
createindex <outdb> <tmp> --index-subset 2

Benchmark (64 threads, fused vs 3-step reference):

SIZE       FUSED             REF(DB)           TIME     RSS
1M         0.025s 2495%      0.119s  550%      0.21x    1.00x
10M        0.047s 1474%      0.160s  506%      0.29x    1.55x
50M        0.081s 1503%      0.188s  631%      0.43x    0.75x
100M       0.112s 1887%      0.248s  694%      0.45x    0.51x
500M       0.468s 2434%      0.649s  850%      0.72x    0.51x

Commit 2: createindex streaming SequenceLookup

When building only the SequenceLookup (--index-subset 2, no k-mer index), createindex now writes encoded+masked sequences directly to the .idx file via the DBWriter streaming API, instead of allocating an O(totalSeqBytes) heap buffer. This eliminates OOM at large input sizes (~250 GB+).

Memory: O(totalSeqBytes) -> O(numSequences * 8 + maxSeqLen). Uses posix_madvise(MADV_DONTNEED) every 64 MB to evict consumed mmap pages, keeping resident memory flat regardless of input size.

Parallelized with OpenMP (chunked processing, same thread-safety model as fillDatabase). Produces byte-identical index data (12/12 correctness checks).

Benchmark (64 threads, streaming vs bulk-allocation):

SIZE       STREAMING         BULK-ALLOC        TIME     MEM
1M         0.073s 1517%      0.058s 1783%      1.25x    n/a
10M        0.144s 1118%      0.145s 1110%      1.00x    0.88x
50M        0.278s 1524%      0.285s 1462%      0.98x    0.96x
100M       0.435s 1881%      0.427s 1770%      1.02x    1.00x
500M       1.908s 2084%      1.910s 2067%      1.00x    0.92x
1000M      3.707s 2147%      3.668s 2137%      1.01x    0.98x
2000M      7.302s 2179%      7.210s 2174%      1.01x    0.54x

Wall time is unchanged. Memory savings become significant at scale (1.0 GB vs 1.8 GB at 2 GB input, 0.54x).

Testing

Both commits include self-contained benchmark suites in util/benchmarks/ that verify correctness (byte-identical output) and measure performance. Run:

python3 util/benchmarks/makepaddedseqdb_fused/run
python3 util/benchmarks/createindex_streaming/run [--slow]

Building a nucleotide database from a FASTA file previously required manually chaining createdb, splitsequence, makepaddedseqdb, and createindex with the right flags. Now a single command does it: mmseqs databases ./input.fasta.gz outdb tmp Both relative (./...) and absolute (/...) paths work — any argument containing '/' that isn't a known database name is treated as a local file. Protein inputs are rejected with a clear error since the indexing pipeline is nucleotide-specific. This keeps `databases` as the single entry point to maintain indexing requirements, and makes it suitable for reindexing external or already manually downloaded databases.

…parallelize with mmap+OpenMP When makepaddedseqdb detects no .dbtype file for the input, it runs a fused codepath that accepts FASTA directly, eliminating intermediate file I/O from the 3-step createdb -> splitsequence -> makepaddedseqdb pipeline. The existing DB-input codepath is unchanged. The fused path replicates the exact behavior of the 3-step pipeline: FASTA parsing via KSeqWrapper, createdb-style shuffle (group by id%32), splitsequence-style chunking with ORF headers, length-based sorting, padded encoding, and lookup/source file generation. Output is byte-for-byte identical to the 3-step pipeline (144/144 correctness checks across 6 configs x 4 sizes). Uses a two-pass streaming approach to avoid O(N) memory: Pass 1: stream FASTA collecting only metadata (headers + lengths) Layout: compute shuffle/sort/offsets from metadata alone Write: headers, lookup, source (no sequence data needed) Pass 2: re-read FASTA, encode each chunk, pwrite() to pre-computed offset Pass 2 (encode + pwrite) is parallelized with mmap + OpenMP: record sequenceOffset/newlineCount during Pass 1, then mmap the input FASTA and process entries in parallel — each thread strips newlines from its mmap'd region into a thread-local buffer, encodes, and pwrites to pre-computed offsets. For compressed/stdin inputs, falls back to the original sequential path. Default --max-seq-len to 10000 to match splitsequence. Without this, the fused FASTA codepath would not split sequences unless --max-seq-len was explicitly passed, producing different output than the 3-step pipeline. Replace the 4-step LOCAL_FASTA pipeline in databases.sh (createdb -> splitsequence -> makepaddedseqdb -> createindex) with 2 steps: makepaddedseqdb <fasta> <outdb> (fused single-pass) createindex <outdb> <tmp> --index-subset 2 (streaming + MADV_DONTNEED) This eliminates intermediate databases and their cleanup, reduces disk I/O by ~2x, and uses constant memory for indexing regardless of input size. Benchmark (64 threads, fused vs reference 3-step makepaddedseqdb): SIZE FUSED REF(DB) TIME RSS 1M 0.025s 2495% 0.119s 550% 0.21x 1.00x 10M 0.047s 1474% 0.160s 506% 0.29x 1.55x 50M 0.081s 1503% 0.188s 631% 0.43x 0.75x 100M 0.112s 1887% 0.248s 694% 0.45x 0.51x 500M 0.468s 2434% 0.649s 850% 0.72x 0.51x Includes a self-contained correctness and performance benchmark suite in util/benchmarks/makepaddedseqdb_fused/.

…el encode+mask When building only the SequenceLookup (no k-mer index), encode+mask each sequence and write directly to the .idx file via DBWriter streaming API, instead of allocating the O(totalSeqBytes) heap buffer. This eliminates the dominant memory allocation that causes OOM at ~250 GB input. Memory usage drops from O(totalSeqBytes) to O(numSequences * 8 + maxSeqLen). Produces byte-identical SequenceLookup data in the index file (12/12 correctness checks across 3 sizes). The streaming path now calls posix_madvise(MADV_DONTNEED) every 64 MB to release consumed mmap pages, keeping memory flat at ~100 MB regardless of input size. Previously the mmap'd padded DB filled page cache proportionally to input size (~1x). With a 1.5 TB DB and 128 GB RAM this is problematic. The streaming streamSequenceLookupToIndex() path is parallelized with OpenMP using chunked processing: sequences are processed in chunks of 100K, each chunk parallel-encoded into a shared buffer at pre-computed non-overlapping offsets (same thread-safety model as fillDatabase), then written serially via writeAdd. Benchmark (64 threads, streaming vs bulk-allocation, --slow): SIZE STREAMING BULK-ALLOC TIME MEM 1M 0.073s 1517% 0.058s 1783% 1.25x n/a 10M 0.144s 1118% 0.145s 1110% 1.00x 0.88x 50M 0.278s 1524% 0.285s 1462% 0.98x 0.96x 100M 0.435s 1881% 0.427s 1770% 1.02x 1.00x 500M 1.908s 2084% 1.910s 2067% 1.00x 0.92x 1000M 3.707s 2147% 3.668s 2137% 1.01x 0.98x 2000M 7.302s 2179% 7.210s 2174% 1.01x 0.54x Wall time is unchanged (~1.0x ratio). Memory savings become significant at 2 GB input (1.0 GB streaming vs 1.8 GB bulk, 0.54x).

antonvnv added 3 commits April 13, 2026 21:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fused makepaddedseqdb + streaming createindex#9

Fused makepaddedseqdb + streaming createindex#9
antonvnv wants to merge 3 commits into
pskvins:masterfrom
antonvnv:fused-and-streaming

antonvnv commented Apr 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

antonvnv commented Apr 29, 2026

Commit 1: makepaddedseqdb fused FASTA codepath

Commit 2: createindex streaming SequenceLookup

Testing

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant