Fused makepaddedseqdb + streaming createindex#9
Open
antonvnv wants to merge 3 commits into
Open
Conversation
Building a nucleotide database from a FASTA file previously required manually chaining createdb, splitsequence, makepaddedseqdb, and createindex with the right flags. Now a single command does it: mmseqs databases ./input.fasta.gz outdb tmp Both relative (./...) and absolute (/...) paths work — any argument containing '/' that isn't a known database name is treated as a local file. Protein inputs are rejected with a clear error since the indexing pipeline is nucleotide-specific. This keeps `databases` as the single entry point to maintain indexing requirements, and makes it suitable for reindexing external or already manually downloaded databases.
…parallelize with mmap+OpenMP When makepaddedseqdb detects no .dbtype file for the input, it runs a fused codepath that accepts FASTA directly, eliminating intermediate file I/O from the 3-step createdb -> splitsequence -> makepaddedseqdb pipeline. The existing DB-input codepath is unchanged. The fused path replicates the exact behavior of the 3-step pipeline: FASTA parsing via KSeqWrapper, createdb-style shuffle (group by id%32), splitsequence-style chunking with ORF headers, length-based sorting, padded encoding, and lookup/source file generation. Output is byte-for-byte identical to the 3-step pipeline (144/144 correctness checks across 6 configs x 4 sizes). Uses a two-pass streaming approach to avoid O(N) memory: Pass 1: stream FASTA collecting only metadata (headers + lengths) Layout: compute shuffle/sort/offsets from metadata alone Write: headers, lookup, source (no sequence data needed) Pass 2: re-read FASTA, encode each chunk, pwrite() to pre-computed offset Pass 2 (encode + pwrite) is parallelized with mmap + OpenMP: record sequenceOffset/newlineCount during Pass 1, then mmap the input FASTA and process entries in parallel — each thread strips newlines from its mmap'd region into a thread-local buffer, encodes, and pwrites to pre-computed offsets. For compressed/stdin inputs, falls back to the original sequential path. Default --max-seq-len to 10000 to match splitsequence. Without this, the fused FASTA codepath would not split sequences unless --max-seq-len was explicitly passed, producing different output than the 3-step pipeline. Replace the 4-step LOCAL_FASTA pipeline in databases.sh (createdb -> splitsequence -> makepaddedseqdb -> createindex) with 2 steps: makepaddedseqdb <fasta> <outdb> (fused single-pass) createindex <outdb> <tmp> --index-subset 2 (streaming + MADV_DONTNEED) This eliminates intermediate databases and their cleanup, reduces disk I/O by ~2x, and uses constant memory for indexing regardless of input size. Benchmark (64 threads, fused vs reference 3-step makepaddedseqdb): SIZE FUSED REF(DB) TIME RSS 1M 0.025s 2495% 0.119s 550% 0.21x 1.00x 10M 0.047s 1474% 0.160s 506% 0.29x 1.55x 50M 0.081s 1503% 0.188s 631% 0.43x 0.75x 100M 0.112s 1887% 0.248s 694% 0.45x 0.51x 500M 0.468s 2434% 0.649s 850% 0.72x 0.51x Includes a self-contained correctness and performance benchmark suite in util/benchmarks/makepaddedseqdb_fused/.
…el encode+mask When building only the SequenceLookup (no k-mer index), encode+mask each sequence and write directly to the .idx file via DBWriter streaming API, instead of allocating the O(totalSeqBytes) heap buffer. This eliminates the dominant memory allocation that causes OOM at ~250 GB input. Memory usage drops from O(totalSeqBytes) to O(numSequences * 8 + maxSeqLen). Produces byte-identical SequenceLookup data in the index file (12/12 correctness checks across 3 sizes). The streaming path now calls posix_madvise(MADV_DONTNEED) every 64 MB to release consumed mmap pages, keeping memory flat at ~100 MB regardless of input size. Previously the mmap'd padded DB filled page cache proportionally to input size (~1x). With a 1.5 TB DB and 128 GB RAM this is problematic. The streaming streamSequenceLookupToIndex() path is parallelized with OpenMP using chunked processing: sequences are processed in chunks of 100K, each chunk parallel-encoded into a shared buffer at pre-computed non-overlapping offsets (same thread-safety model as fillDatabase), then written serially via writeAdd. Benchmark (64 threads, streaming vs bulk-allocation, --slow): SIZE STREAMING BULK-ALLOC TIME MEM 1M 0.073s 1517% 0.058s 1783% 1.25x n/a 10M 0.144s 1118% 0.145s 1110% 1.00x 0.88x 50M 0.278s 1524% 0.285s 1462% 0.98x 0.96x 100M 0.435s 1881% 0.427s 1770% 1.02x 1.00x 500M 1.908s 2084% 1.910s 2067% 1.00x 0.92x 1000M 3.707s 2147% 3.668s 2137% 1.01x 0.98x 2000M 7.302s 2179% 7.210s 2174% 1.01x 0.54x Wall time is unchanged (~1.0x ratio). Memory savings become significant at 2 GB input (1.0 GB streaming vs 1.8 GB bulk, 0.54x).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
For large nucleotide databases (NT is ~1.5 TB indexed), the current 4-step RNA indexing pipeline (createdb -> splitsequence -> makepaddedseqdb -> createindex) has two problems: intermediate databases double the disk I/O, and createindex allocates an O(totalSeqBytes) buffer for SequenceLookup that doesn't fit in RAM. On a 128 GB machine indexing a 1.5 TB DB, this forces heavy swapping — the swap I/O compounds with the already large sequential read I/O, making indexing painfully slow.
This PR reduces the pipeline to 2 steps and makes memory usage constant regardless of input size, eliminating both the intermediate I/O and the swap pressure.
Based on the databases-local-index branch (PR #4).
Commit 1: makepaddedseqdb fused FASTA codepath
makepaddedseqdbnow accepts FASTA directly (detected by absence of .dbtype). The fused path eliminates all intermediate databases and their I/O, while producing byte-for-byte identical output to the 3-step pipeline (verified: 144/144 file comparisons across 6 configs x 4 sizes).Two-pass design:
Falls back to sequential read for compressed/stdin inputs.
databases.sh LOCAL_FASTA pipeline updated from 4 steps to 2:
Benchmark (64 threads, fused vs 3-step reference):
Commit 2: createindex streaming SequenceLookup
When building only the SequenceLookup (
--index-subset 2, no k-mer index), createindex now writes encoded+masked sequences directly to the .idx file via the DBWriter streaming API, instead of allocating an O(totalSeqBytes) heap buffer. This eliminates OOM at large input sizes (~250 GB+).Memory: O(totalSeqBytes) -> O(numSequences * 8 + maxSeqLen). Uses posix_madvise(MADV_DONTNEED) every 64 MB to evict consumed mmap pages, keeping resident memory flat regardless of input size.
Parallelized with OpenMP (chunked processing, same thread-safety model as fillDatabase). Produces byte-identical index data (12/12 correctness checks).
Benchmark (64 threads, streaming vs bulk-allocation):
Wall time is unchanged. Memory savings become significant at scale (1.0 GB vs 1.8 GB at 2 GB input, 0.54x).
Testing
Both commits include self-contained benchmark suites in util/benchmarks/ that verify correctness (byte-identical output) and measure performance. Run: