BubbleFinder computes all snarls, superbubbles, and ultrabubbles in genomic and pangenomic GFA graphs (i.e. bidirected graphs).
BubbleFinder is built around linear-time algorithms (linear in the size of the input graph, i.e. O(|V|+|E|)), and outputs compact representations:
- Snarls: BubbleFinder computes all snarls and outputs a representation whose total size is linear in the input graph size.
- Superbubbles: superbubbles admit a linear-size representation as pairs of endpoints, which BubbleFinder outputs.
- Ultrabubbles: BubbleFinder computes ultrabubbles in linear time by reducing them to directed superbubbles after orienting the bidirected graph (requires at least one tip per connected component, see the
ultrabubblescommand notes).
For snarls and superbubbles, BubbleFinder exploits the SPQR trees of the biconnected components of the undirected counterparts of the input bidirected graph, and traverses them efficiently to identify all snarls and superbubbles.
Important
snarls computes all snarls and aims to replicate the behavior of vg snarls -a -T, but vg outputs only a pruned, linear-size snarl decomposition.
Therefore, BubbleFinder may output more snarls than vg snarl.
Note
Empirical performance (snarls & superbubbles). Benchmarks and methodological details are reported in Sena, Politov et al., 2025.
- Snarls: BubbleFinder is consistently faster than
vg snarls -a -Ton the PGGB graphs (up to ~2× faster on larger graphs and ~3× on the smallest one). On human chromosome graphs (Chromosome 1/10/22), BubbleFinder can be up to ~2× slower end-to-end in a single-threaded run due to preprocessing (BC/SPQR tree building), but benefits from multi-threading (up to ~4× speedup at 16 threads in those datasets). - Superbubbles: BubbleFinder runs in similar times as BubbleGun on small graphs, and is about ~10× faster on larger graphs; in particular, BubbleGun hit a >3h timeout on Chromosome 1/10/22, while BubbleFinder completed in minutes in our benchmarks.
Ultrabubbles are computed using a different approach (not SPQR-based): BubbleFinder first orients each connected component of the bidirected graph with a DFS-based procedure (starting from a tip). During this orientation, conflicts (same-sign edges whose endpoints are already fixed) are resolved by introducing auxiliary source/sink tips, yielding a directed graph whose size remains linear in the original graph size. BubbleFinder then runs a linear-time directed superbubble algorithm on this oriented graph and maps the resulting superbubbles back to ultrabubbles, ignoring bubbles whose endpoints are auxiliary vertices.
Note
Empirical performance (ultrabubbles). In our ultrabubble enumeration benchmarks, BubbleFinder consistently outperformed vg across all tested datasets: about 2.25×–2.35× on 1000GP chromosome graphs, 4.75×–6.38× on PGGB graphs, 10.66× on HPRC v1.1, and up to 30.79× on HPRC v2.0 CHM13 (232 individuals), reducing wall time from 9:09:39 to 17:51.
Peak RAM was generally comparable on smaller graphs, and dropped from 459.90 GiB to 111.87 GiB on HPRC v2.0 CHM13 (~4.11× less).
A dedicated ultrabubble preprint describing this method and its benchmarks is forthcoming (link to be added).
Download the latest release:
https://github.com/algbio/BubbleFinder/releases/latest
./BubbleFinder --help
./BubbleFinder snarls -g example/tiny1.gfa -o tiny1.snarlsconda create -n bubblefinder_env -c conda-forge -c bioconda bubblefinder
conda activate bubblefinder_env
./BubbleFinder --helpSee: Building from source
| Command | Typical input | Output endpoints | Notes |
|---|---|---|---|
snarls |
bidirected GFA | oriented incidences (a+, d-) |
may output cliques (compact representation of many pairs) |
superbubbles |
bidirected GFA | segment IDs (a, e) |
computed on doubled directed graph + orientation projection |
directed-superbubbles |
directed (--graph or --gfa-directed) |
segment IDs | directed superbubbles directly |
ultrabubbles |
bidirected GFA | oriented incidences | requires ≥ 1 tip per connected component |
spqr-tree |
GFA only | .spqr v0.1 |
connected components + BC-tree + SPQR decomposition |
flowchart TD
%% POINT D'ENTRÉE GÉNÉRAL
A["Start: ./BubbleFinder COMMAND -g GRAPH -o OUT"] --> B["Read input file (option -g).<br>Auto-detect compression: .gz / .bz2 / .xz"]
B --> C{"Input format?"}
C -->|"--gfa or *.gfa"| D["Parse GFA as bidirected graph"]
C -->|"--gfa-directed"| E["Parse GFA as directed graph"]
C -->|"--graph or *.graph"| F["Parse .graph as directed graph"]
%% SÉLECTION DE COMMANDE
D --> G{"Command"}
E --> G
F --> G
%% SNARLS
G -->|snarls| S0["Build undirected counterpart<br>of bidirected GFA graph"]
S0 -->|"For each connected component"| S1["Compute BC-tree + SPQR trees<br>for each biconnected component"]
S1 --> S2["Traverse SPQR trees.<br>Enumerate ALL snarls (linear-time)"]
S2 --> S3["Output (-o): oriented incidences (a+/d-).<br>May use clique-lines (compact representation)"]
%% SUPERBUBBLES (bidirected GFA -> doubled directed)
G -->|superbubbles| SB0["Build doubled directed graph<br>with oriented copies v+ / v- for each segment v"]
SB0 -->|"For each connected component"| SB1["Compute BC-tree + SPQR trees<br>for each biconnected component"]
SB1 --> SB2["Traverse SPQR trees with DP on skeletons.<br>Enumerate ALL superbubbles<br>on the doubled graph"]
SB2 --> SB3["Orientation projection (see §“Orientation projection”).<br>Merge mirror bubbles, drop +/- , sort endpoints"]
SB3 --> SB4["Output (-o): segment ID pairs (a e).<br>Exactly one pair per line"]
%% DIRECTED SUPERBUBBLES (même algo, graphe déjà dirigé)
G -->|directed-superbubbles| DSB0["Use directed input as-is<br>(--graph or --gfa-directed)"]
DSB0 -->|"For each connected component"| DSB1["Compute BC-tree + SPQR trees<br>for each biconnected component"]
DSB1 --> DSB2["Traverse SPQR trees with DP on skeletons.<br>Enumerate ALL directed superbubbles"]
DSB2 --> DSB3["Output (-o): segment ID pairs.<br>Exactly one pair per line"]
%% ULTRABUBBLES (orientation + superbubbles dirigés)
G -->|ultrabubbles| U0{"At least 1 tip per connected component?"}
U0 -->|no| Ufail["Fail (current limitation).<br>Need at least 1 tip per component"]
U0 -->|yes| U1["Orient each component via DFS starting from a tip"]
U1 --> U2["Resolve same-sign conflicts by adding auxiliary tips<br>(keeps linear size)"]
U2 --> U3["Obtain directed skeleton"]
U3 --> U4["Run CLSD directed superbubble decomposition"]
U4 --> U5["Map superbubbles back to ultrabubbles.<br>Discard bubbles with auxiliary endpoints"]
U5 --> U6["Output (-o): oriented incidence pairs (a+ d-)"]
U5 --> U7{"Option --clsd-trees FILE?"}
U7 -->|yes| U8["Also output ultrabubble hierarchy<br>(reconnected CLSD trees)"]
U7 -->|no| End1["Done"]
%% SPQR-TREE OUTPUT
G -->|spqr-tree| P0{"GFA input?"}
P0 -->|no| Pfail["Fail: spqr-tree requires GFA input"]
P0 -->|yes| P1["Compute components + BC-tree + SPQR decomposition"]
P1 --> P2["Output: .spqr v0.1 file"]
BubbleFinder supports five commands:
-
snarls: computes all snarls and is supposed to replicate the behavior of vg snarl (when run with parameters-a -T).
Note thatvg snarlprunes some snarls to output only a linear number of snarls; thusBubbleFinderfinds more snarls thanvg snarl. -
superbubbles: computes superbubbles in a (virtually) doubled representation of the bidirected graph and is supposed to replicate the behavior of BubbleGun.
Since superbubbles are classically defined on directed graphs, BubbleFinder first runs the algorithm on this doubled directed representation, then projects the results back to unordered pairs of segment IDs (see Orientation projection).
Notice that BubbleGun also reports weak superbubbles, i.e. for a bubble with entrysand exitt, it also reports the structures which also have an edge fromttos(thus the interior of the bubble is not acyclic). -
directed-superbubbles: computes superbubbles on a directed input graph (--graphor--gfa-directed). -
ultrabubbles: computes ultrabubbles by orienting each connected component with a DFS procedure and then running the clsd superbubble algorithm on the resulting directed skeleton; requires at least one tip per connected component in the input graph. -
spqr-tree: outputs the connected components, BC-tree (blocks / cut vertices), and SPQR decomposition of each biconnected block in the SPQR tree file format.spqrv0.1 as specified here: https://github.com/sebschmi/SPQR-tree-file-format.
As an additional feature, BubbleFinder can also output a tree hierarchy of ultrabubbles using --clsd-trees <file> (see Ultrabubble hierarchy (CLSD trees)). This hierarchy is derived from the directed superbubble decomposition computed on the oriented directed skeleton (Gärtner & Stadler, 2019). We then transform this superbubble hierarchy into an ultrabubble hierarchy by removing bubbles whose entrance or exit corresponds to an auxiliary vertex introduced during skeleton construction, and by reconnecting their children to the removed bubble’s parent.
Warning
ultrabubbles currently requires at least one tip per connected component in the input graph (otherwise it might fail).
A precompiled binary for Linux (x86‑64) is available directly in the Assets section of the latest GitHub release:
https://github.com/algbio/BubbleFinder/releases/latest
After downloading the asset, extract it and run:
./BubbleFinder --helpAt the moment, building from source has been tested only on Linux:
git clone https://github.com/algbio/BubbleFinder && \
cd BubbleFinder && \
cmake -S . -B build && \
cmake --build build -j <NUM_THREADS> && \
mv build/BubbleFinder .Replace <NUM_THREADS> with the number of parallel build jobs you want to use (for example -j 8). Omitting -j will build single‑threaded but more slowly.
Now BubbleFinder is in the root directory.
If you use conda, you can install BubbleFinder from Bioconda:
conda create -n bubblefinder_env -c conda-forge -c bioconda bubblefinder
conda activate bubblefinder_env
./BubbleFinder --helpCommand line to run BubbleFinder:
./BubbleFinder <command> -g <graphFile> -o <outputFile> [options]
Available commands are:
superbubbles- Find bidirected superbubbles (GFA -> bidirected by default)directed-superbubbles- Find directed superbubbles (directed graph)snarls- Find snarls (typically on bidirected graphs from GFA)ultrabubbles- Find ultrabubbles (typically on bidirected graphs from GFA)spqr-tree- Output the connected components, BC-tree and SPQR decomposition in.spqrv0.1 format
BubbleFinder supports bidirected and directed graphs in GFA and internal .graph format.
BubbleFinder .graph text format:
- first line: two integers n and m
- n = number of distinct node IDs declared
- m = number of directed edges
- next m lines:
u v(separated by whitespace), each describing a directed edge fromutov. uandvare arbitrary node identifiers (strings without whitespace).
You can force the format of the input graph with the following flags:
--gfa
GFA input (bidirected).
--gfa-directed
GFA input interpreted as a directed graph.
--graph
BubbleFinder .graph text format with one directed edge per line (described above).
If none of these is given, the format is auto-detected from the file extension (e.g., .gfa, .graph).
Input files can be compressed with gzip, bzip2 or xz. Compression is auto-detected from the file name suffix:
.gz / .bgz -> gzip
.bz2 -> bzip2
.xz -> xz
Note
spqr-tree currently requires GFA input.
The internal .graph format is not supported for spqr-tree.
Complete list of options:
-g <file>
Input graph file (possibly compressed)
-o <file>
Output file
-j <threads>
Number of threads
--gfa
Force GFA input (bidirected)
--gfa-directed
Force GFA input interpreted as directed graph
--graph
Force .graph text format (see 'Format options' above)
--clsd-trees <file>
Compute CLSD superbubble trees (hierarchy) and write them to <file>.
Only supported for the ultrabubbles command.
--report-json <file>
Write JSON metrics report
-m <bytes>
Stack size in bytes
-h, --help
Show the help message and exit
All commands write plain text to the file specified by -o <outputFile> with the same global structure (unless stated otherwise by a specific option):
- First line: a single integer
N, the number of result lines that follow. - Lines 2..N+1: one result line per line.
Each result line encodes one or more unordered pairs of endpoints.
What an "endpoint" looks like depends on the command:
snarls/ultrabubbles: endpoints are oriented incidences, e.g.a+,d-.superbubbles/directed-superbubbles: endpoints are segment IDs without orientation, e.g.a,e.
The only difference between commands is:
snarlsmay output cliques (a line with ≥ 2 endpoints encodes all pairs between them),superbubblesanddirected-superbubblesalways output exactly one pair per line.
Used by the snarls command.
- After the header, each line contains at least two incidences, separated by whitespace.
- An incidence is a segment (or node) ID followed by an orientation sign, e.g.
a+,d-.
Example on the tiny graph in example/tiny1.gfa:
./BubbleFinder snarls -g example/tiny1.gfa -o example/tiny1.snarls --gfaThis produces:
2
g+ k-
a+ d- f+ g-
Interpretation:
- The first line
2means: 2 result lines follow. - Each result line with
k ≥ 2incidences encodes all unordered pairs among them:g+ k-encodes the single pair{g+, k-}.a+ d- f+ g-encodes the clique of pairs:{a+, d-},{a+, f+},{a+, g-},{d-, f+},{d-, g-},{f+, g-}.
So the snarl output is just a compact way to write many pairs at once:
one line = all pairs between the listed incidences.
Used by:
superbubbles(bidirected graphs from GFA),directed-superbubbles(directed graphs:--graphor--gfa-directed).
Here each result line contains exactly two tokens, e.g.:
3
a b
e f
b e
Interpretation:
- Each line
u vis a single unordered pair of segment IDs{u, v}. - IDs are segment names from GFA
Srecords (no+/-orientation).
For superbubbles, these pairs are obtained after running the superbubble algorithm on the doubled directed graph and then applying the orientation projection (see Orientation projection).
Default ultrabubble output (written to -o <outputFile>) is a flat list of endpoint pairs where each endpoint is an oriented incidence (segmentID+ / segmentID-):
N
a+ d-
g+ k-
...
Interpretation:
- each result line contains exactly two oriented incidences,
- each line encodes one unordered pair of oriented incidences
{u±, v±}.
To also output the hierarchical decomposition (/nesting structure) of ultrabubbles, run ultrabubbles with --clsd-trees <file>.
Each line corresponds to one rooted tree and follows a parenthesized representation:
- a leaf bubble is printed as:
<X,Y>
- an internal bubble with children is printed as:
(child1,child2,...,childk)<X,Y>
where X and Y are oriented incidences such as a+ or d-.
Implementation detail (relevant for interpretation): during construction of the directed skeleton, BubbleFinder may introduce auxiliary intermediate vertices. The CLSD decomposition is computed on that skeleton, and the reported ultrabubble decomposition is obtained by removing bubbles whose entrance or exit is such an introduced vertex. Their children are connected upward in the resulting hierarchy.
The spqr-tree command writes a .spqr file according to the SPQR tree file format specification:
- Specification repository: https://github.com/sebschmi/SPQR-tree-file-format
- Version used by BubbleFinder: v0.1
BubbleFinder writes the header line:
H v0.1 https://github.com/sebschmi/SPQR-tree-file-format
For details on the line types and semantics (H/G/N/B/C/S/P/R/V/E), please refer to the specification repository above, which contains the format documentation and examples.
If you look at example/tiny1.png you'll notice that the bidirected edge {a+, b+} appearing in the graph image has been encoded as:
L a + b - 0M
This is because in GFA, links are directed. So, the rule is that to compute snarls from a GFA file, for every link
a x b y
in the GFA file (where x, y ∈ {+, -}), we flip the second sign y as ¬y, and make a bidirected edge {a x, b ¬y}. Then we compute snarls in this bidirected graph.
Superbubbles are classically defined on directed graphs, whereas GFA graphs are bidirected. In superbubbles mode, BubbleFinder therefore first converts the bidirected graph into a doubled directed graph, where each segment v has two oriented copies v+ and v-, and superbubbles are detected between oriented endpoints (e.g. (a+, e-) and its mirror (e+, a-)).
To report results at the level of segments, independently of the arbitrary orientation chosen in the doubled graph, we apply an orientation projection:
- mirror pairs like
(a+, e-)and(e+, a-)are treated as the same bubble; - we then drop the
+/-signs and sort the two segment names.
The final output is a single unordered pair of segment IDs, e.g. a e. This process is illustrated below.
In this example, the directed graph on segments has three superbubbles with endpoints (a, b), (b, e) and (e, f). After running the superbubble algorithm on the doubled graph and applying the orientation projection, BubbleFinder reports exactly these three pairs in its standard output format (see Output format).
This repository includes a brute-force implementation and a random test harness to validate both the snarl and superbubble algorithms.
A brute-force program is built together with BubbleFinder from source and resides in the build directory after compilation. It can compute snarls and, in a different mode, superbubbles on a given GFA file. In the examples below we refer to this binary as ./build/snarls_bf.
To run the brute-force program on snarls for a given GFA file:
cd build
./snarls_bf gfaGraphPathThe brute-force program outputs results in a format consumed by src/bruteforce.py, which then compares them with BubbleFinder’s output. The same brute-force engine is also used for superbubble validation via the --superbubbles mode, which takes care of running the binary and parsing its output.
We also include a generator of random graphs and a driver script that compares the brute-force implementation with BubbleFinder. The script is src/bruteforce.py, and it can operate in two modes:
- Snarls
- Superbubbles
- Ultrabubbles
To run the random tests on snarls (for example, on 100 random graphs):
python3 src/bruteforce.py \
--bruteforce-bin ./build/snarls_bf \
--bubblefinder-bin ./BubbleFinder \
--n-graphs 100This will:
- generate random GFA graphs,
- run the brute-force snarl finder,
- run BubbleFinder in snarl mode,
- compare their outputs (missing / extra pairs / segfaults),
- and additionally check that BubbleFinder’s output is consistent across different thread counts if you pass several values to
--threads.
To run random tests on ultrabubbles, use the brute-force ultrabubble binary (e.g. ./build/ultrabubbles_bf). Since the current ultrabubble implementation in BubbleFinder requires at least one tip per connected component, you can restrict the generated random graphs with:
python3 src/bruteforce.py \
--bruteforce-bin ./build/ultrabubbles_bf \
--bubblefinder-bin ./BubbleFinder \
--n-graphs 100 \
--min-tips-per-cc 1To run the same style of tests for superbubbles, pass the --superbubbles flag:
python3 src/bruteforce.py \
--superbubbles \
--bruteforce-bin ./build/superbubbles_bf \
--bubblefinder-bin ./BubbleFinder \
--n-graphs 100To run superbubble tests with multiple thread configurations and check result consistency, you can run:
python3 src/bruteforce.py \
--superbubbles \
--bruteforce-bin ./build/superbubbles_bf \
--bubblefinder-bin ./BubbleFinder \
--n-graphs 100 \
--threads 1 4 8Any divergence or error is reported, and if you pass --keep-failing, the script will save the corresponding GFA and BubbleFinder outputs in /tmp for manual inspection.
-
Francisco Sena, Aleksandr Politov, Corentin Moumard, Manuel Cáceres, Sebastian Schmidt, Juha Harviainen, Alexandru I. Tomescu. Identifying all snarls and superbubbles in linear-time, via a unified SPQR-tree framework. arXiv:2511.21919 (2025). https://arxiv.org/abs/2511.21919
-
Fabian Gärtner, Peter F. Stadler. Direct superbubble detection. Algorithms 12(4):81, 2019. DOI: 10.3390/a12040081. https://www.mdpi.com/1999-4893/12/4/81
-
vg toolkit (GitHub): https://github.com/vgteam/vg
-
BubbleGun (GitHub): https://github.com/fawaz-dabbaghieh/bubble_gun
-
Scalable computation of ultrabubbles in pangenomes by orienting bidirected graphs. Preprint forthcoming (link to be added).