This repository contains a Snakemake pipeline that can (depending on the requested Snakemake targets and on the configuration in datasets.yaml):
- Build graphs (GFA) from FASTA, FASTQ (ggcat+Lighter), VCF (vg), or pre‑existing GFAs.
- Clean and "bluntify" the graphs.
- Prepare unidirectional graph representations (sgraph / edgelist), when required by the selected benchmark programs / inputs.
- Run benchmarks on several programs (BubbleGun, BubbleFinder (superbubbles / snarls / ultrabubbles), vg snarls, clsd).
- Aggregate results in a single table and produce plots (time / memory).
Running the default target all (snakemake all ...) triggers the workflow needed to produce the aggregated table and plots for all enabled datasets and selected benchmark programs.
Note: steps (build / clean / conversions / benchmarks) run only if required by the requested Snakemake targets and by datasets.yaml. In particular, any benchmark/aggregation target (e.g. results/benchmarks.tsv or all) will build data/<dataset>/<dataset>.cleaned.gfa for the datasets being benchmarked.
- 1. Requirements
- 2. Setup
- 3. Dataset configuration (
datasets.yaml) - 4. Running the pipeline
- 5. Output structure
- 6. Customizing tool binaries
- 7. Quick troubleshooting
- Unix‑like OS (Linux, macOS).
- Snakemake (recommended via conda/mamba).
- conda or mamba available in
$PATH.
(The pipeline usesconda:directives in the rules.) - Internet access (to download data and binaries).
The following tools are managed automatically by the Snakefile if no binary is specified in datasets.yaml:
- BubbleFinder, cloned and built from GitHub (commit
302de4f0). - GetBlunted, precompiled binary downloaded (release
v1.0.0). - clsd, cloned and built (commit
c49598fc). - Lighter, cloned and built (commit
d8621db1).
# 1) Clone the repository
git clone https://github.com/algbio/BubbleFinder-experiments.git
cd BubbleFinder-experiments
# 2) (Optional) Create a conda/mamba env with Snakemake version 9
mamba env create -f environment.yml
conda activate benchMake sure conda or mamba is in $PATH when you run Snakemake.
The datasets.yaml file describes:
- datasets (
datasets:), - builders (how to produce raw GFAs),
- tools and conda environments,
- benchmark programs and their parameters.
Each dataset has a builder field:
ggcat_from_fasta- Input: FASTA/FNA (or
.tar.gzarchive containing FASTA files).
- Input: FASTA/FNA (or
ggcat_from_reads_lighter- Input: FASTQ(.gz) → Lighter correction → ggcat.
vg_from_vcf- Input:
fa_gz(reference) +vcf_gz→vg construct→ GFA.
- Input:
gfa_from_url- Input: pre‑built GFA downloaded from a URL.
vg_from_url- Input: a
.vgfile downloaded from a URL →vg convert -g→ GFA.
- Input: a
gbz_from_url- Input: a
.gbzfile downloaded from a URL →vg convert -g→ GFA.
- Input: a
pggb_from_fasta- Input: FASTA, graph built with pggb.
Under datasets::
- name: coli3682
enabled: true
builder: ggcat_from_fasta
...enabled: true→ used.enabled: false→ ignored.enabled: auto→ enabled only if input files can be detected (useful for datasets discovered from an index page).
Benchmark programs are selected:
- globally via
defaults.bench.programs, and/or - per dataset via
datasets: ... bench_programs:
Global example:
defaults:
bench:
reps: 2
programs:
- BubbleGun_gfa
- BubbleFinder
- vg_snarls_gfa
- clsd_sbPer dataset override example:
- name: ecoli50
...
bench_programs:
- BubbleFinder
- vg_snarls_gfaThe number of benchmark repetitions is controlled by defaults.bench.reps.
For each dataset, for each selected program (and for each configured thread count), the pipeline runs rep = 1..reps and writes one TSV per repetition.
A TSV file is a plain-text Tab-Separated Values file (fields separated by tab characters); in this pipeline, benchmark TSVs are small key/value tables storing time/memory/exit status and some metadata (program, threads, etc.).
Example:
defaults:
bench:
reps: 2The pipeline benchmarks BubbleFinder via the program name BubbleFinder. BubbleFinder runs in one or more modes, configured in datasets.yaml.
Modes supported here (BubbleFinder commands):
superbubblessnarlsultrabubbles
Mode selection is controlled by:
defaults.bench.program_opts.BubbleFinder.modes(global), and/or- per-dataset overrides under
datasets: ... bench: program_opts: BubbleFinder: ...
Global example (run multiple modes):
defaults:
bench:
program_opts:
BubbleFinder:
modes: ["superbubbles", "snarls", "ultrabubbles"]Per-dataset example (run only ultrabubbles on one dataset):
- name: ecoli50
...
bench_programs:
- BubbleFinder
bench:
program_opts:
BubbleFinder:
modes: ["ultrabubbles"]BubbleFinder input type is configured per mode with input::
input: "gfa"→ BubbleFinder readsdata/<dataset>/<dataset>.cleaned.gfainput: "sgraph"→ BubbleFinder readsdata/<dataset>/<dataset>.sbspqr.sgraph(built from a cleaned GFA)
Example (run BubbleFinder ultrabubbles, using GFA input):
defaults:
bench:
program_opts:
BubbleFinder:
modes: ["ultrabubbles"]
default:
input: "gfa"Note: dataset names do not control BubbleFinder modes; only the modes: field does.
Also, BubbleFinder’s ultrabubbles mode requires at least one tip per connected component in the input graph (BubbleFinder requirement).
These derived representations are built only if required by the selected benchmark programs / inputs:
- clsd (
clsd_sb) uses an edgelist:data/<dataset>/<dataset>.clsd.edgelist(generated from a cleaned GFA). - BubbleFinder uses sgraph only when
BubbleFinderis configured withinput: "sgraph".
There are two distinct “thread” concepts:
snakemake -j <N>controls how many rules run in parallel (workflow-level parallelism).- Program-specific thread counts control how many threads a tool uses inside one rule.
Builder thread defaults are set under defaults.threads, e.g.:
defaults:
threads:
ggcat: 8
vg: 16
lighter: 8
pggb: 4For benchmarks, thread values can be configured:
- globally per program via
defaults.bench.threads_by_program, and/or - per dataset via
datasets: ... bench_threads:
Global example:
defaults:
bench:
threads_by_program:
BubbleFinder: [1, 4, 8, 16]
vg_snarls_gfa: [1, 4, 8, 16]Per-dataset override example:
- name: ecoli50
...
bench_threads:
vg_snarls_gfa: [1, 8]BubbleFinder thread values can also be set per mode via program_opts:
defaults:
bench:
program_opts:
BubbleFinder:
modes: ["superbubbles", "ultrabubbles"]
by_mode:
superbubbles:
threads: [1, 2, 4, 8]
ultrabubbles:
threads: [1, 4, 8, 16]snakemake -n -psnakemake all --use-conda -j 8--use-conda: required to create/use the environments defined inconfig/*.yml.-j 8: number of parallel jobs.
- Build and clean only the GFA of
coli3682:
snakemake --use-conda -j 4 data/coli3682/coli3682.cleaned.gfa- Run benchmarks + aggregation (this also produces the plots because they are generated by the same rule in this workflow):
snakemake --use-conda -j 8 results/benchmarks.tsvFor each dataset <name>:
data/<name>/<name>.*.gfa- Raw GFA (builder-dependent), e.g.:
...ggcat.fasta.gfa...vg.gfa...pggb.gfa...raw.gfa(builder:gfa_from_url)...vg.url.gfa(builder:vg_from_url)...gbz.url.gfa(builder:gbz_from_url)
- Raw GFA (builder-dependent), e.g.:
data/<name>/<name>.bluntified.gfa(temporary).data/<name>/<name>.cleaned.gfa- Cleaned GFA (no H‑lines, bluntified).
data/<name>/<name>.sb.cleaned.gfa- Cleaned GFA produced by the optional ggcat
-frebuild (when enabled for SB programs on ggcat-based datasets).
- Cleaned GFA produced by the optional ggcat
data/<name>/<name>.sbspqr.sgraph- sgraph for BubbleFinder when configured with
input: "sgraph".
- sgraph for BubbleFinder when configured with
data/<name>/<name>.clsd.edgelist- Edgelist for
clsd.
- Edgelist for
results/.prechecks.ok- Marker indicating that prechecks have run.
results/bench/- Per‑program, per‑dataset, per‑rep benchmark TSV files.
- For BubbleFinder, file names include the mode:
BubbleFinder/<dataset>.<mode>.t<threads>.rep<rep>.tsv
results/prog_out/- Raw program outputs (BubbleFinder outputs + BubbleFinder JSON report, etc.).
results/logs/- Detailed logs per step.
results/benchmarks.tsv- Aggregated benchmark table.
results/plots/time_by_dataset_program.pngrss_by_dataset_program.png
results/summary/reruns_planned.tsv- Information used for automatic reruns (timeouts, failures).
In datasets.yaml, under defaults.tools, you can point to pre‑installed binaries to avoid automatic cloning/building:
defaults:
tools:
spqr_bin: /path/to/BubbleFinder
get_blunted: /path/to/get_blunted
lighter_bin: /path/to/lighter
clsd_bin: /path/to/clsdIf these paths are not set, the Snakefile will:
- clone/build
BubbleFinder,clsd, andLighterunderbuild/, - download
get_bluntedintobin/.
- Error in prechecks (
prechecks):- Check
results/logs/*(especiallyresults/logs/bench,results/logs/ggcat,results/logs/vg). - Ensure
condaormambais available.
- Check
- Frequent timeouts:
- Adjust
defaults.tools.timeout.secondsindatasets.yaml.
- Adjust
- Problem with a specific dataset:
- Run a single target, e.g.
snakemake --use-conda -j 4 data/coli3682/coli3682.cleaned.gfa
- Run a single target, e.g.
- Bluntification issues:
- A WARN about
get_bluntedmeans the pipeline falls back to "naive bluntify" (overlap fields forced to*). - Install GetBlunted and set
defaults.tools.get_bluntedto use it.
- A WARN about