Snakemake Pipeline for BubbleFinder

This repository contains a Snakemake pipeline that can (depending on the requested Snakemake targets and on the configuration in datasets.yaml):

Build graphs (GFA) from FASTA, FASTQ (ggcat+Lighter), VCF (vg), or pre‑existing GFAs.
Clean and "bluntify" the graphs.
Prepare unidirectional graph representations (sgraph / edgelist), when required by the selected benchmark programs / inputs.
Run benchmarks on several programs (BubbleGun, BubbleFinder (superbubbles / snarls / ultrabubbles), vg snarls, clsd).
Aggregate results in a single table and produce plots (time / memory).

Running the default target all (snakemake all ...) triggers the workflow needed to produce the aggregated table and plots for all enabled datasets and selected benchmark programs.

Note: steps (build / clean / conversions / benchmarks) run only if required by the requested Snakemake targets and by datasets.yaml. In particular, any benchmark/aggregation target (e.g. results/benchmarks.tsv or all) will build data/<dataset>/<dataset>.cleaned.gfa for the datasets being benchmarked.

1. Requirements

Unix‑like OS (Linux, macOS).
Snakemake (recommended via conda/mamba).
conda or mamba available in $PATH.
(The pipeline uses conda: directives in the rules.)
Internet access (to download data and binaries).

The following tools are managed automatically by the Snakefile if no binary is specified in datasets.yaml:

BubbleFinder, cloned and built from GitHub (commit 302de4f0).
GetBlunted, precompiled binary downloaded (release v1.0.0).
clsd, cloned and built (commit c49598fc).
Lighter, cloned and built (commit d8621db1).

2. Setup

# 1) Clone the repository
git clone https://github.com/algbio/BubbleFinder-experiments.git
cd BubbleFinder-experiments

# 2) (Optional) Create a conda/mamba env with Snakemake version 9
mamba env create -f environment.yml
conda activate bench

Make sure conda or mamba is in $PATH when you run Snakemake.

3. Dataset configuration (`datasets.yaml`)

The datasets.yaml file describes:

datasets (datasets:),
builders (how to produce raw GFAs),
tools and conda environments,
benchmark programs and their parameters.

3.1 Available builders

Each dataset has a builder field:

ggcat_from_fasta
- Input: FASTA/FNA (or .tar.gz archive containing FASTA files).
ggcat_from_reads_lighter
- Input: FASTQ(.gz) → Lighter correction → ggcat.
vg_from_vcf
- Input: fa_gz (reference) + vcf_gz → vg construct → GFA.
gfa_from_url
- Input: pre‑built GFA downloaded from a URL.
vg_from_url
- Input: a .vg file downloaded from a URL → vg convert -g → GFA.
gbz_from_url
- Input: a .gbz file downloaded from a URL → vg convert -g → GFA.
pggb_from_fasta
- Input: FASTA, graph built with pggb.

3.2 Enable / disable a dataset

Under datasets::

- name: coli3682
  enabled: true
  builder: ggcat_from_fasta
  ...

enabled: true → used.
enabled: false → ignored.
enabled: auto → enabled only if input files can be detected (useful for datasets discovered from an index page).

3.3 Choosing benchmark programs

Benchmark programs are selected:

globally via defaults.bench.programs, and/or
per dataset via datasets: ... bench_programs:

Global example:

defaults:
  bench:
    reps: 2
    programs:
      - BubbleGun_gfa
      - BubbleFinder
      - vg_snarls_gfa
      - clsd_sb

Per dataset override example:

- name: ecoli50
  ...
  bench_programs:
    - BubbleFinder
    - vg_snarls_gfa

Repetitions (`reps`)

The number of benchmark repetitions is controlled by defaults.bench.reps.
For each dataset, for each selected program (and for each configured thread count), the pipeline runs rep = 1..reps and writes one TSV per repetition.

A TSV file is a plain-text Tab-Separated Values file (fields separated by tab characters); in this pipeline, benchmark TSVs are small key/value tables storing time/memory/exit status and some metadata (program, threads, etc.).

Example:

defaults:
  bench:
    reps: 2

3.4 BubbleFinder: modes and inputs

The pipeline benchmarks BubbleFinder via the program name BubbleFinder. BubbleFinder runs in one or more modes, configured in datasets.yaml.

Modes supported here (BubbleFinder commands):

superbubbles
snarls
ultrabubbles

Mode selection is controlled by:

defaults.bench.program_opts.BubbleFinder.modes (global), and/or
per-dataset overrides under datasets: ... bench: program_opts: BubbleFinder: ...

Global example (run multiple modes):

defaults:
  bench:
    program_opts:
      BubbleFinder:
        modes: ["superbubbles", "snarls", "ultrabubbles"]

Per-dataset example (run only ultrabubbles on one dataset):

- name: ecoli50
  ...
  bench_programs:
    - BubbleFinder
  bench:
    program_opts:
      BubbleFinder:
        modes: ["ultrabubbles"]

BubbleFinder input type is configured per mode with input::

input: "gfa" → BubbleFinder reads data/<dataset>/<dataset>.cleaned.gfa
input: "sgraph" → BubbleFinder reads data/<dataset>/<dataset>.sbspqr.sgraph (built from a cleaned GFA)

Example (run BubbleFinder ultrabubbles, using GFA input):

defaults:
  bench:
    program_opts:
      BubbleFinder:
        modes: ["ultrabubbles"]
        default:
          input: "gfa"

Note: dataset names do not control BubbleFinder modes; only the modes: field does.
Also, BubbleFinder’s ultrabubbles mode requires at least one tip per connected component in the input graph (BubbleFinder requirement).

3.5 When are sgraph / edgelist produced?

These derived representations are built only if required by the selected benchmark programs / inputs:

clsd (clsd_sb) uses an edgelist: data/<dataset>/<dataset>.clsd.edgelist (generated from a cleaned GFA).
BubbleFinder uses sgraph only when BubbleFinder is configured with input: "sgraph".

3.6 Threads configuration (build + benchmarks)

There are two distinct “thread” concepts:

snakemake -j <N> controls how many rules run in parallel (workflow-level parallelism).
Program-specific thread counts control how many threads a tool uses inside one rule.

Builder threads (graph construction)

Builder thread defaults are set under defaults.threads, e.g.:

defaults:
  threads:
    ggcat: 8
    vg: 16
    lighter: 8
    pggb: 4

Benchmark threads (per program)

For benchmarks, thread values can be configured:

globally per program via defaults.bench.threads_by_program, and/or
per dataset via datasets: ... bench_threads:

Global example:

defaults:
  bench:
    threads_by_program:
      BubbleFinder: [1, 4, 8, 16]
      vg_snarls_gfa: [1, 4, 8, 16]

Per-dataset override example:

- name: ecoli50
  ...
  bench_threads:
    vg_snarls_gfa: [1, 8]

BubbleFinder thread values can also be set per mode via program_opts:

defaults:
  bench:
    program_opts:
      BubbleFinder:
        modes: ["superbubbles", "ultrabubbles"]
        by_mode:
          superbubbles:
            threads: [1, 2, 4, 8]
          ultrabubbles:
            threads: [1, 4, 8, 16]

4. Running the pipeline

Dry‑run (no execution)

snakemake -n -p

Full run (all enabled datasets)

snakemake all --use-conda -j 8

--use-conda: required to create/use the environments defined in config/*.yml.
-j 8: number of parallel jobs.

Targeted examples

Build and clean only the GFA of coli3682:

snakemake --use-conda -j 4 data/coli3682/coli3682.cleaned.gfa

Run benchmarks + aggregation (this also produces the plots because they are generated by the same rule in this workflow):

snakemake --use-conda -j 8 results/benchmarks.tsv

5. Output structure

`data/` directory

For each dataset <name>:

data/<name>/<name>.*.gfa
- Raw GFA (builder-dependent), e.g.:
  - ...ggcat.fasta.gfa
  - ...vg.gfa
  - ...pggb.gfa
  - ...raw.gfa (builder: gfa_from_url)
  - ...vg.url.gfa (builder: vg_from_url)
  - ...gbz.url.gfa (builder: gbz_from_url)
data/<name>/<name>.bluntified.gfa (temporary).
data/<name>/<name>.cleaned.gfa
- Cleaned GFA (no H‑lines, bluntified).
data/<name>/<name>.sb.cleaned.gfa
- Cleaned GFA produced by the optional ggcat -f rebuild (when enabled for SB programs on ggcat-based datasets).
data/<name>/<name>.sbspqr.sgraph
- sgraph for BubbleFinder when configured with input: "sgraph".
data/<name>/<name>.clsd.edgelist
- Edgelist for clsd.

`results/` directory

results/.prechecks.ok
- Marker indicating that prechecks have run.
results/bench/
- Per‑program, per‑dataset, per‑rep benchmark TSV files.
- For BubbleFinder, file names include the mode:
  - BubbleFinder/<dataset>.<mode>.t<threads>.rep<rep>.tsv
results/prog_out/
- Raw program outputs (BubbleFinder outputs + BubbleFinder JSON report, etc.).
results/logs/
- Detailed logs per step.
results/benchmarks.tsv
- Aggregated benchmark table.
results/plots/
- time_by_dataset_program.png
- rss_by_dataset_program.png
results/summary/reruns_planned.tsv
- Information used for automatic reruns (timeouts, failures).

6. Customizing tool binaries

In datasets.yaml, under defaults.tools, you can point to pre‑installed binaries to avoid automatic cloning/building:

defaults:
  tools:
    spqr_bin: /path/to/BubbleFinder
    get_blunted: /path/to/get_blunted
    lighter_bin: /path/to/lighter
    clsd_bin: /path/to/clsd

If these paths are not set, the Snakefile will:

clone/build BubbleFinder, clsd, and Lighter under build/,
download get_blunted into bin/.

7. Quick troubleshooting

Error in prechecks (prechecks):
- Check results/logs/* (especially results/logs/bench, results/logs/ggcat, results/logs/vg).
- Ensure conda or mamba is available.
Frequent timeouts:
- Adjust defaults.tools.timeout.seconds in datasets.yaml.
Problem with a specific dataset:
- Run a single target, e.g.
  snakemake --use-conda -j 4 data/coli3682/coli3682.cleaned.gfa
Bluntification issues:
- A WARN about get_blunted means the pipeline falls back to "naive bluntify" (overlap fields forced to *).
- Install GetBlunted and set defaults.tools.get_blunted to use it.

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
bin		bin
config		config
modules		modules
scripts		scripts
.gitignore		.gitignore
README.md		README.md
Snakefile		Snakefile
datasets.yaml		datasets.yaml
environment.yml		environment.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Snakemake Pipeline for BubbleFinder

Table of contents

1. Requirements

2. Setup

3. Dataset configuration (`datasets.yaml`)

3.1 Available builders

3.2 Enable / disable a dataset

3.3 Choosing benchmark programs

Repetitions (`reps`)

3.4 BubbleFinder: modes and inputs

3.5 When are sgraph / edgelist produced?

3.6 Threads configuration (build + benchmarks)

Builder threads (graph construction)

Benchmark threads (per program)

4. Running the pipeline

Dry‑run (no execution)

Full run (all enabled datasets)

Targeted examples

5. Output structure

`data/` directory

`results/` directory

6. Customizing tool binaries

7. Quick troubleshooting

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

algbio/BubbleFinder-experiments

Folders and files

Latest commit

History

Repository files navigation

Snakemake Pipeline for BubbleFinder

Table of contents

1. Requirements

2. Setup

3. Dataset configuration (datasets.yaml)

3.1 Available builders

3.2 Enable / disable a dataset

3.3 Choosing benchmark programs

Repetitions (reps)

3.4 BubbleFinder: modes and inputs

3.5 When are sgraph / edgelist produced?

3.6 Threads configuration (build + benchmarks)

Builder threads (graph construction)

Benchmark threads (per program)

4. Running the pipeline

Dry‑run (no execution)

Full run (all enabled datasets)

Targeted examples

5. Output structure

data/ directory

results/ directory

6. Customizing tool binaries

7. Quick troubleshooting

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

3. Dataset configuration (`datasets.yaml`)

Repetitions (`reps`)

`data/` directory

`results/` directory

Packages