A command-line tool for querying the AllTheBacteria genomics database (~3.2M bacterial genomes), searching AMR/stress/virulence genes, finding closest genomes via sketch distances, and downloading genome assemblies.
Single binary, no dependencies.
Supported platforms: Linux, macOS, Windows (amd64 and arm64)
atb-cli was designed and architected by Thanh Le Viet in his personal capacity, using his own Claude account. The implementation was developed with coding assistance from Claude (Anthropic), an AI assistant that helped with code generation, testing, and documentation under human direction and review. Thanks to hackathon participants Jane Hawkey, Ahmed M Moustafa, Martin Hunt, and Zamin Iqbal for their input and feedback.
- Download
- Install
- Quick Start
- Updating
- Usage Examples
- LLM Integration (MCP)
- Output Formats
- Available Columns
- Performance
- Building
- Data Sources
- License
Pre-built binaries for all platforms:
| Platform | Architecture | File |
|---|---|---|
| Linux | x86_64 (amd64) | atb-cli_<version>_linux_amd64.tar.gz |
| Linux | ARM64 | atb-cli_<version>_linux_arm64.tar.gz |
| macOS | Intel (amd64) | atb-cli_<version>_darwin_amd64.tar.gz |
| macOS | Apple Silicon (arm64) | atb-cli_<version>_darwin_arm64.tar.gz |
| Windows | x86_64 (amd64) | atb-cli_<version>_windows_amd64.zip |
| Windows | ARM64 | atb-cli_<version>_windows_arm64.zip |
Latest release: github.com/allthebacteria/atb-cli/releases/latest
Download the file for your platform, extract, and place the atb binary (or atb.exe on Windows) somewhere in your PATH.
One-line install (Linux/macOS):
curl -fsSL https://raw.githubusercontent.com/allthebacteria/atb-cli/main/install.sh | bashThis detects your OS and architecture, downloads the latest release, and installs to ~/.local/bin. It will add the directory to your PATH automatically if needed.
# Install a specific version
ATB_VERSION=v0.1.0 curl -fsSL https://raw.githubusercontent.com/allthebacteria/atb-cli/main/install.sh | bash
# Install to a custom directory
ATB_INSTALL_DIR=~/.local/bin curl -fsSL https://raw.githubusercontent.com/allthebacteria/atb-cli/main/install.sh | bashWindows: Download the .zip from the Download table above, extract, and add atb.exe to your PATH.
Other methods:
# Go install (requires Go 1.23+)
go install github.com/allthebacteria/atb-cli/cmd/atb@latest
# From source
git clone https://github.com/allthebacteria/atb-cli.git
cd atb-cli
make build # binary at ./bin/atb# 1. Download the database (~540 MB core tables)
atb fetch
# 2. Query
atb query --species "Escherichia coli" --hq-only --limit 10If you don't have the parquet files yet:
# Download core tables from OSF (~540MB)
./bin/atb fetch
# Or download all tables including ENA metadata (~3GB)
./bin/atb fetch --allatb checks for new versions in the background (once every 24 hours). If a newer release is found, you'll see a notice on every run until you upgrade:
A new version of atb is available: v0.9.0 (current: v0.8.0)
What's new:
feat: find closest ATB genomes via sketch distances
...
Release: https://github.com/allthebacteria/atb-cli/releases/tag/v0.9.0
Run 'atb update' to upgrade.
To update:
# Interactive update (asks for confirmation)
atb update
# Non-interactive (for scripts/CI)
atb update --forceThe updater downloads the correct binary for your OS/architecture from GitHub Releases and replaces the current binary in place.
# Get 10 high-quality E. coli genomes
atb query --species "Escherichia coli" --hq-only --limit 10
# With quality filters
atb query --species "Escherichia coli" \
--hq-only \
--min-completeness 99.5 \
--max-contamination 0.5 \
--min-n50 200000 \
--sort-by N50 --sort-desc \
--limit 20
# Select specific columns
atb query --species "Escherichia coli" --hq-only --limit 5 \
--columns sample_accession,sylph_species,N50,Completeness_General,aws_url
# Search by genus
atb query --genus Salmonella --hq-only --limit 20
# Wildcard species search
atb query --species-like "Streptococcus%" --hq-only --limit 10# Salmonella from the UK, Illumina only
atb query --species "Salmonella enterica" \
--country "United Kingdom" \
--platform "ILLUMINA" \
--limit 20
# Genomes collected between 2020-2023
atb query --species "Escherichia coli" \
--collection-date-from 2020-01-01 \
--collection-date-to 2023-12-31 \
--limit 50# Create a filter file
cat > my_query.toml <<'EOF'
[filter]
species = "Escherichia coli"
hq_only = true
min_completeness = 99.0
max_contamination = 2.0
min_n50 = 100000
[output]
columns = ["sample_accession", "sylph_species", "N50", "Completeness_General", "aws_url"]
sort_by = "N50"
sort_desc = true
limit = 100
format = "tsv"
output = "ecoli_results.tsv"
EOF
# Run the query
atb query --filter my_query.toml
# CLI flags override TOML values
atb query --filter my_query.toml --limit 10atb info SAMD00000355Output:
=== Assembly ===
sample_accession: SAMD00000355
sylph_species: Streptococcus pyogenes
hq_filter: PASS
dataset: 661k
aws_url: https://allthebacteria-assemblies.s3.eu-west-2.amazonaws.com/SAMD00000355.fa.gz
=== Assembly Stats ===
total_length: 1868526
N50: 148451
=== CheckM2 Quality ===
completeness_general: 99.06
contamination: 0.03
=== MLST ===
scheme: ecoli_achtman_4
ST: 131
status: PERFECT
score: 100
alleles: adk(53);fumC(40);gyrB(47);icd(13);mdh(36);purA(28);recA(29)
=== ENA Metadata ===
country: Japan:Aichi
collection_date: 1994
instrument_platform: ILLUMINA
# Query, then download
atb query --species "Klebsiella pneumoniae" --hq-only --limit 10 \
--columns sample_accession,aws_url --format csv -o results.csv
atb download --from results.csv --output-dir ./genomes
# Pipe query directly to download
atb query --species "Escherichia coli" --hq-only --limit 5 \
--columns sample_accession,aws_url --format csv | \
atb download --from - --output-dir ./ecoli_genomes
# Preview what would be downloaded
atb download --from results.csv --dry-run
# Download from a URL list
atb download --urls my_urls.txt --output-dir ./genomes --parallel 8
# Download a single file
atb download --url https://allthebacteria-assemblies.s3.eu-west-2.amazonaws.com/SAMD00000355.fa.gz \
--output-dir ./genomes# Default summary of the full database
atb summarise
# Group by species (top 20)
atb summarise --by sylph_species --top 20
# Summarise a previous query result
atb query --genus Salmonella --hq-only --limit 100 \
--columns sample_accession,sylph_species,hq_filter,dataset -o salmonella.tsv
atb summarise --from salmonella.tsv
# Pipe query to summarise
atb query --species "Escherichia coli" --hq-only --limit 200 \
--columns sample_accession,sylph_species,dataset --format csv | \
atb summarise --from -AMR data comes from AMRFinderPlus results run across all ATB genomes. All AMR, stress, and virulence data is in amrfinderplus.parquet (25.6M rows, 81 MB), downloaded automatically by atb fetch and partitioned by genus for fast queries.
# Get all AMR gene hits for E. coli (high-quality genomes only)
atb amr --species "Escherichia coli" --hq-only --limit 100
# Filter by drug class
atb amr --species "Escherichia coli" --hq-only --class "BETA-LACTAM"
# Wildcard gene search (all beta-lactamase genes)
atb amr --species "Escherichia coli" --gene "bla%"
# Compare resistance across multiple species
atb amr --species "Escherichia coli,Klebsiella pneumoniae" --class "BETA-LACTAM"
# Find a gene across ALL genera (no species filter needed)
atb amr --gene "blaCTX-M-15" --limit 100
# Filter by ENA metadata -- country, platform, or collection date
# (requires ena_20250506.parquet: run 'atb fetch --tables ena_20250506.parquet')
# Any ENA filter implies --with-ena, so country/collection_date/instrument_platform
# are appended to the output automatically.
atb amr --species "Escherichia coli" --class "BETA-LACTAM" \
--country "United Kingdom" --platform ILLUMINA --limit 50
atb amr --species "Salmonella enterica" --gene "blaCTX-M-15" \
--collection-date-from 2022-01-01
# Add ENA columns to the output without filtering
# (requires ena_20250506.parquet). Without --with-ena the ENA table is not read,
# so default AMR queries stay in the millisecond tier.
atb amr --species "Escherichia coli" --class "BETA-LACTAM" --with-ena --limit 50
# Search by drug class across all genera
atb amr --class "CARBAPENEM" --limit 50
# Filter by detection quality
atb amr --species "Escherichia coli" --min-coverage 95 --min-identity 98
# Query stress response genes
atb amr --species "Escherichia coli" --type stress
# Query virulence factors
atb amr --species "Escherichia coli" --type virulence
# Query all three categories at once
atb amr --species "Escherichia coli" --type all
# Output to file
atb amr --species "Klebsiella pneumoniae" --hq-only --format csv -o kpn_amr.csv
# Download matching assemblies directly
atb amr --species "Escherichia coli" --class "BETA-LACTAM" --hq-only --download -d ./genomes
# Preview what would be downloaded (no actual download)
atb amr --species "Klebsiella pneumoniae" --gene "blaCTX-M-15" --download --dry-run
# Cap number of assemblies to download
atb amr --species "Escherichia coli" --gene "bla%" --download --max-samples 20 -d ./bla_genomes--species accepts comma-separated values for multi-species comparison. When omitted, --gene or --class is required to search across all genera.
The --download flag downloads the FASTA assembly for each unique sample in the results. Query output is always printed first. Use --dry-run to preview URLs without downloading, and --max-samples to cap the number of assemblies.
AMR output columns: sample_accession, gene_symbol, element_type, element_subtype, class, subclass, method, coverage, identity, species, genus
With --with-ena (or any ENA filter), three extra columns are appended: country, collection_date, instrument_platform.
MLST data covers 2.44M samples across 156 typing schemes. The data is included in the core metadata fetch.
# Get all STs for E. coli (high-quality only)
atb mlst --species "Escherichia coli" --hq-only --limit 20
# Find ST131 E. coli (a globally disseminated high-risk clone)
atb mlst --species "Escherichia coli" --st 131 --hq-only
# Query by MLST scheme name
atb mlst --scheme salmonella --limit 50
atb mlst --scheme ecoli_achtman_4 --limit 20
# Only perfect MLST calls (all alleles matched exactly)
atb mlst --species "Escherichia coli" --status PERFECT --limit 20
# Find novel sequence types (new allele combinations)
atb mlst --species "Salmonella enterica" --status NOVEL --limit 20
# Combine with species and quality filters
atb mlst --species "Klebsiella pneumoniae" --hq-only --status PERFECT --limit 50
# Filter MLST results by ENA metadata -- country, platform, or collection date
# (requires ena_20250506.parquet: run 'atb fetch --tables ena_20250506.parquet')
# Any ENA filter implies --with-ena, so country/collection_date/instrument_platform
# are appended to the output automatically.
atb mlst --species "Escherichia coli" --st 131 --country "United Kingdom"
atb mlst --species "Salmonella enterica" --platform ILLUMINA \
--collection-date-from 2022-01-01 --limit 100
# Add ENA columns to the output without filtering
# (requires ena_20250506.parquet). Without --with-ena the ENA table is not read,
# so default MLST queries stay in the millisecond tier.
atb mlst --species "Escherichia coli" --st 131 --with-ena
# Output as CSV
atb mlst --species "Escherichia coli" --st 131 --format csv -o st131.csv
# Download assemblies for matching samples
atb mlst --species "Escherichia coli" --st 131 --download -d ./st131
# Preview download, cap at 20 assemblies
atb mlst --species "Salmonella enterica" --status PERFECT --download --dry-run --max-samples 20MLST output columns: sample_accession, sylph_species, mlst_scheme, mlst_st, mlst_status, mlst_score, mlst_alleles
With --with-ena (or any ENA filter), three extra columns are appended: country, collection_date, instrument_platform.
MLST status values: PERFECT (exact match), NOVEL (new combination), OK (partial), MIXED, BAD, MISSING, NONE
Find the closest genomes in the ATB database to your input sequences using MinHash sketch distances via sketchlib. Results include ANI (Average Nucleotide Identity) and enriched metadata.
Linux/macOS only -- sketchlib binaries are not available for Windows.
# 1. Install sketchlib (one-time, downloads binary next to atb)
atb sketch install
# 2. Download the ATB sketch database (~4.2 GB, one-time)
atb sketch fetch
# 3. Query your genome against ~3.2M ATB genomes
atb sketch query my_genome.fasta
# Top 50 closest matches
atb sketch query my_genome.fasta --knn 50
# Multiple input files
atb sketch query sample1.fasta sample2.fasta
# Batch from a file list
atb sketch query -f input_list.txt
# Find closest genomes AND download their assemblies
atb sketch query my_genome.fasta --download ./closest_genomes
# Preview downloads without actually downloading
atb sketch query my_genome.fasta --download ./closest --dry-run
# Raw sketchlib output (no metadata enrichment)
atb sketch query my_genome.fasta --raw
# JSON output
atb sketch query my_genome.fasta --format jsonOutput columns: query, sample_accession, ani, species, N50, completeness, mlst_st
When --download is used, the output directory will contain:
- Downloaded genome FASTA files (
.fa.gz) sketch_results.tsv-- full query results with download URLsmanifest.json-- download summary (consistent withatb download)
# Show info about the local sketch database
atb sketch infoThe AllTheBacteria project hosts ~3,000 files on OSF across 75+ categories (assemblies, annotations, AMR, MLST, protein structures, and more). Browse and download them directly:
# List all project categories
atb osf ls
# Find files matching a keyword
atb osf ls AMR
atb osf ls "Protein Structures"
# Regex search across project and filename
atb osf ls --grep "bakta.*batch"
# Sort by size, different output formats
atb osf ls AMR --sort size --format json
# Preview what would be downloaded
atb osf download --dry-run "AMRFinderPlus.*results.*latest"
# Download with MD5 verification
atb osf download --verify "DefenseFinder.*results"
# Download all files in a project
atb osf download --project AllTheBacteria/MLST --all -o ./mlst_dataThe file index is cached locally and refreshed every 7 days. Use --refresh to force an update.
# Download core tables including AMR and MLST (~700 MB)
# Includes: assembly, assembly_stats, checkm2, sylph, run, mlst, amrfinderplus
atb fetch
# Download all tables including ENA metadata (~3.2 GB)
atb fetch --all
# Download specific tables only
atb fetch --tables ena_20250506.parquet
# Force re-download
atb fetch --force
# Rebuild the SQLite query index (runs automatically after fetch)
atb index --force# Create default config
atb config init
# View config
atb config show
# Set data directory
atb config set general.data_dir /path/to/parquet/files
# Set default download parallelism
atb config set download.parallel 8Config is stored at ~/.config/atb/config.toml.
atb includes a built-in Model Context Protocol server, allowing LLMs to query the AllTheBacteria database directly through natural language.
Two transport modes:
- stdio (default) - for Claude Code, Claude Desktop, Cursor, VS Code Copilot, Windsurf, OpenAI Codex CLI
- HTTP/SSE (
--http :8080) - for ChatGPT, OpenAI Responses API, remote clients
Tools exposed:
| Tool | Description |
|---|---|
atb_query |
Search genomes by species, genus, quality, N50 |
atb_amr |
Query AMR resistance genes by species and drug class |
atb_mlst |
Query MLST scheme, ST, and allele calls |
atb_info |
Get full metadata for a specific sample |
atb_stats |
Database summary statistics |
atb_species_list |
List available species with genome counts |
# Claude Code - runs directly from GitHub, available in all projects
claude mcp add --scope user atb -- go run github.com/allthebacteria/atb-cli/cmd/atb@latest mcpFirst call takes ~10s to compile; cached after that.
# 1. Install atb
curl -fsSL https://raw.githubusercontent.com/allthebacteria/atb-cli/main/install.sh | bash
# 2. Fetch the database and build the index
atb fetch# Claude Code (available globally in all projects)
claude mcp add --scope user atb -- atb mcp
# Claude Code (current project only)
claude mcp add atb -- atb mcp
# Claude Desktop (add to ~/Library/Application Support/Claude/claude_desktop_config.json on macOS
# or %APPDATA%\Claude\claude_desktop_config.json on Windows)
{
"mcpServers": {
"atb": {
"command": "atb",
"args": ["mcp"]
}
}
}
# Cursor (Settings > MCP Servers > Add)
# Command: atb mcp
# OpenAI Codex CLI (~/.codex/config.toml)
[mcp_servers.atb]
command = "atb"
args = ["mcp"]Note: After adding, restart your client for the MCP server to become available.
# Start the HTTP/SSE server
atb mcp --http :8080Then configure your client with the SSE endpoint URL:
- ChatGPT: Settings > Connected apps > Add MCP server >
http://your-host:8080/sse - OpenAI Responses API: Use
server_url: "http://your-host:8080/sse"in the MCP tool config
For public access, deploy with Docker or use a tunnel:
# Docker (auto-downloads data on first run)
docker compose up -d
# SSE endpoint: http://localhost:8080/sse
# Or quick public URL with ngrok
atb mcp --http :8080 &
ngrok http 8080See docs/deployment.md for full deployment guides (Fly.io, Railway, VPS, Docker Compose).
Note: If your data is in a non-default location, add
--data-dir /your/pathto all commands above.
Once connected, you can ask natural language questions like:
- "How many Salmonella genomes are in the database?"
- "Find me 20 high-quality E. coli genomes with N50 > 200000"
- "What beta-lactam resistance genes does Klebsiella pneumoniae have?"
- "Show me all metadata for sample SAMD00000355"
- "What are the top 10 species by genome count?"
The LLM will call the appropriate atb tools and interpret the results for you.
By default, output is a pretty table when writing to a terminal, and TSV when piped. Override with --format:
atb query --species "Escherichia coli" --limit 5 --format tsv # tab-separated
atb query --species "Escherichia coli" --limit 5 --format csv # comma-separated
atb query --species "Escherichia coli" --limit 5 --format json # JSON array
atb query --species "Escherichia coli" --limit 5 --format table # pretty tablesample_accession, run_accession, assembly_accession, sylph_species, scientific_name, hq_filter, dataset, asm_fasta_on_osf, aws_url, osf_tarball_url
total_length, number, mean_length, longest, shortest, N50, N90
Completeness_General, Contamination, Completeness_Specific, Genome_Size, GC_Content
country, collection_date, instrument_platform, instrument_model, read_count, base_count, library_strategy, study_accession, fastq_ftp
Queries are effectively instant after atb fetch builds the indexes.
| Query type | Time | Peak RAM |
|---|---|---|
Metadata query (atb query --species ... --limit 100) |
<10ms | 15 MB |
Sample info (atb info SAMN...) |
<10ms | 14 MB |
AMR query (atb amr --species ... --limit 100) |
<10ms | 0.1 MB |
AMR query (atb amr --species ... --class ...) |
<10ms | 0.1 MB |
AMR cross-genus gene search (atb amr --gene ... --limit 100) |
~2ms | 3 MB |
| Genome download (50 files, parallel=4) | 3.8s | 18 MB |
Post-fetch index build: ~8-11 minutes (one-time). Disk: ~3.5 GB typical install.
Full benchmark details, methodology, and comparisons across query tiers: docs/BENCHMARKS.md
The database uses GTDB taxonomy (not NCBI). Some species names differ from common usage. If a query returns 0 results, the tool suggests close matches. Example: Enterococcus faecium in GTDB may be Enterococcus_B faecium. Use --species-like "Enterococcus%faecium" to search across GTDB naming variants.
# Build for current platform
make build
# Run tests
make test
# Cross-compile for all supported platforms
GOOS=linux GOARCH=amd64 go build -o bin/atb-linux-amd64 ./cmd/atb
GOOS=linux GOARCH=arm64 go build -o bin/atb-linux-arm64 ./cmd/atb
GOOS=darwin GOARCH=amd64 go build -o bin/atb-darwin-amd64 ./cmd/atb
GOOS=darwin GOARCH=arm64 go build -o bin/atb-darwin-arm64 ./cmd/atb
GOOS=windows GOARCH=amd64 go build -o bin/atb-windows-amd64.exe ./cmd/atb
GOOS=windows GOARCH=arm64 go build -o bin/atb-windows-arm64.exe ./cmd/atbRequires Go 1.23+. Pure Go, no CGO - cross-compilation works out of the box.
All external URLs are defined in internal/sources/sources.go -- a single file that documents every URL the tool accesses.
| Data | Source | Used by |
|---|---|---|
| Parquet metadata (assembly, QC, species, MLST, AMR) | OSF (h7wzy) Aggregated/Latest_2025-05/ |
atb fetch |
| ENA metadata (geography, platform, dates) | Same OSF project, optional tables | atb fetch --all |
| AMR/Stress/Virulence genes | AMRFinderPlus results as amrfinderplus.parquet (25.6M rows, 81 MB) |
atb amr |
| OSF file index | all_atb_files.tsv (~3,000 files, 75+ categories) | atb osf ls, atb osf download |
| Sketch database | Same OSF project, atb_sketchlib.aggregated.202408 (.skm + .skd, ~4.2 GB) |
atb sketch fetch |
| Genome assemblies | allthebacteria-assemblies.s3.eu-west-2.amazonaws.com |
atb download, atb sketch query --download |
| sketchlib binary | bacpop/sketchlib.rust (Linux/macOS) | atb sketch install |