plasmid-host-range

Predict the bacterial host genus of a plasmid directly from its nucleotide sequence, by fine-tuning a pretrained DNA foundation model (default: DNABERT-2) on PLSDB 2025.

The headline public API is:

from plasmid_host_range.predict import predict_host_genus

preds = predict_host_genus("path/to/plasmid.fasta", top_k=3)
for p in preds:
    print(p.accession, p.top_genera, p.scores)

Setup

cd ml/plasmid-host-range
python -m venv .venv && source .venv/bin/activate
pip install -e ".[dev]"

Or with uv:

uv venv && source .venv/bin/activate
uv pip install -e ".[dev]"

Data

PLSDB is distributed on Figshare. The current release is PLSDB 2024_05_31_v2 (article 27252609). The download command hits the public Figshare API and pulls only the files we need (≈ 2 GB total, mostly sequences.fasta.bz2):

plasmid-host-range download

This fetches into data/raw/:

File	Purpose
`sequences.fasta` (decompressed from `.bz2`)	Plasmid nucleotide sequences
`nuccore.csv`	Per-plasmid metadata (accession, taxon ID, length, …)
`taxonomy.csv`	Taxon ID → lineage (genus, species, …)
`biosample.csv`	Biosample-level host info (fallback source)
`README.md`	Upstream PLSDB column docs

The download is resumable at the file level: re-running download skips files that are already present and non-empty.

Then build the processed splits:

plasmid-host-range preprocess --top-n-genera 20

This joins nuccore.csv with taxonomy.csv on taxon ID, filters plasmids by length, labels the top-N most common host genera (plus Other), and writes data/processed/{train,val,test}.parquet with columns {accession, sequence, label, genus, species}, using a host-species-grouped train/val/test split to avoid leaking near-identical plasmids across splits.

Train

Smoke test (tiny subset, CPU-friendly):

plasmid-host-range train --config configs/smoke.yaml

Full fine-tune (GPU recommended):

plasmid-host-range train --config configs/default.yaml

Evaluate

plasmid-host-range evaluate --checkpoint checkpoints/best

Produces reports/test_metrics.json and reports/confusion_matrix.png. A k-mer + logistic regression baseline is reported alongside, so any DL improvement is quantified rather than assumed.

Predict on a new plasmid

plasmid-host-range predict path/to/plasmid.fasta --top-k 3

Layout

src/plasmid_host_range/
  data/         # download, preprocess, splits, torch Dataset
  model.py      # HF AutoModelForSequenceClassification wrapper
  train.py      # fine-tuning loop
  evaluate.py   # metrics + confusion matrix
  predict.py    # predict_host_genus() public API
  baselines.py  # k-mer + logistic regression baseline
  cli.py        # typer CLI
tests/          # smoke + unit tests
configs/        # default.yaml, smoke.yaml

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
configs		configs
notebooks		notebooks
src/plasmid_host_range		src/plasmid_host_range
tests		tests
.gitignore		.gitignore
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

plasmid-host-range

Setup

Data

Train

Evaluate

Predict on a new plasmid

Layout

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

plasmid-host-range

Setup

Data

Train

Evaluate

Predict on a new plasmid

Layout

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages