Autonomous AI-driven retrosynthesis model optimization.
An AI agent autonomously experiments with model architecture, hyperparameters, and training strategies to improve a retrosynthesis prediction model -- predicting how to synthesize target molecules from available starting materials.
Built on the autoresearch framework by Andrej Karpathy, with improvements inspired by auto-q-research for structured experiment tracking, novelty-guided exploration, and multi-step investment strategies.
An AI coding agent (Claude, Codex, etc.) sits in a loop:
1. Read analysis report (training dynamics, tried configs, novelty score)
2. Formulate hypothesis (guided by exploration metrics, not just intuition)
3. Modify train.py (architecture, optimizer, hyperparameters, etc.)
4. Train the model for 5 minutes
5. Run analyze.py (log results, check invest state, update reports)
6. If accuracy improved → keep the change
If accuracy dropped but foundational → invest (with deadline)
If accuracy didn't improve → revert
7. Repeat forever
The model learns to predict reactants from products: given a target molecule as a SMILES string, generate the precursor molecules needed to synthesize it.
Input: CC(=O)Oc1ccccc1C(=O)O (aspirin)
Output: CC(=O)OC(=O)C.OC(=O)c1ccccc1O (acetic anhydride + salicylic acid)
The Gradio frontend provides multi-step retrosynthesis -- recursively decomposing complex molecules until all precursors are commercially available building blocks.
- Python 3.10+
- uv package manager
- For real training: NVIDIA GPU (tested on T4/A10G via AWS). CPU/MPS works for testing.
# Install uv (if needed)
curl -LsSf https://astral.sh/uv/install.sh | sh
# Install dependencies
uv sync
# Basic: download USPTO-50K, build tokenizer, extract building blocks (~2 min)
uv run prepare.py
# With SMILES augmentation (5x training data, recommended):
uv run prepare.py --augment 5
# With reaction class conditioning:
uv run prepare.py --augment 5 --reaction-class
# For local testing with a tiny subset (500 reactions):
uv run prepare.py --tiny
# Re-process data (e.g., after changing augmentation settings):
uv run prepare.py --force --augment 5 --reaction-class# Full 5-minute run (GPU)
uv run train.py
# Quick 30-second test (CPU/MPS)
TIME_BUDGET=30 uv run train.pyOutput:
---
val_accuracy: 0.120000
val_validity: 0.850000
training_seconds: 300.1
total_seconds: 340.2
peak_vram_mb: 2048.0
total_tokens_M: 12.5
num_steps: 500
num_params_M: 2.1
depth: 4
Point your AI coding agent at the repo with permissions disabled:
Hi, have a look at program.md and let's kick off a new experiment! Let's do the setup first.
The agent creates a branch, establishes a baseline, then loops indefinitely -- modifying train.py, running experiments, analyzing results via analyze.py, keeping improvements (or investing in foundational changes), and logging everything.
uv run app.pyOpens a Gradio web UI at http://localhost:7860 where you can:
- Input molecules (SMILES or common names like "aspirin")
- See predicted retrosynthesis routes with 2D structure drawings
- Toggle beam search for top-3 predictions with confidence scores
- View autoresearch experiment history and interactive accuracy charts
Open analysis.ipynb to see the autoresearch progress chart, or use the Autoresearch Progress tab in the Gradio frontend for interactive Plotly charts.
You (the researcher)
|
| write research strategy + prepare data
v
program.md ──────────> AI Agent (Claude, Codex, etc.)
|
| 1. read analysis.txt (training dynamics, novelty score)
| 2. formulate hypothesis (guided by exploration metrics)
| 3. modify train.py, commit
v
train.py ──> 5-min training ──> val_accuracy
| |
v v
analyze.py ←─────────────────── results
|
| log to experiments.jsonl
| compute novelty score, training dynamics
| check invest state (deadlines, abort thresholds)
| generate fixed-size analysis.txt
v
analysis.txt ──> agent reads ──> next experiment
|
app.py <── best_model.pt ── checkpoint
(Gradio frontend with beam search, Plotly charts)
See ARCHITECTURE.md for detailed diagrams.
| File | Role | Who edits it |
|---|---|---|
prepare.py |
Data download (USPTO-50K), SMILES tokenizer, dataloader, evaluation, beam search, SMILES augmentation, reaction class conditioning | Researcher (read-only for agent) |
train.py |
GPT model instantiation, optimizer (AdamW), training loop, all hyperparameters, checkpoint saving | The AI agent |
model.py |
GPT model definition (shared by train.py and app.py) | Researcher |
program.md |
Agent instructions: experiment loop with analysis, invest mechanism, exploration guidance, literature search | Researcher |
analyze.py |
Post-experiment analysis: logging, training dynamics, novelty scores, invest state management, fixed-size reports | Researcher (run by agent, not modified) |
app.py |
Gradio frontend: retrosynthesis prediction with beam search, molecule visualization, experiment history, Plotly charts | Researcher |
analysis.ipynb |
Results visualization: progress chart, summary stats | Run after experiments |
terraform/ |
AWS infrastructure: g4dn.xlarge GPU instance (~$0.53/hr) | Researcher |
| File | Purpose | Who reads it |
|---|---|---|
experiments.jsonl |
Append-only full experiment history with configs | Only analyze.py |
analysis.txt |
Fixed-size report (~50 lines): dynamics, tried configs, novelty, invest state | Agent (every cycle) |
ideas.md |
Prioritized experiment ideas (max 10, mechanically capped) | Agent |
invest_state.json |
Invest mechanism state (active/inactive, deadline, thresholds) | Agent + analyze.py |
results.tsv |
Legacy experiment log (6 columns) | Legacy support |
best_model.pt |
Best model checkpoint | app.py, deployment |
loss_curve.csv |
Per-step training loss for latest experiment | analyze.py |
- GPT model -- decoder-only transformer with RoPE, RMS normalization, ReLU-squared activation, scaled dot-product attention (auto-dispatches to FlashAttention on supported hardware)
- Sequence format --
<bos> [product SMILES] <sep> [reactant SMILES] <eos>, with loss computed only on reactant tokens - Training loop -- gradient accumulation, time-based progress tracking, LR scheduling (linear warmup + cosine warmdown), fast-fail on NaN/divergence
- Runtime autocast test -- verifies float16/bfloat16 works on the current GPU before training; falls back to float32 with a warning if CUBLAS errors occur
- Outputs -- model checkpoints (
best_model.pt), per-step loss curve (loss_curve.csv), summary metrics
- 49,015 reactions from US pharma patents
- Split: ~40K train, ~5K val, ~5K test
- 10 reaction type classes (heteroatom alkylation, acylation, C-C bond formation, etc.)
The same molecule can be written as many valid SMILES strings. prepare.py --augment N generates N random SMILES variants per training reaction via RDKit, effectively multiplying training data with zero runtime overhead. This is a well-established technique that typically improves retrosynthesis accuracy by 5-15%.
prepare.py --reaction-class extracts the 10 USPTO-50K reaction type labels and prepends class tokens (<class_0> through <class_9>) to each sequence. This gives the model a structural hint about what kind of reaction to predict.
After each experiment, analyze.py produces a fixed-size analysis.txt report with:
- Training dynamics (convergence detection, end-of-run slope, trend)
- Tried-configurations summary (one line per hyperparameter, prevents retrying)
- Config-space novelty score (deterministic L2 distance, flags low-diversity exploration)
- Diminishing returns detection (flags tapped-out hyperparameter dimensions)
- Invest mechanism state (deadline tracking, abort thresholds)
Beyond keep/discard, the agent can mark an experiment as invest when accuracy dropped but the change is believed to be foundational (e.g., SMILES augmentation needs follow-up model scaling). Invests have deadlines and abort thresholds, enforced mechanically by analyze.py.
Each experiment's config is projected into a normalized hyperparameter space, and its distance from all prior experiments is computed. Low novelty triggers a suggestion to explore a different direction, preventing the agent from making tiny variations in one corner of the search space.
prepare.py includes both greedy and beam search generation. Beam search (width 10) explores multiple candidate predictions in parallel, typically adding +5-10% top-1 accuracy. The frontend supports toggling beam search for top-3 predictions with log-probability scores.
The model predicts one step; the frontend recursively applies it, stopping when reactants are commercially available (checked against building blocks extracted from training data) or when max depth is reached.
Terraform config provisions a g4dn.xlarge (T4 16GB GPU, ~$0.53/hr):
cd terraform
terraform init
terraform apply -var="ssh_key_name=your-key"
# SSH in, start Claude Code in tmux, let it run overnightOpen colab.ipynb in Colab, select T4 GPU runtime, run cells top to bottom. Free but limited: no autonomous agent, sessions timeout.
Deploy the Gradio frontend to HF Spaces for a permanent demo:
- Train on EC2 (agent loop)
- Copy
best_model.ptlocally - Push repo + model to a HF Space
- Shut down EC2
uv run prepare.py --tiny
TIME_BUDGET=30 uv run train.py
uv run app.pyThe training script automatically detects GPU capability and selects the appropriate precision:
- Compute >= 8.0 (A10G, A100, L4): bfloat16
- Compute < 8.0 (T4): float16
- CUBLAS failure: automatic fallback to float32 with warning
PyTorch is pinned to 2.4-2.5 with CUDA 12.4 for maximum T4 compatibility.
This project is an unofficial fork/adaptation of autoresearch by Andrej Karpathy. The agent loop improvements (structured analysis, novelty scores, invest mechanism) are inspired by auto-q-research.
MIT