MTBBench is a benchmark designed to evaluate the reasoning capabilities of multimodal large language models (LLMs) in complex clinical decision-making scenarios. It focuses on two core challenges in oncology: multimodal integration (e.g., pathology, genomics, radiology) and longitudinal reasoning across patient timelines. The benchmark includes agentic tasks requiring interaction with external foundation model-based tools and datasets.
To install all required dependencies, simply run:
bash setup.shNote: If you want to evaluate the agent on IHC data, you will need to additionally clone and install TRIDENT from source.
Before running the benchmark, configure your paths and credentials in:
neurips25/configs/base.yaml
The config file specifies paths to datasets, tool credentials, and output directories. Below, we provide guidance for acquiring the necessary external datasets.
The HANCOCK dataset contains SVS-format tissue micro arrays (TMAs). To prepare it:
- Follow the original HANCOCK GitHub repository to extract tiles and compute cell densities using QuPath.
- Reproduce our ABMIL training by extracting tumor centers and cell density measurements for Blocks 1 and 2.
- Download the dataset files from the HANCOCK project page to replicate the question curation used in the benchmark.
The MSK-CHORD dataset is available on cBioPortal. To use it:
- Download the ZIP archive from the cBioPortal page.
- Extract it and update the dataset path in
base.yaml.
To enable the DrugBank tool for longitudinal drug lookups:
- Register for a DrugBank account.
- Apply for a license to access their API.
- Download and locally host the dataset following their documentation.
- Update your API path and credentials in the config file.
We provide full logs of agent interactions for all models evaluated in the paper:
agent_logs_hancock/: Multimodal evaluation logs (HANCOCK)agent_logs_msk/: Longitudinal evaluation logs (MSK-CHORD)
Each log includes all agent–LLM conversations, intermediate reasoning steps, and generated answers.
Make sure you have:
- Installed dependencies
- Configured Hugging Face access tokens (for model download)
- Set paths in
base.yaml
To run an evaluation with Qwen/Qwen2.5-VL-7B-Instruct on the HANCOCK dataset:
python -m neurips25.benchmarks.run_agent_benchmark \
--doctor_model "Qwen/Qwen2.5-VL-7B-Instruct" \
--output_dir "./agent_logs_hancock/" \
--dataset "hancock"This fork adds a pinned, validated runtime for the agent benchmark plus the data-prep
and reproduction tooling needed to run both tracks end-to-end. The original setup.sh /
requirements.txt (conda, full WSI/ABMIL stack) still describe the upstream install; the
section below is the leaner, version-pinned path that the in-tree compatibility shims target.
| Area | Change |
|---|---|
| Runner | --use-tools (tool-augmented agent) and --max-cases N (stop after N cases) flags |
| Compatibility | shims for vLLM 0.23 / torch 2.11+cu130 / transformers 5 (neurips25/eval, neurips25/tools/conch.py) |
| Data prep | scripts/preprocess_msk_chord.py, scripts/build_cell_density_csv.py, scripts/validate_ihc_tool_path.py |
| Config | tools.conch / tools.uni in base.yaml; scripts/link_data.sh to wire staged data/models |
| Reproduction | scripts/setup_runtime.sh, scripts/smoke.sh, scripts/analyze_smoke.py, sbatch templates |
| Models | MODELS.md (the five models, sizes, load mechanism, gated-access notes) |
bash scripts/setup_runtime.sh # builds .venv with uv: vllm 0.23, torch 2.11+cu130, transformers 5.12, CONCH
source .venv/bin/activateThe pins live in requirements-runtime.txt. This is the lean runtime set — it omits the
optional offline paths (UNI/TRIDENT ABMIL training, openslide/QuPath WSI processing,
pyserini/faiss PubMed retrieval); those are not needed to run the benchmark.
The dataset's question JSONs embed relative data/... paths, so a data/ directory must
exist at the repo root. The repo already tracks a small data/ (the shipped
cell_density_measurements.csv and the ABMIL checkpoint); the staged root is a superset, so
set MTBBENCH_LINK_FORCE=1 to move the tracked data/ aside to data.bak (nothing is
deleted) before symlinking. Wire it (and models/) up with no hardcoded paths:
MTBBENCH_LINK_FORCE=1 \
MTBBENCH_DATA_ROOT=/abs/path/to/staged/data \
MTBBENCH_MODELS_ROOT=/abs/path/to/models \
bash scripts/link_data.shOn the CoreWeave reno cluster the staged root is /mnt/data/artifacts/tumor_board
(.../data, .../models), group-readable to slurm-users. See MODELS.md for the models.
export DRUGBANK_USERNAME="placeholder@example.com" # pubmed.py reads this at import time (Entrez.email); the tool is never called
# MSK-CHORD (longitudinal, no tools)
python -m neurips25.benchmarks.run_agent_benchmark \
--dataset msk --doctor_model "$MTBBENCH_MODELS_ROOT/Qwen3-32B" \
--output_dir ./agent_logs_msk/
# HANCOCK (multimodal, tool-augmented: CONCH + IHC density tool)
python -m neurips25.benchmarks.run_agent_benchmark \
--dataset hancock --use-tools --doctor_model "$MTBBENCH_MODELS_ROOT/Qwen2.5-VL-32B-Instruct" \
--output_dir ./agent_logs_hancock/CONCH downloads gated MahmoodLab/conch weights at first use; set HF_TOKEN to an account
that has accepted the gate (see MODELS.md).
scripts/smoke.sh runs the runner at --max-cases 2 on a track; scripts/analyze_smoke.py
scores the logs (logs written, no parse errors, valid answers, and — for HANCOCK — the IHC
tool + CONCH fire with zero fallbacks). SLURM templates: scripts/smoke_{msk,hancock}.sbatch.
bash scripts/smoke.sh hancock "$MTBBENCH_MODELS_ROOT/Qwen2.5-VL-32B-Instruct" 1 ./agent_logs_hancock_smoke
# Score the bar. Pass the run's stdout (.out) so IHC/CONCH fires are counted from the
# unmutated logger output -- the agent rewrites its conversation between questions, so the
# JSON blob alone is only a lower bound and the zero-fallback check is not airtight.
python scripts/analyze_smoke.py ./agent_logs_msk_smoke ./agent_logs_hancock_smoke smoke_summary.json \
--hancock-stdout ./smoke_smoke-mtb-hancock_<jobid>.out| Source | How to obtain | Prep |
|---|---|---|
| MSK-CHORD | cBioPortal msk_chord_2024 ZIP (study); CC BY-NC-ND 4.0 |
scripts/preprocess_msk_chord.py --in-dir <raw> --out-dir data/msk_chord_processed |
| HANCOCK IHC tool CSV | the shipped data/hancock/cell_density_measurements.csv holds the all-predicted (ABMIL) values, which is the faithful tool output by design (the IHC tool is the ABMIL predictor) |
see scripts/build_cell_density_csv.py to build the measured comparison CSV from a FAU QuPath export |
| CONCH / UNI | gated MahmoodLab/CONCH, MahmoodLab/UNI (accept the gate on your HF account) |
staged under models/{conch,uni} |
| DrugBank | licensed account (drugbank.com) — blocked without credentials; not needed for the two tracks above | set DRUGBANK_USERNAME/DRUGBANK_PASSWORD |
| Question JSONs + cases | generated over case data (generate_questions.py / msk_question_generation.py, needs an OpenAI key); not in upstream |
place at data/questions_{hancock,msk}_bench.json + data/{hancock,msk_bench}/cases |
Licenses: MSK-CHORD (CC BY-NC-ND 4.0) and HANCOCK (CC BY-NC) are non-commercial; MSK-CHORD also restricts derivatives. Confirm this fits your intended use before relying on the data.
Synthesized from Goodfire tumor-board experiments #1 (data staging + MSK preprocessing), #2 (HANCOCK IHC tool data), #3 (runner flags + compatibility shims + smoke validation), and #4 (model consolidation). See the PR description for details.