MTBBench: A Multimodal Sequential Clinical Decision-Making Benchmark in Oncology

MTBBench is a benchmark designed to evaluate the reasoning capabilities of multimodal large language models (LLMs) in complex clinical decision-making scenarios. It focuses on two core challenges in oncology: multimodal integration (e.g., pathology, genomics, radiology) and longitudinal reasoning across patient timelines. The benchmark includes agentic tasks requiring interaction with external foundation model-based tools and datasets.

🛠 Getting Started

To install all required dependencies, simply run:

bash setup.sh

Note: If you want to evaluate the agent on IHC data, you will need to additionally clone and install TRIDENT from source.

⚙️ Configuration

Before running the benchmark, configure your paths and credentials in:

neurips25/configs/base.yaml

The config file specifies paths to datasets, tool credentials, and output directories. Below, we provide guidance for acquiring the necessary external datasets.

📁 Datasets

HANCOCK (Multimodal Tissue Microarrays)

The HANCOCK dataset contains SVS-format tissue micro arrays (TMAs). To prepare it:

Follow the original HANCOCK GitHub repository to extract tiles and compute cell densities using QuPath.
Reproduce our ABMIL training by extracting tumor centers and cell density measurements for Blocks 1 and 2.
Download the dataset files from the HANCOCK project page to replicate the question curation used in the benchmark.

MSK-CHORD (Longitudinal Genomic Profiles)

The MSK-CHORD dataset is available on cBioPortal. To use it:

Download the ZIP archive from the cBioPortal page.
Extract it and update the dataset path in base.yaml.

DrugBank API

To enable the DrugBank tool for longitudinal drug lookups:

Register for a DrugBank account.
Apply for a license to access their API.
Download and locally host the dataset following their documentation.
Update your API path and credentials in the config file.

📑 Agent Logs

We provide full logs of agent interactions for all models evaluated in the paper:

agent_logs_hancock/: Multimodal evaluation logs (HANCOCK)
agent_logs_msk/: Longitudinal evaluation logs (MSK-CHORD)

Each log includes all agent–LLM conversations, intermediate reasoning steps, and generated answers.

▶️ Running the Benchmark

Make sure you have:

Installed dependencies
Configured Hugging Face access tokens (for model download)
Set paths in base.yaml

To run an evaluation with Qwen/Qwen2.5-VL-7B-Instruct on the HANCOCK dataset:

python -m neurips25.benchmarks.run_agent_benchmark \
  --doctor_model "Qwen/Qwen2.5-VL-7B-Instruct" \
  --output_dir "./agent_logs_hancock/" \
  --dataset "hancock"

🔁 Reproducible runtime (Goodfire fork)

This fork adds a pinned, validated runtime for the agent benchmark plus the data-prep and reproduction tooling needed to run both tracks end-to-end. The original setup.sh / requirements.txt (conda, full WSI/ABMIL stack) still describe the upstream install; the section below is the leaner, version-pinned path that the in-tree compatibility shims target.

What's new in this fork

Area	Change
Runner	`--use-tools` (tool-augmented agent) and `--max-cases N` (stop after N cases) flags
Compatibility	shims for vLLM 0.23 / torch 2.11+cu130 / transformers 5 (`neurips25/eval`, `neurips25/tools/conch.py`)
Data prep	`scripts/preprocess_msk_chord.py`, `scripts/build_cell_density_csv.py`, `scripts/validate_ihc_tool_path.py`
Config	`tools.conch` / `tools.uni` in `base.yaml`; `scripts/link_data.sh` to wire staged data/models
Reproduction	`scripts/setup_runtime.sh`, `scripts/smoke.sh`, `scripts/analyze_smoke.py`, sbatch templates
Models	`MODELS.md` (the five models, sizes, load mechanism, gated-access notes)

1. Install the pinned runtime

bash scripts/setup_runtime.sh        # builds .venv with uv: vllm 0.23, torch 2.11+cu130, transformers 5.12, CONCH
source .venv/bin/activate

The pins live in requirements-runtime.txt. This is the lean runtime set — it omits the optional offline paths (UNI/TRIDENT ABMIL training, openslide/QuPath WSI processing, pyserini/faiss PubMed retrieval); those are not needed to run the benchmark.

2. Point at staged data + models

The dataset's question JSONs embed relative data/... paths, so a data/ directory must exist at the repo root. The repo already tracks a small data/ (the shipped cell_density_measurements.csv and the ABMIL checkpoint); the staged root is a superset, so set MTBBENCH_LINK_FORCE=1 to move the tracked data/ aside to data.bak (nothing is deleted) before symlinking. Wire it (and models/) up with no hardcoded paths:

MTBBENCH_LINK_FORCE=1 \
MTBBENCH_DATA_ROOT=/abs/path/to/staged/data \
MTBBENCH_MODELS_ROOT=/abs/path/to/models \
bash scripts/link_data.sh

On the CoreWeave reno cluster the staged root is /mnt/data/artifacts/tumor_board (.../data, .../models), group-readable to slurm-users. See MODELS.md for the models.

3. Run both tracks

export DRUGBANK_USERNAME="placeholder@example.com"   # pubmed.py reads this at import time (Entrez.email); the tool is never called

# MSK-CHORD (longitudinal, no tools)
python -m neurips25.benchmarks.run_agent_benchmark \
  --dataset msk --doctor_model "$MTBBENCH_MODELS_ROOT/Qwen3-32B" \
  --output_dir ./agent_logs_msk/

# HANCOCK (multimodal, tool-augmented: CONCH + IHC density tool)
python -m neurips25.benchmarks.run_agent_benchmark \
  --dataset hancock --use-tools --doctor_model "$MTBBENCH_MODELS_ROOT/Qwen2.5-VL-32B-Instruct" \
  --output_dir ./agent_logs_hancock/

CONCH downloads gated MahmoodLab/conch weights at first use; set HF_TOKEN to an account that has accepted the gate (see MODELS.md).

4. Smoke / reproduction

scripts/smoke.sh runs the runner at --max-cases 2 on a track; scripts/analyze_smoke.py scores the logs (logs written, no parse errors, valid answers, and — for HANCOCK — the IHC tool + CONCH fire with zero fallbacks). SLURM templates: scripts/smoke_{msk,hancock}.sbatch.

bash scripts/smoke.sh hancock "$MTBBENCH_MODELS_ROOT/Qwen2.5-VL-32B-Instruct" 1 ./agent_logs_hancock_smoke
# Score the bar. Pass the run's stdout (.out) so IHC/CONCH fires are counted from the
# unmutated logger output -- the agent rewrites its conversation between questions, so the
# JSON blob alone is only a lower bound and the zero-fallback check is not airtight.
python scripts/analyze_smoke.py ./agent_logs_msk_smoke ./agent_logs_hancock_smoke smoke_summary.json \
  --hancock-stdout ./smoke_smoke-mtb-hancock_<jobid>.out

Data acquisition (gated / external sources)

Source	How to obtain	Prep
MSK-CHORD	cBioPortal `msk_chord_2024` ZIP (study); CC BY-NC-ND 4.0	`scripts/preprocess_msk_chord.py --in-dir <raw> --out-dir data/msk_chord_processed`
HANCOCK IHC tool CSV	the shipped `data/hancock/cell_density_measurements.csv` holds the all-predicted (ABMIL) values, which is the faithful tool output by design (the IHC tool is the ABMIL predictor)	see `scripts/build_cell_density_csv.py` to build the measured comparison CSV from a FAU QuPath export
CONCH / UNI	gated `MahmoodLab/CONCH`, `MahmoodLab/UNI` (accept the gate on your HF account)	staged under `models/{conch,uni}`
DrugBank	licensed account (drugbank.com) — blocked without credentials; not needed for the two tracks above	set `DRUGBANK_USERNAME`/`DRUGBANK_PASSWORD`
Question JSONs + cases	generated over case data (`generate_questions.py` / `msk_question_generation.py`, needs an OpenAI key); not in upstream	place at `data/questions_{hancock,msk}_bench.json` + `data/{hancock,msk_bench}/cases`

Licenses: MSK-CHORD (CC BY-NC-ND 4.0) and HANCOCK (CC BY-NC) are non-commercial; MSK-CHORD also restricts derivatives. Confirm this fits your intended use before relying on the data.

Provenance

Synthesized from Goodfire tumor-board experiments #1 (data staging + MSK preprocessing), #2 (HANCOCK IHC tool data), #3 (runner flags + compatibility shims + smoke validation), and #4 (model consolidation). See the PR description for details.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

MTBBench: A Multimodal Sequential Clinical Decision-Making Benchmark in Oncology

🛠 Getting Started

⚙️ Configuration

📁 Datasets

HANCOCK (Multimodal Tissue Microarrays)

MSK-CHORD (Longitudinal Genomic Profiles)

DrugBank API

📑 Agent Logs

▶️ Running the Benchmark

🔁 Reproducible runtime (Goodfire fork)

What's new in this fork

1. Install the pinned runtime

2. Point at staged data + models

3. Run both tracks

4. Smoke / reproduction

Data acquisition (gated / external sources)

Provenance

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
agent_logs_hancock		agent_logs_hancock
agent_logs_msk		agent_logs_msk
data		data
neurips25		neurips25
scripts		scripts
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
MODELS.md		MODELS.md
README.md		README.md
requirements-runtime.txt		requirements-runtime.txt
requirements.txt		requirements.txt
setup.sh		setup.sh

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

MTBBench: A Multimodal Sequential Clinical Decision-Making Benchmark in Oncology

🛠 Getting Started

⚙️ Configuration

📁 Datasets

HANCOCK (Multimodal Tissue Microarrays)

MSK-CHORD (Longitudinal Genomic Profiles)

DrugBank API

📑 Agent Logs

▶️ Running the Benchmark

🔁 Reproducible runtime (Goodfire fork)

What's new in this fork

1. Install the pinned runtime

2. Point at staged data + models

3. Run both tracks

4. Smoke / reproduction

Data acquisition (gated / external sources)

Provenance

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages