T3RL (mulTi-Turn Tool-use Reinforcement Learning) is a multi-turn tool-use RL training package built on slime. It implements the Environment Tuning paradigm — training LLMs to perform multi-turn function calling via GRPO by interacting with realistic execution environments, rather than imitating static trajectories. Uses the Berkeley Function-Calling Leaderboard (BFCL) as the evaluation environment.
-
Environment Tuning — Learns tool-use through environmental interaction, not trajectory imitation. See Background for details on the method and how this repo relates to the original paper.
-
Non-invasive slime integration — Plugs into slime's training loop via 6 hook functions (
--custom-*-path), requiring zero patches to slime source code. See Architecture for the full hook table. -
SGLang parser delegation — Directly imports SGLang's native reasoning and tool-call parsers; switching model families (e.g., Qwen3 -> DeepSeek) requires only changing two YAML fields. See Architecture: Parser Delegation.
-
TITO token trajectory tracking — Maintains a strict invariant (
len(log_probs) == len(loss_mask) == response_length) across multi-turn episodes, ensuring GRPO gradients are applied to the correct tokens. See Architecture: TITO. -
Multi-turn gym environment —
BFCLGymAdapterprovides a gym-stylestep()/reset()interface overBFCLEnv, with aneval_modeflag that changes step-limit behavior and scoring semantics. -
Standalone eval pipeline — Fully async evaluation that talks directly to SGLang via aiohttp, no slime dependency required. See Evaluation.
T3RL implements the Environment Tuning paradigm (ICLR 2026) — training LLMs to perform multi-turn tool-use by interacting with realistic execution environments, rather than imitating static trajectories. It uses a two-stage curriculum with fine-grained progress reward (correct_turns / total_turns), currently targeting Qwen3-4B-Instruct-2507 on 8x H200. Trained checkpoints: Qwen3-4B-EnvTuning-Base (Stage 1) and Qwen3-4B-EnvTuning (Stage 2). See Background for paper techniques, curriculum details, and training curves.
T3RL requires the slime/Megatron runtime environment. Because dependencies (Megatron, SGLang, etc.) are complex, using the official slime Docker images is strongly recommended:
# Clone with submodules
git clone --recurse-submodules https://github.com/IcyFish332/T3RL.git
cd T3RL
# Or initialize submodules in an existing clone
git submodule update --init --recursive
# Install T3RL + bfcl_eval (does NOT install slime)
pip install -e .
# Install slime separately (no-deps to avoid conflicts with Docker env)
pip install -e 3rdparty/slime --no-deps
# (Optional) Install dev dependencies for testing
pip install -e ".[dev]"python data/preprocess_bfcl_data.py --output_dir data/processed/bfclTraining is split into two stages. Both scripts accept environment variable overrides for all paths and hyperparameters.
Stage 1 — Train on base data (bfcl_train_base.jsonl):
HF_CHECKPOINT=/path/to/Qwen3-4B-Instruct-2507 \
REF_LOAD=/path/to/Qwen3-4B-Instruct-2507_torch_dist \
bash scripts/train/stage1.shCheckpoint conversion — Convert Stage 1 output to HF + torch_dist for Stage 2:
bash scripts/convert/meg_to_hf.sh qwen3-4B-Instruct-2507 \
--input-dir outputs/checkpoints/t3rl_bfcl/<experiment>/iter_XXXXXXX \
--output-dir outputs/models/Qwen3-4B-stage1 --force
bash scripts/convert/hf_to_meg.sh qwen3-4B-Instruct-2507 \
--hf-checkpoint outputs/models/Qwen3-4B-stage1 \
--save outputs/models/Qwen3-4B-stage1_torch_distStage 2 — Continue training on full data (bfcl_train.jsonl) with longer context (CP=4):
HF_CHECKPOINT=outputs/models/Qwen3-4B-stage1 \
REF_LOAD=outputs/models/Qwen3-4B-stage1_torch_dist \
bash scripts/train/stage2.shTo resume either stage from a checkpoint:
RESUME_EXPERIMENT=<experiment_name> bash scripts/train/stage1.sh # or stage2.shOfficial BFCL eval (generate + evaluate via bfcl CLI):
MODEL_PATH=/path/to/model \
MODEL_NAME=Qwen/Qwen3-4B-Instruct-2507-FC \
bash scripts/test/run_bfcl_multiturn_eval.shStandalone T3RL eval (async, talks directly to SGLang):
# Launch SGLang server
bash scripts/test/launch_sglang.sh # or: NUM_GPUS=1 PORT=30000 bash scripts/test/launch_sglang.sh
# Run eval
python -m t3rl.test.eval_bfcl \
--data-path data/processed/bfcl/bfcl_test.jsonl \
--sglang-url http://localhost:3000 \
--model-path /path/to/model \
--config configs/bfcl/instruct2507.yamlOr use the shell wrapper:
SGLANG_URL=http://localhost:3000 bash scripts/test/run_t3rl_eval.shSee Evaluation for full CLI options and output format.
# Debug rollout only (no Megatron training)
bash scripts/debug/debug_rollout.sh
# Debug training from saved rollout data
bash scripts/debug/debug_training.sh- Background — Paper techniques, curriculum details, trained models, and training curves
- Architecture — Runtime call chain, slime integration hooks, key modules, TITO invariant, parser delegation, reward signal, configuration reference
- Evaluation — Standalone eval pipeline, async design, eval_mode semantics, accuracy metric, CLI options
- BFCL Multi-Turn Reference — Dataset structure, scoring logic, execution engine, API backends, and known edge cases
All runtime behavior is controlled by YAML config files in configs/bfcl/. See Architecture: Configuration for a full field reference.
# configs/bfcl/instruct2507.yaml (excerpt)
reasoning_parser: "qwen3-thinking"
tool_call_parser: "qwen"
think_start_token: "<thinking>"
think_end_token: "</thinking>"
max_step_limit: 20
tis_level: "token"
tis_mode: "truncate"- Environment augmentation — Adapt actionable environment augmentation (converting vague failures into corrective hints) as a flexible, online-enhanced LLM workflow, decoupled from the static four-stage curriculum
- Multi-backbone support — Validate and tune curriculum stages for weaker backbones (e.g., Qwen2.5, Llama 3.1) that benefit from explicit format-regulation stages
- Extended evaluation — Add out-of-distribution benchmarks (BFCL V4, ACEBench) to the standalone eval pipeline
- Multi-node training — Support multi-node distributed training beyond the current single-node 8-GPU setup
If you find this work useful, please cite the Environment Tuning paper:
@article{lu2025don,
title={Don't Just Fine-tune the Agent, Tune the Environment},
author={Lu, Siyuan and Wang, Zechuan and Zhang, Hongxuan and Wu, Qintong and Gan, Leilei and Zhuang, Chenyi and Gu, Jinjie and Lin, Tao},
journal={arXiv preprint arXiv:2510.10197},
year={2025}
}