GitHub - IcyFish332/T3RL

T3RL (mulTi-Turn Tool-use Reinforcement Learning) is a multi-turn tool-use RL training package built on slime. It implements the Environment Tuning paradigm — training LLMs to perform multi-turn function calling via GRPO by interacting with realistic execution environments, rather than imitating static trajectories. Uses the Berkeley Function-Calling Leaderboard (BFCL) as the evaluation environment.

Features

Environment Tuning — Learns tool-use through environmental interaction, not trajectory imitation. See Background for details on the method and how this repo relates to the original paper.
Non-invasive slime integration — Plugs into slime's training loop via 6 hook functions (--custom-*-path), requiring zero patches to slime source code. See Architecture for the full hook table.
SGLang parser delegation — Directly imports SGLang's native reasoning and tool-call parsers; switching model families (e.g., Qwen3 -> DeepSeek) requires only changing two YAML fields. See Architecture: Parser Delegation.
TITO token trajectory tracking — Maintains a strict invariant (len(log_probs) == len(loss_mask) == response_length) across multi-turn episodes, ensuring GRPO gradients are applied to the correct tokens. See Architecture: TITO.
Multi-turn gym environment — BFCLGymAdapter provides a gym-style step()/reset() interface over BFCLEnv, with an eval_mode flag that changes step-limit behavior and scoring semantics.
Standalone eval pipeline — Fully async evaluation that talks directly to SGLang via aiohttp, no slime dependency required. See Evaluation.

Background

T3RL implements the Environment Tuning paradigm (ICLR 2026) — training LLMs to perform multi-turn tool-use by interacting with realistic execution environments, rather than imitating static trajectories. It uses a two-stage curriculum with fine-grained progress reward (correct_turns / total_turns), currently targeting Qwen3-4B-Instruct-2507 on 8x H200. Trained checkpoints: Qwen3-4B-EnvTuning-Base (Stage 1) and Qwen3-4B-EnvTuning (Stage 2). See Background for paper techniques, curriculum details, and training curves.

Installation

Prerequisites

T3RL requires the slime/Megatron runtime environment. Because dependencies (Megatron, SGLang, etc.) are complex, using the official slime Docker images is strongly recommended:

Install

# Clone with submodules
git clone --recurse-submodules https://github.com/IcyFish332/T3RL.git
cd T3RL

# Or initialize submodules in an existing clone
git submodule update --init --recursive

# Install T3RL + bfcl_eval (does NOT install slime)
pip install -e .

# Install slime separately (no-deps to avoid conflicts with Docker env)
pip install -e 3rdparty/slime --no-deps

# (Optional) Install dev dependencies for testing
pip install -e ".[dev]"

Quick Start

1. Data Preprocessing

python data/preprocess_bfcl_data.py --output_dir data/processed/bfcl

2. Training (8x H200)

Training is split into two stages. Both scripts accept environment variable overrides for all paths and hyperparameters.

Stage 1 — Train on base data (bfcl_train_base.jsonl):

HF_CHECKPOINT=/path/to/Qwen3-4B-Instruct-2507 \
REF_LOAD=/path/to/Qwen3-4B-Instruct-2507_torch_dist \
bash scripts/train/stage1.sh

Checkpoint conversion — Convert Stage 1 output to HF + torch_dist for Stage 2:

bash scripts/convert/meg_to_hf.sh qwen3-4B-Instruct-2507 \
    --input-dir outputs/checkpoints/t3rl_bfcl/<experiment>/iter_XXXXXXX \
    --output-dir outputs/models/Qwen3-4B-stage1 --force

bash scripts/convert/hf_to_meg.sh qwen3-4B-Instruct-2507 \
    --hf-checkpoint outputs/models/Qwen3-4B-stage1 \
    --save outputs/models/Qwen3-4B-stage1_torch_dist

Stage 2 — Continue training on full data (bfcl_train.jsonl) with longer context (CP=4):

HF_CHECKPOINT=outputs/models/Qwen3-4B-stage1 \
REF_LOAD=outputs/models/Qwen3-4B-stage1_torch_dist \
bash scripts/train/stage2.sh

To resume either stage from a checkpoint:

RESUME_EXPERIMENT=<experiment_name> bash scripts/train/stage1.sh  # or stage2.sh

3. Evaluation

Official BFCL eval (generate + evaluate via bfcl CLI):

MODEL_PATH=/path/to/model \
MODEL_NAME=Qwen/Qwen3-4B-Instruct-2507-FC \
bash scripts/test/run_bfcl_multiturn_eval.sh

Standalone T3RL eval (async, talks directly to SGLang):

# Launch SGLang server
bash scripts/test/launch_sglang.sh  # or: NUM_GPUS=1 PORT=30000 bash scripts/test/launch_sglang.sh

# Run eval
python -m t3rl.test.eval_bfcl \
    --data-path data/processed/bfcl/bfcl_test.jsonl \
    --sglang-url http://localhost:3000 \
    --model-path /path/to/model \
    --config configs/bfcl/instruct2507.yaml

Or use the shell wrapper:

SGLANG_URL=http://localhost:3000 bash scripts/test/run_t3rl_eval.sh

See Evaluation for full CLI options and output format.

4. Debug Scripts

# Debug rollout only (no Megatron training)
bash scripts/debug/debug_rollout.sh

# Debug training from saved rollout data
bash scripts/debug/debug_training.sh

Documentation

Background — Paper techniques, curriculum details, trained models, and training curves
Architecture — Runtime call chain, slime integration hooks, key modules, TITO invariant, parser delegation, reward signal, configuration reference
Evaluation — Standalone eval pipeline, async design, eval_mode semantics, accuracy metric, CLI options
BFCL Multi-Turn Reference — Dataset structure, scoring logic, execution engine, API backends, and known edge cases

Configuration

All runtime behavior is controlled by YAML config files in configs/bfcl/. See Architecture: Configuration for a full field reference.

# configs/bfcl/instruct2507.yaml (excerpt)
reasoning_parser: "qwen3-thinking"
tool_call_parser: "qwen"
think_start_token: "<thinking>"
think_end_token: "</thinking>"
max_step_limit: 20
tis_level: "token"
tis_mode: "truncate"

Roadmap

Environment augmentation — Adapt actionable environment augmentation (converting vague failures into corrective hints) as a flexible, online-enhanced LLM workflow, decoupled from the static four-stage curriculum
Multi-backbone support — Validate and tune curriculum stages for weaker backbones (e.g., Qwen2.5, Llama 3.1) that benefit from explicit format-regulation stages
Extended evaluation — Add out-of-distribution benchmarks (BFCL V4, ACEBench) to the standalone eval pipeline
Multi-node training — Support multi-node distributed training beyond the current single-node 8-GPU setup

Citation

If you find this work useful, please cite the Environment Tuning paper:

@article{lu2025don,
  title={Don't Just Fine-tune the Agent, Tune the Environment},
  author={Lu, Siyuan and Wang, Zechuan and Zhang, Hongxuan and Wu, Qintong and Gan, Leilei and Zhuang, Chenyi and Gu, Jinjie and Lin, Tao},
  journal={arXiv preprint arXiv:2510.10197},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
3rdparty		3rdparty
assets		assets
configs/bfcl		configs/bfcl
data		data
docs		docs
scripts		scripts
t3rl		t3rl
.gitignore		.gitignore
.gitmodules		.gitmodules
CLAUDE.md		CLAUDE.md
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Features

Background

Installation

Prerequisites

Install

Quick Start

1. Data Preprocessing

2. Training (8x H200)

3. Evaluation

4. Debug Scripts

Documentation

Configuration

Roadmap

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 1

Languages

Folders and files

Latest commit

History

Repository files navigation

Features

Background

Installation

Prerequisites

Install

Quick Start

1. Data Preprocessing

2. Training (8x H200)

3. Evaluation

4. Debug Scripts

Documentation

Configuration

Roadmap

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 1

Languages

Packages