Skip to content

IcyFish332/T3RL

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

T3RL — Multi-Turn Tool-Use Reinforcement Learning

Paper ICLR 2026 Models Training Curves Ask DeepWiki

T3RL (mulTi-Turn Tool-use Reinforcement Learning) is a multi-turn tool-use RL training package built on slime. It implements the Environment Tuning paradigm — training LLMs to perform multi-turn function calling via GRPO by interacting with realistic execution environments, rather than imitating static trajectories. Uses the Berkeley Function-Calling Leaderboard (BFCL) as the evaluation environment.

Features

  • Environment Tuning — Learns tool-use through environmental interaction, not trajectory imitation. See Background for details on the method and how this repo relates to the original paper.

  • Non-invasive slime integration — Plugs into slime's training loop via 6 hook functions (--custom-*-path), requiring zero patches to slime source code. See Architecture for the full hook table.

  • SGLang parser delegation — Directly imports SGLang's native reasoning and tool-call parsers; switching model families (e.g., Qwen3 -> DeepSeek) requires only changing two YAML fields. See Architecture: Parser Delegation.

  • TITO token trajectory tracking — Maintains a strict invariant (len(log_probs) == len(loss_mask) == response_length) across multi-turn episodes, ensuring GRPO gradients are applied to the correct tokens. See Architecture: TITO.

  • Multi-turn gym environmentBFCLGymAdapter provides a gym-style step()/reset() interface over BFCLEnv, with an eval_mode flag that changes step-limit behavior and scoring semantics.

  • Standalone eval pipeline — Fully async evaluation that talks directly to SGLang via aiohttp, no slime dependency required. See Evaluation.

Background

T3RL implements the Environment Tuning paradigm (ICLR 2026) — training LLMs to perform multi-turn tool-use by interacting with realistic execution environments, rather than imitating static trajectories. It uses a two-stage curriculum with fine-grained progress reward (correct_turns / total_turns), currently targeting Qwen3-4B-Instruct-2507 on 8x H200. Trained checkpoints: Qwen3-4B-EnvTuning-Base (Stage 1) and Qwen3-4B-EnvTuning (Stage 2). See Background for paper techniques, curriculum details, and training curves.

Installation

Prerequisites

T3RL requires the slime/Megatron runtime environment. Because dependencies (Megatron, SGLang, etc.) are complex, using the official slime Docker images is strongly recommended:

Install

# Clone with submodules
git clone --recurse-submodules https://github.com/IcyFish332/T3RL.git
cd T3RL

# Or initialize submodules in an existing clone
git submodule update --init --recursive

# Install T3RL + bfcl_eval (does NOT install slime)
pip install -e .

# Install slime separately (no-deps to avoid conflicts with Docker env)
pip install -e 3rdparty/slime --no-deps

# (Optional) Install dev dependencies for testing
pip install -e ".[dev]"

Quick Start

1. Data Preprocessing

python data/preprocess_bfcl_data.py --output_dir data/processed/bfcl

2. Training (8x H200)

Training is split into two stages. Both scripts accept environment variable overrides for all paths and hyperparameters.

Stage 1 — Train on base data (bfcl_train_base.jsonl):

HF_CHECKPOINT=/path/to/Qwen3-4B-Instruct-2507 \
REF_LOAD=/path/to/Qwen3-4B-Instruct-2507_torch_dist \
bash scripts/train/stage1.sh

Checkpoint conversion — Convert Stage 1 output to HF + torch_dist for Stage 2:

bash scripts/convert/meg_to_hf.sh qwen3-4B-Instruct-2507 \
    --input-dir outputs/checkpoints/t3rl_bfcl/<experiment>/iter_XXXXXXX \
    --output-dir outputs/models/Qwen3-4B-stage1 --force

bash scripts/convert/hf_to_meg.sh qwen3-4B-Instruct-2507 \
    --hf-checkpoint outputs/models/Qwen3-4B-stage1 \
    --save outputs/models/Qwen3-4B-stage1_torch_dist

Stage 2 — Continue training on full data (bfcl_train.jsonl) with longer context (CP=4):

HF_CHECKPOINT=outputs/models/Qwen3-4B-stage1 \
REF_LOAD=outputs/models/Qwen3-4B-stage1_torch_dist \
bash scripts/train/stage2.sh

To resume either stage from a checkpoint:

RESUME_EXPERIMENT=<experiment_name> bash scripts/train/stage1.sh  # or stage2.sh

3. Evaluation

Official BFCL eval (generate + evaluate via bfcl CLI):

MODEL_PATH=/path/to/model \
MODEL_NAME=Qwen/Qwen3-4B-Instruct-2507-FC \
bash scripts/test/run_bfcl_multiturn_eval.sh

Standalone T3RL eval (async, talks directly to SGLang):

# Launch SGLang server
bash scripts/test/launch_sglang.sh  # or: NUM_GPUS=1 PORT=30000 bash scripts/test/launch_sglang.sh

# Run eval
python -m t3rl.test.eval_bfcl \
    --data-path data/processed/bfcl/bfcl_test.jsonl \
    --sglang-url http://localhost:3000 \
    --model-path /path/to/model \
    --config configs/bfcl/instruct2507.yaml

Or use the shell wrapper:

SGLANG_URL=http://localhost:3000 bash scripts/test/run_t3rl_eval.sh

See Evaluation for full CLI options and output format.

4. Debug Scripts

# Debug rollout only (no Megatron training)
bash scripts/debug/debug_rollout.sh

# Debug training from saved rollout data
bash scripts/debug/debug_training.sh

Documentation

  • Background — Paper techniques, curriculum details, trained models, and training curves
  • Architecture — Runtime call chain, slime integration hooks, key modules, TITO invariant, parser delegation, reward signal, configuration reference
  • Evaluation — Standalone eval pipeline, async design, eval_mode semantics, accuracy metric, CLI options
  • BFCL Multi-Turn Reference — Dataset structure, scoring logic, execution engine, API backends, and known edge cases

Configuration

All runtime behavior is controlled by YAML config files in configs/bfcl/. See Architecture: Configuration for a full field reference.

# configs/bfcl/instruct2507.yaml (excerpt)
reasoning_parser: "qwen3-thinking"
tool_call_parser: "qwen"
think_start_token: "<thinking>"
think_end_token: "</thinking>"
max_step_limit: 20
tis_level: "token"
tis_mode: "truncate"

Roadmap

  • Environment augmentation — Adapt actionable environment augmentation (converting vague failures into corrective hints) as a flexible, online-enhanced LLM workflow, decoupled from the static four-stage curriculum
  • Multi-backbone support — Validate and tune curriculum stages for weaker backbones (e.g., Qwen2.5, Llama 3.1) that benefit from explicit format-regulation stages
  • Extended evaluation — Add out-of-distribution benchmarks (BFCL V4, ACEBench) to the standalone eval pipeline
  • Multi-node training — Support multi-node distributed training beyond the current single-node 8-GPU setup

Citation

If you find this work useful, please cite the Environment Tuning paper:

@article{lu2025don,
  title={Don't Just Fine-tune the Agent, Tune the Environment},
  author={Lu, Siyuan and Wang, Zechuan and Zhang, Hongxuan and Wu, Qintong and Gan, Leilei and Zhuang, Chenyi and Gu, Jinjie and Lin, Tao},
  journal={arXiv preprint arXiv:2510.10197},
  year={2025}
}

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages