This repository implements the KeyChain data creation pipeline from LoongRL: Reinforcement Learning for Advanced Reasoning over Long Contexts. KeyChain synthesizes high-quality, verifiable long-context QA training data by embedding questions within long contexts using UUID key-value chain linking and multi-level distractors — designed specifically for reinforcement learning over long contexts with fine-grained difficulty control.
RL Training: For the reinforcement learning training framework, see LoongRL.
- KeyChain Pipeline — original UUID-chain synthesis
- Plain Multihop Synthesis — vanilla long-context QA data generation
- Trajectory Generation — rollout generation + scoring for SFT
- Repository Structure
- Citation
KeyChain constructs long-context QA instances through a three-stage pipeline:
- Data Filtering — Filter source multi-hop questions (HotpotQA, MuSiQue, 2WikiMQA) using Qwen2.5-32B, retaining only questions of appropriate difficulty
- Long Context Filling — Compose long contexts (4K–128K tokens) by shuffling and inserting documents around the question-relevant passages
- KeyChain Insertion — Generate UUID key-value chains and insert the pairs at random positions throughout the context. One chain leads to the real question; other chains lead to distractor questions. The model must follow the correct chain starting from a given UUID to locate the question, then reason over the surrounding documents to answer it
Each instance presents the model with a long context containing documents interleaved with UUID key-value pairs. The model receives a starting UUID and must follow the chain to find the hidden question:
Please read the following text.
Document 0: ...
Document 3:
Who's Who? is a studio album by American jazz musician John Scofield. ...
{"bdd640fb-0667-4ad1-9c80-317fa3b1799d": "23b8c1e9-3924-46de-beb1-3b9046685257"}.
...
Document 10:
... Sonoma State offers 92 Bachelor's degrees, 19 Master's degrees ...
{"972a8469-1641-4f82-8b9d-2434e465e150": "Musician and satirist Allie Goertz
wrote a song about the "The Simpsons" character Milhouse, who Matt Groening
named after who?"}.
...
Document 47:
Neil Affleck
{"bd9c66b3-ad3c-4d6d-9a3d-1fa7bc8960a9": "972a8469-1641-4f82-8b9d-2434e465e150"}.
...
In the context above, there is one correct question to answer. The correct
question can only be found by following the correct consecutive chain of
key:value pairs encoded with UUID strings, starting from
"bdd640fb-0667-4ad1-9c80-317fa3b1799d".
Find the correct question first, then answer it.
| Dataset | Training QA Pairs | Unique Documents |
|---|---|---|
| HotpotQA | 90,447 | 483,696 |
| MuSiQue | 19,938 | 797,413 |
| 2WikiMQA | 167,454 | 369,378 |
pip install transformers tenacity openai azure-identity tqdm gdown
sudo apt update -y && sudo apt install unzip# Standard long-context QA synthesis
bash scripts/synth.sh
# Relevant-documents-only (no fillers)
bash scripts/synth_relevant_only.sh
# Full pipeline: context filling + KeyChain insertion across all datasets × lengths
bash scripts/synth_qwen_filter_core_gaussian_add_distractor.sh
# Extract GPT-4o reasoning steps
bash scripts/synth_gpt_call_reasoning.shCustom generation:
python qa.py \
--save_dir=./ \
--save_name=hotpotqa \
--dataset=hotpot_train_v1.1.json \
--tokenizer_path=Qwen/Qwen2.5-7B-Instruct \
--tokenizer_type=hf \
--max_seq_length=32768 \
--tokens_to_generate=128 \
--num_samples=100 \
--template="{context}"Synthesize vanilla long-context multihop QA data (no UUID/KeyChain) for SFT pre-training or evaluation. Covers all three datasets at five context lengths.
# Download datasets + generate 1000 samples per (dataset × length) combo
# 15 jobs total: hotpotqa/musique/2wikimqa × 4k/8k/16k/32k/64k
bash scripts/synth_multihop_plain.shDatasets are auto-downloaded on first run. Override the tokenizer via env var:
TOKENIZER_PATH=/path/to/model bash scripts/synth_multihop_plain.shScale to a different sample count:
NUM_SAMPLES=500 bash scripts/synth_multihop_plain.shoutput/
├── hotpotqa/
│ ├── train-num_sample_1000-max_seq_4096.jsonl
│ ├── train-num_sample_1000-max_seq_8192.jsonl
│ ├── train-num_sample_1000-max_seq_16384.jsonl
│ ├── train-num_sample_1000-max_seq_32768.jsonl
│ └── train-num_sample_1000-max_seq_65536.jsonl
├── musique/ (same structure)
└── 2wikimqa/ (same structure)
Each line:
{
"index": 0,
"input": "Which magazine was started first Arthur's Magazine or First for Women?",
"context": "Passage 1:\n...\n\nPassage 2:\n...",
"answers": ["Arthur's Magazine"],
"length": 4096
}Default tokenizer: Qwen/Qwen3-8B (only tokenizer config downloaded, no model weights).
Generate model rollouts over the plain multihop data and score them with rule-based rewards for SFT training.
Rewards are adapted from LoongRL (rule-based only, no LLM judge):
| Metric | Description |
|---|---|
sub_em |
Bidirectional substring containment after normalization — primary reward, matches LoongRL training signal |
em |
Strict exact match after normalization |
f1 |
Token-level F1 score |
Answer extraction: finds </think> tag → extracts \boxed{answer} from text after it (for thinking models). is_correct = sub_em.
# All 15 combos sequentially (vLLM owns all GPUs per job)
bash trajectory_gen/scripts/gen_trajectories.sh
# Override parameters
MODEL=/path/to/model TP_SIZE=8 N=8 bash trajectory_gen/scripts/gen_trajectories.shDefault model: /home/wsy0227/qwen14b_2e_1node_16k_2k_FILTERedAGAIN_dis_256bsz_grpo_SUBEM_end-step151
Single file:
python trajectory_gen/generate_trajectories.py \
--input_file output/hotpotqa/train-num_sample_1000-max_seq_8192.jsonl \
--model /path/to/model \
--tp_size 4 \
--n 4 \
--temperature 0.6# Single file via API
python trajectory_gen/generate_trajectories.py \
--input_file output/hotpotqa/train-num_sample_1000-max_seq_8192.jsonl \
--backend openai \
--model DeepSeek-V3.2 \
--openai_base_url http://host:port/v1
# All 15 combos
BACKEND=openai MODEL=DeepSeek-V3.2 OPENAI_BASE_URL=http://host:port/v1 \
bash trajectory_gen/scripts/gen_trajectories.shNote: for API backends, set OPENAI_API_KEY env var if required by the server.
OPENAI_BASE_URL=http://host:port/v1 \
OPENAI_API_KEY=dummy \
python trajectory_gen/test_api_trajectory.py
# Optional overrides
N_SAMPLES=50 N_ROLLOUTS=4 MODEL=gpt-4o \
OPENAI_BASE_URL=http://host:port/v1 \
python trajectory_gen/test_api_trajectory.pyUses a different prompt template (step-by-step reasoning, "The answer is X" format) suited for non-thinking models. Concurrent API calls via asyncio.
trajectories/
├── hotpotqa/
│ ├── train-num_sample_1000-max_seq_4096-{model_tag}.jsonl
│ └── ...
├── musique/ (same structure)
└── 2wikimqa/ (same structure)
Each line:
{
"index": 0,
"input": "Which magazine was started first...",
"context": "Passage 1:\n...",
"answers": ["Arthur's Magazine"],
"length": 8192,
"model": "qwen14b_...step151",
"trajectories": [
{
"text": "<think>...</think>\\boxed{Arthur's Magazine}",
"extracted_answer": "Arthur's Magazine",
"sub_em": 1,
"em": 1,
"f1": 1.0,
"is_correct": 1
}
],
"num_correct": 1,
"pass_rate": 0.25
}# Unit tests for scoring functions (no GPU required)
python -m pytest trajectory_gen/tests/test_scoring.py -vThis directory contains tools to prepare SFT training data in ms-swift format and the resulting data files.
| File | Description | Rows |
|---|---|---|
{dataset}/train-...-LoongRL-14b-swift.jsonl |
All 4 trajectories per query from LoongRL-14b rollouts | 4000 / file |
{dataset}/train-...-LoongRL-14b-swift-filtered.jsonl |
Best trajectory per query (sub_em=1, highest f1) | ~800–990 / file |
chatqa2_summary/long_sft_train_summary.jsonl |
Long-context summarization data from ChatQA2 | 5776 |
Datasets: hotpotqa, musique, 2wikimqa × lengths: 4096, 8192, 16384, 32768
sft_data/
├── convert_to_swift.py # Convert LoongRL-14b trajectory JSONL → ms-swift format
├── filter_swift.py # Filter to best trajectory per query (sub_em=1, max f1)
├── validate_swift.py # Validate ms-swift format (rule check + AutoPreprocessor)
├── extract_chatqa2_summary.py # Extract summarization data from ChatQA2-Long-SFT-data
├── run_convert.sh # Batch: convert all 12 LoongRL-14b files
└── run_filter.sh # Batch: filter all 12 converted files
All data files use the ms-swift messages format:
{
"messages": [
{"role": "user", "content": "The following are given passages.\n...\n\nQuestion: ..."},
{"role": "assistant", "content": "<model response>"}
],
"is_correct": 1,
"sub_em": 1,
"em": 1,
"f1": 1.0
}To validate any file before use:
python sft_data/validate_swift.py path/to/file.jsonlThis runs two layers:
- Rule checks — verifies
messagesfield, role/content presence, non-empty content, last role isassistant - AutoPreprocessor dry-run — passes 100 sample rows through ms-swift 4.x
AutoPreprocessor; exit code 0 = pass, 1 = fail
Example output:
Validating: long_sft_train_summary.jsonl
Layer 1: rule checks...
PASS: 5776 rows, no rule errors
Layer 2: ms-swift AutoPreprocessor...
[swift] AutoPreprocessor: OK (100 rows sampled)
✓ long_sft_train_summary.jsonl passed all checks
Use --full to run AutoPreprocessor on all rows instead of 100.
# 1. Convert LoongRL-14b trajectories to ms-swift format
bash sft_data/run_convert.sh
# 2. Filter to best trajectory per query
bash sft_data/run_filter.sh
# 3. Extract ChatQA2 summarization data
python sft_data/extract_chatqa2_summary.pyKeyChain/
│
├── ── Plain Multihop Synthesis ──────────────────────────────────────────────
│
├── qa.py # Long-context QA synthesis (plain, no KeyChain)
│ # --distract_questions -1 to disable distractors
├── scripts/synth_multihop_plain.sh # Run all 15 (dataset × length) synthesis jobs
│
├── ── Trajectory Generation ─────────────────────────────────────────────────
│
├── trajectory_gen/
│ ├── generate_trajectories.py # Rollout generation + rule-based scoring
│ │ # Backends: vllm, openai
│ │ # Rewards: sub_em (primary), em, f1
│ ├── test_api_trajectory.py # Quick async test against any OpenAI-compatible API
│ ├── scripts/
│ │ └── gen_trajectories.sh # Batch runner for all 15 combos (sequential)
│ └── tests/
│ └── test_scoring.py # Unit tests for scoring logic (no GPU needed)
│
├── ── SFT Data ──────────────────────────────────────────────────────────────
│
├── sft_data/
│ ├── convert_to_swift.py # LoongRL-14b trajectories → ms-swift format
│ ├── filter_swift.py # Best trajectory per query (sub_em=1, max f1)
│ ├── validate_swift.py # Format validator (rules + AutoPreprocessor)
│ ├── extract_chatqa2_summary.py # Extract summarization data from ChatQA2
│ ├── run_convert.sh # Batch convert all 12 LoongRL-14b files
│ └── run_filter.sh # Batch filter all 12 converted files
│
├── ── KeyChain Pipeline ─────────────────────────────────────────────────────
│
├── qa_filter_data.py # Context generation with pre-filtering
├── qa_filter_data_core_gaussian_middle.py # Gaussian question placement
├── qa_qwen_filtered_core_gaussian_add_distractor.py # Full pipeline: context + KeyChain
├── qa_add_distractor.py # Distractor injection module
├── qa_relevant_only.py # Supporting-facts-only generation
├── qa_musique_hard.py # MuSiQue-specific variant
├── uuid_test.py # UUID chain/tree generation utilities
├── gpt_call.py # GPT-4o reasoning extraction
├── tokenizer.py # Multi-backend tokenizer (HF, NeMo, OpenAI, Gemini)
│
├── ── Quality Filtering ─────────────────────────────────────────────────────
│
├── filter_question/ # Stage 1: model-based QA quality filtering
│ ├── filter_infer.py # vLLM distributed inference (Qwen2.5-32B)
│ ├── merge_output.py # Prediction merging & metric computation
│ └── convert_*.py # Dataset format converters
├── filter_again/ # Stage 2: secondary quality control
│ ├── judge_filters.py # Answer matching & filtering
│ └── judge_utils.py # Metrics: EM, F1, CEM
│
├── ── Scripts ───────────────────────────────────────────────────────────────
│
├── scripts/
│ ├── synth_multihop_plain.sh # Plain multihop synthesis (NEW)
│ ├── synth.sh # Basic synthesis
│ ├── synth_relevant_only.sh # Supporting-facts-only
│ ├── synth_qwen_filter_core_gaussian_add_distractor.sh # Full KeyChain pipeline
│ ├── synth_context_all_lengths.sh # All context lengths (no filtering)
│ └── synth_gpt_call_reasoning.sh # GPT-4o reasoning extraction
│
└── difficulty_analysis/ # Dataset analysis notebooks
If you use KeyChain in your research, please cite our paper:
@misc{wang2025loongrlreinforcementlearningadvanced,
title={LoongRL: Reinforcement Learning for Advanced Reasoning over Long Contexts},
author={Siyuan Wang and Gaokai Zhang and Li Lyna Zhang and Ning Shang and Fan Yang and Dongyao Chen and Mao Yang},
year={2025},
eprint={2510.19363},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2510.19363},
}This repository implements the KeyChain data creation pipeline from LoongRL: Reinforcement Learning for Advanced Reasoning over Long Contexts. KeyChain synthesizes high-quality, verifiable long-context QA training data by embedding questions within long contexts using UUID key-value chain linking and multi-level distractors — designed specifically for reinforcement learning over long contexts with fine-grained difficulty control.
RL Training: For the reinforcement learning training framework, see LoongRL.
KeyChain constructs long-context QA instances through a three-stage pipeline:
- Data Filtering — Filter source multi-hop questions (HotpotQA, MuSiQue, 2WikiMQA) using Qwen2.5-32B, retaining only questions of appropriate difficulty
- Long Context Filling — Compose long contexts (4K–128K tokens) by shuffling and inserting documents around the question-relevant passages
- KeyChain Insertion — Generate UUID key-value chains and insert the pairs at random positions throughout the context. One chain leads to the real question; other chains lead to distractor questions. The model must follow the correct chain starting from a given UUID to locate the question, then reason over the surrounding documents to answer it
Each instance presents the model with a long context containing documents interleaved with UUID key-value pairs. The model receives a starting UUID and must follow the chain to find the hidden question:
Please read the following text.
Document 0: ...
Document 3:
Who's Who? is a studio album by American jazz musician John Scofield. ...
{"bdd640fb-0667-4ad1-9c80-317fa3b1799d": "23b8c1e9-3924-46de-beb1-3b9046685257"}.
...
Document 10:
... Sonoma State offers 92 Bachelor's degrees, 19 Master's degrees ...
{"972a8469-1641-4f82-8b9d-2434e465e150": "Musician and satirist Allie Goertz
wrote a song about the "The Simpsons" character Milhouse, who Matt Groening
named after who?"}.
...
Document 47:
Neil Affleck
{"bd9c66b3-ad3c-4d6d-9a3d-1fa7bc8960a9": "972a8469-1641-4f82-8b9d-2434e465e150"}.
...
In the context above, there is one correct question to answer. The correct
question can only be found by following the correct consecutive chain of
key:value pairs encoded with UUID strings, starting from
"bdd640fb-0667-4ad1-9c80-317fa3b1799d".
Find the correct question first, then answer it.
| Dataset | Training QA Pairs | Unique Documents |
|---|---|---|
| HotpotQA | 90,447 | 483,696 |
| MuSiQue | 19,938 | 797,413 |
| 2WikiMQA | 167,454 | 369,378 |
pip install transformers tenacity openai azure-identity tqdm gdown
sudo apt update -y && sudo apt install unzip# Standard long-context QA synthesis
bash scripts/synth.sh
# Relevant-documents-only (no fillers)
bash scripts/synth_relevant_only.sh# Synthesize across all datasets x all context lengths with
# context filling and KeyChain insertion
bash scripts/synth_qwen_filter_core_gaussian_add_distractor.sh# Extract GPT-4o reasoning steps for training data
bash scripts/synth_gpt_call_reasoning.shpython qa.py \
--save_dir=./ \
--save_name=hotpotqa \
--dataset=hotpotqa/hotpot_train_v1.1.json \
--tokenizer_path=Qwen/Qwen2.5-7B-Instruct \
--tokenizer_type=hf \
--max_seq_length=32768 \
--tokens_to_generate=128 \
--num_samples=100 \
--template="{context}"KeyChain/
├── qa.py # Basic context composition
├── qa_filter_data.py # Context generation with pre-filtering
├── qa_filter_data_core_gaussian_middle.py # Gaussian question placement
├── qa_qwen_filtered_core_gaussian_add_distractor.py # Full pipeline: context filling + KeyChain insertion
├── qa_add_distractor.py # Distractor injection module
├── qa_relevant_only.py # Supporting-facts-only generation
├── gpt_call.py # GPT-4o reasoning extraction
├── tokenizer.py # Multi-backend tokenizer (HF, NeMo, OpenAI)
├── filter_question/ # Stage 1: model-based QA quality filtering
│ ├── filter_infer.py # vLLM distributed inference
│ ├── merge_output.py # Prediction merging
│ └── convert_*.py # Dataset format converters
├── filter_again/ # Stage 2: secondary quality control
│ ├── judge_filters.py # Answer matching & filtering
│ └── judge_utils.py # Evaluation metrics (EM, F1, CEM)
├── scripts/ # Shell scripts for pipeline execution
│ ├── synth.sh # Basic synthesis
│ ├── synth_qwen_filter_core_gaussian_add_distractor.sh # Full pipeline
│ └── synth_gpt_call_reasoning.sh # Reasoning extraction
└── difficulty_analysis/ # Dataset analysis notebooks
If you use KeyChain in your research, please cite our paper:
@misc{wang2025loongrlreinforcementlearningadvanced,
title={LoongRL: Reinforcement Learning for Advanced Reasoning over Long Contexts},
author={Siyuan Wang and Gaokai Zhang and Li Lyna Zhang and Ning Shang and Fan Yang and Dongyao Chen and Mao Yang},
year={2025},
eprint={2510.19363},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2510.19363},
}