KeyChain: Key-Sentence-Driven Long-Context Data Synthesis [ICLR 2026 Oral]

This repository implements the KeyChain data creation pipeline from LoongRL: Reinforcement Learning for Advanced Reasoning over Long Contexts. KeyChain synthesizes high-quality, verifiable long-context QA training data by embedding questions within long contexts using UUID key-value chain linking and multi-level distractors — designed specifically for reinforcement learning over long contexts with fine-grained difficulty control.

RL Training: For the reinforcement learning training framework, see LoongRL.

KeyChain Pipeline

Overview

KeyChain constructs long-context QA instances through a three-stage pipeline:

Data Filtering — Filter source multi-hop questions (HotpotQA, MuSiQue, 2WikiMQA) using Qwen2.5-32B, retaining only questions of appropriate difficulty
Long Context Filling — Compose long contexts (4K–128K tokens) by shuffling and inserting documents around the question-relevant passages
KeyChain Insertion — Generate UUID key-value chains and insert the pairs at random positions throughout the context. One chain leads to the real question; other chains lead to distractor questions. The model must follow the correct chain starting from a given UUID to locate the question, then reason over the surrounding documents to answer it

Generated Data Example

Each instance presents the model with a long context containing documents interleaved with UUID key-value pairs. The model receives a starting UUID and must follow the chain to find the hidden question:

Please read the following text.
Document 0: ...
Document 3:
Who's Who? is a studio album by American jazz musician John Scofield. ...
{"bdd640fb-0667-4ad1-9c80-317fa3b1799d": "23b8c1e9-3924-46de-beb1-3b9046685257"}.
...
Document 10:
... Sonoma State offers 92 Bachelor's degrees, 19 Master's degrees ...
{"972a8469-1641-4f82-8b9d-2434e465e150": "Musician and satirist Allie Goertz
 wrote a song about the "The Simpsons" character Milhouse, who Matt Groening
 named after who?"}.
...
Document 47:
Neil Affleck
{"bd9c66b3-ad3c-4d6d-9a3d-1fa7bc8960a9": "972a8469-1641-4f82-8b9d-2434e465e150"}.
...

In the context above, there is one correct question to answer. The correct
question can only be found by following the correct consecutive chain of
key:value pairs encoded with UUID strings, starting from
"bdd640fb-0667-4ad1-9c80-317fa3b1799d".
Find the correct question first, then answer it.

Data Sources

Dataset	Training QA Pairs	Unique Documents
HotpotQA	90,447	483,696
MuSiQue	19,938	797,413
2WikiMQA	167,454	369,378

Installation

pip install transformers tenacity openai azure-identity tqdm gdown
sudo apt update -y && sudo apt install unzip

Usage

# Standard long-context QA synthesis
bash scripts/synth.sh

# Relevant-documents-only (no fillers)
bash scripts/synth_relevant_only.sh

# Full pipeline: context filling + KeyChain insertion across all datasets × lengths
bash scripts/synth_qwen_filter_core_gaussian_add_distractor.sh

# Extract GPT-4o reasoning steps
bash scripts/synth_gpt_call_reasoning.sh

Custom generation:

python qa.py \
    --save_dir=./ \
    --save_name=hotpotqa \
    --dataset=hotpot_train_v1.1.json \
    --tokenizer_path=Qwen/Qwen2.5-7B-Instruct \
    --tokenizer_type=hf \
    --max_seq_length=32768 \
    --tokens_to_generate=128 \
    --num_samples=100 \
    --template="{context}"

Plain Multihop Synthesis

Synthesize vanilla long-context multihop QA data (no UUID/KeyChain) for SFT pre-training or evaluation. Covers all three datasets at five context lengths.

Quick Start

# Download datasets + generate 1000 samples per (dataset × length) combo
# 15 jobs total: hotpotqa/musique/2wikimqa × 4k/8k/16k/32k/64k
bash scripts/synth_multihop_plain.sh

Datasets are auto-downloaded on first run. Override the tokenizer via env var:

TOKENIZER_PATH=/path/to/model bash scripts/synth_multihop_plain.sh

Scale to a different sample count:

NUM_SAMPLES=500 bash scripts/synth_multihop_plain.sh

Output

output/
├── hotpotqa/
│   ├── train-num_sample_1000-max_seq_4096.jsonl
│   ├── train-num_sample_1000-max_seq_8192.jsonl
│   ├── train-num_sample_1000-max_seq_16384.jsonl
│   ├── train-num_sample_1000-max_seq_32768.jsonl
│   └── train-num_sample_1000-max_seq_65536.jsonl
├── musique/   (same structure)
└── 2wikimqa/  (same structure)

Each line:

{
  "index": 0,
  "input": "Which magazine was started first Arthur's Magazine or First for Women?",
  "context": "Passage 1:\n...\n\nPassage 2:\n...",
  "answers": ["Arthur's Magazine"],
  "length": 4096
}

Default tokenizer: Qwen/Qwen3-8B (only tokenizer config downloaded, no model weights).

Trajectory Generation

Generate model rollouts over the plain multihop data and score them with rule-based rewards for SFT training.

Scoring

Rewards are adapted from LoongRL (rule-based only, no LLM judge):

Metric	Description
`sub_em`	Bidirectional substring containment after normalization — primary reward, matches LoongRL training signal
`em`	Strict exact match after normalization
`f1`	Token-level F1 score

Answer extraction: finds </think> tag → extracts \boxed{answer} from text after it (for thinking models). is_correct = sub_em.

Option A: Local vLLM model

# All 15 combos sequentially (vLLM owns all GPUs per job)
bash trajectory_gen/scripts/gen_trajectories.sh

# Override parameters
MODEL=/path/to/model TP_SIZE=8 N=8 bash trajectory_gen/scripts/gen_trajectories.sh

Default model: /home/wsy0227/qwen14b_2e_1node_16k_2k_FILTERedAGAIN_dis_256bsz_grpo_SUBEM_end-step151

Single file:

python trajectory_gen/generate_trajectories.py \
    --input_file output/hotpotqa/train-num_sample_1000-max_seq_8192.jsonl \
    --model /path/to/model \
    --tp_size 4 \
    --n 4 \
    --temperature 0.6

Option B: OpenAI-compatible API

# Single file via API
python trajectory_gen/generate_trajectories.py \
    --input_file output/hotpotqa/train-num_sample_1000-max_seq_8192.jsonl \
    --backend openai \
    --model DeepSeek-V3.2 \
    --openai_base_url http://host:port/v1

# All 15 combos
BACKEND=openai MODEL=DeepSeek-V3.2 OPENAI_BASE_URL=http://host:port/v1 \
    bash trajectory_gen/scripts/gen_trajectories.sh

Note: for API backends, set OPENAI_API_KEY env var if required by the server.

Option C: Quick API test (10 samples)

OPENAI_BASE_URL=http://host:port/v1 \
OPENAI_API_KEY=dummy \
python trajectory_gen/test_api_trajectory.py

# Optional overrides
N_SAMPLES=50 N_ROLLOUTS=4 MODEL=gpt-4o \
OPENAI_BASE_URL=http://host:port/v1 \
python trajectory_gen/test_api_trajectory.py

Uses a different prompt template (step-by-step reasoning, "The answer is X" format) suited for non-thinking models. Concurrent API calls via asyncio.

Output

trajectories/
├── hotpotqa/
│   ├── train-num_sample_1000-max_seq_4096-{model_tag}.jsonl
│   └── ...
├── musique/   (same structure)
└── 2wikimqa/  (same structure)

Each line:

{
  "index": 0,
  "input": "Which magazine was started first...",
  "context": "Passage 1:\n...",
  "answers": ["Arthur's Magazine"],
  "length": 8192,
  "model": "qwen14b_...step151",
  "trajectories": [
    {
      "text": "<think>...</think>\\boxed{Arthur's Magazine}",
      "extracted_answer": "Arthur's Magazine",
      "sub_em": 1,
      "em": 1,
      "f1": 1.0,
      "is_correct": 1
    }
  ],
  "num_correct": 1,
  "pass_rate": 0.25
}

Tests

# Unit tests for scoring functions (no GPU required)
python -m pytest trajectory_gen/tests/test_scoring.py -v

SFT Data (`sft_data/`)

This directory contains tools to prepare SFT training data in ms-swift format and the resulting data files.

Data Files

File	Description	Rows
`{dataset}/train-...-LoongRL-14b-swift.jsonl`	All 4 trajectories per query from LoongRL-14b rollouts	4000 / file
`{dataset}/train-...-LoongRL-14b-swift-filtered.jsonl`	Best trajectory per query (sub_em=1, highest f1)	~800–990 / file
`chatqa2_summary/long_sft_train_summary.jsonl`	Long-context summarization data from ChatQA2	5776

Datasets: hotpotqa, musique, 2wikimqa × lengths: 4096, 8192, 16384, 32768

Tools

sft_data/
├── convert_to_swift.py       # Convert LoongRL-14b trajectory JSONL → ms-swift format
├── filter_swift.py           # Filter to best trajectory per query (sub_em=1, max f1)
├── validate_swift.py         # Validate ms-swift format (rule check + AutoPreprocessor)
├── extract_chatqa2_summary.py # Extract summarization data from ChatQA2-Long-SFT-data
├── run_convert.sh            # Batch: convert all 12 LoongRL-14b files
└── run_filter.sh             # Batch: filter all 12 converted files

ms-swift Format Check

All data files use the ms-swift messages format:

{
  "messages": [
    {"role": "user",      "content": "The following are given passages.\n...\n\nQuestion: ..."},
    {"role": "assistant", "content": "<model response>"}
  ],
  "is_correct": 1,
  "sub_em": 1,
  "em": 1,
  "f1": 1.0
}

To validate any file before use:

python sft_data/validate_swift.py path/to/file.jsonl

This runs two layers:

Rule checks — verifies messages field, role/content presence, non-empty content, last role is assistant
AutoPreprocessor dry-run — passes 100 sample rows through ms-swift 4.x AutoPreprocessor; exit code 0 = pass, 1 = fail

Example output:

Validating: long_sft_train_summary.jsonl
Layer 1: rule checks...
  PASS: 5776 rows, no rule errors
Layer 2: ms-swift AutoPreprocessor...
  [swift] AutoPreprocessor: OK (100 rows sampled)

✓ long_sft_train_summary.jsonl passed all checks

Use --full to run AutoPreprocessor on all rows instead of 100.

Reproduce

# 1. Convert LoongRL-14b trajectories to ms-swift format
bash sft_data/run_convert.sh

# 2. Filter to best trajectory per query
bash sft_data/run_filter.sh

# 3. Extract ChatQA2 summarization data
python sft_data/extract_chatqa2_summary.py

Repository Structure

KeyChain/
│
├── ── Plain Multihop Synthesis ──────────────────────────────────────────────
│
├── qa.py                                            # Long-context QA synthesis (plain, no KeyChain)
│                                                    #   --distract_questions -1 to disable distractors
├── scripts/synth_multihop_plain.sh                  # Run all 15 (dataset × length) synthesis jobs
│
├── ── Trajectory Generation ─────────────────────────────────────────────────
│
├── trajectory_gen/
│   ├── generate_trajectories.py                     # Rollout generation + rule-based scoring
│   │                                                #   Backends: vllm, openai
│   │                                                #   Rewards: sub_em (primary), em, f1
│   ├── test_api_trajectory.py                       # Quick async test against any OpenAI-compatible API
│   ├── scripts/
│   │   └── gen_trajectories.sh                      # Batch runner for all 15 combos (sequential)
│   └── tests/
│       └── test_scoring.py                          # Unit tests for scoring logic (no GPU needed)
│
├── ── SFT Data ──────────────────────────────────────────────────────────────
│
├── sft_data/
│   ├── convert_to_swift.py                          # LoongRL-14b trajectories → ms-swift format
│   ├── filter_swift.py                              # Best trajectory per query (sub_em=1, max f1)
│   ├── validate_swift.py                            # Format validator (rules + AutoPreprocessor)
│   ├── extract_chatqa2_summary.py                   # Extract summarization data from ChatQA2
│   ├── run_convert.sh                               # Batch convert all 12 LoongRL-14b files
│   └── run_filter.sh                                # Batch filter all 12 converted files
│
├── ── KeyChain Pipeline ─────────────────────────────────────────────────────
│
├── qa_filter_data.py                                # Context generation with pre-filtering
├── qa_filter_data_core_gaussian_middle.py           # Gaussian question placement
├── qa_qwen_filtered_core_gaussian_add_distractor.py # Full pipeline: context + KeyChain
├── qa_add_distractor.py                             # Distractor injection module
├── qa_relevant_only.py                              # Supporting-facts-only generation
├── qa_musique_hard.py                               # MuSiQue-specific variant
├── uuid_test.py                                     # UUID chain/tree generation utilities
├── gpt_call.py                                      # GPT-4o reasoning extraction
├── tokenizer.py                                     # Multi-backend tokenizer (HF, NeMo, OpenAI, Gemini)
│
├── ── Quality Filtering ─────────────────────────────────────────────────────
│
├── filter_question/                                 # Stage 1: model-based QA quality filtering
│   ├── filter_infer.py                              #   vLLM distributed inference (Qwen2.5-32B)
│   ├── merge_output.py                              #   Prediction merging & metric computation
│   └── convert_*.py                                 #   Dataset format converters
├── filter_again/                                    # Stage 2: secondary quality control
│   ├── judge_filters.py                             #   Answer matching & filtering
│   └── judge_utils.py                               #   Metrics: EM, F1, CEM
│
├── ── Scripts ───────────────────────────────────────────────────────────────
│
├── scripts/
│   ├── synth_multihop_plain.sh                      # Plain multihop synthesis (NEW)
│   ├── synth.sh                                     # Basic synthesis
│   ├── synth_relevant_only.sh                       # Supporting-facts-only
│   ├── synth_qwen_filter_core_gaussian_add_distractor.sh  # Full KeyChain pipeline
│   ├── synth_context_all_lengths.sh                 # All context lengths (no filtering)
│   └── synth_gpt_call_reasoning.sh                  # GPT-4o reasoning extraction
│
└── difficulty_analysis/                             # Dataset analysis notebooks

Citation

If you use KeyChain in your research, please cite our paper:

@misc{wang2025loongrlreinforcementlearningadvanced,
      title={LoongRL: Reinforcement Learning for Advanced Reasoning over Long Contexts},
      author={Siyuan Wang and Gaokai Zhang and Li Lyna Zhang and Ning Shang and Fan Yang and Dongyao Chen and Mao Yang},
      year={2025},
      eprint={2510.19363},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2510.19363},
}

This repository implements the KeyChain data creation pipeline from LoongRL: Reinforcement Learning for Advanced Reasoning over Long Contexts. KeyChain synthesizes high-quality, verifiable long-context QA training data by embedding questions within long contexts using UUID key-value chain linking and multi-level distractors — designed specifically for reinforcement learning over long contexts with fine-grained difficulty control.

RL Training: For the reinforcement learning training framework, see LoongRL.

Overview

KeyChain constructs long-context QA instances through a three-stage pipeline:

Data Filtering — Filter source multi-hop questions (HotpotQA, MuSiQue, 2WikiMQA) using Qwen2.5-32B, retaining only questions of appropriate difficulty
Long Context Filling — Compose long contexts (4K–128K tokens) by shuffling and inserting documents around the question-relevant passages
KeyChain Insertion — Generate UUID key-value chains and insert the pairs at random positions throughout the context. One chain leads to the real question; other chains lead to distractor questions. The model must follow the correct chain starting from a given UUID to locate the question, then reason over the surrounding documents to answer it

Generated Data Example

Each instance presents the model with a long context containing documents interleaved with UUID key-value pairs. The model receives a starting UUID and must follow the chain to find the hidden question:

Please read the following text.
Document 0: ...
Document 3:
Who's Who? is a studio album by American jazz musician John Scofield. ...
{"bdd640fb-0667-4ad1-9c80-317fa3b1799d": "23b8c1e9-3924-46de-beb1-3b9046685257"}.
...
Document 10:
... Sonoma State offers 92 Bachelor's degrees, 19 Master's degrees ...
{"972a8469-1641-4f82-8b9d-2434e465e150": "Musician and satirist Allie Goertz
 wrote a song about the "The Simpsons" character Milhouse, who Matt Groening
 named after who?"}.
...
Document 47:
Neil Affleck
{"bd9c66b3-ad3c-4d6d-9a3d-1fa7bc8960a9": "972a8469-1641-4f82-8b9d-2434e465e150"}.
...

In the context above, there is one correct question to answer. The correct
question can only be found by following the correct consecutive chain of
key:value pairs encoded with UUID strings, starting from
"bdd640fb-0667-4ad1-9c80-317fa3b1799d".
Find the correct question first, then answer it.

Data Sources

Dataset	Training QA Pairs	Unique Documents
HotpotQA	90,447	483,696
MuSiQue	19,938	797,413
2WikiMQA	167,454	369,378

Installation

pip install transformers tenacity openai azure-identity tqdm gdown
sudo apt update -y && sudo apt install unzip

Usage

Basic Synthesis

# Standard long-context QA synthesis
bash scripts/synth.sh

# Relevant-documents-only (no fillers)
bash scripts/synth_relevant_only.sh

Full Pipeline (Context Filling + KeyChain Insertion)

# Synthesize across all datasets x all context lengths with
# context filling and KeyChain insertion
bash scripts/synth_qwen_filter_core_gaussian_add_distractor.sh

Reasoning Extraction

# Extract GPT-4o reasoning steps for training data
bash scripts/synth_gpt_call_reasoning.sh

Custom Generation

python qa.py \
    --save_dir=./ \
    --save_name=hotpotqa \
    --dataset=hotpotqa/hotpot_train_v1.1.json \
    --tokenizer_path=Qwen/Qwen2.5-7B-Instruct \
    --tokenizer_type=hf \
    --max_seq_length=32768 \
    --tokens_to_generate=128 \
    --num_samples=100 \
    --template="{context}"

Repository Structure

KeyChain/
├── qa.py                                            # Basic context composition
├── qa_filter_data.py                                # Context generation with pre-filtering
├── qa_filter_data_core_gaussian_middle.py           # Gaussian question placement
├── qa_qwen_filtered_core_gaussian_add_distractor.py # Full pipeline: context filling + KeyChain insertion
├── qa_add_distractor.py                             # Distractor injection module
├── qa_relevant_only.py                              # Supporting-facts-only generation
├── gpt_call.py                                      # GPT-4o reasoning extraction
├── tokenizer.py                                     # Multi-backend tokenizer (HF, NeMo, OpenAI)
├── filter_question/                                 # Stage 1: model-based QA quality filtering
│   ├── filter_infer.py                              #   vLLM distributed inference
│   ├── merge_output.py                              #   Prediction merging
│   └── convert_*.py                                 #   Dataset format converters
├── filter_again/                                    # Stage 2: secondary quality control
│   ├── judge_filters.py                             #   Answer matching & filtering
│   └── judge_utils.py                               #   Evaluation metrics (EM, F1, CEM)
├── scripts/                                         # Shell scripts for pipeline execution
│   ├── synth.sh                                     #   Basic synthesis
│   ├── synth_qwen_filter_core_gaussian_add_distractor.sh  # Full pipeline
│   └── synth_gpt_call_reasoning.sh                  #   Reasoning extraction
└── difficulty_analysis/                             # Dataset analysis notebooks

Citation

If you use KeyChain in your research, please cite our paper:

@misc{wang2025loongrlreinforcementlearningadvanced,
      title={LoongRL: Reinforcement Learning for Advanced Reasoning over Long Contexts},
      author={Siyuan Wang and Gaokai Zhang and Li Lyna Zhang and Ning Shang and Fan Yang and Dongyao Chen and Mao Yang},
      year={2025},
      eprint={2510.19363},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2510.19363},
}

Name		Name	Last commit message	Last commit date
Latest commit History 78 Commits
difficulty_analysis		difficulty_analysis
filter_again		filter_again
filter_question		filter_question
incremental_data		incremental_data
scripts		scripts
sft_data		sft_data
tests		tests
trajectory_gen		trajectory_gen
.gitignore		.gitignore
README.md		README.md
az_files.sh		az_files.sh
cp_files.sh		cp_files.sh
generate_sanity_check.py		generate_sanity_check.py
gpt_call.py		gpt_call.py
gpt_test.py		gpt_test.py
install.sh		install.sh
qa.py		qa.py
qa_add_distractor.py		qa_add_distractor.py
qa_filter_data.py		qa_filter_data.py
qa_filter_data_core_gaussian_middle.py		qa_filter_data_core_gaussian_middle.py
qa_musique_hard.py		qa_musique_hard.py
qa_qwen_filtered_add_distractor.py		qa_qwen_filtered_add_distractor.py
qa_qwen_filtered_core_gaussian_add_distractor.py		qa_qwen_filtered_core_gaussian_add_distractor.py
qa_relevant_only.py		qa_relevant_only.py
race_download.py		race_download.py
requirements.txt		requirements.txt
sample_question_length.py		sample_question_length.py
tokenizer.py		tokenizer.py
translate_to_rl.py		translate_to_rl.py
uuid_test.py		uuid_test.py
watch_codeRabbit_prs.sh		watch_codeRabbit_prs.sh

Folders and files

Latest commit

History

Repository files navigation

KeyChain: Key-Sentence-Driven Long-Context Data Synthesis [ICLR 2026 Oral]

Table of Contents

KeyChain Pipeline

Overview

Generated Data Example

Data Sources

Installation

Usage

Plain Multihop Synthesis

Quick Start

Output

Trajectory Generation

Scoring

Option A: Local vLLM model

Option B: OpenAI-compatible API

Option C: Quick API test (10 samples)

Output

Tests

SFT Data (sft_data/)

Data Files

Tools

ms-swift Format Check

Reproduce

Repository Structure

Citation

Overview

Generated Data Example

Data Sources

Installation

Usage

Basic Synthesis

Full Pipeline (Context Filling + KeyChain Insertion)

Reasoning Extraction

Custom Generation

Repository Structure

Citation

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

SFT Data (`sft_data/`)

Packages