Fold pipeline RL into GSM8K example and add client fixes by kiddyboots216 · Pull Request #2 · togethercomputer/xorl-client

kiddyboots216 · 2026-03-20T21:13:05Z

Summary

move the pipeline RL functionality into examples/qwen_gsm8k_rl_loop.py
remove the standalone filler-token example
add routed_expert_logits support to Datum and TrainingClient request forwarding
update training-client, chunking, and type tests

Closes #4.
Supersedes #5.

Testing

python -m py_compile examples/qwen_gsm8k_rl_loop.py
pytest -q tests/test_training_client.py tests/test_types.py tests/test_chunked_forward_backward.py

Adds examples/filler_rl.py — a ~1000-line pipeline RL script for training models to generate random-number filler tokens before answering 4-digit multiplication problems. This is the minimal "V3" filler RL implementation, distilled from the 2900-line tomi version. Key design decisions (hardcoded, not configurable): - Random number filler only - 4-digit multiplication problems - answer_only advantage weighting (filler tokens get 0 advantage) - maxrl advantage normalization - PPO clipping (policy_loss) - Full-weights with NCCL sync (no LoRA) - Pipeline RL always on (generation overlaps training)

…ull_weights_safetensors The xorl server doesn't have a /api/v1/save_full_weights_safetensors endpoint. In full-weights mode, save_weights_for_sampler does the same thing (saves full model as safetensors via save_full_weights operation). Also fixes default log_path to use /data/apanda/outputs/.

When all samples for a problem have identical reward (zero advantage), draw a fresh problem from the eval pool and sample it. This fills the datum budget so training steps have more gradient signal. Reward metrics (mean, correct_rate, format_rate) are computed from the original batch only — replacements contribute datums but don't inflate the reported numbers, since all-correct groups being replaced would otherwise bias reward upward.

kiddyboots216 added 3 commits March 20, 2026 14:12

kiddyboots216 force-pushed the examples/filler-rl-pipeline branch from 2fd12ba to 7b2f372 Compare March 22, 2026 00:15

kiddyboots216 added 7 commits March 22, 2026 22:47

update script

93e4782

install wandb

6b29ff2

Add routed_expert_logits support to R3 datums

dd46933

Fix stale chunking test expectations

f8368c1

Move pipeline RL into GSM8K example

f0104c6

Merge pull request #5 from togethercomputer/issue-4-routed-expert-logits

fd76acc

Add worker_port support for inference endpoints

dd3abff

kiddyboots216 changed the title ~~Add minimal filler-token RL example with pipeline mode~~ Fold pipeline RL into GSM8K example and add client fixes Apr 10, 2026

kiddyboots216 mentioned this pull request Apr 10, 2026

Add routed_expert_logits support to R3 datums #5

Closed

Remove worker_port inference endpoint helper

9b9f488

kiddyboots216 mentioned this pull request Apr 10, 2026

Add worker_port support to TrainingClient.add_inference_endpoint() #3

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fold pipeline RL into GSM8K example and add client fixes#2

Fold pipeline RL into GSM8K example and add client fixes#2
kiddyboots216 wants to merge 11 commits into
mainfrom
examples/filler-rl-pipeline

kiddyboots216 commented Mar 20, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

kiddyboots216 commented Mar 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Testing

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

kiddyboots216 commented Mar 20, 2026 •

edited

Loading