Skip to content

Fold pipeline RL into GSM8K example and add client fixes#2

Open
kiddyboots216 wants to merge 11 commits into
mainfrom
examples/filler-rl-pipeline
Open

Fold pipeline RL into GSM8K example and add client fixes#2
kiddyboots216 wants to merge 11 commits into
mainfrom
examples/filler-rl-pipeline

Conversation

@kiddyboots216
Copy link
Copy Markdown
Contributor

@kiddyboots216 kiddyboots216 commented Mar 20, 2026

Summary

  • move the pipeline RL functionality into examples/qwen_gsm8k_rl_loop.py
  • remove the standalone filler-token example
  • add routed_expert_logits support to Datum and TrainingClient request forwarding
  • update training-client, chunking, and type tests

Closes #4.
Supersedes #5.

Testing

  • python -m py_compile examples/qwen_gsm8k_rl_loop.py
  • pytest -q tests/test_training_client.py tests/test_types.py tests/test_chunked_forward_backward.py

Adds examples/filler_rl.py — a ~1000-line pipeline RL script for
training models to generate random-number filler tokens before answering
4-digit multiplication problems. This is the minimal "V3" filler RL
implementation, distilled from the 2900-line tomi version.

Key design decisions (hardcoded, not configurable):
- Random number filler only
- 4-digit multiplication problems
- answer_only advantage weighting (filler tokens get 0 advantage)
- maxrl advantage normalization
- PPO clipping (policy_loss)
- Full-weights with NCCL sync (no LoRA)
- Pipeline RL always on (generation overlaps training)
…ull_weights_safetensors

The xorl server doesn't have a /api/v1/save_full_weights_safetensors
endpoint. In full-weights mode, save_weights_for_sampler does the same
thing (saves full model as safetensors via save_full_weights operation).

Also fixes default log_path to use /data/apanda/outputs/.
When all samples for a problem have identical reward (zero advantage),
draw a fresh problem from the eval pool and sample it. This fills the
datum budget so training steps have more gradient signal.

Reward metrics (mean, correct_rate, format_rate) are computed from the
original batch only — replacements contribute datums but don't inflate
the reported numbers, since all-correct groups being replaced would
otherwise bias reward upward.
@kiddyboots216 kiddyboots216 force-pushed the examples/filler-rl-pipeline branch from 2fd12ba to 7b2f372 Compare March 22, 2026 00:15
@kiddyboots216 kiddyboots216 changed the title Add minimal filler-token RL example with pipeline mode Fold pipeline RL into GSM8K example and add client fixes Apr 10, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add routed_expert_logits support to xorl-client R3 datums

1 participant