Skip to content

fix: pipecleaning 0.1.1 envs for rl#768

Merged
bxyu-nvidia merged 27 commits intomainfrom
fsiino/pipecleaning-rl-envs
Mar 3, 2026
Merged

fix: pipecleaning 0.1.1 envs for rl#768
bxyu-nvidia merged 27 commits intomainfrom
fsiino/pipecleaning-rl-envs

Conversation

@fsiino-nvidia
Copy link
Contributor

@fsiino-nvidia fsiino-nvidia commented Feb 26, 2026

This PR contains infra + environment-specific changes made during pipecleaning efforts to get the respective environments to run successfully in nemo-rl.

rollout_collection:
Any server crash, network timeout, model error or resource exhaustion can cause a rollout to fail. Any single failed row kills the entire batch. We return a zero-reward fallback after waiting for ROLLOUT_ROW_TIMEOUT_SECONDS.

equivalence_llm_judge:
Previously any overly long generated answers from a thinking model would overflow the judge model's context window, crash the judge call and lead to cascading rollout failures. The implementation allows opt-in via max_judge_input_tokens and chars_per_token_estimate to truncate the generated answer before passing it to the judge model.

math_with_code:
Adds fallback for \boxed{} answer extraction. Previously only assistant text messages were searched for the answer. If a model were to answer via code execution, it ended up getting a zero reward. This fallback also searches function_call_output so an answer can be extracted from the tool output if the model prints it inside the executed code.

math_with_judge:
- Add config fields for judge truncation, similar to equivalence_llm_judge.

  • Similar to math_with_code:
    • Send extracted answer parsed from \boxed{} to judge model instead of the raw generated answer to avoid overwhelming the judge's context window.
    • Expected answers wrapped in \(...\) produce \boxed{\42\)} which is not parsable by the math_verify library. This strips the outer delimiters to fix parsing.
  • Prevent crash if the extracted answer is empty.

mcqa:
Address null option values in the dataset. Previously these were skipped which led to invalid rows being collected and training crashes.

mini_swe_agent:
Add a standalone data processing script to convert raw SWE-Bench/SWE-Gym data into the nemo-gym training format.

Frankie Siino and others added 23 commits February 26, 2026 12:57
Signed-off-by: Frankie Siino <fsiino@nvidia.com>
Signed-off-by: Frankie Siino <fsiino@nvidia.com>
Signed-off-by: Frankie Siino <fsiino@nvidia.com>
Signed-off-by: Frankie Siino <fsiino@nvidia.com>
Signed-off-by: Frankie Siino <fsiino@cw-dfw-cs-001-login-01.cm.cluster>
Signed-off-by: Frankie Siino <fsiino@nvidia.com>
Signed-off-by: Frankie Siino <fsiino@cw-dfw-cs-001-login-01.cm.cluster>
Signed-off-by: Frankie Siino <fsiino@nvidia.com>
…llection

Signed-off-by: Frankie Siino <fsiino@cw-dfw-cs-001-login-01.cm.cluster>
Signed-off-by: Frankie Siino <fsiino@nvidia.com>
Signed-off-by: Frankie Siino <fsiino@cw-dfw-cs-001-login-01.cm.cluster>
Signed-off-by: Frankie Siino <fsiino@nvidia.com>
Signed-off-by: Frankie Siino <fsiino@cw-dfw-cs-001-login-01.cm.cluster>
Signed-off-by: Frankie Siino <fsiino@nvidia.com>
Signed-off-by: Frankie Siino <fsiino@cw-dfw-cs-001-login-01.cm.cluster>
Signed-off-by: Frankie Siino <fsiino@nvidia.com>
…r handling, simplified judge prompt

Signed-off-by: Frankie Siino <fsiino@cw-dfw-cs-001-login-01.cm.cluster>
Signed-off-by: Frankie Siino <fsiino@nvidia.com>
…racted answer for judge, add truncation and warmup support

Signed-off-by: Frankie Siino <fsiino@cw-dfw-cs-001-login-01.cm.cluster>
Signed-off-by: Frankie Siino <fsiino@nvidia.com>
…er normalization, numeric fallback, max_steps limit

Signed-off-by: Frankie Siino <fsiino@cw-dfw-cs-001-login-01.cm.cluster>
Signed-off-by: Frankie Siino <fsiino@nvidia.com>
Signed-off-by: Frankie Siino <fsiino@cw-dfw-cs-001-login-01.cm.cluster>
Signed-off-by: Frankie Siino <fsiino@nvidia.com>
Signed-off-by: Frankie Siino <fsiino@cw-dfw-cs-001-login-01.cm.cluster>
Signed-off-by: Frankie Siino <fsiino@nvidia.com>
Signed-off-by: Frankie Siino <fsiino@cw-dfw-cs-001-login-01.cm.cluster>
Signed-off-by: Frankie Siino <fsiino@nvidia.com>
Signed-off-by: Frankie Siino <fsiino@cw-dfw-cs-001-login-01.cm.cluster>
Signed-off-by: Frankie Siino <fsiino@nvidia.com>
Signed-off-by: Frankie Siino <fsiino@nvidia.com>
Signed-off-by: Frankie Siino <fsiino@nvidia.com>
Signed-off-by: Frankie Siino <fsiino@nvidia.com>
Signed-off-by: Frankie Siino <fsiino@nvidia.com>
Signed-off-by: Frankie Siino <fsiino@nvidia.com>
Signed-off-by: Frankie Siino <fsiino@nvidia.com>
@fsiino-nvidia fsiino-nvidia force-pushed the fsiino/pipecleaning-rl-envs branch from 6f11b41 to 9270c18 Compare February 26, 2026 21:00
Signed-off-by: Frankie Siino <fsiino@nvidia.com>
@fsiino-nvidia fsiino-nvidia linked an issue Feb 27, 2026 that may be closed by this pull request
2 tasks
@fsiino-nvidia fsiino-nvidia requested a review from a team February 27, 2026 18:00
Signed-off-by: Frankie Siino <fsiino@nvidia.com>
…l-envs

Signed-off-by: Frankie Siino <fsiino@nvidia.com>

# Conflicts:
#	README.md
res = await server_client.post(server_name=row["agent_ref"]["name"], url_path="/run", json=row)
await raise_for_status(res)
return row, await get_response_json(res)
try:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we revert this pls given we don't want to artificially limit the rollout collection time?

Signed-off-by: Frankie Siino <fsiino@nvidia.com>
@bxyu-nvidia bxyu-nvidia merged commit d1f57f7 into main Mar 3, 2026
6 checks passed
@bxyu-nvidia bxyu-nvidia deleted the fsiino/pipecleaning-rl-envs branch March 3, 2026 19:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Pipeclean NeMo RL training with 0.1.1 environments

2 participants