Open
Conversation
added 2 commits
March 26, 2026 03:06
9998905 to
feb4c6d
Compare
Remove observability logging additions not needed for OpenReward integration. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Overview
This PR integrates OpenReward as a first-class agentic environment in ROLL, enabling RL training on any task hosted on the OpenReward platform. The initial demo targets the EndlessTerminals task — a containerised Linux terminal benchmark with 3,255 verified shell tasks and binary episode-level rewards.
What is OpenReward?
OpenReward is a platform for hosting and serving RL training environments. It exposes each environment through a session-based SDK: the agent calls tools, receives observations and rewards, and the platform handles task setup, execution, and verification.
What is EndlessTerminals?
EndlessTerminals (
kanishk/EndlessTerminals) is a scalable RL environment for terminal agents. At each step the model issues a shell command via a tool call; the episode ends when the model callsdoneor the step budget is exhausted. Reward is binary (1 = task solved, 0 = not). Tasks span file operations, scripting, git, data processing, archiving, and more — all auto-generated and verified inside Apptainer containers with no human annotation.New files
roll/pipeline/agentic/env/openreward/openreward_env.pyOpenRewardEnv— implements thegem.Envinterface by wrapping the OpenReward sync SDK. Handles session lifecycle, tool-call parsing, reward reduction, retry logic, and forced-termination signals from the env manager.roll/pipeline/agentic/env/openreward/tool_utils.pyapply_chat_templateformat, parses<tool_call>blocks (Qwen3.5 native XML + JSON fallback), and reduces per-step rewards to a scalar.roll/pipeline/agentic/env/openreward/__init__.pyOpenRewardEnv.examples/agentic_demo/openreward_endless_terminals_reinforce_qwen35_2b.yamlexamples/agentic_demo/run_openreward_endless_terminals.shstart_agentic_pipeline.py.Entry point
The env is registered in
roll/pipeline/agentic/env/__init__.pyunder the keyopenreward_env:Use it in any YAML config via:
Running the demo
Key design notes
OpenRewardEnv.reset()returns tool specs viainfo["tools"]; the Qwen tokenizer's Jinja2 template builds the system prompt automatically with the correct<function=name>tool-call format.env_reset_failed/env_timeoutflags propagate to the env manager for clean episode skipping.<function=name><parameter=key>...) and JSON fallback ({"name": ..., "arguments": {...}}). Malformed calls return a nudge message rather than crashing the episode.sum,mean,max,min) over per-step rewards; an optionalnonterminal_rewardpenalty applies when the episode truncates before reaching a terminal state.