Skip to content

ROLL with OpenReward Environments#401

Open
shamanez wants to merge 4 commits intoalibaba:mainfrom
shamanez:feat/openreward-integration
Open

ROLL with OpenReward Environments#401
shamanez wants to merge 4 commits intoalibaba:mainfrom
shamanez:feat/openreward-integration

Conversation

@shamanez
Copy link
Contributor

@shamanez shamanez commented Mar 25, 2026

Overview

This PR integrates OpenReward as a first-class agentic environment in ROLL, enabling RL training on any task hosted on the OpenReward platform. The initial demo targets the EndlessTerminals task — a containerised Linux terminal benchmark with 3,255 verified shell tasks and binary episode-level rewards.

What is OpenReward?

OpenReward is a platform for hosting and serving RL training environments. It exposes each environment through a session-based SDK: the agent calls tools, receives observations and rewards, and the platform handles task setup, execution, and verification.

What is EndlessTerminals?

EndlessTerminals (kanishk/EndlessTerminals) is a scalable RL environment for terminal agents. At each step the model issues a shell command via a tool call; the episode ends when the model calls done or the step budget is exhausted. Reward is binary (1 = task solved, 0 = not). Tasks span file operations, scripting, git, data processing, archiving, and more — all auto-generated and verified inside Apptainer containers with no human annotation.

New files

File Purpose
roll/pipeline/agentic/env/openreward/openreward_env.py OpenRewardEnv — implements the gem.Env interface by wrapping the OpenReward sync SDK. Handles session lifecycle, tool-call parsing, reward reduction, retry logic, and forced-termination signals from the env manager.
roll/pipeline/agentic/env/openreward/tool_utils.py Stateless helpers: converts OpenReward tool specs to Qwen apply_chat_template format, parses <tool_call> blocks (Qwen3.5 native XML + JSON fallback), and reduces per-step rewards to a scalar.
roll/pipeline/agentic/env/openreward/__init__.py Package init — re-exports OpenRewardEnv.
examples/agentic_demo/openreward_endless_terminals_reinforce_qwen35_2b.yaml Ready-to-use Hydra config: Qwen3.5-2B + STEP_REINFORCE + Megatron-Core (TP=2, CP=2) + vLLM inference on 8 GPUs.
examples/agentic_demo/run_openreward_endless_terminals.sh Launch script that sets required NCCL env vars and calls start_agentic_pipeline.py.

Entry point

The env is registered in roll/pipeline/agentic/env/__init__.py under the key openreward_env:

gem.register("openreward_env", entry_point="roll.pipeline.agentic.env.openreward:OpenRewardEnv")

Use it in any YAML config via:

custom_envs:
  MyEnvTag:
    env_type: "openreward_env"
    env_config:
      environment_name: "kanishk/EndlessTerminals"   # or any other OpenReward task
      split: "train"
      max_steps: 16
      reward_reduction: "sum"   # sum | mean | max | min

Running the demo

# Prerequisites
pip install openreward

# Set credentials
export OPENREWARD_API_KEY="..."
export WANDB_API_KEY="..."

# Launch training (Qwen3.5-2B, STEP_REINFORCE, 8 GPUs)
bash examples/agentic_demo/run_openreward_endless_terminals.sh

# Or override config params directly
python examples/start_agentic_pipeline.py \
  --config_path agentic_demo \
  --config_name openreward_endless_terminals_reinforce_qwen35_2b \
  max_steps=100 rollout_batch_size=32

Key design notes

  • No system prompt neededOpenRewardEnv.reset() returns tool specs via info["tools"]; the Qwen tokenizer's Jinja2 template builds the system prompt automatically with the correct <function=name> tool-call format.
  • Graceful failure handling — session creation uses exponential backoff retries; env_reset_failed / env_timeout flags propagate to the env manager for clean episode skipping.
  • Tool-call format — supports both Qwen3.5 native XML (<function=name><parameter=key>...) and JSON fallback ({"name": ..., "arguments": {...}}). Malformed calls return a nudge message rather than crashing the episode.
  • Reward reduction — configurable (sum, mean, max, min) over per-step rewards; an optional nonterminal_reward penalty applies when the episode truncates before reaching a terminal state.

@CLAassistant
Copy link

CLAassistant commented Mar 25, 2026

CLA assistant check
All committers have signed the CLA.

@shamanez shamanez force-pushed the feat/openreward-integration branch from 9998905 to feb4c6d Compare March 25, 2026 16:06
shamanez and others added 2 commits March 26, 2026 03:08
Remove observability logging additions not needed for OpenReward integration.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
@shamanez shamanez changed the title Feat/openreward integration feat: OpenReward environment integration Mar 25, 2026
@shamanez shamanez changed the title feat: OpenReward environment integration ROLL with OpenReward Environments Mar 25, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants