ROLL with OpenReward Environments by shamanez · Pull Request #401 · alibaba/ROLL

shamanez · 2026-03-25T16:01:00Z

Overview

This PR integrates OpenReward as a first-class agentic environment in ROLL, enabling RL training on any task hosted on the OpenReward platform. The initial demo targets the EndlessTerminals task — a containerised Linux terminal benchmark with 3,255 verified shell tasks and binary episode-level rewards.

What is OpenReward?

OpenReward is a platform for hosting and serving RL training environments. It exposes each environment through a session-based SDK: the agent calls tools, receives observations and rewards, and the platform handles task setup, execution, and verification.

What is EndlessTerminals?

EndlessTerminals (kanishk/EndlessTerminals) is a scalable RL environment for terminal agents. At each step the model issues a shell command via a tool call; the episode ends when the model calls done or the step budget is exhausted. Reward is binary (1 = task solved, 0 = not). Tasks span file operations, scripting, git, data processing, archiving, and more — all auto-generated and verified inside Apptainer containers with no human annotation.

New files

File	Purpose
`roll/pipeline/agentic/env/openreward/openreward_env.py`	`OpenRewardEnv` — implements the `gem.Env` interface by wrapping the OpenReward sync SDK. Handles session lifecycle, tool-call parsing, reward reduction, retry logic, and forced-termination signals from the env manager.
`roll/pipeline/agentic/env/openreward/tool_utils.py`	Stateless helpers: converts OpenReward tool specs to Qwen `apply_chat_template` format, parses `<tool_call>` blocks (Qwen3.5 native XML + JSON fallback), and reduces per-step rewards to a scalar.
`roll/pipeline/agentic/env/openreward/__init__.py`	Package init — re-exports `OpenRewardEnv`.
`examples/agentic_demo/openreward_endless_terminals_reinforce_qwen35_2b.yaml`	Ready-to-use Hydra config: Qwen3.5-2B + STEP_REINFORCE + Megatron-Core (TP=2, CP=2) + vLLM inference on 8 GPUs.
`examples/agentic_demo/run_openreward_endless_terminals.sh`	Launch script that sets required NCCL env vars and calls `start_agentic_pipeline.py`.

Entry point

The env is registered in roll/pipeline/agentic/env/__init__.py under the key openreward_env:

gem.register("openreward_env", entry_point="roll.pipeline.agentic.env.openreward:OpenRewardEnv")

Use it in any YAML config via:

custom_envs:
  MyEnvTag:
    env_type: "openreward_env"
    env_config:
      environment_name: "kanishk/EndlessTerminals"   # or any other OpenReward task
      split: "train"
      max_steps: 16
      reward_reduction: "sum"   # sum | mean | max | min

Running the demo

# Prerequisites
pip install openreward

# Set credentials
export OPENREWARD_API_KEY="..."
export WANDB_API_KEY="..."

# Launch training (Qwen3.5-2B, STEP_REINFORCE, 8 GPUs)
bash examples/agentic_demo/run_openreward_endless_terminals.sh

# Or override config params directly
python examples/start_agentic_pipeline.py \
  --config_path agentic_demo \
  --config_name openreward_endless_terminals_reinforce_qwen35_2b \
  max_steps=100 rollout_batch_size=32

Key design notes

No system prompt needed — OpenRewardEnv.reset() returns tool specs via info["tools"]; the Qwen tokenizer's Jinja2 template builds the system prompt automatically with the correct <function=name> tool-call format.
Graceful failure handling — session creation uses exponential backoff retries; env_reset_failed / env_timeout flags propagate to the env manager for clean episode skipping.
Tool-call format — supports both Qwen3.5 native XML (<function=name><parameter=key>...) and JSON fallback ({"name": ..., "arguments": {...}}). Malformed calls return a nudge message rather than crashing the episode.
Reward reduction — configurable (sum, mean, max, min) over per-step rewards; an optional nonterminal_reward penalty applies when the episode truncates before reaching a terminal state.

CLAassistant · 2026-03-25T16:01:11Z

All committers have signed the CLA.

Remove observability logging additions not needed for OpenReward integration. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

shamanez added 2 commits March 26, 2026 03:06

added OpenRewards

dab543a

added the openreward support.

feb4c6d

shamanez force-pushed the feat/openreward-integration branch from 9998905 to feb4c6d Compare March 25, 2026 16:06

shamanez and others added 2 commits March 26, 2026 03:08

revert agent_native_env_manager.py to upstream version

1deaecb

Remove observability logging additions not needed for OpenReward integration. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

remove IPA config yaml not needed for OpenReward integration

03a1568

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

shamanez changed the title ~~Feat/openreward integration~~ feat: OpenReward environment integration Mar 25, 2026

shamanez changed the title ~~feat: OpenReward environment integration~~ ROLL with OpenReward Environments Mar 25, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ROLL with OpenReward Environments#401

ROLL with OpenReward Environments#401
shamanez wants to merge 4 commits intoalibaba:mainfrom
shamanez:feat/openreward-integration

shamanez commented Mar 25, 2026 •

edited

Loading

Uh oh!

CLAassistant commented Mar 25, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

shamanez commented Mar 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

What is OpenReward?

What is EndlessTerminals?

New files

Entry point

Running the demo

Key design notes

Uh oh!

CLAassistant commented Mar 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

shamanez commented Mar 25, 2026 •

edited

Loading

CLAassistant commented Mar 25, 2026 •

edited

Loading