Skip to content

feat: Adds GenRM Response API Model with support for custom roles used in pairwise response comparison.#674

Merged
bxyu-nvidia merged 7 commits intomainfrom
ffrujeri/genrm-model
Mar 4, 2026
Merged

feat: Adds GenRM Response API Model with support for custom roles used in pairwise response comparison.#674
bxyu-nvidia merged 7 commits intomainfrom
ffrujeri/genrm-model

Conversation

@ffrujeri
Copy link
Contributor

@ffrujeri ffrujeri commented Feb 11, 2026

What does this PR do?

Adds GenRM support via a dedicated Response API Model package (responses_api_models/genrm_model/). The package provides a single local variant: a GenRM model that uses a locally managed vLLM server (download model + start vLLM, e.g. via Ray).

The key design goal is to keep all GenRM-specific logic inside the model server, so that the resources server (and the base schema) can use standard OpenAI roles throughout. Role remapping to the GenRM chat-template roles (response_1, response_2, principle) happens as a final preprocessing step, just before the request is forwarded to vLLM.

Related to PR #523. Part of #516.


Architecture

Package layout (responses_api_models/genrm_model/)

genrm_model/
├── __init__.py
├── app.py
├── pyproject.toml
├── setup.py
├── README.md
├── configs/
│   └── genrm_model.yaml
└── tests/
    └── test_app.py      # config tests + preprocessing unit tests

Components

  • GenRMModelConfig — extends LocalVLLMModelConfig with supports_principle_role.
  • GenRMModelMixin — overrides two hooks on VLLMModel:
    • get_converter() — returns a plain VLLMConverter (no custom converter needed).
    • _preprocess_chat_completion_create_params() — reads the comparison payload from
      metadata and appends the GenRM chat-template roles immediately before the vLLM call:
      • metadata["response_1"] → appended as a "response_1" message
      • metadata["response_2"] → appended as a "response_2" message
      • metadata["principle"] → appended as a "principle" message (when supports_principle_role=True)
      • metadata is consumed here and not forwarded to vLLM.
  • GenRMModelGenRMModelMixin + LocalVLLMModel.

Changes to vllm_model/app.py

_preprocess_chat_completion_create_params(self, request, body_dict) -> Dict[str, Any] is extracted from chat_completions() as an overrideable hook. The base implementation covers replace_developer_role_with_system, model / chat_template_kwargs injection, token-ID augmentation, reasoning-parser handling, and extra_body merging. Subclasses override it to apply model-specific transformations before the vLLM call.

Schema (nemo_gym/openai_utils.py)

NeMoGymEasyInputMessage.role and NeMoGymMessage.role use standard OpenAI roles only (user, assistant, system, developer). The custom response_1 / response_2 / principle literals are not part of the request schema — they are an internal vLLM chat-template detail handled entirely within GenRMModelMixin.


Usage

Config key: genrm_model under responses_api_models, with entrypoint: app.py. See configs/genrm_model.yaml.

from responses_api_models.genrm_model.app import GenRMModel, GenRMModelConfig

How the resources server calls the GenRM model server:

The input field carries only the conversation history (standard OpenAI roles). The comparison
payload is passed via metadata so the request schema stays generic:

responses_create_params.input = conversation_history_messages   # user / assistant turns only
responses_create_params.metadata = {
    "response_1": response_1_text,
    "response_2": response_2_text,
    "principle":  principle_text,   # omit key when use_principle=False
}

GenRMModelMixin._preprocess_chat_completion_create_params reads metadata, appends the
custom-role messages to the conversation, and pops metadata before forwarding to vLLM.


Testing

Launching a multi-node config:

ng_run "+config_paths=[responses_api_models/genrm_model/configs/genrm_model.yaml]"
genrm_model:
  responses_api_models:
    genrm_model:
      entrypoint: app.py
      model: <MODEL_PATH>
      uses_reasoning_parser: true
      return_token_id_information: false
      supports_principle_role: true
      debug: true
      hf_home: null

      vllm_serve_env_vars:
        VLLM_RAY_DP_PACK_STRATEGY: strict

      vllm_serve_kwargs:
        tensor_parallel_size: 8
        data_parallel_size: 2
        data_parallel_size_local: 1
        pipeline_parallel_size: 1
        reasoning_parser: deepseek_r1
        gpu_memory_utilization: 0.85
        max_model_len: 60000
        model_loader_extra_config:
          enable_multithread_load: true
          num_threads: 108

We get a runnable server:

All 1 / 1 servers ready! Polling every 60s

####################################################################################################
#
# Server Instances
#
####################################################################################################

[1] genrm_model (responses_api_models/genrm_model)
{
    'config_path': 'genrm_model',
    'entrypoint': 'app.py',
    'host': '127.0.0.1',
    'name': 'genrm_model',
    'port': 11093,
    'server_type': 'responses_api_models',
    'url': 'http://127.0.0.1:11093',
}
####################################################################################################

And a successful response (input is conversation history only; responses and principle go in metadata):

curl -s -X POST http://127.0.0.1:11093/v1/responses \
  -H "Content-Type: application/json" \
  -d '{
    "input": [
      {"role": "user", "content": "What is the capital of France?", "type": "message"}
    ],
    "metadata": {
      "principle":  "Judge which response is better.",
      "response_1": "The capital of France is Paris.",
      "response_2": "Paris is the capital city of France."
    },
    "temperature": 0.0,
    "max_output_tokens": 512
  }'

@copy-pr-bot
Copy link

copy-pr-bot bot commented Feb 11, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@ffrujeri ffrujeri changed the title feat: Adds GenRM (Generative Reward Model) Response API Model with support for custom roles used in pairwise response comparison. feat: Adds GenRM Response API Model with support for custom roles used in pairwise response comparison. Feb 12, 2026
@ffrujeri ffrujeri force-pushed the ffrujeri/genrm-model branch from c22967f to dd172a0 Compare February 18, 2026 02:33
@ffrujeri ffrujeri changed the base branch from main to ffrujeri/multi-node-local-vllm February 18, 2026 02:36
@ffrujeri ffrujeri marked this pull request as ready for review February 18, 2026 02:38
@ffrujeri ffrujeri requested a review from a team as a code owner February 18, 2026 02:38
@bxyu-nvidia bxyu-nvidia linked an issue Feb 18, 2026 that may be closed by this pull request
@ffrujeri ffrujeri force-pushed the ffrujeri/multi-node-local-vllm branch from 7d3a839 to 2971f31 Compare February 18, 2026 16:58
@ffrujeri ffrujeri force-pushed the ffrujeri/genrm-model branch from dd172a0 to 7458914 Compare February 18, 2026 16:58
@ffrujeri ffrujeri force-pushed the ffrujeri/multi-node-local-vllm branch from 2971f31 to ce4168f Compare February 28, 2026 18:55
@ffrujeri ffrujeri force-pushed the ffrujeri/genrm-model branch from 7458914 to 588a33b Compare March 3, 2026 00:08
@ffrujeri ffrujeri changed the base branch from ffrujeri/multi-node-local-vllm to bxyu/rollout-collection-infra March 3, 2026 00:08
Base automatically changed from bxyu/rollout-collection-infra to main March 3, 2026 04:10
@ffrujeri ffrujeri force-pushed the ffrujeri/genrm-model branch 2 times, most recently from 4fc3148 to 6138812 Compare March 3, 2026 19:53
ffrujeri added 5 commits March 3, 2026 22:55
Signed-off-by: Felipe Vieira Frujeri <ffrujeri@nvidia.com>
Signed-off-by: Felipe Vieira Frujeri <ffrujeri@nvidia.com>
Signed-off-by: Felipe Vieira Frujeri <ffrujeri@nvidia.com>
Signed-off-by: Felipe Vieira Frujeri <ffrujeri@nvidia.com>
Signed-off-by: Felipe Vieira Frujeri <ffrujeri@nvidia.com>
@ffrujeri ffrujeri force-pushed the ffrujeri/genrm-model branch from 90f96d3 to ca71797 Compare March 3, 2026 22:55
ffrujeri added 2 commits March 3, 2026 23:04
Signed-off-by: Felipe Vieira Frujeri <ffrujeri@nvidia.com>
…training workflows. (#679)

# GenRM Compare Resource Server & Cohort-Based Verify

## What does this PR do?

This PR adds a production-ready **Resource Server** for comparing
multiple candidate responses using GenRM models, and moves RLHF-specific
reward logic (cohort buffering and comparison) into that server so that
**rollout collection and consumer libraries (e.g. nemo-rl) stay
generic**.

## Issues

- Related to PR #523 (reference).
- Part of #516.

## Summary

In RLHF, rewards are **relative to other rollouts for the same task**
(e.g. same prompt), not independent. This PR addresses that by:

- **Cohort-based verify**: The genrm_compare server’s `/verify` endpoint
buffers rollouts by prompt (and optional principle). When
`num_rollouts_per_prompt` rollouts have been received for a prompt, it
runs pairwise comparison, aggregates scores, and returns the appropriate
reward to each of the N callers. Callers naturally “wait” until their
cohort is complete via the async verify flow.
- **No RLHF hacks in Gym or NeMo RL**: Rollout collection stays a simple
“post each row to agent `/run`”. The agent calls the resources server’s
`/verify` with the response; genrm_compare owns all buffering and
comparison. No comparison strategy or prompt buffering in
`rollout_collection.py`.

### Key features

- **Cohort-based verify**: Configurable `num_rollouts_per_prompt`;
verify buffers by prompt (and principle), runs comparison when cohort is
full, distributes rewards.
- **Batch `/compare` API**: Direct comparison of N `response_objs` (e.g.
for scripts or tests).
- **Pairwise comparison**: Circular and all-pairs strategies; tiebreaker
and length-based bonuses; optional principle-based judging.
- **GenRM model alignment**: Config aligned with `genrm_model` (server
name `genrm_model`; custom roles `response_1`, `response_2`,
`principle`).
- **Clean boundaries**: Zero GenRM-specific code in rollout collection
or config types; all RLHF logic in genrm_compare.

## Architecture

```
Rollout collection
    └── For each row: POST to agent /run  (unchanged; no strategy or buffering)

Agent (e.g. simple_agent)
    └── /run: generate response → POST to resources server /verify with (params, response, optional principle)

GenRM Compare Resource Server
    ├── /verify (per-rollout)
    │   ├── num_rollouts_per_prompt <= 1 → return default_score
    │   └── num_rollouts_per_prompt > 1:
    │       ├── Buffer by prompt_key (input + principle)
    │       ├── When cohort size == num_rollouts_per_prompt:
    │       │   ├── Run pairwise comparison (GenRM model)
    │       │   ├── Aggregate scores (tiebreaker, length bonuses)
    │       │   └── Resolve all N pending verify callers with their rewards
    │       └── Return this rollout’s reward
    └── /compare (batch)
        └── Compare N response_objs; return rewards + metrics (for scripts/tests)
```

- **Config**: genrm_compare config includes `num_rollouts_per_prompt`,
`genrm_model_server` (name `genrm_model`), and comparison/aggregation
options. No `comparison_strategy` in global config for rollout.
- **Data**: For RLHF, provide `num_rollouts_per_prompt` rows per prompt
(e.g. via `num_repeats` when loading data).

## Testing

```bash
curl -s -X POST http://127.0.0.1:17795/compare \
  -H "Content-Type: application/json" \
  -d '{
    "conversation_history": [{"role": "user", "content": "What is SKILL?"}],
    "response_objs": [
      {"output": [{"type": "message", "content": [{"type": "output_text", "text": "SKILL is a verb meaning to kill."}]}]},
      {"output": [{"type": "message", "content": [{"type": "output_text", "text": "Skill refers to the ability to perform a task well."}]}]}
    ]
  }' | jq .
```

GenRM returns a response with reasoning and a final message containing
JSON scores, e.g.:

```json
{
  "rewards": [
    1.025,
    4.475
  ],
  "comparison_results": [
    {
      "response_i": 0,
      "response_j": 1,
      "judge_idx": 0,
      "score_1": 1.0,
      "score_2": 5.0,
      "ranking": 6.0
    },
    {
      "response_i": 1,
      "response_j": 0,
      "judge_idx": 0,
      "score_1": 4.0,
      "score_2": 1.0,
      "ranking": 1.0
    }
  ],
  "metrics": {
    "mean_individual_score": 2.75,
    "std_individual_score": 1.7853571071357126,
    "tiebreak_usage_rate": 0.0
  }
}
```

Unit tests cover genrm_compare (verify stub when N≤1, cohort logic,
compare), utils (prompt key, parsing, aggregation), and
comparison_strategies (batch client and helpers).

---------

Signed-off-by: Felipe Vieira Frujeri <ffrujeri@nvidia.com>
@bxyu-nvidia bxyu-nvidia merged commit 0c62ff0 into main Mar 4, 2026
5 checks passed
@bxyu-nvidia bxyu-nvidia deleted the ffrujeri/genrm-model branch March 4, 2026 00:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: Reward model support

2 participants