feat: Adds GenRM Response API Model with support for custom roles used in pairwise response comparison.#674
Merged
bxyu-nvidia merged 7 commits intomainfrom Mar 4, 2026
Merged
Conversation
c22967f to
dd172a0
Compare
7d3a839 to
2971f31
Compare
dd172a0 to
7458914
Compare
2971f31 to
ce4168f
Compare
7458914 to
588a33b
Compare
bxyu-nvidia
requested changes
Mar 3, 2026
4fc3148 to
6138812
Compare
Signed-off-by: Felipe Vieira Frujeri <ffrujeri@nvidia.com>
Signed-off-by: Felipe Vieira Frujeri <ffrujeri@nvidia.com>
Signed-off-by: Felipe Vieira Frujeri <ffrujeri@nvidia.com>
Signed-off-by: Felipe Vieira Frujeri <ffrujeri@nvidia.com>
Signed-off-by: Felipe Vieira Frujeri <ffrujeri@nvidia.com>
90f96d3 to
ca71797
Compare
Signed-off-by: Felipe Vieira Frujeri <ffrujeri@nvidia.com>
…training workflows. (#679) # GenRM Compare Resource Server & Cohort-Based Verify ## What does this PR do? This PR adds a production-ready **Resource Server** for comparing multiple candidate responses using GenRM models, and moves RLHF-specific reward logic (cohort buffering and comparison) into that server so that **rollout collection and consumer libraries (e.g. nemo-rl) stay generic**. ## Issues - Related to PR #523 (reference). - Part of #516. ## Summary In RLHF, rewards are **relative to other rollouts for the same task** (e.g. same prompt), not independent. This PR addresses that by: - **Cohort-based verify**: The genrm_compare server’s `/verify` endpoint buffers rollouts by prompt (and optional principle). When `num_rollouts_per_prompt` rollouts have been received for a prompt, it runs pairwise comparison, aggregates scores, and returns the appropriate reward to each of the N callers. Callers naturally “wait” until their cohort is complete via the async verify flow. - **No RLHF hacks in Gym or NeMo RL**: Rollout collection stays a simple “post each row to agent `/run`”. The agent calls the resources server’s `/verify` with the response; genrm_compare owns all buffering and comparison. No comparison strategy or prompt buffering in `rollout_collection.py`. ### Key features - **Cohort-based verify**: Configurable `num_rollouts_per_prompt`; verify buffers by prompt (and principle), runs comparison when cohort is full, distributes rewards. - **Batch `/compare` API**: Direct comparison of N `response_objs` (e.g. for scripts or tests). - **Pairwise comparison**: Circular and all-pairs strategies; tiebreaker and length-based bonuses; optional principle-based judging. - **GenRM model alignment**: Config aligned with `genrm_model` (server name `genrm_model`; custom roles `response_1`, `response_2`, `principle`). - **Clean boundaries**: Zero GenRM-specific code in rollout collection or config types; all RLHF logic in genrm_compare. ## Architecture ``` Rollout collection └── For each row: POST to agent /run (unchanged; no strategy or buffering) Agent (e.g. simple_agent) └── /run: generate response → POST to resources server /verify with (params, response, optional principle) GenRM Compare Resource Server ├── /verify (per-rollout) │ ├── num_rollouts_per_prompt <= 1 → return default_score │ └── num_rollouts_per_prompt > 1: │ ├── Buffer by prompt_key (input + principle) │ ├── When cohort size == num_rollouts_per_prompt: │ │ ├── Run pairwise comparison (GenRM model) │ │ ├── Aggregate scores (tiebreaker, length bonuses) │ │ └── Resolve all N pending verify callers with their rewards │ └── Return this rollout’s reward └── /compare (batch) └── Compare N response_objs; return rewards + metrics (for scripts/tests) ``` - **Config**: genrm_compare config includes `num_rollouts_per_prompt`, `genrm_model_server` (name `genrm_model`), and comparison/aggregation options. No `comparison_strategy` in global config for rollout. - **Data**: For RLHF, provide `num_rollouts_per_prompt` rows per prompt (e.g. via `num_repeats` when loading data). ## Testing ```bash curl -s -X POST http://127.0.0.1:17795/compare \ -H "Content-Type: application/json" \ -d '{ "conversation_history": [{"role": "user", "content": "What is SKILL?"}], "response_objs": [ {"output": [{"type": "message", "content": [{"type": "output_text", "text": "SKILL is a verb meaning to kill."}]}]}, {"output": [{"type": "message", "content": [{"type": "output_text", "text": "Skill refers to the ability to perform a task well."}]}]} ] }' | jq . ``` GenRM returns a response with reasoning and a final message containing JSON scores, e.g.: ```json { "rewards": [ 1.025, 4.475 ], "comparison_results": [ { "response_i": 0, "response_j": 1, "judge_idx": 0, "score_1": 1.0, "score_2": 5.0, "ranking": 6.0 }, { "response_i": 1, "response_j": 0, "judge_idx": 0, "score_1": 4.0, "score_2": 1.0, "ranking": 1.0 } ], "metrics": { "mean_individual_score": 2.75, "std_individual_score": 1.7853571071357126, "tiebreak_usage_rate": 0.0 } } ``` Unit tests cover genrm_compare (verify stub when N≤1, cohort logic, compare), utils (prompt key, parsing, aggregation), and comparison_strategies (batch client and helpers). --------- Signed-off-by: Felipe Vieira Frujeri <ffrujeri@nvidia.com>
bxyu-nvidia
approved these changes
Mar 3, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What does this PR do?
Adds GenRM support via a dedicated Response API Model package (
responses_api_models/genrm_model/). The package provides a single local variant: a GenRM model that uses a locally managed vLLM server (download model + start vLLM, e.g. via Ray).The key design goal is to keep all GenRM-specific logic inside the model server, so that the resources server (and the base schema) can use standard OpenAI roles throughout. Role remapping to the GenRM chat-template roles (
response_1,response_2,principle) happens as a final preprocessing step, just before the request is forwarded to vLLM.Related to PR #523. Part of #516.
Architecture
Package layout (
responses_api_models/genrm_model/)Components
GenRMModelConfig— extendsLocalVLLMModelConfigwithsupports_principle_role.GenRMModelMixin— overrides two hooks onVLLMModel:get_converter()— returns a plainVLLMConverter(no custom converter needed)._preprocess_chat_completion_create_params()— reads the comparison payload frommetadataand appends the GenRM chat-template roles immediately before the vLLM call:metadata["response_1"]→ appended as a"response_1"messagemetadata["response_2"]→ appended as a"response_2"messagemetadata["principle"]→ appended as a"principle"message (whensupports_principle_role=True)metadatais consumed here and not forwarded to vLLM.GenRMModel—GenRMModelMixin+LocalVLLMModel.Changes to
vllm_model/app.py_preprocess_chat_completion_create_params(self, request, body_dict) -> Dict[str, Any]is extracted fromchat_completions()as an overrideable hook. The base implementation coversreplace_developer_role_with_system, model /chat_template_kwargsinjection, token-ID augmentation, reasoning-parser handling, andextra_bodymerging. Subclasses override it to apply model-specific transformations before the vLLM call.Schema (
nemo_gym/openai_utils.py)NeMoGymEasyInputMessage.roleandNeMoGymMessage.roleuse standard OpenAI roles only (user,assistant,system,developer). The customresponse_1/response_2/principleliterals are not part of the request schema — they are an internal vLLM chat-template detail handled entirely withinGenRMModelMixin.Usage
Config key:
genrm_modelunderresponses_api_models, withentrypoint: app.py. Seeconfigs/genrm_model.yaml.How the resources server calls the GenRM model server:
The
inputfield carries only the conversation history (standard OpenAI roles). The comparisonpayload is passed via
metadataso the request schema stays generic:GenRMModelMixin._preprocess_chat_completion_create_paramsreadsmetadata, appends thecustom-role messages to the conversation, and pops
metadatabefore forwarding to vLLM.Testing
Launching a multi-node config:
We get a runnable server:
And a successful response (
inputis conversation history only; responses and principle go inmetadata):