Skip to content

fix: Add patch for placement groups in local_vllm_model.#694

Closed
ffrujeri wants to merge 3 commits intomainfrom
ffrujeri/multi-node-local-vllm
Closed

fix: Add patch for placement groups in local_vllm_model.#694
ffrujeri wants to merge 3 commits intomainfrom
ffrujeri/multi-node-local-vllm

Conversation

@ffrujeri
Copy link
Contributor

@ffrujeri ffrujeri commented Feb 13, 2026

What does this PR do?

Improves LocalVLLMModel robustness when launching multiple vLLM response API servers concurrently on Ray clusters.

This PR addresses multiple failure modes seen during concurrent startup:

  1. vLLM DP placement group over-allocation (creating more than dp_size groups)
  2. Ray node resource key parsing incompatibility (node:<ip>_group_* synthetic keys)
  3. startup-time scheduling races between multiple LocalVLLMModel instances

Issues

Fixes https://github.com/NVIDIA-NeMo/Internal-Planning/issues/148


Usage

No user-facing API changes.

Existing LocalVLLMModel configs continue to work as-is.
Behavioral change: startup of local vLLM servers is now serialized via a Ray-wide lock to avoid placement races when multiple models are launched at the same time.


Additional Information

  • Root causes

    • vLLM v1 create_dp_placement_groups can continue outer-loop iteration after reaching dp_size, leading to over-creation and assertion failure.
    • Ray resource maps may include multiple node:* keys (e.g. node:<ip>_group_*), which broke code assuming a single node key.
    • Two independent LocalVLLMModelActors can race during placement-group creation against the same shared Ray cluster resources.
  • Fixes in this PR

    • Added/updated responses_api_models/local_vllm_model/vllm_patches.py patch for DP placement:
      • stops allocation once dp_size placement groups are created (breaks outer loop)
      • handles Ray synthetic node keys by selecting canonical node IP keys
      • supports strict/headless-compatible allocation paths safely
    • Updated responses_api_models/local_vllm_model/app.py:
      • improved head-node selection using required local GPU capacity heuristics
      • added a detached Ray startup lock actor to serialize local vLLM server startup across concurrent model launches
      • lock is acquired before actor startup and released in finally for safety
  • Net effect

    • Concurrent startup of multiple response API model servers is significantly more reliable.
    • DP placement logic is resilient to newer Ray node-resource key formats.
    • No config migration required.

  • Testing
config_paths="responses_api_models/local_vllm_model/configs/openai/gpt-oss-20b-reasoning-high.yaml,\
responses_api_models/local_vllm_model/configs/openai/gpt-oss-120b-reasoning-high.yaml"
ng_run "+config_paths=[${config_paths}]"
[1] gpt-oss-20b-reasoning-high (responses_api_models/local_vllm_model)
{
    'config_path': 'gpt-oss-20b-reasoning-high',
    'dir_path': (
        '/scratch/fsw/portfolios/llmservice/projects/llmservice_modelalignment_ppo/users/ffrujeri/Gym/responses_api_mo'
        'dels/local_vllm_model'
    ),
    'entrypoint': 'app.py',
    'host': '127.0.0.1',
    'name': 'local_vllm_model',
    'pid': 432161,
    'port': 13538,
    'process_name': 'gpt-oss-20b-reasoning-high',
    'server_type': 'responses_api_models',
    'url': 'http://127.0.0.1:13538',
}
[2] gpt-oss-120b-reasoning-high (responses_api_models/local_vllm_model)
{
    'config_path': 'gpt-oss-120b-reasoning-high',
    'dir_path': (
        '/scratch/fsw/portfolios/llmservice/projects/llmservice_modelalignment_ppo/users/ffrujeri/Gym/responses_api_mo'
        'dels/local_vllm_model'
    ),
    'entrypoint': 'app.py',
    'host': '127.0.0.1',
    'name': 'local_vllm_model',
    'pid': 432162,
    'port': 18253,
    'process_name': 'gpt-oss-120b-reasoning-high',
    'server_type': 'responses_api_models',
    'url': 'http://127.0.0.1:18253',
}
####################################################################################################

@copy-pr-bot
Copy link

copy-pr-bot bot commented Feb 13, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@ffrujeri ffrujeri changed the title Add patch for placement groups in local_vllm_model. fix: Add patch for placement groups in local_vllm_model. Feb 17, 2026
@ffrujeri ffrujeri marked this pull request as ready for review February 18, 2026 02:38
@ffrujeri ffrujeri force-pushed the ffrujeri/multi-node-local-vllm branch from 7d3a839 to 2971f31 Compare February 18, 2026 16:58
@ffrujeri ffrujeri requested a review from bxyu-nvidia February 28, 2026 00:43
Signed-off-by: Felipe Vieira Frujeri <ffrujeri@nvidia.com>
Signed-off-by: Felipe Vieira Frujeri <ffrujeri@nvidia.com>
@ffrujeri ffrujeri force-pushed the ffrujeri/multi-node-local-vllm branch from 2971f31 to ce4168f Compare February 28, 2026 18:55
Signed-off-by: Felipe Vieira Frujeri <ffrujeri@nvidia.com>
@ffrujeri ffrujeri closed this Mar 3, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant