Support DSE metrics for vLLM by amaslenn · Pull Request #816 · NVIDIA/cloudai

amaslenn · 2026-02-24T12:57:34Z

Summary

Add several metrics for vLLM workload to enable DSE: default==throughput, tps-per-user, tps-per-gpu.

Test Plan

CI (extended)
Manual runs

Additional Notes

–

coderabbitai · 2026-02-24T12:57:41Z

📝 Walkthrough

Walkthrough

Exports and metrics for vLLM benchmarks were extended: VLLMBenchReport gained throughput/concurrency fields and derived properties; the report generation strategy added a metrics catalog plus GPU- and metric-retrieval helpers; GPU ID computation was factored into a helper; tests were added for metrics behavior.

Changes

Cohort / File(s)	Summary
API Exports `src/cloudai/workloads/vllm/__init__.py`	Adds `VLLMBenchReport` to the module's public exports (`__all__`).
Metrics & Report Strategy `src/cloudai/workloads/vllm/report_generation_strategy.py`	Extends `VLLMBenchReport` with `output_throughput`, `max_concurrency`, and properties `throughput`, `concurrency`, `tps_per_user`; adds `metrics` catalog and `used_gpus_count()` / `get_metric()` on `VLLMBenchReportGenerationStrategy`.
GPU ID Helper / Refactor `src/cloudai/workloads/vllm/slurm_command_gen_strategy.py`	Introduces `vllm_all_gpu_ids(tdef, system_gpus_per_node)` and refactors `gpu_ids` to delegate GPU ID computation to it.
Tests `tests/workloads/vllm/test_report_gen_strategy.py`	New tests: fixture with `VLLMBenchReport` bench data; validates `can_handle_directory()`, `get_metric()` for multiple metrics (throughput, tps-per-user, tps-per-gpu), invalid metric handling, and GPU-count based tps-per-gpu behavior.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

🐇 I hopped through JSON, metrics in paw,
I counted GPUs—oh what a law!
Throughput and tps in tidy array,
Tests clap their paws and say "Hooray!" ✨

🚥 Pre-merge checks | ✅ 2

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly summarizes the main change: adding DSE metrics support for vLLM workload.
Description check	✅ Passed	The description directly addresses the changeset, explaining the metrics being added and the test plan.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings (stacked PR)
📝 Generate docstrings (commit on current branch)

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch am/cloudai-19-vllm-dse-support-requested-metrics

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/cloudai/workloads/vllm/report_generation_strategy.py`:
- Around line 96-112: The final "return METRIC_ERROR" in get_metric is
unreachable because invalid metrics return early and all valid metric branches
("default"/"throughput", "tps-per-user", "tps-per-gpu") return a value; remove
that trailing return to avoid confusion. Edit the get_metric method (references:
get_metric, METRIC_ERROR, parse_vllm_bench_output, used_gpus_count,
VLLM_BENCH_JSON_FILE) and delete the final fallback return; if you prefer
defensive coding, replace it with a clear comment explaining why it's
unreachable instead of keeping the dead return.

In `@tests/workloads/vllm/test_report_gen_strategy.py`:
- Around line 65-76: The test_vllm_metrics test hardcodes BENCH_DATA.throughput
/ 8 for the "tps-per-gpu" case which couples the test to an 8-GPU fixture;
update the test to follow the pattern in test_vllm_tps_per_gpu by deriving the
expected GPU count from the SlurmSystem fixture or mocking used_gpus_count
instead of assuming 8. Specifically, change the parametrized expected for
"tps-per-gpu" to compute BENCH_DATA.throughput / slurm_system.gpus_per_node (or
set a mock used_gpus_count on the VLLMBenchReportGenerationStrategy instance) so
VLLMBenchReportGenerationStrategy.get_metric and BENCH_DATA.throughput remain
consistent with the fixture.

ℹ️ Review info

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 2f209d4 and c712694.

📒 Files selected for processing (4)

src/cloudai/workloads/vllm/__init__.py
src/cloudai/workloads/vllm/report_generation_strategy.py
src/cloudai/workloads/vllm/slurm_command_gen_strategy.py
tests/workloads/vllm/test_report_gen_strategy.py

src/cloudai/workloads/vllm/report_generation_strategy.py

tests/workloads/vllm/test_report_gen_strategy.py

greptile-apps · 2026-02-26T11:04:03Z

Greptile Summary

This PR successfully adds DSE metric support for vLLM workloads by introducing three metric calculations: throughput, tps-per-user, and tps-per-gpu. The implementation is clean and well-tested.

Key changes:

Refactored GPU ID calculation logic into a standalone function vllm_all_gpu_ids() for reusability between command generation and report generation strategies
Added output_throughput and max_concurrency fields to VLLMBenchReport model
Implemented get_metric() method with proper error handling (returns METRIC_ERROR for invalid metrics or when concurrency is zero for tps-per-user)
Added comprehensive test coverage including edge cases (zero concurrency, invalid metrics, different GPU counts)

The code follows existing patterns in the codebase and handles edge cases appropriately. The refactoring improves code organization without changing behavior.

Confidence Score: 5/5

This PR is safe to merge with minimal risk
Clean refactoring with comprehensive test coverage, proper error handling, and no breaking changes. The changes are well-contained to the vLLM workload module with clear intent and correct implementation
No files require special attention

Important Files Changed

Filename	Overview
src/cloudai/workloads/vllm/init.py	Adds `VLLMBenchReport` to exports - straightforward and correct
src/cloudai/workloads/vllm/slurm_command_gen_strategy.py	Refactors GPU ID logic into reusable function - clean extraction with no behavior changes
src/cloudai/workloads/vllm/report_generation_strategy.py	Implements DSE metrics (throughput, tps-per-user, tps-per-gpu) with proper error handling
tests/workloads/vllm/test_report_gen_strategy.py	Comprehensive test coverage for all new metrics including edge cases

_{Last reviewed commit: 935b22f}

greptile-apps

_{4 files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/cloudai/workloads/vllm/report_generation_strategy.py`:
- Around line 91-94: The call to vllm_all_gpu_ids in used_gpus_count can raise
on malformed CUDA_VISIBLE_DEVICES and must be guarded so it doesn't abort metric
retrieval; wrap the call that uses vllm_all_gpu_ids(cast(VllmTestDefinition,
self.test_run.test), getattr(self.system, "gpus_per_node", None)) in a
try/except that catches the exception and returns METRIC_ERROR (use the existing
METRIC_ERROR symbol) instead of letting it bubble, and apply the same try/except
guard around the code path that computes the "tps-per-gpu" metric (the block
around lines that currently call vllm_all_gpu_ids at metric retrieval) so any
malformed GPU config results in returning METRIC_ERROR.

ℹ️ Review info

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c712694 and 935b22f.

📒 Files selected for processing (1)

src/cloudai/workloads/vllm/report_generation_strategy.py

src/cloudai/workloads/vllm/report_generation_strategy.py

Support DSE metrics for vLLM

c712694

ovidiusm approved these changes Feb 26, 2026

View reviewed changes

amaslenn marked this pull request as ready for review February 26, 2026 10:59

amaslenn requested review from jeffnvidia and srivatsankrishnan as code owners February 26, 2026 10:59

coderabbitai bot reviewed Feb 26, 2026

View reviewed changes

src/cloudai/workloads/vllm/report_generation_strategy.py Outdated Show resolved Hide resolved

tests/workloads/vllm/test_report_gen_strategy.py Show resolved Hide resolved

greptile-apps bot reviewed Feb 26, 2026

View reviewed changes

Address review comments

935b22f

coderabbitai bot reviewed Feb 26, 2026

View reviewed changes

src/cloudai/workloads/vllm/report_generation_strategy.py Show resolved Hide resolved

amaslenn requested a review from podkidyshev February 26, 2026 12:11

podkidyshev approved these changes Feb 26, 2026

View reviewed changes

amaslenn merged commit 9bdf826 into main Feb 26, 2026
5 checks passed

amaslenn deleted the am/cloudai-19-vllm-dse-support-requested-metrics branch February 26, 2026 12:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support DSE metrics for vLLM#816

Support DSE metrics for vLLM#816
amaslenn merged 2 commits intomainfrom
am/cloudai-19-vllm-dse-support-requested-metrics

amaslenn commented Feb 24, 2026

Uh oh!

coderabbitai bot commented Feb 24, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Poem

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Uh oh!

greptile-apps bot commented Feb 26, 2026 •

edited

Loading

Uh oh!

greptile-apps bot left a comment

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

amaslenn commented Feb 24, 2026

Summary

Test Plan

Additional Notes

Uh oh!

coderabbitai bot commented Feb 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Poem

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

greptile-apps bot commented Feb 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

coderabbitai bot commented Feb 24, 2026 •

edited

Loading

greptile-apps bot commented Feb 26, 2026 •

edited

Loading