Skip to content

Support DSE metrics for vLLM#816

Merged
amaslenn merged 2 commits intomainfrom
am/cloudai-19-vllm-dse-support-requested-metrics
Feb 26, 2026
Merged

Support DSE metrics for vLLM#816
amaslenn merged 2 commits intomainfrom
am/cloudai-19-vllm-dse-support-requested-metrics

Conversation

@amaslenn
Copy link
Contributor

Summary

Add several metrics for vLLM workload to enable DSE: default==throughput, tps-per-user, tps-per-gpu.

Test Plan

  1. CI (extended)
  2. Manual runs

Additional Notes

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 24, 2026

📝 Walkthrough

Walkthrough

Exports and metrics for vLLM benchmarks were extended: VLLMBenchReport gained throughput/concurrency fields and derived properties; the report generation strategy added a metrics catalog plus GPU- and metric-retrieval helpers; GPU ID computation was factored into a helper; tests were added for metrics behavior.

Changes

Cohort / File(s) Summary
API Exports
src/cloudai/workloads/vllm/__init__.py
Adds VLLMBenchReport to the module's public exports (__all__).
Metrics & Report Strategy
src/cloudai/workloads/vllm/report_generation_strategy.py
Extends VLLMBenchReport with output_throughput, max_concurrency, and properties throughput, concurrency, tps_per_user; adds metrics catalog and used_gpus_count() / get_metric() on VLLMBenchReportGenerationStrategy.
GPU ID Helper / Refactor
src/cloudai/workloads/vllm/slurm_command_gen_strategy.py
Introduces vllm_all_gpu_ids(tdef, system_gpus_per_node) and refactors gpu_ids to delegate GPU ID computation to it.
Tests
tests/workloads/vllm/test_report_gen_strategy.py
New tests: fixture with VLLMBenchReport bench data; validates can_handle_directory(), get_metric() for multiple metrics (throughput, tps-per-user, tps-per-gpu), invalid metric handling, and GPU-count based tps-per-gpu behavior.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

🐇 I hopped through JSON, metrics in paw,
I counted GPUs—oh what a law!
Throughput and tps in tidy array,
Tests clap their paws and say "Hooray!" ✨

🚥 Pre-merge checks | ✅ 2
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly summarizes the main change: adding DSE metrics support for vLLM workload.
Description check ✅ Passed The description directly addresses the changeset, explaining the metrics being added and the test plan.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch am/cloudai-19-vllm-dse-support-requested-metrics

Comment @coderabbitai help to get the list of available commands and usage tips.

@amaslenn amaslenn marked this pull request as ready for review February 26, 2026 10:59
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/cloudai/workloads/vllm/report_generation_strategy.py`:
- Around line 96-112: The final "return METRIC_ERROR" in get_metric is
unreachable because invalid metrics return early and all valid metric branches
("default"/"throughput", "tps-per-user", "tps-per-gpu") return a value; remove
that trailing return to avoid confusion. Edit the get_metric method (references:
get_metric, METRIC_ERROR, parse_vllm_bench_output, used_gpus_count,
VLLM_BENCH_JSON_FILE) and delete the final fallback return; if you prefer
defensive coding, replace it with a clear comment explaining why it's
unreachable instead of keeping the dead return.

In `@tests/workloads/vllm/test_report_gen_strategy.py`:
- Around line 65-76: The test_vllm_metrics test hardcodes BENCH_DATA.throughput
/ 8 for the "tps-per-gpu" case which couples the test to an 8-GPU fixture;
update the test to follow the pattern in test_vllm_tps_per_gpu by deriving the
expected GPU count from the SlurmSystem fixture or mocking used_gpus_count
instead of assuming 8. Specifically, change the parametrized expected for
"tps-per-gpu" to compute BENCH_DATA.throughput / slurm_system.gpus_per_node (or
set a mock used_gpus_count on the VLLMBenchReportGenerationStrategy instance) so
VLLMBenchReportGenerationStrategy.get_metric and BENCH_DATA.throughput remain
consistent with the fixture.

ℹ️ Review info

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 2f209d4 and c712694.

📒 Files selected for processing (4)
  • src/cloudai/workloads/vllm/__init__.py
  • src/cloudai/workloads/vllm/report_generation_strategy.py
  • src/cloudai/workloads/vllm/slurm_command_gen_strategy.py
  • tests/workloads/vllm/test_report_gen_strategy.py

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Feb 26, 2026

Greptile Summary

This PR successfully adds DSE metric support for vLLM workloads by introducing three metric calculations: throughput, tps-per-user, and tps-per-gpu. The implementation is clean and well-tested.

Key changes:

  • Refactored GPU ID calculation logic into a standalone function vllm_all_gpu_ids() for reusability between command generation and report generation strategies
  • Added output_throughput and max_concurrency fields to VLLMBenchReport model
  • Implemented get_metric() method with proper error handling (returns METRIC_ERROR for invalid metrics or when concurrency is zero for tps-per-user)
  • Added comprehensive test coverage including edge cases (zero concurrency, invalid metrics, different GPU counts)

The code follows existing patterns in the codebase and handles edge cases appropriately. The refactoring improves code organization without changing behavior.

Confidence Score: 5/5

  • This PR is safe to merge with minimal risk
  • Clean refactoring with comprehensive test coverage, proper error handling, and no breaking changes. The changes are well-contained to the vLLM workload module with clear intent and correct implementation
  • No files require special attention

Important Files Changed

Filename Overview
src/cloudai/workloads/vllm/init.py Adds VLLMBenchReport to exports - straightforward and correct
src/cloudai/workloads/vllm/slurm_command_gen_strategy.py Refactors GPU ID logic into reusable function - clean extraction with no behavior changes
src/cloudai/workloads/vllm/report_generation_strategy.py Implements DSE metrics (throughput, tps-per-user, tps-per-gpu) with proper error handling
tests/workloads/vllm/test_report_gen_strategy.py Comprehensive test coverage for all new metrics including edge cases

Last reviewed commit: 935b22f

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

4 files reviewed, no comments

Edit Code Review Agent Settings | Greptile

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/cloudai/workloads/vllm/report_generation_strategy.py`:
- Around line 91-94: The call to vllm_all_gpu_ids in used_gpus_count can raise
on malformed CUDA_VISIBLE_DEVICES and must be guarded so it doesn't abort metric
retrieval; wrap the call that uses vllm_all_gpu_ids(cast(VllmTestDefinition,
self.test_run.test), getattr(self.system, "gpus_per_node", None)) in a
try/except that catches the exception and returns METRIC_ERROR (use the existing
METRIC_ERROR symbol) instead of letting it bubble, and apply the same try/except
guard around the code path that computes the "tps-per-gpu" metric (the block
around lines that currently call vllm_all_gpu_ids at metric retrieval) so any
malformed GPU config results in returning METRIC_ERROR.

ℹ️ Review info

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c712694 and 935b22f.

📒 Files selected for processing (1)
  • src/cloudai/workloads/vllm/report_generation_strategy.py

@amaslenn amaslenn requested a review from podkidyshev February 26, 2026 12:11
@amaslenn amaslenn merged commit 9bdf826 into main Feb 26, 2026
5 checks passed
@amaslenn amaslenn deleted the am/cloudai-19-vllm-dse-support-requested-metrics branch February 26, 2026 12:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants