Conversation
📝 WalkthroughWalkthroughExports and metrics for vLLM benchmarks were extended: Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Poem
🚥 Pre-merge checks | ✅ 2✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/cloudai/workloads/vllm/report_generation_strategy.py`:
- Around line 96-112: The final "return METRIC_ERROR" in get_metric is
unreachable because invalid metrics return early and all valid metric branches
("default"/"throughput", "tps-per-user", "tps-per-gpu") return a value; remove
that trailing return to avoid confusion. Edit the get_metric method (references:
get_metric, METRIC_ERROR, parse_vllm_bench_output, used_gpus_count,
VLLM_BENCH_JSON_FILE) and delete the final fallback return; if you prefer
defensive coding, replace it with a clear comment explaining why it's
unreachable instead of keeping the dead return.
In `@tests/workloads/vllm/test_report_gen_strategy.py`:
- Around line 65-76: The test_vllm_metrics test hardcodes BENCH_DATA.throughput
/ 8 for the "tps-per-gpu" case which couples the test to an 8-GPU fixture;
update the test to follow the pattern in test_vllm_tps_per_gpu by deriving the
expected GPU count from the SlurmSystem fixture or mocking used_gpus_count
instead of assuming 8. Specifically, change the parametrized expected for
"tps-per-gpu" to compute BENCH_DATA.throughput / slurm_system.gpus_per_node (or
set a mock used_gpus_count on the VLLMBenchReportGenerationStrategy instance) so
VLLMBenchReportGenerationStrategy.get_metric and BENCH_DATA.throughput remain
consistent with the fixture.
ℹ️ Review info
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
📒 Files selected for processing (4)
src/cloudai/workloads/vllm/__init__.pysrc/cloudai/workloads/vllm/report_generation_strategy.pysrc/cloudai/workloads/vllm/slurm_command_gen_strategy.pytests/workloads/vllm/test_report_gen_strategy.py
Greptile SummaryThis PR successfully adds DSE metric support for vLLM workloads by introducing three metric calculations: Key changes:
The code follows existing patterns in the codebase and handles edge cases appropriately. The refactoring improves code organization without changing behavior. Confidence Score: 5/5
Important Files Changed
Last reviewed commit: 935b22f |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/cloudai/workloads/vllm/report_generation_strategy.py`:
- Around line 91-94: The call to vllm_all_gpu_ids in used_gpus_count can raise
on malformed CUDA_VISIBLE_DEVICES and must be guarded so it doesn't abort metric
retrieval; wrap the call that uses vllm_all_gpu_ids(cast(VllmTestDefinition,
self.test_run.test), getattr(self.system, "gpus_per_node", None)) in a
try/except that catches the exception and returns METRIC_ERROR (use the existing
METRIC_ERROR symbol) instead of letting it bubble, and apply the same try/except
guard around the code path that computes the "tps-per-gpu" metric (the block
around lines that currently call vllm_all_gpu_ids at metric retrieval) so any
malformed GPU config results in returning METRIC_ERROR.
Summary
Add several metrics for vLLM workload to enable DSE:
default==throughput,tps-per-user,tps-per-gpu.Test Plan
Additional Notes
–