Skip to content

[Feature Request] Add Kubernetes-native runner for distributed inference benchmarking (llm-d) #1045

@cemigo114

Description

@cemigo114

Is your feature request related to a problem? Please describe.

Today, the runners/ directory is Slurm-centric for multi-node setups — e.g., launch_b200-dgxc-slurm.sh, launch_h100-dgxc-slurm.sh, launch_h200-dgxc-slurm.sh. Slurm is great for HPC-style clusters, but it's limiting for reproducing these benchmarks on the cloud-native stacks that most production LLM serving actually runs on: Kubernetes on EKS/GKE/AKS/OpenShift and on-prem K8s GPU fleets.

The repo already demonstrates disaggregated serving via NVIDIA Dynamo on Slurm (e.g., launch_gb200-nv.sh + PR #1008 for Kimi K2.5 NVFP4 GB200 disaggregated vLLM), so disaggregation itself is supported — the gap is K8s-native orchestration of the same patterns (disaggregated P/D, KV-cache-aware routing, wide-EP, autoscaling). Without that, community users can't easily reproduce InferenceX results in their own K8s environments, and newer serving patterns that are first-class in K8s-native stacks are harder to cover.

Describe the solution you'd like

Add a first-class Kubernetes-native runner targeting llm-d as a reference, analogous to the existing Slurm runners. Concretely:

  • New runner(s) under runners/ (e.g., launch_b200-k8s-llmd.sh, launch_mi355x-k8s-llmd.sh) that stand up llm-d on a K8s cluster and drive benchmarks through the existing harness.
  • Reuse llm-d's upstream Helm charts and reproducible benchmark workflows (shipped in llm-d v0.5, Feb 2026), which already include validated B200 numbers (~3.1k tok/s per decode GPU on wide-EP; up to 50k output tok/s on a 16×16 B200 P/D topology). This minimizes new orchestration code on the InferenceX side.
  • Integration with benchmarks/ so K8s-native results are directly comparable to Slurm-based runs on the same metrics (TTFT, ITL, throughput, goodput, per-GPU utilization).
  • Support the serving patterns llm-d exposes natively: disaggregated prefill/decode via NIXL, KV-cache-aware inference scheduling via the Gateway API, wide-EP for MoE models (DeepSeek, Qwen3.5, gpt-oss), and tiered KV offload.
  • Docs for running InferenceX benchmarks on a K8s cluster (GB200 NVL72 / B200 / H100 / MI355X) using llm-d as the orchestration layer.

Describe alternatives you've considered

  • Slurm-only (status quo): works for the current set of supported clusters, but limits reproducibility for the broader K8s-based community and makes it harder to benchmark K8s-native patterns (Gateway-API-based smart routing, HPA/VPA autoscaling, workload-variant autoscaler).
  • Raw Kubernetes Deployments/StatefulSets without llm-d: workable, but reinvents disaggregated serving, KV-cache-aware routing, and autoscaling that llm-d already provides on top of vLLM/SGLang.
  • Ray Serve / KServe / NVIDIA Dynamo on K8s: viable alternatives — could be added as additional K8s runners later. llm-d seems like a strong first target because it's purpose-built for distributed LLM inference, aligns with the vLLM/SGLang stack already used here, is Apache-2.0, and has an existing reproducible benchmark workflow that can be leveraged directly.

Additional context

  • llm-d: https://github.com/llm-d/llm-d — Kubernetes-native distributed inference stack with disaggregated P/D, KV-cache-aware scheduling, wide-EP, and native vLLM/SGLang support. Supported accelerators per their docs include NVIDIA A100+, AMD MI250+, Intel GPU Max, and Google TPU v5e+ — overlapping well with InferenceX's hardware coverage.
  • A K8s-native runner would also make it easier to onboard new accelerators/clouds without waiting for Slurm integration on each provider.
  • Happy to help prototype a runner if maintainers are interested and can point at a preferred starting cluster (B200 or MI355X).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    Status

    No status

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions