[Feature Request] Add Kubernetes-native runner for distributed inference benchmarking (llm-d)

**Is your feature request related to a problem? Please describe.**

Today, the `runners/` directory is Slurm-centric for multi-node setups — e.g., `launch_b200-dgxc-slurm.sh`, `launch_h100-dgxc-slurm.sh`, `launch_h200-dgxc-slurm.sh`. Slurm is great for HPC-style clusters, but it's limiting for reproducing these benchmarks on the cloud-native stacks that most production LLM serving actually runs on: Kubernetes on EKS/GKE/AKS/OpenShift and on-prem K8s GPU fleets.

The repo already demonstrates disaggregated serving via NVIDIA Dynamo on Slurm (e.g., `launch_gb200-nv.sh` + PR #1008 for Kimi K2.5 NVFP4 GB200 disaggregated vLLM), so disaggregation itself is supported — the gap is **K8s-native orchestration** of the same patterns (disaggregated P/D, KV-cache-aware routing, wide-EP, autoscaling). Without that, community users can't easily reproduce InferenceX results in their own K8s environments, and newer serving patterns that are first-class in K8s-native stacks are harder to cover.

**Describe the solution you'd like**

Add a first-class Kubernetes-native runner targeting [llm-d](https://github.com/llm-d/llm-d) as a reference, analogous to the existing Slurm runners. Concretely:

- New runner(s) under `runners/` (e.g., `launch_b200-k8s-llmd.sh`, `launch_mi355x-k8s-llmd.sh`) that stand up llm-d on a K8s cluster and drive benchmarks through the existing harness.
- Reuse llm-d's upstream Helm charts and reproducible benchmark workflows (shipped in llm-d v0.5, Feb 2026), which already include validated B200 numbers (~3.1k tok/s per decode GPU on wide-EP; up to 50k output tok/s on a 16×16 B200 P/D topology). This minimizes new orchestration code on the InferenceX side.
- Integration with `benchmarks/` so K8s-native results are directly comparable to Slurm-based runs on the same metrics (TTFT, ITL, throughput, goodput, per-GPU utilization).
- Support the serving patterns llm-d exposes natively: disaggregated prefill/decode via NIXL, KV-cache-aware inference scheduling via the Gateway API, wide-EP for MoE models (DeepSeek, Qwen3.5, gpt-oss), and tiered KV offload.
- Docs for running InferenceX benchmarks on a K8s cluster (GB200 NVL72 / B200 / H100 / MI355X) using llm-d as the orchestration layer.

**Describe alternatives you've considered**

- **Slurm-only (status quo):** works for the current set of supported clusters, but limits reproducibility for the broader K8s-based community and makes it harder to benchmark K8s-native patterns (Gateway-API-based smart routing, HPA/VPA autoscaling, workload-variant autoscaler).
- **Raw Kubernetes Deployments/StatefulSets without llm-d:** workable, but reinvents disaggregated serving, KV-cache-aware routing, and autoscaling that llm-d already provides on top of vLLM/SGLang.
- **Ray Serve / KServe / NVIDIA Dynamo on K8s:** viable alternatives — could be added as additional K8s runners later. llm-d seems like a strong first target because it's purpose-built for distributed LLM inference, aligns with the vLLM/SGLang stack already used here, is Apache-2.0, and has an existing reproducible benchmark workflow that can be leveraged directly.

**Additional context**

- llm-d: https://github.com/llm-d/llm-d — Kubernetes-native distributed inference stack with disaggregated P/D, KV-cache-aware scheduling, wide-EP, and native vLLM/SGLang support. Supported accelerators per their docs include NVIDIA A100+, AMD MI250+, Intel GPU Max, and Google TPU v5e+ — overlapping well with InferenceX's hardware coverage.
- A K8s-native runner would also make it easier to onboard new accelerators/clouds without waiting for Slurm integration on each provider.
- Happy to help prototype a runner if maintainers are interested and can point at a preferred starting cluster (B200 or MI355X).


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature Request] Add Kubernetes-native runner for distributed inference benchmarking (llm-d) #1045

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Feature Request] Add Kubernetes-native runner for distributed inference benchmarking (llm-d) #1045

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions