[WIP] Allow overriding srt-slurm repo/ref at the launcher level by Oseltamivir · Pull Request #1118 · SemiAnalysisAI/InferenceX

Oseltamivir · 2026-04-22T22:51:03Z

Summary

Adds SRT_SLURM_REPO / SRT_SLURM_REF env-var override in runners/launch_gb200-nv.sh so a workflow_dispatch can test a WIP srt-slurm fork/branch without touching production defaults.
Exports LM_EVAL_WORKSPACE=$GITHUB_WORKSPACE alongside the existing INFMAX_WORKSPACE so the launcher is forward-compatible with the generalized lm-eval runner in NVIDIA/srt-slurm#41 (which reads LM_EVAL_WORKSPACE) while legacy sa-submission-* branches keep working (they read INFMAX_WORKSPACE).
Plumbs srt-slurm-repo and srt-slurm-ref as optional inputs through e2e-tests.yml → benchmark-multinode-tmpl.yml so the override can be set at dispatch time.
Adds .github/configs/srt-slurm-validation.yaml, a single-entry sweep config mirroring the cheapest dsr1-fp8-gb200-dynamo-trt entry so the validation dispatch runs exactly one eval job instead of the full gb200 matrix.

Why

NVIDIA/srt-slurm#41 generalizes the lm-eval runner so it no longer requires SA-specific infra (the reviewer's ask on #12). Before switching defaults, we want to validate the generalized version end-to-end on GB200 with the existing InferenceX eval harness. This PR adds that knob and leaves every existing code path unchanged.

How to test

gh workflow run e2e-tests.yml \
  --repo SemiAnalysisAI/InferenceX \
  --ref srt-slurm-override \
  -f generate-cli-command='full-sweep --config-files .github/configs/srt-slurm-validation.yaml --evals-only --runner-type gb200 --multi-node' \
  -f test-name='srt-slurm generalized lm-eval validation' \
  -f srt-slurm-repo='https://github.com/Oseltamivir/srt-slurm-nvidia.git' \
  -f srt-slurm-ref='lm-eval-main'

Runs exactly one eval-only job: dsr1_8k1k dynamo-trt fp8 | P(tp8/ep8/dptrue/nw1) D(tp8/ep8/dpfalse/nw3) | disagg-true spec-none conc-63 | eval-only. Clones the generalized srt-slurm fork (Oseltamivir/srt-slurm-nvidia@lm-eval-main) instead of the default NVIDIA/srt-slurm@sa-submission-q2-2026.

Scope / non-goals

Only launch_gb200-nv.sh is updated. The other NVIDIA launchers (launch_{b200,b300,gb300,h100,h200}-*.sh) keep their current sa-submission-q2-2026 clone and can be updated in a follow-up once the smoke test passes.
Defaults are deliberately unchanged: no env vars set → identical behavior to today.
srt-slurm-validation.yaml is a test fixture referenced only by explicit --config-files in the dispatch command above; no automated workflow picks it up.

Adds optional SRT_SLURM_REPO / SRT_SLURM_REF env vars that the GB200 launcher reads to choose a non-default srt-slurm fork/branch. Defaults are unchanged — production runs that don't set these still clone NVIDIA/srt-slurm@sa-submission-q2-2026 (dynamo-vllm, dynamo-trt+kimik2.5) or ishandhanani/srt-slurm@sa-submission-q1-2026 (everything else) as before. Also exports LM_EVAL_WORKSPACE=$GITHUB_WORKSPACE alongside the existing INFMAX_WORKSPACE. The generalized lm-eval support in NVIDIA/srt-slurm#41 reads LM_EVAL_WORKSPACE; the legacy sa-submission branches read INFMAX_WORKSPACE. Exporting both keeps a single launcher working with either version. Plumbs srt-slurm-repo and srt-slurm-ref as optional inputs through e2e-tests.yml -> benchmark-multinode-tmpl.yml so workflow_dispatch runs can target a specific srt-slurm fork/branch for validation without touching production defaults.

claude

LGTM — additive override plumbing with unchanged defaults.

Extended reasoning...

Overview

This PR adds optional srt-slurm-repo / srt-slurm-ref inputs to e2e-tests.yml (both workflow_dispatch and workflow_call) and to benchmark-multinode-tmpl.yml, exposes them to the launcher as SRT_SLURM_REPO / SRT_SLURM_REF env vars, and refactors the three git clone branches in runners/launch_gb200-nv.sh into a single clone block driven by DEFAULT_SRT_REPO/DEFAULT_SRT_REF variables with env-var override. It also exports LM_EVAL_WORKSPACE=$GITHUB_WORKSPACE alongside the existing INFMAX_WORKSPACE for forward-compat with a generalized upstream srt-slurm branch.

Security risks

None meaningful. The override inputs let a user with workflow-dispatch permission point the launcher at an arbitrary git URL, but that capability already exists for anyone who can edit the launcher script, and the clone/checkout runs inside the benchmarking job — not a trust boundary that would elevate privilege. No secrets handling changes.

Level of scrutiny

Low. This is CI/benchmark infra, not production-serving code. Defaults are byte-identical to the prior behavior: the three framework branches still map to the same (repo, ref) pairs, and ${VAR:-default} falls through to the default when the workflow passes an empty string. The change is isolated to one launcher (launch_gb200-nv.sh) — the other GPU launchers are explicitly out of scope per the PR description.

Other factors

No bugs were found by the bug hunting system. The refactor is mechanical and the override pattern is idiomatic for shell scripts. Adding LM_EVAL_WORKSPACE alongside INFMAX_WORKSPACE is purely additive and justified by the compatibility rationale.

One-entry sweep config used by the manual workflow_dispatch that validates NVIDIA/srt-slurm#41 on GB200. Mirrors the cheapest entry from dsr1-fp8-gb200-dynamo-trt (8k1k stp, eval-conc=63, 1P/3D) so the end-to-end eval path is exercised without running the full gb200 sweep. Not referenced by any automated workflow; picked up only when passed explicitly via --config-files.

The fp8 variant mapped to /mnt/numa1/.../deepseek-r1-0528/, which is only visible from SLURM compute nodes and not from the gb200 GH runners, so the new preflight validation in NVIDIA/srt-slurm@lm-eval-main (not present in sa-submission-q2-2026) rejected the submission on the runner side before sbatch. The dsr1-fp4 path in the launcher maps to /mnt/lustre01/models/deepseek-r1-0528-fp4-v2/, which is on the shared lustre mount accessible from both the runner and compute nodes. Recipe ctx1_gen4_tep8_batch1_eplb0_mtp0.yaml is the smallest 8k1k gb200 trtllm fp4 recipe (1P+8D=9 nodes, single variant, nginx-sqsh container alias matches the launcher's srtslurm.yaml).

Oseltamivir requested a review from a team April 22, 2026 22:51

github-project-automation Bot added this to InferenceMAX Board Apr 22, 2026

claude Bot reviewed Apr 22, 2026

View reviewed changes

Oseltamivir added 2 commits April 22, 2026 16:28

Oseltamivir changed the title ~~Allow overriding srt-slurm repo/ref at the launcher level~~ [WIP} Allow overriding srt-slurm repo/ref at the launcher level Apr 23, 2026

Oseltamivir changed the title ~~[WIP} Allow overriding srt-slurm repo/ref at the launcher level~~ [WIP] Allow overriding srt-slurm repo/ref at the launcher level Apr 23, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Allow overriding srt-slurm repo/ref at the launcher level#1118

[WIP] Allow overriding srt-slurm repo/ref at the launcher level#1118
Oseltamivir wants to merge 3 commits intomainfrom
srt-slurm-override

Oseltamivir commented Apr 22, 2026 •

edited

Loading

Uh oh!

claude Bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Oseltamivir commented Apr 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Why

How to test

Scope / non-goals

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

Overview

Security risks

Level of scrutiny

Other factors

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Oseltamivir commented Apr 22, 2026 •

edited

Loading