Skip to content

[WIP] Allow overriding srt-slurm repo/ref at the launcher level#1118

Open
Oseltamivir wants to merge 3 commits intomainfrom
srt-slurm-override
Open

[WIP] Allow overriding srt-slurm repo/ref at the launcher level#1118
Oseltamivir wants to merge 3 commits intomainfrom
srt-slurm-override

Conversation

@Oseltamivir
Copy link
Copy Markdown
Collaborator

@Oseltamivir Oseltamivir commented Apr 22, 2026

Summary

  • Adds SRT_SLURM_REPO / SRT_SLURM_REF env-var override in runners/launch_gb200-nv.sh so a workflow_dispatch can test a WIP srt-slurm fork/branch without touching production defaults.
  • Exports LM_EVAL_WORKSPACE=$GITHUB_WORKSPACE alongside the existing INFMAX_WORKSPACE so the launcher is forward-compatible with the generalized lm-eval runner in NVIDIA/srt-slurm#41 (which reads LM_EVAL_WORKSPACE) while legacy sa-submission-* branches keep working (they read INFMAX_WORKSPACE).
  • Plumbs srt-slurm-repo and srt-slurm-ref as optional inputs through e2e-tests.ymlbenchmark-multinode-tmpl.yml so the override can be set at dispatch time.
  • Adds .github/configs/srt-slurm-validation.yaml, a single-entry sweep config mirroring the cheapest dsr1-fp8-gb200-dynamo-trt entry so the validation dispatch runs exactly one eval job instead of the full gb200 matrix.

Why

NVIDIA/srt-slurm#41 generalizes the lm-eval runner so it no longer requires SA-specific infra (the reviewer's ask on #12). Before switching defaults, we want to validate the generalized version end-to-end on GB200 with the existing InferenceX eval harness. This PR adds that knob and leaves every existing code path unchanged.

How to test

gh workflow run e2e-tests.yml \
  --repo SemiAnalysisAI/InferenceX \
  --ref srt-slurm-override \
  -f generate-cli-command='full-sweep --config-files .github/configs/srt-slurm-validation.yaml --evals-only --runner-type gb200 --multi-node' \
  -f test-name='srt-slurm generalized lm-eval validation' \
  -f srt-slurm-repo='https://github.com/Oseltamivir/srt-slurm-nvidia.git' \
  -f srt-slurm-ref='lm-eval-main'

Runs exactly one eval-only job: dsr1_8k1k dynamo-trt fp8 | P(tp8/ep8/dptrue/nw1) D(tp8/ep8/dpfalse/nw3) | disagg-true spec-none conc-63 | eval-only. Clones the generalized srt-slurm fork (Oseltamivir/srt-slurm-nvidia@lm-eval-main) instead of the default NVIDIA/srt-slurm@sa-submission-q2-2026.

Scope / non-goals

  • Only launch_gb200-nv.sh is updated. The other NVIDIA launchers (launch_{b200,b300,gb300,h100,h200}-*.sh) keep their current sa-submission-q2-2026 clone and can be updated in a follow-up once the smoke test passes.
  • Defaults are deliberately unchanged: no env vars set → identical behavior to today.
  • srt-slurm-validation.yaml is a test fixture referenced only by explicit --config-files in the dispatch command above; no automated workflow picks it up.

Adds optional SRT_SLURM_REPO / SRT_SLURM_REF env vars that the GB200
launcher reads to choose a non-default srt-slurm fork/branch. Defaults
are unchanged — production runs that don't set these still clone
NVIDIA/srt-slurm@sa-submission-q2-2026 (dynamo-vllm, dynamo-trt+kimik2.5)
or ishandhanani/srt-slurm@sa-submission-q1-2026 (everything else) as
before.

Also exports LM_EVAL_WORKSPACE=$GITHUB_WORKSPACE alongside the existing
INFMAX_WORKSPACE. The generalized lm-eval support in NVIDIA/srt-slurm#41
reads LM_EVAL_WORKSPACE; the legacy sa-submission branches read
INFMAX_WORKSPACE. Exporting both keeps a single launcher working with
either version.

Plumbs srt-slurm-repo and srt-slurm-ref as optional inputs through
e2e-tests.yml -> benchmark-multinode-tmpl.yml so workflow_dispatch runs
can target a specific srt-slurm fork/branch for validation without
touching production defaults.
Copy link
Copy Markdown
Contributor

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM — additive override plumbing with unchanged defaults.

Extended reasoning...

Overview

This PR adds optional srt-slurm-repo / srt-slurm-ref inputs to e2e-tests.yml (both workflow_dispatch and workflow_call) and to benchmark-multinode-tmpl.yml, exposes them to the launcher as SRT_SLURM_REPO / SRT_SLURM_REF env vars, and refactors the three git clone branches in runners/launch_gb200-nv.sh into a single clone block driven by DEFAULT_SRT_REPO/DEFAULT_SRT_REF variables with env-var override. It also exports LM_EVAL_WORKSPACE=$GITHUB_WORKSPACE alongside the existing INFMAX_WORKSPACE for forward-compat with a generalized upstream srt-slurm branch.

Security risks

None meaningful. The override inputs let a user with workflow-dispatch permission point the launcher at an arbitrary git URL, but that capability already exists for anyone who can edit the launcher script, and the clone/checkout runs inside the benchmarking job — not a trust boundary that would elevate privilege. No secrets handling changes.

Level of scrutiny

Low. This is CI/benchmark infra, not production-serving code. Defaults are byte-identical to the prior behavior: the three framework branches still map to the same (repo, ref) pairs, and ${VAR:-default} falls through to the default when the workflow passes an empty string. The change is isolated to one launcher (launch_gb200-nv.sh) — the other GPU launchers are explicitly out of scope per the PR description.

Other factors

No bugs were found by the bug hunting system. The refactor is mechanical and the override pattern is idiomatic for shell scripts. Adding LM_EVAL_WORKSPACE alongside INFMAX_WORKSPACE is purely additive and justified by the compatibility rationale.

One-entry sweep config used by the manual workflow_dispatch that
validates NVIDIA/srt-slurm#41 on GB200. Mirrors the cheapest entry
from dsr1-fp8-gb200-dynamo-trt (8k1k stp, eval-conc=63, 1P/3D) so
the end-to-end eval path is exercised without running the full
gb200 sweep.

Not referenced by any automated workflow; picked up only when passed
explicitly via --config-files.
The fp8 variant mapped to /mnt/numa1/.../deepseek-r1-0528/, which is only
visible from SLURM compute nodes and not from the gb200 GH runners, so the
new preflight validation in NVIDIA/srt-slurm@lm-eval-main (not present in
sa-submission-q2-2026) rejected the submission on the runner side before
sbatch. The dsr1-fp4 path in the launcher maps to
/mnt/lustre01/models/deepseek-r1-0528-fp4-v2/, which is on the shared
lustre mount accessible from both the runner and compute nodes.

Recipe ctx1_gen4_tep8_batch1_eplb0_mtp0.yaml is the smallest 8k1k gb200
trtllm fp4 recipe (1P+8D=9 nodes, single variant, nginx-sqsh container
alias matches the launcher's srtslurm.yaml).
@Oseltamivir Oseltamivir changed the title Allow overriding srt-slurm repo/ref at the launcher level [WIP} Allow overriding srt-slurm repo/ref at the launcher level Apr 23, 2026
@Oseltamivir Oseltamivir changed the title [WIP} Allow overriding srt-slurm repo/ref at the launcher level [WIP] Allow overriding srt-slurm repo/ref at the launcher level Apr 23, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

1 participant