Skip to content

Day 0 GB300 DeepSeek-V4-Pro FP4 vLLM disagg #1150

Open
Oseltamivir wants to merge 29 commits intomainfrom
dsv4-fp4-gb300-dynamo-vllm-disagg
Open

Day 0 GB300 DeepSeek-V4-Pro FP4 vLLM disagg #1150
Oseltamivir wants to merge 29 commits intomainfrom
dsv4-fp4-gb300-dynamo-vllm-disagg

Conversation

@Oseltamivir
Copy link
Copy Markdown
Collaborator

@Oseltamivir Oseltamivir commented Apr 25, 2026

Summary

Adds dsv4-fp4-gb300-dynamo-vllm — same DSV4-Pro FP4 sweep we already run on gb200, ported to the gb300-cw (CoreWeave) cluster. Topologies, per-worker tuning, container, and concurrency sweep are identical to the gb200 config; only gpu_type, the launch script's filesystem assumptions, and the SLURM partition differ.

What's in here

  • runners/launch_gb300-cw.sh (new): adapted from launch_gb200-nv.sh. Stages weights at /mnt/vast/models/dsv4/, squash files at /mnt/vast/squash/, partition all. cw has no Lustre and no compute-node-local NVMe — VAST is the only option.
  • runners.yaml: new gb300-cw group with gb300-cw_0/_1 (kept separate from existing gb300 group so dsr1-fp8-gb300-dynamo-sglang doesn't get cross-routed onto cw's launch script).
  • 6 new recipes under benchmarks/multi_node/srt-slurm-recipes/vllm/deepseek-v4/{1k1k,8k1k}/disagg-gb300-*.yaml: byte-for-byte mirrors of the gb200 recipes with gpu_type: gb300 and updated headers. Tuning kept verbatim — GB300's extra HBM (288 GB vs 184 GB) probably means the CPU/DRAM offload knobs in the tep8 recipes can be dropped, but worth measuring first rather than re-tuning blind.
  • nvidia-master.yaml: dsv4-fp4-gb300-dynamo-vllm config entry, runner: gb300-cw, recipes pointing to the new gb300 paths.
  • perf-changelog.yaml: additions-only entry.

Rack pinning (cw-specific)

cw is 2x 18-node racks. srtctl already auto-emits #SBATCH --segment={total_nodes} by default (use_segment_sbatch_directive: true is the schema default), and the launch script spells this out in srtslurm.yaml so it's obvious. The largest topology (8k/1k 7p1d-dep8-dep16) needs exactly 18 nodes — fits one rack exactly. Anything wider would no longer fit.

Test plan

  • Manually trigger dsv4-fp4-gb300-dynamo-vllm on the gb300-cw runners — start with the 4-node 1p1d-dep8-tep8 recipe to validate cluster plumbing before any 18-node job.
  • If the cw cluster has spare HBM headroom, follow up with a tuning PR to drop CPU/DRAM offload in the tep8 recipes and see if max-num-seqs can go higher.
  • Validate that gb300-cw_0/1 GitHub runners are registered with label gb300-cw (assumed, not verified by this PR).

Validation done locally

  • process_changelog.py against main: passes, produces 6 multi-node entries (3x 1k/1k + 3x 8k/1k), 23 benchmark points.
  • Recipe benchmark.concurrencies audit vs matrix conc-list: all 6 pairs aligned.
  • generate_sweep_configs.py full-sweep --runner-type gb300-cw: returns 6 entries.

Adds the same set of topologies (1k/1k: 1p1d-dep8-tep8, 1p1d-dep8-dep16,
3p1d-dep8-dep16; 8k/1k: same plus 7p1d-dep8-dep16) targeted at the
gb300-cr cluster (CoreWeave, 2x 18-node racks). Per-worker tuning is
identical to the gb200 sweep — only gpu_type, name, and the launch
script's filesystem / partition assumptions differ.

- Adds gb300-cr runner group (gb300-cr_0/1) and launch_gb300-cr.sh.
- Recipes mounted at /mnt/vast/models/deepseek-v4-pro/ and squash files
  under /mnt/vast/squash/; SLURM partition is 'all'.
- Each job rack-pins via srtctl's auto '#SBATCH --segment={total_nodes}';
  the 18-node 7p1d topology fits one rack exactly.
@github-actions
Copy link
Copy Markdown
Contributor

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

Runner names use the existing CoreWeave 'cw' suffix convention (matches
b200-cw_*, h100-cw_*, etc.) — gb300-cr was wrong. Model weights are at
/mnt/vast/models/dsv4/ (the directory the user already populated), not
.../deepseek-v4-pro/ as I'd guessed.
@Oseltamivir Oseltamivir changed the title Port DeepSeek-V4-Pro FP4 vLLM disagg sweep to gb300-cr Port DeepSeek-V4-Pro FP4 vLLM disagg sweep to gb300-cw Apr 25, 2026
@Oseltamivir Oseltamivir changed the title Port DeepSeek-V4-Pro FP4 vLLM disagg sweep to gb300-cw Day 0 GB300 DeepSeek-V4-Pro FP4 vLLM disagg Apr 25, 2026
Comment thread perf-changelog.yaml
Comment on lines +1820 to +1822
- "Same topologies, same per-worker tuning, same container (vllm/vllm-openai:deepseekv4-cu130). Recipes duplicated as disagg-gb300-*.yaml with gpu_type: gb300; 1k/1k and 8k/1k both included"
- "New runners group gb300-cr (gb300-cr_0/1) and launch_gb300-cr.sh: SLURM partition `all`, model staging at /mnt/vast/models/deepseek-v4-pro/, squash files at /mnt/vast/squash/. Each job rack-pins via srtctl's auto `#SBATCH --segment={total_nodes}` (max 18-node 7p1d topology fits one rack exactly)"
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1150
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 The new dsv4-fp4-gb300-dynamo-vllm changelog entry at perf-changelog.yaml:1822 ends with pr-link: TBD rather than the actual PR URL. The immediately-preceding gb200 sibling entry (line 1814) and most recent merges follow the convention of using the real https://github.com/SemiAnalysisAI/InferenceX/pull/<num> link, so this should be updated to https://github.com/SemiAnalysisAI/InferenceX/pull/1150 before merge. Cosmetic / nit — does not affect runtime, but the placeholder will be permanently retained in the changelog once merged.

Extended reasoning...

What's wrong. The new entry added at the bottom of perf-changelog.yaml (lines 1820-1822) has pr-link: TBD as its final line. The convention in this file is to fill in the actual GitHub PR URL — the directly-preceding gb200 sibling entry (lines 1807-1814, the dsv4-fp4-gb200-dynamo-vllm port from PR #1129) ends with pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1129, and most recently-merged entries (#1144, #1147, #1148) follow the same pattern.

Why it matters. Once this PR is merged as #1150, the placeholder TBD becomes permanent in the changelog history — no automated process rewrites the entry post-merge. So the changelog will permanently link to nothing for this entry, breaking the audit trail that lets readers click through from a config-key change back to its originating PR.

Step-by-step proof.

  1. Check the current state of HEAD for perf-changelog.yaml: git show 154be8d -- perf-changelog.yaml shows the diff hunk adding lines 1815-1822, ending with pr-link: TBD.
  2. Read perf-changelog.yaml lines 1820-1822 directly: the literal string pr-link: TBD appears, not a URL.
  3. Compare to the gb200 sibling 8 lines above (line 1814): pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1129 — proper URL.
  4. The PR description explicitly calls out perf-changelog.yaml: additions-only entry and the PR number is 1150, so the intended value is unambiguous.

Note on the PR diff display. The bug-tracker's rendered diff in the review pane shows this line as pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1150, which differs from what is actually committed at HEAD. The git working tree is the source of truth — git show 154be8d -- perf-changelog.yaml confirms the literal TBD.

Fix. Replace pr-link: TBD on line 1822 with pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1150 before merging, matching the convention of the gb200 sibling above.

Severity rationale (nit). This is documentation-only and has no runtime impact. Some prior entries in the file already have pull/XXX, pull/XXXX, or pull/TBD placeholder leakage from previously-merged PRs (e.g., lines 16, 46, 53, 349, 824, 852, 889, 906, 1556-1676), so the convention is clearly not enforced. Worth fixing for parity with the gb200 sibling immediately above, but not a blocker.

Comment on lines +7666 to +7668
# Same topology + tuning as dsv4-fp4-gb300-dynamo-vllm's gb200 sibling, just
# pointed at the gb300 recipe variants. Cluster gb300-cr is 2x 18-node
# racks; each job is rack-pinned via srtctl's auto `#SBATCH --segment={N}`.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Comment at line 7666 reads "Same topology + tuning as dsv4-fp4-gb300-dynamo-vllm's gb200 sibling" but it's inside the dsv4-fp4-gb300-dynamo-vllm config block (declared at line 7657) — so the config is referring to itself as having a gb200 sibling. Almost certainly a copy-paste leftover from the GB200→GB300 port; should reference dsv4-fp4-gb200-dynamo-vllm instead. Pure comment-only nit, no runtime effect.

Extended reasoning...

What's wrong

In .github/configs/nvidia-master.yaml the new config block dsv4-fp4-gb300-dynamo-vllm: is declared at line 7657. The header comment for that block, lines 7666-7668, currently reads:

dsv4-fp4-gb300-dynamo-vllm:        # line 7657
  ...
  # Same topology + tuning as dsv4-fp4-gb300-dynamo-vllm's gb200 sibling, just  # line 7666
  # pointed at the gb300 recipe variants. Cluster gb300-cr is 2x 18-node
  # racks; each job is rack-pinned via srtctl's auto `#SBATCH --segment={N}`.

A config can't be its own sibling. The author clearly intended to point at dsv4-fp4-gb200-dynamo-vllm — that is the existing GB200 config defined immediately above (ending at line 7655) and the actual upstream this PR ports from.

Step-by-step proof

  1. Line 7657: dsv4-fp4-gb300-dynamo-vllm: — this opens the config block.
  2. Lines 7658-7665: scalar fields (image, model, model-prefix, runner, precision, framework, multinode, disagg) — all still inside the block opened at 7657.
  3. Line 7666: comment under that same key, which begins "Same topology + tuning as dsv4-fp4-gb300-dynamo-vllm's gb200 sibling…"
  4. The phrase "dsv4-fp4-gb300-dynamo-vllm's gb200 sibling" reads as "the gb200 sibling of dsv4-fp4-gb300-dynamo-vllm" — i.e. the current block's gb200 sibling, which is dsv4-fp4-gb200-dynamo-vllm (lines 7544-7655). Saying "X's gb200 sibling" while being X is a tautology with no referent.
  5. The PR description corroborates: "Same DSV4-Pro FP4 sweep we already run on gb200, ported to the gb300-cr cluster" — i.e. the sibling is gb200, not gb300.

Impact

None on runtime, parsing, generated artifacts, or sweep behavior — YAML comments are inert. This is purely a readability issue: a future reader following the comment will go looking for a non-existent reference.

Fix

Change dsv4-fp4-gb300-dynamo-vllm's gb200 sibling to dsv4-fp4-gb200-dynamo-vllm (or equivalent phrasing such as "as the gb200 sibling (dsv4-fp4-gb200-dynamo-vllm)"). One-token edit while the PR is still open.

Oseltamivir and others added 18 commits April 24, 2026 22:24
- SLURM_ACCOUNT: benchmark -> cw-sup. The 'benchmark' account was
  inherited from launch_gb200-nv.sh but doesn't exist on the cw cluster;
  sacctmgr shows the user is associated with cw-sup.
- Extend gb300-cw runner group to include gb300-cw_2 and gb300-cw_3.
  All four cw runners now have the gb300-cw label, so list them all so
  matrix expansion can round-robin across the full pool.
srtctl's slurm template (job_script_minimal.j2) does `if ! command -v
uv` and only installs its own (ARM64) uv when missing. The runner pod
is x86 and /mnt/home is shared NFS with the aarch64 compute nodes; the
default uv install location $HOME/.local/bin lands on that shared NFS
path and shadows the template's install on the compute side, causing
`Exec format error` from slurmd.

Install via XDG_BIN_HOME to a runner-pod-local /tmp tmpfs path. Scrub
any stale x86 uv from prior runs out of $HOME/.local/bin and fail loud
if XDG_BIN_HOME isn't honored or the install leaks to NFS anyway.
Previously relied on srtctl's auto '#SBATCH --segment={total_nodes}'
(controlled by use_segment_sbatch_directive=true, the schema default).
Real runs on gb300-cw showed the directive was missing from the
generated sbatch — workers landed on different racks.

Make the constraint explicit per recipe:
  sbatch_directives:
    segment: "<total_nodes>"

and turn off the auto path in srtslurm.yaml so we don't emit two
overlapping #SBATCH --segment lines. Each gb300 recipe now declares
its own segment value matching its prefill_nodes + decode_nodes
total (4, 6, 10, or 18).
OOM during 'maturin build' of dynamo source on gb300-cw. Cargo defaults
to nproc parallel rustc workers; on Grace ARM (~72 cores per node) the
peak RAM during the link phase exceeded the SLURM cgroup limit, causing
SIGKILL with 'task 0: Out Of Memory' before vLLM ever started.

Capped at 4 in both prefill_environment and decode_environment of every
gb300 recipe. Each rustc uses ~5-10GB during linking, so 4 parallel jobs
keep peak well under any reasonable per-task cgroup limit.

(gb200-nv runs the same install via the same srt-slurm path and works
without this cap, so cw evidently has tighter per-task memory limits.)
…oc backtick bug

Two changes:

1. Add 'mem: "0"' to sbatch_directives in every gb300 recipe so each
   sbatch emits '#SBATCH --mem=0'. cw evidently has a tighter default
   per-task memory cgroup than nv; without --mem=0 the workers were
   getting killed with 'srun: task 0: Out Of Memory' partway through
   model load (and possibly during the dynamo source build before
   that). --mem=0 means 'use all node memory', which is what we want
   for these node-exclusive ML jobs.

2. Drop backticks from the comment in launch_gb300-cw.sh's heredoc.
   The heredoc terminator is unquoted (<<EOF), so bash performed
   command substitution on the backtick content, producing a noisy
   'sbatch_directives:: command not found' error. Cosmetic only — the
   srtslurm.yaml was still written correctly — but the error looked
   alarming.
The recipe header comments still claimed each job is rack-pinned
'via srtctl's auto #SBATCH --segment={total_nodes}', but two commits
ago we flipped use_segment_sbatch_directive to false in srtslurm.yaml
and added explicit sbatch_directives.segment per recipe. Update the
six gb300 recipe headers to match the actual mechanism.
…cuda.so.1

First gb300-cw run died with 'ImportError: libcuda.so.1: cannot open
shared object file' inside the decode worker container — vllm._C is
linked against libcuda but the shared lib wasn't on the dynamic linker
search path. cw's pyxis/enroot doesn't auto-inject the host NVIDIA
driver libraries the way gb200-nv's setup does; the prestart hook
needs NVIDIA_VISIBLE_DEVICES + NVIDIA_DRIVER_CAPABILITIES in the
runtime env to know which devices and capabilities to expose.

Setting them in the launch script before 'srtctl apply' propagates
through SLURM's default --export=ALL on both sbatch and srun, so they
reach the enroot prestart hook and trigger the libcuda + libnvidia-*
bind-mounts.
Failure mode (now diagnosed): srt-slurm's DP+EP path launches one srun
container per GPU. Each container independently runs the dynamo source
install ('maturin build' of the rust runtime, ~10 min on Grace ARM).
With 4 ranks per node x 2 nodes per worker the install times vary
enough across ranks that the early finishers hit vLLM's hardcoded
5-min 'Did not receive response from front-end process' deadline
while late finishers (rank 0 included) are still compiling.

Fix:
- runners/gb300-cw-vllm-container-deps.sh: new setup script that takes
  a global flock on /mnt/vast and, on cache miss, builds the dynamo
  wheel + a pruned source archive ONCE. Every rank pip-installs from
  the cache (~30 s) so timing across ranks stays tight.
- launch_gb300-cw.sh: overlay the custom script into the cloned
  srt-slurm's configs/ dir so the recipes' setup_script reference
  resolves to it.
- All 6 gb300 recipes: dynamo.install: false (was true) so srt-slurm's
  hardcoded per-rank install path is skipped — our setup script is the
  sole installer.
Previous attempt's logs proved every rank ran maturin build in parallel
('[dynamo-cache] cold cache — building...' showed up in ALL worker
output), so the flock on /mnt/vast was a silent no-op. /mnt/vast is
NFS-backed and flock is unreliable there without explicit nolock
config — typical in clusters.

mkdir IS atomic across NFS. Switch to mkdir-based leader election: the
rank whose mkdir of <hash>.building succeeds is the leader and runs the
build; everyone else polls for .done. Followers timeout at 30 min if
the leader crashes; in practice the build is ~10 min.
Two prior attempts at coordinating a one-time dynamo build across the
~60 worker containers via fs-level locks on /mnt/vast both failed: NFS
silently no-ops flock and races negatively-cached mkdir. Every rank
ended up running maturin build in parallel, the timing skew across
nodes blew vLLM's hardcoded 5-min 'Did not receive response from
front-end' deadline, and ranks died.

New design eliminates all per-rank coordination:

* launch_gb300-cw.sh now runs a one-shot  BEFORE submitting the main sbatch. That srun
  builds the dynamo wheel + a pruned source archive into a temp dir on
  /mnt/vast and atomically renames into place. Same-dir rename on NFS
  IS atomic (unlike flock or mkdir-vs-cache), so even when both
  gb300-cw_0 and gb300-cw_1 race on a cold cache the loser cleanly
  discards its build.

* gb300-cw-vllm-container-deps.sh becomes pure pip-install-from-cache;
  it errors out fast if the prebuild didn't run, instead of trying to
  build on its own.

Net: per-rank setup is now ~30 s (pip install of prebuilt wheel) vs.
~10 min cargo build, and identical across all ranks, so we don't blow
vLLM's startup window.
Last attempt's prebuild srun got OOM-killed mid-build:

  error: could not compile `moxcms` (lib)
  Caused by: process didn't exit successfully ... (signal: 9, SIGKILL)
  error: Detected 1 oom_kill event in StepId=71.0
  srun: task 0: Out Of Memory

Default per-task memory cgroup is too small for cargo's link phase on a
big rust workspace. Three knobs added:

  --mem=0                 claim full node memory (same lever the main
                          sbatch already uses)
  CARGO_BUILD_JOBS=8      cap parallel rustc workers; on 72-core Grace
                          ARM the default nproc setting can have dozens
                          of rustc processes peaking together
  -C debuginfo=0          default debuginfo=2 from cargo is what makes
                          the link phase memory-hungry; we don't need
                          debug symbols in the runtime wheel
Last attempt's prebuild succeeded, the launch script reported
'[dynamo-prebuild] published cache at /mnt/vast/dynamo_cache/<hash>',
but every worker still errored with our 'prebuilt cache missing'
message. Reason: srt-slurm only mounts the model dir
(/mnt/vast/models/dsv4) into worker containers — /mnt/vast/dynamo_cache
isn't visible inside, so setup_script's stat of the cache always fails.

Add extra_mount: /mnt/vast/dynamo_cache:/mnt/vast/dynamo_cache to all
six gb300 recipes. Verified the recipes still parse cleanly via
srtctl's load_config; cfg.extra_mount is now populated as expected.
Latest run got past dynamo install (cache mount + prebuild both work
now — 41 ranks all succeeded), then hit a different wall:

  RuntimeError: Did not receive response from front-end process
  within 5 minutes

This is vllm's hardcoded engine-core handshake deadline. With DSV4-Pro
weights (~850 GB) on /mnt/vast NFS and 8 DP ranks reading in parallel
through one NFS client mount, rank 0's model load runs longer than
5 minutes under contention; the other DP ranks then time out waiting
for the front-end (rank 0's DPAsyncMPClient) to respond.

The 5-min limit is a module-level constant HANDSHAKE_TIMEOUT_MINS in
vllm/v1/engine/core.py with no env-var override. The setup script now
seds it to 30 in each rank's container after the dynamo install
completes. (No-op + warning if the constant ever changes upstream.)
After patching the handshake timeout to 30 min, every rank still hits
'Did not receive response from front-end process within 30 minutes'.
Rank 0 itself goes silent right after vllm config init — no model load
progress, just a 30+ min gap. Suggests NCCL init is hanging, not slow
NFS load.

Two cw-specific tweaks:
- NCCL_MNNVL_ENABLE: removed. cw does not have multi-node NVLink (that's
  a gb200-nv tray feature). Telling NCCL it's there can confuse init.
- NCCL_P2P_LEVEL: NVL: removed. Across nodes there is no NVLink path,
  so forcing NVL-only P2P is wrong; let NCCL auto-pick (PIX/NET/etc).

Plus NCCL_DEBUG=INFO so the next run's worker logs show where NCCL is
stuck. We can revert the debug log once we know the root cause.
…ducer

NVL72 GB300 HAS multi-node NVLink — removing NCCL_MNNVL_ENABLE was
wrong. This commit restores it (and NCCL_P2P_LEVEL=NVL on tep8
recipes) to match the working gb200 references.

Adds NCCL_DEBUG_SUBSYS + NCCL_DEBUG_FILE to all gb300 recipes so NCCL
init/bootstrap/net diagnostics land in per-process log files instead
of flooding the main sweep log. Also adds VLLM_ENGINE_READY_TIMEOUT_S
to dep16 recipes (was only on tep8 before).

Reduces nvidia-master search space to just the 1p1d-dep8-tep8 topology
(4 nodes) for both ISL configs to isolate the DP Coordinator startup
failure before scaling up to larger topologies.
Oseltamivir and others added 7 commits April 25, 2026 14:26
…L_DEBUG_FILE

Three changes to diagnose the prefill DP Coordinator startup failure:

1. Restore the HANDSHAKE_TIMEOUT_MINS 5→30 sed patch in the setup
   script. Removing it (87bdf1f) caused follower DP ranks to hit the
   hardcoded 5-minute front-end handshake timeout during model load
   from VAST NFS. VLLM_ENGINE_READY_TIMEOUT_S does not control this
   code path.

2. Add a Python patch to vllm's coordinator.py that logs the DP
   Coordinator child's pid, alive status, and exitcode when the parent
   sees "failed to report ZMQ addresses". This surfaces the actual
   child failure instead of the opaque parent-side error.

3. Remove NCCL_DEBUG_FILE from all gb300 recipes — /tmp inside the
   container is ephemeral and not collected. NCCL debug now goes to
   stderr which lands in the SLURM .out files.
The previous coordinator patch (7f526db) failed because the needle
strings didn't match the actual multi-line format in
vllm/v1/engine/coordinator.py. Rewrote based on the real source:

(a) Bump _wait_for_zmq_addrs timeout=30 → timeout=300 by matching
    the exact "[zmq_addr_pipe, self.proc.sentinel], timeout=30" string.

(b) Insert child-process debug logging (pid, alive, exitcode) before
    the RuntimeError raise, matching the exact multi-line raise block.

This should expose whether the DP Coordinator child is crashing vs
just slow, and give it 5 minutes instead of 30 seconds to report
ZMQ addresses.
Previous patches (7f526db, 6415458) failed because exact string
matching was too brittle for the multi-line raise block in
coordinator.py. Now:

- Timeout bump: still exact-matches "[zmq_addr_pipe, self.proc.sentinel],
  timeout=30" → timeout=300 (this string is stable)
- Debug logging: regex-matches the RuntimeError raise block with
  flexible indentation/whitespace, injects child proc debug info
  (pid, alive, exitcode, sentinel) using self.proc (not the wrong
  self._coordinator_proc from the v1 attempt)
- Verification: dumps inspect.getsource(DPCoordinator._wait_for_zmq_addrs)
  so the per-rank logs show exactly what code will run

Separates timeout bump and logging into independent python blocks
so a failure in one doesn't skip the other.
@Oseltamivir Oseltamivir requested a review from Qiaolin-Yu as a code owner April 26, 2026 05:49
Oseltamivir added a commit that referenced this pull request Apr 27, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

1 participant