Skip to content

Clean llm#5

Open
QiJune wants to merge 23 commits into
mainfrom
clean_llm
Open

Clean llm#5
QiJune wants to merge 23 commits into
mainfrom
clean_llm

Conversation

@QiJune
Copy link
Copy Markdown
Owner

@QiJune QiJune commented Jun 25, 2025

No description provided.

QiJune added 23 commits June 24, 2025 11:40
Signed-off-by: QI JUN <22017000+QiJune@users.noreply.github.com>
Signed-off-by: QI JUN <22017000+QiJune@users.noreply.github.com>
Signed-off-by: QI JUN <22017000+QiJune@users.noreply.github.com>
Signed-off-by: QI JUN <22017000+QiJune@users.noreply.github.com>
Signed-off-by: QI JUN <22017000+QiJune@users.noreply.github.com>
Signed-off-by: QI JUN <22017000+QiJune@users.noreply.github.com>
Signed-off-by: QI JUN <22017000+QiJune@users.noreply.github.com>
Signed-off-by: QI JUN <22017000+QiJune@users.noreply.github.com>
Signed-off-by: QI JUN <22017000+QiJune@users.noreply.github.com>
Signed-off-by: QI JUN <22017000+QiJune@users.noreply.github.com>
Signed-off-by: QI JUN <22017000+QiJune@users.noreply.github.com>
Signed-off-by: QI JUN <22017000+QiJune@users.noreply.github.com>
Signed-off-by: QI JUN <22017000+QiJune@users.noreply.github.com>
Signed-off-by: QI JUN <22017000+QiJune@users.noreply.github.com>
Signed-off-by: QI JUN <22017000+QiJune@users.noreply.github.com>
Signed-off-by: QI JUN <22017000+QiJune@users.noreply.github.com>
Signed-off-by: QI JUN <22017000+QiJune@users.noreply.github.com>
Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>
Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>
Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>
QiJune added a commit that referenced this pull request May 28, 2026
…n DGX_B200

Source-verified each test's actual GPU requirement, then rebalanced placement
so each test runs on a stage whose reserved GPU count matches what it uses.

l0_dgx_b200.yml:
- Add new 2-GPU pre-merge pytorch/mpi condition for 9 tests previously on a
  4-GPU stage but using 2 GPU (test_autotuner_distributed_strategy, two
  TestQwen3_5_35B_A3B::test_bf16[tp2-*], test_disaggregated_deepseek_v3_lite_fp8_nixl,
  three TestKVCacheV2DSv3Lite::test_mtp_*, two TestFlux* 2-GPU pipeline tests).
- Move TestDeepSeekV32::test_nvfp4_attn_multi_gpus from the 8-GPU post-merge
  stage to the 4-GPU post-merge stage (test uses tp=4 per
  @skip_less_mpi_world_size(4)).
- Remove test_configurable_moe_single_gpu -k "MEGAMOE_DEEPGEMM", 8 unconditional
  1-GPU visual_gen tests, and test_ray_disaggregated_serving[tp2]; they now
  live in their right-sized stages (see below).

l0_b200.yml:
- Add the single-GPU MEGAMOE_DEEPGEMM row next to the existing
  test_configurable_moe_single_gpu CUTLASS/TRTLLM/CUTEDSL/DEEPGEMM/DENSEGEMM
  rows in the 1-GPU pre-merge pytorch condition.
- Add 8 visual_gen 1-GPU tests (test_visual_gen_quickstart, five LPIPS golden
  tests, two visual_gen_benchmark tests) to the 1-GPU post-merge pytorch
  condition. All use VisualGenArgs without parallel_config or explicit
  cfg_size=1 / ulysses_size=1.

l0_dgx_h100.yml:
- Add test_ray_disaggregated_serving[tp2] to the 4-GPU pytorch/ray pre-merge
  condition. The test is disaggregated with tp=2 in each of the context and
  generation servers; the in-body check skips when device_count < tp_size*2,
  so [tp2] actually needs 4 GPUs (not 2). Placed in
  DGX_H100-4_GPUs-PyTorch-Ray-1; no new stage needed.

L0_Test.groovy:
- Add DGX_B200-2_GPUs-PyTorch-1 stage to x86SlurmTestConfigs (single split,
  2 GPU, dgx-b200-flex pool).

Note: 6 conditional visual_gen tests under l0_dgx_b200.yml condition #5
(test_wan_t2v_example, four test_vbench_dimension_score_wan*, two
test_vbench_dimension_score_ltx2_*) were considered but kept in place. They
call _generate_wan_video / _generate_ltx2_video, which append --cfg_size 2
only when torch.cuda.device_count() >= 2. Moving to a 1-GPU stage would
silently drop the cfg_size=2 code path from CI coverage.

Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant