Skip to content

Separate CI job for Megatron GPU tests#888

Merged
kevalmorabia97 merged 2 commits intomainfrom
kmorabia/split-mcore-gpu-tests
Feb 13, 2026
Merged

Separate CI job for Megatron GPU tests#888
kevalmorabia97 merged 2 commits intomainfrom
kmorabia/split-mcore-gpu-tests

Conversation

@kevalmorabia97
Copy link
Collaborator

@kevalmorabia97 kevalmorabia97 commented Feb 13, 2026

What does this PR do?

[Short term]: Megatron based tests take a long time often resulting in CICD timeout. Splitting megatron tests into a dedicated CICD job for faster overall CI/CD run
[Mid/Long term]: Run all megatron gpu tests using torchrun instead of pytest so all dist processes are already created and all individual tests no longer need to setup and destroy their processes which adds a lot of overhead per test

Testing

Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 13, 2026

📝 Walkthrough

Walkthrough

Restructures GPU testing infrastructure by introducing matrix-based parallel execution for multiple test variants, reducing job timeouts to 90 minutes, separating Megatron-specific test configuration into dedicated environments, and consolidating shared test fixtures.

Changes

Cohort / File(s) Summary
GitHub Workflows
.github/workflows/gpu_tests.yml
Implements matrix-based parallel GPU test execution with example variants (py312-cuda12-gpu, py312-cuda12-gpu-megatron), applies fail-fast: false strategy, and reduces job timeouts from 120–150 minutes to 90 minutes.
Megatron Test Configuration
tests/gpu_megatron/_extensions, tests/gpu_megatron/torch/conftest.py
Adds path references to shared GPU test fixtures at ../gpu/_extensions/ and ../../gpu/torch/conftest.py to enable fixture reuse across test suites.
Tox Environment Configuration
tox.ini
Removes megatron-core from common cuda12-gpu testenv, creates dedicated [testenv:{py310,py311,py312}-cuda12-gpu-megatron] with megatron-core pre-installation, removes Eagle-3 dependencies, and adjusts test execution paths to target tests/gpu_megatron.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

🚥 Pre-merge checks | ✅ 4
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The pull request title accurately captures the main objective: separating Megatron GPU tests into a dedicated CI job, which is reflected in all modified files (.github/workflows/gpu_tests.yml, tox.ini) and the PR description.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Merge Conflict Detection ✅ Passed ✅ No merge conflicts detected when merging into main
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch kmorabia/split-mcore-gpu-tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
.github/workflows/gpu_tests.yml (1)

42-47: ⚠️ Potential issue | 🔴 Critical

Bug: tests/gpu_megatron/** is missing from the changed-files filter.

Changes made only under tests/gpu_megatron/ won't trigger the PR GPU tests because this path isn't included in the file-change detection. The gpu-tests-pr job will be skipped.

Proposed fix
          files: |
            .github/workflows/gpu_tests.yml
            modelopt/**
            tests/gpu/**
+           tests/gpu_megatron/**
            tox.ini
            setup.py

@codecov
Copy link

codecov bot commented Feb 13, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 73.73%. Comparing base (95511a0) to head (7f8ccaf).
⚠️ Report is 2 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main     #888   +/-   ##
=======================================
  Coverage   73.73%   73.73%           
=======================================
  Files         199      199           
  Lines       21165    21165           
=======================================
  Hits        15606    15606           
  Misses       5559     5559           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@jenchen13
Copy link
Contributor

i think we need to fix the export unit test here to pass in the HF model name for the tokenizer (in export_mcore_gpt_to_hf second argument)

@jenchen13 jenchen13 self-requested a review February 13, 2026 18:29
Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
@kevalmorabia97 kevalmorabia97 force-pushed the kmorabia/split-mcore-gpu-tests branch from f0d5c4b to 7f8ccaf Compare February 13, 2026 19:46
@kevalmorabia97 kevalmorabia97 merged commit ae69d5d into main Feb 13, 2026
43 checks passed
@kevalmorabia97 kevalmorabia97 deleted the kmorabia/split-mcore-gpu-tests branch February 13, 2026 21:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants