Add FP8 support for SALMAutomodel by pzelasko · Pull Request #15754 · NVIDIA-NeMo/NeMo

pzelasko · 2026-06-04T17:48:19Z

Summary

add SpeechLM2 FP8 helpers for TorchAO config creation, TE FP8 autocast/patching, FSDP scale precompute, and TE FP8 padding/alignment
wire SALMAutomodel through those helpers without cached FP8 state and keep the forward/backward flow minimal
support TE FP8 padding for BSHD and THD packed inputs, including context-parallel alignment metadata
document TE FP8 and TorchAO FP8 config examples and add focused unit/runtime coverage

Testing

git diff --cached --check
pytest tests/collections/speechlm2/test_fp8.py tests/collections/speechlm2/test_salm_packed_sequences.py -q
in nemo-speech:cu13-h100plus: pytest tests/collections/speechlm2/test_salm_automodel.py -q
in nemo-speech:cu13-h100plus: 2-GPU LibriSpeech training with Nemotron 3 Nano + Canary-1B-Flash, FSDP2 dp=2 ep=2, TE FP8 recipe=block, completed 2000/2000 steps in 1:18:02 with final logged loss 0.0586

copy-pr-bot · 2026-06-04T17:48:22Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

pzelasko · 2026-06-04T17:53:23Z

/ok to test 614e4b4

Signed-off-by: Piotr Żelasko <pzelasko@nvidia.com>

pzelasko · 2026-06-04T18:26:23Z

/ok to test 23761ed

Signed-off-by: Piotr Żelasko <pzelasko@nvidia.com>

pzelasko · 2026-06-04T21:43:27Z

/ok to test 329f23a

KunalDhawan

Great work @pzelasko, added minor comments below, other than that LGTM

KunalDhawan · 2026-06-04T22:48:33Z

+    """Return the minimal sequence-length multiple so B*T is divisible by 8."""
+    if batch_size <= 0:
+        raise ValueError(f"batch_size must be positive; got {batch_size}.")
+    return 8 // gcd(batch_size, 8)


This only ensures B*T % 8 == 0 and ignores tp_size, which is inconsistent with the THD helper (8 * cp_size * tp_size). Under BSHD + TP + TE-FP8 I think that breaks in two ways:

prepare_inputs truncates the seq dim to a multiple of tp_size so sequence parallelism doesn't silently reshape the input (salm_automodel.py ~L269), but then forward appends pad = (-T) % seq_multiple tokens, so the padded length is no longer guaranteed divisible by tp_size → SP shape break.

With SP the local TE Linear sees M = B*T/tp_size, so FP8 actually needs B*T % (8*tp_size) == 0, not just % 8.

Could we either thread tp_size through here (note 8*tp_size alone isn't enough — e.g. B=16, tp=4 → multiple of 2, still not divisible by 4 — so probably needs an explicit lcm(tp_size, ...)), or add a validate_fp8_config rejection for BSHD + TP + TE-FP8 pointing folks at the THD packed path? A BSHD analogue of test_maybe_pad_thd_..._accounts_for_cp_and_tp would lock it down. This combo wasn't in the 2-GPU run (dp=2 ep=2, no TP), so it's currently untested.

KunalDhawan · 2026-06-04T22:49:44Z

    def backward(self, *args, **kwargs):
        self._setup_moe_fsdp_sync()
-        with loss_parallel():
+        with loss_parallel(), te_fp8_context(self.cfg.get("automodel_backend", None)):


Quick question on wrapping backward in te_fp8_context too: standard TE usage only wraps the forward, and the backward consumes the FP8 metadata captured during the forward's fp8_autocast. Re-entering fp8_autocast here can, for history/delayed-scaling recipes, trigger an extra amax/scale update (and a second amax all-reduce) at context exit. Probably harmless for block/current (which is what the run used), but it's an easy source of subtle scale drift on other recipes. Is it deliberate / needed for something specific? If not, dropping it from backward seems safer.

nice catch too

pzelasko · 2026-06-05T13:33:04Z

/ok to test 17e203f

MahmoudAshraf97 · 2026-06-08T08:04:08Z

Have you gotten any speedups from this? I tried FP8 with Fastconformer and it was slower than BF16

pzelasko · 2026-06-08T15:24:31Z

I'm still debugging a few issues before this is ready, I'll report when I have the numbers.

The expected speedup from TransformerEngine's FP8 is about 2x on Hopper, but you have to make sure the training is compute bound to get that speedup. If your matmuls are on very small problem sizes, the overhead of scaling etc can be greater than the speedup of an already very tiny kernel. Two easiest ways to get the speedup are using larger models or larger batch sizes.

pzelasko requested review from KunalDhawan and nithinraok June 4, 2026 17:52

copy-pr-bot Bot temporarily deployed to public June 4, 2026 17:54 Inactive

copy-pr-bot Bot had a problem deploying to test June 4, 2026 17:55 Error

copy-pr-bot Bot temporarily deployed to public June 4, 2026 17:57 Inactive

copy-pr-bot Bot temporarily deployed to public June 4, 2026 17:58 Inactive

pzelasko added 2 commits June 4, 2026 11:14

Add FP8 support for SALMAutomodel

cbef0ab

Signed-off-by: Piotr Żelasko <pzelasko@nvidia.com>

Mark legacy docs pages as orphaned

23761ed

Signed-off-by: Piotr Żelasko <pzelasko@nvidia.com>

pzelasko force-pushed the codex/salm-automodel-fp8 branch from 6c20005 to 23761ed Compare June 4, 2026 18:14

copy-pr-bot Bot temporarily deployed to public June 4, 2026 18:27 Inactive

copy-pr-bot Bot temporarily deployed to test June 4, 2026 18:29 Inactive

copy-pr-bot Bot temporarily deployed to public June 4, 2026 18:30 Inactive

copy-pr-bot Bot temporarily deployed to public June 4, 2026 18:31 Inactive

copy-pr-bot Bot temporarily deployed to public June 4, 2026 18:54 Inactive

Fix FP8 tests and bump Automodel pin

329f23a

Signed-off-by: Piotr Żelasko <pzelasko@nvidia.com>

pzelasko requested a review from a team as a code owner June 4, 2026 21:42

copy-pr-bot Bot temporarily deployed to public June 4, 2026 21:44 Inactive

copy-pr-bot Bot temporarily deployed to test June 4, 2026 21:45 Inactive

copy-pr-bot Bot temporarily deployed to public June 4, 2026 21:47 Inactive

copy-pr-bot Bot temporarily deployed to public June 4, 2026 21:49 Inactive

KunalDhawan reviewed Jun 4, 2026

View reviewed changes

Fix TE FP8 padding with tensor parallelism

17e203f

copy-pr-bot Bot temporarily deployed to public June 5, 2026 13:33 Inactive

copy-pr-bot Bot temporarily deployed to test June 5, 2026 13:35 Inactive

copy-pr-bot Bot temporarily deployed to public June 5, 2026 13:37 Inactive

copy-pr-bot Bot temporarily deployed to public June 5, 2026 13:38 Inactive

copy-pr-bot Bot temporarily deployed to public June 5, 2026 13:57 Inactive

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add FP8 support for SALMAutomodel#15754

Add FP8 support for SALMAutomodel#15754
pzelasko wants to merge 4 commits into
mainfrom
codex/salm-automodel-fp8

pzelasko commented Jun 4, 2026 •

edited

Loading

Uh oh!

copy-pr-bot Bot commented Jun 4, 2026

Uh oh!

pzelasko commented Jun 4, 2026

Uh oh!

pzelasko commented Jun 4, 2026

Uh oh!

pzelasko commented Jun 4, 2026

Uh oh!

KunalDhawan left a comment

Uh oh!

KunalDhawan Jun 4, 2026

Uh oh!

pzelasko Jun 5, 2026

Uh oh!

KunalDhawan Jun 4, 2026

Uh oh!

pzelasko Jun 5, 2026

Uh oh!

pzelasko commented Jun 5, 2026

Uh oh!

MahmoudAshraf97 commented Jun 8, 2026

Uh oh!

pzelasko commented Jun 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

pzelasko commented Jun 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Testing

Uh oh!

copy-pr-bot Bot commented Jun 4, 2026

Uh oh!

pzelasko commented Jun 4, 2026

Uh oh!

pzelasko commented Jun 4, 2026

Uh oh!

pzelasko commented Jun 4, 2026

Uh oh!

KunalDhawan left a comment

Choose a reason for hiding this comment

Uh oh!

KunalDhawan Jun 4, 2026

Choose a reason for hiding this comment

Uh oh!

pzelasko Jun 5, 2026

Choose a reason for hiding this comment

Uh oh!

KunalDhawan Jun 4, 2026

Choose a reason for hiding this comment

Uh oh!

pzelasko Jun 5, 2026

Choose a reason for hiding this comment

Uh oh!

pzelasko commented Jun 5, 2026

Uh oh!

MahmoudAshraf97 commented Jun 8, 2026

Uh oh!

pzelasko commented Jun 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

pzelasko commented Jun 4, 2026 •

edited

Loading