AMD-AGI · zhitwang17 · Jun 17, 2026
diff --git a/README.md b/README.md
@@ -1,6 +1,6 @@
 # ALTO: Advanced Low-precision Training and Optimization
 
-ALTO is a Python library for low-precision model training and optimization, built on top of the [TorchTitan fork](https://github.com/AMD-AGI/torchtitan-amd/tree/dev/alto). It ships Triton-backed low-precision kernels (MXFP4, block-scaled FP8, and related utilities) and a configurable stack of **modifiers**—low-precision training (LPT)—wired into TorchTitan through a model-converter pipeline.
+ALTO is a Python library for low-precision model training and optimization, built on top of the [TorchTitan fork](https://github.com/AMD-AGI/torchtitan-amd/tree/dev/alto). It ships Triton-backed low-precision kernels (MXFP4, NVFP4, block-scaled FP8, and related utilities) and a configurable stack of **modifiers**—low-precision training (LPT)—wired into TorchTitan through a model-converter pipeline.
 
 ## Contents
 
@@ -20,6 +20,7 @@ Training-oriented kernels and schemes include:
 
 - **[Blockwise FP8](alto/kernels/blockwise_fp8)** — linear, grouped GEMM, and FlashAttention.
 - **[MXFP4](alto/kernels/fp4/mxfp4)** — linear, grouped GEMM, and FlashAttention.
+- **[NVFP4](alto/kernels/fp4/nvfp4)** — linear and grouped GEMM, using an E4M3 inner-block scale with an optional two-level (tensorwise) outer scale.
 
 Techniques used to narrow the gap versus BF16 include:
 
@@ -132,16 +133,19 @@ Illustrative recipe fragment:
 training_stage:
   lpt_modifiers:
     LowPrecisionTrainingModifier:
-      scheme: "mxfp4"
+      scheme: "mxfp4"          # also supports "nvfp4" (plus "mxfp8_e4m3" / "mxfp8_e5m2")
       targets: ["Linear", "GptOssGroupedExperts"]
       ignore: ["output", "re:.*\\.router\\.gate"]
       use_2dblock_x: false
       use_2dblock_w: true
       use_hadamard: true
       use_sr_grad: true
       use_dge: false
+      two_level_scaling: none  # use "tensorwise" to enable NVFP4's outer scale
 ```
 
+To train the same model with NVFP4 instead, set `scheme: "nvfp4"` and `two_level_scaling: "tensorwise"`. NVFP4 kernels require the environment variable `TRITON_ALLOW_NON_CONSTEXPR_GLOBALS=1` at launch.
+
 ## Export and evaluation
 
 TorchTitan typically saves checkpoints in PyTorch DCP format; you can convert to Hugging Face Safetensors and run lm-eval tasks with the bundled export utility: