diff --git a/README.md b/README.md index bb4a5e2..9cd8bf2 100644 --- a/README.md +++ b/README.md @@ -1,6 +1,6 @@ # ALTO: Advanced Low-precision Training and Optimization -ALTO is a Python library for low-precision model training and optimization, built on top of the [TorchTitan fork](https://github.com/AMD-AGI/torchtitan-amd/tree/dev/alto). It ships Triton-backed low-precision kernels (MXFP4, block-scaled FP8, and related utilities) and a configurable stack of **modifiers**—low-precision training (LPT)—wired into TorchTitan through a model-converter pipeline. +ALTO is a Python library for low-precision model training and optimization, built on top of the [TorchTitan fork](https://github.com/AMD-AGI/torchtitan-amd/tree/dev/alto). It ships Triton-backed low-precision kernels (MXFP4, NVFP4, block-scaled FP8, and related utilities) and a configurable stack of **modifiers**—low-precision training (LPT)—wired into TorchTitan through a model-converter pipeline. ## Contents @@ -20,6 +20,7 @@ Training-oriented kernels and schemes include: - **[Blockwise FP8](alto/kernels/blockwise_fp8)** — linear, grouped GEMM, and FlashAttention. - **[MXFP4](alto/kernels/fp4/mxfp4)** — linear, grouped GEMM, and FlashAttention. +- **[NVFP4](alto/kernels/fp4/nvfp4)** — linear and grouped GEMM, using an E4M3 inner-block scale with an optional two-level (tensorwise) outer scale. Techniques used to narrow the gap versus BF16 include: @@ -132,7 +133,7 @@ Illustrative recipe fragment: training_stage: lpt_modifiers: LowPrecisionTrainingModifier: - scheme: "mxfp4" + scheme: "mxfp4" # also supports "nvfp4" (plus "mxfp8_e4m3" / "mxfp8_e5m2") targets: ["Linear", "GptOssGroupedExperts"] ignore: ["output", "re:.*\\.router\\.gate"] use_2dblock_x: false @@ -140,8 +141,11 @@ training_stage: use_hadamard: true use_sr_grad: true use_dge: false + two_level_scaling: none # use "tensorwise" to enable NVFP4's outer scale ``` +To train the same model with NVFP4 instead, set `scheme: "nvfp4"` and `two_level_scaling: "tensorwise"`. NVFP4 kernels require the environment variable `TRITON_ALLOW_NON_CONSTEXPR_GLOBALS=1` at launch. + ## Export and evaluation TorchTitan typically saves checkpoints in PyTorch DCP format; you can convert to Hugging Face Safetensors and run lm-eval tasks with the bundled export utility: