Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 6 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# ALTO: Advanced Low-precision Training and Optimization

ALTO is a Python library for low-precision model training and optimization, built on top of the [TorchTitan fork](https://github.com/AMD-AGI/torchtitan-amd/tree/dev/alto). It ships Triton-backed low-precision kernels (MXFP4, block-scaled FP8, and related utilities) and a configurable stack of **modifiers**—low-precision training (LPT)—wired into TorchTitan through a model-converter pipeline.
ALTO is a Python library for low-precision model training and optimization, built on top of the [TorchTitan fork](https://github.com/AMD-AGI/torchtitan-amd/tree/dev/alto). It ships Triton-backed low-precision kernels (MXFP4, NVFP4, block-scaled FP8, and related utilities) and a configurable stack of **modifiers**—low-precision training (LPT)—wired into TorchTitan through a model-converter pipeline.

## Contents

Expand All @@ -20,6 +20,7 @@ Training-oriented kernels and schemes include:

- **[Blockwise FP8](alto/kernels/blockwise_fp8)** — linear, grouped GEMM, and FlashAttention.
- **[MXFP4](alto/kernels/fp4/mxfp4)** — linear, grouped GEMM, and FlashAttention.
- **[NVFP4](alto/kernels/fp4/nvfp4)** — linear and grouped GEMM, using an E4M3 inner-block scale with an optional two-level (tensorwise) outer scale.

Techniques used to narrow the gap versus BF16 include:

Expand Down Expand Up @@ -132,16 +133,19 @@ Illustrative recipe fragment:
training_stage:
lpt_modifiers:
LowPrecisionTrainingModifier:
scheme: "mxfp4"
scheme: "mxfp4" # also supports "nvfp4" (plus "mxfp8_e4m3" / "mxfp8_e5m2")
targets: ["Linear", "GptOssGroupedExperts"]
ignore: ["output", "re:.*\\.router\\.gate"]
use_2dblock_x: false
use_2dblock_w: true
use_hadamard: true
use_sr_grad: true
use_dge: false
two_level_scaling: none # use "tensorwise" to enable NVFP4's outer scale
```

To train the same model with NVFP4 instead, set `scheme: "nvfp4"` and `two_level_scaling: "tensorwise"`. NVFP4 kernels require the environment variable `TRITON_ALLOW_NON_CONSTEXPR_GLOBALS=1` at launch.

## Export and evaluation

TorchTitan typically saves checkpoints in PyTorch DCP format; you can convert to Hugging Face Safetensors and run lm-eval tasks with the bundled export utility:
Expand Down