Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
20 commits
Select commit Hold shift + click to select a range
3cea870
feat(quantization): add SDNQ (SD.Next Quantization) support
Pfannkuchensack Jan 14, 2026
241db2e
fix(sdnq): improve uint4 dequantization and add diffusers format support
Pfannkuchensack Jan 15, 2026
f36f0aa
test(sdnq): align asymmetric dequant expectation with upstream conven…
Pfannkuchensack May 23, 2026
53749bf
test(sdnq): align asymmetric dequant expectation with upstream conven…
Pfannkuchensack May 23, 2026
7a01941
feat(sdnq): support SDNQ-quantized T5 encoders
Pfannkuchensack May 23, 2026
5dfde1e
feat(sdnq): support SDNQ-quantized FLUX.2 transformers
Pfannkuchensack May 23, 2026
478caa7
feat(sdnq): support full ZImagePipeline diffusers folders
Pfannkuchensack May 23, 2026
d8b1e46
fix(sdnq): match T5 SDNQ submodel layout in FluxPipeline bundles
Pfannkuchensack May 23, 2026
c3cc06e
fix(sdnq): swap scale/shift halves in FLUX BFL converter's norm_out
Pfannkuchensack May 24, 2026
3261c03
feat(sdnq): dispatch all ZImagePipeline submodels via SDNQ loader
Pfannkuchensack May 24, 2026
474af00
feat(sdnq): support FLUX.2 Klein dynamic mixed-precision pipelines
Pfannkuchensack May 24, 2026
9350045
- Reject SDNQ-quantized folders in Main_Diffusers_FLUX_Config and
Pfannkuchensack May 24, 2026
010a83f
- Merge multi-shard safetensors in sdnq_sd_loader so Klein 9B's
Pfannkuchensack May 24, 2026
99a891b
- Treat SDNQ ZImagePipeline / Flux2KleinPipeline folders as
Pfannkuchensack May 24, 2026
1fdeaf9
fix(sdnq): unblock FLUX.2 Klein SDNQ pipelines in the UI
Pfannkuchensack May 24, 2026
81ef83b
feat(sdnq): add starter models and user-facing docs
Pfannkuchensack May 24, 2026
75ca01f
Chore Fix Path
Pfannkuchensack May 24, 2026
076ac22
fix(sdnq): add missing variant/cpu_only fields to SDNQ configs
Pfannkuchensack May 24, 2026
5b1c658
Fix openapi schema.ts
Pfannkuchensack May 24, 2026
acad055
Fix Path
Pfannkuchensack May 24, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion docs/src/content/docs/configuration/fp8-storage.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ FP8 Storage cuts a model's VRAM footprint roughly in half by keeping weights on
It pairs well with [Low-VRAM mode](/configuration/low-vram-mode/): low-VRAM mode streams layers between RAM and VRAM, while FP8 Storage shrinks the layers themselves.

:::caution[For full precision models only]
FP8 Storage only applies to **full precision** checkpoints (FP16 / BF16 / FP32). It is **silently a no-op** for already-quantized formats — **GGUF**, **NF4**, and **int8** checkpoints carry their own storage precision and the loader returns a different module type that the FP8 layer cast does not touch. If your model is already quantized, the toggle has no effect; use the full-precision variant of the model if you want to enable FP8 Storage.
FP8 Storage only applies to **full precision** checkpoints (FP16 / BF16 / FP32). It is **silently a no-op** for already-quantized formats — **GGUF**, **NF4**, **int8**, and [**SDNQ**](/configuration/sdnq-quantization/) checkpoints carry their own storage precision and the loader returns a different module type that the FP8 layer cast does not touch. If your model is already quantized, the toggle has no effect; use the full-precision variant of the model if you want to enable FP8 Storage.
:::

## Requirements
Expand Down
130 changes: 130 additions & 0 deletions docs/src/content/docs/configuration/sdnq-quantization.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,130 @@
---
title: SDNQ Quantization
sidebar:
order: 4
---

import { Steps } from '@astrojs/starlight/components';

SDNQ ([SD.Next Quantization Engine](https://github.com/Disty0/sdnq)) is a quantization scheme that stores model weights at 4–5 bits with an optional low-rank SVD correction. InvokeAI loads SDNQ-quantized models as full HuggingFace diffusers pipelines and dequantizes weights on the fly during inference, with no extra Python package required.

## Supported models

| Model family | Status | Format(s) supported |
| ------------------ | ------ | ---------------------------------------------------- |
| **FLUX.1 schnell / dev** | ✅ | Diffusers pipeline (uint4 + SVD), single-file |
| **FLUX.2 Klein 4B / 9B** | ✅ | Diffusers pipeline (uint4 / int5 mixed, ± SVD), single-file |
| **Z-Image Turbo** | ✅ | Diffusers pipeline (uint4 + SVD), single-file |
| **T5 Encoder** | ✅ | Folder + standalone |
| **Qwen3 Encoder** | ✅ | Folder + standalone |
| **VAE (AutoencoderKL)** | ✅ | Folder |
| **SDXL / SD1 / SD2** | ❌ | Not yet — UNet pipeline conversion outstanding |

:::caution[SDNQ vs SVDQuant / Nunchaku]
SDNQ ("SD.Next Quantization") and SVDQuant ("[Nunchaku](https://github.com/nunchaku-ai/nunchaku)") both apply SVD low-rank correction to 4-bit weights, but they use **different on-disk formats** and **different inference engines**. A file like `svdq-int4-flux.1-schnell.safetensors` from `mit-han-lab` is a Nunchaku checkpoint and will *not* load through InvokeAI's SDNQ path — it has keys like `qweight`, `wscales`, `smooth`, `lora_up` rather than SDNQ's `weight`, `scale`, `zero_point`, `svd_up`. Look for the `Disty0/...-SDNQ-...` repo prefix on HuggingFace to be sure you're picking up the right format.
:::

## Memory footprint

Typical reductions vs. the bfloat16 baseline:

| Model | bfloat16 | SDNQ uint4 + SVD | Approx. VRAM at inference |
| --------------------------- | -------- | ---------------- | ------------------------- |
| FLUX.1 schnell | ~33 GB | ~15 GB | ~12 GB |
| FLUX.2 Klein 4B (dynamic) | ~8 GB | ~5 GB | ~5 GB |
| FLUX.2 Klein 9B (dynamic + SVD) | ~18 GB | ~13 GB | ~11 GB |
| Z-Image Turbo | ~12 GB | ~5 GB | ~5 GB |

The actual peak VRAM depends on resolution, batch size, attention backend, and whether [Low-VRAM mode](/configuration/low-vram-mode/) is enabled.

## Installing SDNQ models

The easiest way is via the **Starter Models** picker — search for "SDNQ":

<Steps>
1. Open the **Model Manager** → **Starter Models** tab.
2. Search for `SDNQ`.
3. Click **Install** on the variant you want (each entry shows the HuggingFace source).
</Steps>

To install a different SDNQ model from HuggingFace:

<Steps>
1. Open the **Model Manager** → **Add Model** → **HuggingFace** tab.
2. Enter the repo, e.g. `Disty0/FLUX.2-klein-9B-SDNQ-4bit-dynamic-svd-r32`.
3. Click **Install**. The whole pipeline folder downloads (transformer + text encoder + tokenizer + VAE).
:::tip
For very large SDNQ models, you can also pre-download with `huggingface-cli` and then point InvokeAI at the local folder via **Add Model** → **Folder**.
:::
</Steps>

InvokeAI auto-detects the SDNQ format from `transformer/quantization_config.json` (the `quant_method: "sdnq"` marker). The Model Manager shows the format as **sdnq** in the model badge once installed.

## What gets quantized

Inside a typical SDNQ pipeline folder:

- **`transformer/`** — diffusion transformer weights (most of the savings come from here). Quantized to uint4 or, for dynamic-mixed exports, a per-layer mix of uint4 and int5.
- **`text_encoder/`** — for FLUX.1 this is T5 + CLIP (T5 is SDNQ'd, CLIP stays full precision); for FLUX.2 Klein and Z-Image, Qwen3 is SDNQ'd.
- **`vae/`** — left as bfloat16 in current Disty0 exports (the VAE is small enough that quantizing it isn't worth the quality risk).

Layers in the producer's `modules_to_not_convert` list (typically embeddings, final projection, layer norms) stay full precision in all cases.

## LoRA compatibility

LoRAs apply to SDNQ-quantized models via the **sidecar patching path**: instead of merging the LoRA delta into the quantized weight, InvokeAI keeps the LoRA as a separate residual that runs alongside each forward pass.

- ✅ Standard LoRA, LoKr, DoRA, FluxControl-LoRA, FullLayer patches all work.
- ⚠️ Inference is slightly slower per step than non-quantized LoRA application (the sidecar adds an extra matmul per patched layer), but the loss is small in practice.
- ❌ LoRA training against SDNQ-quantized weights is **not supported**.

If you stack many LoRAs on a heavily quantized model and notice quality drift, try lowering individual LoRA weights — the 4-bit base already eats some headroom for cumulative perturbations.

## Quality trade-offs

The dynamic mixed-precision FLUX.2 Klein exports (`Disty0/FLUX.2-klein-{4B,9B}-SDNQ-4bit-dynamic-...`) let SDNQ promote individual layers from uint4 to int5 if the layer's quantization error exceeds a per-group budget. In practice this keeps the most sensitive attention projections at int5 while everything else stays uint4, with no user-visible quality regression vs. bfloat16 in most prompts.

The static uint4 + SVD exports (`...-SDNQ-uint4-svd-r32`) are slightly more aggressive but use rank-32 SVD residuals to recover the lost precision. The SVD correction adds ~3 % of the original weight size back to the file but largely closes the quality gap.

You will most likely see SDNQ-specific quality issues at:

- **Very high CFG values** (> 8) on Klein 4B dynamic — the 4-bit attention saturates faster than bfloat16.
- **Long generations with heavy LoRA stacks** — cumulative quantization noise becomes visible after dozens of steps.

If you need higher quality and have the VRAM, the static `Disty0/FLUX.2-dev-SDNQ-uint4-svd-r32` (FLUX.2 dev, 12 B params) is the most faithful SDNQ option.

## Comparison with other quantization formats

| Format | Size | VRAM at inference | LoRA support | Loading path |
| ----------------- | ----- | ----------------- | ------------ | ---------------------------------------- |
| **SDNQ uint4 + SVD** | ~50 % | ~50 % | ✅ sidecar | Full diffusers pipeline |
| **GGUF Q4_K_M** | ~30 % | ~30 % | ✅ sidecar | Single-file transformer + separate encoders/VAE |
| **BnB NF4** | ~50 % | ~50 % | ✅ sidecar | Single-file transformer + separate encoders/VAE |
| **FP8 storage** | ~50 % | ~50 % | ✅ direct | Any full-precision model (toggle in Model Manager) |

The headline differences:

- **SDNQ models are pipeline-shaped**: one install pulls everything you need (transformer + encoders + VAE). GGUF and BnB usually need you to also install a T5 / Qwen3 / VAE separately.
- **SDNQ has the cleanest dynamic-precision story**: GGUF picks one bit-width per file; SDNQ dynamic-mixed exports tune precision per layer.
- **GGUF is more memory-efficient** at the same nominal bit-width because it uses smaller groups. SDNQ trades that for the SVD correction option.

For low VRAM (~6–8 GB), GGUF Q4 is still the best fit. For 12–16 GB cards that can host a FLUX-class model, SDNQ is the simplest "install one thing, get a working pipeline" option.

## Troubleshooting

### "Non-diffusers FLUX.2 Klein models require a standalone Qwen3 Encoder" (Invoke button greyed out)

The Klein SDNQ pipeline carries its own Qwen3 encoder, so this readiness gate shouldn't fire — if it does, the install most likely happened before SDNQ-pipeline support was wired up and the model is cached with the wrong format in the database. Delete the model from the Model Manager and re-install it; the second install will pick up the correct `sdnq_quantized` format with the submodels populated.

### Heavy high-frequency noise overlay on output

If the structure of your image is recognizable (rough subject + composition) but a colored static is layered over it, you've hit a quantization-loader bug. Open an issue with:

- The exact HuggingFace repo of the model.
- A side-by-side with a non-SDNQ variant of the same prompt (e.g. compare against GGUF Q4 of the same base model).

Historically this has been caused by missing key-permutation steps in the diffusers→BFL state-dict conversion (e.g. `scale`/`shift` halves swapped). It's not a sign that the file itself is broken.

### "no safetensors files found" or "size mismatch for weight"

You probably pointed InvokeAI at the wrong subfolder — SDNQ pipelines are installed as the **whole repo root** (the folder that contains `model_index.json`), not at `transformer/` directly. The Model Manager's "Add Folder" flow expects the pipeline root.
1 change: 1 addition & 0 deletions invokeai/app/invocations/flux2_denoise.py
Original file line number Diff line number Diff line change
Expand Up @@ -462,6 +462,7 @@ def _run_diffusion(self, context: InvocationContext) -> torch.Tensor:
ModelFormat.BnbQuantizedLlmInt8b,
ModelFormat.BnbQuantizednf4b,
ModelFormat.GGUFQuantized,
ModelFormat.SDNQQuantized,
]:
model_is_quantized = True
else:
Expand Down
35 changes: 23 additions & 12 deletions invokeai/app/invocations/flux2_klein_model_loader.py
Original file line number Diff line number Diff line change
Expand Up @@ -94,14 +94,13 @@ class Flux2KleinModelLoaderInvocation(BaseInvocation):

qwen3_source_model: Optional[ModelIdentifierField] = InputField(
default=None,
description="Diffusers Flux2 Klein model to extract VAE and/or Qwen3 encoder from. "
"Use this if you don't have separate VAE/Qwen3 models. "
description="Diffusers or SDNQ-pipeline Flux2 Klein model to extract VAE and/or Qwen3 "
"encoder from. Use this if you don't have separate VAE/Qwen3 models. "
"Ignored if both VAE and Qwen3 Encoder are provided separately.",
input=Input.Direct,
ui_model_base=BaseModelType.Flux2,
ui_model_type=ModelType.Main,
ui_model_format=ModelFormat.Diffusers,
title="Qwen3 Source (Diffusers)",
title="Qwen3 Source",
)

max_seq_len: Literal[256, 512] = InputField(
Expand All @@ -114,9 +113,15 @@ def invoke(self, context: InvocationContext) -> Flux2KleinModelLoaderOutput:
# Transformer always comes from the main model
transformer = self.model.model_copy(update={"submodel_type": SubModelType.Transformer})

# Check if main model is Diffusers format (can extract VAE directly)
# Check if main model is a pipeline-shaped config we can extract submodels from.
# Plain diffusers pipelines satisfy this; so do SDNQ-quantized pipeline folders (which
# ship the same submodels layout — transformer / vae / text_encoder / tokenizer).
# Single-file SDNQ/GGUF checkpoints don't have submodels populated and fall through to
# the standalone-encoder branch.
main_config = context.models.get_config(self.model)
main_is_diffusers = main_config.format == ModelFormat.Diffusers
main_is_diffusers = main_config.format == ModelFormat.Diffusers or (
main_config.format == ModelFormat.SDNQQuantized and bool(getattr(main_config, "submodels", None))
)

# Determine VAE source
# IMPORTANT: FLUX.2 Klein uses a 32-channel VAE (AutoencoderKLFlux2), not the 16-channel FLUX.1 VAE.
Expand Down Expand Up @@ -173,13 +178,19 @@ def invoke(self, context: InvocationContext) -> Flux2KleinModelLoaderOutput:
def _validate_diffusers_format(
self, context: InvocationContext, model: ModelIdentifierField, model_name: str
) -> None:
"""Validate that a model is in Diffusers format."""
"""Validate that a model exposes the diffusers-style submodel layout. Both plain diffusers
pipelines and SDNQ-quantized pipeline folders (which ship the same submodels) qualify;
single-file SDNQ FLUX.2 checkpoints don't have submodels populated and are still rejected.
"""
config = context.models.get_config(model)
if config.format != ModelFormat.Diffusers:
raise ValueError(
f"The {model_name} model must be a Diffusers format model. "
f"The selected model '{config.name}' is in {config.format.value} format."
)
if config.format == ModelFormat.Diffusers:
return
if config.format == ModelFormat.SDNQQuantized and getattr(config, "submodels", None):
return
raise ValueError(
f"The {model_name} model must be a Diffusers-style FLUX.2 pipeline (with VAE / Qwen3 "
f"submodels). The selected model '{config.name}' is in {config.format.value} format."
)

def _validate_qwen3_encoder_variant(self, context: InvocationContext, main_config) -> None:
"""Validate that the standalone Qwen3 encoder variant matches the FLUX.2 Klein variant.
Expand Down
1 change: 1 addition & 0 deletions invokeai/app/invocations/flux_denoise.py
Original file line number Diff line number Diff line change
Expand Up @@ -440,6 +440,7 @@ def _run_diffusion(
ModelFormat.BnbQuantizedLlmInt8b,
ModelFormat.BnbQuantizednf4b,
ModelFormat.GGUFQuantized,
ModelFormat.SDNQQuantized,
]:
model_is_quantized = True
else:
Expand Down
3 changes: 2 additions & 1 deletion invokeai/app/invocations/flux_model_loader.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,7 @@
)
from invokeai.backend.flux.util import get_flux_max_seq_length
from invokeai.backend.model_manager.configs.base import Checkpoint_Config_Base
from invokeai.backend.model_manager.configs.main import Main_SDNQ_Diffusers_FLUX_Config
from invokeai.backend.model_manager.taxonomy import BaseModelType, ModelType, SubModelType


Expand Down Expand Up @@ -82,7 +83,7 @@ def invoke(self, context: InvocationContext) -> FluxModelLoaderOutput:
t5_encoder = preprocess_t5_encoder_model_identifier(self.t5_encoder_model)

transformer_config = context.models.get_config(transformer)
assert isinstance(transformer_config, Checkpoint_Config_Base)
assert isinstance(transformer_config, (Checkpoint_Config_Base, Main_SDNQ_Diffusers_FLUX_Config))

return FluxModelLoaderOutput(
transformer=TransformerField(transformer=transformer, loras=[]),
Expand Down
13 changes: 13 additions & 0 deletions invokeai/app/invocations/flux_text_encoder.py
Original file line number Diff line number Diff line change
Expand Up @@ -97,6 +97,7 @@ def _t5_encode(self, context: InvocationContext) -> torch.Tensor:
ModelFormat.BnbQuantizedLlmInt8b,
ModelFormat.BnbQuantizednf4b,
ModelFormat.GGUFQuantized,
ModelFormat.SDNQQuantized,
]:
model_is_quantized = True
else:
Expand Down Expand Up @@ -154,6 +155,18 @@ def _clip_encode(self, context: InvocationContext) -> torch.Tensor:
cached_weights=cached_weights,
)
)
elif clip_text_encoder_config.format in [ModelFormat.SDNQQuantized]:
# SDNQ-quantized CLIP - apply LoRA as sidecar layers
exit_stack.enter_context(
LayerPatcher.apply_smart_model_patches(
model=clip_text_encoder,
patches=self._clip_lora_iterator(context),
prefix=FLUX_LORA_CLIP_PREFIX,
dtype=clip_text_encoder.dtype,
cached_weights=cached_weights,
force_sidecar_patching=True,
)
)
else:
# There are currently no supported CLIP quantized models. Add support here if needed.
raise ValueError(f"Unsupported model format: {clip_text_encoder_config.format}")
Expand Down
27 changes: 21 additions & 6 deletions invokeai/app/invocations/flux_vae_decode.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
import torch
from diffusers.models.autoencoders.autoencoder_kl import AutoencoderKL
from einops import rearrange
from PIL import Image

Expand Down Expand Up @@ -40,15 +41,29 @@ class FluxVaeDecodeInvocation(BaseInvocation, WithMetadata, WithBoard):
)

def _vae_decode(self, vae_info: LoadedModel, latents: torch.Tensor) -> Image.Image:
assert isinstance(vae_info.model, AutoEncoder)
estimated_working_memory = estimate_vae_working_memory_flux(
operation="decode", image_tensor=latents, vae=vae_info.model
)
assert isinstance(vae_info.model, (AutoEncoder, AutoencoderKL))

# Only estimate working memory for BFL AutoEncoder (diffusers VAE handles this internally)
if isinstance(vae_info.model, AutoEncoder):
estimated_working_memory = estimate_vae_working_memory_flux(
operation="decode", image_tensor=latents, vae=vae_info.model
)
else:
estimated_working_memory = 0

with vae_info.model_on_device(working_mem_bytes=estimated_working_memory) as (_, vae):
assert isinstance(vae, AutoEncoder)
assert isinstance(vae, (AutoEncoder, AutoencoderKL))
vae_dtype = next(iter(vae.parameters())).dtype
latents = latents.to(device=TorchDevice.choose_torch_device(), dtype=vae_dtype)
img = vae.decode(latents)

if isinstance(vae, AutoEncoder):
# BFL AutoEncoder returns tensor directly
img = vae.decode(latents)
else:
# Diffusers AutoencoderKL returns DecoderOutput with .sample attribute
# Scale latents for diffusers VAE (FLUX uses shift_factor and scale_factor)
latents = (latents / vae.config.scaling_factor) + vae.config.shift_factor
img = vae.decode(latents, return_dict=False)[0]

img = img.clamp(-1, 1)
img = rearrange(img[0], "c h w -> h w c") # noqa: F821
Expand Down
2 changes: 1 addition & 1 deletion invokeai/app/invocations/z_image_denoise.py
Original file line number Diff line number Diff line change
Expand Up @@ -438,7 +438,7 @@ def _run_diffusion(self, context: InvocationContext) -> torch.Tensor:
# slower inference than direct patching, but is agnostic to the quantization format.
if transformer_config.format in [ModelFormat.Diffusers, ModelFormat.Checkpoint]:
model_is_quantized = False
elif transformer_config.format in [ModelFormat.GGUFQuantized]:
elif transformer_config.format in [ModelFormat.GGUFQuantized, ModelFormat.SDNQQuantized]:
model_is_quantized = True
else:
raise ValueError(f"Unsupported Z-Image model format: {transformer_config.format}")
Expand Down
Loading
Loading