invoke-ai · Pfannkuchensack · Jan 14, 2026 · Jan 15, 2026 · May 23, 2026 · May 23, 2026
diff --git a/docs/src/content/docs/configuration/fp8-storage.mdx b/docs/src/content/docs/configuration/fp8-storage.mdx
@@ -11,7 +11,7 @@ FP8 Storage cuts a model's VRAM footprint roughly in half by keeping weights on
 It pairs well with [Low-VRAM mode](/configuration/low-vram-mode/): low-VRAM mode streams layers between RAM and VRAM, while FP8 Storage shrinks the layers themselves.
 
 :::caution[For full precision models only]
-FP8 Storage only applies to **full precision** checkpoints (FP16 / BF16 / FP32). It is **silently a no-op** for already-quantized formats — **GGUF**, **NF4**, and **int8** checkpoints carry their own storage precision and the loader returns a different module type that the FP8 layer cast does not touch. If your model is already quantized, the toggle has no effect; use the full-precision variant of the model if you want to enable FP8 Storage.
+FP8 Storage only applies to **full precision** checkpoints (FP16 / BF16 / FP32). It is **silently a no-op** for already-quantized formats — **GGUF**, **NF4**, **int8**, and [**SDNQ**](/configuration/sdnq-quantization/) checkpoints carry their own storage precision and the loader returns a different module type that the FP8 layer cast does not touch. If your model is already quantized, the toggle has no effect; use the full-precision variant of the model if you want to enable FP8 Storage.
 :::
 
 ## Requirements

diff --git a/docs/src/content/docs/configuration/sdnq-quantization.mdx b/docs/src/content/docs/configuration/sdnq-quantization.mdx
@@ -0,0 +1,130 @@
+---
+title: SDNQ Quantization
+sidebar:
+  order: 4
+---
+
+import { Steps } from '@astrojs/starlight/components';
+
+SDNQ ([SD.Next Quantization Engine](https://github.com/Disty0/sdnq)) is a quantization scheme that stores model weights at 4–5 bits with an optional low-rank SVD correction. InvokeAI loads SDNQ-quantized models as full HuggingFace diffusers pipelines and dequantizes weights on the fly during inference, with no extra Python package required.
+
+## Supported models
+
+| Model family       | Status | Format(s) supported                                  |
+| ------------------ | ------ | ---------------------------------------------------- |
+| **FLUX.1 schnell / dev** | ✅ | Diffusers pipeline (uint4 + SVD), single-file        |
+| **FLUX.2 Klein 4B / 9B** | ✅ | Diffusers pipeline (uint4 / int5 mixed, ± SVD), single-file |
+| **Z-Image Turbo**        | ✅ | Diffusers pipeline (uint4 + SVD), single-file        |
+| **T5 Encoder**           | ✅ | Folder + standalone                                  |
+| **Qwen3 Encoder**        | ✅ | Folder + standalone                                  |
+| **VAE (AutoencoderKL)**  | ✅ | Folder                                               |
+| **SDXL / SD1 / SD2**     | ❌ | Not yet — UNet pipeline conversion outstanding       |
+
+:::caution[SDNQ vs SVDQuant / Nunchaku]
+SDNQ ("SD.Next Quantization") and SVDQuant ("[Nunchaku](https://github.com/nunchaku-ai/nunchaku)") both apply SVD low-rank correction to 4-bit weights, but they use **different on-disk formats** and **different inference engines**. A file like `svdq-int4-flux.1-schnell.safetensors` from `mit-han-lab` is a Nunchaku checkpoint and will *not* load through InvokeAI's SDNQ path — it has keys like `qweight`, `wscales`, `smooth`, `lora_up` rather than SDNQ's `weight`, `scale`, `zero_point`, `svd_up`. Look for the `Disty0/...-SDNQ-...` repo prefix on HuggingFace to be sure you're picking up the right format.
+:::
+
+## Memory footprint
+
+Typical reductions vs. the bfloat16 baseline:
+
+| Model                       | bfloat16 | SDNQ uint4 + SVD | Approx. VRAM at inference |
+| --------------------------- | -------- | ---------------- | ------------------------- |
+| FLUX.1 schnell              | ~33 GB   | ~15 GB           | ~12 GB                    |
+| FLUX.2 Klein 4B (dynamic)   | ~8 GB    | ~5 GB            | ~5 GB                     |
+| FLUX.2 Klein 9B (dynamic + SVD) | ~18 GB | ~13 GB         | ~11 GB                    |
+| Z-Image Turbo               | ~12 GB   | ~5 GB            | ~5 GB                     |
+
+The actual peak VRAM depends on resolution, batch size, attention backend, and whether [Low-VRAM mode](/configuration/low-vram-mode/) is enabled.
+
+## Installing SDNQ models
+
+The easiest way is via the **Starter Models** picker — search for "SDNQ":
+
+<Steps>
+1. Open the **Model Manager** → **Starter Models** tab.
+2. Search for `SDNQ`.
+3. Click **Install** on the variant you want (each entry shows the HuggingFace source).
+</Steps>
+
+To install a different SDNQ model from HuggingFace:
+
+<Steps>
+1. Open the **Model Manager** → **Add Model** → **HuggingFace** tab.
+2. Enter the repo, e.g. `Disty0/FLUX.2-klein-9B-SDNQ-4bit-dynamic-svd-r32`.
+3. Click **Install**. The whole pipeline folder downloads (transformer + text encoder + tokenizer + VAE).
+:::tip
+For very large SDNQ models, you can also pre-download with `huggingface-cli` and then point InvokeAI at the local folder via **Add Model** → **Folder**.
+:::
+</Steps>
+
+InvokeAI auto-detects the SDNQ format from `transformer/quantization_config.json` (the `quant_method: "sdnq"` marker). The Model Manager shows the format as **sdnq** in the model badge once installed.
+
+## What gets quantized
+
+Inside a typical SDNQ pipeline folder:
+
+- **`transformer/`** — diffusion transformer weights (most of the savings come from here). Quantized to uint4 or, for dynamic-mixed exports, a per-layer mix of uint4 and int5.
+- **`text_encoder/`** — for FLUX.1 this is T5 + CLIP (T5 is SDNQ'd, CLIP stays full precision); for FLUX.2 Klein and Z-Image, Qwen3 is SDNQ'd.
+- **`vae/`** — left as bfloat16 in current Disty0 exports (the VAE is small enough that quantizing it isn't worth the quality risk).
+
+Layers in the producer's `modules_to_not_convert` list (typically embeddings, final projection, layer norms) stay full precision in all cases.
+
+## LoRA compatibility
+
+LoRAs apply to SDNQ-quantized models via the **sidecar patching path**: instead of merging the LoRA delta into the quantized weight, InvokeAI keeps the LoRA as a separate residual that runs alongside each forward pass.
+
+- ✅ Standard LoRA, LoKr, DoRA, FluxControl-LoRA, FullLayer patches all work.
+- ⚠️ Inference is slightly slower per step than non-quantized LoRA application (the sidecar adds an extra matmul per patched layer), but the loss is small in practice.
+- ❌ LoRA training against SDNQ-quantized weights is **not supported**.
+
+If you stack many LoRAs on a heavily quantized model and notice quality drift, try lowering individual LoRA weights — the 4-bit base already eats some headroom for cumulative perturbations.
+
+## Quality trade-offs
+
+The dynamic mixed-precision FLUX.2 Klein exports (`Disty0/FLUX.2-klein-{4B,9B}-SDNQ-4bit-dynamic-...`) let SDNQ promote individual layers from uint4 to int5 if the layer's quantization error exceeds a per-group budget. In practice this keeps the most sensitive attention projections at int5 while everything else stays uint4, with no user-visible quality regression vs. bfloat16 in most prompts.
+
+The static uint4 + SVD exports (`...-SDNQ-uint4-svd-r32`) are slightly more aggressive but use rank-32 SVD residuals to recover the lost precision. The SVD correction adds ~3 % of the original weight size back to the file but largely closes the quality gap.
+
+You will most likely see SDNQ-specific quality issues at:
+
+- **Very high CFG values** (> 8) on Klein 4B dynamic — the 4-bit attention saturates faster than bfloat16.
+- **Long generations with heavy LoRA stacks** — cumulative quantization noise becomes visible after dozens of steps.
+
+If you need higher quality and have the VRAM, the static `Disty0/FLUX.2-dev-SDNQ-uint4-svd-r32` (FLUX.2 dev, 12 B params) is the most faithful SDNQ option.
+
+## Comparison with other quantization formats
+
+| Format            | Size  | VRAM at inference | LoRA support | Loading path                             |
+| ----------------- | ----- | ----------------- | ------------ | ---------------------------------------- |
+| **SDNQ uint4 + SVD** | ~50 % | ~50 %          | ✅ sidecar    | Full diffusers pipeline                   |
+| **GGUF Q4_K_M**   | ~30 % | ~30 %             | ✅ sidecar    | Single-file transformer + separate encoders/VAE |
+| **BnB NF4**       | ~50 % | ~50 %             | ✅ sidecar    | Single-file transformer + separate encoders/VAE |
+| **FP8 storage**   | ~50 % | ~50 %             | ✅ direct     | Any full-precision model (toggle in Model Manager) |
+
+The headline differences:
+
+- **SDNQ models are pipeline-shaped**: one install pulls everything you need (transformer + encoders + VAE). GGUF and BnB usually need you to also install a T5 / Qwen3 / VAE separately.
+- **SDNQ has the cleanest dynamic-precision story**: GGUF picks one bit-width per file; SDNQ dynamic-mixed exports tune precision per layer.
+- **GGUF is more memory-efficient** at the same nominal bit-width because it uses smaller groups. SDNQ trades that for the SVD correction option.
+
+For low VRAM (~6–8 GB), GGUF Q4 is still the best fit. For 12–16 GB cards that can host a FLUX-class model, SDNQ is the simplest "install one thing, get a working pipeline" option.
+
+## Troubleshooting
+
+### "Non-diffusers FLUX.2 Klein models require a standalone Qwen3 Encoder" (Invoke button greyed out)
+
+The Klein SDNQ pipeline carries its own Qwen3 encoder, so this readiness gate shouldn't fire — if it does, the install most likely happened before SDNQ-pipeline support was wired up and the model is cached with the wrong format in the database. Delete the model from the Model Manager and re-install it; the second install will pick up the correct `sdnq_quantized` format with the submodels populated.
+
+### Heavy high-frequency noise overlay on output
+
+If the structure of your image is recognizable (rough subject + composition) but a colored static is layered over it, you've hit a quantization-loader bug. Open an issue with:
+
+- The exact HuggingFace repo of the model.
+- A side-by-side with a non-SDNQ variant of the same prompt (e.g. compare against GGUF Q4 of the same base model).
+
+Historically this has been caused by missing key-permutation steps in the diffusers→BFL state-dict conversion (e.g. `scale`/`shift` halves swapped). It's not a sign that the file itself is broken.
+
+### "no safetensors files found" or "size mismatch for weight"
+
+You probably pointed InvokeAI at the wrong subfolder — SDNQ pipelines are installed as the **whole repo root** (the folder that contains `model_index.json`), not at `transformer/` directly. The Model Manager's "Add Folder" flow expects the pipeline root.
@@ -462,6 +462,7 @@ def _run_diffusion(self, context: InvocationContext) -> torch.Tensor:
                 ModelFormat.BnbQuantizedLlmInt8b,
                 ModelFormat.BnbQuantizednf4b,
                 ModelFormat.GGUFQuantized,
+                ModelFormat.SDNQQuantized,
             ]:
                 model_is_quantized = True
             else:

@@ -94,14 +94,13 @@ class Flux2KleinModelLoaderInvocation(BaseInvocation):
 
     qwen3_source_model: Optional[ModelIdentifierField] = InputField(
         default=None,
-        description="Diffusers Flux2 Klein model to extract VAE and/or Qwen3 encoder from. "
-        "Use this if you don't have separate VAE/Qwen3 models. "
+        description="Diffusers or SDNQ-pipeline Flux2 Klein model to extract VAE and/or Qwen3 "
+        "encoder from. Use this if you don't have separate VAE/Qwen3 models. "
         "Ignored if both VAE and Qwen3 Encoder are provided separately.",
         input=Input.Direct,
         ui_model_base=BaseModelType.Flux2,
         ui_model_type=ModelType.Main,
-        ui_model_format=ModelFormat.Diffusers,
-        title="Qwen3 Source (Diffusers)",
+        title="Qwen3 Source",
     )
 
     max_seq_len: Literal[256, 512] = InputField(
@@ -114,9 +113,15 @@ def invoke(self, context: InvocationContext) -> Flux2KleinModelLoaderOutput:
         # Transformer always comes from the main model
         transformer = self.model.model_copy(update={"submodel_type": SubModelType.Transformer})
 
-        # Check if main model is Diffusers format (can extract VAE directly)
+        # Check if main model is a pipeline-shaped config we can extract submodels from.
+        # Plain diffusers pipelines satisfy this; so do SDNQ-quantized pipeline folders (which
+        # ship the same submodels layout — transformer / vae / text_encoder / tokenizer).
+        # Single-file SDNQ/GGUF checkpoints don't have submodels populated and fall through to
+        # the standalone-encoder branch.
         main_config = context.models.get_config(self.model)
-        main_is_diffusers = main_config.format == ModelFormat.Diffusers
+        main_is_diffusers = main_config.format == ModelFormat.Diffusers or (
+            main_config.format == ModelFormat.SDNQQuantized and bool(getattr(main_config, "submodels", None))
+        )
 
         # Determine VAE source
         # IMPORTANT: FLUX.2 Klein uses a 32-channel VAE (AutoencoderKLFlux2), not the 16-channel FLUX.1 VAE.
@@ -173,13 +178,19 @@ def invoke(self, context: InvocationContext) -> Flux2KleinModelLoaderOutput:
     def _validate_diffusers_format(
         self, context: InvocationContext, model: ModelIdentifierField, model_name: str
     ) -> None:
-        """Validate that a model is in Diffusers format."""
+        """Validate that a model exposes the diffusers-style submodel layout. Both plain diffusers
+        pipelines and SDNQ-quantized pipeline folders (which ship the same submodels) qualify;
+        single-file SDNQ FLUX.2 checkpoints don't have submodels populated and are still rejected.
+        """
         config = context.models.get_config(model)
-        if config.format != ModelFormat.Diffusers:
-            raise ValueError(
-                f"The {model_name} model must be a Diffusers format model. "
-                f"The selected model '{config.name}' is in {config.format.value} format."
-            )
+        if config.format == ModelFormat.Diffusers:
+            return
+        if config.format == ModelFormat.SDNQQuantized and getattr(config, "submodels", None):
+            return
+        raise ValueError(
+            f"The {model_name} model must be a Diffusers-style FLUX.2 pipeline (with VAE / Qwen3 "
+            f"submodels). The selected model '{config.name}' is in {config.format.value} format."
+        )
 
     def _validate_qwen3_encoder_variant(self, context: InvocationContext, main_config) -> None:
         """Validate that the standalone Qwen3 encoder variant matches the FLUX.2 Klein variant.

@@ -440,6 +440,7 @@ def _run_diffusion(
                 ModelFormat.BnbQuantizedLlmInt8b,
                 ModelFormat.BnbQuantizednf4b,
                 ModelFormat.GGUFQuantized,
+                ModelFormat.SDNQQuantized,
             ]:
                 model_is_quantized = True
             else:

@@ -15,6 +15,7 @@
 )
 from invokeai.backend.flux.util import get_flux_max_seq_length
 from invokeai.backend.model_manager.configs.base import Checkpoint_Config_Base
+from invokeai.backend.model_manager.configs.main import Main_SDNQ_Diffusers_FLUX_Config
 from invokeai.backend.model_manager.taxonomy import BaseModelType, ModelType, SubModelType
 
 
@@ -82,7 +83,7 @@ def invoke(self, context: InvocationContext) -> FluxModelLoaderOutput:
         t5_encoder = preprocess_t5_encoder_model_identifier(self.t5_encoder_model)
 
         transformer_config = context.models.get_config(transformer)
-        assert isinstance(transformer_config, Checkpoint_Config_Base)
+        assert isinstance(transformer_config, (Checkpoint_Config_Base, Main_SDNQ_Diffusers_FLUX_Config))
 
         return FluxModelLoaderOutput(
             transformer=TransformerField(transformer=transformer, loras=[]),

@@ -97,6 +97,7 @@ def _t5_encode(self, context: InvocationContext) -> torch.Tensor:
                 ModelFormat.BnbQuantizedLlmInt8b,
                 ModelFormat.BnbQuantizednf4b,
                 ModelFormat.GGUFQuantized,
+                ModelFormat.SDNQQuantized,
             ]:
                 model_is_quantized = True
             else:
@@ -154,6 +155,18 @@ def _clip_encode(self, context: InvocationContext) -> torch.Tensor:
                         cached_weights=cached_weights,
                     )
                 )
+            elif clip_text_encoder_config.format in [ModelFormat.SDNQQuantized]:
+                # SDNQ-quantized CLIP - apply LoRA as sidecar layers
+                exit_stack.enter_context(
+                    LayerPatcher.apply_smart_model_patches(
+                        model=clip_text_encoder,
+                        patches=self._clip_lora_iterator(context),
+                        prefix=FLUX_LORA_CLIP_PREFIX,
+                        dtype=clip_text_encoder.dtype,
+                        cached_weights=cached_weights,
+                        force_sidecar_patching=True,
+                    )
+                )
             else:
                 # There are currently no supported CLIP quantized models. Add support here if needed.
                 raise ValueError(f"Unsupported model format: {clip_text_encoder_config.format}")

@@ -1,4 +1,5 @@
 import torch
+from diffusers.models.autoencoders.autoencoder_kl import AutoencoderKL
 from einops import rearrange
 from PIL import Image
 
@@ -40,15 +41,29 @@ class FluxVaeDecodeInvocation(BaseInvocation, WithMetadata, WithBoard):
     )
 
     def _vae_decode(self, vae_info: LoadedModel, latents: torch.Tensor) -> Image.Image:
-        assert isinstance(vae_info.model, AutoEncoder)
-        estimated_working_memory = estimate_vae_working_memory_flux(
-            operation="decode", image_tensor=latents, vae=vae_info.model
-        )
+        assert isinstance(vae_info.model, (AutoEncoder, AutoencoderKL))
+
+        # Only estimate working memory for BFL AutoEncoder (diffusers VAE handles this internally)
+        if isinstance(vae_info.model, AutoEncoder):
+            estimated_working_memory = estimate_vae_working_memory_flux(
+                operation="decode", image_tensor=latents, vae=vae_info.model
+            )
+        else:
+            estimated_working_memory = 0
+
         with vae_info.model_on_device(working_mem_bytes=estimated_working_memory) as (_, vae):
-            assert isinstance(vae, AutoEncoder)
+            assert isinstance(vae, (AutoEncoder, AutoencoderKL))
             vae_dtype = next(iter(vae.parameters())).dtype
             latents = latents.to(device=TorchDevice.choose_torch_device(), dtype=vae_dtype)
-            img = vae.decode(latents)
+
+            if isinstance(vae, AutoEncoder):
+                # BFL AutoEncoder returns tensor directly
+                img = vae.decode(latents)
+            else:
+                # Diffusers AutoencoderKL returns DecoderOutput with .sample attribute
+                # Scale latents for diffusers VAE (FLUX uses shift_factor and scale_factor)
+                latents = (latents / vae.config.scaling_factor) + vae.config.shift_factor
+                img = vae.decode(latents, return_dict=False)[0]
 
         img = img.clamp(-1, 1)
         img = rearrange(img[0], "c h w -> h w c")  # noqa: F821

@@ -438,7 +438,7 @@ def _run_diffusion(self, context: InvocationContext) -> torch.Tensor:
             # slower inference than direct patching, but is agnostic to the quantization format.
             if transformer_config.format in [ModelFormat.Diffusers, ModelFormat.Checkpoint]:
                 model_is_quantized = False
-            elif transformer_config.format in [ModelFormat.GGUFQuantized]:
+            elif transformer_config.format in [ModelFormat.GGUFQuantized, ModelFormat.SDNQQuantized]:
                 model_is_quantized = True
             else:
                 raise ValueError(f"Unsupported Z-Image model format: {transformer_config.format}")