fix(fp8): route ModelMixin through hook-based path to survive partialload by Pfannkuchensack · Pull Request #9231 · invoke-ai/InvokeAI

Pfannkuchensack · 2026-05-24T16:59:52Z

Summary

Diffusers' enable_layerwise_casting() installs a LayerwiseCastingHook that (a) only casts dtype in pre_forward, not device, and (b) replaces Linear.forward with an instance-level wrapper that calls the original Linear.forward captured before the hook was installed. ModelCache.put() later runs apply_custom_layers_to_model, which constructs a new CustomLinear sharing the original Linear's __dict__ — so the diffusers wrapper carries over and routes calls to the captured original forward, silently bypassing CustomLinear.forward and its cast_to_device autocast.

With partial loading (e.g. FLUX.2 Klein 9B on a constrained GPU), some Linear weights stay on CPU. The diffusers pre_forward only casts dtype, so F.linear then sees input on cuda:0 and weight on cpu and raises "Expected all tensors to be on the same device".

Route every nn.Module — including ModelMixin — through _apply_fp8_to_nn_module, which uses register_forward_pre_hook / register_forward_hook(always_call=True). nn.Module._call_impl dispatches these around forward without replacing it, so CustomLinear.forward is still reached and cast_to_device moves the weight to the input device. Lose diffusers' _disable_peft_input_autocast in the process, which is irrelevant — InvokeAI patches LoRAs through CustomLinear's _patches_and_weights, not PEFT BaseTunerLayer.

Add regression test that asserts the ModelMixin branch calls _apply_fp8_to_nn_module and not enable_layerwise_casting.

Related Issues / Discussions

https://discord.com/channels/1020123559063990373/1508132779164962850

Reported on Discord: FP8 storage on FLUX.2 Klein 9B crashes with
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking argument for argument mat2 in method wrapper_CUDA_mm)
at Flux2FeedForward.linear_out inside ff_context.

Stack trace points to the diffusers LayerwiseCastingHook wrapper (diffusers/hooks/hooks.py:189 → torch/nn/modules/linear.py:125).

QA Instructions

Repro (pre-fix):

Install FLUX.2 Klein 9B (Diffusers format) + matching Qwen3 8B encoder.
In Model Manager → FLUX.2 Klein 9B → enable FP8 Storage.
Constrain VRAM so partial loading kicks in (e.g. set max VRAM well below 9 GB, or run on a 12 GB GPU with other models cached).
Generate an image with the Flux2 Klein workflow.
Pre-fix: crash with the device-mismatch RuntimeError at linear_out in the first transformer block.
Post-fix: generation completes normally.

Regression coverage:

pytest tests/backend/model_manager/load/test_load_default_fp8.py — 13 tests, all green. The new test_apply_fp8_layerwise_casting_uses_hook_path_for_model_mixin fails on the pre-fix code (it would observe enable_layerwise_casting being called) and passes on the fix.

Also verify the existing FP8 paths still work:

FLUX.1 + FP8 on Diffusers format → should still cast and infer correctly.
FLUX.1 + FP8 on single-file checkpoint → already used _apply_fp8_to_nn_module, behavior unchanged.
SDXL / SD1 + FP8 → should still work.
LoRA on top of an FP8 base model → CustomLinear._autocast_forward_with_patches branch should fire (covered by test_wrap_forward_reaches_custom_linear_after_apply_custom_layers).

Merge Plan

Straight merge. No DB or schema changes. No frontend changes. Cache invalidation on the FP8 toggle already exists (drop_model on settings change), so a user toggling FP8 off/on after pulling this PR will get the fixed loader on next load.

Checklist

The PR has a short but descriptive title, suitable for a changelog
Tests added / updated (if applicable)
❗Changes to a redux slice have a corresponding migration
Documentation added / updated (if applicable)
Updated What's New copy (if doing a release after this PR)

… load Diffusers' enable_layerwise_casting() installs a LayerwiseCastingHook that (a) only casts dtype in pre_forward, not device, and (b) replaces Linear.forward with an instance-level wrapper that calls the original Linear.forward captured before the hook was installed. ModelCache.put() later runs apply_custom_layers_to_model, which constructs a new CustomLinear sharing the original Linear's __dict__ — so the diffusers wrapper carries over and routes calls to the captured original forward, silently bypassing CustomLinear.forward and its cast_to_device autocast. With partial loading (e.g. FLUX.2 Klein 9B on a constrained GPU), some Linear weights stay on CPU. The diffusers pre_forward only casts dtype, so F.linear then sees input on cuda:0 and weight on cpu and raises "Expected all tensors to be on the same device". Route every nn.Module — including ModelMixin — through _apply_fp8_to_nn_module, which uses register_forward_pre_hook / register_forward_hook(always_call=True). nn.Module._call_impl dispatches these around forward without replacing it, so CustomLinear.forward is still reached and cast_to_device moves the weight to the input device. Lose diffusers' _disable_peft_input_autocast in the process, which is irrelevant — InvokeAI patches LoRAs through CustomLinear's _patches_and_weights, not PEFT BaseTunerLayer. Add regression test that asserts the ModelMixin branch calls _apply_fp8_to_nn_module and not enable_layerwise_casting.

JPPhoto · 2026-05-25T19:13:44Z

It looks like this needs a documentation update.

invokeai/docs/src/content/docs/configuration/fp8-storage.mdx:26 and :34 still say InvokeAI's FP8 path uses Diffusers enable_layerwise_casting, but PR 9231 removes that path for ModelMixin and routes every torch.nn.Module through InvokeAI's hook-based _apply_fp8_to_nn_module path at invokeai/backend/model_manager/load/load_default.py:221-233. This leaves the user-facing FP8 Storage docs stale for anyone diagnosing FP8 behavior or hardware impact.

Pfannkuchensack requested review from JPPhoto, blessedcoolant, dunkeroni and lstein as code owners May 24, 2026 16:59

github-actions Bot added python PRs that change python files backend PRs that change backend files python-tests PRs that change python tests labels May 24, 2026

lstein added the v6.13.x label May 25, 2026

lstein added this to Invoke - Community Roadmap May 25, 2026

lstein moved this to 6.13.x Theme: MODELS in Invoke - Community Roadmap May 25, 2026

lstein assigned JPPhoto May 25, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(fp8): route ModelMixin through hook-based path to survive partialload#9231

fix(fp8): route ModelMixin through hook-based path to survive partialload#9231
Pfannkuchensack wants to merge 1 commit into
invoke-ai:mainfrom
Pfannkuchensack:fix/fp8-klein9b

Pfannkuchensack commented May 24, 2026

Uh oh!

JPPhoto commented May 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

Pfannkuchensack commented May 24, 2026

Summary

Related Issues / Discussions

QA Instructions

Merge Plan

Checklist

Uh oh!

JPPhoto commented May 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants