Skip to content

fix(fp8): route ModelMixin through hook-based path to survive partialload#9231

Open
Pfannkuchensack wants to merge 1 commit into
invoke-ai:mainfrom
Pfannkuchensack:fix/fp8-klein9b
Open

fix(fp8): route ModelMixin through hook-based path to survive partialload#9231
Pfannkuchensack wants to merge 1 commit into
invoke-ai:mainfrom
Pfannkuchensack:fix/fp8-klein9b

Conversation

@Pfannkuchensack
Copy link
Copy Markdown
Collaborator

Summary

Diffusers' enable_layerwise_casting() installs a LayerwiseCastingHook that (a) only casts dtype in pre_forward, not device, and (b) replaces Linear.forward with an instance-level wrapper that calls the original Linear.forward captured before the hook was installed. ModelCache.put() later runs apply_custom_layers_to_model, which constructs a new CustomLinear sharing the original Linear's __dict__ — so the diffusers wrapper carries over and routes calls to the captured original forward, silently bypassing CustomLinear.forward and its cast_to_device autocast.

With partial loading (e.g. FLUX.2 Klein 9B on a constrained GPU), some Linear weights stay on CPU. The diffusers pre_forward only casts dtype, so F.linear then sees input on cuda:0 and weight on cpu and raises "Expected all tensors to be on the same device".

Route every nn.Module — including ModelMixin — through _apply_fp8_to_nn_module, which uses register_forward_pre_hook / register_forward_hook(always_call=True). nn.Module._call_impl dispatches these around forward without replacing it, so CustomLinear.forward is still reached and cast_to_device moves the weight to the input device. Lose diffusers' _disable_peft_input_autocast in the process, which is irrelevant — InvokeAI patches LoRAs through CustomLinear's _patches_and_weights, not PEFT BaseTunerLayer.

Add regression test that asserts the ModelMixin branch calls _apply_fp8_to_nn_module and not enable_layerwise_casting.

Related Issues / Discussions

https://discord.com/channels/1020123559063990373/1508132779164962850

Reported on Discord: FP8 storage on FLUX.2 Klein 9B crashes with
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking argument for argument mat2 in method wrapper_CUDA_mm)
at Flux2FeedForward.linear_out inside ff_context.

Stack trace points to the diffusers LayerwiseCastingHook wrapper (diffusers/hooks/hooks.py:189torch/nn/modules/linear.py:125).

QA Instructions

Repro (pre-fix):

  1. Install FLUX.2 Klein 9B (Diffusers format) + matching Qwen3 8B encoder.
  2. In Model Manager → FLUX.2 Klein 9B → enable FP8 Storage.
  3. Constrain VRAM so partial loading kicks in (e.g. set max VRAM well below 9 GB, or run on a 12 GB GPU with other models cached).
  4. Generate an image with the Flux2 Klein workflow.
  5. Pre-fix: crash with the device-mismatch RuntimeError at linear_out in the first transformer block.
  6. Post-fix: generation completes normally.

Regression coverage:

  • pytest tests/backend/model_manager/load/test_load_default_fp8.py — 13 tests, all green. The new test_apply_fp8_layerwise_casting_uses_hook_path_for_model_mixin fails on the pre-fix code (it would observe enable_layerwise_casting being called) and passes on the fix.

Also verify the existing FP8 paths still work:

  • FLUX.1 + FP8 on Diffusers format → should still cast and infer correctly.
  • FLUX.1 + FP8 on single-file checkpoint → already used _apply_fp8_to_nn_module, behavior unchanged.
  • SDXL / SD1 + FP8 → should still work.
  • LoRA on top of an FP8 base model → CustomLinear._autocast_forward_with_patches branch should fire (covered by test_wrap_forward_reaches_custom_linear_after_apply_custom_layers).

Merge Plan

Straight merge. No DB or schema changes. No frontend changes. Cache invalidation on the FP8 toggle already exists (drop_model on settings change), so a user toggling FP8 off/on after pulling this PR will get the fixed loader on next load.

Checklist

  • The PR has a short but descriptive title, suitable for a changelog
  • Tests added / updated (if applicable)
  • ❗Changes to a redux slice have a corresponding migration
  • Documentation added / updated (if applicable)
  • Updated What's New copy (if doing a release after this PR)

… load

Diffusers' enable_layerwise_casting() installs a LayerwiseCastingHook that
(a) only casts dtype in pre_forward, not device, and (b) replaces Linear.forward
with an instance-level wrapper that calls the original Linear.forward captured
before the hook was installed. ModelCache.put() later runs
apply_custom_layers_to_model, which constructs a new CustomLinear sharing the
original Linear's __dict__ — so the diffusers wrapper carries over and routes
calls to the captured original forward, silently bypassing CustomLinear.forward
and its cast_to_device autocast.

With partial loading (e.g. FLUX.2 Klein 9B on a constrained GPU), some Linear
weights stay on CPU. The diffusers pre_forward only casts dtype, so F.linear
then sees input on cuda:0 and weight on cpu and raises
"Expected all tensors to be on the same device".

Route every nn.Module — including ModelMixin — through _apply_fp8_to_nn_module,
which uses register_forward_pre_hook / register_forward_hook(always_call=True).
nn.Module._call_impl dispatches these around forward without replacing it, so
CustomLinear.forward is still reached and cast_to_device moves the weight to
the input device. Lose diffusers' _disable_peft_input_autocast in the process,
which is irrelevant — InvokeAI patches LoRAs through CustomLinear's
_patches_and_weights, not PEFT BaseTunerLayer.

Add regression test that asserts the ModelMixin branch calls
_apply_fp8_to_nn_module and not enable_layerwise_casting.
@github-actions github-actions Bot added python PRs that change python files backend PRs that change backend files python-tests PRs that change python tests labels May 24, 2026
@lstein lstein added the v6.13.x label May 25, 2026
@lstein lstein moved this to 6.13.x Theme: MODELS in Invoke - Community Roadmap May 25, 2026
@JPPhoto
Copy link
Copy Markdown
Collaborator

JPPhoto commented May 25, 2026

It looks like this needs a documentation update.

invokeai/docs/src/content/docs/configuration/fp8-storage.mdx:26 and :34 still say InvokeAI's FP8 path uses Diffusers enable_layerwise_casting, but PR 9231 removes that path for ModelMixin and routes every torch.nn.Module through InvokeAI's hook-based _apply_fp8_to_nn_module path at invokeai/backend/model_manager/load/load_default.py:221-233. This leaves the user-facing FP8 Storage docs stale for anyone diagnosing FP8 behavior or hardware impact.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backend PRs that change backend files python PRs that change python files python-tests PRs that change python tests v6.13.x

Projects

Status: 6.13.x Theme: MODELS

Development

Successfully merging this pull request may close these issues.

3 participants