feat: cross-stage offload modes and layer-streaming for low-VRAM GPUs#1477
feat: cross-stage offload modes and layer-streaming for low-VRAM GPUs#1477fszontagh wants to merge 92 commits into
Conversation
Add runtime tensor offloading to enable running large models (Q8+) on GPUs with limited VRAM by dynamically moving components between GPU and CPU memory. - `cond_only`: Offload cond_stage (LLM/CLIP) after conditioning - `cond_diffusion`: Offload both cond_stage and diffusion after use - `aggressive`: Offload each component immediately after use - Add OffloadConfig struct with mode, flags for cond_stage/diffusion - Add move_params_to_cpu/gpu methods to GGMLRunner - Add set_auto_offload() to control automatic offloading behavior - Implement on-demand reload before conditioning/diffusion steps - Track VRAM usage for offloaded components Enables 1024x1024 generation with Z-Image Q8 (~7GB) + Qwen3-4B Q8 (~4GB) + VAE (~320MB) on 12GB GPU by offloading the ~4GB LLM after conditioning completes, freeing VRAM for diffusion compute buffers. Without offloading: CUDA OOM during diffusion With cond_only offload: Successful generation in ~66s Tested configurations: - offload_mode=none: OOM at 1024x1024 with Q8 models - offload_mode=cond_only: Success, ~66s generation time - offload_mode=cond_only + vae_tiling: Success, ~149s
Expose the dynamic tensor offloading feature through CLI options: - --offload-mode: Set offload mode (none, cond_only, cond_diffusion, aggressive) - --offload-log: Enable offload event logging - --no-offload-log: Disable offload event logging The cond_only mode is particularly useful for 12GB GPUs running large Q8 models with LLMs, as it offloads the LLM/CLIP to CPU after conditioning, freeing VRAM for diffusion compute buffers. Changes: - Add sd_offload_mode_name() and str_to_offload_mode() helper functions - Add sd_offload_config_init() for default configuration - Add offload_config member to SDContextParams - Wire offload_config through to_sd_ctx_params_t() - Add CLI options in get_options()
When dynamic offloading is enabled and the LLM/CLIP model was offloaded to CPU, attempting to reload it to GPU could fail if there's not enough VRAM available. Previously, the code logged a misleading warning "conditioning will run on CPU (slower)" but then crashed (SEGV) because: 1. move_params_to_gpu() failed and returned false 2. Code continued to call get_learned_condition() 3. compute() tried offload_params_to_runtime_backend() which failed again 4. compute() returned false but caller didn't check return value 5. Code tried to use uninitialized data, causing SEGV Fix: - Return NULL from generate_image/generate_video when GPU reload fails - Return false from load() if initial GPU move fails - This gives callers a proper error to handle instead of crashing The user will see a clear error message suggesting to reduce resolution, use smaller models, or disable dynamic offloading.
When offload_mode is enabled and LoRAs are being applied, the cond_stage (LLM/CLIP) may still be on GPU from initial model loading. This uses up VRAM and causes LoRA allocation to fail with OOM. Fix: Before applying LoRAs in generate_image(), check if: 1. offload_mode is enabled 2. offload_cond_stage is true 3. We have LoRAs to apply 4. cond_stage is currently on GPU If all conditions are met, offload cond_stage to CPU first to free VRAM for LoRA allocation. The cond_stage will be reloaded on-demand before conditioning runs. This allows using LoRAs with large LLM models (like qwen3-4b) on 12GB GPUs that would otherwise OOM during LoRA allocation.
When cond_stage reload fails due to LoRA buffers using VRAM: 1. Free LoRA buffers to make room 2. Retry cond_stage reload 3. Reload LoRA weights from disk Added reload_params() method to LoraModel to support reloading weights after buffer is freed and reallocated. This enables using LoRA with cond_only offload mode on GPUs where cond_stage + LoRA can't both fit alongside diffusion model.
- Add enable_offload parameter to LoraModel constructor - Enable CPU offload for LoRA when dynamic offloading is active - Use move_params_to_cpu()/move_params_to_gpu() for fast memory transfers instead of free_params_buffer()/reload_params() disk I/O This makes LoRA offloading ~10-50ms instead of ~500-1000ms from disk.
When offload mode is enabled, GGMLRunner has both: - params_buffer (CPU) - runtime_params_buffer (GPU) The destructor only freed params_buffer, causing GPU memory to leak when LoRA models were destroyed while on GPU. This caused OOM errors after multiple generations with LoRAs.
- Add sd_vram_estimation_t enum for estimation method selection - SD_VRAM_EST_DRYRUN (default): accurate graph-based estimation - SD_VRAM_EST_FORMULA: faster formula-based approximation - Add estimate_compute_buffer_size() to GGMLRunner for dry-run allocation that returns required buffer size without allocating - Add estimate_vae_decode_vram() to calculate VAE decode requirements using either dry-run or formula method - Add smart_offload_for_vae() that estimates VRAM needed and offloads only what's necessary before VAE decode - Call smart_offload_for_vae() before decode in image and video generation paths This enables smarter offloading - only offload components when actually needed based on accurate VRAM estimation.
- Add get_free_vram() helper to query actual GPU memory via CUDA - Add estimate_diffusion_vram() for diffusion sampling memory estimate - Add should_offload_cond_stage_for_diffusion() smart check - Add should_offload_diffusion_for_vae() smart check - Replace unconditional offload with VRAM-aware decisions - Only offload when free_vram < next_phase_needs + 300MB margin - Apply to both txt2img and img2img/video generation paths - Update common.hpp for vram_estimation struct field order On larger GPUs, components stay on GPU between phases for speed. On tight VRAM, offloading still occurs as needed.
- Add reload_diffusion field to sd_offload_config_t struct - Default to true (matches previous always-reload behavior) - Make post-generation reload of diffusion model respect config - Update both txt2img and video generation paths - Allows keeping diffusion offloaded between generations for batch work Benchmark results on 12GB GPU with Z-Image Q8_0: - no_reload: 29-30s generation, 1.9GB GPU after - reload: 32s generation, 8.1GB GPU after
New CLI options: - --offload-cond-stage / --no-offload-cond-stage - --offload-diffusion / --no-offload-diffusion - --reload-cond-stage / --no-reload-cond-stage - --reload-diffusion / --no-reload-diffusion - --vram-estimation [dryrun|formula] Also adds: - sd_vram_estimation_name() and str_to_vram_estimation() API functions - Extended toString() output showing all offload config details
This commit adds the foundation for layer-by-layer tensor streaming, enabling models larger than VRAM to run by loading weights on-demand. New components: - TensorRegistry: Tracks individual tensor locations (GPU/CPU) by layer - MemoryBudgetManager: Manages VRAM budget with eviction policies - LayerExecutionEngine: Orchestrates per-layer execution with prefetch Integration: - FluxRunner gains enable_layer_streaming() for streaming mode - New SD_OFFLOAD_LAYER_STREAMING offload mode - CLI: --offload-mode layer_streaming This is the infrastructure foundation. Per-block execution will be added in subsequent commits.
GGMLBlock stores tensor names in its internal `params` map hierarchy, but never calls ggml_set_name() on the actual GGML tensors. This caused register_from_context() to get empty names for all tensors, mapping everything to the "_global" layer (resulting in "registered 1 layers"). Fix: Add register_from_map() method that takes the tensor map from get_param_tensors(), which preserves proper tensor names like "model.diffusion_model.double_blocks.5.img_attn.qkv.weight". Result: 58 layers now registered correctly for Flux models (19 double_blocks + 38 single_blocks + 1 _global) instead of just 1.
…cking 1. Skip move_params_to_gpu() for diffusion model in layer_streaming mode - Before sampling: don't bulk-load entire diffusion model to GPU - After generation: don't reload diffusion in streaming mode 2. Fix tensor name tracking in TensorRegistry::move_layer_to_gpu - Use stored tensor names instead of relying on ggml_get_name() - GGMLBlock doesn't call ggml_set_name() on original tensors Known issue: Graph context invalidation in streaming path needs fixing (alloc_compute_buffer resets compute_ctx after graph is built)
Two critical fixes for layer streaming mode: 1. Flux preprocessing: Add to_backend() calls for input tensors - The regular build_graph() converts external tensors to compute_ctx - Streaming preprocessing was missing this, causing mul_mat assertions - Now properly converts x, context, timesteps, y, guidance to backend 2. UNet streaming: Add skip_param_offload parameter to compute() - In streaming mode, weights are managed by the streaming engine - The regular compute() was trying to bulk-allocate all weights to GPU - This failed with OOM because streaming only loads layers on demand - New skip_param_offload=true prevents this bulk allocation Testing: Successfully generated 512x512 image with SDXL model using --offload-mode layer_streaming, 4 steps completed in 3.78s
MMDiT has no skip connections, making it ideal for layer streaming: - Added mmdit_layer_pattern() to parse joint_blocks.N tensor names - Added streaming infrastructure to MMDiTRunner (enable/disable/compute) - Added compute_streaming() that loads all joint_blocks before execution - Wired MMDiTModel to DiffusionModel streaming interface MMDiT structure: - 24 joint_blocks (each with context_block + x_block) - Global tensors: x_embedder, t_embedder, y_embedder, context_embedder, final_layer
WAN has sequential transformer blocks ideal for streaming: - Added wan_layer_pattern() to parse blocks.N and vace_blocks.N tensor names - Added streaming infrastructure to WanRunner (enable/disable/compute) - Added compute_streaming() that loads all blocks before execution - Wired WanModel to DiffusionModel streaming interface WAN structure: - 30-40 blocks.N (main transformer blocks) - Optional vace_blocks.N (VACE interleaved blocks) - Global tensors: patch_embedding, text_embedding, time_embedding, head
- Add qwen_image_layer_pattern() for 60 transformer_blocks - Add zimage_layer_pattern() for context_refiner + noise_refiner + layers - Add streaming infrastructure to QwenImageRunner and ZImageRunner - Wire both models to DiffusionModel streaming interface - Update compute() methods to accept skip_param_offload parameter All 6 diffusion model architectures now support layer streaming.
- Add ref_latents and increase_ref_index parameters to compute_streaming - Update FluxModel::compute_streaming to pass ref_latents - Convert ref_latents to backend in preprocessing graph - Handle ref_latents patchification and concatenation Note: Flux streaming still has tensor context issue in preprocessing that needs investigation.
The per-layer mini-graph approach was architecturally broken because: 1. GGML tensors are bound to their compute context 2. alloc_compute_buffer() resets context internally 3. Intermediate results cannot be passed between separate graphs Changed to coarse-stage approach: 1. Load all model weights to GPU via streaming engine 2. Execute full compute graph with skip_param_offload=true 3. This matches the working UNet streaming implementation Also added skip_param_offload parameter to FluxRunner::compute()
In layer_streaming mode, the cond_stage (T5) must be offloaded before layer streaming begins, otherwise there won't be enough VRAM for the diffusion model layers. Changes: - Set free_params_immediately=false for layer_streaming mode in CLI This enables smart offload logic instead of immediate param freeing - Add explicit layer_streaming check in should_offload_cond_stage_for_diffusion() Forces T5 offload regardless of VRAM heuristics Without this fix, T5 (~9GB) stays on GPU while layer streaming tries to load Flux layers (~6.5GB), causing OOM on 12GB cards. Tested with Flux Schnell Q4_K + T5XXL fp16 on RTX 3060 12GB: - T5 properly offloaded after conditioning - Layer streaming loads all 58 layers successfully - Image generation completes without OOM
Implements the same coarse-stage layer streaming approach used by Flux, MMDiT, UNet, and other models for the new Anima diffusion model. Changes: - tensor_registry.hpp: Add anima_layer_pattern() for net.blocks.N extraction - anima.hpp: Add streaming engine, enable/disable/compute_streaming methods - diffusion_model.hpp: Add AnimaModel streaming wrapper methods Anima has 28 transformer blocks by default, similar in structure to other DiT models, making it a good candidate for VRAM offloading on memory-constrained systems.
AnimaConditioner: - Add GPU offloading methods (is_params_on_gpu, move_params_to_cpu, move_params_to_gpu, get_params_vram_size, set_auto_offload) delegating to underlying LLM - This enables proper VRAM management for Anima's Qwen3 text encoder Layer streaming state consistency: - Skip diffusion model state manipulation in layer_streaming mode - The TensorRegistry uses direct buffer pointer swapping which leaves GGMLRunner's internal state (params_on_runtime_backend) out of sync - Querying or manipulating diffusion offload state after streaming would cause crashes due to this inconsistency - cond_stage offload still works normally (not managed by streaming) Tested: Anima model generates identical output with and without layer_streaming enabled (verified via MD5 hash comparison)
Problem: After layer streaming completes, all diffusion model layers remain on GPU. For large models like QwenImage (8.6GB), this leaves insufficient VRAM for VAE decoding. Solution: Add offload_streaming_layers() method to all streaming-enabled models that moves all layers back to CPU before VAE decode. Changes: - Add offload_streaming_layers() to DiffusionModel base interface - Implement in all runners: UNet, MMDiT, Flux, Anima, Wan, QwenImage, ZImage - Add override methods in all Model wrapper classes - Call offload_streaming_layers() in stable-diffusion.cpp before VAE decode This enables running models larger than VRAM: - QwenImage Edit (16GB model) now runs on 12GB GPU via layer_streaming - Tested: Anima streaming produces identical output with ~1% overhead
- Add staged forward methods to QwenImageModel: - forward_input_stage(): patchify + input projections - forward_single_block(): execute one transformer block - forward_output_stage(): norm + proj + unpatchify - Implement compute_streaming_true() for QwenImage that: - Executes each of the 60 transformer blocks as a separate mini-graph - Stores intermediate img/txt tensors in CPU memory between blocks - Loads/offloads ~140MB per block during execution - Enables running 8.5GB+ models on 12GB VRAM GPUs - Update all model architectures (Flux, MMDiT, Anima, WAN, ZImage, UNet) with improved VRAM checking in compute_streaming() This is true per-layer streaming where only ONE block's weights plus activation memory is needed at any time, enabling models larger than available VRAM to run. Tested with Qwen-Image-Edit-2509-Q3_K_S.gguf (8.5GB) on RTX 3060 12GB.
…utput read Bug: When compute() was called with free_compute_buffer_immediately=true, the buffer holding output tensors was freed before ggml_backend_tensor_get() could read them, causing "CUDA error: invalid device ordinal". Fixes: 1. alloc_compute_buffer() now returns graph via out_gf parameter for reuse 2. compute() reuses graph from alloc_compute_buffer to avoid tensor mismatch 3. copy_data_to_backend_tensor() skips tensors without allocated buffers 4. All TRUE per-layer streaming stages now use free_compute_buffer_immediately=false and manually call free_compute_buffer() after reading outputs Affected models: Flux, MMDiT, Anima, UNet, ZImage, QwenImage
- Add estimate_vae_encode_vram() for VRAM estimation before encoding - Add smart_offload_for_vae_encode() to offload cond_stage and diffusion models before VAE encode operations - Call smart_offload_for_vae_encode() before all encode_first_stage() and vae_encode() calls across generate_image and generate_video paths: - img2img init image encoding - ref image encoding (for edit modes) - control net image encoding - video frame encoding (WAN, VACE, Anima) This prevents OOM during VAE encoding of large images by freeing VRAM from models not needed during the encode phase. With layer_streaming mode, this allows encoding images that previously caused OOM.
Key changes: - Add async prefetch methods to LayerExecutionEngine: prefetch_layer(), wait_for_prefetch(), wait_for_all_prefetches() - Add AsyncLoadState struct and async layer load methods to TensorRegistry: start_async_layer_load(), complete_async_layer_load() - Use ggml_backend_tensor_copy_async() to overlap memory transfers with GPU computation during TRUE per-layer streaming - Update qwen_image.hpp to start prefetching next block before computing current block, reducing GPU idle time - Fix sd_offload_config_t initialization with correct field order - Offload diffusion model layers to CPU at startup when layer_streaming mode is enabled, freeing VRAM for LLM/CLIP conditioning This enables overlapped memory transfers during per-layer streaming, reducing periodic GPU pauses caused by blocking PCIe transfers.
Adds async prefetching pattern to overlap PCIe memory transfer with GPU computation during layer streaming. Before computing each block, prefetch the next block's weights asynchronously. Models updated: - Flux: double_blocks and single_blocks loops - UNet: input_blocks and output_blocks loops - MMDiT: joint_blocks loop - ZImage: layers loop - Anima: blocks loop Note: WAN model doesn't have true per-layer streaming yet (uses full graph).
When using CFG (multiple model calls per diffusion step), the VRAM check didn't account for layers already loaded on GPU. This caused the second CFG call to see full VRAM and switch to slow TRUE per-layer streaming. Now tracks already_on_gpu and only checks remaining_to_load against available VRAM. Second+ CFG calls complete in ~0.15s instead of 3+ seconds. Applied to all 7 architectures: Flux, UNet, MMDiT, ZImage, Anima, WAN, QwenImage
UNet's compute_streaming had four bugs that didn't surface until SDXL + --max-vram pushed the planner into per-layer mode: 1. Coarse-stage path called regular compute() without skip_param_offload=true, double-allocating UNet params on the runtime backend (4.79 GB ZImage, 4.79 GB SDXL). Other architectures already pass true; only unet.hpp was missing it. 2. forward_input_block() called resblock_forward() for every input_blocks.X.0 entry, but at indices 3 and 6 the slot is a DownSampleBlock — the dynamic_pointer_cast<ResBlock> returned null and the next forward() segfaulted silently. Now dispatches DownSampleBlock vs ResBlock by actual type. 3. forward_output_block() called attention_layer_forward() for output_blocks.X.1, but on SD1.x's deepest output block (no attention at that resolution) the slot holds an UpSampleBlock, producing the same null-cast crash. Now walks .1 and .2 once each and dispatches UpSampleBlock vs SpatialTransformer by type. 4. get_num_input_blocks()/get_num_output_blocks() returned a hardcoded 12. SDXL has 9, tiny_unet variants have gaps. Replaced with a scan of the blocks map for the actual max index, so the streaming loop iterates over indices the model actually has. Verified with --max-vram cap forcing per-layer streaming on SDXL 1024x1024, SD1.5 512x512, plus regression on Z-Image bf16, Z-Image Q8, Flux schnell, Chroma, Anima, Qwen Image, and SD3.5 Large.
Layer streaming streams the diffusion model's params from CPU pinned to GPU one block at a time, but the VAE was sitting GPU-resident through the entire sampler loop even though it's only used at decode time. On Z-Image bf16 with no --offload-to-cpu master switch, that wasted ~300 MB of VRAM that the per-block compute buffer needed and produced mid-stream cudaMalloc failures (e.g. layer 19 needing 539 MiB). Two pieces: 1. Internal escalation: when offload_config.mode == LAYER_STREAMING, construct the VAE with offload_params_to_cpu=true regardless of the user-facing --offload-to-cpu master switch. This mirrors the existing escalation for cond_stage and diffusion. The user's master flag is preserved as a separate knob. 2. Opportunistic offload: if the VAE somehow ended up on GPU (not the default path under streaming, but possible via VAE backend construction quirks), park it on its CPU-pinned twin between cond_stage and the sampler loop via the existing move_params_to_cpu swap. The next decode_first_stage call reloads it via the runner's normal compute path. Generic across architectures — every VAE/TAE variant (AutoEncoderKL, WanVAERunner, TinyImage/VideoAutoEncoder, FakeVAE) flows through the same vae_offload_to_cpu plumbing.
|
This effort is really exciting! Is there a way to automatically select the appropriate configuration so users don't need to manually do so? It would be nice if these performance improvements Just Worked ™️ (similar to how llama.cpp's |
Some pieces are already auto: e.g. our A Roughly:
|
Mid-stream cudaMalloc OOM (e.g. compute-buffer alloc fails at layer N because the resident warm cache + new compute buffer don't fit) leaves the streaming engine's GPU residency in place — the success path's offload_streaming_layers() at the end of the sampler loop never runs on the failure path. Result: the next job inherits 8-9 GB of stale streaming layers on GPU, has no headroom for its own compute buffer, and fails at roughly the same layer index. Manually retrying the same job hits the same OOM in a feedback loop. Add an explicit offload_streaming_layers() call on every sampling failure return path: txt2img, hires, video high-noise, video low-noise. Cheap because each layer's CPU-pinned twin already exists, so the eviction is just pointer swaps. This restores the invariant that "between jobs, GPU is clean enough for the next compute_streaming_true to start fresh," matching the success path.
Picks up 8 commits since the previous sync at 90e87bc: 0b82969 docs: add .github/pull_request_template.md 381e0df docs: add CONTRIBUTING.md 0665a7f feat: add hidream o1 image support (leejet#1485) eeac950 fix: Use PkgConfig for WebP and WebM (leejet#1400) 57ff2eb feat: support for memory-mapping model weights (leejet#1414) 9d68341 feat: add Euler CFG++ and Euler-A CFG++ samplers (leejet#1354) 60477fd docs: add new go bindings for stable-diffusion.cpp (leejet#1480) 6ee0684 feat: display server url with "http://" prefix. (leejet#1486) Conflicts, all in src/ggml_extend.hpp: 1. copy_data_to_backend_tensor signature: upstream made gf required (graph-cut needs the segment's graph to restrict uploads); our layer-streaming path needs gf=nullptr so each mini-graph uploads its full backend_tensor_data_map without filtering. Resolution: keep gf optional (default nullptr) and guard the graph_tensor_set filter on gf != nullptr. Upstream's new read_graph_tensor<T> template is added unchanged above copy_data_to_backend_tensor. 2. Tensor-loop null check: upstream added tensor/data null guards and a single ggml_get_name() lookup. Kept both, with our gf-gate layered on top of upstream's set-membership check. 3. alloc_params_buffer: upstream's mmap fast-path (skip allocation when every tensor already has data, since ggml_backend_alloc_ctx_tensors would hit n_buffers==0) and our pinned-host fast-path (allocate weights in the GPU device's host buffer for async H2D under offload) collide on the same function. Resolution: mmap check runs first and returns early — mmapped tensors can't be moved into pinned host memory — then the pinned-host path runs for the non-mmap CPU-params-with-GPU-runtime case, then the original pageable params_backend alloc as the final fallback. Smoke-tested on Z-Image-Turbo Q8 at 512x512: --offload-mode layer_streaming -> 4.0s total (coarse-stage path) --offload-to-cpu --max-vram 4 -> 8.3s total (3 graph-cut segments) HiDream O1 streaming hooks deferred to a follow-up commit.
HiDream O1's diffusion forward pass is structurally an LLM transformer
applied to a concatenated [text-tokens | image-tokens] sequence, with
three small heads on each side (token embed, t_embedder, x_embedder up
front; final RMSNorm + final_layer2 + slice + unpatchify + velocity at
the end). The dominant param size lives in language_model.layers.N — so
streaming those LLM blocks is the natural per-layer unit.
Pieces:
- tensor_registry.hpp: hidream_o1_layer_pattern matches
"language_model.layers.N" substrings. Everything else (embed_tokens,
norm, t_embedder1, x_embedder, final_layer2) maps to _global and
stays resident.
- llm.hpp: TextModel grows two small helpers that expose individual
block access without disrupting forward_embeds — forward_layer_block
(run one TransformerBlock) and forward_final_norm (run the trailing
RMSNorm). Both are public, additive, and don't change shared
text-encoder code paths.
- hidream_o1.hpp: HiDreamO1Runner gets enable_layer_streaming,
compute_streaming (coarse-stage fallback when model fits), and
compute_streaming_true (three-stage execution):
Stage 1 inputs_embeds prelude: embed + image-embed splice +
t_embedder concat + patchify x + ref concat + x_embedder
+ final concat. Output read to pinned host.
Stage 2 per-layer LLM forward: one mini-graph per
language_model.layers.i. attention_mask and input_pos are
precomputed CPU-side once and re-bound into each layer's
graph (they don't change layer-to-layer). Layer weights
stream in, compute runs, hidden state reads back, layer
evicts.
Stage 3 final RMSNorm + final_layer2 + slice (x_pred_start ..
+target_tokens) + unpatchify + (x - x_pred)/sigma velocity
prediction.
- diffusion_model.hpp: HiDreamO1Model wrapper now overrides the layer
streaming interface and routes compute_streaming through
StreamingParamConverter, with inline conversion of the
image_embeds vector<pair<int, Tensor>> since the converter doesn't
have a pair-vector helper.
Untested against a real HiDream O1 checkpoint (no model file
available locally). Z-Image-Turbo Q8 layer_streaming and max-vram
graph-cut both regression-tested at 512x512 step=1 to confirm no
collateral damage to existing runners. Upstream's --offload-to-cpu
--max-vram path already worked for HiDream O1 via the LLM module's
mark_graph_cut calls; this commit adds our --offload-mode
layer_streaming path on top.
VERSION_HIDREAM_O1's branch constructed its conditioner and diffusion model with offload_params_to_cpu — the user-facing master flag — while every other model (z_image, qwen_image, anima, flux, wan, ...) uses the per-component cond_stage_offload_to_cpu / diffusion_offload_to_cpu escalated flags. That meant --offload-mode layer_streaming couldn't escalate HiDream's params onto CPU before allocation, so the 16 GB bf16 checkpoint went straight into cudaMalloc on a 12 GB GPU and OOMed before sampling could begin. Switch HiDream O1 to the same per-component pattern. Verified end-to- end on the docs example (1024x1024, 4 steps, cfg-scale=1.0, seed=42): true per-layer streaming runs 36 LLM transformer blocks in ~2.84 s per step, full image generation in 19.34 s, output renders the requested sign text legibly.
|
Nice job! This looks like a solid direction overall. I have a few review comments, mostly around keeping the execution path unified. Conceptually, layer streaming is still segment execution: mark boundaries, split the graph, load the segment's params, execute, then evict or keep them resident. If we build pinned host buffers, next-segment/layer prefetch, and resident-block caching on top of the existing graph split path, we can keep one generic execution framework and avoid carrying complex per-model manual split code for Flux, Z-Image, Qwen, Wan, UNet, etc. Ideally the actual split/execution machinery should be shared with the current graph split implementation. For cross-stage component placement, I may be missing some details, but it looks like this overlaps with the newer |
13 new upstream commits since previous sync at 0b82969. The big one is leejet#1500 (module backend assignment): ~1.5k LOC churn that splits backend code into a new ggml_extend_backend.{h,cpp} pair and replaces every runner's (backend_t backend, bool offload_params_to_cpu) constructor arg with (backend_t runtime, backend_t params). New CLI flags --backend te=cpu,vae=cuda0,... and --params-backend te=cpu,vae=cpu,... Other notable upstream changes folded in: 3633072 module backend assignment (leejet#1500) 38b14ad --max-vram -1 auto-detect (leejet#1498) 67dda3f LTX 2.3 architecture (leejet#1463) 06accf2 LTXAV latent2rgb projection 9d68341 Euler/DDIM unification (leejet#1474) cde20d5 stereo handling in sd_audio d7ecbe1 T5 EOS dedup in Anima bd17f53 / 0c1ca17 / 839f6a9 / 3b4d26f ROCm/docs/CI db08b84 GCC 16 build fix 686856e fake-VAE log demotion 0b82969 / 381e0df PR template + CONTRIBUTING.md Conflicts: - examples/common/common.cpp, include/stable-diffusion.h: kept our offload_config alongside upstream's new backend/params_backend strings. sd_ctx_params_t now carries both axes. - src/lora.hpp: dropped our enable_offload bool. The new params_backend argument expresses the same intent (CPU = offload). - src/hidream_o1.hpp: kept params_prefix member, switched constructor to upstream's (backend, params_backend) signature. - src/stable-diffusion.cpp: every runner-construction site took upstream's backend_for(MODULE) / params_backend_for(MODULE) lookups. Removed the dead cond_stage/diffusion/vae_offload_to_cpu local-bool derivation; replaced with calls to a new SDBackendManager::force_module_params_backend(MODULE, "cpu") helper that mutates params_assignment_ after init_backend() runs. The offload_config-driven escalations now land in the same data structure upstream's --params-backend writes to. Post-merge fixups surfaced by retesting HiDream O1 streaming: - src/llm.hpp: TextModel.forward_final_norm now casts to LLMRMSNorm, not RMSNorm. Upstream changed the "norm" block's concrete type; our pre-merge cast returned nullptr and crashed on first forward(). - src/hidream_o1.hpp: Stage 1 of compute_streaming_true scales inputs_embeds by sqrt(hidden_size) when params.llm.normalize_input, matching what forward_embeds does. No-op for HiDream O1 today but keeps the streaming path drift-free if a future arch flips it. Smoke-tested on 12 GB GPU: Z-Image-Turbo Q8 layer_streaming -> 4.32 s HiDream O1 bf16 dev layer_streaming -> 17.44 s (4 steps, 1024x1024)
- Forward-declare GGMLRunner as struct (matches actual definition) - Drop redundant #include <ggml.h> (arrives via layer_streaming.hpp) - Use __LAYER_STREAMING_EXECUTOR_HPP__ guard (matches sibling files) - Clarify post_compute lifecycle for chunk-K and output_stage in doc comments
Per-layer load/compute/evict cycle. Chunk-K resident graph + profiling land in subsequent tasks. Nothing calls run_streaming yet — first caller arrives with the HiDream O1 migration in Task 5. Adds a friend declaration in struct GGMLRunner so the executor can reach the protected streaming_engine_ handle and analyze_vram_budget() helper without widening visibility for unrelated members.
- Free compute buffer on per-layer failure paths (prevents shape-mismatch reuse on subsequent invocations) - Correct header doc to reflect actual cleanup contract (caller handles layer eviction via offload_streaming_layers; executor only frees its own compute buffer) - Warn (don't silently ignore) when output_stage.post_compute is set - Drop unused <cstdlib>/<cstring> includes
Lets callers pre-dispatch their chunk-K resident-layer mega-graph (via the existing LayerStreaming::ChunkGraph helper) and have the executor pick up streaming from layer K onwards. Default 0 means stream every layer, matching current behavior.
The previous commit gave start_layer_idx a default, which forced output_out and output_ctx to also gain nullptr defaults (C++ contiguous-defaults rule). Default-nullptr output params would let callers silently produce no output. Drop all three defaults; every caller must explicitly pass start_layer_idx (typically 0 or K) and the output handles.
Hoist the per-layer timing locals from z_image's hand-written streaming path into the shared executor. Every migrated runner now reports wait/load/advance/compute/evict microseconds per sampling step when SDCPP_STREAM_PROFILE=1.
First migration to LayerStreaming::run_streaming. compute_streaming_true drops from ~280 LOC to ~210 LOC: three builder lambdas + run_streaming call. Per-layer load/evict/prefetch/buffer-lifecycle now lives in the executor. Also fix a latent bug in run_stage: when a post_compute is attached and free_buffer_after=true, the prior code freed the compute buffer before post_compute ran, so ggml_backend_tensor_get on a captured output handle read from a freed allocation. Defer the free until after post_compute completes. Verified: hidream_o1_image_dev_bf16 cat test at 1024x1024 4 steps seed 42 produces a visually identical cat; Z-Image streaming (still hand-written) regression-clean.
Most complex migration: two persisted activations (txt_img + t_emb),
refiners in Stage 1, chunk-K resident-layer dispatch via the existing
LayerStreaming::ChunkGraph helper, then per-layer streaming for the
non-resident block via the executor's run_streaming() with
start_layer_idx=K.
Chunk-K dispatch stays per-model (inside Stage 1's post_compute, after
refiner output reaches host) since the chunk graph's input descriptors
are model-specific. The executor's start_layer_idx parameter from
Task 3 is what makes this clean.
Refiner layers (context_refiner.N, noise_refiner.N) are loaded at the
top of Stage 1's build_graph -- after the executor's _global load and
before the refiner forward calls -- so they stay GPU-resident through
the streaming loop without polluting the executor with model-specific
"_global_extras" concepts.
Verified:
- Z-Image-Turbo Q8 512x512 1 step: coarse path, 3.91s, IDAT-identical
to baseline /tmp/postmerge_zimage_stream.png
- Z-Image-Turbo bf16 1024x688 4 steps: per-layer + chunk-K both fire
("layer cache: 17 resident, 13 streamed per step"); 16.78s,
coherent cat output
- HiDream O1 regression: 16.64s, cat with sign
Net diff: -114 LOC.
Two persisted activations (txt + img, both update per-layer) plus t_emb. No chunk-K today; prev_gpu_output factory parameter is wired for executor-contract parity but unused. Layer name pattern: transformer_blocks.N. Verified: Qwen Image Q4_0 13B streaming smoke (per-layer engaged via --max-vram 4 cap, 40 streamed layers); HiDream + Z-Image regression- clean.
Per-layer factory dispatches by layer_idx: double_blocks for the early phase (returns updated img+txt pair), single_blocks for the later phase (concatenated [txt|img] stream). Layer name pattern follows the same split. No smoke test in this commit -- memory budget; full smoke matrix runs after Task 13.
Lift per-layer load/compute/evict/prefetch boilerplate into LayerStreaming::run_streaming. WAN's high-noise / low-noise diffusion split is unchanged — each WanModel instance still gets its own streaming_engine_ independently. No smoke test in this commit — memory budget; full smoke matrix runs after Task 13.
Anima's compute_streaming_true previously open-coded the streaming loop: direct registry.move_layer_to_gpu / prime_prefetch / wait_for_prefetch / advance_prefetch / move_layer_to_cpu around an inline per-block dispatch. This was a real per-block streamer (unlike WAN's placeholder), so the migration lifts the three stages (input prelude, per-block, output) verbatim into the LayerStreaming::run_streaming three-lambda pattern. State that previously lived on the stack now lives as AnimaRunner members so the lambdas can read/write across executor boundaries: stage1_*_out_ tensor handles, x_ne_ / context_ne_ / embedded_ts_ne_ / temb_ne_ shape arrays, and persistent_*_ pinned host buffers with matching std::vector fallbacks. context is optional in some Anima variants — persistent_context_ stays nullptr when stage1_context_out_ is null, mirroring the original behavior. Layer naming uses "blocks.N" (registry-side key produced by anima_layer_pattern from "net.blocks.N"); start_layer_idx=0 (no chunk-K dispatch); the executor evicts every streamed layer unconditionally, same as before. resident_blocks_ is still computed on the first invocation for logging parity. LOC delta: +212 / -260 (net -48).
Rewrites MMDiTRunner::compute_streaming_true on top of
LayerStreaming::run_streaming using the standard three-lambda pattern
(input_stage / per-layer factory / output_stage), replacing the bespoke
inline _global-load + per-block compute loop.
The previous implementation was already a real per-block streamer (not a
placeholder): Stage 1 ran forward_input_stage to produce x / context /
c_mod and persisted them into pinned host buffers, Stage 2 iterated
joint_blocks.{i} with sync load + wait_for_prefetch + move_layer_to_cpu,
and Stage 3 ran forward_output_stage + unpatchify_and_crop. The new
factory mirrors that behavior verbatim against the shared executor:
- input_stage.post_compute reads back x / c_mod (and context when
non-null) into persistent_* member buffers; resident_joint_blocks_ is
decided on first invocation as before for logging parity.
- The per-block factory rebinds x_in / c_mod_in / context_in from host
buffers each iteration (prev_gpu_output ignored; no chunk-K dispatch
path for MMDiT today) and reads layer_x_out_ / layer_context_out_
back via ggml_backend_tensor_get in post_compute.
- skip_layers is honored via a trivial no-op stage (matching Flux's
pattern) so persistent activations pass through unchanged, mirroring
the previous `continue` semantics.
- output_stage.build_graph runs forward_output_stage + unpatchify_and_crop;
the executor writes results into output / output_ctx.
Streaming state (stage1_*_out_, layer_*_out_, x_ne_ / context_ne_ /
c_mod_ne_, persistent_* buffers + fallback vectors) is lifted into
MMDiTRunner members so the captured-by-this lambdas can survive across
stages.
Net: -41 lines.
3-phase architecture (input_blocks -> middle_block -> output_blocks) with skip connections persisted to host across phases. Treats the diffusion as num_input + 1 + num_output 'layers' for the executor; the per-block factory dispatches by phase to the existing forward_input_block / forward_middle_block / forward_output_block helpers (which already encode the DownSample/UpSample type-dispatch fixes from commit dbd4a35). No smoke test in this commit -- memory budget; full smoke matrix runs after Task 13.
After migrating all 8 runners to LayerStreaming::run_streaming (Tasks 5-12), sweep each runner for code orphaned by the migration: member variables that no longer have a reader, private helpers that only the old compute_streaming_true called, etc. - hidream_o1: drop unused persistent_inputs_embeds_fallback. - qwen_image: drop logging-only resident_transformer_blocks_ and the old StreamingState struct + copy_tensor_to_storage / create_tensor_from_storage helpers. - flux: drop logging-only resident_double_blocks_ / resident_single_blocks_, plus Flux::StreamingContext and the forward_preprocessing / forward_double_block(StreamingContext) / forward_single_block(StreamingContext) / forward_postprocessing helpers and the FluxRunner::streaming_ctx_ member that used them. - anima: drop logging-only resident_blocks_. - mmdit: drop logging-only resident_joint_blocks_. - unet: drop cfg.keep_layers_behind override (only consulted by the unused LayerExecutionEngine::execute_streaming path). Kept intentionally: z_image's chunk_graph_ / dispatch_resident_chunk / resident_layer_count_ (chunk-K dispatch lives in z_image's Stage 1 post_compute), and all forward_* inner-model helpers (called by the migrated lambdas). The two public forward_double_block / forward_single_block overloads in flux.hpp (the ones returning ggml_tensor* / std::pair, not bool) stay — those are the ones the migrated lambdas call.
|
Good evening (or good day), thanks for this awesome PR!! I tried it on my own system (Vulkan, AMD RX 580 8GB, Arch Linux, flux-2-klein-9b-Q8_0.gguf) and got a segmentation fault (core dumped). Using the AI (I don't know much about ML or C++), the AI suggested the following: Cause of the bug:For Flux.2 and Flux.2 Klein models, In the original non-streaming path, these global modulations are precalculated and passed to Suggested Fix:To resolve this, we can make In std::pair<ggml_tensor*, ggml_tensor*> forward_double_block(GGMLRunnerContext* ctx,
int block_idx,
ggml_tensor* img,
ggml_tensor* txt,
ggml_tensor* vec,
ggml_tensor* pe,
ggml_tensor* txt_img_mask,
std::vector<ModulationOut>& ds_img_mods,
std::vector<ModulationOut>& ds_txt_mods) {
if (params.share_modulation && ds_img_mods.empty()) {
auto double_stream_modulation_img = std::dynamic_pointer_cast<Modulation>(blocks["double_stream_modulation_img"]);
auto double_stream_modulation_txt = std::dynamic_pointer_cast<Modulation>(blocks["double_stream_modulation_txt"]);
ds_img_mods = double_stream_modulation_img->forward(ctx, vec);
ds_txt_mods = double_stream_modulation_txt->forward(ctx, vec);
}
auto block = std::dynamic_pointer_cast<DoubleStreamBlock>(blocks["double_blocks." + std::to_string(block_idx)]);
auto img_txt = block->forward(ctx, img, txt, vec, pe, txt_img_mask, ds_img_mods, ds_txt_mods);
return img_txt;
}And modify ggml_tensor* forward_single_block(GGMLRunnerContext* ctx,
int block_idx,
ggml_tensor* txt_img,
ggml_tensor* vec,
ggml_tensor* pe,
ggml_tensor* txt_img_mask,
std::vector<ModulationOut>& ss_mods) {
if (params.share_modulation && ss_mods.empty()) {
auto single_stream_modulation = std::dynamic_pointer_cast<Modulation>(blocks["single_stream_modulation"]);
ss_mods = single_stream_modulation->forward(ctx, vec);
}
auto block = std::dynamic_pointer_cast<SingleStreamBlock>(blocks["single_blocks." + std::to_string(block_idx)]);
return block->forward(ctx, txt_img, vec, pe, txt_img_mask, ss_mods);
}After following his advice, everything went smoothly and very quickly. Sorry for using AI |
|
I was just hit by the same problem while testing Wan2.2 with the exact same card on the exact same board (PCIe 3). I can make WAN2.2 work with WAN2GP but i would really love to make it work with sd.ccp instead in order to use GGUF version of the model. |
25 new upstream commits since the previous sync. Highlights: 3a8788c refactor: unify extra argument parsing (leejet#1540) 449165c feat: stream LTX VAE temporal tile decoding (leejet#1539) adaa599 Feat: Temporal tile custom size with overlap (leejet#1510) 2e35146 perf: run LTX audio VAE decode in one ggml graph (leejet#1538) 47d8198 feat: add taeltx2_3_wide support (leejet#1535) ef92a00 feat: add graph cut markers for LTXAV transformer (leejet#1534) b3374e6 feat: add LTX spatial latent upscale hires support (leejet#1533) bdd937f feat: add taeltx2/taeltx2.3 support (leejet#1531) c51ec7c fix: always load runtime lora params on runtime backend (leejet#1532) e7eb92f feat: add Gradient Estimation sampler (leejet#1484) 50134e5 refactor: split guidance composition (leejet#1506) e43b24c feat: add ltx2.3 flf2v support (leejet#1505) b706d68 fix: restore singleton dims for LLM outputs (leejet#1518) b758b7d fix: only enable TAE after successful load (leejet#1517) f683c88 feat: make negative max_vram control the amount of spare vram (leejet#1503) baf7eda refactor: minify vocab files (leejet#1509) 22c8c40 sync: update ggml (leejet#1520) plus 8 CI / docs / docker fixes. Conflict resolution: src/stable-diffusion.cpp had a single conflict in the video-generation post-sampling block. Our HEAD had the smart-offload-for-VAE-decode hook (move diffusion model to CPU when free_params_immediately is false and VRAM is tight). Upstream added the LTX spatial latent upscale hires path that runs a second sampler invocation. Both pieces are needed and they're complementary: smart offload is video-agnostic and runs only on the non-upscale code path; the upscale block manages its own params lifecycle through its own sampler+free invocation. Resolution: upstream's `if (latent_upscale_enabled)` block kept as-is, and our smart-offload + free_params_immediately handling moved into the matching `else` branch. No semantic change to either feature. All other touched files (include/stable-diffusion.h, src/llm.hpp, src/ggml_extend.hpp, src/diffusion_model.hpp, examples/common/...) auto-merged cleanly. Our additions (friend declaration in ggml_extend for the streaming executor, forward_layer_block / forward_final_norm helpers on LLM::TextModel, offload_config field on sd_ctx_params_t) all interoperate with the upstream changes — Build is clean. Smoke test: Z-Image-Turbo Q8 generates a valid cat image at 512x512 after the merge. Host CUDA driver currently shows NVML version mismatch (220s wallclock); requires driver reload to re-validate expected timings.
Models with share_modulation=true (Flux 2, Flux 2 Klein) do NOT instantiate local img_mod / txt_mod / modulation blocks inside DoubleStreamBlock and SingleStreamBlock (flux.hpp:272, 285). Their modulation is computed once at the parent Flux level and threaded into each block via ds_img_mods / ds_txt_mods / ss_mods vectors. The non-streaming path computes these in forward_input_stage and passes them all the way through forward_orig. The layer-streaming path, however, has always constructed FRESH empty vectors inside its per-block factory (preserved across the Task 8 migration). When the block forward sees an empty mod vector, it falls back to its local modulation block — which is nullptr under share_modulation, triggering a null-pointer dereference and an immediate segfault. Bug surfaced for the first time when flux-2-klein-9b-Q8_0 hit our streaming path. PR leejet#1477 comment from @AndriiParf with stack-trace analysis from an AI tool, confirmed by reading the code: empty ds_img_mods/ds_txt_mods/ss_mods at the per-block call site, share_modulation guard in the DoubleStreamBlock/SingleStreamBlock constructors that skips local-modulation instantiation, block->forward unconditional dereference of the local pointer. Fix: in Flux::forward_double_block and Flux::forward_single_block, when share_modulation is active and the incoming mod vectors are empty, compute the shared modulations from `vec` on demand using the parent-level Modulation blocks (always _global resident, so always on GPU during streaming). Adds one Linear forward per block per step (sub-millisecond aggregate), but avoids the much-more-invasive alternative of persisting Stage-1 ModulationOut tensors to host buffers and re-binding them per layer. Coarse-stage path unaffected: forward_input_stage still precomputes the mods and the non-empty vectors short-circuit the on-demand guard. A separate report from @nArn0 on PR leejet#1477 describes a WAN 2.2 segfault on the same RX 580 / Vulkan / PCIe 3 hardware. WAN's transformer is structurally different (no share_modulation; modulation is a per-block weight parameter at params["modulation"]). That report likely involves either Vulkan-specific streaming hazards already documented in vulkan_compat.md notes, or a different latent issue in the per-block streaming path that Task 9's migration newly exercises. Not addressed here; needs a stack trace to localize.
feat: cross-stage offload modes and layer-streaming for low-VRAM GPUs
Why
Two problems that come up on small GPUs running large diffusion models:
This PR adds a single new flag,
--offload-mode, that handles cross-stage placement, plus a per-layer streaming path (--offload-mode layer_streaming) for the doesn't-fit-at-all case.New CLI flags
--offload-mode <mode>none,cond_only,cond_diffusion,aggressive,layer_streaming. Defaultnone.--offload-cond-stage/--no-offload-cond-stage--offload-diffusion/--no-offload-diffusion--offload-log/--no-offload-log--vram-estimation <method>dryrun(probe graph) orformula(analytic).--streaming-prefetch <N>--streaming-min-vram <MB>What each mode does
none(default)cond_onlycond_diffusionaggressivelayer_streamingHow layer streaming works
Three pieces, each a known-but-effective optimization at a different layer of the stack:
cudaMemcpyAsyncactually goes async (a pageable source falls through to a synchronous bounce-buffer copy in the driver).A unified VRAM heuristic decides automatically which layers stay resident and which stream, based on actual free VRAM. Users don't have to pick a budget manually.
Benchmarks - RTX 3060 (12 GB), PCIe 3.0 x16
Hardware: RTX 3060 12 GB. The card itself supports PCIe 4.0, but the board is DDR3-era so the slot is capped at PCIe 3.0 x16 (8.0 GT/s). PCIe bandwidth is the dominant cost during streaming, so faster boards (PCIe 4.0 x16, ~24 GB/s practical) should reduce these numbers materially.
All numbers below: batch_count=4, steps=12, resolution=688x1024, LoRA applied at runtime, same prompt/seed across configs.
Z-Image-Turbo bf16 (11.5 GB diffusion model — does NOT fit in 12 GB)
Workload: 4 images per generation, 12 sampling steps each, batch=4. This is where streaming matters most — without offload of some kind, the model can't even load.
--offload-mode layer_streaming--offload-to-cpu --max-vram 9Z-Image-Turbo Q8 (6.7 GB diffusion model — fits in VRAM, but VAE compute buffer doesn't)
Workload: 4 images per generation, 12 sampling steps each, batch=4. When the model fits, streaming gives up most of its advantage and the simpler existing offload paths are slightly faster. Listed for completeness.
--offload-to-cpu--vae-tiling--offload-mode layer_streaming--offload-to-cpu --max-vram 6--vae-on-cpuSo the recommendation in the docs is: pick
--offload-mode layer_streamingwhen the model doesn't fit (where it's ~2× faster than alternatives), and stick with the existing--offload-to-cpu(or no offload) when it does.--offload-mode none(default) keeps current master behaviour.Architectures
The streaming runtime is shared via
tensor_registry.hpp,layer_streaming.hpp,memory_budget.hpp. Verified end-to-end on RTX 3060:Implemented and built but not personally verified by me - appreciate someone with the hardware/models confirming:
Known issues
--lora-apply-mode immediately+--offload-mode layer_streamingcrashes - the immediate folder reaches into weight buffers that haven't been uploaded to GPU yet under streaming. Useat_runtime(defaultautoalready picks this in streaming mode). Pre-existing class of issue surfaced by streaming.dryrunis more accurate but adds a small startup cost. Switch todryrunif you hit OOM during the first step.Backwards compatibility
Default behaviour is unchanged.
--offload-mode nonematches current master byte-for-byte. All new flags are opt-in.Bug fixes folded in
While exercising the offload paths I found and fixed a small set of pre-existing bugs. They're independent of the new offload modes and benefit users who never set
--offload-mode. Happy to split these into a separate small PR if preferred.GGMLRunnerdestructor leakedruntime_params_bufferandpartial_runtime_params_buffer.free_params_buffer()only released the CPU-sideparams_buffer. When the runner had been staged onto the runtime backend (any offload mode active, including the segmented offload from feat: add max-vram based segmented param offload #1476), the GPU-side weight buffer(s) leaked on destruction. Real leak under LoRA + offload — many short-lived runners are created during LoRA application. Two-line additions to the destructor.aggressivemode.MultiLoraAdapterfix (already merged into master); will rebase to drop that commit at PR time.Documentation
docs/vram_offloading.mdcovers the modes, decision tree, and example commands.