gemma4_assistant: protect n_layer_kv_from_start against shared_kv_layers == n_layer#24
Open
PhilEgly wants to merge 1 commit into
Conversation
…ers == n_layer The GEMMA4 hparam-loading path already disables KV reuse when shared_kv_layers leaves no dedicated KV layers, but the GEMMA4_ASSISTANT path next to it does not. For 26B/31B assistants where block_count == shared_kv_layers == 4, this leaves hparams.n_layer_kv_from_start at 0 and downstream tensor-creation code hits a 0-length vector subscript (visible on Windows debug-iterators as "invalid vector subscript"; UB elsewhere). Mirrors the existing GEMMA4 protection a few lines above. Reproduces with google/gemma-4-26B-A4B-it-assistant converted via convert_hf_to_gguf.py. Edge variants (E2B/E4B) and the new 2026-06-03 12B Unified assistant likely have different shared_kv_layers values that avoid this edge case, which is why current AtomicChat-published GGUFs do not exhibit it.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
The
GEMMA4hparam-loading path insrc/llama-model.cppalready disables KV reuse whenshared_kv_layersleaves no dedicated KV layers (lines 1620-1626 of master):The
GEMMA4_ASSISTANTpath immediately below does the samen_layer - shared_kv_layerscomputation but is missing the guard. This PR mirrors the existing GEMMA4 protection.Why this matters
For drafters where
block_count == shared_kv_layers,n_layer_kv_from_startends up at0. Downstream tensor-create code treats this as a 0-length vector and on MSVC debug-mode iterators it surfaces asinvalid vector subscript; in release builds it's UB and depends on the heap layout.This matters specifically for the published
google/gemma-4-26B-A4B-it-assistantdrafter (and almost certainly the 31B variant): converting it with the in-treeconvert_hf_to_gguf.pyproduces metadata with:which yields
n_layer_kv_from_start = 4 - 4 = 0and the load fails. The Edge variants (E2B / E4B) appear to have differentshared_kv_layersvalues and don't hit this edge case, which is consistent with the AtomicChat-published GGUFs currently shipping for E2B / E4B / 31B but not 26B-A4B or the new 12B Unified assistant.Reproduction
python convert_hf_to_gguf.py ./gemma-4-26B-A4B-it-assistant \ --outfile gemma-4-26B-A4B-it-assistant-bf16.gguf --outtype bf16python scripts/verify-gemma4-assistant-gguf.py gemma-4-26B-A4B-it-assistant-bf16.gguf # -> "ok: token_embd.weight shape=(1024, 262144) embedding_length_kv=1024"build/bin/llama-server \ --model gemma-4-26B-A4B-it-UD-Q4_K_XL.gguf \ --mtp-head gemma-assistant-mtp.gguf \ --spec-type mtp \ --n-gpu-layers 99 --gpu-layers-draft 99 \ -ctk turbo3 -ctv turbo3 -ctkd turbo3 -ctvd turbo3 \ --jinja --port 8080 # ... # llama_model_load: error loading model: invalid vector subscript # llama_model_load_mtp_from_file: failed to load assistantWith this PR applied, the warning fires and load proceeds past this checkpoint:
Caveat — second issue exists downstream
After this patch, the loader still fails with the same "invalid vector subscript" message later in the drafter's tensor-create loop for the 26B drafter. This PR does not fix that second issue — it only restores parity with the GEMMA4 path's guard. I have not yet root-caused the second site (it appears to be a per-layer hparam array indexing issue in or near the tensor-create loop for
LLM_ARCH_GEMMA4_ASSISTANT). I'm happy to investigate further if the maintainers would like, but wanted to ship this trivial parity fix independently since it's a clean ~7-line change.Test environment
cmake -G Ninja -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES=120 -DCMAKE_BUILD_TYPE=Releaseagainst commit0a635dcd9(master HEAD at time of patch)unsloth/gemma-4-26B-A4B-it-GGUFUD-Q4_K_XL (17 GB) + self-converted Q8_0 drafter