Skip to content

gemma4_assistant: protect n_layer_kv_from_start against shared_kv_layers == n_layer#24

Open
PhilEgly wants to merge 1 commit into
AtomicBot-ai:feature/turboquant-kv-cachefrom
PhilEgly:fix/gemma4-assistant-shared-kv-protective-check
Open

gemma4_assistant: protect n_layer_kv_from_start against shared_kv_layers == n_layer#24
PhilEgly wants to merge 1 commit into
AtomicBot-ai:feature/turboquant-kv-cachefrom
PhilEgly:fix/gemma4-assistant-shared-kv-protective-check

Conversation

@PhilEgly
Copy link
Copy Markdown

@PhilEgly PhilEgly commented Jun 4, 2026

Summary

The GEMMA4 hparam-loading path in src/llama-model.cpp already disables KV reuse when shared_kv_layers leaves no dedicated KV layers (lines 1620-1626 of master):

if (hparams.n_layer > 0 && hparams.n_layer_kv_from_start <= 0) {
    LLAMA_LOG_WARN(...);
    hparams.n_layer_kv_from_start = (int32_t) hparams.n_layer;
}

The GEMMA4_ASSISTANT path immediately below does the same n_layer - shared_kv_layers computation but is missing the guard. This PR mirrors the existing GEMMA4 protection.

Why this matters

For drafters where block_count == shared_kv_layers, n_layer_kv_from_start ends up at 0. Downstream tensor-create code treats this as a 0-length vector and on MSVC debug-mode iterators it surfaces as invalid vector subscript; in release builds it's UB and depends on the heap layout.

This matters specifically for the published google/gemma-4-26B-A4B-it-assistant drafter (and almost certainly the 31B variant): converting it with the in-tree convert_hf_to_gguf.py produces metadata with:

gemma4_assistant.block_count               = 4
gemma4_assistant.attention.shared_kv_layers = 4

which yields n_layer_kv_from_start = 4 - 4 = 0 and the load fails. The Edge variants (E2B / E4B) appear to have different shared_kv_layers values and don't hit this edge case, which is consistent with the AtomicChat-published GGUFs currently shipping for E2B / E4B / 31B but not 26B-A4B or the new 12B Unified assistant.

Reproduction

  1. Pull safetensors from HF:
    huggingface-cli download google/gemma-4-26B-A4B-it-assistant --local-dir ./gemma-4-26B-A4B-it-assistant
  2. Convert to GGUF with the in-tree converter (no patch needed at this step — the converter already handles this arch):
    python convert_hf_to_gguf.py ./gemma-4-26B-A4B-it-assistant \
        --outfile gemma-4-26B-A4B-it-assistant-bf16.gguf --outtype bf16
  3. Verify with the in-tree verifier (passes — file is structurally correct):
    python scripts/verify-gemma4-assistant-gguf.py gemma-4-26B-A4B-it-assistant-bf16.gguf
    # -> "ok: token_embd.weight shape=(1024, 262144) embedding_length_kv=1024"
  4. Quantize to Q8_0 (optional, succeeds):
    build/bin/llama-quantize gemma-4-26B-A4B-it-assistant-bf16.gguf gemma-assistant-mtp.gguf Q8_0
  5. Launch the server with the fork's run-script defaults — fails on master:
    build/bin/llama-server \
        --model gemma-4-26B-A4B-it-UD-Q4_K_XL.gguf \
        --mtp-head gemma-assistant-mtp.gguf \
        --spec-type mtp \
        --n-gpu-layers 99 --gpu-layers-draft 99 \
        -ctk turbo3 -ctv turbo3 -ctkd turbo3 -ctvd turbo3 \
        --jinja --port 8080
    # ...
    # llama_model_load: error loading model: invalid vector subscript
    # llama_model_load_mtp_from_file: failed to load assistant

With this PR applied, the warning fires and load proceeds past this checkpoint:

load_hparams: gemma4_assistant KV sharing metadata leaves no dedicated KV layers (n_layer=4, shared_kv_layers=4); disabling reuse

Caveat — second issue exists downstream

After this patch, the loader still fails with the same "invalid vector subscript" message later in the drafter's tensor-create loop for the 26B drafter. This PR does not fix that second issue — it only restores parity with the GEMMA4 path's guard. I have not yet root-caused the second site (it appears to be a per-layer hparam array indexing issue in or near the tensor-create loop for LLM_ARCH_GEMMA4_ASSISTANT). I'm happy to investigate further if the maintainers would like, but wanted to ship this trivial parity fix independently since it's a clean ~7-line change.

Test environment

  • Windows 11 Pro for Workstations, Visual Studio 2022 BuildTools 17.14 (MSVC 14.44.35207)
  • CUDA Toolkit 13.3.33, RTX 5070 Ti (sm_120 / compute capability 12.0)
  • Built with cmake -G Ninja -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES=120 -DCMAKE_BUILD_TYPE=Release against commit 0a635dcd9 (master HEAD at time of patch)
  • Repro target: unsloth/gemma-4-26B-A4B-it-GGUF UD-Q4_K_XL (17 GB) + self-converted Q8_0 drafter

…ers == n_layer

The GEMMA4 hparam-loading path already disables KV reuse when shared_kv_layers
leaves no dedicated KV layers, but the GEMMA4_ASSISTANT path next to it does
not. For 26B/31B assistants where block_count == shared_kv_layers == 4, this
leaves hparams.n_layer_kv_from_start at 0 and downstream tensor-creation code
hits a 0-length vector subscript (visible on Windows debug-iterators as
"invalid vector subscript"; UB elsewhere).

Mirrors the existing GEMMA4 protection a few lines above. Reproduces with
google/gemma-4-26B-A4B-it-assistant converted via convert_hf_to_gguf.py.

Edge variants (E2B/E4B) and the new 2026-06-03 12B Unified assistant likely
have different shared_kv_layers values that avoid this edge case, which is
why current AtomicChat-published GGUFs do not exhibit it.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant