gemma4_assistant: protect n_layer_kv_from_start against shared_kv_layers == n_layer by PhilEgly · Pull Request #24 · AtomicBot-ai/atomic-llama-cpp-turboquant

PhilEgly · 2026-06-04T00:46:03Z

Summary

The GEMMA4 hparam-loading path in src/llama-model.cpp already disables KV reuse when shared_kv_layers leaves no dedicated KV layers (lines 1620-1626 of master):

if (hparams.n_layer > 0 && hparams.n_layer_kv_from_start <= 0) {
    LLAMA_LOG_WARN(...);
    hparams.n_layer_kv_from_start = (int32_t) hparams.n_layer;
}

The GEMMA4_ASSISTANT path immediately below does the same n_layer - shared_kv_layers computation but is missing the guard. This PR mirrors the existing GEMMA4 protection.

Why this matters

For drafters where block_count == shared_kv_layers, n_layer_kv_from_start ends up at 0. Downstream tensor-create code treats this as a 0-length vector and on MSVC debug-mode iterators it surfaces as invalid vector subscript; in release builds it's UB and depends on the heap layout.

This matters specifically for the published google/gemma-4-26B-A4B-it-assistant drafter (and almost certainly the 31B variant): converting it with the in-tree convert_hf_to_gguf.py produces metadata with:

gemma4_assistant.block_count               = 4
gemma4_assistant.attention.shared_kv_layers = 4

which yields n_layer_kv_from_start = 4 - 4 = 0 and the load fails. The Edge variants (E2B / E4B) appear to have different shared_kv_layers values and don't hit this edge case, which is consistent with the AtomicChat-published GGUFs currently shipping for E2B / E4B / 31B but not 26B-A4B or the new 12B Unified assistant.

Reproduction

Pull safetensors from HF:

huggingface-cli download google/gemma-4-26B-A4B-it-assistant --local-dir ./gemma-4-26B-A4B-it-assistant

Convert to GGUF with the in-tree converter (no patch needed at this step — the converter already handles this arch):

python convert_hf_to_gguf.py ./gemma-4-26B-A4B-it-assistant \
    --outfile gemma-4-26B-A4B-it-assistant-bf16.gguf --outtype bf16

Verify with the in-tree verifier (passes — file is structurally correct):

python scripts/verify-gemma4-assistant-gguf.py gemma-4-26B-A4B-it-assistant-bf16.gguf
# -> "ok: token_embd.weight shape=(1024, 262144) embedding_length_kv=1024"

Quantize to Q8_0 (optional, succeeds):

build/bin/llama-quantize gemma-4-26B-A4B-it-assistant-bf16.gguf gemma-assistant-mtp.gguf Q8_0

Launch the server with the fork's run-script defaults — fails on master:

build/bin/llama-server \
    --model gemma-4-26B-A4B-it-UD-Q4_K_XL.gguf \
    --mtp-head gemma-assistant-mtp.gguf \
    --spec-type mtp \
    --n-gpu-layers 99 --gpu-layers-draft 99 \
    -ctk turbo3 -ctv turbo3 -ctkd turbo3 -ctvd turbo3 \
    --jinja --port 8080
# ...
# llama_model_load: error loading model: invalid vector subscript
# llama_model_load_mtp_from_file: failed to load assistant

With this PR applied, the warning fires and load proceeds past this checkpoint:

load_hparams: gemma4_assistant KV sharing metadata leaves no dedicated KV layers (n_layer=4, shared_kv_layers=4); disabling reuse

Caveat — second issue exists downstream

After this patch, the loader still fails with the same "invalid vector subscript" message later in the drafter's tensor-create loop for the 26B drafter. This PR does not fix that second issue — it only restores parity with the GEMMA4 path's guard. I have not yet root-caused the second site (it appears to be a per-layer hparam array indexing issue in or near the tensor-create loop for LLM_ARCH_GEMMA4_ASSISTANT). I'm happy to investigate further if the maintainers would like, but wanted to ship this trivial parity fix independently since it's a clean ~7-line change.

Test environment

Windows 11 Pro for Workstations, Visual Studio 2022 BuildTools 17.14 (MSVC 14.44.35207)
CUDA Toolkit 13.3.33, RTX 5070 Ti (sm_120 / compute capability 12.0)
Built with cmake -G Ninja -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES=120 -DCMAKE_BUILD_TYPE=Release against commit 0a635dcd9 (master HEAD at time of patch)
Repro target: unsloth/gemma-4-26B-A4B-it-GGUF UD-Q4_K_XL (17 GB) + self-converted Q8_0 drafter

…ers == n_layer The GEMMA4 hparam-loading path already disables KV reuse when shared_kv_layers leaves no dedicated KV layers, but the GEMMA4_ASSISTANT path next to it does not. For 26B/31B assistants where block_count == shared_kv_layers == 4, this leaves hparams.n_layer_kv_from_start at 0 and downstream tensor-creation code hits a 0-length vector subscript (visible on Windows debug-iterators as "invalid vector subscript"; UB elsewhere). Mirrors the existing GEMMA4 protection a few lines above. Reproduces with google/gemma-4-26B-A4B-it-assistant converted via convert_hf_to_gguf.py. Edge variants (E2B/E4B) and the new 2026-06-03 12B Unified assistant likely have different shared_kv_layers values that avoid this edge case, which is why current AtomicChat-published GGUFs do not exhibit it.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

gemma4_assistant: protect n_layer_kv_from_start against shared_kv_layers == n_layer#24

gemma4_assistant: protect n_layer_kv_from_start against shared_kv_layers == n_layer#24
PhilEgly wants to merge 1 commit into
AtomicBot-ai:feature/turboquant-kv-cachefrom
PhilEgly:fix/gemma4-assistant-shared-kv-protective-check

PhilEgly commented Jun 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

PhilEgly commented Jun 4, 2026

Summary

Why this matters

Reproduction

Caveat — second issue exists downstream

Test environment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant