feat: Qwen 3.5 GDN support with hybrid model fixes by r-dh · Pull Request #2133 · abetlen/llama-cpp-python

r-dh · 2026-03-04T19:51:12Z

Summary

Adds Qwen 3.5 (Gated Delta Network) support, building on the work in #2132 by @codavidgarcia, with additional fixes needed to make hybrid GDN models work correctly:

Update llama.cpp submodule and bindings for Qwen 3.5 (from feat: update llama.cpp submodule and bindings for Qwen 3.5 support #2132)
Fix CMake build: set BUILD_NUMBER and LLAMA_INSTALL_VERSION for mtmd
Return bool from kv_cache_seq_rm so callers can detect when partial memory removal fails
Fix prefix-caching in generate(): GDN hybrid models reject partial memory removal via llama_memory_seq_rm(), but the return value was being ignored, causing llama_decode returned -1 on subsequent calls. Now falls back to full prompt re-evaluation when partial removal is not supported.

The GDN prefix-caching fix affects all hybrid architecture models (not just Qwen 3.5), since any model mixing KV-cache attention with recurrent state will return False from llama_memory_seq_rm() for partial ranges.

Test plan

Verified with Qwen3.5-4B-GGUF (Q4_K_M) at n_ctx=6144 on Apple Metal
Multi-turn inference (prefix-caching across calls) works correctly
Embeddings work with updated llama_memory_clear API
RAG and agentic tool-calling flows pass end-to-end

Updates the llama.cpp submodule to da348c9df which includes support for the Qwen 3.5 model architecture (hybrid SSM + attention). Changes to Python bindings: 1. llama_cpp.py: Sync llama_context_params struct with upstream C API - flash_attn (bool) → flash_attn_type (enum llama_flash_attn_type) - Add samplers (void*) and n_samplers (size_t) fields - Add LLAMA_FLASH_ATTN_TYPE_* enum constants 2. llama.py: Update flash_attn parameter handling - Map flash_attn=True/False to flash_attn_type=1/0 3. _ctypes_extensions.py: Graceful handling of deprecated symbols - ctypes_function decorator returns stub instead of crashing when a symbol is not found in the shared library Tested with Qwen3.5-0.8B-Q4_K_M.gguf on Apple Silicon (M1 Pro): - Cold start: ~4s (vs ~40s with mlx-vlm) - Inference: ~0.6s per chat completion - Model loads and runs correctly on Metal GPU

codavidgarcia and others added 4 commits March 3, 2026 17:31

fix: set BUILD_NUMBER and LLAMA_INSTALL_VERSION for mtmd build

eacc258

fix: return bool from kv_cache_seq_rm for partial removal detection

0124847

fix: handle GDN hybrid models that reject partial memory removal

47aedc2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Qwen 3.5 GDN support with hybrid model fixes#2133

feat: Qwen 3.5 GDN support with hybrid model fixes#2133
r-dh wants to merge 4 commits intoabetlen:mainfrom
r-dh:fix/qwen35-struct-alignment

r-dh commented Mar 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

r-dh commented Mar 4, 2026

Summary

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants