Skip to content

feat: Qwen 3.5 GDN support with hybrid model fixes#2133

Open
r-dh wants to merge 4 commits intoabetlen:mainfrom
r-dh:fix/qwen35-struct-alignment
Open

feat: Qwen 3.5 GDN support with hybrid model fixes#2133
r-dh wants to merge 4 commits intoabetlen:mainfrom
r-dh:fix/qwen35-struct-alignment

Conversation

@r-dh
Copy link

@r-dh r-dh commented Mar 4, 2026

Summary

Adds Qwen 3.5 (Gated Delta Network) support, building on the work in #2132 by @codavidgarcia, with additional fixes needed to make hybrid GDN models work correctly:

  • Update llama.cpp submodule and bindings for Qwen 3.5 (from feat: update llama.cpp submodule and bindings for Qwen 3.5 support #2132)
  • Fix CMake build: set BUILD_NUMBER and LLAMA_INSTALL_VERSION for mtmd
  • Return bool from kv_cache_seq_rm so callers can detect when partial memory removal fails
  • Fix prefix-caching in generate(): GDN hybrid models reject partial memory removal via llama_memory_seq_rm(), but the return value was being ignored, causing llama_decode returned -1 on subsequent calls. Now falls back to full prompt re-evaluation when partial removal is not supported.

The GDN prefix-caching fix affects all hybrid architecture models (not just Qwen 3.5), since any model mixing KV-cache attention with recurrent state will return False from llama_memory_seq_rm() for partial ranges.

Test plan

  • Verified with Qwen3.5-4B-GGUF (Q4_K_M) at n_ctx=6144 on Apple Metal
  • Multi-turn inference (prefix-caching across calls) works correctly
  • Embeddings work with updated llama_memory_clear API
  • RAG and agentic tool-calling flows pass end-to-end

codavidgarcia and others added 4 commits March 3, 2026 17:31
Updates the llama.cpp submodule to da348c9df which includes support for
the Qwen 3.5 model architecture (hybrid SSM + attention).

Changes to Python bindings:

1. llama_cpp.py: Sync llama_context_params struct with upstream C API
   - flash_attn (bool) → flash_attn_type (enum llama_flash_attn_type)
   - Add samplers (void*) and n_samplers (size_t) fields
   - Add LLAMA_FLASH_ATTN_TYPE_* enum constants

2. llama.py: Update flash_attn parameter handling
   - Map flash_attn=True/False to flash_attn_type=1/0

3. _ctypes_extensions.py: Graceful handling of deprecated symbols
   - ctypes_function decorator returns stub instead of crashing
     when a symbol is not found in the shared library

Tested with Qwen3.5-0.8B-Q4_K_M.gguf on Apple Silicon (M1 Pro):
- Cold start: ~4s (vs ~40s with mlx-vlm)
- Inference: ~0.6s per chat completion
- Model loads and runs correctly on Metal GPU
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants