feat: update llama.cpp submodule and bindings for Qwen 3.5 support#2132
feat: update llama.cpp submodule and bindings for Qwen 3.5 support#2132codavidgarcia wants to merge 1 commit intoabetlen:mainfrom
Conversation
There was a problem hiding this comment.
Pull request overview
Updates the vendored llama.cpp submodule and adjusts Python ctypes bindings to match upstream C API changes needed to load/support Qwen 3.5 (qwen35) GGUF models.
Changes:
- Bumps
vendor/llama.cppto a newer upstream commit with Qwen 3.5 architecture fixes. - Syncs
llama_context_paramswith upstream (newflash_attn_type, added sampler fields). - Makes ctypes symbol loading more tolerant of removed/deprecated C symbols.
Reviewed changes
Copilot reviewed 3 out of 4 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| vendor/llama.cpp | Updates submodule pointer to upstream commit with Qwen 3.5 support. |
| llama_cpp/llama_cpp.py | Updates ctypes structs/constants to match new upstream C API fields and enum. |
| llama_cpp/llama.py | Maps the existing flash_attn boolean to the new flash_attn_type field. |
| llama_cpp/_ctypes_extensions.py | Avoids import-time crashes when a C symbol is missing by returning the Python stub. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
llama_cpp/llama_cpp.py
Outdated
| swa_full: bool | ||
| kv_unified: bool | ||
| samplers: ctypes.c_void_p | ||
| n_samplers: int |
There was a problem hiding this comment.
The type annotation for n_samplers is int, but the corresponding _fields_ entry uses ctypes.c_size_t. Align the annotation with the actual C type (e.g., ctypes.c_size_t or an explicit int alias documented as size_t) to avoid misleading type hints and subtle platform-size confusion.
| n_samplers: int | |
| n_samplers: ctypes.c_size_t |
There was a problem hiding this comment.
Fixed, updated the type annotation to ctypes.c_size_t to match the _fields_ entry.
llama_cpp/llama.py
Outdated
| embedding=self.context_params.embeddings, | ||
| offload_kqv=self.context_params.offload_kqv, | ||
| flash_attn=self.context_params.flash_attn, | ||
| flash_attn=bool(self.context_params.flash_attn_type), |
There was a problem hiding this comment.
bool(self.context_params.flash_attn_type) will treat LLAMA_FLASH_ATTN_TYPE_AUTO = -1 as True, which can incorrectly round-trip state (AUTO becomes enabled after re-init). Consider serializing the integer flash_attn_type explicitly (and keeping backward compatibility by accepting a bool on load), or mapping only ENABLED to True and everything else to False.
| flash_attn=bool(self.context_params.flash_attn_type), | |
| flash_attn=( | |
| self.context_params.flash_attn_type | |
| == llama_cpp.LLAMA_FLASH_ATTN_TYPE_ENABLED | |
| ), |
There was a problem hiding this comment.
Fixed. Now mapping explicitly against LLAMA_FLASH_ATTN_TYPE_ENABLED in both the setter and the serializer (__getstate__), so AUTO = -1 won't round-trip as enabled.
llama_cpp/_ctypes_extensions.py
Outdated
| try: | ||
| func = getattr(lib, name) | ||
| func.argtypes = argtypes | ||
| func.restype = restype | ||
| functools.wraps(f)(func) | ||
| return func | ||
| except AttributeError: | ||
| # Symbol not found in shared library (deprecated/removed) | ||
| return f |
There was a problem hiding this comment.
Silently returning the stub on missing symbols can make runtime failures harder to diagnose (the import succeeds, but calls may later fail in less obvious ways). Consider emitting a warning (optionally gated by an env var / debug flag) or returning a stub that raises a clear NotImplementedError indicating the symbol was not found in the loaded shared library.
There was a problem hiding this comment.
Agreed. Changed to return a stub that raises NotImplementedError with the missing symbol name, so the import still succeeds but any actual call to a removed function fails clearly.
8aada9a to
2e3bda1
Compare
Updates the llama.cpp submodule to da348c9df which includes support for
the Qwen 3.5 model architecture (hybrid SSM + attention).
Changes to Python bindings:
1. llama_cpp.py: Sync llama_context_params struct with upstream C API
- flash_attn (bool) → flash_attn_type (enum llama_flash_attn_type)
- Add samplers (void*) and n_samplers (size_t) fields
- Add LLAMA_FLASH_ATTN_TYPE_* enum constants
2. llama.py: Update flash_attn parameter handling
- Map flash_attn=True/False to flash_attn_type=1/0
3. _ctypes_extensions.py: Graceful handling of deprecated symbols
- ctypes_function decorator returns stub instead of crashing
when a symbol is not found in the shared library
Tested with Qwen3.5-0.8B-Q4_K_M.gguf on Apple Silicon (M1 Pro):
- Cold start: ~4s (vs ~40s with mlx-vlm)
- Inference: ~0.6s per chat completion
- Model loads and runs correctly on Metal GPU
2e3bda1 to
d21ef67
Compare
|
Addressed all review comments and squashed into a single commit. Summary of changes since the review:
Tested locally on M1 Pro with Qwen3.5-0.8B Q4_K_M. Model loads, offloads to Metal, and completes both text and chat inference correctly. |
Summary
This PR updates the llama.cpp submodule and Python bindings to support the Qwen 3.5 model architecture (hybrid SSM + attention), released on March 2, 2026.
Without this fix, loading any Qwen 3.5 GGUF results in:
error loading model architecture: unknown model architecture: 'qwen35'Changes
1.
llama_cpp.py— Sync struct with upstream C APIflash_attn(bool) →flash_attn_type(enumllama_flash_attn_type)AUTO=-1,DISABLED=0,ENABLED=1samplers(void*) andn_samplers(size_t) fields tollama_context_paramsLLAMA_FLASH_ATTN_TYPE_*enum constants2.
llama.py— Backward-compatible flash_attn handlingflash_attn=True/FalsePython parameter to the newflash_attn_type=1/0C field3.
_ctypes_extensions.py— Graceful symbol resolutionctypes_functiondecorator now catchesAttributeErrorfor deprecated/removed symbolsllama_get_kv_selfdeprecated in favor ofllama_get_memory)4.
vendor/llama.cpp— Submodule updateda348c9df(models: fix qwen3.5 beta/gate shapes)Testing
Tested on Apple Silicon (M1 Pro) with
Qwen3.5-0.8B-Q4_K_M.gguf:Motivation
Qwen 3.5 was released on March 2, 2026 and is already one of the most downloaded model families. The
llama.cppupstream merged full support days ago, butllama-cpp-pythoncurrently ships an older submodule pin that doesn't recognize theqwen35architecture. This blocks all Python users from using these models.