feat: update llama.cpp submodule and bindings for Qwen 3.5 support by codavidgarcia · Pull Request #2132 · abetlen/llama-cpp-python

codavidgarcia · 2026-03-03T22:03:57Z

Summary

This PR updates the llama.cpp submodule and Python bindings to support the Qwen 3.5 model architecture (hybrid SSM + attention), released on March 2, 2026.

Without this fix, loading any Qwen 3.5 GGUF results in:
error loading model architecture: unknown model architecture: 'qwen35'

Changes

1. `llama_cpp.py` — Sync struct with upstream C API

flash_attn (bool) → flash_attn_type (enum llama_flash_attn_type)
- The C struct changed this field from a bool to an enum with values AUTO=-1, DISABLED=0, ENABLED=1
Added samplers (void*) and n_samplers (size_t) fields to llama_context_params
- Required by the new sampler chain API in upstream
Added LLAMA_FLASH_ATTN_TYPE_* enum constants

2. `llama.py` — Backward-compatible flash_attn handling

Maps the existing flash_attn=True/False Python parameter to the new flash_attn_type=1/0 C field
No breaking change to the public API

3. `_ctypes_extensions.py` — Graceful symbol resolution

ctypes_function decorator now catches AttributeError for deprecated/removed symbols
Returns the original Python stub instead of crashing at import time
This handles the C API evolution (e.g. llama_get_kv_self deprecated in favor of llama_get_memory)

4. `vendor/llama.cpp` — Submodule update

Updated to commit da348c9df (models: fix qwen3.5 beta/gate shapes)

Testing

Tested on Apple Silicon (M1 Pro) with Qwen3.5-0.8B-Q4_K_M.gguf:

Metric	Result
Model load	0.56s
First inference	0.89s
Chat completion	✅ Working
Text completion	✅ Working
Metal GPU offload	✅ All layers

Motivation

Qwen 3.5 was released on March 2, 2026 and is already one of the most downloaded model families. The llama.cpp upstream merged full support days ago, but llama-cpp-python currently ships an older submodule pin that doesn't recognize the qwen35 architecture. This blocks all Python users from using these models.

Copilot

Pull request overview

Updates the vendored llama.cpp submodule and adjusts Python ctypes bindings to match upstream C API changes needed to load/support Qwen 3.5 (qwen35) GGUF models.

Changes:

Bumps vendor/llama.cpp to a newer upstream commit with Qwen 3.5 architecture fixes.
Syncs llama_context_params with upstream (new flash_attn_type, added sampler fields).
Makes ctypes symbol loading more tolerant of removed/deprecated C symbols.

Reviewed changes

Copilot reviewed 3 out of 4 changed files in this pull request and generated 3 comments.

File	Description
vendor/llama.cpp	Updates submodule pointer to upstream commit with Qwen 3.5 support.
llama_cpp/llama_cpp.py	Updates ctypes structs/constants to match new upstream C API fields and enum.
llama_cpp/llama.py	Maps the existing `flash_attn` boolean to the new `flash_attn_type` field.
llama_cpp/_ctypes_extensions.py	Avoids import-time crashes when a C symbol is missing by returning the Python stub.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-03-03T22:05:42Z

llama_cpp/llama_cpp.py

        swa_full: bool
        kv_unified: bool
+        samplers: ctypes.c_void_p
+        n_samplers: int


The type annotation for n_samplers is int, but the corresponding _fields_ entry uses ctypes.c_size_t. Align the annotation with the actual C type (e.g., ctypes.c_size_t or an explicit int alias documented as size_t) to avoid misleading type hints and subtle platform-size confusion.

Suggested change

n_samplers: int

n_samplers: ctypes.c_size_t

Fixed, updated the type annotation to ctypes.c_size_t to match the _fields_ entry.

Copilot · 2026-03-03T22:05:43Z

llama_cpp/llama.py

            embedding=self.context_params.embeddings,
            offload_kqv=self.context_params.offload_kqv,
-            flash_attn=self.context_params.flash_attn,
+            flash_attn=bool(self.context_params.flash_attn_type),


bool(self.context_params.flash_attn_type) will treat LLAMA_FLASH_ATTN_TYPE_AUTO = -1 as True, which can incorrectly round-trip state (AUTO becomes enabled after re-init). Consider serializing the integer flash_attn_type explicitly (and keeping backward compatibility by accepting a bool on load), or mapping only ENABLED to True and everything else to False.

Suggested change

flash_attn=bool(self.context_params.flash_attn_type),

flash_attn=(

self.context_params.flash_attn_type

== llama_cpp.LLAMA_FLASH_ATTN_TYPE_ENABLED

),

Fixed. Now mapping explicitly against LLAMA_FLASH_ATTN_TYPE_ENABLED in both the setter and the serializer (__getstate__), so AUTO = -1 won't round-trip as enabled.

Copilot · 2026-03-03T22:05:43Z

llama_cpp/_ctypes_extensions.py

+                try:
+                    func = getattr(lib, name)
+                    func.argtypes = argtypes
+                    func.restype = restype
+                    functools.wraps(f)(func)
+                    return func
+                except AttributeError:
+                    # Symbol not found in shared library (deprecated/removed)
+                    return f


Silently returning the stub on missing symbols can make runtime failures harder to diagnose (the import succeeds, but calls may later fail in less obvious ways). Consider emitting a warning (optionally gated by an env var / debug flag) or returning a stub that raises a clear NotImplementedError indicating the symbol was not found in the loaded shared library.

Agreed. Changed to return a stub that raises NotImplementedError with the missing symbol name, so the import still succeeds but any actual call to a removed function fails clearly.

Updates the llama.cpp submodule to da348c9df which includes support for the Qwen 3.5 model architecture (hybrid SSM + attention). Changes to Python bindings: 1. llama_cpp.py: Sync llama_context_params struct with upstream C API - flash_attn (bool) → flash_attn_type (enum llama_flash_attn_type) - Add samplers (void*) and n_samplers (size_t) fields - Add LLAMA_FLASH_ATTN_TYPE_* enum constants 2. llama.py: Update flash_attn parameter handling - Map flash_attn=True/False to flash_attn_type=1/0 3. _ctypes_extensions.py: Graceful handling of deprecated symbols - ctypes_function decorator returns stub instead of crashing when a symbol is not found in the shared library Tested with Qwen3.5-0.8B-Q4_K_M.gguf on Apple Silicon (M1 Pro): - Cold start: ~4s (vs ~40s with mlx-vlm) - Inference: ~0.6s per chat completion - Model loads and runs correctly on Metal GPU

codavidgarcia · 2026-03-03T22:35:51Z

Addressed all review comments and squashed into a single commit.

Summary of changes since the review:

n_samplers type hint aligned to ctypes.c_size_t
flash_attn_type mapping now uses the named constants (LLAMA_FLASH_ATTN_TYPE_ENABLED / DISABLED) instead of raw ints, and the __getstate__ serializer checks against ENABLED explicitly to avoid the bool(-1) == True round-trip issue with AUTO
Missing symbol handler now raises NotImplementedError with the symbol name instead of silently returning the stub
Synced the C header comment block, docstring, and TYPE_CHECKING annotations with the actual struct layout (added flash_attn_type, samplers, n_samplers; removed stale flash_attn references)

Tested locally on M1 Pro with Qwen3.5-0.8B Q4_K_M. Model loads, offloads to Metal, and completes both text and chat inference correctly.

Copilot AI review requested due to automatic review settings March 3, 2026 22:03

Copilot AI reviewed Mar 3, 2026

View reviewed changes

codavidgarcia force-pushed the fix/qwen35-struct-alignment branch from 8aada9a to 2e3bda1 Compare March 3, 2026 22:31

codavidgarcia force-pushed the fix/qwen35-struct-alignment branch from 2e3bda1 to d21ef67 Compare March 3, 2026 22:31

r-dh mentioned this pull request Mar 4, 2026

feat: Qwen 3.5 GDN support with hybrid model fixes #2133

Open

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: update llama.cpp submodule and bindings for Qwen 3.5 support#2132

feat: update llama.cpp submodule and bindings for Qwen 3.5 support#2132
codavidgarcia wants to merge 1 commit intoabetlen:mainfrom
codavidgarcia:fix/qwen35-struct-alignment

codavidgarcia commented Mar 3, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Mar 3, 2026

Uh oh!

codavidgarcia Mar 3, 2026 •

edited

Loading

Uh oh!

Copilot AI Mar 3, 2026

Uh oh!

codavidgarcia Mar 3, 2026

Uh oh!

Copilot AI Mar 3, 2026

Uh oh!

codavidgarcia Mar 3, 2026

Uh oh!

codavidgarcia commented Mar 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

codavidgarcia commented Mar 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

1. llama_cpp.py — Sync struct with upstream C API

2. llama.py — Backward-compatible flash_attn handling

3. _ctypes_extensions.py — Graceful symbol resolution

4. vendor/llama.cpp — Submodule update

Testing

Motivation

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

codavidgarcia Mar 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

codavidgarcia Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

codavidgarcia Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

codavidgarcia commented Mar 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

codavidgarcia commented Mar 3, 2026 •

edited

Loading

1. `llama_cpp.py` — Sync struct with upstream C API

2. `llama.py` — Backward-compatible flash_attn handling

3. `_ctypes_extensions.py` — Graceful symbol resolution

4. `vendor/llama.cpp` — Submodule update

codavidgarcia Mar 3, 2026 •

edited

Loading