Skip to content

feat: update llama.cpp submodule and bindings for Qwen 3.5 support#2132

Open
codavidgarcia wants to merge 1 commit intoabetlen:mainfrom
codavidgarcia:fix/qwen35-struct-alignment
Open

feat: update llama.cpp submodule and bindings for Qwen 3.5 support#2132
codavidgarcia wants to merge 1 commit intoabetlen:mainfrom
codavidgarcia:fix/qwen35-struct-alignment

Conversation

@codavidgarcia
Copy link

@codavidgarcia codavidgarcia commented Mar 3, 2026

Summary

This PR updates the llama.cpp submodule and Python bindings to support the Qwen 3.5 model architecture (hybrid SSM + attention), released on March 2, 2026.

Without this fix, loading any Qwen 3.5 GGUF results in:
error loading model architecture: unknown model architecture: 'qwen35'

Changes

1. llama_cpp.py — Sync struct with upstream C API

  • flash_attn (bool) → flash_attn_type (enum llama_flash_attn_type)
    • The C struct changed this field from a bool to an enum with values AUTO=-1, DISABLED=0, ENABLED=1
  • Added samplers (void*) and n_samplers (size_t) fields to llama_context_params
    • Required by the new sampler chain API in upstream
  • Added LLAMA_FLASH_ATTN_TYPE_* enum constants

2. llama.py — Backward-compatible flash_attn handling

  • Maps the existing flash_attn=True/False Python parameter to the new flash_attn_type=1/0 C field
  • No breaking change to the public API

3. _ctypes_extensions.py — Graceful symbol resolution

  • ctypes_function decorator now catches AttributeError for deprecated/removed symbols
  • Returns the original Python stub instead of crashing at import time
  • This handles the C API evolution (e.g. llama_get_kv_self deprecated in favor of llama_get_memory)

4. vendor/llama.cpp — Submodule update

  • Updated to commit da348c9df (models: fix qwen3.5 beta/gate shapes)

Testing

Tested on Apple Silicon (M1 Pro) with Qwen3.5-0.8B-Q4_K_M.gguf:

Metric Result
Model load 0.56s
First inference 0.89s
Chat completion ✅ Working
Text completion ✅ Working
Metal GPU offload ✅ All layers

Motivation

Qwen 3.5 was released on March 2, 2026 and is already one of the most downloaded model families. The llama.cpp upstream merged full support days ago, but llama-cpp-python currently ships an older submodule pin that doesn't recognize the qwen35 architecture. This blocks all Python users from using these models.

Copilot AI review requested due to automatic review settings March 3, 2026 22:03
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Updates the vendored llama.cpp submodule and adjusts Python ctypes bindings to match upstream C API changes needed to load/support Qwen 3.5 (qwen35) GGUF models.

Changes:

  • Bumps vendor/llama.cpp to a newer upstream commit with Qwen 3.5 architecture fixes.
  • Syncs llama_context_params with upstream (new flash_attn_type, added sampler fields).
  • Makes ctypes symbol loading more tolerant of removed/deprecated C symbols.

Reviewed changes

Copilot reviewed 3 out of 4 changed files in this pull request and generated 3 comments.

File Description
vendor/llama.cpp Updates submodule pointer to upstream commit with Qwen 3.5 support.
llama_cpp/llama_cpp.py Updates ctypes structs/constants to match new upstream C API fields and enum.
llama_cpp/llama.py Maps the existing flash_attn boolean to the new flash_attn_type field.
llama_cpp/_ctypes_extensions.py Avoids import-time crashes when a C symbol is missing by returning the Python stub.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

swa_full: bool
kv_unified: bool
samplers: ctypes.c_void_p
n_samplers: int
Copy link

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The type annotation for n_samplers is int, but the corresponding _fields_ entry uses ctypes.c_size_t. Align the annotation with the actual C type (e.g., ctypes.c_size_t or an explicit int alias documented as size_t) to avoid misleading type hints and subtle platform-size confusion.

Suggested change
n_samplers: int
n_samplers: ctypes.c_size_t

Copilot uses AI. Check for mistakes.
Copy link
Author

@codavidgarcia codavidgarcia Mar 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed, updated the type annotation to ctypes.c_size_t to match the _fields_ entry.

embedding=self.context_params.embeddings,
offload_kqv=self.context_params.offload_kqv,
flash_attn=self.context_params.flash_attn,
flash_attn=bool(self.context_params.flash_attn_type),
Copy link

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bool(self.context_params.flash_attn_type) will treat LLAMA_FLASH_ATTN_TYPE_AUTO = -1 as True, which can incorrectly round-trip state (AUTO becomes enabled after re-init). Consider serializing the integer flash_attn_type explicitly (and keeping backward compatibility by accepting a bool on load), or mapping only ENABLED to True and everything else to False.

Suggested change
flash_attn=bool(self.context_params.flash_attn_type),
flash_attn=(
self.context_params.flash_attn_type
== llama_cpp.LLAMA_FLASH_ATTN_TYPE_ENABLED
),

Copilot uses AI. Check for mistakes.
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. Now mapping explicitly against LLAMA_FLASH_ATTN_TYPE_ENABLED in both the setter and the serializer (__getstate__), so AUTO = -1 won't round-trip as enabled.

Comment on lines +113 to +121
try:
func = getattr(lib, name)
func.argtypes = argtypes
func.restype = restype
functools.wraps(f)(func)
return func
except AttributeError:
# Symbol not found in shared library (deprecated/removed)
return f
Copy link

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Silently returning the stub on missing symbols can make runtime failures harder to diagnose (the import succeeds, but calls may later fail in less obvious ways). Consider emitting a warning (optionally gated by an env var / debug flag) or returning a stub that raises a clear NotImplementedError indicating the symbol was not found in the loaded shared library.

Copilot uses AI. Check for mistakes.
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. Changed to return a stub that raises NotImplementedError with the missing symbol name, so the import still succeeds but any actual call to a removed function fails clearly.

@codavidgarcia codavidgarcia force-pushed the fix/qwen35-struct-alignment branch from 8aada9a to 2e3bda1 Compare March 3, 2026 22:31
Updates the llama.cpp submodule to da348c9df which includes support for
the Qwen 3.5 model architecture (hybrid SSM + attention).

Changes to Python bindings:

1. llama_cpp.py: Sync llama_context_params struct with upstream C API
   - flash_attn (bool) → flash_attn_type (enum llama_flash_attn_type)
   - Add samplers (void*) and n_samplers (size_t) fields
   - Add LLAMA_FLASH_ATTN_TYPE_* enum constants

2. llama.py: Update flash_attn parameter handling
   - Map flash_attn=True/False to flash_attn_type=1/0

3. _ctypes_extensions.py: Graceful handling of deprecated symbols
   - ctypes_function decorator returns stub instead of crashing
     when a symbol is not found in the shared library

Tested with Qwen3.5-0.8B-Q4_K_M.gguf on Apple Silicon (M1 Pro):
- Cold start: ~4s (vs ~40s with mlx-vlm)
- Inference: ~0.6s per chat completion
- Model loads and runs correctly on Metal GPU
@codavidgarcia codavidgarcia force-pushed the fix/qwen35-struct-alignment branch from 2e3bda1 to d21ef67 Compare March 3, 2026 22:31
@codavidgarcia
Copy link
Author

Addressed all review comments and squashed into a single commit.

Summary of changes since the review:

  • n_samplers type hint aligned to ctypes.c_size_t
  • flash_attn_type mapping now uses the named constants (LLAMA_FLASH_ATTN_TYPE_ENABLED / DISABLED) instead of raw ints, and the __getstate__ serializer checks against ENABLED explicitly to avoid the bool(-1) == True round-trip issue with AUTO
  • Missing symbol handler now raises NotImplementedError with the symbol name instead of silently returning the stub
  • Synced the C header comment block, docstring, and TYPE_CHECKING annotations with the actual struct layout (added flash_attn_type, samplers, n_samplers; removed stale flash_attn references)

Tested locally on M1 Pro with Qwen3.5-0.8B Q4_K_M. Model loads, offloads to Metal, and completes both text and chat inference correctly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants