Skip to content

This PR implements the previously stubbed state management methods in the _internals.py module and updates the corresponding API calls in llama.py to use the correct underlying C++ function names.#2134

Open
bsides230 wants to merge 6 commits intoabetlen:mainfrom
bsides230:kv-caching-issue

Conversation

@bsides230
Copy link

@bsides230 bsides230 commented Mar 5, 2026

Builds on previous PR https://github.com/abetlen/llama-cpp-python/pull/2133/

Key Changes
Implemented copy_state_data() method in Llama._ctx that wraps llama_cpp.llama_state_get_data()
Implemented set_state_data() method in Llama.ctx that wraps llama_cpp.llama_state_set_data()
Updated save_state() method to call llama_state_get_data() instead of the deprecated llama_copy_state_data() and pass the state_size parameter
Updated load_state() method to call llama_state_set_data() instead of llama_set_state_data() and pass the state_size parameter
Corrected function call in save_state() from llama_get_state_size() to llama_state_get_size() for consistency
Implementation Details
The changes align the Python wrapper with the underlying C++ API by using the newer llama_state
* function naming convention. The size parameter is now explicitly passed to both copy_state_data() and set_state_data() methods, which is required by the updated C++ interface.

codavidgarcia and others added 6 commits March 3, 2026 17:31
Updates the llama.cpp submodule to da348c9df which includes support for
the Qwen 3.5 model architecture (hybrid SSM + attention).

Changes to Python bindings:

1. llama_cpp.py: Sync llama_context_params struct with upstream C API
   - flash_attn (bool) → flash_attn_type (enum llama_flash_attn_type)
   - Add samplers (void*) and n_samplers (size_t) fields
   - Add LLAMA_FLASH_ATTN_TYPE_* enum constants

2. llama.py: Update flash_attn parameter handling
   - Map flash_attn=True/False to flash_attn_type=1/0

3. _ctypes_extensions.py: Graceful handling of deprecated symbols
   - ctypes_function decorator returns stub instead of crashing
     when a symbol is not found in the shared library

Tested with Qwen3.5-0.8B-Q4_K_M.gguf on Apple Silicon (M1 Pro):
- Cold start: ~4s (vs ~40s with mlx-vlm)
- Inference: ~0.6s per chat completion
- Model loads and runs correctly on Metal GPU
… the _internals.py module and updates the corresponding API calls in llama.py to use the correct underlying C++ function names.

Key Changes
Implemented copy_state_data() method in Llama._ctx that wraps llama_cpp.llama_state_get_data()
Implemented set_state_data() method in Llama._ctx that wraps llama_cpp.llama_state_set_data()
Updated save_state() method to call llama_state_get_data() instead of the deprecated llama_copy_state_data() and pass the state_size parameter
Updated load_state() method to call llama_state_set_data() instead of llama_set_state_data() and pass the state_size parameter
Corrected function call in save_state() from llama_get_state_size() to llama_state_get_size() for consistency
Implementation Details
The changes align the Python wrapper with the underlying C++ API by using the newer llama_state_* function naming convention. The size parameter is now explicitly passed to both copy_state_data() and set_state_data() methods, which is required by the updated C++ interface.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants