feat:upgrade by lochjin · Pull Request #5 · Qitmeer/llama.cpp

lochjin · 2026-04-26T01:08:39Z

Overview

Additional information

Requirements

I have read and agree with the contributing guidelines
AI usage disclosure:

* rpc : refactor the RPC transport Move all transport related code into a separate file and use the socket_t interface to hide all transport implementation details. * fix win32 * better socket_t construction

* server : speculative decoding using checkpoints * server : fix draft check with checkpoints * server : rename spec vars * server : log levels * server : refactored spec logic to speculative.cpp * server : renamed spec checkpoints option * server : fix spec checkpoints, logging * speculative : checkpoints with draft model, logging * server : n_tokens_cur and create_checkpoint in draft * server : fix server_speculative_callback (slot.id) * spec : fix ngram-map/begin idx_last_check * spec : init ckpt (begin() wasn't called) * chore: update webui build output * server : restore sampler in spec checkpoint and clear mem * cont : avoid --spec-use-checkpoints argument * cont : remove server_prompt_checkpoint_with_size * spec : rename (leave_draft_state) * cont : clean-up * cont : do not ignore partial drafts even if the are short * cont : spec callback owned by session * cont : simplify * cont : avoid empty speculative session * cont : simplify * cont : simplify * cont : enable mtmd speculative decoding * cont : keep the spec sampler alive * cont : simplify * cont : fix nullptr deref + draft checkpoints * cont : remove common_speculative_accept_response * cont : remove callback * cont : simplify * cont : minor * cont : simplify * cont : fix accepted number --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

ggml-org#21630 added the CMP0194 NEW policy to silence a CMake warning, but on Windows runners it caused CMake to prefer the MinGW toolchain for ASM and broke MSVC builds. Reverting only that policy block restores the previous working behavior. The CMake 4.1+ warning comes back, but that is cosmetic and does not break any platform. Reported-by: oobabooga Refs: ggml-org#21630 Co-authored-by: texasich <texasich@users.noreply.github.com>

* convert : support sentence-transformer 5.4 config files * fix: embeddinggemma * fix: mapping Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * fix: pooling_mode Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* cache subgraph splits when cgraph is unchanged Skip per-call subgraph construction in ggml_backend_meta_graph_compute when the same ggml_cgraph is used consecutively. Assign uid to every sub-graph so that CUDA's fast uid check path hits too. * Address review comments * Keep the scope as is * Rename last_uid and last_n_subgraphs field. Remove last_max_tmp_size field. Refactor code. * Address review comments * Update ggml/src/ggml-backend-meta.cpp Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * Update ggml/src/ggml-backend-meta.cpp Co-authored-by: Johannes Gäßler <johannesg@5d6.de> --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

ggml-org#22082) * mtmd: add pos_0 to mtmd_image_tokens_get_decoder_pos * fix build

* CUDA: refactor mma data loading for AMD * fix CDNA MMQ occupancy * fix CDNA3 mma * fix RDNA3 compile

@arthw

* [SYCL] Fix reorder MMVQ assert on unaligned vocab sizes The reorder mul_mat_vec_q dispatchers for Q4_0, Q8_0, Q4_K, and Q6_K asserted that block_num_y was a multiple of 16 subgroups. Models with a vocab size not divisible by 16 (for example HY-MT at 120818) aborted on model load when the output projection tripped the assert. I replaced the assert with padding: block_num_y now rounds up to a whole number of subgroup-sized workgroups. The kernel already has the row bounds check (`if (row >= nrows) return;`) so the extra padded threads early-exit cleanly. Row values are uniform across a subgroup so the collective reduce stays safe. For aligned vocab sizes the padded block_num_y equals the old value, so the kernel launch is identical and there is no regression. Thanks to @arthw for flagging the relationship to ggml-org#21527. Fixes ggml-org#22020. AI assisted coding, tested on Intel B70 hardware. * sycl: use WARP_SIZE for num_subgroups in reorder MMVQ launches Replaces the hardcoded 16 with WARP_SIZE in the four reorder_mul_mat_vec launch helpers (Q4_0, Q8_0, Q4_K, Q6_K). Compile-time no-op on the Intel target where WARP_SIZE is 16, but makes the relationship to subgroup size explicit. Per review by @NeoZhangJianyu on ggml-org#22035. Assisted by Claude.

…22102) * llama: fix crash in print_info for GLM-DSA when vocab_only is set * addressed code review comments * cont : simplify --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* merged properly, but slow q3_k and q5_k with u32 indexing * Start on new mat-vec * New format float paths working * Working q4_0 * Work on remaining legacy q-types * port k-quants to new matvec * remove old shader * Remove old constants, format * remove accidental file --------- Co-authored-by: Neha Abbas <nehaabbas@ReeseLevines-MacBook-Pro.local> Co-authored-by: Reese Levine <reeselevine1@gmail.com>

…g#21636) * Implemented optimized q1_0 dot for x86 and generic * Removed redundant helper definition * Removed two redundant instructions from AVX q1_0 dot * Fixed inconsistency with fp16 conversion for generic q1_0 dot and deduplicated generic fallback * Style cleanup around AVX q1_0 dot * Replaced explicitly unrolled blocks with inner for loop for q1_0 * Replaced scalar ARM q1_0 impl with new generic one

* TP: fix 0-sized tensor slices, AllReduce fallback * fix layer structure <-> GPU count aliasing * add missing std::fill * fix CUDA device set, max ggml ctx size

* Fix delayed AllReduce on Gemma-4 MoE Skip forward past nodes that don't consume the current one, and allow a chain of MULs. * Check for all sources before skipping nodes * Address review comments

* server : remove /api endpoints * cont : remove /api/tags

* ggml-cuda: flush legacy pool on OOM and retry Signed-off-by: 梁厚宏 <2695316095@qq.com> * Address review comments: add explicit sync, update destructor, clean up MUSA macros Signed-off-by: 梁厚宏 <2695316095@qq.com> --------- Signed-off-by: 梁厚宏 <2695316095@qq.com>

…org#18760) (ggml-org#22003) Fixes: ggml-org#18760 Co-authored-by: Christian <christian@example.com>

…ice (ggml-org#22171) * fit-params : add option to output estimated memory per device * cont : minor * cont : refactor * cont : move fit params implementation to libcommon * cont : header * cont : headers * cont : codeowners

…ggml-org#22199) * ggml-webgpu: add tile flash attention fallback * ggml-webgpu: add new fields and discard usage of mnk for tile version * ggml-webgpu: modify the vec path to discard the mnk parameter * ggml-webgpu: enable flash attention vec and tile version for broswer * ggml-webgpu: stagging KV for flash attention tile version * formatting * turn on subgroup uniformity check * remove Q_TILE as it is always 1 for vec path * make row_max and exp_sum to local register * make different bindings with same underlying buffer to have the same usage flags * move path selection into the shader library and have the host consume a single flash-attn decision object. * turn off skip_validation and address buffer overlapping when nwg==1 * formatting * merge binding when kv overlap

…2303) * switch ubuntu-latest to ubuntu-slim * Fix the path for upload so CI doesn't fail * Update .github/workflows/build-and-test-snapdragon.yml Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Use -slim image for key check and consistent naming for artifact dir Signed-off-by: Max Krasnyansky <maxk@qti.qualcomm.com> * Remove check-secret extra job * move QDC key check for Run QDC jobs step specifically * add a step before to check the secret for qdc jobs --------- Signed-off-by: Max Krasnyansky <maxk@qti.qualcomm.com> Co-authored-by: Max Krasnyansky <maxk@qti.qualcomm.com> Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* hexagon: bump HMX freq to max corner * hex-mm: fix error in log msg

* fix very stupid structured output bug * Things just cannot be too easy.

…ggml-org#22327) * Implement ssm_scan * Remove blocking in graph_compute and check for set rows * Fix bindings * Update op support

* opt arc770 for Q4_0 * add for Q4_0 * update the script * add help script for windows * update guide * fix format issue * convert from dos to unix for format issue * fix missed -sm parameter

* gitignore : add .pi + personal SYSTEM.md * cont : fix requirements heading in PR template * cont : shorten line

Change the default `ftype` in `llama_model_quantize_params` from `LLAMA_FTYPE_MOSTLY_Q5_1` to `LLAMA_FTYPE_MOSTLY_Q8_0`. In case some external program naively uses the default quantization params, we should probably default to a known-good type like Q8_0 rather than Q5_1, which is rather old.

…#20962) * Optimize Metal Tensor API usage for matmul2d Separates the Metal Tensor API (matmul2d) path in kernel_mul_mm into its own standalone kernel, gated by GGML_METAL_HAS_TENSOR. The legacy simdgroup_matrix kernel is preserved under #else. Previously both paths were interleaved via #ifdef blocks within a single kernel, forcing the tensor path to share the legacy kernel's data layout and threadgroup memory scheme. Splitting the kernel enabled memory and dispatch optimizations that weren't possible when the two paths shared code structure. * cont : cleanup * cont : cleanup * cont : cleanup --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* CUDA: reduce MMQ stream-k overhead * use 32 bit integers for kbc

* chat: fix handling of space in reasoning markers * fix tests * whitespace

rgerganov and others added 30 commits April 19, 2026 10:21

rpc : refactor the RPC transport (ggml-org#21998)

91fef95

* rpc : refactor the RPC transport Move all transport related code into a separate file and use the socket_t interface to hide all transport implementation details. * fix win32 * better socket_t construction

ci : install spirv-headers for vulkan-cross (ggml-org#22109)

037bfe3

mtmd: add pos_0 to mtmd_image_tokens_get_decoder_pos (breaking change) (

1912407

ggml-org#22082) * mtmd: add pos_0 to mtmd_image_tokens_get_decoder_pos * fix build

HIP: Remove unesscary NCCL_CHECK (ggml-org#21914)

471540a

common/autoparser : allow space after tool call (ggml-org#22073)

d5b780a

CUDA: refactor mma data loading for AMD (ggml-org#22051)

4eac5b4

* CUDA: refactor mma data loading for AMD * fix CDNA MMQ occupancy * fix CDNA3 mma * fix RDNA3 compile

vendor : update cpp-httplib to 0.42.0 (ggml-org#21781)

e365e65

server: rename --clear-idle to --cache-idle-slots (ggml-org#21741)

9d49acb

server : refactor "use checkpoint" logic (ggml-org#22114)

de71b5f

fix: GLM-DSA crash in llama-tokenize when using vocab_only (ggml-org#…

81df3f7

…22102) * llama: fix crash in print_info for GLM-DSA when vocab_only is set * addressed code review comments * cont : simplify --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

mtmd: refactor mtmd_decode_use_mrope (ggml-org#22161)

a678916

TP: fix 0-sized tensor slices, AllReduce fallback (ggml-org#21808)

fb19f94

* TP: fix 0-sized tensor slices, AllReduce fallback * fix layer structure <-> GPU count aliasing * add missing std::fill * fix CUDA device set, max ggml ctx size

Tensor-parallel: Fix delayed AllReduce on Gemma-4 MoE (ggml-org#22129)

fd6ae4c

* Fix delayed AllReduce on Gemma-4 MoE Skip forward past nodes that don't consume the current one, and allow a chain of MULs. * Check for all sources before skipping nodes * Address review comments

server : remove /api endpoints (ggml-org#22165)

cf8b0db

* server : remove /api endpoints * cont : remove /api/tags

mtmd: correct get_n_pos / get_decoder_pos (ggml-org#22175)

86f8daa

server : fix hardcoded proxy connection timeout in router mode (ggml-…

ff6b106

…org#18760) (ggml-org#22003) Fixes: ggml-org#18760 Co-authored-by: Christian <christian@example.com>

ggml : bump version to 0.10.0 (ggml/1463)

041fe83

sync : ggml

4889afb

llama-ext : fix exports (ggml-org#22202)

cd03ec7

mtmd: correct mtmd_decode_use_mrope() (ggml-org#22188)

9998d88

vulkan: Support F16 OP_FILL (ggml-org#22177)

82209ef

ArberSephirotheca and others added 13 commits April 24, 2026 10:39

Hexagon: Bump HMX Frequency to Max Corner (ggml-org#22334)

361fe72

* hexagon: bump HMX freq to max corner * hex-mm: fix error in log msg

parser: fix structured output bug (ggml-org#22302)

0adede8

* fix very stupid structured output bug * Things just cannot be too easy.

ggml-webgpu: support for SSM_SCAN and disable set_rows error checking (…

dd2914d

…ggml-org#22327) * Implement ssm_scan * Remove blocking in graph_compute and check for set rows * Fix bindings * Update op support

[SYCL] Optimize Q4_0 mul_mat for Arc770, add scripts (ggml-org#22291)

eddd7a1

* opt arc770 for Q4_0 * add for Q4_0 * update the script * add help script for windows * update guide * fix format issue * convert from dos to unix for format issue * fix missed -sm parameter

gitignore : add .pi + personal SYSTEM.md (ggml-org#22316)

8ea8fee

* gitignore : add .pi + personal SYSTEM.md * cont : fix requirements heading in PR template * cont : shorten line

CUDA: reduce MMQ stream-k overhead (ggml-org#22298)

9725a31

* CUDA: reduce MMQ stream-k overhead * use 32 bit integers for kbc

spec : fix vocab compat checks (ggml-org#22358)

98dc141

chat: fix handling of space in reasoning markers (ggml-org#22353)

dcad77c

* chat: fix handling of space in reasoning markers * fix tests * whitespace

feat:upgrade

4be2164

github-actions Bot added documentation Improvements or additions to documentation Apple Metal SYCL Nvidia GPU Vulkan testing examples devops python script server ggml model jinja parser Hexagon WebGPU OpenVINO labels Apr 26, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat:upgrade#5

feat:upgrade#5
lochjin wants to merge 94 commits intoQitmeer:masterfrom
lochjin:master

lochjin commented Apr 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

Conversation

lochjin commented Apr 26, 2026

Overview

Additional information

Requirements

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants