git.djapps.eu Git - pkg/ggml/sources/llama.cpp/log

]> git.djapps.eu Git - pkg/ggml/sources/llama.cpp/log

overview / pkg / ggml / sources / llama.cpp / log

commit | commitdiff | tree

Georgi Gerganov [Sun, 25 Jan 2026 07:12:50 +0000 (09:12 +0200)]

completion : fix prompt cache for recurrent models (#19045)

commit | commitdiff | tree

Molly Sophia [Sun, 25 Jan 2026 07:11:19 +0000 (15:11 +0800)]

readme: update RWKV7 model links (#19061)

Signed-off-by: Molly Sophia <redacted>

commit | commitdiff | tree

Jakkala Mahesh [Sun, 25 Jan 2026 07:10:52 +0000 (12:40 +0530)]

llama: fix integer type consistency in split helpers (#18894)

* llama: fix integer type consistency in split helpers

* llama: apply minor style fixes

* llama: remove trailing whitespace

commit | commitdiff | tree

Daniel Bevenius [Sun, 25 Jan 2026 06:31:42 +0000 (07:31 +0100)]

common : use two decimal places for float arg help messages (#19048)

* common : use two decimal places for float arg help messages

This commit updates the help messages for various command-line arguments
in arg.cpp to display floating-point default values with two decimal
places instead of one.

The motivation for this changes is that currently only having one decimal
place means that values generated using --help or llama-gen-docs will not
display the correct values.

For example, currently the value of top-p in tools/server/README.md is
`0.9`, but the default value is actually '0.95'. And running
llama-gen-docs does not update this value as it uses the output from the
help message, which shows only one decimal place, so the values look
like they are unchanged.

* docs : run llama-gen-docs to update docs

commit | commitdiff | tree

Bartowski [Sun, 25 Jan 2026 01:36:47 +0000 (20:36 -0500)]

convert : fix conversion for inheriting models that were bypassing modify_tensors (#19064)

* Add undo_permute = False where needed

* Replace super().modify_tensors with ModelBase

* Add one more ModelBase.modify_tensors

* Update convert_hf_to_gguf.py

Co-authored-by: Sigbjørn Skjæret <redacted>
* Update convert_hf_to_gguf.py

Co-authored-by: Sigbjørn Skjæret <redacted>
* Update convert_hf_to_gguf.py

Co-authored-by: Sigbjørn Skjæret <redacted>
---------

Co-authored-by: Sigbjørn Skjæret <redacted>

commit | commitdiff | tree

Johannes Gäßler [Sat, 24 Jan 2026 21:13:08 +0000 (22:13 +0100)]

llama-fit-params: keep explicit --ctx-size 0 (#19070)

commit | commitdiff | tree

Johannes Gäßler [Sat, 24 Jan 2026 20:57:51 +0000 (21:57 +0100)]

GGUF: check that tensor size is representable (#19072)

commit | commitdiff | tree

Xuan-Son Nguyen [Sat, 24 Jan 2026 16:58:45 +0000 (17:58 +0100)]

chat: fix language input for translategemma (#19052)

* chat: fix language input for translategemma

* Update common/chat.cpp

Co-authored-by: Aldehir Rojas <redacted>
---------

Co-authored-by: Aldehir Rojas <redacted>

commit | commitdiff | tree

Johannes Gäßler [Sat, 24 Jan 2026 09:09:36 +0000 (10:09 +0100)]

CUDA: re-use MLA K data for V in MMA FA (#19057)

commit | commitdiff | tree

Aman Gupta [Sat, 24 Jan 2026 06:25:20 +0000 (14:25 +0800)]

ggml-cuda: enable cuda-graphs for `n-cpu-moe` (#18934)

* ggml-cuda: add split-wise cuda graph

* add n-cpu-moe compare_llama_bench.py

* fix hip/musa builds

commit | commitdiff | tree

nullname [Sat, 24 Jan 2026 06:02:07 +0000 (14:02 +0800)]

ggml-hexagon: flash-attn opt (#19025)

* optimize flash attention kernel by improving score computation and online softmax update

* wip

* Refactor online softmax update in flash attention kernel for improved performance

* Optimize flash attention kernel by replacing float array with HVX_Vector for score computation

* wip

commit | commitdiff | tree

Georgi Gerganov [Fri, 23 Jan 2026 16:22:34 +0000 (18:22 +0200)]

graph : utilize `ggml_build_forward_select()` to avoid reallocations (#18898)

* graph : avoid branches between embedding and token inputs

* models : make deepstack graphs (e.g. Qwen3 VL) have constant topology

* ci : enable -DGGML_SCHED_NO_REALLOC=ON for server CI

* cont : pad token embeddings to n_embd_inp

commit | commitdiff | tree

Neo Zhang [Fri, 23 Jan 2026 12:54:10 +0000 (20:54 +0800)]

[SYCL] use malloc to support both iGPU and dGPU in same time (#18992)

* use malloc to support both iGPU and dGPU in same time

* support windows

---------

Co-authored-by: Neo Zhang Jianyu <redacted>

commit | commitdiff | tree

Xuan-Son Nguyen [Fri, 23 Jan 2026 11:03:42 +0000 (12:03 +0100)]

chat : fix translategemma crash on common_chat_format_example (#19019)

commit | commitdiff | tree

Daniel Bevenius [Fri, 23 Jan 2026 08:01:36 +0000 (09:01 +0100)]

model-conversion : use BUILD_DIR variable in all scripts (#19015)

This commit modifies all the utility scripts to use an optional
BUILD_DIR variable/argument to specify the build directory.

The motivation for this is that Commit
3d55846a5c626e2e608db8c24fa9ee6defaacca9 ("model-conversion : add
BUILD_DIR variable to run-converted-model scripts") introduced this
variable to the causal and embeddings scripts, but I missed the scripts
in the utils directory.

commit | commitdiff | tree

Alberto Cabrera Pérez [Fri, 23 Jan 2026 07:55:08 +0000 (07:55 +0000)]

ggml-cpu: aarm64: q5_K repack gemm and gemv (and generic) implementations (i8mm) (#18860)

* Boilerplate for q5_Kx8 REPACK on ARM and fallback

Signed-off-by: Alberto Cabrera <redacted>
* Implements make_block_q5_Kx8 by extending make_block_q4_Kx8

Signed-off-by: Alberto Cabrera <redacted>
* q5_K repack gemm and gemv generics

* Gemm and Gemv ARM implementations (i8mm)

* Improved qh manipulation looking at non-repack vec_dot implementation

* Full unroll

* Apply Q5_K Gemv vand and vshl optimizations to gemm. Improve comments.

Signed-off-by: Alberto Cabrera <redacted>
* Fix wrong fallback definitions of Q5_K

Signed-off-by: Alberto Cabrera <redacted>
* Fixed comments. Reverted unnecessary formatting

Signed-off-by: Alberto Cabrera <redacted>
* Fixed typo in generic definitions

* Switching AND + Shift with Shift Insert. Better op interleaving.

* Vectorize + unroll the block scales

* Apply gemm optimizations to gemv

* Improve bias calculation

---------

Signed-off-by: Alberto Cabrera <redacted>

commit | commitdiff | tree

Aldehir Rojas [Fri, 23 Jan 2026 02:31:22 +0000 (20:31 -0600)]

cli : load parser definition (#19031)

* cli : load parser definition

* cont : only unload if a parser is defined

commit | commitdiff | tree

Xuan-Son Nguyen [Thu, 22 Jan 2026 20:30:06 +0000 (21:30 +0100)]

server : support preserving reasoning_content in assistant message (#18994)

* support reasoning_content input

* report template caps to webui

* add docs

* rm commented code

commit | commitdiff | tree

Georgi Gerganov [Thu, 22 Jan 2026 20:09:01 +0000 (22:09 +0200)]

mla : make the V tensor a view of K (#18986)

* mla : pass V as a view of K to the FA op

* cuda : adjust mla logic to new layout

* kv-cache : fix rope shift

* tests : remove comment

* cuda : fix reusable_cutoff

Co-authored-by: Johannes Gäßler <redacted>
---------

Co-authored-by: Johannes Gäßler <redacted>

commit | commitdiff | tree

Johannes Gäßler [Thu, 22 Jan 2026 19:39:25 +0000 (20:39 +0100)]

CUDA: fix alignment check for FA (#19023)

commit | commitdiff | tree

Aman Gupta [Thu, 22 Jan 2026 18:58:07 +0000 (02:58 +0800)]

convert_hf_to_gguf.py: refactor modify_tensors to call super (#18866)

commit | commitdiff | tree

lhez [Thu, 22 Jan 2026 18:29:25 +0000 (10:29 -0800)]

opencl: enable the general fp mm for non-cont input and as a fallback for specialized kqv kernel for adreno (#18970)

* opencl: add `copy_to_contiguous` and utilize mm kernels

* opencl: only copy to cont for f32 and f16 tensors

* opencl: use cont mm for fallback when dst is large

* opencl: use nb local to copy-to-cont

* opencl: use local offset as well

commit | commitdiff | tree

Xuan-Son Nguyen [Thu, 22 Jan 2026 18:24:37 +0000 (19:24 +0100)]

server: do not log certain endpoints (avoid log spam) (#19028)

commit | commitdiff | tree

Georgi Gerganov [Thu, 22 Jan 2026 14:17:06 +0000 (16:17 +0200)]

quant : manual overrides of tensor types take precedence (#18952)

commit | commitdiff | tree

Aaron Teo [Thu, 22 Jan 2026 13:38:02 +0000 (21:38 +0800)]

release: update github api (#19022)

commit | commitdiff | tree

Xuan-Son Nguyen [Thu, 22 Jan 2026 13:36:32 +0000 (14:36 +0100)]

mtmd : update docs to use llama_model_n_embd_inp (#18999)

commit | commitdiff | tree

손희준 [Thu, 22 Jan 2026 13:36:04 +0000 (22:36 +0900)]

server: Reorder methods in `server-task.cpp` (#19016)

* Move `task_result_state::update_chat_msg` to match with header

* Move `server_task_result_cmpl_partial::to_json_anthropic()` to match with header

---------

Co-authored-by: openingnow <>

commit | commitdiff | tree

Aman Gupta [Thu, 22 Jan 2026 10:51:53 +0000 (18:51 +0800)]

CUDA: add gqa_ratio 4 for GLM 4.7 flash (#18953)

commit | commitdiff | tree

shaofeiqi [Thu, 22 Jan 2026 06:05:54 +0000 (22:05 -0800)]

opencl: add TRI op support (#18979)

commit | commitdiff | tree

Aleksei Nikiforov [Thu, 22 Jan 2026 00:16:21 +0000 (01:16 +0100)]

ggml-zdnn : mark zDNN buffers as non-host (#18967)

While buffers reside in host memory,
additional transformation is needed to use buffers with zDNN.

Fixes #18848

commit | commitdiff | tree

Pádraic Slattery [Wed, 21 Jan 2026 23:57:18 +0000 (00:57 +0100)]

ci : update GitHub Actions versions [no ci] (#18935)

commit | commitdiff | tree

Mariusz Woloszyn [Wed, 21 Jan 2026 23:55:55 +0000 (00:55 +0100)]

convert : add Devstral-2 (Ministral3ForCausalLM) arch (#18972)

* Add Ministral3ForCausalLM architeture

This adds support for newer architectres like Devstral-2

* removed blank line found after function decorator

Co-authored-by: Sigbjørn Skjæret <redacted>
---------

Co-authored-by: Sigbjørn Skjæret <redacted>

commit | commitdiff | tree

Piotr Wilkin (ilintar) [Wed, 21 Jan 2026 18:24:37 +0000 (19:24 +0100)]

jinja: support none|string (#18995)

* jinja: support none|string

* Update common/jinja/value.cpp

Co-authored-by: Sigbjørn Skjæret <redacted>
* Update tests/test-jinja.cpp

Co-authored-by: Sigbjørn Skjæret <redacted>
* Add as_string()

---------

Co-authored-by: Sigbjørn Skjæret <redacted>

commit | commitdiff | tree

Hendrik Erz [Wed, 21 Jan 2026 17:46:01 +0000 (18:46 +0100)]

fix: Use `tabular-nums` for chat message statistics (#18915)

* fix: Use `tabular-nums` for chat message statistics

* fix: Rebuild WebUI

commit | commitdiff | tree

Daniel Bevenius [Wed, 21 Jan 2026 17:31:34 +0000 (18:31 +0100)]

llama : clarify nemotron-h.cpp comment about RoPE [no ci] (#18997)

This commit removes the mention of RoPE in the comment for the Q and K
computation as RoPE is not applied.

commit | commitdiff | tree

Jeff Bolz [Wed, 21 Jan 2026 17:01:40 +0000 (11:01 -0600)]

vulkan: Remove transfer_ctx, do everything in compute_ctx. (#18945)

* vulkan: Remove transfer_ctx, do everything in compute_ctx.

We had a bug where a set_tensor_async (using transfer_ctx) didn't get
submitted before the graph_compute (using compute_ctx) that came after
it. To avoid this sort of issue, just do everything in compute_ctx.

Remove transfer_cmd_pool, which was already unused.

* fix crash with perf logger

commit | commitdiff | tree

Adrien Gallouët [Wed, 21 Jan 2026 16:58:38 +0000 (17:58 +0100)]

common : improve error message when HTTPS is missing but required (#18987)

Signed-off-by: Adrien Gallouët <redacted>

commit | commitdiff | tree

손희준 [Wed, 21 Jan 2026 16:47:23 +0000 (01:47 +0900)]

server: /v1/responses (partial) (#18486)

* from previous PR

* Make instruction(system) as first message

* Convert [input_message] (text/image/file)

* Rename convert_responses_to_chatcmpl(body) -> response_body

* Initial tool call support

* Erase instructions field from chatcmpl body

* Feed reasoning texts to chat template

* Use std::vector instead of opaque json array

* Make output_item.added events consistent

* Move `server_task_result_cmpl_partial::update` from header to source

* Match ID of output_item.added and .done events

* Add function_call only if there is no "fc_" prefix

* Add function call output at non-streaming API

* Test if ID is persistent

* Add doc

* Fix style - use trailing comma

* Rewrite state management

* catch up with upstream/master

* Fix style - "type" is the first item of SSE data

* Explicitly check "instructions" from response_body

* Make lambdas static

* Check if reasoning content exists

* Add `oai_resp_id` to task_result_state(also initialized at ctor), server_task_result_cmpl_partial, and server_task_result_cmpl_final

* Reject `input_file` since it is not supported by chatcmpl

* Add "fc_" prefix to non-straming function call id as coderabbit pointed out

---------

Co-authored-by: openingnow <>

commit | commitdiff | tree

Jeff Bolz [Wed, 21 Jan 2026 16:43:43 +0000 (10:43 -0600)]

vulkan: support flash attention GQA/split_k with small batches (#18938)

commit | commitdiff | tree

Masato Nakasaka [Wed, 21 Jan 2026 16:13:43 +0000 (01:13 +0900)]

Revert "vulkan: force full subgroups for flash attention to fix intel subgroup crash (#17356)" (#18831)

This reverts commit 980b7cd17e055c8c587f79ffda7eb4fddf405566.

commit | commitdiff | tree

Jeff Bolz [Wed, 21 Jan 2026 15:22:02 +0000 (09:22 -0600)]

vulkan: Use mul_mat_vec_id for small values of n (#18918)

Change ggml_vk_mul_mat_vec_id_q_f16 to loop over the batch dimension and
update the indexing calculations in get_offsets.

Mat-vec is faster than mat-mat for small values of n. We don't get the same
reuse of the weights as in the non-ID path, but with this the cost is linear
in n rather than n>1 being far slower than n==1.

commit | commitdiff | tree

Tarek Dakhran [Wed, 21 Jan 2026 12:30:23 +0000 (13:30 +0100)]

memory : add llama_memory_hybrid_iswa (#18601)

* memory : add llama_memory_hybrid_iswa

* Update src/llama-memory-hybrid-iswa.cpp

Co-authored-by: Georgi Gerganov <redacted>
---------

Co-authored-by: Georgi Gerganov <redacted>

commit | commitdiff | tree

Piotr Wilkin (ilintar) [Wed, 21 Jan 2026 11:35:20 +0000 (12:35 +0100)]

Fix GLM 4.7 Lite MoE gating func (#18980)

* Fix GLM 4.7 MoE gating func

* Update src/models/deepseek2.cpp

Co-authored-by: Sigbjørn Skjæret <redacted>
* Update src/llama-model.cpp

Co-authored-by: Xuan-Son Nguyen <redacted>
---------

Co-authored-by: Sigbjørn Skjæret <redacted>
Co-authored-by: Xuan-Son Nguyen <redacted>

commit | commitdiff | tree

Matthieu Coudron [Wed, 21 Jan 2026 06:52:46 +0000 (07:52 +0100)]

gguf: display strerrno when cant load a model (#18884)

I've had issues loading models with llama-server:
[44039] E gguf_init_from_file: failed to open GGUF file 'mistral-7b-v0.1.Q8_0.gguf'

and I was sure it could access the file. Seems like --models-dir and
--models-presets dont interact like I thought they would but I salvaged
this snippet that helps troubleshooting
[44039] E gguf_init_from_file: failed to open GGUF file 'mistral-7b-v0.1.Q8_0.gguf' (errno No such file or directory)

commit | commitdiff | tree

Oliver Simons [Wed, 21 Jan 2026 01:34:29 +0000 (02:34 +0100)]

CUDA: Fix builds for older CCCL versions by ifdefing strided_iterator (#18964)

* CUDA: Fix builds for older CCCL versions by ifdefing strided_iterator

Strided iterator was added in [CCCL
3.1](https://github.com/NVIDIA/cccl/releases/tag/v3.1.0), which is packaged into
[CTK
13.1](https://docs.nvidia.com/cuda/cuda-toolkit-release-notes/index.html#id5)

* Unindent as per code review request

commit | commitdiff | tree

Adrien Gallouët [Tue, 20 Jan 2026 17:28:43 +0000 (18:28 +0100)]

common, server : use the same User-Agent by default (#18957)

This commit also ensures that if a custom User-Agent is used, it will be
the only one sent.

Signed-off-by: Adrien Gallouët <redacted>

commit | commitdiff | tree

Xuan-Son Nguyen [Tue, 20 Jan 2026 17:23:25 +0000 (18:23 +0100)]

cli : fix reasoning responses in CLI (#18961)

* cli : fix reasoning responses in CLI

* fix build

* fix build (2)

commit | commitdiff | tree

Oliver Simons [Tue, 20 Jan 2026 12:11:01 +0000 (13:11 +0100)]

CUDA: Replace init_offsets kernel with iterators in cub-based argsort (#18930)

* CUDA: Replace `init_offsets` with iterators in argsort

This is a QOL improvement, saving us the cost of materializing the
iterator

* Remove unnecessary include from top-k.cu

commit | commitdiff | tree

Adrien Gallouët [Tue, 20 Jan 2026 10:42:49 +0000 (11:42 +0100)]

ggml : cleanup path_str() (#18928)

- Remove pragmas as `std::codecvt_utf8` is not used.
- Avoid implicit `strlen()`.

Signed-off-by: Adrien Gallouët <redacted>

commit | commitdiff | tree

Georgi Gerganov [Tue, 20 Jan 2026 10:21:28 +0000 (12:21 +0200)]

metal : enable FA for MLA heads (#18950)

commit | commitdiff | tree

Daniel Bevenius [Tue, 20 Jan 2026 05:55:24 +0000 (06:55 +0100)]

convert : use n_groups instead of hardcoded values in reshape (#18929)

* convert : use n_groups instead of hardcoded values in reshape

This commit modifies the conversion script for NemotronHModel to use
the 'n_groups' hyperparameter, and allow Python to calculate the the
last dimension, using -1, when reshaping the 'mixer.norm.weight' tensor.

* use self.n_group instead of self.hparams["n_groups"]

commit | commitdiff | tree

Xuan-Son Nguyen [Mon, 19 Jan 2026 22:28:01 +0000 (23:28 +0100)]

server : refactor oai_parser_opt, move it to server_chat_params (#18937)

* server_chat_params

* move chat format into CLI

* use meta whenever possible

* clean up, no more chatml fallback

commit | commitdiff | tree

ddh0 [Mon, 19 Jan 2026 22:09:20 +0000 (16:09 -0600)]

convert : support Glm4MoeLite (#18936)

* initial commit for branch

* add glm-4.7-flash, move tokenizer hash

* use `glm4` pretok

* silence flake8 E302 (CI)

* apply review feedback

* add <|user|> as eog

* also add EOG `<|observation|>`

* revert llama-vocab

* inherit vocab from glm4

---------

Co-authored-by: Xuan Son Nguyen <redacted>

commit | commitdiff | tree

Sigbjørn Skjæret [Mon, 19 Jan 2026 19:29:43 +0000 (20:29 +0100)]

jinja : fix undefined keys and attributes and int/float as bool (#18924)

* fix undefined keys and attributes

* add falsy tests

* as_bool for integers and floats

* more falsy/truthy tests

* --typo

commit | commitdiff | tree

Sigbjørn Skjæret [Mon, 19 Jan 2026 19:29:15 +0000 (20:29 +0100)]

ci : run test-jinja -py on high perf [no ci] (#18916)

commit | commitdiff | tree

Lennart Austenfeld [Mon, 19 Jan 2026 18:13:31 +0000 (19:13 +0100)]

server: fix memory reservations in populate_token_probs (#18787)

commit | commitdiff | tree

Georgi Gerganov [Mon, 19 Jan 2026 18:03:19 +0000 (20:03 +0200)]

ggml : add ggml_build_forward_select (#18550)

* ggml : add ggml_build_forward_select

* cuda : adapt CUDA graph compat to new feature

* vulkan : update logic to handle command buffer closing

* ggml : check compute for fusion

* ggml : add comment

commit | commitdiff | tree

Daniel Bevenius [Mon, 19 Jan 2026 12:12:38 +0000 (13:12 +0100)]

model-conversion : add BUILD_DIR variable to run-converted-model scripts (#18927)

This commit adds a BUILD_DIR variable to the scripts used for running
converted models.

The motivation for this is that currently the `build` directory is
hardcoded and it can be useful to specify a different build directory,
with builds for different configurations.

commit | commitdiff | tree

Julius Tischbein [Sun, 18 Jan 2026 16:35:57 +0000 (17:35 +0100)]

llama : Extend fallback, fix fileno for dio file, exclude case that mmap uses dio file (#18887)

commit | commitdiff | tree

Francisco Herrera [Sun, 18 Jan 2026 10:03:35 +0000 (05:03 -0500)]

docs: add linux to index (#18907)

commit | commitdiff | tree

Xuan-Son Nguyen [Sun, 18 Jan 2026 07:14:27 +0000 (08:14 +0100)]

tests : add test-jinja -py option for cross-checking (#18906)

* tests : add test-jinja -py option or cross-checking

* Update tests/test-jinja.cpp

Co-authored-by: Sigbjørn Skjæret <redacted>
* fix + add source

* SandboxedEnvironment

* fix array.map case

---------

Co-authored-by: Sigbjørn Skjæret <redacted>

commit | commitdiff | tree

Sigbjørn Skjæret [Sun, 18 Jan 2026 02:40:06 +0000 (03:40 +0100)]

jinja : fix object item order (and properly implement dictsort) (#18904)

* fix object item order

* as_ordered_object

* copy whole object

commit | commitdiff | tree

Sigbjørn Skjæret [Sun, 18 Jan 2026 01:53:01 +0000 (02:53 +0100)]

jinja : attribute support for join, map and sort (#18883)

* support negative array index and default value

* attribute support (int and str) for join, map and sort

* add tests

* update CODEOWNERS

* improve fixme sorting comment

commit | commitdiff | tree

Sigbjørn Skjæret [Sun, 18 Jan 2026 00:05:09 +0000 (01:05 +0100)]

jinja : add missing tojson filter for bool (#18900)

* add missing tojson for bool

* add more literal tests

commit | commitdiff | tree

Sigbjørn Skjæret [Sat, 17 Jan 2026 23:57:51 +0000 (00:57 +0100)]

jinja : fix lexing of float literals with sign (#18901)

* fix lexing of float literals with sign

* add test

* consume_numeric

commit | commitdiff | tree

Xuan-Son Nguyen [Sat, 17 Jan 2026 23:48:55 +0000 (00:48 +0100)]

jinja: correct member access rule (#18905)

commit | commitdiff | tree

lhez [Sat, 17 Jan 2026 21:50:32 +0000 (13:50 -0800)]

opencl: fix q6_K mv for m=1 (#18893)

commit | commitdiff | tree

Sigbjørn Skjæret [Sat, 17 Jan 2026 20:52:02 +0000 (21:52 +0100)]

ci : add label for jinja changes (#18903)

commit | commitdiff | tree

Georgi Gerganov [Sat, 17 Jan 2026 13:42:42 +0000 (15:42 +0200)]

kv-cache : optimize KQ mask construction (#18842)

* kv-cache : optimize KQ mask construction

* cont : add explanation + improve

* cont : fix

commit | commitdiff | tree

Reese Levine [Sat, 17 Jan 2026 00:12:43 +0000 (16:12 -0800)]

ggml webgpu: support for backend sampling (#18880)

* ggml webgpu: add SOFTPLUS unary operator

Implements SOFTPLUS (log(1 + exp(x))) with f16/f32 support. Uses f32
precision for intermediate calculations to prevent f16 overflow.

* Add shader implementation and 4 variants (f32/f16, inplace/non-inplace)
* Register pipelines and device support
* Follow Vulkan backend numerical stability pattern

* ggml webgpu: add EXPM1 unary operator

Implements EXPM1 (exp(x) - 1) with f16/f32 support.

* Add shader implementation and 4 variants (f32/f16, inplace/non-inplace)
* Register pipelines and device support

* ggml webgpu: add FLOOR unary operator

Implements FLOOR (rounds down to nearest integer) with f16/f32 support.

* Add shader implementation and 4 variants (f32/f16, inplace/non-inplace)
* Register pipelines and device support

* ggml webgpu: add CEIL unary operator

Implements CEIL (rounds up to nearest integer) with f16/f32 support.

* Add shader implementation and 4 variants (f32/f16, inplace/non-inplace)
* Register pipelines and device support

* ggml webgpu: add ROUND unary operator

Implements ROUND (rounds to nearest integer) with f16/f32 support.

* Add shader implementation and 4 variants (f32/f16, inplace/non-inplace)
* Register pipelines and device support

* ggml webgpu: add TRUNC unary operator

Implements TRUNC (truncates towards zero) with f16/f32 support.

* Add shader implementation and 4 variants (f32/f16, inplace/non-inplace)
* Register pipelines and device support

* docs : update WebGPU support for unary operators (FLOOR, CEIL, ROUND, TRUNC, EXPM1, SOFTPLUS)

* Updates to webgpu get_memory

* Add argmax

* Add argmax,cumsum,sum,sum_rows

* Add necessary CPY/GET_ROWS operators

* Support for argsort using multi-pass strategy

* Update set_rows for i32 indices, move to pre-wgsl

* Port unary operators to pre-wgsl and support FILL

* Implement PAD

* Add support for top-k

* clean up, scope pipeline init mutex

* fix newline

* Add support for log

* Update LOG for better precision, and ops doc

---------

Co-authored-by: Abhijit Ramesh <redacted>

commit | commitdiff | tree

Thore Koritzius [Fri, 16 Jan 2026 14:59:56 +0000 (15:59 +0100)]

ggml : extend ggml_pool_1d + metal (#16429)

* chore: resolve conflicts

* feat: ggml metal impl

* fix: ggml_metal_kargs_pool_1d struct

* fix: require contiguous input

* chore: test pool_1d

* chore: limit pool1d test cases to p0=0 and s0=k0 to conform with asserts

* chore: add p0 and s0 to testing

* fix: allow padding for cpu and metal

* Update ggml/src/ggml-metal/ggml-metal.metal

* fix: correct single-threaded loop

* ggml : cleanup

* tests : add ne[1] != 1 tests

* fix: ne[1] handling in np

* cont : fixes

---------

Co-authored-by: Georgi Gerganov <redacted>

commit | commitdiff | tree

hipudding [Fri, 16 Jan 2026 12:32:17 +0000 (20:32 +0800)]

docs : update ops.md for CANN backend (#18654)

commit | commitdiff | tree

Perry Naseck [Fri, 16 Jan 2026 11:38:25 +0000 (06:38 -0500)]

ggml-blas: hide warnings from included BLAS headers (#18818)

* fix compile def openblas, blis for compat libs, nvpl compile def, warn if no blas vendor set

* ggml-blas: hide warnings from included BLAS headers

commit | commitdiff | tree

Tarek Dakhran [Fri, 16 Jan 2026 10:23:08 +0000 (11:23 +0100)]

mtmd : Fix ASR for LFM2.5-Audio-1.5B (#18876)

commit | commitdiff | tree

Xuan-Son Nguyen [Fri, 16 Jan 2026 10:22:06 +0000 (11:22 +0100)]

common : implement new jinja template engine (#18462)

* jinja vm

* lexer

* add vm types

* demo

* clean up

* parser ok

* binary_expression::execute

* shadow naming

* bin ops works!

* fix map object

* add string builtins

* add more builtins

* wip

* use mk_val

* eval with is_user_input

* render gemma tmpl ok

* track input string even after transformations

* support binded functions

* keyword arguments and slicing array

* use shared_ptr for values

* add mk_stmt

* allow print source on exception

* fix negate test

* testing more templates

* mostly works

* add filter_statement

* allow func to access ctx

* add jinja-value.cpp

* impl global_from_json

* a lot of fixes

* more tests

* more fix, more tests

* more fixes

* rm workarounds

* demo: type inferrence

* add placeholder for tojson

* improve function args handling

* rm type inference

* no more std::regex

* trailing spaces

* make testing more flexible

* make output a bit cleaner

* (wip) redirect minja calls

* test: add --output

* fix crash on macro kwargs

* add minimal caps system

* add some workarounds

* rm caps_apply_workarounds

* get rid of preprocessing

* more fixes

* fix test-chat-template

* move test-chat-jinja into test-chat-template

* rm test-chat-jinja from cmake

* test-chat-template: use common

* fix build

* fix build (2)

* rename vm --> interpreter

* improve error reporting

* correct lstrip behavior

* add tojson

* more fixes

* disable tests for COMMON_CHAT_FORMAT_GENERIC

* make sure tojson output correct order

* add object.length

* fully functional selectattr / rejectattr

* improve error reporting

* more builtins added, more fixes

* create jinja rendering tests

* fix testing.h path

* adjust whitespace rules

* more fixes

* temporary disable test for ibm-granite

* r/lstrip behavior matched with hf.js

* minimax, glm4.5 ok

* add append and pop

* kimi-k2 ok

* test-chat passed

* fix lstrip_block

* add more jinja tests

* cast to unsigned char

* allow dict key to be numeric

* nemotron: rm windows newline

* tests ok

* fix test

* rename interpreter --> runtime

* fix build

* add more checks

* bring back generic format support

* fix Apertus

* [json.exception.out_of_range.403] key 'content' not found

* rm generic test

* refactor input marking

* add docs

* fix windows build

* clarify error message

* improved tests

* split/rsplit with maxsplit

* non-inverse maxsplit

forgot to change after simplifying

* implement separators for tojson and fix indent

* i like to move it move it

* rename null -- > none

* token::eof

* some nits + comments

* add exception classes for lexer and parser

* null -> none

* rename global -> env

* rm minja

* update docs

* docs: add input marking caveats

* imlement missing jinja-tests functions

* oops

* support trim filter with args, remove bogus to_json reference

* numerous argument fixes

* updated tests

* implement optional strip chars parameter

* use new chars parameter

* float filter also has default

* always leave at least one decimal in float string

* jinja : static analysis + header cleanup + minor fixes

* add fuzz test

* add string.cpp

* fix chat_template_kwargs

* nits

* fix build

* revert

* unrevert

sorry :)

* add fuzz func_args, refactor to be safer

* fix array.map()

* loosen ensure_vals max count condition, add not impl for map(int)

* hopefully fix windows

* check if empty first

* normalize newlines

---------

Co-authored-by: Alde Rojas <redacted>
Co-authored-by: Sigbjørn Skjæret <redacted>
Co-authored-by: Georgi Gerganov <redacted>

commit | commitdiff | tree

Julius Tischbein [Fri, 16 Jan 2026 08:46:51 +0000 (09:46 +0100)]

Setting mmap and direct_io to false as default in llama-bench.cpp (#18841)

commit | commitdiff | tree

Raul Torres [Fri, 16 Jan 2026 08:34:09 +0000 (08:34 +0000)]

CANN: Remove unused `ggml_cann_get_device` function (#18625)

commit | commitdiff | tree

Chenguang Li [Fri, 16 Jan 2026 08:24:04 +0000 (16:24 +0800)]

CANN: fix an issue where get_env was not fully renamed (#18796)

* CANN: fix an issue where get_env was not fully renamed

* ci: add cann with acl group

* ci: define use_acl_graph using GitHub Action

* ci: update cann dockerfile with acl graph

commit | commitdiff | tree

hipudding [Fri, 16 Jan 2026 08:18:49 +0000 (16:18 +0800)]

CANN: support gated linear attn (#18653)

* CANN: support gated linear attn

This change adds support for the GGML_OP_GATED_LINEAR_ATTN operator.
The feature was implemented by YushengZhao. Because the previous
submission was based on an outdated codebase, this PR was rebased to
merge.

Co-authored-by: YushengZhao <redacted>
Co-authored-by: hipudding <redacted>
* CANN: optimize OP gla

Optimize gla for high preformance

* Remove unused comments

---------

Co-authored-by: 赵禹昇 <redacted>
Co-authored-by: YushengZhao <redacted>

commit | commitdiff | tree

shaofeiqi [Thu, 15 Jan 2026 19:17:17 +0000 (11:17 -0800)]

OpenCL: add SOLVE_TRI op support (#18846)

commit | commitdiff | tree

Georgi Gerganov [Thu, 15 Jan 2026 18:53:01 +0000 (20:53 +0200)]

cuda : print less debug logs when disabling cuda graphs (#18868)

commit | commitdiff | tree

Georgi Gerganov [Thu, 15 Jan 2026 17:35:57 +0000 (19:35 +0200)]

context : do not reserve scheduler for warmups (#18867)

commit | commitdiff | tree

ddh0 [Thu, 15 Jan 2026 17:16:29 +0000 (11:16 -0600)]

llama : add adaptive-p sampler (#17927)

* initial commit for branch

* simplify constants

* add params to `struct common_params_sampling`, add reference to PR

* explicitly clamp `min_target` and `max_target` to `[0.0, 1.0]`

* add args, rename `queue_size` -> `window_size`

* improved comments

* minor

* remove old unused code from algorithm

* minor

* add power law case to `common_sampler_init`, add sampler name mappings

* clarify behaviour when `window_size = 0`

* add missing enums

* remove `target_range` param, make `target == 1` no-op, cleanup code

* oops, straggler

* add missing parameters in `server-task.cpp`

* copy from author

ref:
https://gist.github.com/MrJackSpade/9be99c7efbba7b95a41377e123b7b069

* remove old debug log, style nit

* fix compiler warning, add commented-out logging per token

* re-write + change parameters + simplify

* oops forgot args.cpp

* fix leftover `window_size`

* add missing values to `common_params_sampling::print()`

* with logging

* does this fix it?

* no, but does this?

* update default decay

* optimize

* fix bad merge

my git skills are lacking

* silence `missing initializer for member`

* update default decay to 0.9

* fix logging

* format (double)

* add power law to the new `samplers` vector

* log sampler init values

* improve logging messages in llama_sampler_power_law

* remove extraneous logging

* simplify target computation

last commit with debug logging!

* remove debug logging, explicitly clamp params at init

* add `use_power_law` flag + logic, minor cleanup

* update `power-law` -> `adaptive-p`

* fix cold start EMA

- `ctx->weighted_sum` is now initialized and reset to `target / (1.0f -
clamped_decay)`
- `ctx->total_weight` is now initialized and reset to `1.0f / (1.0f -
clamped_decay)`

this fixes a "cold start" problem with the moving average

* update `SHARPNESS` constant to `10.0f`

* minor style fixes

no functional changes

* minor style fixes cont.

* update `llama_sampler_adaptive_p_i` for backend sampling (ref: #17004)

* separate into `apply` + `accept` functions

* `pending_token_idx`: switch from `llama_token` to `int32`

functionally identical (`llama.h` has `typedef int32_t llama_token;`),
but its more correct now

* don't transform logits <= -1e9f

* fix masking in backend top-p, min-p

* address review comments

* typo in comments `RND` -> `RNG`

* add docs

* add recommended values in completion docs

* address PR feedback

* remove trailing whitespace (for CI `editorconfig`)

* add to adaptive-p to `common_sampler_types_from_chars`

commit | commitdiff | tree

Xuan-Son Nguyen [Thu, 15 Jan 2026 16:10:28 +0000 (17:10 +0100)]

server: improve slots scheduling for n_cmpl (#18789)

* server : make sure children tasks are scheduled to launch with parent

* fix

* add comment pointing to this PR

* fix

* clean up

* more debug messages

* add pop_deferred_task with specific ID version

* improve the logic

* simple approach

* no double move

* correct return type of launch_slots_with_parent_task

commit | commitdiff | tree

Georgi Gerganov [Thu, 15 Jan 2026 14:39:17 +0000 (16:39 +0200)]

context : reserve new scheduler when graph topology changes (#18547)

* context : reserve new scheduler when graph topology changes

* cont : fix

* cont : fix reserve

* cont : reserve only when changes occur + timing

* context : add comments

* llama : reserve on sampler changes

* common : allow null common_sampler

* server : task declares needs (embd, logits, sampling)

* server : do not init sampler if not needed

* llama : fix need_reserve when unsetting a sampler

* server : consolidate slot reset/clear logic

commit | commitdiff | tree

Johannes Gäßler [Thu, 15 Jan 2026 14:14:50 +0000 (15:14 +0100)]

CUDA: fix allignment on register spill for FA (#18815)

commit | commitdiff | tree

shalinib-ibm [Thu, 15 Jan 2026 09:31:18 +0000 (15:01 +0530)]

ggml-cpu: optimize ggml_vec_dot_bf16 for Power9 (#18837)

commit | commitdiff | tree

Xuan-Son Nguyen [Thu, 15 Jan 2026 09:24:28 +0000 (10:24 +0100)]

lora: make sure model keep track of associated adapters (#18490)

* lora: make sure model keep track of associated adapters

* deprecate llama_adapter_lora_free

* minor : std::unordered_set over std::set

---------

Co-authored-by: Georgi Gerganov <redacted>

commit | commitdiff | tree

Sigbjørn Skjæret [Thu, 15 Jan 2026 09:12:46 +0000 (10:12 +0100)]

model-loader : support bool array sliding window pattern (#18850)

commit | commitdiff | tree

Adrien Gallouët [Thu, 15 Jan 2026 08:47:29 +0000 (09:47 +0100)]

tests : download models only when running ctest (#18843)

Signed-off-by: Adrien Gallouët <redacted>

commit | commitdiff | tree

Max Krasnyansky [Thu, 15 Jan 2026 05:46:12 +0000 (21:46 -0800)]

hexagon: support for OP_CPY, host buffers now optional, hvx-utils refactoring and optimizations (#18822)

* hexagon: disable repack buffers if host buffers are disabled, improved handling of env vars

* hexagon: add support for OP_CPY fp16/fp32 -> fp16/fp32

Factore out all hvx_copy functions into hvx-copy.h header and reduced code duplication.
Update HTP ops infra to support OP_CPY

* hexagon: cleanup and refactor hex/hvx/htp headers and helper libs

hex is basically all scalar/core platform stuff (L2, DMA, basic utils)
hvx is all hvx related utils, helpers, etc
htp is higher level stuff like Ops, etc

hvx-utils library got a nice round of cleanup and refactoring to reduce duplication

use hvx_vec_store_a where possible

* hexagon: refactor HVX sigmoid functions to hvx-sigmoid.h

Moved sigmoid and tanh vector functions from hvx-utils.h to a new header
hvx-sigmoid.h. Implemented aligned and unaligned variants for sigmoid
array processing using a macro pattern similar to hvx-copy.h. Updated
act-ops.c to use the new aligned variant hvx_sigmoid_f32_aa. Removed
unused hvx-sigmoid.c.

* hexagon: factor out hvx-sqrt.h

* hexagon: mintor update to hvx-utils.h

* hexagon: remove spurios log

* hexagon: factor out and optimize hvx_add/sub/mul

* hexagon: remove _opt variants of add/sub/mul as they simply fully aligned versions

* hexagon: refactor reduction functions to hvx-reduce.h

Moved `hvx_self_max_f32` and `hvx_self_sum_f32` from `hvx-utils.h`/`.c` to `hvx-reduce.h`.
Renamed them to `hvx_reduce_max_f32` and `hvx_reduce_sum_f32`.
Added aligned (`_a`) and unaligned (`_u`) variants and used macros to unify logic.
Updated `softmax-ops.c` to use the new functions.

* hexagon: refactor the rest of arithmetic functions to hvx-arith.h

Moved `hvx_sum_of_squares_f32`, `hvx_min_scalar_f32`, and `hvx_clamp_scalar_f32` from `hvx-utils.c/h` to `hvx-arith.h`. Implemented aligned/unaligned variants (`_aa`, `_au`, etc.) and used macros to reduce code duplication. Updated `hvx_min_scalar_f32` and `hvx_clamp_scalar_f32` to use `dst, src, ..., n` argument order. Updated call sites in `act-ops.c`.

Refactor Hexagon HVX arithmetic functions (min, clamp) to hvx-arith.h

Moved `hvx_min_scalar_f32` and `hvx_clamp_scalar_f32` from `hvx-utils.c/h` to `hvx-arith.h`. Implemented aligned/unaligned variants (`_aa`, `_au`, etc.) and used macros to reduce code duplication. Updated these functions to use `dst, src, ..., n` argument order and updated call sites in `act-ops.c`. `hvx_sum_of_squares_f32` remains in `hvx-utils.c` as requested.

* hexagon: refactor hvx_sum_of_squares_f32

- Modify `hvx_sum_of_squares_f32` in `ggml/src/ggml-hexagon/htp/hvx-reduce.h` to use `dst, src` signature.
- Implement `_a` (aligned) and `_u` (unaligned) variants for `hvx_sum_of_squares_f32`.
- Update `hvx_reduce_loop_body` macro to support both returning and storing results via `finalize_op`.
- Update existing reduction functions in `hvx-reduce.h` to use the updated macro.
- Update `rms_norm_htp_f32` in `ggml/src/ggml-hexagon/htp/unary-ops.c` to match the new signature.

* hexagon: use hvx_splat instead of memset

* hexagon: consistent use of f32/f16 in all function names to match the rest of GGML

* hexagon: fix hvx_copy_f16_f32 on v75 and older

* hexagon: update readme to include GGML_HEXAGON_EXPERIMENTAL

* scripts: update snapdragon/adb scripts to enable host param

commit | commitdiff | tree

Oliver Simons [Thu, 15 Jan 2026 02:44:54 +0000 (03:44 +0100)]

CUDA: Factor out and re-use `block_reduce` function (#18785)

* CUDA: Refactor and expose two_stage_warp_reduce_* function

* Use `two_stage_warp_reduce` also in softmax kernel, move smem out of it

Moving smem out of `__device__` function to `__global__` function
allows for explicit smem reuse, as either compiler or cuda rt seem to not
free it afterwards (`cudaFuncSetAttribute` fails when not accounting for
it once for each call to two_stage_warp_reduce)

* Update ggml/src/ggml-cuda/common.cuh

Co-authored-by: Aman Gupta <redacted>
* Use two_stage_warp_reduce in group_norm_f32

* Use two_stage_warp_reduce in rms_norm_f32

* Fix smem calculation which expects bytes

* Make `two_stage_warp_reduce` accept all values warp_reduce accepts

Also integrate it into norm_f32 function

* Use two_stage_warp_reduce in l2_norm_f32

* Use type traits for block reduction for better legibility

Also adresss other requests by @am17an such as variable renaming

* Make norm tests cover all cuda paths

* Mark columns % WARP_SIZE !=0 as supported for RMS_NORM_BACK

Unit-tests passed locally, let's see if they pass in the CI as well

* Use `enum class` for `block_reduce_method`

This is more type-safe than plain enum

* Rename variables as suggested in code review by @am17an

* Rename two_stage_warp_reduce -> block_reduce

* Fix trailing whitespace in common.cuh

* Make condition of static_assert type-dependent

This delays evaluation until the template is actually instantiated.
Otherwise, some compilers may evaluate the assert when parsing the
template, resulting in build errors as observed here:

https://github.com/ggml-org/llama.cpp/actions/runs/20960323123/job/60235530068?pr=18785

* Inline definitions

---------

Co-authored-by: Aman Gupta <redacted>

commit | commitdiff | tree

Piotr Wilkin (ilintar) [Wed, 14 Jan 2026 19:29:35 +0000 (20:29 +0100)]

Restore clip's cb() to its rightful glory - extract common debugging elements in llama (#17914)

* Extract common debugging functions; plug eval-callback and mtmd's MTMD_DEBUG_GRAPH with same functionality

* Move to common

* Remove unneeded header

* Unlink from common

* chore: update webui build output

* Cleanup; properly pass params to mtmd without depending on common; factorize debug.cpp to use common debug code.

* Revert change to webapp

* Post-merge adjust

* Apply suggestions from code review

Co-authored-by: Xuan-Son Nguyen <redacted>
* Apply code review changes

* Remove changes to server-context

* Remove mtmd.h include

* Remove utility functions from header

* Apply suggestions from code review

Co-authored-by: Xuan-Son Nguyen <redacted>
* Rename functions

* Update tools/mtmd/clip.cpp

Co-authored-by: Xuan-Son Nguyen <redacted>
* Update tools/mtmd/clip.cpp

Co-authored-by: Xuan-Son Nguyen <redacted>
* Update tools/mtmd/clip.cpp

Co-authored-by: Xuan-Son Nguyen <redacted>
---------

Co-authored-by: Xuan-Son Nguyen <redacted>

commit | commitdiff | tree

Junwon Hwang [Wed, 14 Jan 2026 18:38:21 +0000 (03:38 +0900)]

model : clean up and fix EXAONE-MoE configuration (#18840)

* Fix mismatch of EXAONE-MoE configuration

* ensure gating func is set, cleanup

---------

Co-authored-by: Sigbjørn Skjæret <redacted>

commit | commitdiff | tree

Adrien Gallouët [Wed, 14 Jan 2026 17:02:47 +0000 (18:02 +0100)]

refactor : remove libcurl, use OpenSSL when available (#18828)

commit | commitdiff | tree

Jeff Bolz [Wed, 14 Jan 2026 09:59:05 +0000 (03:59 -0600)]

vulkan: Check maxStorageBufferRange in supports_op (#18709)

* vulkan: Check maxStorageBufferRange in supports_op

* skip maxStorageBufferRange check when shader64BitIndexing is enabled

commit | commitdiff | tree

Aman Gupta [Wed, 14 Jan 2026 09:55:15 +0000 (17:55 +0800)]

llama-model: fix unfortunate typo (#18832)

commit | commitdiff | tree

Daniel Bevenius [Wed, 14 Jan 2026 09:31:49 +0000 (10:31 +0100)]

CUDA : fix typo in clang pragma comment [no ci] (#18830)

commit | commitdiff | tree

Ruben Ortlam [Wed, 14 Jan 2026 08:41:23 +0000 (09:41 +0100)]

vulkan: work around Intel fp16 bug in mmq (#18814)

commit | commitdiff | tree

Perry Naseck [Wed, 14 Jan 2026 07:22:25 +0000 (02:22 -0500)]

ggml-metal: do not copy headers for embedded, use current binary dir for embedded (#18705)

Packaging of ggml-org/llama.cpp

RSS Atom