]> git.djapps.eu Git - pkg/ggml/sources/llama.cpp/log
pkg/ggml/sources/llama.cpp
6 weeks ago[SYCL] use malloc to support both iGPU and dGPU in same time (#18992)
Neo Zhang [Fri, 23 Jan 2026 12:54:10 +0000 (20:54 +0800)]
[SYCL] use malloc to support both iGPU and dGPU in same time (#18992)

* use malloc to support both iGPU and dGPU in same time

* support windows

---------

Co-authored-by: Neo Zhang Jianyu <redacted>
6 weeks agochat : fix translategemma crash on common_chat_format_example (#19019)
Xuan-Son Nguyen [Fri, 23 Jan 2026 11:03:42 +0000 (12:03 +0100)]
chat : fix translategemma crash on common_chat_format_example (#19019)

6 weeks agomodel-conversion : use BUILD_DIR variable in all scripts (#19015)
Daniel Bevenius [Fri, 23 Jan 2026 08:01:36 +0000 (09:01 +0100)]
model-conversion : use BUILD_DIR variable in all scripts (#19015)

This commit modifies all the utility scripts to use an optional
BUILD_DIR variable/argument to specify the build directory.

The motivation for this is that Commit
3d55846a5c626e2e608db8c24fa9ee6defaacca9 ("model-conversion : add
BUILD_DIR variable to run-converted-model scripts") introduced this
variable to the causal and embeddings scripts, but I missed the scripts
in the utils directory.

6 weeks agoggml-cpu: aarm64: q5_K repack gemm and gemv (and generic) implementations (i8mm)...
Alberto Cabrera Pérez [Fri, 23 Jan 2026 07:55:08 +0000 (07:55 +0000)]
ggml-cpu: aarm64: q5_K repack gemm and gemv (and generic) implementations (i8mm) (#18860)

* Boilerplate for q5_Kx8 REPACK on ARM and fallback

Signed-off-by: Alberto Cabrera <redacted>
* Implements make_block_q5_Kx8 by extending make_block_q4_Kx8

Signed-off-by: Alberto Cabrera <redacted>
* q5_K repack gemm and gemv generics

* Gemm and Gemv ARM implementations (i8mm)

* Improved qh manipulation looking at non-repack vec_dot implementation

* Full unroll

* Apply Q5_K Gemv vand and vshl optimizations to gemm. Improve comments.

Signed-off-by: Alberto Cabrera <redacted>
* Fix wrong fallback definitions of Q5_K

Signed-off-by: Alberto Cabrera <redacted>
* Fixed comments. Reverted unnecessary formatting

Signed-off-by: Alberto Cabrera <redacted>
* Fixed typo in generic definitions

* Switching AND + Shift with Shift Insert. Better op interleaving.

* Vectorize + unroll the block scales

* Apply gemm optimizations to gemv

* Improve bias calculation

---------

Signed-off-by: Alberto Cabrera <redacted>
6 weeks agocli : load parser definition (#19031)
Aldehir Rojas [Fri, 23 Jan 2026 02:31:22 +0000 (20:31 -0600)]
cli : load parser definition (#19031)

* cli : load parser definition

* cont : only unload if a parser is defined

6 weeks agoserver : support preserving reasoning_content in assistant message (#18994)
Xuan-Son Nguyen [Thu, 22 Jan 2026 20:30:06 +0000 (21:30 +0100)]
server : support preserving reasoning_content in assistant message (#18994)

* support reasoning_content input

* report template caps to webui

* add docs

* rm commented code

6 weeks agomla : make the V tensor a view of K (#18986)
Georgi Gerganov [Thu, 22 Jan 2026 20:09:01 +0000 (22:09 +0200)]
mla : make the V tensor a view of K (#18986)

* mla : pass V as a view of K to the FA op

* cuda : adjust mla logic to new layout

* kv-cache : fix rope shift

* tests : remove comment

* cuda : fix reusable_cutoff

Co-authored-by: Johannes Gäßler <redacted>
---------

Co-authored-by: Johannes Gäßler <redacted>
6 weeks agoCUDA: fix alignment check for FA (#19023)
Johannes Gäßler [Thu, 22 Jan 2026 19:39:25 +0000 (20:39 +0100)]
CUDA: fix alignment check for FA (#19023)

6 weeks agoconvert_hf_to_gguf.py: refactor modify_tensors to call super (#18866)
Aman Gupta [Thu, 22 Jan 2026 18:58:07 +0000 (02:58 +0800)]
convert_hf_to_gguf.py: refactor modify_tensors to call super (#18866)

6 weeks agoopencl: enable the general fp mm for non-cont input and as a fallback for specialized...
lhez [Thu, 22 Jan 2026 18:29:25 +0000 (10:29 -0800)]
opencl: enable the general fp mm for non-cont input and as a fallback for specialized kqv kernel for adreno (#18970)

* opencl: add `copy_to_contiguous` and utilize mm kernels

* opencl: only copy to cont for f32 and f16 tensors

* opencl: use cont mm for fallback when dst is large

* opencl: use nb local to copy-to-cont

* opencl: use local offset as well

6 weeks agoserver: do not log certain endpoints (avoid log spam) (#19028)
Xuan-Son Nguyen [Thu, 22 Jan 2026 18:24:37 +0000 (19:24 +0100)]
server: do not log certain endpoints (avoid log spam) (#19028)

6 weeks agoquant : manual overrides of tensor types take precedence (#18952)
Georgi Gerganov [Thu, 22 Jan 2026 14:17:06 +0000 (16:17 +0200)]
quant : manual overrides of tensor types take precedence (#18952)

6 weeks agorelease: update github api (#19022)
Aaron Teo [Thu, 22 Jan 2026 13:38:02 +0000 (21:38 +0800)]
release: update github api (#19022)

6 weeks agomtmd : update docs to use llama_model_n_embd_inp (#18999)
Xuan-Son Nguyen [Thu, 22 Jan 2026 13:36:32 +0000 (14:36 +0100)]
mtmd : update docs to use llama_model_n_embd_inp (#18999)

6 weeks agoserver: Reorder methods in `server-task.cpp` (#19016)
손희준 [Thu, 22 Jan 2026 13:36:04 +0000 (22:36 +0900)]
server: Reorder methods in `server-task.cpp` (#19016)

* Move `task_result_state::update_chat_msg` to match with header

* Move `server_task_result_cmpl_partial::to_json_anthropic()` to match with header

---------

Co-authored-by: openingnow <>
6 weeks agoCUDA: add gqa_ratio 4 for GLM 4.7 flash (#18953)
Aman Gupta [Thu, 22 Jan 2026 10:51:53 +0000 (18:51 +0800)]
CUDA: add gqa_ratio 4 for GLM 4.7 flash (#18953)

6 weeks agoopencl: add TRI op support (#18979)
shaofeiqi [Thu, 22 Jan 2026 06:05:54 +0000 (22:05 -0800)]
opencl: add TRI op support (#18979)

7 weeks agoggml-zdnn : mark zDNN buffers as non-host (#18967)
Aleksei Nikiforov [Thu, 22 Jan 2026 00:16:21 +0000 (01:16 +0100)]
ggml-zdnn : mark zDNN buffers as non-host (#18967)

While buffers reside in host memory,
additional transformation is needed to use buffers with zDNN.

Fixes #18848

7 weeks agoci : update GitHub Actions versions [no ci] (#18935)
Pádraic Slattery [Wed, 21 Jan 2026 23:57:18 +0000 (00:57 +0100)]
ci : update GitHub Actions versions [no ci] (#18935)

7 weeks agoconvert : add Devstral-2 (Ministral3ForCausalLM) arch (#18972)
Mariusz Woloszyn [Wed, 21 Jan 2026 23:55:55 +0000 (00:55 +0100)]
convert : add Devstral-2 (Ministral3ForCausalLM) arch (#18972)

* Add Ministral3ForCausalLM architeture

This adds support for newer architectres like Devstral-2

* removed blank line found after function decorator

Co-authored-by: Sigbjørn Skjæret <redacted>
---------

Co-authored-by: Sigbjørn Skjæret <redacted>
7 weeks agojinja: support none|string (#18995)
Piotr Wilkin (ilintar) [Wed, 21 Jan 2026 18:24:37 +0000 (19:24 +0100)]
jinja: support none|string (#18995)

* jinja: support none|string

* Update common/jinja/value.cpp

Co-authored-by: Sigbjørn Skjæret <redacted>
* Update tests/test-jinja.cpp

Co-authored-by: Sigbjørn Skjæret <redacted>
* Add as_string()

---------

Co-authored-by: Sigbjørn Skjæret <redacted>
7 weeks agofix: Use `tabular-nums` for chat message statistics (#18915)
Hendrik Erz [Wed, 21 Jan 2026 17:46:01 +0000 (18:46 +0100)]
fix: Use `tabular-nums` for chat message statistics (#18915)

* fix: Use `tabular-nums` for chat message statistics

* fix: Rebuild WebUI

7 weeks agollama : clarify nemotron-h.cpp comment about RoPE [no ci] (#18997)
Daniel Bevenius [Wed, 21 Jan 2026 17:31:34 +0000 (18:31 +0100)]
llama : clarify nemotron-h.cpp comment about RoPE [no ci] (#18997)

This commit removes the mention of RoPE in the comment for the Q and K
computation as RoPE is not applied.

7 weeks agovulkan: Remove transfer_ctx, do everything in compute_ctx. (#18945)
Jeff Bolz [Wed, 21 Jan 2026 17:01:40 +0000 (11:01 -0600)]
vulkan: Remove transfer_ctx, do everything in compute_ctx. (#18945)

* vulkan: Remove transfer_ctx, do everything in compute_ctx.

We had a bug where a set_tensor_async (using transfer_ctx) didn't get
submitted before the graph_compute (using compute_ctx) that came after
it. To avoid this sort of issue, just do everything in compute_ctx.

Remove transfer_cmd_pool, which was already unused.

* fix crash with perf logger

7 weeks agocommon : improve error message when HTTPS is missing but required (#18987)
Adrien Gallouët [Wed, 21 Jan 2026 16:58:38 +0000 (17:58 +0100)]
common : improve error message when HTTPS is missing but required (#18987)

Signed-off-by: Adrien Gallouët <redacted>
7 weeks agoserver: /v1/responses (partial) (#18486)
손희준 [Wed, 21 Jan 2026 16:47:23 +0000 (01:47 +0900)]
server: /v1/responses (partial) (#18486)

* from previous PR

* Make instruction(system) as first message

* Convert [input_message] (text/image/file)

* Rename convert_responses_to_chatcmpl(body) -> response_body

* Initial tool call support

* Erase instructions field from chatcmpl body

* Feed reasoning texts to chat template

* Use std::vector instead of opaque json array

* Make output_item.added events consistent

* Move `server_task_result_cmpl_partial::update` from header to source

* Match ID of output_item.added and .done events

* Add function_call only if there is no "fc_" prefix

* Add function call output at non-streaming API

* Test if ID is persistent

* Add doc

* Fix style - use trailing comma

* Rewrite state management

* catch up with upstream/master

* Fix style - "type" is the first item of SSE data

* Explicitly check "instructions" from response_body

* Make lambdas static

* Check if reasoning content exists

* Add `oai_resp_id` to task_result_state(also initialized at ctor), server_task_result_cmpl_partial, and server_task_result_cmpl_final

* Reject `input_file` since it is not supported by chatcmpl

* Add "fc_" prefix to non-straming function call id as coderabbit pointed out

---------

Co-authored-by: openingnow <>
7 weeks agovulkan: support flash attention GQA/split_k with small batches (#18938)
Jeff Bolz [Wed, 21 Jan 2026 16:43:43 +0000 (10:43 -0600)]
vulkan: support flash attention GQA/split_k with small batches (#18938)

7 weeks agoRevert "vulkan: force full subgroups for flash attention to fix intel subgroup crash...
Masato Nakasaka [Wed, 21 Jan 2026 16:13:43 +0000 (01:13 +0900)]
Revert "vulkan: force full subgroups for flash attention to fix intel subgroup crash (#17356)" (#18831)

This reverts commit 980b7cd17e055c8c587f79ffda7eb4fddf405566.

7 weeks agovulkan: Use mul_mat_vec_id for small values of n (#18918)
Jeff Bolz [Wed, 21 Jan 2026 15:22:02 +0000 (09:22 -0600)]
vulkan: Use mul_mat_vec_id for small values of n (#18918)

Change ggml_vk_mul_mat_vec_id_q_f16 to loop over the batch dimension and
update the indexing calculations in get_offsets.

Mat-vec is faster than mat-mat for small values of n. We don't get the same
reuse of the weights as in the non-ID path, but with this the cost is linear
in n rather than n>1 being far slower than n==1.

7 weeks agomemory : add llama_memory_hybrid_iswa (#18601)
Tarek Dakhran [Wed, 21 Jan 2026 12:30:23 +0000 (13:30 +0100)]
memory : add llama_memory_hybrid_iswa (#18601)

* memory : add llama_memory_hybrid_iswa

* Update src/llama-memory-hybrid-iswa.cpp

Co-authored-by: Georgi Gerganov <redacted>
---------

Co-authored-by: Georgi Gerganov <redacted>
7 weeks agoFix GLM 4.7 Lite MoE gating func (#18980)
Piotr Wilkin (ilintar) [Wed, 21 Jan 2026 11:35:20 +0000 (12:35 +0100)]
Fix GLM 4.7 Lite MoE gating func (#18980)

* Fix GLM 4.7 MoE gating func

* Update src/models/deepseek2.cpp

Co-authored-by: Sigbjørn Skjæret <redacted>
* Update src/llama-model.cpp

Co-authored-by: Xuan-Son Nguyen <redacted>
---------

Co-authored-by: Sigbjørn Skjæret <redacted>
Co-authored-by: Xuan-Son Nguyen <redacted>
7 weeks agogguf: display strerrno when cant load a model (#18884)
Matthieu Coudron [Wed, 21 Jan 2026 06:52:46 +0000 (07:52 +0100)]
gguf: display strerrno when cant load a model (#18884)

I've had issues loading models with llama-server:
[44039] E gguf_init_from_file: failed to open GGUF file 'mistral-7b-v0.1.Q8_0.gguf'

and I was sure it could access the file. Seems like --models-dir and
--models-presets dont interact like I thought they would but I salvaged
this snippet that helps troubleshooting
[44039] E gguf_init_from_file: failed to open GGUF file 'mistral-7b-v0.1.Q8_0.gguf' (errno No such file or directory)

7 weeks agoCUDA: Fix builds for older CCCL versions by ifdefing strided_iterator (#18964)
Oliver Simons [Wed, 21 Jan 2026 01:34:29 +0000 (02:34 +0100)]
CUDA: Fix builds for older CCCL versions by ifdefing strided_iterator (#18964)

* CUDA: Fix builds for older CCCL versions by ifdefing strided_iterator

Strided iterator was added in [CCCL
3.1](https://github.com/NVIDIA/cccl/releases/tag/v3.1.0), which is packaged into
[CTK
13.1](https://docs.nvidia.com/cuda/cuda-toolkit-release-notes/index.html#id5)

* Unindent as per code review request

7 weeks agocommon, server : use the same User-Agent by default (#18957)
Adrien Gallouët [Tue, 20 Jan 2026 17:28:43 +0000 (18:28 +0100)]
common, server : use the same User-Agent by default (#18957)

This commit also ensures that if a custom User-Agent is used, it will be
the only one sent.

Signed-off-by: Adrien Gallouët <redacted>
7 weeks agocli : fix reasoning responses in CLI (#18961)
Xuan-Son Nguyen [Tue, 20 Jan 2026 17:23:25 +0000 (18:23 +0100)]
cli : fix reasoning responses in CLI (#18961)

* cli : fix reasoning responses in CLI

* fix build

* fix build (2)

7 weeks agoCUDA: Replace init_offsets kernel with iterators in cub-based argsort (#18930)
Oliver Simons [Tue, 20 Jan 2026 12:11:01 +0000 (13:11 +0100)]
CUDA: Replace init_offsets kernel with iterators in cub-based argsort (#18930)

* CUDA: Replace `init_offsets` with iterators in argsort

This is a QOL improvement, saving us the cost of materializing the
iterator

* Remove unnecessary include from top-k.cu

7 weeks agoggml : cleanup path_str() (#18928)
Adrien Gallouët [Tue, 20 Jan 2026 10:42:49 +0000 (11:42 +0100)]
ggml : cleanup path_str() (#18928)

- Remove pragmas as `std::codecvt_utf8` is not used.
- Avoid implicit `strlen()`.

Signed-off-by: Adrien Gallouët <redacted>
7 weeks agometal : enable FA for MLA heads (#18950)
Georgi Gerganov [Tue, 20 Jan 2026 10:21:28 +0000 (12:21 +0200)]
metal : enable FA for MLA heads (#18950)

7 weeks agoconvert : use n_groups instead of hardcoded values in reshape (#18929)
Daniel Bevenius [Tue, 20 Jan 2026 05:55:24 +0000 (06:55 +0100)]
convert : use n_groups instead of hardcoded values in reshape (#18929)

* convert : use n_groups instead of hardcoded values in reshape

This commit modifies the conversion script for NemotronHModel to use
the 'n_groups' hyperparameter, and allow Python to calculate the the
last dimension, using -1, when reshaping the 'mixer.norm.weight' tensor.

* use self.n_group instead of self.hparams["n_groups"]

7 weeks agoserver : refactor oai_parser_opt, move it to server_chat_params (#18937)
Xuan-Son Nguyen [Mon, 19 Jan 2026 22:28:01 +0000 (23:28 +0100)]
server : refactor oai_parser_opt, move it to server_chat_params (#18937)

* server_chat_params

* move chat format into CLI

* use meta whenever possible

* clean up, no more chatml fallback

7 weeks agoconvert : support Glm4MoeLite (#18936)
ddh0 [Mon, 19 Jan 2026 22:09:20 +0000 (16:09 -0600)]
convert : support Glm4MoeLite (#18936)

* initial commit for branch

* add glm-4.7-flash, move tokenizer hash

* use `glm4` pretok

* silence flake8 E302 (CI)

* apply review feedback

* add <|user|> as eog

* also add EOG `<|observation|>`

* revert llama-vocab

* inherit vocab from glm4

---------

Co-authored-by: Xuan Son Nguyen <redacted>
7 weeks agojinja : fix undefined keys and attributes and int/float as bool (#18924)
Sigbjørn Skjæret [Mon, 19 Jan 2026 19:29:43 +0000 (20:29 +0100)]
jinja : fix undefined keys and attributes and int/float as bool (#18924)

* fix undefined keys and attributes

* add falsy tests

* as_bool for integers and floats

* more falsy/truthy tests

* --typo

7 weeks agoci : run test-jinja -py on high perf [no ci] (#18916)
Sigbjørn Skjæret [Mon, 19 Jan 2026 19:29:15 +0000 (20:29 +0100)]
ci : run test-jinja -py on high perf [no ci] (#18916)

7 weeks agoserver: fix memory reservations in populate_token_probs (#18787)
Lennart Austenfeld [Mon, 19 Jan 2026 18:13:31 +0000 (19:13 +0100)]
server: fix memory reservations in populate_token_probs (#18787)

7 weeks agoggml : add ggml_build_forward_select (#18550)
Georgi Gerganov [Mon, 19 Jan 2026 18:03:19 +0000 (20:03 +0200)]
ggml : add ggml_build_forward_select (#18550)

* ggml : add ggml_build_forward_select

* cuda : adapt CUDA graph compat to new feature

* vulkan : update logic to handle command buffer closing

* ggml : check compute for fusion

* ggml : add comment

7 weeks agomodel-conversion : add BUILD_DIR variable to run-converted-model scripts (#18927)
Daniel Bevenius [Mon, 19 Jan 2026 12:12:38 +0000 (13:12 +0100)]
model-conversion : add BUILD_DIR variable to run-converted-model scripts (#18927)

This commit adds a BUILD_DIR variable to the scripts used for running
converted models.

The motivation for this is that currently the `build` directory is
hardcoded and it can be useful to specify a different build directory,
with builds for different configurations.

7 weeks agollama : Extend fallback, fix fileno for dio file, exclude case that mmap uses dio...
Julius Tischbein [Sun, 18 Jan 2026 16:35:57 +0000 (17:35 +0100)]
llama : Extend fallback, fix fileno for dio file, exclude case that mmap uses dio file (#18887)

7 weeks agodocs: add linux to index (#18907)
Francisco Herrera [Sun, 18 Jan 2026 10:03:35 +0000 (05:03 -0500)]
docs: add linux to index (#18907)

7 weeks agotests : add test-jinja -py option for cross-checking (#18906)
Xuan-Son Nguyen [Sun, 18 Jan 2026 07:14:27 +0000 (08:14 +0100)]
tests : add test-jinja -py option for cross-checking (#18906)

* tests : add test-jinja -py option or cross-checking

* Update tests/test-jinja.cpp

Co-authored-by: Sigbjørn Skjæret <redacted>
* fix + add source

* SandboxedEnvironment

* fix array.map case

---------

Co-authored-by: Sigbjørn Skjæret <redacted>
7 weeks agojinja : fix object item order (and properly implement dictsort) (#18904)
Sigbjørn Skjæret [Sun, 18 Jan 2026 02:40:06 +0000 (03:40 +0100)]
jinja : fix object item order (and properly implement dictsort) (#18904)

* fix object item order

* as_ordered_object

* copy whole object

7 weeks agojinja : attribute support for join, map and sort (#18883)
Sigbjørn Skjæret [Sun, 18 Jan 2026 01:53:01 +0000 (02:53 +0100)]
jinja : attribute support for join, map and sort (#18883)

* support negative array index and default value

* attribute support (int and str) for join, map and sort

* add tests

* update CODEOWNERS

* improve fixme sorting comment

7 weeks agojinja : add missing tojson filter for bool (#18900)
Sigbjørn Skjæret [Sun, 18 Jan 2026 00:05:09 +0000 (01:05 +0100)]
jinja : add missing tojson filter for bool (#18900)

* add missing tojson for bool

* add more literal tests

7 weeks agojinja : fix lexing of float literals with sign (#18901)
Sigbjørn Skjæret [Sat, 17 Jan 2026 23:57:51 +0000 (00:57 +0100)]
jinja : fix lexing of float literals with sign (#18901)

* fix lexing of float literals with sign

* add test

* consume_numeric

7 weeks agojinja: correct member access rule (#18905)
Xuan-Son Nguyen [Sat, 17 Jan 2026 23:48:55 +0000 (00:48 +0100)]
jinja: correct member access rule (#18905)

7 weeks agoopencl: fix q6_K mv for m=1 (#18893)
lhez [Sat, 17 Jan 2026 21:50:32 +0000 (13:50 -0800)]
opencl: fix q6_K mv for m=1 (#18893)

7 weeks agoci : add label for jinja changes (#18903)
Sigbjørn Skjæret [Sat, 17 Jan 2026 20:52:02 +0000 (21:52 +0100)]
ci : add label for jinja changes (#18903)

7 weeks agokv-cache : optimize KQ mask construction (#18842)
Georgi Gerganov [Sat, 17 Jan 2026 13:42:42 +0000 (15:42 +0200)]
kv-cache : optimize KQ mask construction (#18842)

* kv-cache : optimize KQ mask construction

* cont : add explanation + improve

* cont : fix

7 weeks agoggml webgpu: support for backend sampling (#18880)
Reese Levine [Sat, 17 Jan 2026 00:12:43 +0000 (16:12 -0800)]
ggml webgpu: support for backend sampling (#18880)

* ggml webgpu: add SOFTPLUS unary operator

Implements SOFTPLUS (log(1 + exp(x))) with f16/f32 support. Uses f32
precision for intermediate calculations to prevent f16 overflow.

* Add shader implementation and 4 variants (f32/f16, inplace/non-inplace)
* Register pipelines and device support
* Follow Vulkan backend numerical stability pattern

* ggml webgpu: add EXPM1 unary operator

Implements EXPM1 (exp(x) - 1) with f16/f32 support.

* Add shader implementation and 4 variants (f32/f16, inplace/non-inplace)
* Register pipelines and device support

* ggml webgpu: add FLOOR unary operator

Implements FLOOR (rounds down to nearest integer) with f16/f32 support.

* Add shader implementation and 4 variants (f32/f16, inplace/non-inplace)
* Register pipelines and device support

* ggml webgpu: add CEIL unary operator

Implements CEIL (rounds up to nearest integer) with f16/f32 support.

* Add shader implementation and 4 variants (f32/f16, inplace/non-inplace)
* Register pipelines and device support

* ggml webgpu: add ROUND unary operator

Implements ROUND (rounds to nearest integer) with f16/f32 support.

* Add shader implementation and 4 variants (f32/f16, inplace/non-inplace)
* Register pipelines and device support

* ggml webgpu: add TRUNC unary operator

Implements TRUNC (truncates towards zero) with f16/f32 support.

* Add shader implementation and 4 variants (f32/f16, inplace/non-inplace)
* Register pipelines and device support

* docs : update WebGPU support for unary operators (FLOOR, CEIL, ROUND, TRUNC, EXPM1, SOFTPLUS)

* Updates to webgpu get_memory

* Add argmax

* Add argmax,cumsum,sum,sum_rows

* Add necessary CPY/GET_ROWS operators

* Support for argsort using multi-pass strategy

* Update set_rows for i32 indices, move to pre-wgsl

* Port unary operators to pre-wgsl and support FILL

* Implement PAD

* Add support for top-k

* clean up, scope pipeline init mutex

* fix newline

* Add support for log

* Update LOG for better precision, and ops doc

---------

Co-authored-by: Abhijit Ramesh <redacted>
7 weeks agoggml : extend ggml_pool_1d + metal (#16429)
Thore Koritzius [Fri, 16 Jan 2026 14:59:56 +0000 (15:59 +0100)]
ggml : extend ggml_pool_1d + metal (#16429)

* chore: resolve conflicts

* feat: ggml metal impl

* fix: ggml_metal_kargs_pool_1d struct

* fix: require contiguous input

* chore: test pool_1d

* chore: limit pool1d test cases to p0=0 and s0=k0 to conform with asserts

* chore: add p0 and s0 to testing

* fix: allow padding for cpu and metal

* Update ggml/src/ggml-metal/ggml-metal.metal

* fix: correct single-threaded loop

* ggml : cleanup

* tests : add ne[1] != 1 tests

* fix: ne[1] handling in np

* cont : fixes

---------

Co-authored-by: Georgi Gerganov <redacted>
7 weeks agodocs : update ops.md for CANN backend (#18654)
hipudding [Fri, 16 Jan 2026 12:32:17 +0000 (20:32 +0800)]
docs : update ops.md for CANN backend (#18654)

7 weeks agoggml-blas: hide warnings from included BLAS headers (#18818)
Perry Naseck [Fri, 16 Jan 2026 11:38:25 +0000 (06:38 -0500)]
ggml-blas: hide warnings from included BLAS headers (#18818)

* fix compile def openblas, blis for compat libs, nvpl compile def, warn if no blas vendor set

* ggml-blas: hide warnings from included BLAS headers

7 weeks agomtmd : Fix ASR for LFM2.5-Audio-1.5B (#18876)
Tarek Dakhran [Fri, 16 Jan 2026 10:23:08 +0000 (11:23 +0100)]
mtmd : Fix ASR for LFM2.5-Audio-1.5B (#18876)

7 weeks agocommon : implement new jinja template engine (#18462)
Xuan-Son Nguyen [Fri, 16 Jan 2026 10:22:06 +0000 (11:22 +0100)]
common : implement new jinja template engine (#18462)

* jinja vm

* lexer

* add vm types

* demo

* clean up

* parser ok

* binary_expression::execute

* shadow naming

* bin ops works!

* fix map object

* add string builtins

* add more builtins

* wip

* use mk_val

* eval with is_user_input

* render gemma tmpl ok

* track input string even after transformations

* support binded functions

* keyword arguments and slicing array

* use shared_ptr for values

* add mk_stmt

* allow print source on exception

* fix negate test

* testing more templates

* mostly works

* add filter_statement

* allow func to access ctx

* add jinja-value.cpp

* impl global_from_json

* a lot of fixes

* more tests

* more fix, more tests

* more fixes

* rm workarounds

* demo: type inferrence

* add placeholder for tojson

* improve function args handling

* rm type inference

* no more std::regex

* trailing spaces

* make testing more flexible

* make output a bit cleaner

* (wip) redirect minja calls

* test: add --output

* fix crash on macro kwargs

* add minimal caps system

* add some workarounds

* rm caps_apply_workarounds

* get rid of preprocessing

* more fixes

* fix test-chat-template

* move test-chat-jinja into test-chat-template

* rm test-chat-jinja from cmake

* test-chat-template: use common

* fix build

* fix build (2)

* rename vm --> interpreter

* improve error reporting

* correct lstrip behavior

* add tojson

* more fixes

* disable tests for COMMON_CHAT_FORMAT_GENERIC

* make sure tojson output correct order

* add object.length

* fully functional selectattr / rejectattr

* improve error reporting

* more builtins added, more fixes

* create jinja rendering tests

* fix testing.h path

* adjust whitespace rules

* more fixes

* temporary disable test for ibm-granite

* r/lstrip behavior matched with hf.js

* minimax, glm4.5 ok

* add append and pop

* kimi-k2 ok

* test-chat passed

* fix lstrip_block

* add more jinja tests

* cast to unsigned char

* allow dict key to be numeric

* nemotron: rm windows newline

* tests ok

* fix test

* rename interpreter --> runtime

* fix build

* add more checks

* bring back generic format support

* fix Apertus

* [json.exception.out_of_range.403] key 'content' not found

* rm generic test

* refactor input marking

* add docs

* fix windows build

* clarify error message

* improved tests

* split/rsplit with maxsplit

* non-inverse maxsplit

forgot to change after simplifying

* implement separators for tojson and fix indent

* i like to move it move it

* rename null -- > none

* token::eof

* some nits + comments

* add exception classes for lexer and parser

* null -> none

* rename global -> env

* rm minja

* update docs

* docs: add input marking caveats

* imlement missing jinja-tests functions

* oops

* support trim filter with args, remove bogus to_json reference

* numerous argument fixes

* updated tests

* implement optional strip chars parameter

* use new chars parameter

* float filter also has default

* always leave at least one decimal in float string

* jinja : static analysis + header cleanup + minor fixes

* add fuzz test

* add string.cpp

* fix chat_template_kwargs

* nits

* fix build

* revert

* unrevert

sorry :)

* add fuzz func_args, refactor to be safer

* fix array.map()

* loosen ensure_vals max count condition, add not impl for map(int)

* hopefully fix windows

* check if empty first

* normalize newlines

---------

Co-authored-by: Alde Rojas <redacted>
Co-authored-by: Sigbjørn Skjæret <redacted>
Co-authored-by: Georgi Gerganov <redacted>
7 weeks agoSetting mmap and direct_io to false as default in llama-bench.cpp (#18841)
Julius Tischbein [Fri, 16 Jan 2026 08:46:51 +0000 (09:46 +0100)]
Setting mmap and direct_io to false as default in llama-bench.cpp (#18841)

7 weeks agoCANN: Remove unused `ggml_cann_get_device` function (#18625)
Raul Torres [Fri, 16 Jan 2026 08:34:09 +0000 (08:34 +0000)]
CANN: Remove unused `ggml_cann_get_device` function (#18625)

7 weeks agoCANN: fix an issue where get_env was not fully renamed (#18796)
Chenguang Li [Fri, 16 Jan 2026 08:24:04 +0000 (16:24 +0800)]
CANN: fix an issue where get_env was not fully renamed (#18796)

* CANN: fix an issue where get_env was not fully renamed

* ci: add cann with acl group

* ci: define use_acl_graph using GitHub Action

* ci: update cann dockerfile with acl graph

7 weeks agoCANN: support gated linear attn (#18653)
hipudding [Fri, 16 Jan 2026 08:18:49 +0000 (16:18 +0800)]
CANN: support gated linear attn (#18653)

* CANN: support gated linear attn

This change adds support for the GGML_OP_GATED_LINEAR_ATTN operator.
The feature was implemented by YushengZhao. Because the previous
submission was based on an outdated codebase, this PR was rebased to
merge.

Co-authored-by: YushengZhao <redacted>
Co-authored-by: hipudding <redacted>
* CANN: optimize OP gla

Optimize gla for high preformance

* Remove unused comments

---------

Co-authored-by: 赵禹昇 <redacted>
Co-authored-by: YushengZhao <redacted>
7 weeks agoOpenCL: add SOLVE_TRI op support (#18846)
shaofeiqi [Thu, 15 Jan 2026 19:17:17 +0000 (11:17 -0800)]
OpenCL: add SOLVE_TRI op support (#18846)

7 weeks agocuda : print less debug logs when disabling cuda graphs (#18868)
Georgi Gerganov [Thu, 15 Jan 2026 18:53:01 +0000 (20:53 +0200)]
cuda : print less debug logs when disabling cuda graphs (#18868)

7 weeks agocontext : do not reserve scheduler for warmups (#18867)
Georgi Gerganov [Thu, 15 Jan 2026 17:35:57 +0000 (19:35 +0200)]
context : do not reserve scheduler for warmups (#18867)

7 weeks agollama : add adaptive-p sampler (#17927)
ddh0 [Thu, 15 Jan 2026 17:16:29 +0000 (11:16 -0600)]
llama : add adaptive-p sampler (#17927)

* initial commit for branch

* simplify constants

* add params to `struct common_params_sampling`, add reference to PR

* explicitly clamp `min_target` and `max_target` to `[0.0, 1.0]`

* add args, rename `queue_size` -> `window_size`

* improved comments

* minor

* remove old unused code from algorithm

* minor

* add power law case to `common_sampler_init`, add sampler name mappings

* clarify behaviour when `window_size = 0`

* add missing enums

* remove `target_range` param, make `target == 1` no-op, cleanup code

* oops, straggler

* add missing parameters in `server-task.cpp`

* copy from author

ref:
https://gist.github.com/MrJackSpade/9be99c7efbba7b95a41377e123b7b069

* remove old debug log, style nit

* fix compiler warning, add commented-out logging per token

* re-write + change parameters + simplify

* oops forgot args.cpp

* fix leftover `window_size`

* add missing values to `common_params_sampling::print()`

* with logging

* does this fix it?

* no, but does this?

* update default decay

* optimize

* fix bad merge

my git skills are lacking

* silence `missing initializer for member`

* update default decay to 0.9

* fix logging

* format (double)

* add power law to the new `samplers` vector

* log sampler init values

* improve logging messages in llama_sampler_power_law

* remove extraneous logging

* simplify target computation

last commit with debug logging!

* remove debug logging, explicitly clamp params at init

* add `use_power_law` flag + logic, minor cleanup

* update `power-law` -> `adaptive-p`

* fix cold start EMA

- `ctx->weighted_sum` is now initialized and reset to `target / (1.0f -
clamped_decay)`
- `ctx->total_weight` is now initialized and reset to `1.0f / (1.0f -
clamped_decay)`

this fixes a "cold start" problem with the moving average

* update `SHARPNESS` constant to `10.0f`

* minor style fixes

no functional changes

* minor style fixes cont.

* update `llama_sampler_adaptive_p_i` for backend sampling (ref: #17004)

* separate into `apply` + `accept` functions

* `pending_token_idx`: switch from `llama_token` to `int32`

functionally identical (`llama.h` has `typedef int32_t llama_token;`),
but its more correct now

* don't transform logits <= -1e9f

* fix masking in backend top-p, min-p

* address review comments

* typo in comments `RND` -> `RNG`

* add docs

* add recommended values in completion docs

* address PR feedback

* remove trailing whitespace (for CI `editorconfig`)

* add to adaptive-p to `common_sampler_types_from_chars`

7 weeks agoserver: improve slots scheduling for n_cmpl (#18789)
Xuan-Son Nguyen [Thu, 15 Jan 2026 16:10:28 +0000 (17:10 +0100)]
server: improve slots scheduling for n_cmpl (#18789)

* server : make sure children tasks are scheduled to launch with parent

* fix

* add comment pointing to this PR

* fix

* clean up

* more debug messages

* add pop_deferred_task with specific ID version

* improve the logic

* simple approach

* no double move

* correct return type of launch_slots_with_parent_task

7 weeks agocontext : reserve new scheduler when graph topology changes (#18547)
Georgi Gerganov [Thu, 15 Jan 2026 14:39:17 +0000 (16:39 +0200)]
context : reserve new scheduler when graph topology changes (#18547)

* context : reserve new scheduler when graph topology changes

* cont : fix

* cont : fix reserve

* cont : reserve only when changes occur + timing

* context : add comments

* llama : reserve on sampler changes

* common : allow null common_sampler

* server : task declares needs (embd, logits, sampling)

* server : do not init sampler if not needed

* llama : fix need_reserve when unsetting a sampler

* server : consolidate slot reset/clear logic

7 weeks agoCUDA: fix allignment on register spill for FA (#18815)
Johannes Gäßler [Thu, 15 Jan 2026 14:14:50 +0000 (15:14 +0100)]
CUDA: fix allignment on register spill for FA (#18815)

7 weeks agoggml-cpu: optimize ggml_vec_dot_bf16 for Power9 (#18837)
shalinib-ibm [Thu, 15 Jan 2026 09:31:18 +0000 (15:01 +0530)]
ggml-cpu: optimize ggml_vec_dot_bf16 for Power9 (#18837)

7 weeks agolora: make sure model keep track of associated adapters (#18490)
Xuan-Son Nguyen [Thu, 15 Jan 2026 09:24:28 +0000 (10:24 +0100)]
lora: make sure model keep track of associated adapters (#18490)

* lora: make sure model keep track of associated adapters

* deprecate llama_adapter_lora_free

* minor : std::unordered_set over std::set

---------

Co-authored-by: Georgi Gerganov <redacted>
7 weeks agomodel-loader : support bool array sliding window pattern (#18850)
Sigbjørn Skjæret [Thu, 15 Jan 2026 09:12:46 +0000 (10:12 +0100)]
model-loader : support bool array sliding window pattern (#18850)

7 weeks agotests : download models only when running ctest (#18843)
Adrien Gallouët [Thu, 15 Jan 2026 08:47:29 +0000 (09:47 +0100)]
tests : download models only when running ctest (#18843)

Signed-off-by: Adrien Gallouët <redacted>
7 weeks agohexagon: support for OP_CPY, host buffers now optional, hvx-utils refactoring and...
Max Krasnyansky [Thu, 15 Jan 2026 05:46:12 +0000 (21:46 -0800)]
hexagon: support for OP_CPY, host buffers now optional, hvx-utils refactoring and optimizations   (#18822)

* hexagon: disable repack buffers if host buffers are disabled, improved handling of env vars

* hexagon: add support for OP_CPY fp16/fp32 -> fp16/fp32

Factore out all hvx_copy functions into hvx-copy.h header and reduced code duplication.
Update HTP ops infra to support OP_CPY

* hexagon: cleanup and refactor hex/hvx/htp headers and helper libs

hex is basically all scalar/core platform stuff (L2, DMA, basic utils)
hvx is all hvx related utils, helpers, etc
htp is higher level stuff like Ops, etc

hvx-utils library got a nice round of cleanup and refactoring to reduce duplication

use hvx_vec_store_a where possible

* hexagon: refactor HVX sigmoid functions to hvx-sigmoid.h

Moved sigmoid and tanh vector functions from hvx-utils.h to a new header
hvx-sigmoid.h. Implemented aligned and unaligned variants for sigmoid
array processing using a macro pattern similar to hvx-copy.h. Updated
act-ops.c to use the new aligned variant hvx_sigmoid_f32_aa. Removed
unused hvx-sigmoid.c.

* hexagon: factor out hvx-sqrt.h

* hexagon: mintor update to hvx-utils.h

* hexagon: remove spurios log

* hexagon: factor out and optimize hvx_add/sub/mul

* hexagon: remove _opt variants of add/sub/mul as they simply fully aligned versions

* hexagon: refactor reduction functions to hvx-reduce.h

Moved `hvx_self_max_f32` and `hvx_self_sum_f32` from `hvx-utils.h`/`.c` to `hvx-reduce.h`.
Renamed them to `hvx_reduce_max_f32` and `hvx_reduce_sum_f32`.
Added aligned (`_a`) and unaligned (`_u`) variants and used macros to unify logic.
Updated `softmax-ops.c` to use the new functions.

* hexagon: refactor the rest of arithmetic functions to hvx-arith.h

Moved `hvx_sum_of_squares_f32`, `hvx_min_scalar_f32`, and `hvx_clamp_scalar_f32` from `hvx-utils.c/h` to `hvx-arith.h`. Implemented aligned/unaligned variants (`_aa`, `_au`, etc.) and used macros to reduce code duplication. Updated `hvx_min_scalar_f32` and `hvx_clamp_scalar_f32` to use `dst, src, ..., n` argument order. Updated call sites in `act-ops.c`.

Refactor Hexagon HVX arithmetic functions (min, clamp) to hvx-arith.h

Moved `hvx_min_scalar_f32` and `hvx_clamp_scalar_f32` from `hvx-utils.c/h` to `hvx-arith.h`. Implemented aligned/unaligned variants (`_aa`, `_au`, etc.) and used macros to reduce code duplication. Updated these functions to use `dst, src, ..., n` argument order and updated call sites in `act-ops.c`. `hvx_sum_of_squares_f32` remains in `hvx-utils.c` as requested.

* hexagon: refactor hvx_sum_of_squares_f32

- Modify `hvx_sum_of_squares_f32` in `ggml/src/ggml-hexagon/htp/hvx-reduce.h` to use `dst, src` signature.
- Implement `_a` (aligned) and `_u` (unaligned) variants for `hvx_sum_of_squares_f32`.
- Update `hvx_reduce_loop_body` macro to support both returning and storing results via `finalize_op`.
- Update existing reduction functions in `hvx-reduce.h` to use the updated macro.
- Update `rms_norm_htp_f32` in `ggml/src/ggml-hexagon/htp/unary-ops.c` to match the new signature.

* hexagon: use hvx_splat instead of memset

* hexagon: consistent use of f32/f16 in all function names to match the rest of GGML

* hexagon: fix hvx_copy_f16_f32 on v75 and older

* hexagon: update readme to include GGML_HEXAGON_EXPERIMENTAL

* scripts: update snapdragon/adb scripts to enable host param

8 weeks agoCUDA: Factor out and re-use `block_reduce` function (#18785)
Oliver Simons [Thu, 15 Jan 2026 02:44:54 +0000 (03:44 +0100)]
CUDA: Factor out and re-use `block_reduce` function (#18785)

* CUDA: Refactor and expose two_stage_warp_reduce_* function

* Use `two_stage_warp_reduce` also in softmax kernel, move smem out of it

Moving smem out of `__device__` function to `__global__` function
allows for explicit smem reuse, as either compiler or cuda rt seem to not
free it afterwards (`cudaFuncSetAttribute` fails when not accounting for
it once for each call to two_stage_warp_reduce)

* Update ggml/src/ggml-cuda/common.cuh

Co-authored-by: Aman Gupta <redacted>
* Use two_stage_warp_reduce in group_norm_f32

* Use two_stage_warp_reduce in rms_norm_f32

* Fix smem calculation which expects bytes

* Make `two_stage_warp_reduce` accept all values warp_reduce accepts

Also integrate it into norm_f32 function

* Use two_stage_warp_reduce in l2_norm_f32

* Use type traits for block reduction for better legibility

Also adresss other requests by @am17an such as variable renaming

* Make norm tests cover all cuda paths

* Mark columns % WARP_SIZE !=0 as supported for RMS_NORM_BACK

Unit-tests passed locally, let's see if they pass in the CI as well

* Use `enum class` for `block_reduce_method`

This is more type-safe than plain enum

* Rename variables as suggested in code review by @am17an

* Rename two_stage_warp_reduce -> block_reduce

* Fix trailing whitespace in common.cuh

* Make condition of static_assert type-dependent

This delays evaluation until the template is actually instantiated.
Otherwise, some compilers may evaluate the assert when parsing the
template, resulting in build errors as observed here:

https://github.com/ggml-org/llama.cpp/actions/runs/20960323123/job/60235530068?pr=18785

* Inline definitions

---------

Co-authored-by: Aman Gupta <redacted>
8 weeks agoRestore clip's cb() to its rightful glory - extract common debugging elements in...
Piotr Wilkin (ilintar) [Wed, 14 Jan 2026 19:29:35 +0000 (20:29 +0100)]
Restore clip's cb() to its rightful glory - extract common debugging elements in llama (#17914)

* Extract common debugging functions; plug eval-callback and mtmd's MTMD_DEBUG_GRAPH with same functionality

* Move to common

* Remove unneeded header

* Unlink from common

* chore: update webui build output

* Cleanup; properly pass params to mtmd without depending on common; factorize debug.cpp to use common debug code.

* Revert change to webapp

* Post-merge adjust

* Apply suggestions from code review

Co-authored-by: Xuan-Son Nguyen <redacted>
* Apply code review changes

* Remove changes to server-context

* Remove mtmd.h include

* Remove utility functions from header

* Apply suggestions from code review

Co-authored-by: Xuan-Son Nguyen <redacted>
* Rename functions

* Update tools/mtmd/clip.cpp

Co-authored-by: Xuan-Son Nguyen <redacted>
* Update tools/mtmd/clip.cpp

Co-authored-by: Xuan-Son Nguyen <redacted>
* Update tools/mtmd/clip.cpp

Co-authored-by: Xuan-Son Nguyen <redacted>
---------

Co-authored-by: Xuan-Son Nguyen <redacted>
8 weeks agomodel : clean up and fix EXAONE-MoE configuration (#18840)
Junwon Hwang [Wed, 14 Jan 2026 18:38:21 +0000 (03:38 +0900)]
model : clean up and fix EXAONE-MoE configuration (#18840)

* Fix mismatch of EXAONE-MoE configuration

* ensure gating func is set, cleanup

---------

Co-authored-by: Sigbjørn Skjæret <redacted>
8 weeks agorefactor : remove libcurl, use OpenSSL when available (#18828)
Adrien Gallouët [Wed, 14 Jan 2026 17:02:47 +0000 (18:02 +0100)]
refactor : remove libcurl, use OpenSSL when available (#18828)

8 weeks agovulkan: Check maxStorageBufferRange in supports_op (#18709)
Jeff Bolz [Wed, 14 Jan 2026 09:59:05 +0000 (03:59 -0600)]
vulkan: Check maxStorageBufferRange in supports_op (#18709)

* vulkan: Check maxStorageBufferRange in supports_op

* skip maxStorageBufferRange check when shader64BitIndexing is enabled

8 weeks agollama-model: fix unfortunate typo (#18832)
Aman Gupta [Wed, 14 Jan 2026 09:55:15 +0000 (17:55 +0800)]
llama-model: fix unfortunate typo (#18832)

8 weeks agoCUDA : fix typo in clang pragma comment [no ci] (#18830)
Daniel Bevenius [Wed, 14 Jan 2026 09:31:49 +0000 (10:31 +0100)]
CUDA : fix typo in clang pragma comment [no ci] (#18830)

8 weeks agovulkan: work around Intel fp16 bug in mmq (#18814)
Ruben Ortlam [Wed, 14 Jan 2026 08:41:23 +0000 (09:41 +0100)]
vulkan: work around Intel fp16 bug in mmq (#18814)

8 weeks agoggml-metal: do not copy headers for embedded, use current binary dir for embedded...
Perry Naseck [Wed, 14 Jan 2026 07:22:25 +0000 (02:22 -0500)]
ggml-metal: do not copy headers for embedded, use current binary dir for embedded (#18705)

8 weeks agommap: add Haiku support by skipping RLIMIT_MEMLOCK check (#18819)
Daniel Benjaminsson [Wed, 14 Jan 2026 07:11:05 +0000 (08:11 +0100)]
mmap: add Haiku support by skipping RLIMIT_MEMLOCK check (#18819)

Haiku OS does not support RLIMIT_MEMLOCK, similar to visionOS/tvOS.
Skip the resource limit check on Haiku to allow mlock functionality
to work without compile errors.

Tested on Haiku with NVIDIA RTX 3080 Ti using Vulkan backend.

8 weeks agoci, tests : use cmake to download models and remove libcurl dependency (#18791)
Adrien Gallouët [Wed, 14 Jan 2026 06:46:27 +0000 (07:46 +0100)]
ci, tests : use cmake to download models and remove libcurl dependency (#18791)

* ci, tests : use cmake to download models and remove libcurl dependency
* llama_dl_model -> llama_download_model
* use EXPECTED_HASH for robust model downloading
* Move llama_download_model to cmake/common.cmake

Signed-off-by: Adrien Gallouët <redacted>
8 weeks agollama : print_info alignment fix (#18708)
ddh0 [Tue, 13 Jan 2026 23:05:11 +0000 (17:05 -0600)]
llama : print_info alignment fix (#18708)

* fix text spacing in print_info

* align all

8 weeks agomodel : add EXAONE MoE (#18543)
Junwon Hwang [Tue, 13 Jan 2026 22:28:38 +0000 (07:28 +0900)]
model : add EXAONE MoE (#18543)

* Add EXAONE MoE implementations

Co-authored-by: Junwon Hwang <redacted>
* Address PR feedback

* Address PR feedback

* [WIP] Add MTP for EXAONE-MoE

* Address PR feedback

* Address PR feedback

* Address PR feedback

* Address PR feedback

* Address PR feedback

* Address PR feedback

* Address PR feedback

---------

Co-authored-by: LG-AI-EXAONE <redacted>
8 weeks agovocab : fix attribute overrides for harmony (#18806)
Georgi Gerganov [Tue, 13 Jan 2026 15:40:13 +0000 (17:40 +0200)]
vocab : fix attribute overrides for harmony (#18806)

* vocab : fix attribute overrides for harmony

* cont : add warning log

8 weeks agollama-mmap: fix direct-io loading fallback EOF exception (#18801)
Ruben Ortlam [Tue, 13 Jan 2026 14:57:07 +0000 (15:57 +0100)]
llama-mmap: fix direct-io loading fallback EOF exception (#18801)

8 weeks agomodel-conversion : remove -c 0 from model card template [no ci] (#18807)
Daniel Bevenius [Tue, 13 Jan 2026 13:13:10 +0000 (14:13 +0100)]
model-conversion : remove -c 0 from model card template [no ci] (#18807)

This commit removes the `-c, --ctx-size N` from the llama-server
command in the model card template for causal models.

The motivation for this is that -c 0 is the default and specifying it
is redundant.

8 weeks agoHIP: add fattn-mma-f16 for RDNA4 (#18481)
yulo [Tue, 13 Jan 2026 12:52:16 +0000 (20:52 +0800)]
HIP: add fattn-mma-f16 for RDNA4 (#18481)

* finish VQ mma

* flash_attn_ext_f16_iter

* KQ_rowsum

* correct exp

* fix scale error

* fix softmax scale

* fix softmax scale

* enable fattn on cpu side

* fix random error

* disable fattn-mma-f16 on rdna3

* fix wrong col for rdna

* use identity mat to transpose

* resolve conflicts

* basic tuning for DeepSeek-R1-Distill-Qwen-1.5B

* fix volta compile error

* align rdna4 policy for fattn

* adjust fattn policy

* adjust kernel selection logic

* update as the review comments

* keep fattn-wmma logic

* adjust kernel selection logic

---------

Co-authored-by: zhang hui <redacted>
Co-authored-by: Johannes Gäßler <redacted>
8 weeks agodoc: ban AI-generated PR descriptions [no ci] (#18765)
Johannes Gäßler [Tue, 13 Jan 2026 12:43:12 +0000 (13:43 +0100)]
doc: ban AI-generated PR descriptions [no ci] (#18765)

8 weeks agomtmd: fix use_non_causal being reported incorrectly (#18793) upstream/0.0.7721
Xuan-Son Nguyen [Tue, 13 Jan 2026 11:19:38 +0000 (12:19 +0100)]
mtmd: fix use_non_causal being reported incorrectly (#18793)

* mtmd: fix use_non_causal being reported incorrectly

* move clip_is_mrope to mtmd_decode_use_mrope

* fix sloppy code ggml_cpy

8 weeks agoCUDA : fix unused argument when USE_CUDA_GRAPH=OFF (#18800)
Georgi Gerganov [Tue, 13 Jan 2026 10:25:53 +0000 (12:25 +0200)]
CUDA : fix unused argument when USE_CUDA_GRAPH=OFF (#18800)

8 weeks agograph : clean up t5 input builders (#18795)
Gabe Goodhart [Tue, 13 Jan 2026 08:43:51 +0000 (01:43 -0700)]
graph : clean up t5 input builders (#18795)

* fix: Remove unnecessary `h` loops where `h` was only ever 0

Branch: CleanUpT5InputBuilders

Signed-off-by: Gabe Goodhart <redacted>
* fix: Remove unnecessary padding loop that is never hit anymore

The upper bound used to use GGML_PAD(n_tokens, GGML_KQ_MASK_PAD), but was
removed in https://github.com/ggml-org/llama.cpp/pull/17910 leaving the
loop dead.

Branch: CleanUpT5InputBuilders

Signed-off-by: Gabe Goodhart <redacted>
---------

Signed-off-by: Gabe Goodhart <redacted>