git.djapps.eu Git - pkg/ggml/sources/llama.cpp/log

]> git.djapps.eu Git - pkg/ggml/sources/llama.cpp/log

overview / pkg / ggml / sources / llama.cpp / log

commit | commitdiff | tree

Aaron Teo [Thu, 22 Jan 2026 13:38:02 +0000 (21:38 +0800)]

release: update github api (#19022)

commit | commitdiff | tree

Xuan-Son Nguyen [Thu, 22 Jan 2026 13:36:32 +0000 (14:36 +0100)]

mtmd : update docs to use llama_model_n_embd_inp (#18999)

commit | commitdiff | tree

손희준 [Thu, 22 Jan 2026 13:36:04 +0000 (22:36 +0900)]

server: Reorder methods in `server-task.cpp` (#19016)

* Move `task_result_state::update_chat_msg` to match with header

* Move `server_task_result_cmpl_partial::to_json_anthropic()` to match with header

---------

Co-authored-by: openingnow <>

commit | commitdiff | tree

Aman Gupta [Thu, 22 Jan 2026 10:51:53 +0000 (18:51 +0800)]

CUDA: add gqa_ratio 4 for GLM 4.7 flash (#18953)

commit | commitdiff | tree

shaofeiqi [Thu, 22 Jan 2026 06:05:54 +0000 (22:05 -0800)]

opencl: add TRI op support (#18979)

commit | commitdiff | tree

Aleksei Nikiforov [Thu, 22 Jan 2026 00:16:21 +0000 (01:16 +0100)]

ggml-zdnn : mark zDNN buffers as non-host (#18967)

While buffers reside in host memory,
additional transformation is needed to use buffers with zDNN.

Fixes #18848

commit | commitdiff | tree

Pádraic Slattery [Wed, 21 Jan 2026 23:57:18 +0000 (00:57 +0100)]

ci : update GitHub Actions versions [no ci] (#18935)

commit | commitdiff | tree

Mariusz Woloszyn [Wed, 21 Jan 2026 23:55:55 +0000 (00:55 +0100)]

convert : add Devstral-2 (Ministral3ForCausalLM) arch (#18972)

* Add Ministral3ForCausalLM architeture

This adds support for newer architectres like Devstral-2

* removed blank line found after function decorator

Co-authored-by: Sigbjørn Skjæret <redacted>
---------

Co-authored-by: Sigbjørn Skjæret <redacted>

commit | commitdiff | tree

Piotr Wilkin (ilintar) [Wed, 21 Jan 2026 18:24:37 +0000 (19:24 +0100)]

jinja: support none|string (#18995)

* jinja: support none|string

* Update common/jinja/value.cpp

Co-authored-by: Sigbjørn Skjæret <redacted>
* Update tests/test-jinja.cpp

Co-authored-by: Sigbjørn Skjæret <redacted>
* Add as_string()

---------

Co-authored-by: Sigbjørn Skjæret <redacted>

commit | commitdiff | tree

Hendrik Erz [Wed, 21 Jan 2026 17:46:01 +0000 (18:46 +0100)]

fix: Use `tabular-nums` for chat message statistics (#18915)

* fix: Use `tabular-nums` for chat message statistics

* fix: Rebuild WebUI

commit | commitdiff | tree

Daniel Bevenius [Wed, 21 Jan 2026 17:31:34 +0000 (18:31 +0100)]

llama : clarify nemotron-h.cpp comment about RoPE [no ci] (#18997)

This commit removes the mention of RoPE in the comment for the Q and K
computation as RoPE is not applied.

commit | commitdiff | tree

Jeff Bolz [Wed, 21 Jan 2026 17:01:40 +0000 (11:01 -0600)]

vulkan: Remove transfer_ctx, do everything in compute_ctx. (#18945)

* vulkan: Remove transfer_ctx, do everything in compute_ctx.

We had a bug where a set_tensor_async (using transfer_ctx) didn't get
submitted before the graph_compute (using compute_ctx) that came after
it. To avoid this sort of issue, just do everything in compute_ctx.

Remove transfer_cmd_pool, which was already unused.

* fix crash with perf logger

commit | commitdiff | tree

Adrien Gallouët [Wed, 21 Jan 2026 16:58:38 +0000 (17:58 +0100)]

common : improve error message when HTTPS is missing but required (#18987)

Signed-off-by: Adrien Gallouët <redacted>

commit | commitdiff | tree

손희준 [Wed, 21 Jan 2026 16:47:23 +0000 (01:47 +0900)]

server: /v1/responses (partial) (#18486)

* from previous PR

* Make instruction(system) as first message

* Convert [input_message] (text/image/file)

* Rename convert_responses_to_chatcmpl(body) -> response_body

* Initial tool call support

* Erase instructions field from chatcmpl body

* Feed reasoning texts to chat template

* Use std::vector instead of opaque json array

* Make output_item.added events consistent

* Move `server_task_result_cmpl_partial::update` from header to source

* Match ID of output_item.added and .done events

* Add function_call only if there is no "fc_" prefix

* Add function call output at non-streaming API

* Test if ID is persistent

* Add doc

* Fix style - use trailing comma

* Rewrite state management

* catch up with upstream/master

* Fix style - "type" is the first item of SSE data

* Explicitly check "instructions" from response_body

* Make lambdas static

* Check if reasoning content exists

* Add `oai_resp_id` to task_result_state(also initialized at ctor), server_task_result_cmpl_partial, and server_task_result_cmpl_final

* Reject `input_file` since it is not supported by chatcmpl

* Add "fc_" prefix to non-straming function call id as coderabbit pointed out

---------

Co-authored-by: openingnow <>

commit | commitdiff | tree

Jeff Bolz [Wed, 21 Jan 2026 16:43:43 +0000 (10:43 -0600)]

vulkan: support flash attention GQA/split_k with small batches (#18938)

commit | commitdiff | tree

Masato Nakasaka [Wed, 21 Jan 2026 16:13:43 +0000 (01:13 +0900)]

Revert "vulkan: force full subgroups for flash attention to fix intel subgroup crash (#17356)" (#18831)

This reverts commit 980b7cd17e055c8c587f79ffda7eb4fddf405566.

commit | commitdiff | tree

Jeff Bolz [Wed, 21 Jan 2026 15:22:02 +0000 (09:22 -0600)]

vulkan: Use mul_mat_vec_id for small values of n (#18918)

Change ggml_vk_mul_mat_vec_id_q_f16 to loop over the batch dimension and
update the indexing calculations in get_offsets.

Mat-vec is faster than mat-mat for small values of n. We don't get the same
reuse of the weights as in the non-ID path, but with this the cost is linear
in n rather than n>1 being far slower than n==1.

commit | commitdiff | tree

Tarek Dakhran [Wed, 21 Jan 2026 12:30:23 +0000 (13:30 +0100)]

memory : add llama_memory_hybrid_iswa (#18601)

* memory : add llama_memory_hybrid_iswa

* Update src/llama-memory-hybrid-iswa.cpp

Co-authored-by: Georgi Gerganov <redacted>
---------

Co-authored-by: Georgi Gerganov <redacted>

commit | commitdiff | tree

Piotr Wilkin (ilintar) [Wed, 21 Jan 2026 11:35:20 +0000 (12:35 +0100)]

Fix GLM 4.7 Lite MoE gating func (#18980)

* Fix GLM 4.7 MoE gating func

* Update src/models/deepseek2.cpp

Co-authored-by: Sigbjørn Skjæret <redacted>
* Update src/llama-model.cpp

Co-authored-by: Xuan-Son Nguyen <redacted>
---------

Co-authored-by: Sigbjørn Skjæret <redacted>
Co-authored-by: Xuan-Son Nguyen <redacted>

commit | commitdiff | tree

Matthieu Coudron [Wed, 21 Jan 2026 06:52:46 +0000 (07:52 +0100)]

gguf: display strerrno when cant load a model (#18884)

I've had issues loading models with llama-server:
[44039] E gguf_init_from_file: failed to open GGUF file 'mistral-7b-v0.1.Q8_0.gguf'

and I was sure it could access the file. Seems like --models-dir and
--models-presets dont interact like I thought they would but I salvaged
this snippet that helps troubleshooting
[44039] E gguf_init_from_file: failed to open GGUF file 'mistral-7b-v0.1.Q8_0.gguf' (errno No such file or directory)

commit | commitdiff | tree

Oliver Simons [Wed, 21 Jan 2026 01:34:29 +0000 (02:34 +0100)]

CUDA: Fix builds for older CCCL versions by ifdefing strided_iterator (#18964)

* CUDA: Fix builds for older CCCL versions by ifdefing strided_iterator

Strided iterator was added in [CCCL
3.1](https://github.com/NVIDIA/cccl/releases/tag/v3.1.0), which is packaged into
[CTK
13.1](https://docs.nvidia.com/cuda/cuda-toolkit-release-notes/index.html#id5)

* Unindent as per code review request

commit | commitdiff | tree

Adrien Gallouët [Tue, 20 Jan 2026 17:28:43 +0000 (18:28 +0100)]

common, server : use the same User-Agent by default (#18957)

This commit also ensures that if a custom User-Agent is used, it will be
the only one sent.

Signed-off-by: Adrien Gallouët <redacted>

commit | commitdiff | tree

Xuan-Son Nguyen [Tue, 20 Jan 2026 17:23:25 +0000 (18:23 +0100)]

cli : fix reasoning responses in CLI (#18961)

* cli : fix reasoning responses in CLI

* fix build

* fix build (2)

commit | commitdiff | tree

Oliver Simons [Tue, 20 Jan 2026 12:11:01 +0000 (13:11 +0100)]

CUDA: Replace init_offsets kernel with iterators in cub-based argsort (#18930)

* CUDA: Replace `init_offsets` with iterators in argsort

This is a QOL improvement, saving us the cost of materializing the
iterator

* Remove unnecessary include from top-k.cu

commit | commitdiff | tree

Adrien Gallouët [Tue, 20 Jan 2026 10:42:49 +0000 (11:42 +0100)]

ggml : cleanup path_str() (#18928)

- Remove pragmas as `std::codecvt_utf8` is not used.
- Avoid implicit `strlen()`.

Signed-off-by: Adrien Gallouët <redacted>

commit | commitdiff | tree

Georgi Gerganov [Tue, 20 Jan 2026 10:21:28 +0000 (12:21 +0200)]

metal : enable FA for MLA heads (#18950)

commit | commitdiff | tree

Daniel Bevenius [Tue, 20 Jan 2026 05:55:24 +0000 (06:55 +0100)]

convert : use n_groups instead of hardcoded values in reshape (#18929)

* convert : use n_groups instead of hardcoded values in reshape

This commit modifies the conversion script for NemotronHModel to use
the 'n_groups' hyperparameter, and allow Python to calculate the the
last dimension, using -1, when reshaping the 'mixer.norm.weight' tensor.

* use self.n_group instead of self.hparams["n_groups"]

commit | commitdiff | tree

Xuan-Son Nguyen [Mon, 19 Jan 2026 22:28:01 +0000 (23:28 +0100)]

server : refactor oai_parser_opt, move it to server_chat_params (#18937)

* server_chat_params

* move chat format into CLI

* use meta whenever possible

* clean up, no more chatml fallback

commit | commitdiff | tree

ddh0 [Mon, 19 Jan 2026 22:09:20 +0000 (16:09 -0600)]

convert : support Glm4MoeLite (#18936)

* initial commit for branch

* add glm-4.7-flash, move tokenizer hash

* use `glm4` pretok

* silence flake8 E302 (CI)

* apply review feedback

* add <|user|> as eog

* also add EOG `<|observation|>`

* revert llama-vocab

* inherit vocab from glm4

---------

Co-authored-by: Xuan Son Nguyen <redacted>

commit | commitdiff | tree

Sigbjørn Skjæret [Mon, 19 Jan 2026 19:29:43 +0000 (20:29 +0100)]

jinja : fix undefined keys and attributes and int/float as bool (#18924)

* fix undefined keys and attributes

* add falsy tests

* as_bool for integers and floats

* more falsy/truthy tests

* --typo

commit | commitdiff | tree

Sigbjørn Skjæret [Mon, 19 Jan 2026 19:29:15 +0000 (20:29 +0100)]

ci : run test-jinja -py on high perf [no ci] (#18916)

commit | commitdiff | tree

Lennart Austenfeld [Mon, 19 Jan 2026 18:13:31 +0000 (19:13 +0100)]

server: fix memory reservations in populate_token_probs (#18787)

commit | commitdiff | tree

Georgi Gerganov [Mon, 19 Jan 2026 18:03:19 +0000 (20:03 +0200)]

ggml : add ggml_build_forward_select (#18550)

* ggml : add ggml_build_forward_select

* cuda : adapt CUDA graph compat to new feature

* vulkan : update logic to handle command buffer closing

* ggml : check compute for fusion

* ggml : add comment

commit | commitdiff | tree

Daniel Bevenius [Mon, 19 Jan 2026 12:12:38 +0000 (13:12 +0100)]

model-conversion : add BUILD_DIR variable to run-converted-model scripts (#18927)

This commit adds a BUILD_DIR variable to the scripts used for running
converted models.

The motivation for this is that currently the `build` directory is
hardcoded and it can be useful to specify a different build directory,
with builds for different configurations.

commit | commitdiff | tree

Julius Tischbein [Sun, 18 Jan 2026 16:35:57 +0000 (17:35 +0100)]

llama : Extend fallback, fix fileno for dio file, exclude case that mmap uses dio file (#18887)

commit | commitdiff | tree

Francisco Herrera [Sun, 18 Jan 2026 10:03:35 +0000 (05:03 -0500)]

docs: add linux to index (#18907)

commit | commitdiff | tree

Xuan-Son Nguyen [Sun, 18 Jan 2026 07:14:27 +0000 (08:14 +0100)]

tests : add test-jinja -py option for cross-checking (#18906)

* tests : add test-jinja -py option or cross-checking

* Update tests/test-jinja.cpp

Co-authored-by: Sigbjørn Skjæret <redacted>
* fix + add source

* SandboxedEnvironment

* fix array.map case

---------

Co-authored-by: Sigbjørn Skjæret <redacted>

commit | commitdiff | tree

Sigbjørn Skjæret [Sun, 18 Jan 2026 02:40:06 +0000 (03:40 +0100)]

jinja : fix object item order (and properly implement dictsort) (#18904)

* fix object item order

* as_ordered_object

* copy whole object

commit | commitdiff | tree

Sigbjørn Skjæret [Sun, 18 Jan 2026 01:53:01 +0000 (02:53 +0100)]

jinja : attribute support for join, map and sort (#18883)

* support negative array index and default value

* attribute support (int and str) for join, map and sort

* add tests

* update CODEOWNERS

* improve fixme sorting comment

commit | commitdiff | tree

Sigbjørn Skjæret [Sun, 18 Jan 2026 00:05:09 +0000 (01:05 +0100)]

jinja : add missing tojson filter for bool (#18900)

* add missing tojson for bool

* add more literal tests

commit | commitdiff | tree

Sigbjørn Skjæret [Sat, 17 Jan 2026 23:57:51 +0000 (00:57 +0100)]

jinja : fix lexing of float literals with sign (#18901)

* fix lexing of float literals with sign

* add test

* consume_numeric

commit | commitdiff | tree

Xuan-Son Nguyen [Sat, 17 Jan 2026 23:48:55 +0000 (00:48 +0100)]

jinja: correct member access rule (#18905)

commit | commitdiff | tree

lhez [Sat, 17 Jan 2026 21:50:32 +0000 (13:50 -0800)]

opencl: fix q6_K mv for m=1 (#18893)

commit | commitdiff | tree

Sigbjørn Skjæret [Sat, 17 Jan 2026 20:52:02 +0000 (21:52 +0100)]

ci : add label for jinja changes (#18903)

commit | commitdiff | tree

Georgi Gerganov [Sat, 17 Jan 2026 13:42:42 +0000 (15:42 +0200)]

kv-cache : optimize KQ mask construction (#18842)

* kv-cache : optimize KQ mask construction

* cont : add explanation + improve

* cont : fix

commit | commitdiff | tree

Reese Levine [Sat, 17 Jan 2026 00:12:43 +0000 (16:12 -0800)]

ggml webgpu: support for backend sampling (#18880)

* ggml webgpu: add SOFTPLUS unary operator

Implements SOFTPLUS (log(1 + exp(x))) with f16/f32 support. Uses f32
precision for intermediate calculations to prevent f16 overflow.

* Add shader implementation and 4 variants (f32/f16, inplace/non-inplace)
* Register pipelines and device support
* Follow Vulkan backend numerical stability pattern

* ggml webgpu: add EXPM1 unary operator

Implements EXPM1 (exp(x) - 1) with f16/f32 support.

* Add shader implementation and 4 variants (f32/f16, inplace/non-inplace)
* Register pipelines and device support

* ggml webgpu: add FLOOR unary operator

Implements FLOOR (rounds down to nearest integer) with f16/f32 support.

* Add shader implementation and 4 variants (f32/f16, inplace/non-inplace)
* Register pipelines and device support

* ggml webgpu: add CEIL unary operator

Implements CEIL (rounds up to nearest integer) with f16/f32 support.

* Add shader implementation and 4 variants (f32/f16, inplace/non-inplace)
* Register pipelines and device support

* ggml webgpu: add ROUND unary operator

Implements ROUND (rounds to nearest integer) with f16/f32 support.

* Add shader implementation and 4 variants (f32/f16, inplace/non-inplace)
* Register pipelines and device support

* ggml webgpu: add TRUNC unary operator

Implements TRUNC (truncates towards zero) with f16/f32 support.

* Add shader implementation and 4 variants (f32/f16, inplace/non-inplace)
* Register pipelines and device support

* docs : update WebGPU support for unary operators (FLOOR, CEIL, ROUND, TRUNC, EXPM1, SOFTPLUS)

* Updates to webgpu get_memory

* Add argmax

* Add argmax,cumsum,sum,sum_rows

* Add necessary CPY/GET_ROWS operators

* Support for argsort using multi-pass strategy

* Update set_rows for i32 indices, move to pre-wgsl

* Port unary operators to pre-wgsl and support FILL

* Implement PAD

* Add support for top-k

* clean up, scope pipeline init mutex

* fix newline

* Add support for log

* Update LOG for better precision, and ops doc

---------

Co-authored-by: Abhijit Ramesh <redacted>

commit | commitdiff | tree

Thore Koritzius [Fri, 16 Jan 2026 14:59:56 +0000 (15:59 +0100)]

ggml : extend ggml_pool_1d + metal (#16429)

* chore: resolve conflicts

* feat: ggml metal impl

* fix: ggml_metal_kargs_pool_1d struct

* fix: require contiguous input

* chore: test pool_1d

* chore: limit pool1d test cases to p0=0 and s0=k0 to conform with asserts

* chore: add p0 and s0 to testing

* fix: allow padding for cpu and metal

* Update ggml/src/ggml-metal/ggml-metal.metal

* fix: correct single-threaded loop

* ggml : cleanup

* tests : add ne[1] != 1 tests

* fix: ne[1] handling in np

* cont : fixes

---------

Co-authored-by: Georgi Gerganov <redacted>

commit | commitdiff | tree

hipudding [Fri, 16 Jan 2026 12:32:17 +0000 (20:32 +0800)]

docs : update ops.md for CANN backend (#18654)

commit | commitdiff | tree

Perry Naseck [Fri, 16 Jan 2026 11:38:25 +0000 (06:38 -0500)]

ggml-blas: hide warnings from included BLAS headers (#18818)

* fix compile def openblas, blis for compat libs, nvpl compile def, warn if no blas vendor set

* ggml-blas: hide warnings from included BLAS headers

commit | commitdiff | tree

Tarek Dakhran [Fri, 16 Jan 2026 10:23:08 +0000 (11:23 +0100)]

mtmd : Fix ASR for LFM2.5-Audio-1.5B (#18876)

commit | commitdiff | tree

Xuan-Son Nguyen [Fri, 16 Jan 2026 10:22:06 +0000 (11:22 +0100)]

common : implement new jinja template engine (#18462)

* jinja vm

* lexer

* add vm types

* demo

* clean up

* parser ok

* binary_expression::execute

* shadow naming

* bin ops works!

* fix map object

* add string builtins

* add more builtins

* wip

* use mk_val

* eval with is_user_input

* render gemma tmpl ok

* track input string even after transformations

* support binded functions

* keyword arguments and slicing array

* use shared_ptr for values

* add mk_stmt

* allow print source on exception

* fix negate test

* testing more templates

* mostly works

* add filter_statement

* allow func to access ctx

* add jinja-value.cpp

* impl global_from_json

* a lot of fixes

* more tests

* more fix, more tests

* more fixes

* rm workarounds

* demo: type inferrence

* add placeholder for tojson

* improve function args handling

* rm type inference

* no more std::regex

* trailing spaces

* make testing more flexible

* make output a bit cleaner

* (wip) redirect minja calls

* test: add --output

* fix crash on macro kwargs

* add minimal caps system

* add some workarounds

* rm caps_apply_workarounds

* get rid of preprocessing

* more fixes

* fix test-chat-template

* move test-chat-jinja into test-chat-template

* rm test-chat-jinja from cmake

* test-chat-template: use common

* fix build

* fix build (2)

* rename vm --> interpreter

* improve error reporting

* correct lstrip behavior

* add tojson

* more fixes

* disable tests for COMMON_CHAT_FORMAT_GENERIC

* make sure tojson output correct order

* add object.length

* fully functional selectattr / rejectattr

* improve error reporting

* more builtins added, more fixes

* create jinja rendering tests

* fix testing.h path

* adjust whitespace rules

* more fixes

* temporary disable test for ibm-granite

* r/lstrip behavior matched with hf.js

* minimax, glm4.5 ok

* add append and pop

* kimi-k2 ok

* test-chat passed

* fix lstrip_block

* add more jinja tests

* cast to unsigned char

* allow dict key to be numeric

* nemotron: rm windows newline

* tests ok

* fix test

* rename interpreter --> runtime

* fix build

* add more checks

* bring back generic format support

* fix Apertus

* [json.exception.out_of_range.403] key 'content' not found

* rm generic test

* refactor input marking

* add docs

* fix windows build

* clarify error message

* improved tests

* split/rsplit with maxsplit

* non-inverse maxsplit

forgot to change after simplifying

* implement separators for tojson and fix indent

* i like to move it move it

* rename null -- > none

* token::eof

* some nits + comments

* add exception classes for lexer and parser

* null -> none

* rename global -> env

* rm minja

* update docs

* docs: add input marking caveats

* imlement missing jinja-tests functions

* oops

* support trim filter with args, remove bogus to_json reference

* numerous argument fixes

* updated tests

* implement optional strip chars parameter

* use new chars parameter

* float filter also has default

* always leave at least one decimal in float string

* jinja : static analysis + header cleanup + minor fixes

* add fuzz test

* add string.cpp

* fix chat_template_kwargs

* nits

* fix build

* revert

* unrevert

sorry :)

* add fuzz func_args, refactor to be safer

* fix array.map()

* loosen ensure_vals max count condition, add not impl for map(int)

* hopefully fix windows

* check if empty first

* normalize newlines

---------

Co-authored-by: Alde Rojas <redacted>
Co-authored-by: Sigbjørn Skjæret <redacted>
Co-authored-by: Georgi Gerganov <redacted>

commit | commitdiff | tree

Julius Tischbein [Fri, 16 Jan 2026 08:46:51 +0000 (09:46 +0100)]

Setting mmap and direct_io to false as default in llama-bench.cpp (#18841)

commit | commitdiff | tree

Raul Torres [Fri, 16 Jan 2026 08:34:09 +0000 (08:34 +0000)]

CANN: Remove unused `ggml_cann_get_device` function (#18625)

commit | commitdiff | tree

Chenguang Li [Fri, 16 Jan 2026 08:24:04 +0000 (16:24 +0800)]

CANN: fix an issue where get_env was not fully renamed (#18796)

* CANN: fix an issue where get_env was not fully renamed

* ci: add cann with acl group

* ci: define use_acl_graph using GitHub Action

* ci: update cann dockerfile with acl graph

commit | commitdiff | tree

hipudding [Fri, 16 Jan 2026 08:18:49 +0000 (16:18 +0800)]

CANN: support gated linear attn (#18653)

* CANN: support gated linear attn

This change adds support for the GGML_OP_GATED_LINEAR_ATTN operator.
The feature was implemented by YushengZhao. Because the previous
submission was based on an outdated codebase, this PR was rebased to
merge.

Co-authored-by: YushengZhao <redacted>
Co-authored-by: hipudding <redacted>
* CANN: optimize OP gla

Optimize gla for high preformance

* Remove unused comments

---------

Co-authored-by: 赵禹昇 <redacted>
Co-authored-by: YushengZhao <redacted>

commit | commitdiff | tree

shaofeiqi [Thu, 15 Jan 2026 19:17:17 +0000 (11:17 -0800)]

OpenCL: add SOLVE_TRI op support (#18846)

commit | commitdiff | tree

Georgi Gerganov [Thu, 15 Jan 2026 18:53:01 +0000 (20:53 +0200)]

cuda : print less debug logs when disabling cuda graphs (#18868)

commit | commitdiff | tree

Georgi Gerganov [Thu, 15 Jan 2026 17:35:57 +0000 (19:35 +0200)]

context : do not reserve scheduler for warmups (#18867)

commit | commitdiff | tree

ddh0 [Thu, 15 Jan 2026 17:16:29 +0000 (11:16 -0600)]

llama : add adaptive-p sampler (#17927)

* initial commit for branch

* simplify constants

* add params to `struct common_params_sampling`, add reference to PR

* explicitly clamp `min_target` and `max_target` to `[0.0, 1.0]`

* add args, rename `queue_size` -> `window_size`

* improved comments

* minor

* remove old unused code from algorithm

* minor

* add power law case to `common_sampler_init`, add sampler name mappings

* clarify behaviour when `window_size = 0`

* add missing enums

* remove `target_range` param, make `target == 1` no-op, cleanup code

* oops, straggler

* add missing parameters in `server-task.cpp`

* copy from author

ref:
https://gist.github.com/MrJackSpade/9be99c7efbba7b95a41377e123b7b069

* remove old debug log, style nit

* fix compiler warning, add commented-out logging per token

* re-write + change parameters + simplify

* oops forgot args.cpp

* fix leftover `window_size`

* add missing values to `common_params_sampling::print()`

* with logging

* does this fix it?

* no, but does this?

* update default decay

* optimize

* fix bad merge

my git skills are lacking

* silence `missing initializer for member`

* update default decay to 0.9

* fix logging

* format (double)

* add power law to the new `samplers` vector

* log sampler init values

* improve logging messages in llama_sampler_power_law

* remove extraneous logging

* simplify target computation

last commit with debug logging!

* remove debug logging, explicitly clamp params at init

* add `use_power_law` flag + logic, minor cleanup

* update `power-law` -> `adaptive-p`

* fix cold start EMA

- `ctx->weighted_sum` is now initialized and reset to `target / (1.0f -
clamped_decay)`
- `ctx->total_weight` is now initialized and reset to `1.0f / (1.0f -
clamped_decay)`

this fixes a "cold start" problem with the moving average

* update `SHARPNESS` constant to `10.0f`

* minor style fixes

no functional changes

* minor style fixes cont.

* update `llama_sampler_adaptive_p_i` for backend sampling (ref: #17004)

* separate into `apply` + `accept` functions

* `pending_token_idx`: switch from `llama_token` to `int32`

functionally identical (`llama.h` has `typedef int32_t llama_token;`),
but its more correct now

* don't transform logits <= -1e9f

* fix masking in backend top-p, min-p

* address review comments

* typo in comments `RND` -> `RNG`

* add docs

* add recommended values in completion docs

* address PR feedback

* remove trailing whitespace (for CI `editorconfig`)

* add to adaptive-p to `common_sampler_types_from_chars`

commit | commitdiff | tree

Xuan-Son Nguyen [Thu, 15 Jan 2026 16:10:28 +0000 (17:10 +0100)]

server: improve slots scheduling for n_cmpl (#18789)

* server : make sure children tasks are scheduled to launch with parent

* fix

* add comment pointing to this PR

* fix

* clean up

* more debug messages

* add pop_deferred_task with specific ID version

* improve the logic

* simple approach

* no double move

* correct return type of launch_slots_with_parent_task

commit | commitdiff | tree

Georgi Gerganov [Thu, 15 Jan 2026 14:39:17 +0000 (16:39 +0200)]

context : reserve new scheduler when graph topology changes (#18547)

* context : reserve new scheduler when graph topology changes

* cont : fix

* cont : fix reserve

* cont : reserve only when changes occur + timing

* context : add comments

* llama : reserve on sampler changes

* common : allow null common_sampler

* server : task declares needs (embd, logits, sampling)

* server : do not init sampler if not needed

* llama : fix need_reserve when unsetting a sampler

* server : consolidate slot reset/clear logic

commit | commitdiff | tree

Johannes Gäßler [Thu, 15 Jan 2026 14:14:50 +0000 (15:14 +0100)]

CUDA: fix allignment on register spill for FA (#18815)

commit | commitdiff | tree

shalinib-ibm [Thu, 15 Jan 2026 09:31:18 +0000 (15:01 +0530)]

ggml-cpu: optimize ggml_vec_dot_bf16 for Power9 (#18837)

commit | commitdiff | tree

Xuan-Son Nguyen [Thu, 15 Jan 2026 09:24:28 +0000 (10:24 +0100)]

lora: make sure model keep track of associated adapters (#18490)

* lora: make sure model keep track of associated adapters

* deprecate llama_adapter_lora_free

* minor : std::unordered_set over std::set

---------

Co-authored-by: Georgi Gerganov <redacted>

commit | commitdiff | tree

Sigbjørn Skjæret [Thu, 15 Jan 2026 09:12:46 +0000 (10:12 +0100)]

model-loader : support bool array sliding window pattern (#18850)

commit | commitdiff | tree

Adrien Gallouët [Thu, 15 Jan 2026 08:47:29 +0000 (09:47 +0100)]

tests : download models only when running ctest (#18843)

Signed-off-by: Adrien Gallouët <redacted>

commit | commitdiff | tree

Max Krasnyansky [Thu, 15 Jan 2026 05:46:12 +0000 (21:46 -0800)]

hexagon: support for OP_CPY, host buffers now optional, hvx-utils refactoring and optimizations (#18822)

* hexagon: disable repack buffers if host buffers are disabled, improved handling of env vars

* hexagon: add support for OP_CPY fp16/fp32 -> fp16/fp32

Factore out all hvx_copy functions into hvx-copy.h header and reduced code duplication.
Update HTP ops infra to support OP_CPY

* hexagon: cleanup and refactor hex/hvx/htp headers and helper libs

hex is basically all scalar/core platform stuff (L2, DMA, basic utils)
hvx is all hvx related utils, helpers, etc
htp is higher level stuff like Ops, etc

hvx-utils library got a nice round of cleanup and refactoring to reduce duplication

use hvx_vec_store_a where possible

* hexagon: refactor HVX sigmoid functions to hvx-sigmoid.h

Moved sigmoid and tanh vector functions from hvx-utils.h to a new header
hvx-sigmoid.h. Implemented aligned and unaligned variants for sigmoid
array processing using a macro pattern similar to hvx-copy.h. Updated
act-ops.c to use the new aligned variant hvx_sigmoid_f32_aa. Removed
unused hvx-sigmoid.c.

* hexagon: factor out hvx-sqrt.h

* hexagon: mintor update to hvx-utils.h

* hexagon: remove spurios log

* hexagon: factor out and optimize hvx_add/sub/mul

* hexagon: remove _opt variants of add/sub/mul as they simply fully aligned versions

* hexagon: refactor reduction functions to hvx-reduce.h

Moved `hvx_self_max_f32` and `hvx_self_sum_f32` from `hvx-utils.h`/`.c` to `hvx-reduce.h`.
Renamed them to `hvx_reduce_max_f32` and `hvx_reduce_sum_f32`.
Added aligned (`_a`) and unaligned (`_u`) variants and used macros to unify logic.
Updated `softmax-ops.c` to use the new functions.

* hexagon: refactor the rest of arithmetic functions to hvx-arith.h

Moved `hvx_sum_of_squares_f32`, `hvx_min_scalar_f32`, and `hvx_clamp_scalar_f32` from `hvx-utils.c/h` to `hvx-arith.h`. Implemented aligned/unaligned variants (`_aa`, `_au`, etc.) and used macros to reduce code duplication. Updated `hvx_min_scalar_f32` and `hvx_clamp_scalar_f32` to use `dst, src, ..., n` argument order. Updated call sites in `act-ops.c`.

Refactor Hexagon HVX arithmetic functions (min, clamp) to hvx-arith.h

Moved `hvx_min_scalar_f32` and `hvx_clamp_scalar_f32` from `hvx-utils.c/h` to `hvx-arith.h`. Implemented aligned/unaligned variants (`_aa`, `_au`, etc.) and used macros to reduce code duplication. Updated these functions to use `dst, src, ..., n` argument order and updated call sites in `act-ops.c`. `hvx_sum_of_squares_f32` remains in `hvx-utils.c` as requested.

* hexagon: refactor hvx_sum_of_squares_f32

- Modify `hvx_sum_of_squares_f32` in `ggml/src/ggml-hexagon/htp/hvx-reduce.h` to use `dst, src` signature.
- Implement `_a` (aligned) and `_u` (unaligned) variants for `hvx_sum_of_squares_f32`.
- Update `hvx_reduce_loop_body` macro to support both returning and storing results via `finalize_op`.
- Update existing reduction functions in `hvx-reduce.h` to use the updated macro.
- Update `rms_norm_htp_f32` in `ggml/src/ggml-hexagon/htp/unary-ops.c` to match the new signature.

* hexagon: use hvx_splat instead of memset

* hexagon: consistent use of f32/f16 in all function names to match the rest of GGML

* hexagon: fix hvx_copy_f16_f32 on v75 and older

* hexagon: update readme to include GGML_HEXAGON_EXPERIMENTAL

* scripts: update snapdragon/adb scripts to enable host param

commit | commitdiff | tree

Oliver Simons [Thu, 15 Jan 2026 02:44:54 +0000 (03:44 +0100)]

CUDA: Factor out and re-use `block_reduce` function (#18785)

* CUDA: Refactor and expose two_stage_warp_reduce_* function

* Use `two_stage_warp_reduce` also in softmax kernel, move smem out of it

Moving smem out of `__device__` function to `__global__` function
allows for explicit smem reuse, as either compiler or cuda rt seem to not
free it afterwards (`cudaFuncSetAttribute` fails when not accounting for
it once for each call to two_stage_warp_reduce)

* Update ggml/src/ggml-cuda/common.cuh

Co-authored-by: Aman Gupta <redacted>
* Use two_stage_warp_reduce in group_norm_f32

* Use two_stage_warp_reduce in rms_norm_f32

* Fix smem calculation which expects bytes

* Make `two_stage_warp_reduce` accept all values warp_reduce accepts

Also integrate it into norm_f32 function

* Use two_stage_warp_reduce in l2_norm_f32

* Use type traits for block reduction for better legibility

Also adresss other requests by @am17an such as variable renaming

* Make norm tests cover all cuda paths

* Mark columns % WARP_SIZE !=0 as supported for RMS_NORM_BACK

Unit-tests passed locally, let's see if they pass in the CI as well

* Use `enum class` for `block_reduce_method`

This is more type-safe than plain enum

* Rename variables as suggested in code review by @am17an

* Rename two_stage_warp_reduce -> block_reduce

* Fix trailing whitespace in common.cuh

* Make condition of static_assert type-dependent

This delays evaluation until the template is actually instantiated.
Otherwise, some compilers may evaluate the assert when parsing the
template, resulting in build errors as observed here:

https://github.com/ggml-org/llama.cpp/actions/runs/20960323123/job/60235530068?pr=18785

* Inline definitions

---------

Co-authored-by: Aman Gupta <redacted>

commit | commitdiff | tree

Piotr Wilkin (ilintar) [Wed, 14 Jan 2026 19:29:35 +0000 (20:29 +0100)]

Restore clip's cb() to its rightful glory - extract common debugging elements in llama (#17914)

* Extract common debugging functions; plug eval-callback and mtmd's MTMD_DEBUG_GRAPH with same functionality

* Move to common

* Remove unneeded header

* Unlink from common

* chore: update webui build output

* Cleanup; properly pass params to mtmd without depending on common; factorize debug.cpp to use common debug code.

* Revert change to webapp

* Post-merge adjust

* Apply suggestions from code review

Co-authored-by: Xuan-Son Nguyen <redacted>
* Apply code review changes

* Remove changes to server-context

* Remove mtmd.h include

* Remove utility functions from header

* Apply suggestions from code review

Co-authored-by: Xuan-Son Nguyen <redacted>
* Rename functions

* Update tools/mtmd/clip.cpp

Co-authored-by: Xuan-Son Nguyen <redacted>
* Update tools/mtmd/clip.cpp

Co-authored-by: Xuan-Son Nguyen <redacted>
* Update tools/mtmd/clip.cpp

Co-authored-by: Xuan-Son Nguyen <redacted>
---------

Co-authored-by: Xuan-Son Nguyen <redacted>

commit | commitdiff | tree

Junwon Hwang [Wed, 14 Jan 2026 18:38:21 +0000 (03:38 +0900)]

model : clean up and fix EXAONE-MoE configuration (#18840)

* Fix mismatch of EXAONE-MoE configuration

* ensure gating func is set, cleanup

---------

Co-authored-by: Sigbjørn Skjæret <redacted>

commit | commitdiff | tree

Adrien Gallouët [Wed, 14 Jan 2026 17:02:47 +0000 (18:02 +0100)]

refactor : remove libcurl, use OpenSSL when available (#18828)

commit | commitdiff | tree

Jeff Bolz [Wed, 14 Jan 2026 09:59:05 +0000 (03:59 -0600)]

vulkan: Check maxStorageBufferRange in supports_op (#18709)

* vulkan: Check maxStorageBufferRange in supports_op

* skip maxStorageBufferRange check when shader64BitIndexing is enabled

commit | commitdiff | tree

Aman Gupta [Wed, 14 Jan 2026 09:55:15 +0000 (17:55 +0800)]

llama-model: fix unfortunate typo (#18832)

commit | commitdiff | tree

Daniel Bevenius [Wed, 14 Jan 2026 09:31:49 +0000 (10:31 +0100)]

CUDA : fix typo in clang pragma comment [no ci] (#18830)

commit | commitdiff | tree

Ruben Ortlam [Wed, 14 Jan 2026 08:41:23 +0000 (09:41 +0100)]

vulkan: work around Intel fp16 bug in mmq (#18814)

commit | commitdiff | tree

Perry Naseck [Wed, 14 Jan 2026 07:22:25 +0000 (02:22 -0500)]

ggml-metal: do not copy headers for embedded, use current binary dir for embedded (#18705)

commit | commitdiff | tree

Daniel Benjaminsson [Wed, 14 Jan 2026 07:11:05 +0000 (08:11 +0100)]

mmap: add Haiku support by skipping RLIMIT_MEMLOCK check (#18819)

Haiku OS does not support RLIMIT_MEMLOCK, similar to visionOS/tvOS.
Skip the resource limit check on Haiku to allow mlock functionality
to work without compile errors.

Tested on Haiku with NVIDIA RTX 3080 Ti using Vulkan backend.

commit | commitdiff | tree

Adrien Gallouët [Wed, 14 Jan 2026 06:46:27 +0000 (07:46 +0100)]

ci, tests : use cmake to download models and remove libcurl dependency (#18791)

* ci, tests : use cmake to download models and remove libcurl dependency
* llama_dl_model -> llama_download_model
* use EXPECTED_HASH for robust model downloading
* Move llama_download_model to cmake/common.cmake

Signed-off-by: Adrien Gallouët <redacted>

commit | commitdiff | tree

ddh0 [Tue, 13 Jan 2026 23:05:11 +0000 (17:05 -0600)]

llama : print_info alignment fix (#18708)

* fix text spacing in print_info

* align all

commit | commitdiff | tree

Junwon Hwang [Tue, 13 Jan 2026 22:28:38 +0000 (07:28 +0900)]

model : add EXAONE MoE (#18543)

* Add EXAONE MoE implementations

Co-authored-by: Junwon Hwang <redacted>
* Address PR feedback

* Address PR feedback

* [WIP] Add MTP for EXAONE-MoE

* Address PR feedback

* Address PR feedback

* Address PR feedback

* Address PR feedback

* Address PR feedback

* Address PR feedback

* Address PR feedback

---------

Co-authored-by: LG-AI-EXAONE <redacted>

commit | commitdiff | tree

Georgi Gerganov [Tue, 13 Jan 2026 15:40:13 +0000 (17:40 +0200)]

vocab : fix attribute overrides for harmony (#18806)

* vocab : fix attribute overrides for harmony

* cont : add warning log

commit | commitdiff | tree

Ruben Ortlam [Tue, 13 Jan 2026 14:57:07 +0000 (15:57 +0100)]

llama-mmap: fix direct-io loading fallback EOF exception (#18801)

commit | commitdiff | tree

Daniel Bevenius [Tue, 13 Jan 2026 13:13:10 +0000 (14:13 +0100)]

model-conversion : remove -c 0 from model card template [no ci] (#18807)

This commit removes the `-c, --ctx-size N` from the llama-server
command in the model card template for causal models.

The motivation for this is that -c 0 is the default and specifying it
is redundant.

commit | commitdiff | tree

yulo [Tue, 13 Jan 2026 12:52:16 +0000 (20:52 +0800)]

HIP: add fattn-mma-f16 for RDNA4 (#18481)

* finish VQ mma

* flash_attn_ext_f16_iter

* KQ_rowsum

* correct exp

* fix scale error

* fix softmax scale

* fix softmax scale

* enable fattn on cpu side

* fix random error

* disable fattn-mma-f16 on rdna3

* fix wrong col for rdna

* use identity mat to transpose

* resolve conflicts

* basic tuning for DeepSeek-R1-Distill-Qwen-1.5B

* fix volta compile error

* align rdna4 policy for fattn

* adjust fattn policy

* adjust kernel selection logic

* update as the review comments

* keep fattn-wmma logic

* adjust kernel selection logic

---------

Co-authored-by: zhang hui <redacted>
Co-authored-by: Johannes Gäßler <redacted>

commit | commitdiff | tree

Johannes Gäßler [Tue, 13 Jan 2026 12:43:12 +0000 (13:43 +0100)]

doc: ban AI-generated PR descriptions [no ci] (#18765)

commit | commitdiff | tree

Xuan-Son Nguyen [Tue, 13 Jan 2026 11:19:38 +0000 (12:19 +0100)]

mtmd: fix use_non_causal being reported incorrectly (#18793)

* mtmd: fix use_non_causal being reported incorrectly

* move clip_is_mrope to mtmd_decode_use_mrope

* fix sloppy code ggml_cpy

commit | commitdiff | tree

Georgi Gerganov [Tue, 13 Jan 2026 10:25:53 +0000 (12:25 +0200)]

CUDA : fix unused argument when USE_CUDA_GRAPH=OFF (#18800)

commit | commitdiff | tree

Gabe Goodhart [Tue, 13 Jan 2026 08:43:51 +0000 (01:43 -0700)]

graph : clean up t5 input builders (#18795)

* fix: Remove unnecessary `h` loops where `h` was only ever 0

Branch: CleanUpT5InputBuilders

Signed-off-by: Gabe Goodhart <redacted>
* fix: Remove unnecessary padding loop that is never hit anymore

The upper bound used to use GGML_PAD(n_tokens, GGML_KQ_MASK_PAD), but was
removed in https://github.com/ggml-org/llama.cpp/pull/17910 leaving the
loop dead.

Branch: CleanUpT5InputBuilders

Signed-off-by: Gabe Goodhart <redacted>
---------

Signed-off-by: Gabe Goodhart <redacted>

commit | commitdiff | tree

Ruben Ortlam [Tue, 13 Jan 2026 07:49:10 +0000 (08:49 +0100)]

llama-bench: add direct_io parameter (#18778)

commit | commitdiff | tree

Adrien Gallouët [Mon, 12 Jan 2026 20:43:02 +0000 (21:43 +0100)]

ci : remove libcurl in releases (#18775)

Signed-off-by: Adrien Gallouët <redacted>

commit | commitdiff | tree

Radoslav Gerganov [Mon, 12 Jan 2026 17:21:34 +0000 (19:21 +0200)]

server : add arg for disabling prompt caching (#18776)

* server : add arg for disabling prompt caching

Disabling prompt caching is useful for clients who are restricted to
sending only OpenAI-compat requests and want deterministic
responses.

* address review comments

* address review comments

commit | commitdiff | tree

Adrien Gallouët [Mon, 12 Jan 2026 16:29:00 +0000 (17:29 +0100)]

ci : use openssl for openEuler-latest-cmake-cann (#18779)

Signed-off-by: Adrien Gallouët <redacted>

commit | commitdiff | tree

Adrien Gallouët [Mon, 12 Jan 2026 14:58:52 +0000 (15:58 +0100)]

vendor : update cpp-httplib to 0.30.1 (#18771)

Signed-off-by: Adrien Gallouët <redacted>

commit | commitdiff | tree

Daniel Bevenius [Mon, 12 Jan 2026 12:47:58 +0000 (13:47 +0100)]

examples : add --kv-unified to batched example (#18774)

This commit adds the --kv-unified flag to the batched example. This flag
is currently specified in the README.md as required, but is currently
not available as a command line option for the batched example.

The motivation for this is that specifying this flag as the README
instructs, will lead to an error about the flag not being recognized,
and without this option the example fail with the following error:
```console
split_equal: sequential split is not supported when there are coupled
sequences in the input batch (you may need to use the -kvu flag)
decode: failed to find a memory slot for batch of size 4
main: llama_decode() failed
```

commit | commitdiff | tree

Jeff Bolz [Mon, 12 Jan 2026 12:32:55 +0000 (06:32 -0600)]

vulkan: change memory_logger to be controlled by an env var (#18769)

commit | commitdiff | tree

Xuan-Son Nguyen [Mon, 12 Jan 2026 12:01:24 +0000 (13:01 +0100)]

server: update docs for sleeping [no ci] (#18777)

commit | commitdiff | tree

Jeff Bolz [Mon, 12 Jan 2026 11:32:13 +0000 (05:32 -0600)]

vulkan: Use VK_EXT_shader_64bit_indexing to handle large mat_mul(_id) (#18678)

This fixes incoherent output in Llama-4-Maverick-17B-128E-PAB-Q8_0, which
has a mul_mat_id with an A matrix that's Q8_0 8192 x 5120 x 128.

This should work when the number of blocks in the A matrix is less than 2^32
(for mul_mat_vec or mul_mm_cm2), or for mul_mm I think the limit is like
2^32*LOAD_VEC_A elements.

- Divide batch_stride by QUANT_K earlier, so the block index calculation works in 32b.
- Each vk_pipeline_struct has a linked list of pipelines that will allow it to handle
variants. So far this change just adds a single use case for this, compiling with the
e64BitIndexingEXT flag.
- Use the 64b indexing variant when the A matrix is larger than maxStorageBufferRange.

64-bit indexing has some cost - around 3-5% in MoE models, so it's worth the effort
to avoid enabling it unconditionally.

commit | commitdiff | tree

Ruben Ortlam [Mon, 12 Jan 2026 06:29:35 +0000 (07:29 +0100)]

vulkan: Disable large coopmat matmul configuration on proprietary AMD driver (#18763)

* vulkan: Disable large coopmat matmul configuration on proprietary AMD driver

* Also disable the large tile size

commit | commitdiff | tree

Xuan-Son Nguyen [Sun, 11 Jan 2026 20:00:10 +0000 (21:00 +0100)]

model: fix qwen3next broken due to #18683 (#18762)

commit | commitdiff | tree

Ruben Ortlam [Sun, 11 Jan 2026 16:33:33 +0000 (17:33 +0100)]

Vulkan: Optimize Matmul parameters for AMD GPUs with Coopmat support (#18749)

* vulkan: Enable and optimize large matmul parameter combination for AMD

* limit tuning to AMD GPUs with coopmat support

* use tx_m values instead of _l

Packaging of ggml-org/llama.cpp

RSS Atom