]>
git.djapps.eu Git - pkg/ggml/sources/llama.cpp/log
Johannes Gäßler [Mon, 26 Jan 2026 22:24:58 +0000 (23:24 +0100)]
CUDA: fix padding of GQA to power of 2 in FA (#19115)
Georgi Gerganov [Mon, 26 Jan 2026 18:18:34 +0000 (20:18 +0200)]
graph : fix nkvo offload with FA (#19105)
Sigbjørn Skjæret [Mon, 26 Jan 2026 14:22:49 +0000 (15:22 +0100)]
ci : use new 1vCPU runner for lightweight jobs (#19107)
* use new 1vCPU runner for lightweight jobs
* pyright is too heavy, look into ty some day
use new pip-install input
Georgi Gerganov [Mon, 26 Jan 2026 09:24:30 +0000 (11:24 +0200)]
model : add correct type for GLM 4.7 Flash (#19106)
Johannes Gäßler [Sun, 25 Jan 2026 20:19:47 +0000 (21:19 +0100)]
CUDA: faster FA for GQA > 1 but not power of 2 (#19092)
ccbinn [Sun, 25 Jan 2026 18:07:19 +0000 (02:07 +0800)]
metal : fix recommendedMaxWorkingSetSize availability on legacy iOS/macOS (#19088)
Co-authored-by: chenbin11 <redacted>
Sigbjørn Skjæret [Sun, 25 Jan 2026 17:03:34 +0000 (18:03 +0100)]
convert : yield Gemma3N custom_map tensors directly (#19091)
Aman Gupta [Sun, 25 Jan 2026 15:25:58 +0000 (23:25 +0800)]
ggml-cpu: Use tiled FA for prompt-processing (#19012)
* ggml-cpu: Use tiled FA for prompt-processing
the FA performance is gimped on CPU on long contexts because it essentially uses a vector kernel. This PR adds a tiled FA for PP. Perf tuning for tile sizes done on a AMD EPYC single-socket 64-c machine.
* fix out of bounds for mask
* skip rows where there are all masks
* skip tile if mask is inf
* store mask in worksize
* check inf tile earlier
Georgi Gerganov [Sun, 25 Jan 2026 13:48:56 +0000 (15:48 +0200)]
kv-cache : support V-less cache (#19067)
* kv-cache : support V-less cache
* cuda : better check for V_is_K_view
* cuda : improve V_is_K_view check
* graph : add comments
* hparams : refactor
Sigbjørn Skjæret [Sun, 25 Jan 2026 12:05:05 +0000 (13:05 +0100)]
convert : fix Gemma3N, GraniteMoe and Ernie4.5Moe (#19084)
* fix Gemma3N and Ernie4.5Moe
* fix GraniteMoe
Georgi Gerganov [Sun, 25 Jan 2026 07:12:50 +0000 (09:12 +0200)]
completion : fix prompt cache for recurrent models (#19045)
Molly Sophia [Sun, 25 Jan 2026 07:11:19 +0000 (15:11 +0800)]
readme: update RWKV7 model links (#19061)
Signed-off-by: Molly Sophia <redacted>
Jakkala Mahesh [Sun, 25 Jan 2026 07:10:52 +0000 (12:40 +0530)]
llama: fix integer type consistency in split helpers (#18894)
* llama: fix integer type consistency in split helpers
* llama: apply minor style fixes
* llama: remove trailing whitespace
Daniel Bevenius [Sun, 25 Jan 2026 06:31:42 +0000 (07:31 +0100)]
common : use two decimal places for float arg help messages (#19048)
* common : use two decimal places for float arg help messages
This commit updates the help messages for various command-line arguments
in arg.cpp to display floating-point default values with two decimal
places instead of one.
The motivation for this changes is that currently only having one decimal
place means that values generated using --help or llama-gen-docs will not
display the correct values.
For example, currently the value of top-p in tools/server/README.md is
`0.9`, but the default value is actually '0.95'. And running
llama-gen-docs does not update this value as it uses the output from the
help message, which shows only one decimal place, so the values look
like they are unchanged.
* docs : run llama-gen-docs to update docs
Bartowski [Sun, 25 Jan 2026 01:36:47 +0000 (20:36 -0500)]
convert : fix conversion for inheriting models that were bypassing modify_tensors (#19064)
* Add undo_permute = False where needed
* Replace super().modify_tensors with ModelBase
* Add one more ModelBase.modify_tensors
* Update convert_hf_to_gguf.py
Co-authored-by: Sigbjørn Skjæret <redacted>
* Update convert_hf_to_gguf.py
Co-authored-by: Sigbjørn Skjæret <redacted>
* Update convert_hf_to_gguf.py
Co-authored-by: Sigbjørn Skjæret <redacted>
---------
Co-authored-by: Sigbjørn Skjæret <redacted>
Johannes Gäßler [Sat, 24 Jan 2026 21:13:08 +0000 (22:13 +0100)]
llama-fit-params: keep explicit --ctx-size 0 (#19070)
Johannes Gäßler [Sat, 24 Jan 2026 20:57:51 +0000 (21:57 +0100)]
GGUF: check that tensor size is representable (#19072)
Xuan-Son Nguyen [Sat, 24 Jan 2026 16:58:45 +0000 (17:58 +0100)]
chat: fix language input for translategemma (#19052)
* chat: fix language input for translategemma
* Update common/chat.cpp
Co-authored-by: Aldehir Rojas <redacted>
---------
Co-authored-by: Aldehir Rojas <redacted>
Johannes Gäßler [Sat, 24 Jan 2026 09:09:36 +0000 (10:09 +0100)]
CUDA: re-use MLA K data for V in MMA FA (#19057)
Aman Gupta [Sat, 24 Jan 2026 06:25:20 +0000 (14:25 +0800)]
ggml-cuda: enable cuda-graphs for `n-cpu-moe` (#18934)
* ggml-cuda: add split-wise cuda graph
* add n-cpu-moe compare_llama_bench.py
* fix hip/musa builds
nullname [Sat, 24 Jan 2026 06:02:07 +0000 (14:02 +0800)]
ggml-hexagon: flash-attn opt (#19025)
* optimize flash attention kernel by improving score computation and online softmax update
* wip
* Refactor online softmax update in flash attention kernel for improved performance
* Optimize flash attention kernel by replacing float array with HVX_Vector for score computation
* wip
Georgi Gerganov [Fri, 23 Jan 2026 16:22:34 +0000 (18:22 +0200)]
graph : utilize `ggml_build_forward_select()` to avoid reallocations (#18898)
* graph : avoid branches between embedding and token inputs
* models : make deepstack graphs (e.g. Qwen3 VL) have constant topology
* ci : enable -DGGML_SCHED_NO_REALLOC=ON for server CI
* cont : pad token embeddings to n_embd_inp
Neo Zhang [Fri, 23 Jan 2026 12:54:10 +0000 (20:54 +0800)]
[SYCL] use malloc to support both iGPU and dGPU in same time (#18992)
* use malloc to support both iGPU and dGPU in same time
* support windows
---------
Co-authored-by: Neo Zhang Jianyu <redacted>
Xuan-Son Nguyen [Fri, 23 Jan 2026 11:03:42 +0000 (12:03 +0100)]
chat : fix translategemma crash on common_chat_format_example (#19019)
Daniel Bevenius [Fri, 23 Jan 2026 08:01:36 +0000 (09:01 +0100)]
model-conversion : use BUILD_DIR variable in all scripts (#19015)
This commit modifies all the utility scripts to use an optional
BUILD_DIR variable/argument to specify the build directory.
The motivation for this is that Commit
3d55846a5c626e2e608db8c24fa9ee6defaacca9 ("model-conversion : add
BUILD_DIR variable to run-converted-model scripts") introduced this
variable to the causal and embeddings scripts, but I missed the scripts
in the utils directory.
Alberto Cabrera Pérez [Fri, 23 Jan 2026 07:55:08 +0000 (07:55 +0000)]
ggml-cpu: aarm64: q5_K repack gemm and gemv (and generic) implementations (i8mm) (#18860)
* Boilerplate for q5_Kx8 REPACK on ARM and fallback
Signed-off-by: Alberto Cabrera <redacted>
* Implements make_block_q5_Kx8 by extending make_block_q4_Kx8
Signed-off-by: Alberto Cabrera <redacted>
* q5_K repack gemm and gemv generics
* Gemm and Gemv ARM implementations (i8mm)
* Improved qh manipulation looking at non-repack vec_dot implementation
* Full unroll
* Apply Q5_K Gemv vand and vshl optimizations to gemm. Improve comments.
Signed-off-by: Alberto Cabrera <redacted>
* Fix wrong fallback definitions of Q5_K
Signed-off-by: Alberto Cabrera <redacted>
* Fixed comments. Reverted unnecessary formatting
Signed-off-by: Alberto Cabrera <redacted>
* Fixed typo in generic definitions
* Switching AND + Shift with Shift Insert. Better op interleaving.
* Vectorize + unroll the block scales
* Apply gemm optimizations to gemv
* Improve bias calculation
---------
Signed-off-by: Alberto Cabrera <redacted>
Aldehir Rojas [Fri, 23 Jan 2026 02:31:22 +0000 (20:31 -0600)]
cli : load parser definition (#19031)
* cli : load parser definition
* cont : only unload if a parser is defined
Xuan-Son Nguyen [Thu, 22 Jan 2026 20:30:06 +0000 (21:30 +0100)]
server : support preserving reasoning_content in assistant message (#18994)
* support reasoning_content input
* report template caps to webui
* add docs
* rm commented code
Georgi Gerganov [Thu, 22 Jan 2026 20:09:01 +0000 (22:09 +0200)]
mla : make the V tensor a view of K (#18986)
* mla : pass V as a view of K to the FA op
* cuda : adjust mla logic to new layout
* kv-cache : fix rope shift
* tests : remove comment
* cuda : fix reusable_cutoff
Co-authored-by: Johannes Gäßler <redacted>
---------
Co-authored-by: Johannes Gäßler <redacted>
Johannes Gäßler [Thu, 22 Jan 2026 19:39:25 +0000 (20:39 +0100)]
CUDA: fix alignment check for FA (#19023)
Aman Gupta [Thu, 22 Jan 2026 18:58:07 +0000 (02:58 +0800)]
convert_hf_to_gguf.py: refactor modify_tensors to call super (#18866)
lhez [Thu, 22 Jan 2026 18:29:25 +0000 (10:29 -0800)]
opencl: enable the general fp mm for non-cont input and as a fallback for specialized kqv kernel for adreno (#18970)
* opencl: add `copy_to_contiguous` and utilize mm kernels
* opencl: only copy to cont for f32 and f16 tensors
* opencl: use cont mm for fallback when dst is large
* opencl: use nb local to copy-to-cont
* opencl: use local offset as well
Xuan-Son Nguyen [Thu, 22 Jan 2026 18:24:37 +0000 (19:24 +0100)]
server: do not log certain endpoints (avoid log spam) (#19028)
Georgi Gerganov [Thu, 22 Jan 2026 14:17:06 +0000 (16:17 +0200)]
quant : manual overrides of tensor types take precedence (#18952)
Aaron Teo [Thu, 22 Jan 2026 13:38:02 +0000 (21:38 +0800)]
release: update github api (#19022)
Xuan-Son Nguyen [Thu, 22 Jan 2026 13:36:32 +0000 (14:36 +0100)]
mtmd : update docs to use llama_model_n_embd_inp (#18999)
손희준 [Thu, 22 Jan 2026 13:36:04 +0000 (22:36 +0900)]
server: Reorder methods in `server-task.cpp` (#19016)
* Move `task_result_state::update_chat_msg` to match with header
* Move `server_task_result_cmpl_partial::to_json_anthropic()` to match with header
---------
Co-authored-by: openingnow <>
Aman Gupta [Thu, 22 Jan 2026 10:51:53 +0000 (18:51 +0800)]
CUDA: add gqa_ratio 4 for GLM 4.7 flash (#18953)
shaofeiqi [Thu, 22 Jan 2026 06:05:54 +0000 (22:05 -0800)]
opencl: add TRI op support (#18979)
Aleksei Nikiforov [Thu, 22 Jan 2026 00:16:21 +0000 (01:16 +0100)]
ggml-zdnn : mark zDNN buffers as non-host (#18967)
While buffers reside in host memory,
additional transformation is needed to use buffers with zDNN.
Fixes #18848
Pádraic Slattery [Wed, 21 Jan 2026 23:57:18 +0000 (00:57 +0100)]
ci : update GitHub Actions versions [no ci] (#18935)
Mariusz Woloszyn [Wed, 21 Jan 2026 23:55:55 +0000 (00:55 +0100)]
convert : add Devstral-2 (Ministral3ForCausalLM) arch (#18972)
* Add Ministral3ForCausalLM architeture
This adds support for newer architectres like Devstral-2
* removed blank line found after function decorator
Co-authored-by: Sigbjørn Skjæret <redacted>
---------
Co-authored-by: Sigbjørn Skjæret <redacted>
Piotr Wilkin (ilintar) [Wed, 21 Jan 2026 18:24:37 +0000 (19:24 +0100)]
jinja: support none|string (#18995)
* jinja: support none|string
* Update common/jinja/value.cpp
Co-authored-by: Sigbjørn Skjæret <redacted>
* Update tests/test-jinja.cpp
Co-authored-by: Sigbjørn Skjæret <redacted>
* Add as_string()
---------
Co-authored-by: Sigbjørn Skjæret <redacted>
Hendrik Erz [Wed, 21 Jan 2026 17:46:01 +0000 (18:46 +0100)]
fix: Use `tabular-nums` for chat message statistics (#18915)
* fix: Use `tabular-nums` for chat message statistics
* fix: Rebuild WebUI
Daniel Bevenius [Wed, 21 Jan 2026 17:31:34 +0000 (18:31 +0100)]
llama : clarify nemotron-h.cpp comment about RoPE [no ci] (#18997)
This commit removes the mention of RoPE in the comment for the Q and K
computation as RoPE is not applied.
Jeff Bolz [Wed, 21 Jan 2026 17:01:40 +0000 (11:01 -0600)]
vulkan: Remove transfer_ctx, do everything in compute_ctx. (#18945)
* vulkan: Remove transfer_ctx, do everything in compute_ctx.
We had a bug where a set_tensor_async (using transfer_ctx) didn't get
submitted before the graph_compute (using compute_ctx) that came after
it. To avoid this sort of issue, just do everything in compute_ctx.
Remove transfer_cmd_pool, which was already unused.
* fix crash with perf logger
Adrien Gallouët [Wed, 21 Jan 2026 16:58:38 +0000 (17:58 +0100)]
common : improve error message when HTTPS is missing but required (#18987)
Signed-off-by: Adrien Gallouët <redacted>
손희준 [Wed, 21 Jan 2026 16:47:23 +0000 (01:47 +0900)]
server: /v1/responses (partial) (#18486)
* from previous PR
* Make instruction(system) as first message
* Convert [input_message] (text/image/file)
* Rename convert_responses_to_chatcmpl(body) -> response_body
* Initial tool call support
* Erase instructions field from chatcmpl body
* Feed reasoning texts to chat template
* Use std::vector instead of opaque json array
* Make output_item.added events consistent
* Move `server_task_result_cmpl_partial::update` from header to source
* Match ID of output_item.added and .done events
* Add function_call only if there is no "fc_" prefix
* Add function call output at non-streaming API
* Test if ID is persistent
* Add doc
* Fix style - use trailing comma
* Rewrite state management
* catch up with upstream/master
* Fix style - "type" is the first item of SSE data
* Explicitly check "instructions" from response_body
* Make lambdas static
* Check if reasoning content exists
* Add `oai_resp_id` to task_result_state(also initialized at ctor), server_task_result_cmpl_partial, and server_task_result_cmpl_final
* Reject `input_file` since it is not supported by chatcmpl
* Add "fc_" prefix to non-straming function call id as coderabbit pointed out
---------
Co-authored-by: openingnow <>
Jeff Bolz [Wed, 21 Jan 2026 16:43:43 +0000 (10:43 -0600)]
vulkan: support flash attention GQA/split_k with small batches (#18938)
Masato Nakasaka [Wed, 21 Jan 2026 16:13:43 +0000 (01:13 +0900)]
Revert "vulkan: force full subgroups for flash attention to fix intel subgroup crash (#17356)" (#18831)
This reverts commit
980b7cd17e055c8c587f79ffda7eb4fddf405566 .
Jeff Bolz [Wed, 21 Jan 2026 15:22:02 +0000 (09:22 -0600)]
vulkan: Use mul_mat_vec_id for small values of n (#18918)
Change ggml_vk_mul_mat_vec_id_q_f16 to loop over the batch dimension and
update the indexing calculations in get_offsets.
Mat-vec is faster than mat-mat for small values of n. We don't get the same
reuse of the weights as in the non-ID path, but with this the cost is linear
in n rather than n>1 being far slower than n==1.
Tarek Dakhran [Wed, 21 Jan 2026 12:30:23 +0000 (13:30 +0100)]
memory : add llama_memory_hybrid_iswa (#18601)
* memory : add llama_memory_hybrid_iswa
* Update src/llama-memory-hybrid-iswa.cpp
Co-authored-by: Georgi Gerganov <redacted>
---------
Co-authored-by: Georgi Gerganov <redacted>
Piotr Wilkin (ilintar) [Wed, 21 Jan 2026 11:35:20 +0000 (12:35 +0100)]
Fix GLM 4.7 Lite MoE gating func (#18980)
* Fix GLM 4.7 MoE gating func
* Update src/models/deepseek2.cpp
Co-authored-by: Sigbjørn Skjæret <redacted>
* Update src/llama-model.cpp
Co-authored-by: Xuan-Son Nguyen <redacted>
---------
Co-authored-by: Sigbjørn Skjæret <redacted>
Co-authored-by: Xuan-Son Nguyen <redacted>
Matthieu Coudron [Wed, 21 Jan 2026 06:52:46 +0000 (07:52 +0100)]
gguf: display strerrno when cant load a model (#18884)
I've had issues loading models with llama-server:
[44039] E gguf_init_from_file: failed to open GGUF file 'mistral-7b-v0.1.Q8_0.gguf'
and I was sure it could access the file. Seems like --models-dir and
--models-presets dont interact like I thought they would but I salvaged
this snippet that helps troubleshooting
[44039] E gguf_init_from_file: failed to open GGUF file 'mistral-7b-v0.1.Q8_0.gguf' (errno No such file or directory)
Oliver Simons [Wed, 21 Jan 2026 01:34:29 +0000 (02:34 +0100)]
CUDA: Fix builds for older CCCL versions by ifdefing strided_iterator (#18964)
* CUDA: Fix builds for older CCCL versions by ifdefing strided_iterator
Strided iterator was added in [CCCL
3.1](https://github.com/NVIDIA/cccl/releases/tag/v3.1.0), which is packaged into
[CTK
13.1](https://docs.nvidia.com/cuda/cuda-toolkit-release-notes/index.html#id5)
* Unindent as per code review request
Adrien Gallouët [Tue, 20 Jan 2026 17:28:43 +0000 (18:28 +0100)]
common, server : use the same User-Agent by default (#18957)
This commit also ensures that if a custom User-Agent is used, it will be
the only one sent.
Signed-off-by: Adrien Gallouët <redacted>
Xuan-Son Nguyen [Tue, 20 Jan 2026 17:23:25 +0000 (18:23 +0100)]
cli : fix reasoning responses in CLI (#18961)
* cli : fix reasoning responses in CLI
* fix build
* fix build (2)
Oliver Simons [Tue, 20 Jan 2026 12:11:01 +0000 (13:11 +0100)]
CUDA: Replace init_offsets kernel with iterators in cub-based argsort (#18930)
* CUDA: Replace `init_offsets` with iterators in argsort
This is a QOL improvement, saving us the cost of materializing the
iterator
* Remove unnecessary include from top-k.cu
Adrien Gallouët [Tue, 20 Jan 2026 10:42:49 +0000 (11:42 +0100)]
ggml : cleanup path_str() (#18928)
- Remove pragmas as `std::codecvt_utf8` is not used.
- Avoid implicit `strlen()`.
Signed-off-by: Adrien Gallouët <redacted>
Georgi Gerganov [Tue, 20 Jan 2026 10:21:28 +0000 (12:21 +0200)]
metal : enable FA for MLA heads (#18950)
Daniel Bevenius [Tue, 20 Jan 2026 05:55:24 +0000 (06:55 +0100)]
convert : use n_groups instead of hardcoded values in reshape (#18929)
* convert : use n_groups instead of hardcoded values in reshape
This commit modifies the conversion script for NemotronHModel to use
the 'n_groups' hyperparameter, and allow Python to calculate the the
last dimension, using -1, when reshaping the 'mixer.norm.weight' tensor.
* use self.n_group instead of self.hparams["n_groups"]
Xuan-Son Nguyen [Mon, 19 Jan 2026 22:28:01 +0000 (23:28 +0100)]
server : refactor oai_parser_opt, move it to server_chat_params (#18937)
* server_chat_params
* move chat format into CLI
* use meta whenever possible
* clean up, no more chatml fallback
ddh0 [Mon, 19 Jan 2026 22:09:20 +0000 (16:09 -0600)]
convert : support Glm4MoeLite (#18936)
* initial commit for branch
* add glm-4.7-flash, move tokenizer hash
* use `glm4` pretok
* silence flake8 E302 (CI)
* apply review feedback
* add <|user|> as eog
* also add EOG `<|observation|>`
* revert llama-vocab
* inherit vocab from glm4
---------
Co-authored-by: Xuan Son Nguyen <redacted>
Sigbjørn Skjæret [Mon, 19 Jan 2026 19:29:43 +0000 (20:29 +0100)]
jinja : fix undefined keys and attributes and int/float as bool (#18924)
* fix undefined keys and attributes
* add falsy tests
* as_bool for integers and floats
* more falsy/truthy tests
* --typo
Sigbjørn Skjæret [Mon, 19 Jan 2026 19:29:15 +0000 (20:29 +0100)]
ci : run test-jinja -py on high perf [no ci] (#18916)
Lennart Austenfeld [Mon, 19 Jan 2026 18:13:31 +0000 (19:13 +0100)]
server: fix memory reservations in populate_token_probs (#18787)
Georgi Gerganov [Mon, 19 Jan 2026 18:03:19 +0000 (20:03 +0200)]
ggml : add ggml_build_forward_select (#18550)
* ggml : add ggml_build_forward_select
* cuda : adapt CUDA graph compat to new feature
* vulkan : update logic to handle command buffer closing
* ggml : check compute for fusion
* ggml : add comment
Daniel Bevenius [Mon, 19 Jan 2026 12:12:38 +0000 (13:12 +0100)]
model-conversion : add BUILD_DIR variable to run-converted-model scripts (#18927)
This commit adds a BUILD_DIR variable to the scripts used for running
converted models.
The motivation for this is that currently the `build` directory is
hardcoded and it can be useful to specify a different build directory,
with builds for different configurations.
Julius Tischbein [Sun, 18 Jan 2026 16:35:57 +0000 (17:35 +0100)]
llama : Extend fallback, fix fileno for dio file, exclude case that mmap uses dio file (#18887)
Francisco Herrera [Sun, 18 Jan 2026 10:03:35 +0000 (05:03 -0500)]
docs: add linux to index (#18907)
Xuan-Son Nguyen [Sun, 18 Jan 2026 07:14:27 +0000 (08:14 +0100)]
tests : add test-jinja -py option for cross-checking (#18906)
* tests : add test-jinja -py option or cross-checking
* Update tests/test-jinja.cpp
Co-authored-by: Sigbjørn Skjæret <redacted>
* fix + add source
* SandboxedEnvironment
* fix array.map case
---------
Co-authored-by: Sigbjørn Skjæret <redacted>
Sigbjørn Skjæret [Sun, 18 Jan 2026 02:40:06 +0000 (03:40 +0100)]
jinja : fix object item order (and properly implement dictsort) (#18904)
* fix object item order
* as_ordered_object
* copy whole object
Sigbjørn Skjæret [Sun, 18 Jan 2026 01:53:01 +0000 (02:53 +0100)]
jinja : attribute support for join, map and sort (#18883)
* support negative array index and default value
* attribute support (int and str) for join, map and sort
* add tests
* update CODEOWNERS
* improve fixme sorting comment
Sigbjørn Skjæret [Sun, 18 Jan 2026 00:05:09 +0000 (01:05 +0100)]
jinja : add missing tojson filter for bool (#18900)
* add missing tojson for bool
* add more literal tests
Sigbjørn Skjæret [Sat, 17 Jan 2026 23:57:51 +0000 (00:57 +0100)]
jinja : fix lexing of float literals with sign (#18901)
* fix lexing of float literals with sign
* add test
* consume_numeric
Xuan-Son Nguyen [Sat, 17 Jan 2026 23:48:55 +0000 (00:48 +0100)]
jinja: correct member access rule (#18905)
lhez [Sat, 17 Jan 2026 21:50:32 +0000 (13:50 -0800)]
opencl: fix q6_K mv for m=1 (#18893)
Sigbjørn Skjæret [Sat, 17 Jan 2026 20:52:02 +0000 (21:52 +0100)]
ci : add label for jinja changes (#18903)
Georgi Gerganov [Sat, 17 Jan 2026 13:42:42 +0000 (15:42 +0200)]
kv-cache : optimize KQ mask construction (#18842)
* kv-cache : optimize KQ mask construction
* cont : add explanation + improve
* cont : fix
Reese Levine [Sat, 17 Jan 2026 00:12:43 +0000 (16:12 -0800)]
ggml webgpu: support for backend sampling (#18880)
* ggml webgpu: add SOFTPLUS unary operator
Implements SOFTPLUS (log(1 + exp(x))) with f16/f32 support. Uses f32
precision for intermediate calculations to prevent f16 overflow.
* Add shader implementation and 4 variants (f32/f16, inplace/non-inplace)
* Register pipelines and device support
* Follow Vulkan backend numerical stability pattern
* ggml webgpu: add EXPM1 unary operator
Implements EXPM1 (exp(x) - 1) with f16/f32 support.
* Add shader implementation and 4 variants (f32/f16, inplace/non-inplace)
* Register pipelines and device support
* ggml webgpu: add FLOOR unary operator
Implements FLOOR (rounds down to nearest integer) with f16/f32 support.
* Add shader implementation and 4 variants (f32/f16, inplace/non-inplace)
* Register pipelines and device support
* ggml webgpu: add CEIL unary operator
Implements CEIL (rounds up to nearest integer) with f16/f32 support.
* Add shader implementation and 4 variants (f32/f16, inplace/non-inplace)
* Register pipelines and device support
* ggml webgpu: add ROUND unary operator
Implements ROUND (rounds to nearest integer) with f16/f32 support.
* Add shader implementation and 4 variants (f32/f16, inplace/non-inplace)
* Register pipelines and device support
* ggml webgpu: add TRUNC unary operator
Implements TRUNC (truncates towards zero) with f16/f32 support.
* Add shader implementation and 4 variants (f32/f16, inplace/non-inplace)
* Register pipelines and device support
* docs : update WebGPU support for unary operators (FLOOR, CEIL, ROUND, TRUNC, EXPM1, SOFTPLUS)
* Updates to webgpu get_memory
* Add argmax
* Add argmax,cumsum,sum,sum_rows
* Add necessary CPY/GET_ROWS operators
* Support for argsort using multi-pass strategy
* Update set_rows for i32 indices, move to pre-wgsl
* Port unary operators to pre-wgsl and support FILL
* Implement PAD
* Add support for top-k
* clean up, scope pipeline init mutex
* fix newline
* Add support for log
* Update LOG for better precision, and ops doc
---------
Co-authored-by: Abhijit Ramesh <redacted>
Thore Koritzius [Fri, 16 Jan 2026 14:59:56 +0000 (15:59 +0100)]
ggml : extend ggml_pool_1d + metal (#16429)
* chore: resolve conflicts
* feat: ggml metal impl
* fix: ggml_metal_kargs_pool_1d struct
* fix: require contiguous input
* chore: test pool_1d
* chore: limit pool1d test cases to p0=0 and s0=k0 to conform with asserts
* chore: add p0 and s0 to testing
* fix: allow padding for cpu and metal
* Update ggml/src/ggml-metal/ggml-metal.metal
* fix: correct single-threaded loop
* ggml : cleanup
* tests : add ne[1] != 1 tests
* fix: ne[1] handling in np
* cont : fixes
---------
Co-authored-by: Georgi Gerganov <redacted>
hipudding [Fri, 16 Jan 2026 12:32:17 +0000 (20:32 +0800)]
docs : update ops.md for CANN backend (#18654)
Perry Naseck [Fri, 16 Jan 2026 11:38:25 +0000 (06:38 -0500)]
ggml-blas: hide warnings from included BLAS headers (#18818)
* fix compile def openblas, blis for compat libs, nvpl compile def, warn if no blas vendor set
* ggml-blas: hide warnings from included BLAS headers
Tarek Dakhran [Fri, 16 Jan 2026 10:23:08 +0000 (11:23 +0100)]
mtmd : Fix ASR for LFM2.5-Audio-1.5B (#18876)
Xuan-Son Nguyen [Fri, 16 Jan 2026 10:22:06 +0000 (11:22 +0100)]
common : implement new jinja template engine (#18462)
* jinja vm
* lexer
* add vm types
* demo
* clean up
* parser ok
* binary_expression::execute
* shadow naming
* bin ops works!
* fix map object
* add string builtins
* add more builtins
* wip
* use mk_val
* eval with is_user_input
* render gemma tmpl ok
* track input string even after transformations
* support binded functions
* keyword arguments and slicing array
* use shared_ptr for values
* add mk_stmt
* allow print source on exception
* fix negate test
* testing more templates
* mostly works
* add filter_statement
* allow func to access ctx
* add jinja-value.cpp
* impl global_from_json
* a lot of fixes
* more tests
* more fix, more tests
* more fixes
* rm workarounds
* demo: type inferrence
* add placeholder for tojson
* improve function args handling
* rm type inference
* no more std::regex
* trailing spaces
* make testing more flexible
* make output a bit cleaner
* (wip) redirect minja calls
* test: add --output
* fix crash on macro kwargs
* add minimal caps system
* add some workarounds
* rm caps_apply_workarounds
* get rid of preprocessing
* more fixes
* fix test-chat-template
* move test-chat-jinja into test-chat-template
* rm test-chat-jinja from cmake
* test-chat-template: use common
* fix build
* fix build (2)
* rename vm --> interpreter
* improve error reporting
* correct lstrip behavior
* add tojson
* more fixes
* disable tests for COMMON_CHAT_FORMAT_GENERIC
* make sure tojson output correct order
* add object.length
* fully functional selectattr / rejectattr
* improve error reporting
* more builtins added, more fixes
* create jinja rendering tests
* fix testing.h path
* adjust whitespace rules
* more fixes
* temporary disable test for ibm-granite
* r/lstrip behavior matched with hf.js
* minimax, glm4.5 ok
* add append and pop
* kimi-k2 ok
* test-chat passed
* fix lstrip_block
* add more jinja tests
* cast to unsigned char
* allow dict key to be numeric
* nemotron: rm windows newline
* tests ok
* fix test
* rename interpreter --> runtime
* fix build
* add more checks
* bring back generic format support
* fix Apertus
* [json.exception.out_of_range.403] key 'content' not found
* rm generic test
* refactor input marking
* add docs
* fix windows build
* clarify error message
* improved tests
* split/rsplit with maxsplit
* non-inverse maxsplit
forgot to change after simplifying
* implement separators for tojson and fix indent
* i like to move it move it
* rename null -- > none
* token::eof
* some nits + comments
* add exception classes for lexer and parser
* null -> none
* rename global -> env
* rm minja
* update docs
* docs: add input marking caveats
* imlement missing jinja-tests functions
* oops
* support trim filter with args, remove bogus to_json reference
* numerous argument fixes
* updated tests
* implement optional strip chars parameter
* use new chars parameter
* float filter also has default
* always leave at least one decimal in float string
* jinja : static analysis + header cleanup + minor fixes
* add fuzz test
* add string.cpp
* fix chat_template_kwargs
* nits
* fix build
* revert
* unrevert
sorry :)
* add fuzz func_args, refactor to be safer
* fix array.map()
* loosen ensure_vals max count condition, add not impl for map(int)
* hopefully fix windows
* check if empty first
* normalize newlines
---------
Co-authored-by: Alde Rojas <redacted>
Co-authored-by: Sigbjørn Skjæret <redacted>
Co-authored-by: Georgi Gerganov <redacted>
Julius Tischbein [Fri, 16 Jan 2026 08:46:51 +0000 (09:46 +0100)]
Setting mmap and direct_io to false as default in llama-bench.cpp (#18841)
Raul Torres [Fri, 16 Jan 2026 08:34:09 +0000 (08:34 +0000)]
CANN: Remove unused `ggml_cann_get_device` function (#18625)
Chenguang Li [Fri, 16 Jan 2026 08:24:04 +0000 (16:24 +0800)]
CANN: fix an issue where get_env was not fully renamed (#18796)
* CANN: fix an issue where get_env was not fully renamed
* ci: add cann with acl group
* ci: define use_acl_graph using GitHub Action
* ci: update cann dockerfile with acl graph
hipudding [Fri, 16 Jan 2026 08:18:49 +0000 (16:18 +0800)]
CANN: support gated linear attn (#18653)
* CANN: support gated linear attn
This change adds support for the GGML_OP_GATED_LINEAR_ATTN operator.
The feature was implemented by YushengZhao. Because the previous
submission was based on an outdated codebase, this PR was rebased to
merge.
Co-authored-by: YushengZhao <redacted>
Co-authored-by: hipudding <redacted>
* CANN: optimize OP gla
Optimize gla for high preformance
* Remove unused comments
---------
Co-authored-by: 赵禹昇 <redacted>
Co-authored-by: YushengZhao <redacted>
shaofeiqi [Thu, 15 Jan 2026 19:17:17 +0000 (11:17 -0800)]
OpenCL: add SOLVE_TRI op support (#18846)
Georgi Gerganov [Thu, 15 Jan 2026 18:53:01 +0000 (20:53 +0200)]
cuda : print less debug logs when disabling cuda graphs (#18868)
Georgi Gerganov [Thu, 15 Jan 2026 17:35:57 +0000 (19:35 +0200)]
context : do not reserve scheduler for warmups (#18867)
ddh0 [Thu, 15 Jan 2026 17:16:29 +0000 (11:16 -0600)]
llama : add adaptive-p sampler (#17927)
* initial commit for branch
* simplify constants
* add params to `struct common_params_sampling`, add reference to PR
* explicitly clamp `min_target` and `max_target` to `[0.0, 1.0]`
* add args, rename `queue_size` -> `window_size`
* improved comments
* minor
* remove old unused code from algorithm
* minor
* add power law case to `common_sampler_init`, add sampler name mappings
* clarify behaviour when `window_size = 0`
* add missing enums
* remove `target_range` param, make `target == 1` no-op, cleanup code
* oops, straggler
* add missing parameters in `server-task.cpp`
* copy from author
ref:
https://gist.github.com/MrJackSpade/
9be99c7efbba7b95a41377e123b7b069
* remove old debug log, style nit
* fix compiler warning, add commented-out logging per token
* re-write + change parameters + simplify
* oops forgot args.cpp
* fix leftover `window_size`
* add missing values to `common_params_sampling::print()`
* with logging
* does this fix it?
* no, but does this?
* update default decay
* optimize
* fix bad merge
my git skills are lacking
* silence `missing initializer for member`
* update default decay to 0.9
* fix logging
* format (double)
* add power law to the new `samplers` vector
* log sampler init values
* improve logging messages in llama_sampler_power_law
* remove extraneous logging
* simplify target computation
last commit with debug logging!
* remove debug logging, explicitly clamp params at init
* add `use_power_law` flag + logic, minor cleanup
* update `power-law` -> `adaptive-p`
* fix cold start EMA
- `ctx->weighted_sum` is now initialized and reset to `target / (1.0f -
clamped_decay)`
- `ctx->total_weight` is now initialized and reset to `1.0f / (1.0f -
clamped_decay)`
this fixes a "cold start" problem with the moving average
* update `SHARPNESS` constant to `10.0f`
* minor style fixes
no functional changes
* minor style fixes cont.
* update `llama_sampler_adaptive_p_i` for backend sampling (ref: #17004)
* separate into `apply` + `accept` functions
* `pending_token_idx`: switch from `llama_token` to `int32`
functionally identical (`llama.h` has `typedef int32_t llama_token;`),
but its more correct now
* don't transform logits <= -1e9f
* fix masking in backend top-p, min-p
* address review comments
* typo in comments `RND` -> `RNG`
* add docs
* add recommended values in completion docs
* address PR feedback
* remove trailing whitespace (for CI `editorconfig`)
* add to adaptive-p to `common_sampler_types_from_chars`
Xuan-Son Nguyen [Thu, 15 Jan 2026 16:10:28 +0000 (17:10 +0100)]
server: improve slots scheduling for n_cmpl (#18789)
* server : make sure children tasks are scheduled to launch with parent
* fix
* add comment pointing to this PR
* fix
* clean up
* more debug messages
* add pop_deferred_task with specific ID version
* improve the logic
* simple approach
* no double move
* correct return type of launch_slots_with_parent_task
Georgi Gerganov [Thu, 15 Jan 2026 14:39:17 +0000 (16:39 +0200)]
context : reserve new scheduler when graph topology changes (#18547)
* context : reserve new scheduler when graph topology changes
* cont : fix
* cont : fix reserve
* cont : reserve only when changes occur + timing
* context : add comments
* llama : reserve on sampler changes
* common : allow null common_sampler
* server : task declares needs (embd, logits, sampling)
* server : do not init sampler if not needed
* llama : fix need_reserve when unsetting a sampler
* server : consolidate slot reset/clear logic
Johannes Gäßler [Thu, 15 Jan 2026 14:14:50 +0000 (15:14 +0100)]
CUDA: fix allignment on register spill for FA (#18815)
shalinib-ibm [Thu, 15 Jan 2026 09:31:18 +0000 (15:01 +0530)]
ggml-cpu: optimize ggml_vec_dot_bf16 for Power9 (#18837)
Xuan-Son Nguyen [Thu, 15 Jan 2026 09:24:28 +0000 (10:24 +0100)]
lora: make sure model keep track of associated adapters (#18490)
* lora: make sure model keep track of associated adapters
* deprecate llama_adapter_lora_free
* minor : std::unordered_set over std::set
---------
Co-authored-by: Georgi Gerganov <redacted>
Sigbjørn Skjæret [Thu, 15 Jan 2026 09:12:46 +0000 (10:12 +0100)]
model-loader : support bool array sliding window pattern (#18850)
Adrien Gallouët [Thu, 15 Jan 2026 08:47:29 +0000 (09:47 +0100)]
tests : download models only when running ctest (#18843)
Signed-off-by: Adrien Gallouët <redacted>