git.djapps.eu Git - pkg/ggml/sources/llama.cpp/log

]> git.djapps.eu Git - pkg/ggml/sources/llama.cpp/log

overview / pkg / ggml / sources / llama.cpp / log

Ruben Ortlam [Thu, 2 Apr 2026 16:19:20 +0000 (18:19 +0200)]

tests: allow exporting graph ops from HF file without downloading weights (#21182)

* tests: allow exporting graph ops from HF file without downloading weights

* use unique_ptr for llama_context in HF metadata case

* fix missing non-required tensors falling back to type f32

* use unique pointers where possible

* use no_alloc instead of fixing f32 fallback

* fix missing space

commit | commitdiff | tree

Xuan-Son Nguyen [Thu, 2 Apr 2026 15:10:32 +0000 (17:10 +0200)]

model, mtmd: fix gguf conversion for audio/vision mmproj (#21309)

* fix gguf conversion for audio/vision mmproj

* fix test

commit | commitdiff | tree

Aldehir Rojas [Thu, 2 Apr 2026 13:59:59 +0000 (08:59 -0500)]

common : add commentary rules for gpt-oss-20b (#21286)

commit | commitdiff | tree

Piotr Wilkin (ilintar) [Thu, 2 Apr 2026 09:29:11 +0000 (11:29 +0200)]

Relax prefill parser to allow space. (#21240)

* Relax prefill parser to allow space.

* Move changes from prefix() to parser generation

* Only allow spaces if we're not having a pure content parser next

commit | commitdiff | tree

Jesus Talavera [Thu, 2 Apr 2026 09:28:56 +0000 (11:28 +0200)]

chat : add Granite 4.0 chat template with correct tool_call role mapping (#20804)

* chat : add Granite 4.0 chat template with correct tool_call role mapping

Introduce `LLM_CHAT_TEMPLATE_GRANITE_4_0` alongside the existing Granite
3.x template (renamed `LLM_CHAT_TEMPLATE_GRANITE_3_X`).

The Granite 4.0 Jinja template uses `<tool_call>` XML tags and maps the
`assistant_tool_call` role to `<|start_of_role|>assistant<|end_of_role|><|tool_call|>`.
Without a matching C++ handler, the fallback path emits the literal role
`assistant_tool_call` which the model does not recognize, breaking tool
calling when `--jinja` is not used.

Changes:
- Rename `LLM_CHAT_TEMPLATE_GRANITE` to `LLM_CHAT_TEMPLATE_GRANITE_3_X`
(preserves existing 3.x behavior unchanged)
- Add `LLM_CHAT_TEMPLATE_GRANITE_4_0` enum, map entry, and handler
- Detection: `<|start_of_role|>` + (`<tool_call>` or `<tools>`) → 4.0,
otherwise → 3.x
- Add production Granite 4.0 Jinja template
- Add tests for both 3.x and 4.0 template paths (C++ and Jinja)

Co-Authored-By: Claude Opus 4.6 <redacted>
* Code review: follow standard format and use common logic in test-chat-template.cpp

* Rename custom_conversation variable for extra_conversation to give it a more meaningful name

---------

Co-authored-by: Claude Opus 4.6 <redacted>

commit | commitdiff | tree

Georgi Gerganov [Thu, 2 Apr 2026 08:54:05 +0000 (11:54 +0300)]

kv-cache : do not quantize SWA KV cache (#21277)

commit | commitdiff | tree

Roger Chen [Thu, 2 Apr 2026 08:41:19 +0000 (16:41 +0800)]

Ignore Transfer-Encoding header. (#20269)

commit | commitdiff | tree

Georgi Gerganov [Thu, 2 Apr 2026 07:38:24 +0000 (10:38 +0300)]

sync : ggml

commit | commitdiff | tree

Georgi Gerganov [Thu, 2 Apr 2026 07:37:26 +0000 (10:37 +0300)]

ggml : bump version to 0.9.11 (ggml/1456)

commit | commitdiff | tree

Neo Zhang [Thu, 2 Apr 2026 07:08:32 +0000 (15:08 +0800)]

sycl : fix llama_kv_cache hang when kv_cache is huge: 5GB (#21283)

commit | commitdiff | tree

Todor Boinovski [Thu, 2 Apr 2026 00:44:02 +0000 (17:44 -0700)]

hexagon : add cumsum op support (#21246)

* hexagon : add cumsum op support

* hexagon: enable dma for cumsum op

* Fix line-ending

---------

Co-authored-by: Max Krasnyansky <redacted>

commit | commitdiff | tree

Xuan-Son Nguyen [Wed, 1 Apr 2026 21:31:51 +0000 (23:31 +0200)]

contrib : rewrite AGENTS.md, make it more clear about project values (#21270)

* contrib : rewrite AGENTS.md, make it more clear about types of permitted AI usage

* permit AI for writing code

commit | commitdiff | tree

lhez [Wed, 1 Apr 2026 19:54:58 +0000 (12:54 -0700)]

opencl: fix leak in Adreno q8_0 path (#21212)

commit | commitdiff | tree

Aleksander Grygier [Wed, 1 Apr 2026 19:32:15 +0000 (21:32 +0200)]

server: Bypass API Key validation for WebUI static bundle assets (#21269)

* fix: Bypass API Key validation for static bundle assets

* refactor: All bypassed routes in `public_endpoints`

* test: Update static assets API Key test

commit | commitdiff | tree

Johannes Gäßler [Wed, 1 Apr 2026 19:28:19 +0000 (21:28 +0200)]

CUDA: fix FA kernel selection logic (#21271)

commit | commitdiff | tree

Martin Klacer [Wed, 1 Apr 2026 17:02:41 +0000 (18:02 +0100)]

kleidiai: add CPU feature detection to CI run script (#20394)

* kleidiai: add cpu feature detection to CI run script

Signed-off-by: Martin Klacer <redacted>
Change-Id: I663adc3a7691a98e7dac5488962c13cc344f034a

* kleidiai: revert unrelated requirements change

Signed-off-by: Martin Klacer <redacted>
* kleidiai: removed cpu feature detection from CI run script

* As per the maintainers' suggestion, removed cpu feature detection
from CI run script as CMake handles it already

Signed-off-by: Martin Klacer <redacted>
---------

Signed-off-by: Martin Klacer <redacted>

commit | commitdiff | tree

Nikhil Jain [Wed, 1 Apr 2026 16:53:05 +0000 (09:53 -0700)]

Update Dawn version in WebGPU CI (#20784)

* Pin Dawn version

* Update docs with new Dawn commit hash

commit | commitdiff | tree

Aparna M P [Wed, 1 Apr 2026 15:43:08 +0000 (21:13 +0530)]

hexagon: improve RMS_NORM and DIV accuracy (#21251)

* hexagon-rms_norm: fix RMS_NORM for non-aligned tensor sizes

Co-authored-by: Krishna Sridhar <redacted>
* hexagon-div: perform DIV in fp16 domain for lower dsp archs

---------

Co-authored-by: Krishna Sridhar <redacted>

commit | commitdiff | tree

Jonathan [Wed, 1 Apr 2026 14:22:44 +0000 (07:22 -0700)]

fix: tool call parsing for LFM2 and LFM2.5 models (#21242)

* fix: tool call parsing for LFM2 and LFM2.5 models'

* refactor: add test / break out lfm2 and lfm2.5 parsing logic

commit | commitdiff | tree

Georgi Gerganov [Wed, 1 Apr 2026 13:58:01 +0000 (16:58 +0300)]

llama : rotate activations for better quantization (#21038)

* llama : rotate activations for better quantization

* cont : rotate V more + refactor

* cont : rotate caches separately + support non-power-of-2 head sizes

* cont : simplify

* cont : add reference for V rotation

* cont : refactor

* cont : support context shift

* cont : consolidate

* cont : dedup + allow different types for the rotation matrix

* cont : add env variable to disable rotation

* cont : simplify attn rot kv cache logic + rename env

* cont : pre-compute the Hadamard matrices

commit | commitdiff | tree

Xuan-Son Nguyen [Wed, 1 Apr 2026 13:31:58 +0000 (15:31 +0200)]

scripts: add function call test script (#21234)

* scripts: add function call test script

* add reasoning_content

* fix lint

commit | commitdiff | tree

Georgi Gerganov [Wed, 1 Apr 2026 13:02:34 +0000 (16:02 +0300)]

sync : ggml

commit | commitdiff | tree

Georgi Gerganov [Wed, 1 Apr 2026 13:01:45 +0000 (16:01 +0300)]

ggml : bump version to 0.9.10 (ggml/1454)

commit | commitdiff | tree

Neo Zhang [Wed, 1 Apr 2026 10:54:15 +0000 (18:54 +0800)]

sycl : support nvfp4 type in mul_mat (#21227)

commit | commitdiff | tree

Michael Wand [Wed, 1 Apr 2026 10:04:58 +0000 (03:04 -0700)]

ggml-cuda: Add generic NVFP4 MMQ kernel (#21074)

* Introduced NVFP4 generic MMQ kernel

* Added extra FP8 guard, hope to solve ci HIP failure

* Rename tiles and use HIP_FP8_AVAILABLE

* Removed remaning FP8 straggler and added const int

* Const

* Removed DECL_MMQ_CASE artifact

* Removed newline

* Removed space after else

* Changed HIP FP8 NVFP4 conversion gate

* Added new line to bottom of mmq.cu 270

* Removed extra spaces

* Removed single space in front of else on line 814

* Added NVFP4 to generate cu script so HIP can see it, further tightened logic

* Include generated mmq-instance-nvfp4.cu

* Added NVFP4 mmq to HIP Check ignore list

* Update ggml/src/ggml-cuda/mmq.cuh

Changed to Q3_K tile to read MMQ_MMA_TILE_X_K_NVFP4

Co-authored-by: Johannes Gäßler <redacted>
* Update ggml/src/ggml-cuda/mmq.cuh

Changed to Q3_K tile to read MMQ_MMA_TILE_X_K_NVFP4 in tile assert

Co-authored-by: Johannes Gäßler <redacted>
* Update ggml/src/ggml-cuda/mmq.cuh

Added function name ending for end if

Co-authored-by: Johannes Gäßler <redacted>
* Added function names to closing endif

Co-authored-by: Johannes Gäßler <redacted>
---------

Co-authored-by: Johannes Gäßler <redacted>

commit | commitdiff | tree

Ettore Di Giacinto [Wed, 1 Apr 2026 09:50:17 +0000 (11:50 +0200)]

memory: respect unified KV cache in hybrid memory for eval tasks (#21224)

The hybrid memory paths (`llama-memory-hybrid.cpp` and
`llama-memory-hybrid-iswa.cpp`) always used sequential equal split,
ignoring the unified KV cache flag. This caused hellaswag, winogrande,
and multiple-choice evaluations to fail on hybrid models (models with
both attention and recurrent/SSM layers, such as Qwen3.5-35B-A3B) with:

  split_equal: sequential split is not supported when there are
  coupled sequences in the input batch (you may need to use the
  -kvu flag)

PR #19954 fixed this for `llama-kv-cache-iswa.cpp` by automatically
enabling unified KV mode and setting n_parallel >= 4 for multi-choice
eval tasks. However, the hybrid memory paths were not updated.

This commit mirrors the iswa fix: use non-sequential split when KV
cache is unified (n_stream == 1), which is automatically set by
llama-perplexity for hellaswag/winogrande/multiple-choice since #19954.

Tested on Qwen3.5-35B-A3B (hybrid attention+SSM MoE model):
- HellaSwag: 83.0% (400 tasks)
- Winogrande: 74.5% (400 tasks)
- MMLU: 41.2%
- ARC-Challenge: 56.2%
- TruthfulQA: 37.7%
All previously failed with llama_decode() error.

commit | commitdiff | tree

uvos [Wed, 1 Apr 2026 08:21:20 +0000 (10:21 +0200)]

CUDA/HIP: Fix kernel slection for mmvq mmid kernel to align host selection with device launch bounds (#21238)

The conditions cc == GGML_CUDA_CC_VOLTA || cc >= GGML_CUDA_CC_ADA_LOVELACE and cc >= GGML_CUDA_CC_TURING match all non-nvidia devices. This causes us to attempt to launch the kernel for batch sizes with larger configurations than our launch bounds on HIP devices. This pr fixes the conditionals in get_mmvq_mmid_max_batch.

Fixes #21191

commit | commitdiff | tree

Georgi Gerganov [Wed, 1 Apr 2026 08:10:25 +0000 (11:10 +0300)]

ggml : fix RWKV ops thread assignment (#21226)

commit | commitdiff | tree

Taimur Ahmad [Wed, 1 Apr 2026 08:10:03 +0000 (13:10 +0500)]

ggml-cpu: fix fallback for RVV kernels without zvfh (#21157)

* ggml-cpu: refactor sgemm; fix rvv checks

* ggml-cpu: refactor rvv kernels; set zvfbfwma default to off

commit | commitdiff | tree

Anav Prasad [Wed, 1 Apr 2026 07:07:24 +0000 (07:07 +0000)]

CUDA: Add Flash Attention Support for Head Dimension 512 (#20998)

* flash attention support for head dimension 512 added

* FA D=512 - match 576 configs, limit ncols2, revert vec cap

* fix HIP tile kernel build for D=512

* fix HIP tile kernel occupancy for D=512 on AMD

* Apply suggestions from code review

Co-authored-by: Johannes Gäßler <redacted>
* fix tile FA compilation

---------

Co-authored-by: Johannes Gäßler <redacted>

commit | commitdiff | tree

Ed Addario [Wed, 1 Apr 2026 05:43:00 +0000 (06:43 +0100)]

llama : refactor llama_model_quantize_params to expose a pure C interface (#20346)

* Refactor llama_model_quantize_params to expose a pure C interface

* Restore comment and cleanup struct def

* Code review refactoring

Co-authored-by: Georgi Gerganov <redacted>
* Code review refactoring

---------

Co-authored-by: Georgi Gerganov <redacted>

commit | commitdiff | tree

Reese Levine [Wed, 1 Apr 2026 05:38:24 +0000 (22:38 -0700)]

ggml webgpu: quantized buffers to u32 + wider browser/device support (#21046)

* Work towards removing bitcast

* Move rest of existing types over

* Add timeout back to wait and remove synchronous set_tensor/memset_tensor

* move to unpackf16 for wider compatibility

* cleanup

* Remove deadlock condition in free_bufs

commit | commitdiff | tree

Abhijit Ramesh [Tue, 31 Mar 2026 22:38:16 +0000 (15:38 -0700)]

ggml-webgpu: port all AOT operators to JIT (#20728)

* port cpy pipeline to shader lib with JIT compilation
* port glu pipeline to shader lib with JIT compilation
* port rope pipeline to shader lib with JIT compilation
* port soft_max pipeline to shader lib with JIT compilation
* removed unused functions from embed_wgsl.py which were used for
old AOT template expansion

commit | commitdiff | tree

Aleksander Grygier [Tue, 31 Mar 2026 15:47:46 +0000 (17:47 +0200)]

fix: Use lower-case proxy headers naming (#21235)

commit | commitdiff | tree

Adrien Gallouët [Tue, 31 Mar 2026 14:18:00 +0000 (16:18 +0200)]

common : cleanup logs and modernize the progress bar (#21215)

```
$ build/bin/llama-server -hf unsloth/Qwen3.5-0.8B-GGUF
common_download_file_single_online: HEAD failed, status: 404
no remote preset found, skipping
Downloading mmproj-BF16.gguf ——————————————————————————————————————— 100%
Downloading Qwen3.5-0.8B-Q4_K_M.gguf ——————————————————————————————— 100%
...
```

Signed-off-by: Adrien Gallouët <redacted>

commit | commitdiff | tree

hipudding [Tue, 31 Mar 2026 14:00:51 +0000 (22:00 +0800)]

CANN: fix multi-thread set_tensor race conditions (#20151)

* CANN: fix multi-thread set_tensor race conditions

When ollama calls ggml_backend_tensor_set from multiple threads (each
writing a different chunk of the same tensor), the CANN backend had
three concurrency issues:

1. Quantized tensors (Q4_0/Q8_0) require a full-tensor format transform
   before uploading to device. Per-chunk transforms produced corrupt data.

2. ND-to-NZ weight conversion requires complete tensor data on device.
   Per-chunk conversion operated on incomplete data.

3. The global g_nz_workspaces array had unprotected concurrent access.

Fix by introducing a TensorSetTracker that accumulates write progress
per tensor. For quantized tensors, raw data is staged in a host buffer
and the transform + upload is deferred until all chunks arrive. For NZ
weights, chunks are uploaded directly but conversion is deferred. The
tracker and its staging buffer are released immediately after
post-processing completes.

Add per-device mutex to g_nz_workspaces to prevent data races.

* CANN: fix L2_NORM ignoring eps parameter

The L2_NORM implementation was not using the eps parameter from
op_params, causing incorrect results when eps is large (e.g. 10.0).
The CPU reference computes scale = 1/fmaxf(norm, eps), so add a
Clamp step to clamp the norm to at least eps before dividing.

* ggml/cann: compare op_params for POOL_2D in ACL graph cache matching

When ACL graph mode is enabled, the graph LRU cache checks whether a
cached graph matches the current computation graph. Previously,
GGML_OP_POOL_2D was not included in the op_params comparison, so two
POOL_2D nodes with different pooling parameters (kernel size, stride,
padding) but identical tensor shapes and addresses could incorrectly
reuse a cached graph, leading to wrong results or aclnn errors.

Add GGML_OP_POOL_2D to the list of ops that require op_params matching
in ggml_graph_node_properties::has_matching_properties().

* cann: fix ACL graph cache matching by adding tensor type and unconditional op_params comparison

The ACL graph LRU cache was incorrectly reusing cached graphs for
operations with different tensor types or op_params, causing test
failures for CPY (f16 vs bf16), POOL_2D, L2_NORM, NORM_MUL_ADD,
RMS_NORM_MUL_ADD, and ADD_RMS_NORM.

Changes:
- Add node_type and src_type[] fields to ggml_graph_node_properties
  so the cache can distinguish tensors with different types but
  identical ne/nb (e.g. f16 and bf16 both have 2-byte elements)
- Compare op_params unconditionally for all ops instead of only for
  SCALE/UNARY/GLU/ROPE/POOL_2D

commit | commitdiff | tree

Xuan-Son Nguyen [Tue, 31 Mar 2026 13:44:26 +0000 (15:44 +0200)]

server: (webui) no more gzip compression (#21073)

* webui: no more gzip

* try changing a small line

* Revert "try changing a small line"

This reverts commit 0d7a3531593d87b724d404c8727a96becab3ab07.

* fix lint

* fix test

* rebuild

* split into html/css/js

* lint

* chore: update webui build output

* chore: Update git hooks script

* server: update webui build output

* chore: Update pre-commit hook

* refactor: Cleanup

---------

Co-authored-by: Aleksander Grygier <redacted>

commit | commitdiff | tree

Aldehir Rojas [Tue, 31 Mar 2026 11:52:42 +0000 (06:52 -0500)]

common : gpt-oss handle builtin and unsolicited tool calls (#21213)

commit | commitdiff | tree

lainon1 [Tue, 31 Mar 2026 11:50:51 +0000 (12:50 +0100)]

fix: correct misspellings in code comments (#21217)

- emdeddings → embeddings (gemma3.cpp, gemma3n-iswa.cpp,
gemma-embedding.cpp)
- imlpemented → implemented (llama-adapter.cpp)
- interere → interfere (llama-graph.cpp)
- overridde → overridden (chat.cpp)
- stastistics → statistics (ngram-map.h)
- layed → laid (llama-kv-cache.h)
- worster → worst (llama-context.cpp)
- sequantial → sequential (llama-batch.h)

commit | commitdiff | tree

Seungmin Kim [Tue, 31 Mar 2026 11:02:56 +0000 (20:02 +0900)]

CI: Enable CPU and Vulkan ARM64 Release (#21207)

commit | commitdiff | tree

Georgi Gerganov [Tue, 31 Mar 2026 10:08:13 +0000 (13:08 +0300)]

sync : ggml

commit | commitdiff | tree

Georgi Gerganov [Mon, 30 Mar 2026 15:34:29 +0000 (18:34 +0300)]

ggml : bump version to 0.9.9 (ggml/1449)

commit | commitdiff | tree

Adrien Gallouët [Tue, 31 Mar 2026 10:53:41 +0000 (12:53 +0200)]

common : move up common_init() and fix Windows UTF-8 logs (#21176)

The build info is now only for debug, so we avoid the duplicate
with `--version`.

The UTF-8 setup at the beginning is needed to avoid logging
garbage on Windows.

Signed-off-by: Adrien Gallouët <redacted>

commit | commitdiff | tree

Neo Zhang [Tue, 31 Mar 2026 10:31:50 +0000 (18:31 +0800)]

sycl : enhance fattn perf (#21185)

commit | commitdiff | tree

mtmcp [Tue, 31 Mar 2026 10:04:42 +0000 (07:04 -0300)]

common: add bounds check in common_init_result::sampler to prevent segfault on failed model load (#21082)

* common: add bounds check in common_init_result::sampler to prevent segfault on failed model load

* Revert a308e584cae3fa8cee1d739a858a2d780f1de009

* Add regression test

* Remove regression test for init-fail sampler check

commit | commitdiff | tree

SATISH K C [Tue, 31 Mar 2026 08:52:34 +0000 (03:52 -0500)]

fix: include API key in CORS proxy requests for MCP connections (#21193)

* fix: include API key in CORS proxy requests for MCP connections

When llama-server is started with --api-key-file and --webui-mcp-proxy,
the /cors-proxy endpoint requires authentication. The WebUI was not
including the Authorization header in proxy requests, causing MCP
connections to fail with 401.

Inject getAuthHeaders() into requestInit when useProxy is true so the
proxy request carries the Bearer token alongside the forwarded target
headers.

Fixes #21167

* fix: simplify headers assignment based on reviewer suggestion

Apply buildProxiedHeaders only when useProxy is true, pass headers
directly to the transport otherwise.

commit | commitdiff | tree

Piotr Wilkin (ilintar) [Tue, 31 Mar 2026 08:42:06 +0000 (10:42 +0200)]

server/webui: cleanup dual representation approach, simplify to openai-compat (#21090)

* server/webui: cleanup dual representation approach, simplify to openai-compat

* feat: Fix regression for Agentic Loop UI

* chore: update webui build output

* refactor: Post-review code improvements

* chore: update webui build output

* refactor: Cleanup

* chore: update webui build output

---------

Co-authored-by: Aleksander Grygier <redacted>

commit | commitdiff | tree

Adrien Gallouët [Tue, 31 Mar 2026 07:21:54 +0000 (09:21 +0200)]

vendor : update BoringSSL to 0.20260327.0 (#21211)

Signed-off-by: Adrien Gallouët <redacted>

commit | commitdiff | tree

Galunid [Tue, 31 Mar 2026 07:14:01 +0000 (09:14 +0200)]

common : Disable backend sampling if reasoning budget is enabled (#21209)

commit | commitdiff | tree

shaofeiqi [Mon, 30 Mar 2026 19:19:16 +0000 (12:19 -0700)]

opencl: add q4_K gemm and gemv kernels for Adreno (#20919)

* opencl: add q4_K gemm and gemv kernels for Adreno

* opencl: fix whitespace

* opencl: add workarounds for compiler bugs on older devices

* opencl: handle fp16 denorm on X Elite

* opencl: fix kernel build error

* opencl: fix whitespace

* opencl: make q4_K cvt kernels signature consistent

---------

Co-authored-by: Li He <redacted>

commit | commitdiff | tree

Seungmin Kim [Mon, 30 Mar 2026 18:24:37 +0000 (03:24 +0900)]

CI : Enable CUDA and Vulkan ARM64 runners and fix CI/CD (#21122)

* CI: Enable CUDA and Vulkan ARM64 runners and fix CI/CD

Co-authored-by: Ts-sound <redacted>
* Obtain source tag name from git tag

Co-authored-by: Sigbjørn Skjæret <redacted>
---------

Co-authored-by: Ts-sound <redacted>
Co-authored-by: Sigbjørn Skjæret <redacted>

commit | commitdiff | tree

Zhihao "Zephyr" Yao [Mon, 30 Mar 2026 18:08:46 +0000 (14:08 -0400)]

jinja : handle empty expressions correctly (#20913)

* Reject empty computed member expressions before returning slices[0] from parse_member_expression_arguments().

* Treat empty computed member expressions with Jinja2 undefined semantics

Treat empty computed member expressions like `a[]` as undefined instead of
raising a parser error, to match Jinja2 behavior.

- return a noop expression for empty computed member arguments
- return undefined when a computed member key evaluates to undefined
- add Jinja tests covering `a[]|default('fallback')` and `a[] is undefined`

* Handle undefined computed member properties

Move undefined-property handling to the common member access path, and add a test covering `a[undefined] is undefined`.

* Use default undefined value in member access

Initialize val and then return it when property is undefined.

Co-authored-by: Sigbjørn Skjæret <redacted>
* empty statement parses to blank_expression instead of noop_statement

---------

Co-authored-by: Sigbjørn Skjæret <redacted>

commit | commitdiff | tree

Oliver Simons [Mon, 30 Mar 2026 14:20:00 +0000 (16:20 +0200)]

CUDA : Fix CUB's argsort when nrows % block_size == 0 CCCL < 3.1 (#21181)

* CUDA: Fix CUB's argsort when nrows % block_size == 0 CCCL < 3.1

We wrongly calculated offset_grid as `ceildiv(nrows, block_size)`,
while it must be `ceildiv(nrows + 1, block_size)`. As a consequence, we
had uninitialized values in `offset_iterator[nrows]` for the case when
`nrows % block_size == 0`.

Fixes #21162

* Reduce nrows in test case to 256, don't need 768

commit | commitdiff | tree

Radoslav Gerganov [Mon, 30 Mar 2026 14:05:11 +0000 (17:05 +0300)]

rpc : fix misleading error log (#21184)

When RPC is running with a remote backend which doesn't have init_tensor
function (like CPU and Metal), the server log gets full with error
messages saying that init_tensor is being called with null buffer which
is incorrect. This patch fixes this.

commit | commitdiff | tree

Aleksander Grygier [Mon, 30 Mar 2026 12:40:50 +0000 (14:40 +0200)]

webui: Fix branching logic on edit message (#21175)

* fix: Branching logic + small refactor

* chore: update webui build output

commit | commitdiff | tree

Aman Gupta [Mon, 30 Mar 2026 09:40:17 +0000 (17:40 +0800)]

llama-model-loader: print warning when using overrides with mmap (#20978)

* llama-model-loader: use pinned memory for tensor overrides

* change to warning

commit | commitdiff | tree

Sigbjørn Skjæret [Mon, 30 Mar 2026 07:29:15 +0000 (09:29 +0200)]

ci : bump ty to 0.0.26 (#21156)

* fix incorrect type ignore comments

* bump ty to 0.0.26

commit | commitdiff | tree

Xuan-Son Nguyen [Mon, 30 Mar 2026 06:59:16 +0000 (08:59 +0200)]

server: wrap headers for mcp proxy (#21072)

* server: wrap headers for mcp proxy

* Update tools/server/server-cors-proxy.h

Co-authored-by: Georgi Gerganov <redacted>
* fix build

* chore: update webui build output

* chore: update webui build output

---------

Co-authored-by: Georgi Gerganov <redacted>
Co-authored-by: Aleksander Grygier <redacted>

commit | commitdiff | tree

Sigbjørn Skjæret [Sun, 29 Mar 2026 17:45:40 +0000 (19:45 +0200)]

add missing ROPE_FACTORS_LONG/SHORT for MiniCPM (#21150)

commit | commitdiff | tree

Gaurav Garg [Sun, 29 Mar 2026 16:35:18 +0000 (22:05 +0530)]

Optimize MOE GEMV kernel for BS > 1. (#20905)

* Optimize MOE GEMV kernel for BS > 1.

The previous MOE kernel for BS > 1 had too many thread blocks (nrows_x, nchannels_dst, ncols_dst), with very little work per block. block of (32, 4) was doing inner dot product for a single row.

New mul_mat_vec_q_moe kernel is dedicated for MoE multi-token kernel with grid (ceil(nrows_x/rpb), nchannels_dst), block (warp_size, ncols_dst). Each warp handles two rows independently with warp-level reduction only (no shared memory sync).

This change doesn't increase any compilation time as a single template instance is needed per type. This also simplifies the original GEMV kernel and gets rid of `is_multi_token_id` specialization.

* Remove em-dashes

* Cherry-pick changes from @am17an PR https://github.com/ggml-org/llama.cpp/pull/20885 to enable small_k optimization only for cases where it benefits

Increase max batch size for MMVQ kernels for MUL_MAT_ID to 8

* Make the max batch size for MOE GEMV kernel configurable based on GPU arch and datatype

---------

Co-authored-by: Aman Gupta <redacted>

commit | commitdiff | tree

Max Krasnyansky [Sun, 29 Mar 2026 13:40:13 +0000 (06:40 -0700)]

hexagon: dma optimizations (mostly fixing regressions) (#21137)

* hex-fa: add simple dma cache for Mask

I noticed that we were refetch the mask rows over and over.
This simple cache avoids that.

* hex-dma: unset in-order desc bit which caused signficant perf regression

We don't rely on true in order processing of the DMA descriptors anywhere.
Turns out this mode caused significant regression of around 3-4 TPS during token gen.

* hex-rope: update comment to clarify that we don't need in-order DMA completions

commit | commitdiff | tree

Davi Henrique Linhares [Sun, 29 Mar 2026 05:34:03 +0000 (02:34 -0300)]

devops: including compute-runtime for intel.Dockerfile (#21076)

commit | commitdiff | tree

Neo Zhang [Sun, 29 Mar 2026 01:02:45 +0000 (09:02 +0800)]

[SYCL] Enhance build script to use half cores to build, avoid OS hang (#21093)

* use half cores to build, avoid OS hang

* reduce the output text num to short test time

* avoid to return 0

commit | commitdiff | tree

Sigbjørn Skjæret [Sat, 28 Mar 2026 21:27:38 +0000 (22:27 +0100)]

fix **/x glob matching (#21129)

commit | commitdiff | tree

Piotr Wilkin (ilintar) [Sat, 28 Mar 2026 19:41:32 +0000 (20:41 +0100)]

common/parser: fix handling of tool definition with missing properties key (#21128)

commit | commitdiff | tree

Sigbjørn Skjæret [Sat, 28 Mar 2026 18:57:37 +0000 (19:57 +0100)]

common : add character class support to glob_match (#21111)

* add character class support to glob_match

* remove pointless reference

commit | commitdiff | tree

BlueMöhre [Sat, 28 Mar 2026 16:57:59 +0000 (17:57 +0100)]

WebUI: Replace illegal nested button elements (#21026)

* remove/replace nested button elements

* map rest props to outer element

* solve TODO

* chore: update webui build output

commit | commitdiff | tree

Adrien [Sat, 28 Mar 2026 16:55:38 +0000 (17:55 +0100)]

common/json-schema: fix: handle non-capturing groups (?:...) in JSON schema pattern converter (#21124)

The regex-to-grammar converter in _visit_pattern() crashes with SIGSEGV
when a JSON schema "pattern" field contains a non-capturing group (?:...).

Root cause: when the parser sees '(' followed by '?', it pushes a warning
but does not advance past '?:'. The recursive transform() call then
interprets '?' as a quantifier and calls seq.back() on an empty vector,
causing undefined behavior.

This commonly occurs when serving OpenAI-compatible tool calls from
clients that include complex regex patterns in their JSON schemas (e.g.,
date validation patterns like ^(?:(?:\d\d[2468][048]|...)-02-29|...)$).

The fix:
- Skip '?:' after '(' to treat non-capturing groups as regular groups
- For unsupported syntax (?=, ?!, etc.), skip to matching ')' safely,
  handling escaped characters to avoid miscounting parenthesis depth
- Adjust the ')' unbalanced-parentheses check using direct char
  comparisons instead of substr
- Add test cases for non-capturing groups (C++ only, as the JS/Python
  implementations do not yet support this syntax)

commit | commitdiff | tree

Aldehir Rojas [Sat, 28 Mar 2026 14:33:39 +0000 (09:33 -0500)]

common : add reasoning_format = none support to gpt-oss (#21094)

commit | commitdiff | tree

Georgi Gerganov [Sat, 28 Mar 2026 14:27:36 +0000 (16:27 +0200)]

server : fix processing of multiple back-to-back mtmd chunks (#21107)

commit | commitdiff | tree

Adrien Gallouët [Sat, 28 Mar 2026 13:49:57 +0000 (14:49 +0100)]

ci : gracefully shut down the server (#21110)

Signed-off-by: Adrien Gallouët <redacted>

commit | commitdiff | tree

Woof Dog [Sat, 28 Mar 2026 13:19:16 +0000 (13:19 +0000)]

Document custom default webui preferences in server README (#19771)

commit | commitdiff | tree

Aleksander Grygier [Sat, 28 Mar 2026 12:38:15 +0000 (13:38 +0100)]

webui: Conversation forking + branching improvements (#21021)

* refactor: Make `DialogConfirmation` extensible with children slot

* feat: Add conversation forking logic

* feat: Conversation forking UI

* feat: Update delete/edit dialogs and logic for forks

* refactor: Improve Chat Sidebar UX and add MCP Servers entry

* refactor: Cleanup

* feat: Update message in place when editing leaf nodes

* chore: Cleanup

* chore: Cleanup

* chore: Cleanup

* chore: Cleanup

* chore: Cleanup

* chore: Cleanup

* refactor: Post-review improvements

* chore: update webui build output

* test: Update Storybook test

* chore: update webui build output

* chore: update webui build output

commit | commitdiff | tree

Adrien Gallouët [Sat, 28 Mar 2026 07:59:44 +0000 (08:59 +0100)]

vendor : update cpp-httplib to 0.40.0 (#21100)

Signed-off-by: Adrien Gallouët <redacted>

commit | commitdiff | tree

Ruben Ortlam [Sat, 28 Mar 2026 07:44:56 +0000 (08:44 +0100)]

vulkan: add noncontiguous GLU support (#21081)

* vulkan: add noncontiguous GLU support

* fix compile issue

commit | commitdiff | tree

Piotr Wilkin (ilintar) [Sat, 28 Mar 2026 06:29:26 +0000 (07:29 +0100)]

common/parser: fix reasoning whitespace bugs + extra parser tests (#21085)

* fix whitespace reasoning issues + add reconstruction tests

* Proper fix

* fix Nemotron autoparser test expectations to include newline in marker

commit | commitdiff | tree

Sigbjørn Skjæret [Sat, 28 Mar 2026 01:33:04 +0000 (02:33 +0100)]

cli : add /glob command (#21084)

* add /glob command

* output error when max files reached

* support globbing outside curdir

commit | commitdiff | tree

Ts-sound [Sat, 28 Mar 2026 00:45:09 +0000 (08:45 +0800)]

docker : fix and enable ARM64 image build (#20929)

* CI: fix ARM64 image build error & enable compilation

* Update .github/workflows/docker.yml

Co-authored-by: Aaron Teo <redacted>
* CI: revert ggml/src/ggml-cpu/CMakeLists.txt

* Update .github/workflows/docker.yml

Co-authored-by: Aaron Teo <redacted>
* CI: update runs-on to ubuntu24.04, and update ARM64 build image ( ubuntu_version: "24.04")

* CI: change cpu.Dockerfile gcc to 14;

* CI : cpu.Dockerfile , update pip install .

* Update .github/workflows/docker.yml

Co-authored-by: Aaron Teo <redacted>
---------

Co-authored-by: Aaron Teo <redacted>

commit | commitdiff | tree

Adrien Gallouët [Sat, 28 Mar 2026 00:12:43 +0000 (01:12 +0100)]

server : add custom socket options to disable SO_REUSEPORT (#21056)

* server : add custom socket options to disable SO_REUSEPORT

Signed-off-by: Adrien Gallouët <redacted>
* Add --reuse-port

    $ strace -e trace=setsockopt,bind build/bin/llama-server -lv 2 --reuse-port
    setsockopt(3, SOL_TCP, TCP_NODELAY, [1], 4) = 0
    setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
    setsockopt(3, SOL_SOCKET, SO_REUSEPORT, [1], 4) = 0
    bind(3, {sa_family=AF_INET, sin_port=htons(8080), sin_addr=inet_addr("127.0.0.1")}, 16) = 0

    $ strace -e trace=setsockopt,bind build/bin/llama-server -lv 2
    setsockopt(3, SOL_TCP, TCP_NODELAY, [1], 4) = 0
    setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
    bind(3, {sa_family=AF_INET, sin_port=htons(8080), sin_addr=inet_addr("127.0.0.1")}, 16) = 0

Signed-off-by: Adrien Gallouët <redacted>
* Update tools/server/README.md (llama-gen-docs)

Signed-off-by: Adrien Gallouët <redacted>
* Fix windows

Signed-off-by: Adrien Gallouët <redacted>
---------

Signed-off-by: Adrien Gallouët <redacted>

commit | commitdiff | tree

Aldehir Rojas [Fri, 27 Mar 2026 17:30:40 +0000 (12:30 -0500)]

common : inhibit lazy grammar sampler while reasoning is active (#20970)

* common : inhibit grammar while reasoning budget is active

* cont : update force_pos in accept

* cont : fix tests

* cont : tweak should apply logic

* cont : return early not using grammar sampler

* Add tests

* cont : prevent backend sampling when reasoning budget enabled

* cont : fix typo

---------

Co-authored-by: Piotr Wilkin <redacted>

commit | commitdiff | tree

Kusha Gharahi [Fri, 27 Mar 2026 16:25:55 +0000 (11:25 -0500)]

server: Introduce LLAMA_BUILD_WEBUI build flag to allow disabling the embedded web ui (#20158)

* introduce LLAMA_SERVER_NO_WEBUI

* LLAMA_SERVER_NO_WEBUI → LLAMA_BUILD_WEBUI

* LLAMA_BUILD_WEBUI ON by default not based on LLAMA_STANDALONE

* MIssed this

* Add useWebUi to package.nix

commit | commitdiff | tree

Yiwei Shao [Fri, 27 Mar 2026 16:22:41 +0000 (09:22 -0700)]

hexagon: support for IQ4_NL and MXFP4 (#21018)

* ggml-hexagon: add IQ4_NL and MXFP4 HMX matmul support

- Add IQ4_NL quantization type support to Hexagon backend (buffer
  set/get tensor repack, mul_mat, mul_mat_id dispatch)
- Implement HVX IQ4_NL vec_dot kernels (1x1, 2x1, 2x2) with
  LUT-based 4-bit index to int8 kvalue dequantization
- Add MXFP4 HMX dequantization path with E8M0 scale conversion,
  including batch-4 fast path and single-tile fallback
- Unify quantized row size / scale offset logic to handle Q4_0,
  Q8_0, IQ4_NL, and MXFP4 in the DMA fetch path

* ggml-hexagon: fix SKIP_QUANTIZE src1 address mismatch in mixed-quant models

* Fix the pragma indent

commit | commitdiff | tree

Aleksander Grygier [Fri, 27 Mar 2026 16:01:36 +0000 (17:01 +0100)]

webui: Improve Chat Messages initial scroll + auto-scroll logic + add lazy loading with transitions to content blocks (#20999)

* refactor: Always use agentic content renderer for Assistant Message

* feat: Improve initial scroll + auto-scroll logic + implement fade in action for content blocks

* chore: update webui build output

commit | commitdiff | tree

AN Long [Fri, 27 Mar 2026 11:36:13 +0000 (19:36 +0800)]

server: remove the verbose_prompt parameter (#21059)

* server: respect the verbose_prompt parameter

* Revert "server: respect the verbose_prompt parameter"

This reverts commit 8ed885cf375b2c8ba641c661f3667df70b9797f4.

* Remove --verbose-prompt parameter from llama-server

* Using set_examples instead of set_excludes

commit | commitdiff | tree

Xuan-Son Nguyen [Fri, 27 Mar 2026 10:00:52 +0000 (11:00 +0100)]

mtmd: add more sanity checks (#21047)

commit | commitdiff | tree

Xuan-Son Nguyen [Fri, 27 Mar 2026 09:07:11 +0000 (10:07 +0100)]

server: add built-in tools backend support (#20898)

* wip: server_tools

* refactor

* displayName -> display_name

* snake_case everywhere

* rm redundant field

* change arg to --tools all

* add readme mention

* llama-gen-docs

commit | commitdiff | tree

Radoslav Gerganov [Fri, 27 Mar 2026 08:59:35 +0000 (10:59 +0200)]

rpc : proper handling of data pointers to CPU buffers (#21030)

The compute graph may contain tensors pointing to CPU buffers. In these
cases the buffer address is serialized as 0 and sent over the wire.
However, the data pointer is serialized as-is and this prevents proper
validation on the server side. This patches fixes this by serializing
the data pointer as 0 for non-RPC buffers and doing proper validation on
the server side.

closes: #21006

commit | commitdiff | tree

mtmcp [Fri, 27 Mar 2026 08:25:58 +0000 (05:25 -0300)]

completion : session_tokens insert range in completion tool (no-op → correct) (#20917)

The embd.begin(), embd.begin() range is empty and inserts nothing, so session_tokens never gets updated after
decoding. Should be embd.begin(), embd.end(). Introduced in commit 2b6dfe8.

commit | commitdiff | tree

mtmcp [Fri, 27 Mar 2026 08:01:13 +0000 (05:01 -0300)]

completion : Fix segfault on model load failure (#21049)

commit | commitdiff | tree

Pascal [Fri, 27 Mar 2026 07:17:35 +0000 (08:17 +0100)]

Send reasoning content back to the model across turns via the reasoning_content API field (#21036)

* webui: send reasoning_content back to model in context

Preserve assistant reasoning across turns by extracting it from
internal tags and sending it as a separate reasoning_content field
in the API payload. The server and Jinja templates handle native
formatting (e.g. <think> tags for Qwen, GLM, DeepSeek...).

Adds "Exclude reasoning from context" toggle in Settings > Developer
(off by default, so reasoning is preserved). Includes unit tests.

* webui: add syncable parameter for excludeReasoningFromContext

* chore: update webui build output

commit | commitdiff | tree

ren [Fri, 27 Mar 2026 07:05:21 +0000 (00:05 -0700)]

metal : Fix dimension constraint violation in matmul2d descriptor (#21048)

Updates Metal tensor API test probe to fix the dimension constraint violation in the matmul2d descriptor (at least one value must be a multiple of 16).

commit | commitdiff | tree

KokerZhou [Fri, 27 Mar 2026 00:53:00 +0000 (08:53 +0800)]

CANN: update docker images to 8.5.0 and improve CANN.md (#20801)

* cann: update docker images to 8.5.0

- bump CANN base image from 8.3.rc2 to 8.5.0
- bump ASCEND_VERSION from 8.1.RC1.alpha001 to 8.5.0

Move to newer stable releases.

* cann: update CANN.md

* Update CANN.md to include BF16 support

Added BF16 support information to the CANN documentation and corrected formatting for the installation instructions.

* Fix formatting issues in CANN.md

Fix 234: Trailing whitespace

commit | commitdiff | tree

Saba Fallah [Thu, 26 Mar 2026 23:07:55 +0000 (00:07 +0100)]

mtmd: fix "v.patch_embd" quant and unsupported im2col ops on Metal for deepseek-ocr (#21027)

* mtmd: fix "v.patch_embd" quant and unsupported im2col ops on Metal for deepseek-ocr

* Update src/llama-quant.cpp

Co-authored-by: Sigbjørn Skjæret <redacted>
---------

Co-authored-by: Sigbjørn Skjæret <redacted>

commit | commitdiff | tree

uvos [Thu, 26 Mar 2026 22:06:33 +0000 (23:06 +0100)]

hip: use fnuz fp8 for conversion on CDNA3 (#21040)

commit | commitdiff | tree

Xuan-Son Nguyen [Thu, 26 Mar 2026 19:44:00 +0000 (20:44 +0100)]

ci: pin external actions to exact commit SHA (#21033)

commit | commitdiff | tree

Adrien Gallouët [Thu, 26 Mar 2026 19:34:23 +0000 (20:34 +0100)]

common : add getpwuid fallback for HF cache when HOME is not set (#21035)

Signed-off-by: Adrien Gallouët <redacted>

commit | commitdiff | tree

Xuan-Son Nguyen [Thu, 26 Mar 2026 18:49:20 +0000 (19:49 +0100)]

mtmd: refactor image preprocessing (#21031)

* mtmd: refactor image pre-processing

* correct some places

* correct lfm2

* fix deepseek-ocr on server

* add comment to clarify about mtmd_image_preprocessor_dyn_size

commit | commitdiff | tree

lhez [Thu, 26 Mar 2026 15:52:21 +0000 (08:52 -0700)]

opencl: allow large buffer for adreno (#20997)

commit | commitdiff | tree

Michael Wand [Thu, 26 Mar 2026 15:52:06 +0000 (08:52 -0700)]

convert : support Qwen3.5/Qwen3.5 Moe NVFP4 and add input scales (#20505)

* convert : fix Qwen3.5 NVFP4 conversion

* Updated copilot concerns and rebased

* move into _LinearAttentionVReorderBase and simplify

* --flake

* new_name not needed

* Added input_scale to gguf

* Fixed input_scale addition as tensor

* Added input scale to loader and named _in_s

* Update convert_hf_to_gguf.py

Re-removed input_scale from aux cleanup

Co-authored-by: Sigbjørn Skjæret <redacted>
---------

Co-authored-by: Sigbjørn Skjæret <redacted>

commit | commitdiff | tree

Pavel Zloi [Thu, 26 Mar 2026 15:49:09 +0000 (18:49 +0300)]

convert : add RuGPT3XL (RuGPT3XLForCausalLM) support (#21011)

* Support of ruGPT3XL model added

* Update convert_hf_to_gguf.py

Co-authored-by: Sigbjørn Skjæret <redacted>
* Update convert_hf_to_gguf.py

Co-authored-by: Sigbjørn Skjæret <redacted>
* Update convert_hf_to_gguf.py

Co-authored-by: Sigbjørn Skjæret <redacted>
* chkhsh for ruGPT3XL model added

* Update convert_hf_to_gguf.py

Co-authored-by: Sigbjørn Skjæret <redacted>
* Update convert_hf_to_gguf.py

Co-authored-by: Sigbjørn Skjæret <redacted>
* Fixing chkhsh for ruGPT3XL, rerun updated and _qkv_parts in RuGPT3XLModel

---------

Co-authored-by: Sigbjørn Skjæret <redacted>

Packaging of ggml-org/llama.cpp

RSS Atom