git.djapps.eu Git - pkg/ggml/sources/llama.cpp/log

]> git.djapps.eu Git - pkg/ggml/sources/llama.cpp/log

overview / pkg / ggml / sources / llama.cpp / log

commit | commitdiff | tree

Aldehir Rojas [Mon, 6 Apr 2026 14:08:37 +0000 (09:08 -0500)]

vocab : add byte token handling to BPE detokenizer for Gemma4 (#21488)

commit | commitdiff | tree

Sigbjørn Skjæret [Mon, 6 Apr 2026 12:05:18 +0000 (14:05 +0200)]

convert : fix block_ff_dim retrieval for lfm2 (#21508)

commit | commitdiff | tree

lainon1 [Mon, 6 Apr 2026 12:03:02 +0000 (13:03 +0100)]

server : handle unsuccessful sink.write in chunked stream provider (#21478)

Check the return value of sink.write() in the chunked content provider
and return false when the write fails, matching cpp-httplib's own
streaming contract. This prevents logging chunks as sent when the sink
rejected them and properly aborts the stream on connection failure.

commit | commitdiff | tree

Xuan-Son Nguyen [Mon, 6 Apr 2026 12:02:37 +0000 (14:02 +0200)]

docs: add hunyuan-ocr gguf, also add test [no ci] (#21490)

commit | commitdiff | tree

Georgi Gerganov [Mon, 6 Apr 2026 10:52:07 +0000 (13:52 +0300)]

convert : set "add bos" == True for Gemma 4 (#21500)

* convert : set "add bos" == True for Gemma 4

* cont : handle old GGUFs

commit | commitdiff | tree

Neo Zhang [Mon, 6 Apr 2026 10:28:00 +0000 (18:28 +0800)]

sycl : handle other FA case (#21377)

commit | commitdiff | tree

Yarden Tal [Mon, 6 Apr 2026 01:30:25 +0000 (04:30 +0300)]

hexagon: slight optimization for argosrt output init (#21463)

commit | commitdiff | tree

anchortense [Sun, 5 Apr 2026 23:40:38 +0000 (09:40 +1000)]

llama : correct platform-independent loading of BOOL metadata (#21428)

* model-loader : fix GGUF bool array conversion

* model-loader : fix remaining GGUF bool pointer uses

commit | commitdiff | tree

Richard Davison [Sun, 5 Apr 2026 21:32:14 +0000 (23:32 +0200)]

model : add HunyuanOCR support (#21395)

* HunyuanOCR: add support for text and vision models

- Add HunyuanOCR vision projector (perceiver-based) with Conv2d merge
- Add separate HUNYUAN_OCR chat template (content-before-role format)
- Handle HunyuanOCR's invalid pad_token_id=-1 in converter
- Fix EOS/EOT token IDs from generation_config.json
- Support xdrope RoPE scaling type
- Add tensor mappings for perceiver projector (mm.before_rms, mm.after_rms, etc.)
- Register HunYuanVLForConditionalGeneration for both text and mmproj conversion

* fix proper mapping

* Update gguf-py/gguf/tensor_mapping.py

Co-authored-by: Xuan-Son Nguyen <redacted>
* Update tools/mtmd/clip.cpp

Co-authored-by: Xuan-Son Nguyen <redacted>
* address comments

* update

* Fix typecheck

* Update convert_hf_to_gguf.py

Co-authored-by: Sigbjørn Skjæret <redacted>
* Update convert_hf_to_gguf.py

Co-authored-by: Sigbjørn Skjæret <redacted>
* Update convert_hf_to_gguf.py

Co-authored-by: Sigbjørn Skjæret <redacted>
* Update convert_hf_to_gguf.py

Co-authored-by: Sigbjørn Skjæret <redacted>
---------

Co-authored-by: Xuan-Son Nguyen <redacted>
Co-authored-by: Sigbjørn Skjæret <redacted>

commit | commitdiff | tree

Ludovic Henry [Sun, 5 Apr 2026 18:29:48 +0000 (20:29 +0200)]

ci : use default RISE RISC-V Runners (#21263)

commit | commitdiff | tree

ddh0 [Sun, 5 Apr 2026 14:14:02 +0000 (09:14 -0500)]

server : fix logging of build + system info (#21460)

This PR changes the logging that occurs at startup of llama-server.
Currently, it is redundant (including CPU information twice) and it is
missing the build + commit info.

commit | commitdiff | tree

M1DNYT3 [Sun, 5 Apr 2026 01:04:00 +0000 (04:04 +0300)]

ci: lower cuda12 floor to 12.8.1 for broader host compatibility (#21438)

Co-authored-by: M1DNYT3 <redacted>

commit | commitdiff | tree

Nicholas Sparks [Sun, 5 Apr 2026 00:59:51 +0000 (20:59 -0400)]

ci: fix vulkan workflow referencing non-existent action (#21442)

commit | commitdiff | tree

Aldehir Rojas [Sat, 4 Apr 2026 18:39:00 +0000 (13:39 -0500)]

common : add gemma 4 specialized parser (#21418)

* common : add gemma4 dedicated parser

* cont : add '<|tool_response>' as eog

* cont : emit JSON from Gemma4 tool call AST

* cont : more fixes

* cont : refactor convert function

* cont : refine rules and mapping

* cont : add more tests

* cont : clean up

* cont : remove autoparser gemma4 implementation

* cont : more cleanup

* cont : rename gemma4.jinja to match the others

* cont : add custom template to support interleaved thinking

* cont : preserve reasoning in model turns

* cont : fix initializer error

* cont : fix unused vars

* cont : fix accidental static

* cont : fix specialized_template signature

* fix extra semicolon

* remove debug line and extra space [no ci]

commit | commitdiff | tree

Dan Hoffman [Sat, 4 Apr 2026 14:11:19 +0000 (07:11 -0700)]

server: Fix undefined timing measurement errors in server context (#21201)

Co-authored-by: Dan Hoffman <redacted>

commit | commitdiff | tree

Adrien Gallouët [Sat, 4 Apr 2026 13:08:03 +0000 (15:08 +0200)]

common : respect specified tag, only fallback when tag is empty (#21413)

Signed-off-by: Adrien Gallouët <redacted>

commit | commitdiff | tree

SamareshSingh [Sat, 4 Apr 2026 11:05:10 +0000 (06:05 -0500)]

llama-model: read final_logit_softcapping for Gemma 4 (#21390)

commit | commitdiff | tree

Aman Gupta [Sat, 4 Apr 2026 07:06:34 +0000 (15:06 +0800)]

llama: add custom newline split for Gemma 4 (#21406)

commit | commitdiff | tree

Reese Levine [Fri, 3 Apr 2026 18:40:14 +0000 (11:40 -0700)]

ggml-webgpu: move from parameter buffer pool to single buffer with offsets (#21278)

* Work towards removing bitcast

* Move rest of existing types over

* Add timeout back to wait and remove synchronous set_tensor/memset_tensor

* move to unpackf16 for wider compatibility

* cleanup

* Remove deadlock condition in free_bufs

* Start work on removing parameter buffer pools

* Simplify and optimize further

* simplify profile futures

* Fix stride

* Try using a single command buffer per batch

* formatting

commit | commitdiff | tree

Masato Nakasaka [Fri, 3 Apr 2026 17:16:44 +0000 (02:16 +0900)]

ci: Add Windows Vulkan backend testing on Intel (#21292)

* experimenting CI

* Experimenting CI fix for MinGW

* experimenting CI on Windows

* modified script for integration with VisualStudio

* added proxy handling

* adding python version for Windows execution

* fix iterator::end() dereference

* fixed proxy handling

* Fix errors occurring on Windows

* fixed ci script

* Reverted to master

* Stripping test items to simplify Windows test

* adjusting script for windows testing

* Changed shell

* Fixed shell

* Fixed shell

* Fix CI setting

* Fix CI setting

* Fix CI setting

* Experimenting ci fix

* Experimenting ci fix

* Experimenting ci fix

* Experimenting ci fix

* experimenting fix for unit test error

* Changed to use BUILD_LOW_PERF to skip python tests

* Fix CI

* Added option to specify Ninja generator

* Reverted proxy related changes

commit | commitdiff | tree

Yes You Can Have Your Own [Fri, 3 Apr 2026 17:02:27 +0000 (20:02 +0300)]

server: save and clear idle slots on new task (`--clear-idle`) (#20993)

* server: clear idle slots KV from VRAM (LLAMA_KV_KEEP_ONLY_ACTIVE)

* server: move idle slot KV clearing to slot release

The save "cost" is now paid by the finishing request.

* server: add --kv-clear-idle flag, enable by default

* server: skip clearing last idle slot, clear on launch

* server: test --no-kv-clear-idle flag

* server: simplify on-release clearing loop

* server: remove on-release KV clearing, keep launch-only

* cont : clean-up

* tests: update log strings after --clear-idle rename

* tests: use debug tags instead of log message matching

* test: fix Windows CI by dropping temp log file unlink

---------

Co-authored-by: Georgi Gerganov <redacted>

commit | commitdiff | tree

Piotr Wilkin (ilintar) [Fri, 3 Apr 2026 15:51:52 +0000 (17:51 +0200)]

common/parser: fix call ID detection (Mistral parser mostly) + atomicity for tag-json parsers (#21230)

* Fix call ID detection (Mistral parser mostly) + atomicity for tag-json parsers

* Rename

* Update common/chat-auto-parser-generator.cpp

Co-authored-by: Sigbjørn Skjæret <redacted>
---------

Co-authored-by: Sigbjørn Skjæret <redacted>

commit | commitdiff | tree

Samanvya Tripathi [Fri, 3 Apr 2026 15:51:23 +0000 (11:51 -0400)]

common : fix tool call type detection for nullable and enum schemas (#21327)

* common : fix tool call type detection for nullable and enum schemas

* common, tests : fix grammar delegation for nullable/enum schemas and add tests

Fix enum type inference to scan all enum values (not just index 0) so
schemas like {"enum": [0, "celsius"]} correctly detect string type.

Fix schema_delegates in peg-parser to handle nullable type arrays
(["string", "null"]) and typeless enum schemas in raw mode, allowing
the tagged parser to use raw text instead of JSON-formatted strings.

Add test cases for Qwen3-Coder (TAG_WITH_TAGGED format):
- nullable string ["string", "null"]
- nullable string with null first ["null", "string"]
- nullable integer ["integer", "null"]
- enum without explicit type key

commit | commitdiff | tree

M1DNYT3 [Fri, 3 Apr 2026 13:06:45 +0000 (16:06 +0300)]

docker : bump cuda12 to 12.9.1 (#20920)

Co-authored-by: M1DNYT3 <redacted>
Co-authored-by: CISC <redacted>

commit | commitdiff | tree

jeromew [Fri, 3 Apr 2026 13:05:14 +0000 (15:05 +0200)]

docs: Update build.md: HSA_OVERRIDE_GFX_VERSION clarification (#21331)

The `HSA_OVERRIDE_GFX_VERSION` variable can be used in ROCm to override an unsupported target architecture with a similar but supported target architecture.

This does not and has never worked on Windows. I think the clarification could avoid driving Windows people towards this solution that does not work.

commit | commitdiff | tree

Sigbjørn Skjæret [Fri, 3 Apr 2026 13:03:33 +0000 (15:03 +0200)]

jinja: coerce input for string-specific filters (#21370)

commit | commitdiff | tree

Aaron Teo [Fri, 3 Apr 2026 12:50:00 +0000 (20:50 +0800)]

ci: add more binary checks (#21349)

commit | commitdiff | tree

Piotr Wilkin (ilintar) [Fri, 3 Apr 2026 11:40:41 +0000 (13:40 +0200)]

fix: remove stale assert (#21369)

commit | commitdiff | tree

uvos [Fri, 3 Apr 2026 09:38:22 +0000 (11:38 +0200)]

HIP: build eatch ci build test for a different architecture (#21337)

This helps improve our chances of finding build failures before the release workflow
builds for all architectures.

commit | commitdiff | tree

Tillerino [Fri, 3 Apr 2026 09:21:07 +0000 (11:21 +0200)]

fix: add openssl to nix dependencies (#21353) (#21355)

commit | commitdiff | tree

Vishal Singh [Fri, 3 Apr 2026 09:19:08 +0000 (14:49 +0530)]

ggml-zendnn : add MUL_MAT_ID op support for MoE models (#21315)

* ggml-zendnn : add MUL_MAT_ID op support for MoE models
- Add MUL_MAT_ID op acceleration for Mixture-of-Experts models
- MUL_MAT_ID op fallback to CPU backend if total experts > 32
- Point ZenDNN lib to latest bits ZenDNN-2026-WW13

* ggml-zendnn : add braces to sgemm failure condition for consistency

Co-authored-by: Aaron Teo <redacted>
---------

Co-authored-by: Aaron Teo <redacted>

commit | commitdiff | tree

Piotr Wilkin (ilintar) [Fri, 3 Apr 2026 08:33:03 +0000 (10:33 +0200)]

vocab: fix Gemma4 tokenizer (#21343)

* seems to work

* fix case with new line

Co-authored-by: sayap <redacted>
* gemma 4: fix pre tok regex

---------

Co-authored-by: Xuan Son Nguyen <redacted>
Co-authored-by: sayap <redacted>

commit | commitdiff | tree

Radoslav Gerganov [Fri, 3 Apr 2026 07:28:09 +0000 (10:28 +0300)]

rpc : reuse compute graph buffers (#21299)

Reuse the buffer for the ggml context which is used for creating the
compute graph on the server side. This partially addresses a memory leak
created by the CUDA backend due to using buffer addresses as cache
keys.

ref: #21265
ref: #20315

commit | commitdiff | tree

Georgi Gerganov [Fri, 3 Apr 2026 06:07:59 +0000 (09:07 +0300)]

chat : avoid including json in chat.h (#21306)

commit | commitdiff | tree

Georgi Gerganov [Fri, 3 Apr 2026 06:07:01 +0000 (09:07 +0300)]

(revert) kv-cache : do not quantize SWA KV cache (#21332)

This reverts commit 17193cce34036a6488b092ca79313d4ee1f895f5.

commit | commitdiff | tree

Vishal Singh [Fri, 3 Apr 2026 02:35:15 +0000 (08:05 +0530)]

ci : add AMD ZenDNN label to PR labeler (#21345)

* ci : add AMD CPU label to PR labeler
Add automatic labeling for PRs that modify AMD CPU (ZenDNN) backend files

* ci : rename label AMD CPU to AMD ZenDNN in labeler config

Co-authored-by: Aaron Teo <redacted>
---------

Co-authored-by: Aaron Teo <redacted>

commit | commitdiff | tree

Slobodan Josic [Thu, 2 Apr 2026 22:59:20 +0000 (00:59 +0200)]

[HIP] Bump ROCm version to 7.2.1 (#21066)

Bump ROCm version on Linux from 7.2 to 7.2.1
Add gfx1102 target
Delete LLVM workaround since ROCm 7.2.1 has fix for ROCm 7.2 perf regression https://github.com/ROCm/rocm-systems/issues/2865

---------

Co-authored-by: Sigbjørn Skjæret <redacted>

commit | commitdiff | tree

Piotr Wilkin (ilintar) [Thu, 2 Apr 2026 21:31:02 +0000 (23:31 +0200)]

fix: gemma 4 template (#21326)

commit | commitdiff | tree

Bartowski [Thu, 2 Apr 2026 20:53:58 +0000 (16:53 -0400)]

tests : add unit test coverage for llama_tensor_get_type (#20112)

* Add unit test coverage for llama_tensor_get_type

* Fix merge conflicts, add more schemas

* clang formatter changes

* Trailing whitespace

* Update name

* Start rebase

* Updating files with upstream changes prior to rebase

* Changes needed from rebase

* Update attn_qkv schema, change throw behaviour

* Fix merge conflicts

* White space

* Update with latest changes to state counters

* Revert accidental personal CLAUDE.md changes

* Change quotation mark

* Reuse metadata.name since we have it

* Move test-only stuff out of llama-quant.cpp

* Hide the regex functionality back in llama-quant.cpp, use a unique pointer to a new struct 'compiled_tensor_type_patterns' which contains the patterns

* cont : inital deslop guidelines

* Cleanup based on review comments

* Continue cleanup

* Small cleanup

* Manually set proper ordering of tensors, mostly applies to gemma

* Formatting

* Update tests/test-quant-type-selection.cpp

Co-authored-by: Sigbjørn Skjæret <redacted>
* Fix merge conflicts

---------

Co-authored-by: Georgi Gerganov <redacted>
Co-authored-by: Sigbjørn Skjæret <redacted>

commit | commitdiff | tree

Zheyuan Chen [Thu, 2 Apr 2026 17:40:42 +0000 (10:40 -0700)]

ggml-webgpu: add vectorized flash attention (#20709)

* naive vectorized version

* add vectorized flash attention

* update vec version

* remove unused path and shader

* remove unused helper functions

* add comments

* remove pad path

* ggml-webgpu: fix flash-attn vec nwg=1 path and tighten vec specialization

* change back to vec4

* enable multi split

* enable vec path when:
- Q->ne[1] < 20
- Q->ne[0] % 32 == 0
- V->ne[0] % 4 == 0
- K->type == f16

* update flast_attn_vec_split.wgsl to reduce redundant workgroup barrier usage and use select

* enable vec path for q4 and q8

* flash-attn vec nwg=1 fast path (skip tmp/reduce staging)

* use packed f16 K loads in flash-attn vec split

* use packed f16 K loads in flash-attn vec split on host side

* tune flash-attn vec f16 VEC_NE by head dim

* cleanup

* cleanup

* keep host side clean

* cleanup host side

* change back to original host wait/submit behavior

* formatting

* reverted param-buffer pool r ecfactor

* add helper functions

* ggml-webgpu: move flash-attn vec pipeline caching back into shader lib

* ggml-webgpu: remove duplicate functions

* ggml-webgpu: reserve flash-attn vec scratch in dst buffer allocation

* ggml-webgpu: revert unrelated change

* ggml-webgpu: revert deleted comment

* disable uniformity check

* remove unnecessary change

* Update ggml/src/ggml-webgpu/wgsl-shaders/flash_attn_vec_split.wgsl

* Update ggml/src/ggml-webgpu/ggml-webgpu.cpp

---------

Co-authored-by: Reese Levine <redacted>

commit | commitdiff | tree

Ruben Ortlam [Thu, 2 Apr 2026 16:19:20 +0000 (18:19 +0200)]

tests: allow exporting graph ops from HF file without downloading weights (#21182)

* tests: allow exporting graph ops from HF file without downloading weights

* use unique_ptr for llama_context in HF metadata case

* fix missing non-required tensors falling back to type f32

* use unique pointers where possible

* use no_alloc instead of fixing f32 fallback

* fix missing space

commit | commitdiff | tree

Xuan-Son Nguyen [Thu, 2 Apr 2026 15:10:32 +0000 (17:10 +0200)]

model, mtmd: fix gguf conversion for audio/vision mmproj (#21309)

* fix gguf conversion for audio/vision mmproj

* fix test

commit | commitdiff | tree

Aldehir Rojas [Thu, 2 Apr 2026 13:59:59 +0000 (08:59 -0500)]

common : add commentary rules for gpt-oss-20b (#21286)

commit | commitdiff | tree

Piotr Wilkin (ilintar) [Thu, 2 Apr 2026 09:29:11 +0000 (11:29 +0200)]

Relax prefill parser to allow space. (#21240)

* Relax prefill parser to allow space.

* Move changes from prefix() to parser generation

* Only allow spaces if we're not having a pure content parser next

commit | commitdiff | tree

Jesus Talavera [Thu, 2 Apr 2026 09:28:56 +0000 (11:28 +0200)]

chat : add Granite 4.0 chat template with correct tool_call role mapping (#20804)

* chat : add Granite 4.0 chat template with correct tool_call role mapping

Introduce `LLM_CHAT_TEMPLATE_GRANITE_4_0` alongside the existing Granite
3.x template (renamed `LLM_CHAT_TEMPLATE_GRANITE_3_X`).

The Granite 4.0 Jinja template uses `<tool_call>` XML tags and maps the
`assistant_tool_call` role to `<|start_of_role|>assistant<|end_of_role|><|tool_call|>`.
Without a matching C++ handler, the fallback path emits the literal role
`assistant_tool_call` which the model does not recognize, breaking tool
calling when `--jinja` is not used.

Changes:
- Rename `LLM_CHAT_TEMPLATE_GRANITE` to `LLM_CHAT_TEMPLATE_GRANITE_3_X`
(preserves existing 3.x behavior unchanged)
- Add `LLM_CHAT_TEMPLATE_GRANITE_4_0` enum, map entry, and handler
- Detection: `<|start_of_role|>` + (`<tool_call>` or `<tools>`) → 4.0,
otherwise → 3.x
- Add production Granite 4.0 Jinja template
- Add tests for both 3.x and 4.0 template paths (C++ and Jinja)

Co-Authored-By: Claude Opus 4.6 <redacted>
* Code review: follow standard format and use common logic in test-chat-template.cpp

* Rename custom_conversation variable for extra_conversation to give it a more meaningful name

---------

Co-authored-by: Claude Opus 4.6 <redacted>

commit | commitdiff | tree

Georgi Gerganov [Thu, 2 Apr 2026 08:54:05 +0000 (11:54 +0300)]

kv-cache : do not quantize SWA KV cache (#21277)

commit | commitdiff | tree

Roger Chen [Thu, 2 Apr 2026 08:41:19 +0000 (16:41 +0800)]

Ignore Transfer-Encoding header. (#20269)

commit | commitdiff | tree

Georgi Gerganov [Thu, 2 Apr 2026 07:38:24 +0000 (10:38 +0300)]

sync : ggml

commit | commitdiff | tree

Georgi Gerganov [Thu, 2 Apr 2026 07:37:26 +0000 (10:37 +0300)]

ggml : bump version to 0.9.11 (ggml/1456)

commit | commitdiff | tree

Neo Zhang [Thu, 2 Apr 2026 07:08:32 +0000 (15:08 +0800)]

sycl : fix llama_kv_cache hang when kv_cache is huge: 5GB (#21283)

commit | commitdiff | tree

Todor Boinovski [Thu, 2 Apr 2026 00:44:02 +0000 (17:44 -0700)]

hexagon : add cumsum op support (#21246)

* hexagon : add cumsum op support

* hexagon: enable dma for cumsum op

* Fix line-ending

---------

Co-authored-by: Max Krasnyansky <redacted>

commit | commitdiff | tree

Xuan-Son Nguyen [Wed, 1 Apr 2026 21:31:51 +0000 (23:31 +0200)]

contrib : rewrite AGENTS.md, make it more clear about project values (#21270)

* contrib : rewrite AGENTS.md, make it more clear about types of permitted AI usage

* permit AI for writing code

commit | commitdiff | tree

lhez [Wed, 1 Apr 2026 19:54:58 +0000 (12:54 -0700)]

opencl: fix leak in Adreno q8_0 path (#21212)

commit | commitdiff | tree

Aleksander Grygier [Wed, 1 Apr 2026 19:32:15 +0000 (21:32 +0200)]

server: Bypass API Key validation for WebUI static bundle assets (#21269)

* fix: Bypass API Key validation for static bundle assets

* refactor: All bypassed routes in `public_endpoints`

* test: Update static assets API Key test

commit | commitdiff | tree

Johannes Gäßler [Wed, 1 Apr 2026 19:28:19 +0000 (21:28 +0200)]

CUDA: fix FA kernel selection logic (#21271)

commit | commitdiff | tree

Martin Klacer [Wed, 1 Apr 2026 17:02:41 +0000 (18:02 +0100)]

kleidiai: add CPU feature detection to CI run script (#20394)

* kleidiai: add cpu feature detection to CI run script

Signed-off-by: Martin Klacer <redacted>
Change-Id: I663adc3a7691a98e7dac5488962c13cc344f034a

* kleidiai: revert unrelated requirements change

Signed-off-by: Martin Klacer <redacted>
* kleidiai: removed cpu feature detection from CI run script

* As per the maintainers' suggestion, removed cpu feature detection
from CI run script as CMake handles it already

Signed-off-by: Martin Klacer <redacted>
---------

Signed-off-by: Martin Klacer <redacted>

commit | commitdiff | tree

Nikhil Jain [Wed, 1 Apr 2026 16:53:05 +0000 (09:53 -0700)]

Update Dawn version in WebGPU CI (#20784)

* Pin Dawn version

* Update docs with new Dawn commit hash

commit | commitdiff | tree

Aparna M P [Wed, 1 Apr 2026 15:43:08 +0000 (21:13 +0530)]

hexagon: improve RMS_NORM and DIV accuracy (#21251)

* hexagon-rms_norm: fix RMS_NORM for non-aligned tensor sizes

Co-authored-by: Krishna Sridhar <redacted>
* hexagon-div: perform DIV in fp16 domain for lower dsp archs

---------

Co-authored-by: Krishna Sridhar <redacted>

commit | commitdiff | tree

Jonathan [Wed, 1 Apr 2026 14:22:44 +0000 (07:22 -0700)]

fix: tool call parsing for LFM2 and LFM2.5 models (#21242)

* fix: tool call parsing for LFM2 and LFM2.5 models'

* refactor: add test / break out lfm2 and lfm2.5 parsing logic

commit | commitdiff | tree

Georgi Gerganov [Wed, 1 Apr 2026 13:58:01 +0000 (16:58 +0300)]

llama : rotate activations for better quantization (#21038)

* llama : rotate activations for better quantization

* cont : rotate V more + refactor

* cont : rotate caches separately + support non-power-of-2 head sizes

* cont : simplify

* cont : add reference for V rotation

* cont : refactor

* cont : support context shift

* cont : consolidate

* cont : dedup + allow different types for the rotation matrix

* cont : add env variable to disable rotation

* cont : simplify attn rot kv cache logic + rename env

* cont : pre-compute the Hadamard matrices

commit | commitdiff | tree

Xuan-Son Nguyen [Wed, 1 Apr 2026 13:31:58 +0000 (15:31 +0200)]

scripts: add function call test script (#21234)

* scripts: add function call test script

* add reasoning_content

* fix lint

commit | commitdiff | tree

Georgi Gerganov [Wed, 1 Apr 2026 13:02:34 +0000 (16:02 +0300)]

sync : ggml

commit | commitdiff | tree

Georgi Gerganov [Wed, 1 Apr 2026 13:01:45 +0000 (16:01 +0300)]

ggml : bump version to 0.9.10 (ggml/1454)

commit | commitdiff | tree

Neo Zhang [Wed, 1 Apr 2026 10:54:15 +0000 (18:54 +0800)]

sycl : support nvfp4 type in mul_mat (#21227)

commit | commitdiff | tree

Michael Wand [Wed, 1 Apr 2026 10:04:58 +0000 (03:04 -0700)]

ggml-cuda: Add generic NVFP4 MMQ kernel (#21074)

* Introduced NVFP4 generic MMQ kernel

* Added extra FP8 guard, hope to solve ci HIP failure

* Rename tiles and use HIP_FP8_AVAILABLE

* Removed remaning FP8 straggler and added const int

* Const

* Removed DECL_MMQ_CASE artifact

* Removed newline

* Removed space after else

* Changed HIP FP8 NVFP4 conversion gate

* Added new line to bottom of mmq.cu 270

* Removed extra spaces

* Removed single space in front of else on line 814

* Added NVFP4 to generate cu script so HIP can see it, further tightened logic

* Include generated mmq-instance-nvfp4.cu

* Added NVFP4 mmq to HIP Check ignore list

* Update ggml/src/ggml-cuda/mmq.cuh

Changed to Q3_K tile to read MMQ_MMA_TILE_X_K_NVFP4

Co-authored-by: Johannes Gäßler <redacted>
* Update ggml/src/ggml-cuda/mmq.cuh

Changed to Q3_K tile to read MMQ_MMA_TILE_X_K_NVFP4 in tile assert

Co-authored-by: Johannes Gäßler <redacted>
* Update ggml/src/ggml-cuda/mmq.cuh

Added function name ending for end if

Co-authored-by: Johannes Gäßler <redacted>
* Added function names to closing endif

Co-authored-by: Johannes Gäßler <redacted>
---------

Co-authored-by: Johannes Gäßler <redacted>

commit | commitdiff | tree

Ettore Di Giacinto [Wed, 1 Apr 2026 09:50:17 +0000 (11:50 +0200)]

memory: respect unified KV cache in hybrid memory for eval tasks (#21224)

The hybrid memory paths (`llama-memory-hybrid.cpp` and
`llama-memory-hybrid-iswa.cpp`) always used sequential equal split,
ignoring the unified KV cache flag. This caused hellaswag, winogrande,
and multiple-choice evaluations to fail on hybrid models (models with
both attention and recurrent/SSM layers, such as Qwen3.5-35B-A3B) with:

  split_equal: sequential split is not supported when there are
  coupled sequences in the input batch (you may need to use the
  -kvu flag)

PR #19954 fixed this for `llama-kv-cache-iswa.cpp` by automatically
enabling unified KV mode and setting n_parallel >= 4 for multi-choice
eval tasks. However, the hybrid memory paths were not updated.

This commit mirrors the iswa fix: use non-sequential split when KV
cache is unified (n_stream == 1), which is automatically set by
llama-perplexity for hellaswag/winogrande/multiple-choice since #19954.

Tested on Qwen3.5-35B-A3B (hybrid attention+SSM MoE model):
- HellaSwag: 83.0% (400 tasks)
- Winogrande: 74.5% (400 tasks)
- MMLU: 41.2%
- ARC-Challenge: 56.2%
- TruthfulQA: 37.7%
All previously failed with llama_decode() error.

commit | commitdiff | tree

uvos [Wed, 1 Apr 2026 08:21:20 +0000 (10:21 +0200)]

CUDA/HIP: Fix kernel slection for mmvq mmid kernel to align host selection with device launch bounds (#21238)

The conditions cc == GGML_CUDA_CC_VOLTA || cc >= GGML_CUDA_CC_ADA_LOVELACE and cc >= GGML_CUDA_CC_TURING match all non-nvidia devices. This causes us to attempt to launch the kernel for batch sizes with larger configurations than our launch bounds on HIP devices. This pr fixes the conditionals in get_mmvq_mmid_max_batch.

Fixes #21191

commit | commitdiff | tree

Georgi Gerganov [Wed, 1 Apr 2026 08:10:25 +0000 (11:10 +0300)]

ggml : fix RWKV ops thread assignment (#21226)

commit | commitdiff | tree

Taimur Ahmad [Wed, 1 Apr 2026 08:10:03 +0000 (13:10 +0500)]

ggml-cpu: fix fallback for RVV kernels without zvfh (#21157)

* ggml-cpu: refactor sgemm; fix rvv checks

* ggml-cpu: refactor rvv kernels; set zvfbfwma default to off

commit | commitdiff | tree

Anav Prasad [Wed, 1 Apr 2026 07:07:24 +0000 (07:07 +0000)]

CUDA: Add Flash Attention Support for Head Dimension 512 (#20998)

* flash attention support for head dimension 512 added

* FA D=512 - match 576 configs, limit ncols2, revert vec cap

* fix HIP tile kernel build for D=512

* fix HIP tile kernel occupancy for D=512 on AMD

* Apply suggestions from code review

Co-authored-by: Johannes Gäßler <redacted>
* fix tile FA compilation

---------

Co-authored-by: Johannes Gäßler <redacted>

commit | commitdiff | tree

Ed Addario [Wed, 1 Apr 2026 05:43:00 +0000 (06:43 +0100)]

llama : refactor llama_model_quantize_params to expose a pure C interface (#20346)

* Refactor llama_model_quantize_params to expose a pure C interface

* Restore comment and cleanup struct def

* Code review refactoring

Co-authored-by: Georgi Gerganov <redacted>
* Code review refactoring

---------

Co-authored-by: Georgi Gerganov <redacted>

commit | commitdiff | tree

Reese Levine [Wed, 1 Apr 2026 05:38:24 +0000 (22:38 -0700)]

ggml webgpu: quantized buffers to u32 + wider browser/device support (#21046)

* Work towards removing bitcast

* Move rest of existing types over

* Add timeout back to wait and remove synchronous set_tensor/memset_tensor

* move to unpackf16 for wider compatibility

* cleanup

* Remove deadlock condition in free_bufs

commit | commitdiff | tree

Abhijit Ramesh [Tue, 31 Mar 2026 22:38:16 +0000 (15:38 -0700)]

ggml-webgpu: port all AOT operators to JIT (#20728)

* port cpy pipeline to shader lib with JIT compilation
* port glu pipeline to shader lib with JIT compilation
* port rope pipeline to shader lib with JIT compilation
* port soft_max pipeline to shader lib with JIT compilation
* removed unused functions from embed_wgsl.py which were used for
old AOT template expansion

commit | commitdiff | tree

Aleksander Grygier [Tue, 31 Mar 2026 15:47:46 +0000 (17:47 +0200)]

fix: Use lower-case proxy headers naming (#21235)

commit | commitdiff | tree

Adrien Gallouët [Tue, 31 Mar 2026 14:18:00 +0000 (16:18 +0200)]

common : cleanup logs and modernize the progress bar (#21215)

```
$ build/bin/llama-server -hf unsloth/Qwen3.5-0.8B-GGUF
common_download_file_single_online: HEAD failed, status: 404
no remote preset found, skipping
Downloading mmproj-BF16.gguf ——————————————————————————————————————— 100%
Downloading Qwen3.5-0.8B-Q4_K_M.gguf ——————————————————————————————— 100%
...
```

Signed-off-by: Adrien Gallouët <redacted>

commit | commitdiff | tree

hipudding [Tue, 31 Mar 2026 14:00:51 +0000 (22:00 +0800)]

CANN: fix multi-thread set_tensor race conditions (#20151)

* CANN: fix multi-thread set_tensor race conditions

When ollama calls ggml_backend_tensor_set from multiple threads (each
writing a different chunk of the same tensor), the CANN backend had
three concurrency issues:

1. Quantized tensors (Q4_0/Q8_0) require a full-tensor format transform
   before uploading to device. Per-chunk transforms produced corrupt data.

2. ND-to-NZ weight conversion requires complete tensor data on device.
   Per-chunk conversion operated on incomplete data.

3. The global g_nz_workspaces array had unprotected concurrent access.

Fix by introducing a TensorSetTracker that accumulates write progress
per tensor. For quantized tensors, raw data is staged in a host buffer
and the transform + upload is deferred until all chunks arrive. For NZ
weights, chunks are uploaded directly but conversion is deferred. The
tracker and its staging buffer are released immediately after
post-processing completes.

Add per-device mutex to g_nz_workspaces to prevent data races.

* CANN: fix L2_NORM ignoring eps parameter

The L2_NORM implementation was not using the eps parameter from
op_params, causing incorrect results when eps is large (e.g. 10.0).
The CPU reference computes scale = 1/fmaxf(norm, eps), so add a
Clamp step to clamp the norm to at least eps before dividing.

* ggml/cann: compare op_params for POOL_2D in ACL graph cache matching

When ACL graph mode is enabled, the graph LRU cache checks whether a
cached graph matches the current computation graph. Previously,
GGML_OP_POOL_2D was not included in the op_params comparison, so two
POOL_2D nodes with different pooling parameters (kernel size, stride,
padding) but identical tensor shapes and addresses could incorrectly
reuse a cached graph, leading to wrong results or aclnn errors.

Add GGML_OP_POOL_2D to the list of ops that require op_params matching
in ggml_graph_node_properties::has_matching_properties().

* cann: fix ACL graph cache matching by adding tensor type and unconditional op_params comparison

The ACL graph LRU cache was incorrectly reusing cached graphs for
operations with different tensor types or op_params, causing test
failures for CPY (f16 vs bf16), POOL_2D, L2_NORM, NORM_MUL_ADD,
RMS_NORM_MUL_ADD, and ADD_RMS_NORM.

Changes:
- Add node_type and src_type[] fields to ggml_graph_node_properties
  so the cache can distinguish tensors with different types but
  identical ne/nb (e.g. f16 and bf16 both have 2-byte elements)
- Compare op_params unconditionally for all ops instead of only for
  SCALE/UNARY/GLU/ROPE/POOL_2D

commit | commitdiff | tree

Xuan-Son Nguyen [Tue, 31 Mar 2026 13:44:26 +0000 (15:44 +0200)]

server: (webui) no more gzip compression (#21073)

* webui: no more gzip

* try changing a small line

* Revert "try changing a small line"

This reverts commit 0d7a3531593d87b724d404c8727a96becab3ab07.

* fix lint

* fix test

* rebuild

* split into html/css/js

* lint

* chore: update webui build output

* chore: Update git hooks script

* server: update webui build output

* chore: Update pre-commit hook

* refactor: Cleanup

---------

Co-authored-by: Aleksander Grygier <redacted>

commit | commitdiff | tree

Aldehir Rojas [Tue, 31 Mar 2026 11:52:42 +0000 (06:52 -0500)]

common : gpt-oss handle builtin and unsolicited tool calls (#21213)

commit | commitdiff | tree

lainon1 [Tue, 31 Mar 2026 11:50:51 +0000 (12:50 +0100)]

fix: correct misspellings in code comments (#21217)

- emdeddings → embeddings (gemma3.cpp, gemma3n-iswa.cpp,
gemma-embedding.cpp)
- imlpemented → implemented (llama-adapter.cpp)
- interere → interfere (llama-graph.cpp)
- overridde → overridden (chat.cpp)
- stastistics → statistics (ngram-map.h)
- layed → laid (llama-kv-cache.h)
- worster → worst (llama-context.cpp)
- sequantial → sequential (llama-batch.h)

commit | commitdiff | tree

Seungmin Kim [Tue, 31 Mar 2026 11:02:56 +0000 (20:02 +0900)]

CI: Enable CPU and Vulkan ARM64 Release (#21207)

commit | commitdiff | tree

Georgi Gerganov [Tue, 31 Mar 2026 10:08:13 +0000 (13:08 +0300)]

sync : ggml

commit | commitdiff | tree

Georgi Gerganov [Mon, 30 Mar 2026 15:34:29 +0000 (18:34 +0300)]

ggml : bump version to 0.9.9 (ggml/1449)

commit | commitdiff | tree

Adrien Gallouët [Tue, 31 Mar 2026 10:53:41 +0000 (12:53 +0200)]

common : move up common_init() and fix Windows UTF-8 logs (#21176)

The build info is now only for debug, so we avoid the duplicate
with `--version`.

The UTF-8 setup at the beginning is needed to avoid logging
garbage on Windows.

Signed-off-by: Adrien Gallouët <redacted>

commit | commitdiff | tree

Neo Zhang [Tue, 31 Mar 2026 10:31:50 +0000 (18:31 +0800)]

sycl : enhance fattn perf (#21185)

commit | commitdiff | tree

mtmcp [Tue, 31 Mar 2026 10:04:42 +0000 (07:04 -0300)]

common: add bounds check in common_init_result::sampler to prevent segfault on failed model load (#21082)

* common: add bounds check in common_init_result::sampler to prevent segfault on failed model load

* Revert a308e584cae3fa8cee1d739a858a2d780f1de009

* Add regression test

* Remove regression test for init-fail sampler check

commit | commitdiff | tree

SATISH K C [Tue, 31 Mar 2026 08:52:34 +0000 (03:52 -0500)]

fix: include API key in CORS proxy requests for MCP connections (#21193)

* fix: include API key in CORS proxy requests for MCP connections

When llama-server is started with --api-key-file and --webui-mcp-proxy,
the /cors-proxy endpoint requires authentication. The WebUI was not
including the Authorization header in proxy requests, causing MCP
connections to fail with 401.

Inject getAuthHeaders() into requestInit when useProxy is true so the
proxy request carries the Bearer token alongside the forwarded target
headers.

Fixes #21167

* fix: simplify headers assignment based on reviewer suggestion

Apply buildProxiedHeaders only when useProxy is true, pass headers
directly to the transport otherwise.

commit | commitdiff | tree

Piotr Wilkin (ilintar) [Tue, 31 Mar 2026 08:42:06 +0000 (10:42 +0200)]

server/webui: cleanup dual representation approach, simplify to openai-compat (#21090)

* server/webui: cleanup dual representation approach, simplify to openai-compat

* feat: Fix regression for Agentic Loop UI

* chore: update webui build output

* refactor: Post-review code improvements

* chore: update webui build output

* refactor: Cleanup

* chore: update webui build output

---------

Co-authored-by: Aleksander Grygier <redacted>

commit | commitdiff | tree

Adrien Gallouët [Tue, 31 Mar 2026 07:21:54 +0000 (09:21 +0200)]

vendor : update BoringSSL to 0.20260327.0 (#21211)

Signed-off-by: Adrien Gallouët <redacted>

commit | commitdiff | tree

Galunid [Tue, 31 Mar 2026 07:14:01 +0000 (09:14 +0200)]

common : Disable backend sampling if reasoning budget is enabled (#21209)

commit | commitdiff | tree

shaofeiqi [Mon, 30 Mar 2026 19:19:16 +0000 (12:19 -0700)]

opencl: add q4_K gemm and gemv kernels for Adreno (#20919)

* opencl: add q4_K gemm and gemv kernels for Adreno

* opencl: fix whitespace

* opencl: add workarounds for compiler bugs on older devices

* opencl: handle fp16 denorm on X Elite

* opencl: fix kernel build error

* opencl: fix whitespace

* opencl: make q4_K cvt kernels signature consistent

---------

Co-authored-by: Li He <redacted>

commit | commitdiff | tree

Seungmin Kim [Mon, 30 Mar 2026 18:24:37 +0000 (03:24 +0900)]

CI : Enable CUDA and Vulkan ARM64 runners and fix CI/CD (#21122)

* CI: Enable CUDA and Vulkan ARM64 runners and fix CI/CD

Co-authored-by: Ts-sound <redacted>
* Obtain source tag name from git tag

Co-authored-by: Sigbjørn Skjæret <redacted>
---------

Co-authored-by: Ts-sound <redacted>
Co-authored-by: Sigbjørn Skjæret <redacted>

commit | commitdiff | tree

Zhihao "Zephyr" Yao [Mon, 30 Mar 2026 18:08:46 +0000 (14:08 -0400)]

jinja : handle empty expressions correctly (#20913)

* Reject empty computed member expressions before returning slices[0] from parse_member_expression_arguments().

* Treat empty computed member expressions with Jinja2 undefined semantics

Treat empty computed member expressions like `a[]` as undefined instead of
raising a parser error, to match Jinja2 behavior.

- return a noop expression for empty computed member arguments
- return undefined when a computed member key evaluates to undefined
- add Jinja tests covering `a[]|default('fallback')` and `a[] is undefined`

* Handle undefined computed member properties

Move undefined-property handling to the common member access path, and add a test covering `a[undefined] is undefined`.

* Use default undefined value in member access

Initialize val and then return it when property is undefined.

Co-authored-by: Sigbjørn Skjæret <redacted>
* empty statement parses to blank_expression instead of noop_statement

---------

Co-authored-by: Sigbjørn Skjæret <redacted>

commit | commitdiff | tree

Oliver Simons [Mon, 30 Mar 2026 14:20:00 +0000 (16:20 +0200)]

CUDA : Fix CUB's argsort when nrows % block_size == 0 CCCL < 3.1 (#21181)

* CUDA: Fix CUB's argsort when nrows % block_size == 0 CCCL < 3.1

We wrongly calculated offset_grid as `ceildiv(nrows, block_size)`,
while it must be `ceildiv(nrows + 1, block_size)`. As a consequence, we
had uninitialized values in `offset_iterator[nrows]` for the case when
`nrows % block_size == 0`.

Fixes #21162

* Reduce nrows in test case to 256, don't need 768

commit | commitdiff | tree

Radoslav Gerganov [Mon, 30 Mar 2026 14:05:11 +0000 (17:05 +0300)]

rpc : fix misleading error log (#21184)

When RPC is running with a remote backend which doesn't have init_tensor
function (like CPU and Metal), the server log gets full with error
messages saying that init_tensor is being called with null buffer which
is incorrect. This patch fixes this.

commit | commitdiff | tree

Aleksander Grygier [Mon, 30 Mar 2026 12:40:50 +0000 (14:40 +0200)]

webui: Fix branching logic on edit message (#21175)

* fix: Branching logic + small refactor

* chore: update webui build output

commit | commitdiff | tree

Aman Gupta [Mon, 30 Mar 2026 09:40:17 +0000 (17:40 +0800)]

llama-model-loader: print warning when using overrides with mmap (#20978)

* llama-model-loader: use pinned memory for tensor overrides

* change to warning

commit | commitdiff | tree

Sigbjørn Skjæret [Mon, 30 Mar 2026 07:29:15 +0000 (09:29 +0200)]

ci : bump ty to 0.0.26 (#21156)

* fix incorrect type ignore comments

* bump ty to 0.0.26

commit | commitdiff | tree

Xuan-Son Nguyen [Mon, 30 Mar 2026 06:59:16 +0000 (08:59 +0200)]

server: wrap headers for mcp proxy (#21072)

* server: wrap headers for mcp proxy

* Update tools/server/server-cors-proxy.h

Co-authored-by: Georgi Gerganov <redacted>
* fix build

* chore: update webui build output

* chore: update webui build output

---------

Co-authored-by: Georgi Gerganov <redacted>
Co-authored-by: Aleksander Grygier <redacted>

commit | commitdiff | tree

Sigbjørn Skjæret [Sun, 29 Mar 2026 17:45:40 +0000 (19:45 +0200)]

add missing ROPE_FACTORS_LONG/SHORT for MiniCPM (#21150)

commit | commitdiff | tree

Gaurav Garg [Sun, 29 Mar 2026 16:35:18 +0000 (22:05 +0530)]

Optimize MOE GEMV kernel for BS > 1. (#20905)

* Optimize MOE GEMV kernel for BS > 1.

The previous MOE kernel for BS > 1 had too many thread blocks (nrows_x, nchannels_dst, ncols_dst), with very little work per block. block of (32, 4) was doing inner dot product for a single row.

New mul_mat_vec_q_moe kernel is dedicated for MoE multi-token kernel with grid (ceil(nrows_x/rpb), nchannels_dst), block (warp_size, ncols_dst). Each warp handles two rows independently with warp-level reduction only (no shared memory sync).

This change doesn't increase any compilation time as a single template instance is needed per type. This also simplifies the original GEMV kernel and gets rid of `is_multi_token_id` specialization.

* Remove em-dashes

* Cherry-pick changes from @am17an PR https://github.com/ggml-org/llama.cpp/pull/20885 to enable small_k optimization only for cases where it benefits

Increase max batch size for MMVQ kernels for MUL_MAT_ID to 8

* Make the max batch size for MOE GEMV kernel configurable based on GPU arch and datatype

---------

Co-authored-by: Aman Gupta <redacted>

Packaging of ggml-org/llama.cpp

RSS Atom