git.djapps.eu Git - pkg/ggml/sources/llama.cpp/log

]> git.djapps.eu Git - pkg/ggml/sources/llama.cpp/log

overview / pkg / ggml / sources / llama.cpp / log

commit | commitdiff | tree

crsawyer [Sat, 21 Feb 2026 08:28:39 +0000 (02:28 -0600)]

fix: UI single model selection in router mode (#19767)

commit | commitdiff | tree

Mengsheng Wu [Sat, 21 Feb 2026 00:40:00 +0000 (16:40 -0800)]

hexagon : fix build release (#19444) (#19587)

commit | commitdiff | tree

Aldehir Rojas [Fri, 20 Feb 2026 22:22:22 +0000 (16:22 -0600)]

common : merge qwen3-coder and nemotron nano 3 parsers (#19765)

* common : migrate qwen3-coder to PEG parsing variant

* cont : add JSON parameter test

commit | commitdiff | tree

Taimur Ahmad [Fri, 20 Feb 2026 11:30:07 +0000 (16:30 +0500)]

ggml-cpu: add RVV vec dot kernels for quantization types (#18784)

* ggml-cpu: add rvv vec_dot for iq2_s

Co-authored-by: Rehan Qasim <redacted>
* ggml-cpu: add rvv vec_dot for iq3_s

Co-authored-by: Rehan Qasim <redacted>
* ggml-cpu: add rvv vec_dot for tq1_0, tq2_0

Co-authored-by: Rehan Qasim <redacted>
ggml-cpu: add rvv vec_dot for tq1_0, tq2_0

* ggml-cpu: add rvv vec_dot for iq1_s, iq1_m

Co-authored-by: Rehan Qasim <redacted>
* ggml-cpu: add vlen switch for rvv vec_dot

---------

Co-authored-by: Rehan Qasim <redacted>

commit | commitdiff | tree

ddh0 [Fri, 20 Feb 2026 08:20:16 +0000 (02:20 -0600)]

quantize : add --dry-run option (#19526)

* clean slate for branch

* use 6 characters for tensor dims

* add --dry-run to llama-quantize

* use 6 characters for tensor dims (cont.)

* no need to re-calculate ggml_nbytes for tensor

* fix indent

* show model and quant BPW when quant completes

* add example to --help

* new function `tensor_requires_imatrix`, add courtesy warning about imatrix

* missing __func__, move imatrix flag set

* logic error

* fixup tensor_requires_imatrix

* add missing `GGML_TYPE`s

* simplify and rename `tensor_type_requires_imatrix`

* simplify for style

* add back Q2_K edge case for imatrix

* guard ftype imatrix warning

* comment ref #12557

* remove per @compilade

* remove unused `params` parameter

* move `bool dry_run` per GG

* move `bool dry_run` per GG

* Update src/llama-quant.cpp

Co-authored-by: Sigbjørn Skjæret <redacted>
* Update src/llama-quant.cpp

Co-authored-by: Sigbjørn Skjæret <redacted>
* Update src/llama-quant.cpp

Co-authored-by: Sigbjørn Skjæret <redacted>
---------

Co-authored-by: Sigbjørn Skjæret <redacted>

commit | commitdiff | tree

Jeff Bolz [Fri, 20 Feb 2026 02:08:25 +0000 (18:08 -0800)]

test: mul_mat tests with huge batch size (#19519)

commit | commitdiff | tree

crsawyer [Thu, 19 Feb 2026 21:53:42 +0000 (15:53 -0600)]

WebUI hide models in router mode (#19374)

commit | commitdiff | tree

Jesse Posner [Thu, 19 Feb 2026 21:40:52 +0000 (13:40 -0800)]

common : fix Step-3.5-Flash format detection and thinking support (#19635)

* common : fix Step-3.5-Flash format detection and thinking support

Step-3.5-Flash uses the same XML-style tool call format as Qwen3-Coder
(<tool_call><function=...><parameter=...>) but its Jinja template lacks
the bare <function> and plural <parameters> markers that the detection
logic previously required. This caused it to fall through to Hermes 2
Pro, which doesn't call func_args_not_string(), so arguments stayed as
JSON strings and templates using arguments|items crashed.

Additionally, the Qwen3-Coder-XML format handler had no thinking support.
Models like Step-3.5-Flash that unconditionally emit <think> in their
generation prompt need the same thinking_forced_open handling that
Nemotron v3 and Hermes 2 Pro already have, otherwise reasoning_content
is never separated from content in API responses.

Changes:
- Relax Qwen3-Coder XML detection to only require the 3 shared markers
- Tighten Nemotron v3 branch to also require bare <function> and plural
<parameters>, preventing Step-3.5-Flash from being misrouted via <think>
- Add thinking_forced_open support to Qwen3-Coder-XML init function
- Add <think>/</think> to preserved tokens
- Fix build_grammar_xml_tool_call to handle thinking_forced_open in the
grammar root rule, allowing </think> before tool calls
- Add Step-3.5-Flash chat template and format detection test

Builds on: https://github.com/ggml-org/llama.cpp/pull/19283

* chat : route Step-3.5-Flash to Nemotron v3 PEG parser, add tests

Step-3.5-Flash uses the same XML tool call format as Qwen3-Coder and
Nemotron 3 Nano (<tool_call>/<function=...>/<parameter=...>) but with
unconditional <think> output. Route it to the Nemotron v3 PEG parser
for streaming and schema-aware parameter parsing.

Detection: templates with <think> + XML tool tags use Nemotron v3 PEG
parser; templates without <think> (Qwen3-Coder) use GBNF grammar.

Tests cover: basic messages, tool calls with/without thinking content,
parallel tool calls, code string parameters, optional </parameter>
closing tags, and JSON schema response format.

* chat : remove dead thinking code from qwen3_coder_xml

Remove thinking handling code that became unreachable after routing
Step-3.5-Flash to the Nemotron v3 PEG parser. Qwen3-Coder has no
<think> in its template, so the thinking_forced_open logic, preserved
tokens, and grammar prefix were dead paths.

commit | commitdiff | tree

abhijitb11 [Thu, 19 Feb 2026 20:59:20 +0000 (12:59 -0800)]

common : fix gpt-oss Jinja error when assistant message has both content and thinking with tool calls (#19704)

commit | commitdiff | tree

Masashi Yoshimura [Thu, 19 Feb 2026 16:18:30 +0000 (01:18 +0900)]

ggml-webgpu: Add unary op (SQR, SQRT, SIN, COS) support. (#19700)

* ggml-webgpu: Add unary op (SQR, SQRT, SIN, COS) support.

* Fix to cast the src value to f32 before sin/cos computing.

commit | commitdiff | tree

megemini [Thu, 19 Feb 2026 16:05:25 +0000 (00:05 +0800)]

model: Add PaddleOCR-VL model support (#18825)

* support PaddleOCR-VL

* clip: update PaddleOCR model loader parameters to prevent OOM during warmup

* [update] add paddleocr vl text model instead of ernie4.5

* [update] restore change of minicpmv

* [update] format

* [update] format

* [update] positions and patch merge permute

* [update] mtmd_decode_use_mrope for paddleocr

* [update] image min/max pixels

* [update] remove set_limit_image_tokens

* upate: preprocess without padding

* clean up

* Update convert_hf_to_gguf.py

Co-authored-by: Sigbjørn Skjæret <redacted>
* Update convert_hf_to_gguf.py

Co-authored-by: Sigbjørn Skjæret <redacted>
---------

Co-authored-by: Xuan Son Nguyen <redacted>
Co-authored-by: Sigbjørn Skjæret <redacted>

commit | commitdiff | tree

Ruben Ortlam [Thu, 19 Feb 2026 13:59:16 +0000 (14:59 +0100)]

vulkan: fix MMQ shader push constants and multi-dispatch (#19732)

commit | commitdiff | tree

Georgi Gerganov [Thu, 19 Feb 2026 13:19:53 +0000 (15:19 +0200)]

models : fix qwen3.5 beta/gate shapes (#19730)

* models : fix qwen3.5 beta/gate shapes

* cont : avoid extra reshapes

commit | commitdiff | tree

Saba Fallah [Thu, 19 Feb 2026 12:50:29 +0000 (13:50 +0100)]

mtmd: build_attn modified, flash_attn on/off via ctx_params (#19729)

commit | commitdiff | tree

3 a l i [Thu, 19 Feb 2026 12:30:17 +0000 (16:30 +0400)]

model : add JAIS-2 architecture support (#19488)

* model: add JAIS-2 architecture support

Add support for the JAIS-2 family of Arabic-English bilingual models
from Inception AI (https://huggingface.co/inceptionai/Jais-2-8B-Chat).

Architecture characteristics:
- LayerNorm (not RMSNorm) with biases
- ReLU² (ReLU squared) activation function
- Separate Q/K/V projections with biases
- Simple MLP without gate projection (up -> act -> down)
- RoPE positional embeddings
- GPT-2 BPE tokenizer

Supported model sizes:
- Jais-2-8B (32 layers, 26 heads, 3328 hidden)
- Jais-2-70B (68 layers, 56 heads, 7168 hidden)

Tested with quantizations: BF16, Q8_0, Q6_K, Q5_K_M, Q5_0, Q4_K_M, Q4_0, Q3_K_M, Q2_K

Note: JAIS-2 requires F32 precision accumulators for numerical stability
and uses standard attention (not flash attention) on CUDA backends.

* fix: run convert_hf_to_gguf_update.py for jais-2 tokenizer hash

* fix: use NEOX RoPE type for JAIS2

* fix: remove Q/K permutation (NEOX RoPE doesn't need it)

* fix: enable flash attention for JAIS2 (fixed by #19115)

* fix: add dedicated JAIS2 pre-tokenizer type and control vector support

- Add LLAMA_VOCAB_PRE_TYPE_JAIS2 with cascading whitespace regex
- Include original regex from tokenizer.json as comment
- Add build_cvec call for control vector support

* no longer necessary to override set_vocab

---------

Co-authored-by: Sigbjørn Skjæret <redacted>

commit | commitdiff | tree

Johannes Gäßler [Thu, 19 Feb 2026 11:42:58 +0000 (12:42 +0100)]

CUDA: fix kernel selection logic for tile FA (#19686)

* CUDA: fix kernel selection logic for tile FA

* add comment

commit | commitdiff | tree

Tarek Dakhran [Thu, 19 Feb 2026 11:18:57 +0000 (12:18 +0100)]

mtmd : chat : Fix extra \n between text and media marker (#19595)

* mtmd : chat : Fix extra \n between text and media marker

Thanks to @tugot17 for detecting and reporting the issue.

For vision models (e.g. LFM2.5-VL-1.6B and Qwen/Qwen3-VL-4B-Instruct) `llama-mtmd-cli` produces identical output to HF implementation.

However `llama-server` doesn't. I traced it down to extra newline
inserted after `<__media__>`.

This happens in `to_json_oaicompat`, that treats media markers as text
and joins all parts with `\n` separator.

PR introduces new type `media_marker` and uses it for media markers.
Extra logic is added to prevent insertion of newlines before and after
media markers.

With this change number of input tokens is identical to HF
implementation and as a result the output is also identical.

I explored other ways to address the issue
* remove completely `\n` between text parts in `to_json_oaicompat`
* merge text messages in server-common.cpp before sending them to `to_json_oaicompat`

Please propose alternative ways of fixing this issue.

* Refactor to use explicite per type ifs

* Update common/chat.cpp

Co-authored-by: Piotr Wilkin (ilintar) <redacted>
* Update common_chat_templates_apply_legacy

---------

Co-authored-by: Piotr Wilkin (ilintar) <redacted>

commit | commitdiff | tree

Aleksander Grygier [Thu, 19 Feb 2026 09:27:38 +0000 (10:27 +0100)]

webui: Fix Attachments not being included in completion request (#19731)

* fix: Add missing argument

* chore: update webui build output

commit | commitdiff | tree

Tarek Dakhran [Thu, 19 Feb 2026 08:54:48 +0000 (09:54 +0100)]

model : add tokenizer from LFM2.5-Audio-1.5B (#19687)

* model : Add tokenizer from LFM2.5-Audio-1.5B

[LFM2.5-Audio-1.5B](https://huggingface.co/LiquidAI/LFM2.5-Audio-1.5B) introduced lightweight audio tokenizer.

Tokenizer based on LFM2 architecture and acts as "embedding" model with
different input `n_embd` and output `n_embd_out`.

To be used in https://github.com/ggml-org/llama.cpp/pull/18641.

To convert use

```shell
python3 convert_hf_to_gguf.py /path/to/LFM2.5-Audio-1.5B/audio_detokenizer
```

* Update convert_hf_to_gguf.py

Co-authored-by: Sigbjørn Skjæret <redacted>
* Formatting

* Rework check for attention layers

* Add LFM2 SWA model support

* Address PR feedback

* Set vocab to none

* Move helper function definitions to cpp file

---------

Co-authored-by: Sigbjørn Skjæret <redacted>

commit | commitdiff | tree

Daniel Bevenius [Thu, 19 Feb 2026 08:48:08 +0000 (09:48 +0100)]

llama : use output_resolve_row() in get_logits_ith/get_embeddings_ith (#19663)

This commit updates get_logits_ith(), and get_embeddings_ith() to use
output_resolve_row() to resolve the batch index to output row index.

The motivation for this is to remove some code duplication between these
functions.

commit | commitdiff | tree

Ryan Mangeno [Thu, 19 Feb 2026 07:52:21 +0000 (02:52 -0500)]

model : full modern bert support (#18330)

* full modern bert support

* added gelu op in rank pooling for modern bert

* still working on stuff, added mean calculation before classifier head

* Update convert_hf_to_gguf.py

Co-authored-by: Sigbjørn Skjæret <redacted>
* first layer is dense, as per modern bert research paper

* Update src/llama-graph.cpp

Co-authored-by: Sigbjørn Skjæret <redacted>
* fixed set input for mean pooling to check if pooling type is ranking since modern bert does mean & rank

* Update src/llama-graph.cpp

Co-authored-by: Sigbjørn Skjæret <redacted>
* Update convert_hf_to_gguf.py

Co-authored-by: Sigbjørn Skjæret <redacted>
---------

Co-authored-by: Sigbjørn Skjæret <redacted>

commit | commitdiff | tree

shalinib-ibm [Thu, 19 Feb 2026 06:28:53 +0000 (11:58 +0530)]

llamafile: powerpc: add FP16 MMA path for Q4/Q8 matmul (#19709)

Avoid xvi8ger4pp signed→unsigned bias correction by dequantizing Q4/Q8
inputs to FP16 and using FP16×FP16→FP32 MMA. This removes
post-processing overhead and improves performance.

Performance Impact:
1.5 ~ 2x improvement in PP_Speed for Q4 and Q8 Models,
measured with llama-bench and llama-batched-bench.
Q8 Model: granite-4.0-h-micro-Q8_0.gguf (from huggingface)
Q4 Model: Meta-Llama3-8b Q4 model (generated with llama-quantize from
f32 model)

llama-bench Q8 Model Results:
model                                  size      params backend     threads             test Base t/s Patch t/s
granitehybrid 3B Q8_0              3.16 GiB      3.19 B CPU               10              pp8          64.48 ± 4.72          73.99 ± 0.27
granitehybrid 3B Q8_0              3.16 GiB      3.19 B CPU               10             pp16          80.11 ± 0.32         112.53 ± 0.40
granitehybrid 3B Q8_0              3.16 GiB      3.19 B CPU               10             pp32          89.10 ± 0.27         152.95 ± 0.68
granitehybrid 3B Q8_0              3.16 GiB      3.19 B CPU               10             pp64          93.65 ± 0.25         187.83 ± 0.83
granitehybrid 3B Q8_0              3.16 GiB      3.19 B CPU               10            pp128          99.93 ± 0.02         201.32 ± 0.11
granitehybrid 3B Q8_0              3.16 GiB      3.19 B CPU               10            pp256         102.32 ± 0.40         208.32 ± 0.41
granitehybrid 3B Q8_0              3.16 GiB      3.19 B CPU               10            pp512         103.42 ± 0.40         209.98 ± 0.14
granitehybrid 3B Q8_0              3.16 GiB      3.19 B CPU               10            tg128          20.35 ± 0.01          19.57 ± 0.01

llama-bench Q4 Model Results:
model                                  size      params backend     threads             test               Base    t/s                Patch   t/s
llama 8B Q4_0                      4.33 GiB      8.03 B CPU               10              pp8          34.77 ± 0.10          41.23 ± 0.08
llama 8B Q4_0                      4.33 GiB      8.03 B CPU               10             pp16          40.81 ± 0.04          64.55 ± 0.15
llama 8B Q4_0                      4.33 GiB      8.03 B CPU               10             pp32          44.65 ± 0.05          90.84 ± 0.22
llama 8B Q4_0                      4.33 GiB      8.03 B CPU               10             pp64          47.49 ± 0.03         114.39 ± 0.11
llama 8B Q4_0                      4.33 GiB      8.03 B CPU               10            pp128          49.29 ± 0.24         120.13 ± 0.19
llama 8B Q4_0                      4.33 GiB      8.03 B CPU               10            pp256          49.77 ± 0.23         121.51 ± 0.11
llama 8B Q4_0                      4.33 GiB      8.03 B CPU               10            pp512          49.89 ± 0.23         117.52 ± 0.10
llama 8B Q4_0                      4.33 GiB      8.03 B CPU               10            tg128          13.40 ± 0.01          13.37 ± 0.00

Llama perplexity Results:

Model                     Base Final PPL Estimate Patch Final PPL Estimate
granite-4.0-h-micro-Q8_0    1.3862 +/- 0.04424         1.3868 +/- 0.04432
Meta-Llama3-8b Q4     1.3801 +/- 0.04116         1.3803 +/- 0.04116

Signed-off-by: Shalini.Salomi.Bodapati <redacted>

commit | commitdiff | tree

Georgi Gerganov [Thu, 19 Feb 2026 06:17:49 +0000 (08:17 +0200)]

models : dedup qwen35 graphs (#19660)

* models : dedup qwen35 graphs

* cont : add missing sigmoid

commit | commitdiff | tree

ymcki [Thu, 19 Feb 2026 06:15:17 +0000 (14:15 +0800)]

models : dedup Kimi Linear delta net implementation (#19668)

* models : add llm_build_delta_net_base

* cont : keep qwen35 and qwen35moe graphs intact

* cont : add comments [no ci]

* add kimi linear to delta-net-base

* removed unnecessary ggml_cont from g_exp_t

* removed ggml_cont from g_diff_exp_t. moved ggml_cont for o to kimi-linear.cpp

* removed unnecessary diag mask

* cont : simplify

* cont : avoid graph splits

* scale q after mul instead of beginning

* scale q after mul instead of beginning

* identical ppl

* cont : fix scale and decay mask

* minor : remove TODO

---------

Co-authored-by: Georgi Gerganov <redacted>

commit | commitdiff | tree

Piotr Wilkin (ilintar) [Wed, 18 Feb 2026 23:25:52 +0000 (00:25 +0100)]

Add Jinja support for "indent" string filter (#19529)

* Add partial Jinja support for "indent" string filter

* Fully implement indent

* Add tests for all width variants.

* Update tests/test-jinja.cpp

Co-authored-by: Sigbjørn Skjæret <redacted>
* Fix getline ignoring trailing newlines

* Update common/jinja/value.cpp

Co-authored-by: Sigbjørn Skjæret <redacted>
* fix first indent condition

---------

Co-authored-by: Sigbjørn Skjæret <redacted>

commit | commitdiff | tree

Reese Levine [Wed, 18 Feb 2026 23:06:29 +0000 (16:06 -0700)]

ggml webgpu: Fix bug in dispatching large matrix-vector multiplication (#19535)

* Fix bug in dispatching large matrix-vector multiplication

commit | commitdiff | tree

matteo [Wed, 18 Feb 2026 17:53:37 +0000 (18:53 +0100)]

server: save generated text for the /slots endpoint (for LLAMA_SERVER_SLOTS_DEBUG=1) (#19622)

* save generated text for the /slots endpoint

* update debug_generated_text only when LLAMA_SERVER_SLOTS_DEBUG > 0

* Apply suggestions from code review

---------

Co-authored-by: Matteo <redacted>
Co-authored-by: Xuan-Son Nguyen <redacted>

commit | commitdiff | tree

Xuan-Son Nguyen [Wed, 18 Feb 2026 16:51:40 +0000 (17:51 +0100)]

model: support GLM-OCR (#19677)

* model: support GLM-OCR

* Update convert_hf_to_gguf.py

Co-authored-by: Sigbjørn Skjæret <redacted>
---------

Co-authored-by: Sigbjørn Skjæret <redacted>

commit | commitdiff | tree

Maciej Lisowski [Wed, 18 Feb 2026 15:50:23 +0000 (16:50 +0100)]

docs: Fix broken links for preparing models in Backends (#19684)

commit | commitdiff | tree

Reese Levine [Wed, 18 Feb 2026 14:51:02 +0000 (07:51 -0700)]

ggml webgpu: shader library organization (#19530)

* Basic JIT compilation for mul_mat, get_rows, and scale (#17)

* scale jit working

* preliminary working jit for getrows and mulmat, needs refining

* simplified mul_mat preprocessing switch statement

* get_rows fixes, mul_mat refinement

* formatted + last edits

* removed some extraneous prints

* fixed get_rows, fixed workgroup dispatch in mul_mat. no gibberish

* small fix

* some changes, working

* get_rows and mul_mat jit fixed and working

* Update formatting

* formatting

* Add header

---------

Co-authored-by: Neha Abbas <redacted>
Co-authored-by: Reese Levine <redacted>
* Start work on all-encompassing shader library

* refactor argmax, set_rows

* Refactor all but flashattention, mat mul

* flashattention and matrix multiplication moved to new format

* clean up preprocessing

* Formatting

* remove duplicate constants

* Split large shaders into multiple static strings

---------

Co-authored-by: neha-ha <redacted>

commit | commitdiff | tree

Aleksander Grygier [Wed, 18 Feb 2026 11:02:02 +0000 (12:02 +0100)]

Pre-MCP UI and architecture cleanup (#19689)

commit | commitdiff | tree

Jeff Bolz [Wed, 18 Feb 2026 09:47:10 +0000 (01:47 -0800)]

vulkan: split mul_mat into multiple dispatches to avoid overflow (#19509)

* vulkan: split mul_mat into multiple dispatches to avoid overflow

The batch dimensions can be greater than the max workgroup count limit,
in which case we need to split into multiple dispatches and pass the base
index through a push constant.

Fall back for the less common p021 and nc variants.

* address feedback

commit | commitdiff | tree

Adrien Gallouët [Wed, 18 Feb 2026 07:03:01 +0000 (08:03 +0100)]

common : make small string helpers as inline functions (#19693)

Also use string_view when it make sense and fix some corner cases.

Signed-off-by: Adrien Gallouët <redacted>

commit | commitdiff | tree

shaofeiqi [Tue, 17 Feb 2026 22:47:18 +0000 (14:47 -0800)]

opencl: refactor expm1 and softplus (#19404)

* opencl: refactor expm1

* opencl: refactor softplus

* opencl: use h for half literals

---------

Co-authored-by: Li He <redacted>

commit | commitdiff | tree

shaofeiqi [Tue, 17 Feb 2026 21:56:09 +0000 (13:56 -0800)]

opencl: optimize mean and sum_row kernels (#19614)

* opencl: optimize mean and sum_row kernels

* opencl: add comment for max subgroups

* opencl: format

---------

Co-authored-by: Li He <redacted>

commit | commitdiff | tree

Daniel Bevenius [Tue, 17 Feb 2026 19:43:22 +0000 (20:43 +0100)]

model-conversion : add option to print tensor values (#19692)

This commit updates the tensor-info.py script to support the option to
print the first N values of a tensor when displaying its information.

The motivation for this is that it can be useful to inspect some actual
values in addition to the shapes of the tensors.

commit | commitdiff | tree

Aleksander Grygier [Tue, 17 Feb 2026 12:47:45 +0000 (13:47 +0100)]

Pre-MCP UI and architecture cleanup (#19685)

* webui: extract non-MCP changes from mcp-mvp review split

* webui: extract additional pre-MCP UI and architecture cleanup

* chore: update webui build output

commit | commitdiff | tree

Talha Can Havadar [Tue, 17 Feb 2026 11:22:46 +0000 (12:22 +0100)]

ggml: ggml-cpu: force-no-lto-for-cpu-feats (#19609)

When LTO enabled in build environments it forces all builds to have LTO
in place. But feature detection logic is fragile, and causing Illegal
instruction errors with lto. This disables LTO for the feature
detection code to prevent cross-module optimization from inlining
architecture-specific instructions into the score function. Without this,
LTO can cause SIGILL when loading backends on older CPUs (e.g., loading
power10 backend on power9 crashes before feature check runs).

commit | commitdiff | tree

Georgi Gerganov [Tue, 17 Feb 2026 10:31:49 +0000 (12:31 +0200)]

cuda : enable CUDA graphs for MMID 1 <= BS <= 4 (#19645)

* cuda : enable CUDA graphs for MMID BS <= 4

* cont : add stream capture check

Co-authored-by: Oliver Simons <redacted>
* cont : add MMVQ_MMID_MAX_BATCH_SIZE

---------

Co-authored-by: Oliver Simons <redacted>

commit | commitdiff | tree

Daniel Bevenius [Tue, 17 Feb 2026 09:46:53 +0000 (10:46 +0100)]

model-conversion : make printing of config values optional (#19681)

* model-conversion : make printing of config values optional

This commit updates run-org-model.py to make the printing of model
configuration values optional.

The motivation for this change is that not all models have these
configuration values defined and those that do not will error when
running this script. With these changes we only print the values if they
exist or a default value.

We could optionally just remove them but it can be useful to see these
values when running the original model.

commit | commitdiff | tree

Sigbjørn Skjæret [Tue, 17 Feb 2026 08:30:31 +0000 (09:30 +0100)]

ci : bump komac version (#19682)

commit | commitdiff | tree

Adrien Gallouët [Tue, 17 Feb 2026 07:37:07 +0000 (08:37 +0100)]

build : link ws2_32 as PUBLIC on Windows (#19666)

Signed-off-by: Adrien Gallouët <redacted>

commit | commitdiff | tree

Adrien Gallouët [Tue, 17 Feb 2026 07:36:45 +0000 (08:36 +0100)]

build : cleanup library linking logic (#19665)

Signed-off-by: Adrien Gallouët <redacted>

commit | commitdiff | tree

DAN™ [Mon, 16 Feb 2026 21:49:57 +0000 (16:49 -0500)]

convert : add JoyAI-LLM-Flash (#19651)

* convert_hf_to_gguf: add JoyAI-LLM-Flash tokenizer hash mapping to deepseek-v3

* llama-vocab: create a new pre-tokenizer name for joyai-llm.

* add missing vocab type section

* Update convert_hf_to_gguf_update.py

Co-authored-by: Sigbjørn Skjæret <redacted>
* Update convert_hf_to_gguf.py

Co-authored-by: Sigbjørn Skjæret <redacted>
---------

Co-authored-by: Sigbjørn Skjæret <redacted>

commit | commitdiff | tree

AesSedai [Mon, 16 Feb 2026 16:44:44 +0000 (08:44 -0800)]

perplexity: add proper batching (#19661)

commit | commitdiff | tree

Ivan Chikish [Mon, 16 Feb 2026 15:52:24 +0000 (18:52 +0300)]

common : inline functions (#18639)

commit | commitdiff | tree

Judd [Mon, 16 Feb 2026 15:43:34 +0000 (23:43 +0800)]

ggml : make `ggml_is_view` as API (#19539)

* make `ggml_is_view` as API

* introduce `ggml_aux_is_view` as inline version for internal use.

* change `ggml_aux_is_view` to `ggml_impl_is_view`

commit | commitdiff | tree

Saurabh Dash [Mon, 16 Feb 2026 15:28:46 +0000 (10:28 -0500)]

model: Add support for Tiny Aya Models (#19611)

* changes for tiny aya

* changes to hash

* changes to vocab

* fix some tokenizer regex edge cases

* update comment

* add some comments for regex

* Apply suggestion from @ngxson

---------

Co-authored-by: Xuan-Son Nguyen <redacted>

commit | commitdiff | tree

Adrien Gallouët [Mon, 16 Feb 2026 15:06:48 +0000 (16:06 +0100)]

build : rework llama_option_depr to handle LLAMA_CURL (#19658)

Signed-off-by: Adrien Gallouët <redacted>

commit | commitdiff | tree

Mario Limonciello [Mon, 16 Feb 2026 13:46:08 +0000 (07:46 -0600)]

Adjust workaround for ROCWMMA_FATTN/GFX9 to only newer ROCm veresions (#19591)

Avoids issues with ROCm 6.4.4.

Closes: https://github.com/ggml-org/llama.cpp/issues/19580
Fixes: 6845f7f87 ("Add a workaround for compilation with ROCWMMA_FATTN and gfx9 (#19461)")
Signed-off-by: Mario Limonciello (AMD) <redacted>

commit | commitdiff | tree

Georgi Gerganov [Mon, 16 Feb 2026 12:35:04 +0000 (14:35 +0200)]

models : deduplicate delta-net graphs for Qwen family (#19597)

* models : add llm_build_delta_net_base

* cont : keep qwen35 and qwen35moe graphs intact

* cont : add comments

commit | commitdiff | tree

Georgi Gerganov [Mon, 16 Feb 2026 07:21:11 +0000 (09:21 +0200)]

graph : fix KQ mask, lora, cvec reuse checks (#19644)

* graph : fix KQ mask reuse condition

* cont : dedup KQ mask build and can_reuse

* cont : fix build

* graph : fix adapter check for reuse

commit | commitdiff | tree

abhijain1204fujitsu [Mon, 16 Feb 2026 06:38:43 +0000 (12:08 +0530)]

ggml: aarch64: Implement SVE in Gemm q4_k 8x8 q8_k Kernel (#19132)

* Updated repack.cpp

* Updated repack.cpp

* Updated repack.cpp

* Added if condition to support only vector length 256.

* Changed the format removed comments and duplicate variable

* If SVE 256 not present then was using generic function to compute, hence slowing the performance.

So added code if SVE 256 is not present then use NEON code.

* Code format change suggestion

---------

Co-authored-by: Vithule, Prashant <redacted>

commit | commitdiff | tree

Georgi Gerganov [Sun, 15 Feb 2026 20:23:13 +0000 (22:23 +0200)]

sync : ggml

commit | commitdiff | tree

Georgi Gerganov [Sun, 15 Feb 2026 20:21:04 +0000 (22:21 +0200)]

ggml : bump version to 0.9.7 (ggml/1425)

commit | commitdiff | tree

Georgi Gerganov [Sat, 7 Feb 2026 07:58:02 +0000 (09:58 +0200)]

ggml : bump version to 0.9.6 (ggml/1423)

commit | commitdiff | tree

David Friehs [Sun, 15 Feb 2026 17:08:42 +0000 (18:08 +0100)]

cuda: optimize iq2xxs/iq2xs/iq3xxs dequantization (#19624)

* cuda: optimize iq2xxs/iq2xs/iq3xxs dequantization

- load all 8 int8 for a grid position in one load
- calculate signs via popcnt instead of fetching from ksigns table
- broadcast signs to drop individual shift/mask

* cuda: iq2xxs: simplify sum scaling

express `(sum * scale + sum / 2) / 4` as `(sum * (scale * 2 + 1)) / 8`
express `((aux32 >> 28) * 2 + 1)` as `(aux32 >> 27 | 1)`

saves 3 registers for mul_mat_vec_q (152 -> 149) according to nsight
AFAICT no overflow can occur here as iq2xxs values are far too small

* uint -> uint32_t

error: identifier "uint" is undefined

commit | commitdiff | tree

Aaron Teo [Sun, 15 Feb 2026 16:33:34 +0000 (00:33 +0800)]

docs: update s390x build docs (#19643)

commit | commitdiff | tree

Adrien Gallouët [Sun, 15 Feb 2026 14:38:50 +0000 (15:38 +0100)]

build : remove LLAMA_HTTPLIB option (#19623)

This option was introduced as a workaround because cpp-httplib could not
build on visionOS. Since it has been fixed and now compiles on all platforms,
we can remove it and simplify many things.

Signed-off-by: Adrien Gallouët <redacted>

commit | commitdiff | tree

Daniel Bevenius [Sun, 15 Feb 2026 12:59:38 +0000 (13:59 +0100)]

cmake : check if KleidiAI API has been fetched (#19640)

This commit addresses a build issue with the KleidiAI backend when
building multiple cpu backends. Commmit
3a00c98584e42a20675b6569d81beadb282b0952 ("cmake : fix KleidiAI install
target failure with EXCLUDE_FROM_ALL") introduced a change where
FetchContent_Populate is called instead of FetchContent_MakeAvailable,
where the latter does handle this case (it is idempotent but
FetchContent_Populate is not).

I missed this during my review and I should not have commited without
verifying the CI failure, sorry about that.

commit | commitdiff | tree

Georgi Gerganov [Sun, 15 Feb 2026 12:57:40 +0000 (14:57 +0200)]

context : fix output reorder with backend sampling (#19638)

commit | commitdiff | tree

Georgi Gerganov [Sun, 15 Feb 2026 12:56:35 +0000 (14:56 +0200)]

ggml : avoid UB in gemm ukernel (#19642)

commit | commitdiff | tree

Aaron Teo [Sun, 15 Feb 2026 10:20:35 +0000 (18:20 +0800)]

ggml-cpu: optimize ggml_vec_dot_bf16 for s390x (#19399)

commit | commitdiff | tree

Aman Gupta [Sun, 15 Feb 2026 05:39:24 +0000 (11:09 +0530)]

ggml-cpu: FA add GEMM microkernel (#19422)

* ggml-cpu: FA add GEMM microkernel

* add guard for sizeless vector types

* fix case where DV % GGML_F32_EPR !=0

* move memset out of the loop

* move another memset out of the loop

* use RM=4 for arm

* simd_gemm: convert everything to int

* convert everything to size_t to avoid warnings

* fixup

* add pragma for ignoring aggressive loop optimizations

commit | commitdiff | tree

SamareshSingh [Sun, 15 Feb 2026 05:22:53 +0000 (23:22 -0600)]

cmake : fix KleidiAI install target failure with EXCLUDE_FROM_ALL (#19581)

* cmake: fix KleidiAI install target failure with EXCLUDE_FROM_ALL

Fix for the bug #19501 by adding EXCLUDE_FROM_ALL to FetchContent_Declare. This properly excludes KleidiAI from both build and install targets, preventing install failures when GGML_CPU_KLEIDIAI=ON is used.

The KleidiAI source files are still compiled into libggml-cpu.so, preserving all functionality.

* addressed code review comments

commit | commitdiff | tree

Sigbjørn Skjæret [Sat, 14 Feb 2026 21:22:32 +0000 (22:22 +0100)]

convert : ensure all models handle new experts count (#19621)

* ensure all models handle new experts count

* revert removal for PhiMoeModel, does not inherit from base

commit | commitdiff | tree

Anav Prasad [Sat, 14 Feb 2026 13:07:00 +0000 (05:07 -0800)]

mtmd : Add Nemotron Nano 12B v2 VL support (#19547)

* nemotron nano v2 vlm support added

* simplified code; addressed reviews

* pre-downsample position embeddings during GGUF conversion for fixed input size

commit | commitdiff | tree

Georgi Gerganov [Sat, 14 Feb 2026 10:57:36 +0000 (12:57 +0200)]

models : optimize qwen3next graph (#19375)

* models : optimizing qwen3next graph

* cont

* wip

* wip

* wip

* wip

* wip

* wip

* wip

* wip

* wip

* wip

* cont : remove redundant q, g chunking

* minor

* minor

* avoid passing masks around

* avoid concats during chunking

* naming + shapes

* update names and use prefix to disable CUDA graphs

commit | commitdiff | tree

Adrien Gallouët [Sat, 14 Feb 2026 10:22:57 +0000 (11:22 +0100)]

ggml : fix GGML_DEBUG with OpenMP (#19599)

last_graph is only available without OpenMP, but
ggml_graph_compute_thread() is called in both cases.

Signed-off-by: Adrien Gallouët <redacted>

commit | commitdiff | tree

iMil [Sat, 14 Feb 2026 08:47:01 +0000 (09:47 +0100)]

NetBSD build support (#19589)

commit | commitdiff | tree

Aleksander Grygier [Sat, 14 Feb 2026 08:06:41 +0000 (09:06 +0100)]

webui: Architecture and UI improvements (#19596)

commit | commitdiff | tree

agent-enemy-2 [Sat, 14 Feb 2026 08:06:27 +0000 (03:06 -0500)]

llama : update LoRA API. + fix excessive graph reserves (#19280)

* Refactoring to use new llama_put_adapter_loras

* cont : alternative lora API

---------

Co-authored-by: Jake Chavis <redacted>
Co-authored-by: Georgi Gerganov <redacted>

commit | commitdiff | tree

George [Sat, 14 Feb 2026 08:05:12 +0000 (10:05 +0200)]

mmap: Fix Windows handle lifetime (#19598)

* ggml: added cleanups in ggml_quantize_free
Add missing cleanup calls for IQ2_S, IQ1_M quantization types and IQ3XS with 512 blocks during quantization cleanup.

* mmap: Fix Windows handle lifetime
Move hMapping from local variable to member variable so it stays alive for the entire lifetime of the mapping.
The file mapping handle must remain valid until UnmapViewOfFile is called.
Fixes cleanup order in destructor.

* Update llama-mmap.cpp

* Update llama-mmap.cpp

Remove trailing whitespace from line 567

commit | commitdiff | tree

Georgi Gerganov [Sat, 14 Feb 2026 07:54:03 +0000 (09:54 +0200)]

metal : fix ACC op (#19427)

commit | commitdiff | tree

Adrien Gallouët [Sat, 14 Feb 2026 07:41:16 +0000 (08:41 +0100)]

scripts : use official split.py for cpp-httplib (#19588)

* scripts : use official split.py for cpp-httplib

Using the official script is safer and ensures the generated code aligns
with the library's standards.

Signed-off-by: Adrien Gallouët <redacted>
* Catch generic errors

Signed-off-by: Adrien Gallouët <redacted>
* Allow print()

Signed-off-by: Adrien Gallouët <redacted>
* Ensure robust cleanup

Signed-off-by: Adrien Gallouët <redacted>
---------

Signed-off-by: Adrien Gallouët <redacted>

commit | commitdiff | tree

Sigbjørn Skjæret [Sat, 14 Feb 2026 07:17:43 +0000 (08:17 +0100)]

convert : store ffn_gate_inp_shexp as F32 (#19606)

commit | commitdiff | tree

Adrien Gallouët [Sat, 14 Feb 2026 05:48:37 +0000 (06:48 +0100)]

build : fix libtool call in build-xcframework.sh (#19605)

Run libtool via xcrun like strip and dsymutil, to have proper tool resolution.

Signed-off-by: Adrien Gallouët <redacted>

commit | commitdiff | tree

Jeff Bolz [Sat, 14 Feb 2026 05:42:04 +0000 (21:42 -0800)]

vulkan: support L2_NORM with contiguous rows (#19604)

commit | commitdiff | tree

Jeff Bolz [Sat, 14 Feb 2026 05:36:38 +0000 (21:36 -0800)]

vulkan: support GGML_OP_SET (#19584)

commit | commitdiff | tree

Sophon [Sat, 14 Feb 2026 05:29:17 +0000 (13:29 +0800)]

vulkan: Add vendor id for Qualcomm drivers (#19569)

This commit allows Qualcomm native vulkan driver to be used on Windows
instead of Mesa Dozen.

commit | commitdiff | tree

Max Krasnyansky [Sat, 14 Feb 2026 00:27:30 +0000 (16:27 -0800)]

hexagon: further optimizations and refactoring for flash attention (#19583)

* ggml-hexagon: fa improvements

ggml-hexagon: optimize flash attention calculations with improved variable handling

ggml-hexagon: streamline flash attention operations by removing redundant checks for FP32

ggml-hexagon: optimize hvx_dot_f16_f16_aa_rx2 by simplifying variable handling for unused elements

ggml-hexagon: optimize flash attention by changing slope vector type to F16

* hexfa: fixed test-backend-ops failurs due to leftover element handling

* hexagon: refactor and optimize fa to use local context struct

* ggml-hexagon: optimize flash-attention using hvx_vec_expf

Use HVX for online softmax.

---------

Co-authored-by: chraac <redacted>

commit | commitdiff | tree

Mengsheng Wu [Fri, 13 Feb 2026 23:56:53 +0000 (15:56 -0800)]

github : add missing backends to issue templates (#19603)

commit | commitdiff | tree

Jeff Bolz [Fri, 13 Feb 2026 19:35:29 +0000 (11:35 -0800)]

vulkan: restore -inf check in FA shaders (#19582)

commit | commitdiff | tree

Adrien Gallouët [Fri, 13 Feb 2026 14:10:46 +0000 (15:10 +0100)]

common : update download code (#19573)

* common : remove legacy .json to .etag migration code

Signed-off-by: Adrien Gallouët <redacted>
* common : simplify common_download_file_single_online

This commit also force a redownload if the file exists
but has no .etag file.

Signed-off-by: Adrien Gallouët <redacted>
---------

Signed-off-by: Adrien Gallouët <redacted>

commit | commitdiff | tree

Xuan-Son Nguyen [Fri, 13 Feb 2026 13:56:53 +0000 (14:56 +0100)]

model: support GLM MoE DSA arch (NOTE: indexer is not yet supported) (#19460)

* model: support GLM MoE DSA arch

* working version

* pyright

* keep indexer tensors

* add indexer gguf params

* loaded now

* Apply suggestions from code review

Co-authored-by: Sigbjørn Skjæret <redacted>
* update

* Update src/llama-model.cpp

Co-authored-by: Sigbjørn Skjæret <redacted>
* minor fix and cleanup

---------

Co-authored-by: Sigbjørn Skjæret <redacted>

commit | commitdiff | tree

Alberto Cabrera Pérez [Fri, 13 Feb 2026 12:32:14 +0000 (12:32 +0000)]

Fix wrong memcpy length for block_interleave == 4 (#19575)

commit | commitdiff | tree

ymcki [Fri, 13 Feb 2026 12:31:37 +0000 (20:31 +0800)]

fix vulkan ggml_acc only works in 3d but not 4d (#19426)

* fix vulkan ggml_acc only works in 3d but not 4d

* removed clamp in test_acc_block

* use the correct stride and its test case

* cuda : fix "supports op" condition

* change src0 to src1 in ggml_vk_acc. Update acc.comp with jeffbolznv\'s suggestion except to keep the boundary check

* version without boundary check

* revert back to boundary check version

---------

Co-authored-by: Georgi Gerganov <redacted>

commit | commitdiff | tree

Sigbjørn Skjæret [Fri, 13 Feb 2026 11:49:10 +0000 (12:49 +0100)]

support --verbose-prompt (#19576)

commit | commitdiff | tree

Aman Gupta [Fri, 13 Feb 2026 11:31:40 +0000 (17:01 +0530)]

CUDA: loop over ne2*ne3 in case it overflows (#19538)

* CUDA: loop over ne2*ne3 in case it overflows

* use fastdiv

commit | commitdiff | tree

Aleksander Grygier [Fri, 13 Feb 2026 11:31:00 +0000 (12:31 +0100)]

webui: UI and routing fixes (#19586)

* chore: update webui build output

* chore: update webui build output

* fix: Scroll issues in DropdownMenuSearchable

* webui: fix redirect to root ignoring base path

* fix: Word wrapping

* fix: remove obsolete modality UI tests causing CI failures

- Remove VisionModality/AudioModality test stories
- Remove mockServerProps usage and imports
- Simplify Default test (remove dropdown interaction checks)
- Simplify FileAttachments test (remove mocks)

* feat: Improve formatting performance time

---------

Co-authored-by: Pascal <redacted>

commit | commitdiff | tree

Oliver Simons [Fri, 13 Feb 2026 09:37:55 +0000 (10:37 +0100)]

CUDA: Do not mutate cgraph for fused ADDs (#19566)

* Do not mutate cgraph for fused ADDs

1. We should try to minimize in-place changes to the incoming
   ggml_cgraph where possible (those should happen in graph_optimize)
2. Modifying in-place leads to an additional, unnecessary graph capture
   step as we store the properties before modifying the graph in-place
   in the cuda-backend

* Assert ggml_tensor is trivially copyable

* Update ggml/src/ggml-cuda/ggml-cuda.cu

Co-authored-by: Aman Gupta <redacted>
---------

Co-authored-by: Aman Gupta <redacted>

commit | commitdiff | tree

Pavan Shinde [Fri, 13 Feb 2026 08:38:09 +0000 (14:08 +0530)]

docs : fix broken link and typo (#19560)

commit | commitdiff | tree

ymcki [Fri, 13 Feb 2026 08:10:18 +0000 (16:10 +0800)]

model : Kimi Linear fix conv state update (#19531)

* fix conv state update for llama-server parallel serving

---------

Co-authored-by: Piotr Wilkin (ilintar) <redacted>

commit | commitdiff | tree

Adrien Gallouët [Fri, 13 Feb 2026 05:43:53 +0000 (06:43 +0100)]

llama : remove deprecated codecvt (#19565)

Using the same conversion function ensures a consistent matching between
the regex pattern and the text.

Signed-off-by: Adrien Gallouët <redacted>

commit | commitdiff | tree

Adrien Gallouët [Fri, 13 Feb 2026 05:43:26 +0000 (06:43 +0100)]

vendor : update BoringSSL to 0.20260211.0 (#19562)

Signed-off-by: Adrien Gallouët <redacted>

commit | commitdiff | tree

Georgi Gerganov [Fri, 13 Feb 2026 05:36:24 +0000 (07:36 +0200)]

memory : fix kv cache size for hybrid models (#19559)

commit | commitdiff | tree

Georgi Gerganov [Fri, 13 Feb 2026 05:35:57 +0000 (07:35 +0200)]

metal : improve concurrency (#19555)

commit | commitdiff | tree

Georgi Gerganov [Fri, 13 Feb 2026 05:34:52 +0000 (07:34 +0200)]

metal : support GGML_OP_SET (#19548)

commit | commitdiff | tree

Shupei Fan [Thu, 12 Feb 2026 23:07:49 +0000 (07:07 +0800)]

hexagon: fix typo in vtcm_needs_release (#19545)

commit | commitdiff | tree

lhez [Thu, 12 Feb 2026 22:52:37 +0000 (14:52 -0800)]

opencl: add basic support for q4_1 (#19534)

* opencl: add q4_1 mv

* opencl: clean up

* opencl: add flattened q4_1 mv

* opencl: clean up

* opencl: add basic q4_1 mm

* opencl: fix whitespace

* opencl: add general q4_0 mm

Packaging of ggml-org/llama.cpp

RSS Atom