]> git.djapps.eu Git - pkg/ggml/sources/llama.cpp/log
pkg/ggml/sources/llama.cpp
2 months agofix: UI single model selection in router mode (#19767)
crsawyer [Sat, 21 Feb 2026 08:28:39 +0000 (02:28 -0600)]
fix: UI single model selection in router mode (#19767)

2 months agohexagon : fix build release (#19444) (#19587)
Mengsheng Wu [Sat, 21 Feb 2026 00:40:00 +0000 (16:40 -0800)]
hexagon : fix build release (#19444) (#19587)

2 months agocommon : merge qwen3-coder and nemotron nano 3 parsers (#19765)
Aldehir Rojas [Fri, 20 Feb 2026 22:22:22 +0000 (16:22 -0600)]
common : merge qwen3-coder and nemotron nano 3 parsers (#19765)

* common : migrate qwen3-coder to PEG parsing variant

* cont : add JSON parameter test

2 months agoggml-cpu: add RVV vec dot kernels for quantization types (#18784)
Taimur Ahmad [Fri, 20 Feb 2026 11:30:07 +0000 (16:30 +0500)]
ggml-cpu: add RVV vec dot kernels for quantization types (#18784)

* ggml-cpu: add rvv vec_dot for iq2_s

Co-authored-by: Rehan Qasim <redacted>
* ggml-cpu: add rvv vec_dot for iq3_s

Co-authored-by: Rehan Qasim <redacted>
* ggml-cpu: add rvv vec_dot for tq1_0, tq2_0

Co-authored-by: Rehan Qasim <redacted>
ggml-cpu: add rvv vec_dot for tq1_0, tq2_0

* ggml-cpu: add rvv vec_dot for iq1_s, iq1_m

Co-authored-by: Rehan Qasim <redacted>
* ggml-cpu: add vlen switch for rvv vec_dot

---------

Co-authored-by: Rehan Qasim <redacted>
2 months agoquantize : add --dry-run option (#19526)
ddh0 [Fri, 20 Feb 2026 08:20:16 +0000 (02:20 -0600)]
quantize : add --dry-run option (#19526)

* clean slate for branch

* use 6 characters for tensor dims

* add --dry-run to llama-quantize

* use 6 characters for tensor dims (cont.)

* no need to re-calculate ggml_nbytes for tensor

* fix indent

* show model and quant BPW when quant completes

* add example to --help

* new function `tensor_requires_imatrix`, add courtesy warning about imatrix

* missing __func__, move imatrix flag set

* logic error

* fixup tensor_requires_imatrix

* add missing `GGML_TYPE`s

* simplify and rename `tensor_type_requires_imatrix`

* simplify for style

* add back Q2_K edge case for imatrix

* guard ftype imatrix warning

* comment ref #12557

* remove per @compilade

* remove unused `params` parameter

* move `bool dry_run` per GG

* move `bool dry_run` per GG

* Update src/llama-quant.cpp

Co-authored-by: Sigbjørn Skjæret <redacted>
* Update src/llama-quant.cpp

Co-authored-by: Sigbjørn Skjæret <redacted>
* Update src/llama-quant.cpp

Co-authored-by: Sigbjørn Skjæret <redacted>
---------

Co-authored-by: Sigbjørn Skjæret <redacted>
2 months agotest: mul_mat tests with huge batch size (#19519)
Jeff Bolz [Fri, 20 Feb 2026 02:08:25 +0000 (18:08 -0800)]
test: mul_mat tests with huge batch size (#19519)

2 months agoWebUI hide models in router mode (#19374)
crsawyer [Thu, 19 Feb 2026 21:53:42 +0000 (15:53 -0600)]
WebUI hide models in router mode (#19374)

2 months agocommon : fix Step-3.5-Flash format detection and thinking support (#19635)
Jesse Posner [Thu, 19 Feb 2026 21:40:52 +0000 (13:40 -0800)]
common : fix Step-3.5-Flash format detection and thinking support (#19635)

* common : fix Step-3.5-Flash format detection and thinking support

Step-3.5-Flash uses the same XML-style tool call format as Qwen3-Coder
(<tool_call><function=...><parameter=...>) but its Jinja template lacks
the bare <function> and plural <parameters> markers that the detection
logic previously required. This caused it to fall through to Hermes 2
Pro, which doesn't call func_args_not_string(), so arguments stayed as
JSON strings and templates using arguments|items crashed.

Additionally, the Qwen3-Coder-XML format handler had no thinking support.
Models like Step-3.5-Flash that unconditionally emit <think> in their
generation prompt need the same thinking_forced_open handling that
Nemotron v3 and Hermes 2 Pro already have, otherwise reasoning_content
is never separated from content in API responses.

Changes:
- Relax Qwen3-Coder XML detection to only require the 3 shared markers
- Tighten Nemotron v3 branch to also require bare <function> and plural
  <parameters>, preventing Step-3.5-Flash from being misrouted via <think>
- Add thinking_forced_open support to Qwen3-Coder-XML init function
- Add <think>/</think> to preserved tokens
- Fix build_grammar_xml_tool_call to handle thinking_forced_open in the
  grammar root rule, allowing </think> before tool calls
- Add Step-3.5-Flash chat template and format detection test

Builds on: https://github.com/ggml-org/llama.cpp/pull/19283

* chat : route Step-3.5-Flash to Nemotron v3 PEG parser, add tests

Step-3.5-Flash uses the same XML tool call format as Qwen3-Coder and
Nemotron 3 Nano (<tool_call>/<function=...>/<parameter=...>) but with
unconditional <think> output. Route it to the Nemotron v3 PEG parser
for streaming and schema-aware parameter parsing.

Detection: templates with <think> + XML tool tags use Nemotron v3 PEG
parser; templates without <think> (Qwen3-Coder) use GBNF grammar.

Tests cover: basic messages, tool calls with/without thinking content,
parallel tool calls, code string parameters, optional </parameter>
closing tags, and JSON schema response format.

* chat : remove dead thinking code from qwen3_coder_xml

Remove thinking handling code that became unreachable after routing
Step-3.5-Flash to the Nemotron v3 PEG parser. Qwen3-Coder has no
<think> in its template, so the thinking_forced_open logic, preserved
tokens, and grammar prefix were dead paths.

2 months agocommon : fix gpt-oss Jinja error when assistant message has both content and thinking...
abhijitb11 [Thu, 19 Feb 2026 20:59:20 +0000 (12:59 -0800)]
common : fix gpt-oss Jinja error when assistant message has both content and thinking with tool calls (#19704)

2 months agoggml-webgpu: Add unary op (SQR, SQRT, SIN, COS) support. (#19700)
Masashi Yoshimura [Thu, 19 Feb 2026 16:18:30 +0000 (01:18 +0900)]
ggml-webgpu: Add unary op (SQR, SQRT, SIN, COS) support. (#19700)

* ggml-webgpu: Add unary op (SQR, SQRT, SIN, COS) support.

* Fix to cast the src value to f32 before sin/cos computing.

2 months agomodel: Add PaddleOCR-VL model support (#18825)
megemini [Thu, 19 Feb 2026 16:05:25 +0000 (00:05 +0800)]
model: Add PaddleOCR-VL model support (#18825)

* support PaddleOCR-VL

* clip: update PaddleOCR model loader parameters to prevent OOM during warmup

* [update] add paddleocr vl text model instead of ernie4.5

* [update] restore change of minicpmv

* [update] format

* [update] format

* [update] positions and patch merge permute

* [update] mtmd_decode_use_mrope for paddleocr

* [update] image min/max pixels

* [update] remove set_limit_image_tokens

* upate: preprocess without padding

* clean up

* Update convert_hf_to_gguf.py

Co-authored-by: Sigbjørn Skjæret <redacted>
* Update convert_hf_to_gguf.py

Co-authored-by: Sigbjørn Skjæret <redacted>
---------

Co-authored-by: Xuan Son Nguyen <redacted>
Co-authored-by: Sigbjørn Skjæret <redacted>
2 months agovulkan: fix MMQ shader push constants and multi-dispatch (#19732)
Ruben Ortlam [Thu, 19 Feb 2026 13:59:16 +0000 (14:59 +0100)]
vulkan: fix MMQ shader push constants and multi-dispatch (#19732)

2 months agomodels : fix qwen3.5 beta/gate shapes (#19730)
Georgi Gerganov [Thu, 19 Feb 2026 13:19:53 +0000 (15:19 +0200)]
models : fix qwen3.5 beta/gate shapes (#19730)

* models : fix qwen3.5 beta/gate shapes

* cont : avoid extra reshapes

2 months agomtmd: build_attn modified, flash_attn on/off via ctx_params (#19729)
Saba Fallah [Thu, 19 Feb 2026 12:50:29 +0000 (13:50 +0100)]
mtmd: build_attn modified, flash_attn on/off via ctx_params (#19729)

2 months agomodel : add JAIS-2 architecture support (#19488)
3 a l i [Thu, 19 Feb 2026 12:30:17 +0000 (16:30 +0400)]
model : add JAIS-2 architecture support (#19488)

* model: add JAIS-2 architecture support

Add support for the JAIS-2 family of Arabic-English bilingual models
from Inception AI (https://huggingface.co/inceptionai/Jais-2-8B-Chat).

Architecture characteristics:
- LayerNorm (not RMSNorm) with biases
- ReLU² (ReLU squared) activation function
- Separate Q/K/V projections with biases
- Simple MLP without gate projection (up -> act -> down)
- RoPE positional embeddings
- GPT-2 BPE tokenizer

Supported model sizes:
- Jais-2-8B (32 layers, 26 heads, 3328 hidden)
- Jais-2-70B (68 layers, 56 heads, 7168 hidden)

Tested with quantizations: BF16, Q8_0, Q6_K, Q5_K_M, Q5_0, Q4_K_M, Q4_0, Q3_K_M, Q2_K

Note: JAIS-2 requires F32 precision accumulators for numerical stability
and uses standard attention (not flash attention) on CUDA backends.

* fix: run convert_hf_to_gguf_update.py for jais-2 tokenizer hash

* fix: use NEOX RoPE type for JAIS2

* fix: remove Q/K permutation (NEOX RoPE doesn't need it)

* fix: enable flash attention for JAIS2 (fixed by #19115)

* fix: add dedicated JAIS2 pre-tokenizer type and control vector support

- Add LLAMA_VOCAB_PRE_TYPE_JAIS2 with cascading whitespace regex
- Include original regex from tokenizer.json as comment
- Add build_cvec call for control vector support

* no longer necessary to override set_vocab

---------

Co-authored-by: Sigbjørn Skjæret <redacted>
2 months agoCUDA: fix kernel selection logic for tile FA (#19686)
Johannes Gäßler [Thu, 19 Feb 2026 11:42:58 +0000 (12:42 +0100)]
CUDA: fix kernel selection logic for tile FA (#19686)

* CUDA: fix kernel selection logic for tile FA

* add comment

2 months agomtmd : chat : Fix extra \n between text and media marker (#19595)
Tarek Dakhran [Thu, 19 Feb 2026 11:18:57 +0000 (12:18 +0100)]
mtmd : chat : Fix extra \n between text and media marker (#19595)

* mtmd : chat : Fix extra \n between text and media marker

Thanks to @tugot17 for detecting and reporting the issue.

For vision models (e.g. LFM2.5-VL-1.6B and Qwen/Qwen3-VL-4B-Instruct) `llama-mtmd-cli` produces identical output to HF implementation.

However `llama-server` doesn't. I traced it down to extra newline
inserted after `<__media__>`.

This happens in `to_json_oaicompat`, that treats media markers as text
and joins all parts with `\n` separator.

PR introduces new type `media_marker` and uses it for media markers.
Extra logic is added to prevent insertion of newlines before and after
media markers.

With this change number of input tokens is identical to HF
implementation and as a result the output is also identical.

I explored other ways to address the issue
* remove completely `\n` between text parts in `to_json_oaicompat`
* merge text messages in server-common.cpp before sending them to `to_json_oaicompat`

Please propose alternative ways of fixing this issue.

* Refactor to use explicite per type ifs

* Update common/chat.cpp

Co-authored-by: Piotr Wilkin (ilintar) <redacted>
* Update common_chat_templates_apply_legacy

---------

Co-authored-by: Piotr Wilkin (ilintar) <redacted>
2 months agowebui: Fix Attachments not being included in completion request (#19731)
Aleksander Grygier [Thu, 19 Feb 2026 09:27:38 +0000 (10:27 +0100)]
webui: Fix Attachments not being included in completion request (#19731)

* fix: Add missing argument

* chore: update webui build output

2 months agomodel : add tokenizer from LFM2.5-Audio-1.5B (#19687)
Tarek Dakhran [Thu, 19 Feb 2026 08:54:48 +0000 (09:54 +0100)]
model : add tokenizer from LFM2.5-Audio-1.5B (#19687)

* model : Add tokenizer from LFM2.5-Audio-1.5B

[LFM2.5-Audio-1.5B](https://huggingface.co/LiquidAI/LFM2.5-Audio-1.5B) introduced lightweight audio tokenizer.

Tokenizer based on LFM2 architecture and acts as "embedding" model with
different input `n_embd` and output `n_embd_out`.

To be used in https://github.com/ggml-org/llama.cpp/pull/18641.

To convert use

```shell
python3 convert_hf_to_gguf.py /path/to/LFM2.5-Audio-1.5B/audio_detokenizer
```

* Update convert_hf_to_gguf.py

Co-authored-by: Sigbjørn Skjæret <redacted>
* Formatting

* Rework check for attention layers

* Add LFM2 SWA model support

* Address PR feedback

* Set vocab to none

* Move helper function definitions to cpp file

---------

Co-authored-by: Sigbjørn Skjæret <redacted>
2 months agollama : use output_resolve_row() in get_logits_ith/get_embeddings_ith (#19663)
Daniel Bevenius [Thu, 19 Feb 2026 08:48:08 +0000 (09:48 +0100)]
llama : use output_resolve_row() in get_logits_ith/get_embeddings_ith (#19663)

This commit updates get_logits_ith(), and get_embeddings_ith() to use
output_resolve_row() to resolve the batch index to output row index.

The motivation for this is to remove some code duplication between these
functions.

2 months agomodel : full modern bert support (#18330)
Ryan Mangeno [Thu, 19 Feb 2026 07:52:21 +0000 (02:52 -0500)]
model : full modern bert support (#18330)

* full modern bert support

* added gelu op in rank pooling for modern bert

* still working on stuff, added mean calculation before classifier head

* Update convert_hf_to_gguf.py

Co-authored-by: Sigbjørn Skjæret <redacted>
* first layer is dense, as per modern bert research paper

* Update src/llama-graph.cpp

Co-authored-by: Sigbjørn Skjæret <redacted>
* fixed set input for mean pooling to check if pooling type is ranking since modern bert does mean & rank

* Update src/llama-graph.cpp

Co-authored-by: Sigbjørn Skjæret <redacted>
* Update convert_hf_to_gguf.py

Co-authored-by: Sigbjørn Skjæret <redacted>
---------

Co-authored-by: Sigbjørn Skjæret <redacted>
2 months agollamafile: powerpc: add FP16 MMA path for Q4/Q8 matmul (#19709)
shalinib-ibm [Thu, 19 Feb 2026 06:28:53 +0000 (11:58 +0530)]
llamafile: powerpc: add FP16 MMA path for Q4/Q8 matmul (#19709)

Avoid xvi8ger4pp signed→unsigned bias correction by dequantizing Q4/Q8
inputs to FP16 and using FP16×FP16→FP32 MMA. This removes
post-processing overhead and improves performance.

Performance Impact:
1.5 ~ 2x improvement in PP_Speed for Q4 and Q8 Models,
measured with llama-bench and llama-batched-bench.
Q8 Model: granite-4.0-h-micro-Q8_0.gguf (from huggingface)
Q4 Model: Meta-Llama3-8b Q4 model (generated with llama-quantize from
f32 model)

llama-bench Q8 Model Results:
 model                                  size       params   backend      threads              test  Base t/s Patch t/s
 granitehybrid 3B Q8_0              3.16 GiB       3.19 B   CPU               10               pp8           64.48 ± 4.72           73.99 ± 0.27
 granitehybrid 3B Q8_0              3.16 GiB       3.19 B   CPU               10              pp16           80.11 ± 0.32          112.53 ± 0.40
 granitehybrid 3B Q8_0              3.16 GiB       3.19 B   CPU               10              pp32           89.10 ± 0.27          152.95 ± 0.68
 granitehybrid 3B Q8_0              3.16 GiB       3.19 B   CPU               10              pp64           93.65 ± 0.25          187.83 ± 0.83
 granitehybrid 3B Q8_0              3.16 GiB       3.19 B   CPU               10             pp128           99.93 ± 0.02          201.32 ± 0.11
 granitehybrid 3B Q8_0              3.16 GiB       3.19 B   CPU               10             pp256          102.32 ± 0.40          208.32 ± 0.41
 granitehybrid 3B Q8_0              3.16 GiB       3.19 B   CPU               10             pp512          103.42 ± 0.40          209.98 ± 0.14
 granitehybrid 3B Q8_0              3.16 GiB       3.19 B   CPU               10             tg128           20.35 ± 0.01           19.57 ± 0.01

llama-bench Q4 Model Results:
 model                                  size       params   backend      threads              test                Base    t/s                 Patch   t/s
 llama 8B Q4_0                      4.33 GiB       8.03 B   CPU               10               pp8           34.77 ± 0.10           41.23 ± 0.08
 llama 8B Q4_0                      4.33 GiB       8.03 B   CPU               10              pp16           40.81 ± 0.04           64.55 ± 0.15
 llama 8B Q4_0                      4.33 GiB       8.03 B   CPU               10              pp32           44.65 ± 0.05           90.84 ± 0.22
 llama 8B Q4_0                      4.33 GiB       8.03 B   CPU               10              pp64           47.49 ± 0.03          114.39 ± 0.11
 llama 8B Q4_0                      4.33 GiB       8.03 B   CPU               10             pp128           49.29 ± 0.24          120.13 ± 0.19
 llama 8B Q4_0                      4.33 GiB       8.03 B   CPU               10             pp256           49.77 ± 0.23          121.51 ± 0.11
 llama 8B Q4_0                      4.33 GiB       8.03 B   CPU               10             pp512           49.89 ± 0.23          117.52 ± 0.10
 llama 8B Q4_0                      4.33 GiB       8.03 B   CPU               10             tg128           13.40 ± 0.01           13.37 ± 0.00

Llama perplexity Results:

Model                     Base Final PPL Estimate Patch Final PPL Estimate
granite-4.0-h-micro-Q8_0    1.3862 +/- 0.04424         1.3868 +/- 0.04432
Meta-Llama3-8b Q4     1.3801 +/- 0.04116         1.3803 +/- 0.04116

Signed-off-by: Shalini.Salomi.Bodapati <redacted>
2 months agomodels : dedup qwen35 graphs (#19660)
Georgi Gerganov [Thu, 19 Feb 2026 06:17:49 +0000 (08:17 +0200)]
models : dedup qwen35 graphs (#19660)

* models : dedup qwen35 graphs

* cont : add missing sigmoid

2 months agomodels : dedup Kimi Linear delta net implementation (#19668)
ymcki [Thu, 19 Feb 2026 06:15:17 +0000 (14:15 +0800)]
models : dedup Kimi Linear delta net implementation (#19668)

* models : add llm_build_delta_net_base

* cont : keep qwen35 and qwen35moe graphs intact

* cont : add comments [no ci]

* add kimi linear to delta-net-base

* removed unnecessary ggml_cont from g_exp_t

* removed ggml_cont from g_diff_exp_t. moved ggml_cont for o to kimi-linear.cpp

* removed unnecessary diag mask

* cont : simplify

* cont : avoid graph splits

* scale q after mul instead of beginning

* scale q after mul instead of beginning

* identical ppl

* cont : fix scale and decay mask

* minor : remove TODO

---------

Co-authored-by: Georgi Gerganov <redacted>
2 months agoAdd Jinja support for "indent" string filter (#19529)
Piotr Wilkin (ilintar) [Wed, 18 Feb 2026 23:25:52 +0000 (00:25 +0100)]
Add Jinja support for "indent" string filter (#19529)

* Add partial Jinja support for "indent" string filter

* Fully implement indent

* Add tests for all width variants.

* Update tests/test-jinja.cpp

Co-authored-by: Sigbjørn Skjæret <redacted>
* Fix getline ignoring trailing newlines

* Update common/jinja/value.cpp

Co-authored-by: Sigbjørn Skjæret <redacted>
* fix first indent condition

---------

Co-authored-by: Sigbjørn Skjæret <redacted>
2 months agoggml webgpu: Fix bug in dispatching large matrix-vector multiplication (#19535)
Reese Levine [Wed, 18 Feb 2026 23:06:29 +0000 (16:06 -0700)]
ggml webgpu: Fix bug in dispatching large matrix-vector multiplication (#19535)

* Fix bug in dispatching large matrix-vector multiplication

2 months agoserver: save generated text for the /slots endpoint (for LLAMA_SERVER_SLOTS_DEBUG...
matteo [Wed, 18 Feb 2026 17:53:37 +0000 (18:53 +0100)]
server: save generated text for the /slots endpoint (for LLAMA_SERVER_SLOTS_DEBUG=1) (#19622)

* save generated text for the /slots endpoint

* update debug_generated_text only when LLAMA_SERVER_SLOTS_DEBUG > 0

* Apply suggestions from code review

---------

Co-authored-by: Matteo <redacted>
Co-authored-by: Xuan-Son Nguyen <redacted>
2 months agomodel: support GLM-OCR (#19677)
Xuan-Son Nguyen [Wed, 18 Feb 2026 16:51:40 +0000 (17:51 +0100)]
model: support GLM-OCR (#19677)

* model: support GLM-OCR

* Update convert_hf_to_gguf.py

Co-authored-by: Sigbjørn Skjæret <redacted>
---------

Co-authored-by: Sigbjørn Skjæret <redacted>
2 months agodocs: Fix broken links for preparing models in Backends (#19684)
Maciej Lisowski [Wed, 18 Feb 2026 15:50:23 +0000 (16:50 +0100)]
docs: Fix broken links for preparing models in Backends (#19684)

2 months agoggml webgpu: shader library organization (#19530)
Reese Levine [Wed, 18 Feb 2026 14:51:02 +0000 (07:51 -0700)]
ggml webgpu: shader library organization (#19530)

* Basic JIT compilation for mul_mat, get_rows, and scale (#17)

* scale jit working

* preliminary working jit for getrows and mulmat, needs refining

* simplified mul_mat preprocessing switch statement

* get_rows fixes, mul_mat refinement

* formatted + last edits

* removed some extraneous prints

* fixed get_rows, fixed workgroup dispatch in mul_mat. no gibberish

* small fix

* some changes, working

* get_rows and mul_mat jit fixed and working

* Update formatting

* formatting

* Add header

---------

Co-authored-by: Neha Abbas <redacted>
Co-authored-by: Reese Levine <redacted>
* Start work on all-encompassing shader library

* refactor argmax, set_rows

* Refactor all but flashattention, mat mul

* flashattention and matrix multiplication moved to new format

* clean up preprocessing

* Formatting

* remove duplicate constants

* Split large shaders into multiple static strings

---------

Co-authored-by: neha-ha <redacted>
2 months agoPre-MCP UI and architecture cleanup (#19689)
Aleksander Grygier [Wed, 18 Feb 2026 11:02:02 +0000 (12:02 +0100)]
Pre-MCP UI and architecture cleanup (#19689)

2 months agovulkan: split mul_mat into multiple dispatches to avoid overflow (#19509)
Jeff Bolz [Wed, 18 Feb 2026 09:47:10 +0000 (01:47 -0800)]
vulkan: split mul_mat into multiple dispatches to avoid overflow (#19509)

* vulkan: split mul_mat into multiple dispatches to avoid overflow

The batch dimensions can be greater than the max workgroup count limit,
in which case we need to split into multiple dispatches and pass the base
index through a push constant.

Fall back for the less common p021 and nc variants.

* address feedback

2 months agocommon : make small string helpers as inline functions (#19693)
Adrien Gallouët [Wed, 18 Feb 2026 07:03:01 +0000 (08:03 +0100)]
common : make small string helpers as inline functions (#19693)

Also use string_view when it make sense and fix some corner cases.

Signed-off-by: Adrien Gallouët <redacted>
2 months agoopencl: refactor expm1 and softplus (#19404)
shaofeiqi [Tue, 17 Feb 2026 22:47:18 +0000 (14:47 -0800)]
opencl: refactor expm1 and softplus (#19404)

* opencl: refactor expm1

* opencl: refactor softplus

* opencl: use h for half literals

---------

Co-authored-by: Li He <redacted>
2 months agoopencl: optimize mean and sum_row kernels (#19614)
shaofeiqi [Tue, 17 Feb 2026 21:56:09 +0000 (13:56 -0800)]
opencl: optimize mean and sum_row kernels (#19614)

* opencl: optimize mean and sum_row kernels

* opencl: add comment for max subgroups

* opencl: format

---------

Co-authored-by: Li He <redacted>
2 months agomodel-conversion : add option to print tensor values (#19692)
Daniel Bevenius [Tue, 17 Feb 2026 19:43:22 +0000 (20:43 +0100)]
model-conversion : add option to print tensor values (#19692)

This commit updates the tensor-info.py script to support the option to
print the first N values of a tensor when displaying its information.

The motivation for this is that it can be useful to inspect some actual
values in addition to the shapes of the tensors.

2 months agoPre-MCP UI and architecture cleanup (#19685)
Aleksander Grygier [Tue, 17 Feb 2026 12:47:45 +0000 (13:47 +0100)]
Pre-MCP UI and architecture cleanup (#19685)

* webui: extract non-MCP changes from mcp-mvp review split

* webui: extract additional pre-MCP UI and architecture cleanup

* chore: update webui build output

2 months agoggml: ggml-cpu: force-no-lto-for-cpu-feats (#19609)
Talha Can Havadar [Tue, 17 Feb 2026 11:22:46 +0000 (12:22 +0100)]
ggml: ggml-cpu: force-no-lto-for-cpu-feats (#19609)

When LTO enabled in build environments it forces all builds to have LTO
in place. But feature detection logic is fragile, and causing Illegal
instruction errors with lto. This disables LTO for the feature
detection code to prevent cross-module optimization from inlining
architecture-specific instructions into the score function. Without this,
LTO can cause SIGILL when loading backends on older CPUs (e.g., loading
power10 backend on power9 crashes before feature check runs).

2 months agocuda : enable CUDA graphs for MMID 1 <= BS <= 4 (#19645)
Georgi Gerganov [Tue, 17 Feb 2026 10:31:49 +0000 (12:31 +0200)]
cuda : enable CUDA graphs for MMID 1 <= BS <= 4 (#19645)

* cuda : enable CUDA graphs for MMID BS <= 4

* cont : add stream capture check

Co-authored-by: Oliver Simons <redacted>
* cont : add MMVQ_MMID_MAX_BATCH_SIZE

---------

Co-authored-by: Oliver Simons <redacted>
2 months agomodel-conversion : make printing of config values optional (#19681)
Daniel Bevenius [Tue, 17 Feb 2026 09:46:53 +0000 (10:46 +0100)]
model-conversion : make printing of config values optional (#19681)

* model-conversion : make printing of config values optional

This commit updates run-org-model.py to make the printing of model
configuration values optional.

The motivation for this change is that not all models have these
configuration values defined and those that do not will error when
running this script. With these changes we only print the values if they
exist or a default value.

We could optionally just remove them but it can be useful to see these
values when running the original model.

2 months agoci : bump komac version (#19682)
Sigbjørn Skjæret [Tue, 17 Feb 2026 08:30:31 +0000 (09:30 +0100)]
ci : bump komac version (#19682)

2 months agobuild : link ws2_32 as PUBLIC on Windows (#19666)
Adrien Gallouët [Tue, 17 Feb 2026 07:37:07 +0000 (08:37 +0100)]
build : link ws2_32 as PUBLIC on Windows (#19666)

Signed-off-by: Adrien Gallouët <redacted>
2 months agobuild : cleanup library linking logic (#19665)
Adrien Gallouët [Tue, 17 Feb 2026 07:36:45 +0000 (08:36 +0100)]
build : cleanup library linking logic (#19665)

Signed-off-by: Adrien Gallouët <redacted>
2 months agoconvert : add JoyAI-LLM-Flash (#19651)
DAN™ [Mon, 16 Feb 2026 21:49:57 +0000 (16:49 -0500)]
convert : add JoyAI-LLM-Flash (#19651)

* convert_hf_to_gguf: add JoyAI-LLM-Flash tokenizer hash mapping to deepseek-v3

* llama-vocab: create a new pre-tokenizer name for joyai-llm.

* add missing vocab type section

* Update convert_hf_to_gguf_update.py

Co-authored-by: Sigbjørn Skjæret <redacted>
* Update convert_hf_to_gguf.py

Co-authored-by: Sigbjørn Skjæret <redacted>
---------

Co-authored-by: Sigbjørn Skjæret <redacted>
2 months agoperplexity: add proper batching (#19661)
AesSedai [Mon, 16 Feb 2026 16:44:44 +0000 (08:44 -0800)]
perplexity: add proper batching (#19661)

2 months agocommon : inline functions (#18639)
Ivan Chikish [Mon, 16 Feb 2026 15:52:24 +0000 (18:52 +0300)]
common : inline functions (#18639)

2 months agoggml : make `ggml_is_view` as API (#19539)
Judd [Mon, 16 Feb 2026 15:43:34 +0000 (23:43 +0800)]
ggml : make `ggml_is_view` as API (#19539)

* make `ggml_is_view` as API

* introduce `ggml_aux_is_view` as inline version for internal use.

* change `ggml_aux_is_view` to  `ggml_impl_is_view`

2 months agomodel: Add support for Tiny Aya Models (#19611)
Saurabh Dash [Mon, 16 Feb 2026 15:28:46 +0000 (10:28 -0500)]
model: Add support for Tiny Aya Models (#19611)

* changes for tiny aya

* changes to hash

* changes to vocab

* fix some tokenizer regex edge cases

* update comment

* add some comments for regex

* Apply suggestion from @ngxson

---------

Co-authored-by: Xuan-Son Nguyen <redacted>
2 months agobuild : rework llama_option_depr to handle LLAMA_CURL (#19658)
Adrien Gallouët [Mon, 16 Feb 2026 15:06:48 +0000 (16:06 +0100)]
build : rework llama_option_depr to handle LLAMA_CURL (#19658)

Signed-off-by: Adrien Gallouët <redacted>
2 months agoAdjust workaround for ROCWMMA_FATTN/GFX9 to only newer ROCm veresions (#19591)
Mario Limonciello [Mon, 16 Feb 2026 13:46:08 +0000 (07:46 -0600)]
Adjust workaround for ROCWMMA_FATTN/GFX9 to only newer ROCm veresions (#19591)

Avoids issues with ROCm 6.4.4.

Closes: https://github.com/ggml-org/llama.cpp/issues/19580
Fixes: 6845f7f87 ("Add a workaround for compilation with ROCWMMA_FATTN and gfx9 (#19461)")
Signed-off-by: Mario Limonciello (AMD) <redacted>
2 months agomodels : deduplicate delta-net graphs for Qwen family (#19597)
Georgi Gerganov [Mon, 16 Feb 2026 12:35:04 +0000 (14:35 +0200)]
models : deduplicate delta-net graphs for Qwen family (#19597)

* models : add llm_build_delta_net_base

* cont : keep qwen35 and qwen35moe graphs intact

* cont : add comments

2 months agograph : fix KQ mask, lora, cvec reuse checks (#19644)
Georgi Gerganov [Mon, 16 Feb 2026 07:21:11 +0000 (09:21 +0200)]
graph : fix KQ mask, lora, cvec reuse checks (#19644)

* graph : fix KQ mask reuse condition

* cont : dedup KQ mask build and can_reuse

* cont : fix build

* graph : fix adapter check for reuse

2 months agoggml: aarch64: Implement SVE in Gemm q4_k 8x8 q8_k Kernel (#19132)
abhijain1204fujitsu [Mon, 16 Feb 2026 06:38:43 +0000 (12:08 +0530)]
ggml: aarch64: Implement SVE in Gemm q4_k 8x8 q8_k Kernel  (#19132)

* Updated repack.cpp

* Updated repack.cpp

* Updated repack.cpp

* Added if condition to support only vector length 256.

* Changed the format removed comments and duplicate variable

* If SVE 256 not present then was using generic function to compute, hence slowing the performance.

So added code if SVE 256 is not present then use NEON code.

* Code format change suggestion

---------

Co-authored-by: Vithule, Prashant <redacted>
2 months agosync : ggml upstream/0.0.8067
Georgi Gerganov [Sun, 15 Feb 2026 20:23:13 +0000 (22:23 +0200)]
sync : ggml

2 months agoggml : bump version to 0.9.7 (ggml/1425)
Georgi Gerganov [Sun, 15 Feb 2026 20:21:04 +0000 (22:21 +0200)]
ggml : bump version to 0.9.7 (ggml/1425)

2 months agoggml : bump version to 0.9.6 (ggml/1423)
Georgi Gerganov [Sat, 7 Feb 2026 07:58:02 +0000 (09:58 +0200)]
ggml : bump version to 0.9.6 (ggml/1423)

2 months agocuda: optimize iq2xxs/iq2xs/iq3xxs dequantization (#19624)
David Friehs [Sun, 15 Feb 2026 17:08:42 +0000 (18:08 +0100)]
cuda: optimize iq2xxs/iq2xs/iq3xxs dequantization (#19624)

* cuda: optimize iq2xxs/iq2xs/iq3xxs dequantization

- load all 8 int8 for a grid position in one load
- calculate signs via popcnt instead of fetching from ksigns table
- broadcast signs to drop individual shift/mask

* cuda: iq2xxs: simplify sum scaling

express `(sum * scale + sum / 2) / 4` as `(sum * (scale * 2 + 1)) / 8`
express `((aux32 >> 28) * 2 + 1)` as `(aux32 >> 27 | 1)`

saves 3 registers for mul_mat_vec_q (152 -> 149) according to nsight
AFAICT no overflow can occur here as iq2xxs values are far too small

* uint -> uint32_t

error: identifier "uint" is undefined

2 months agodocs: update s390x build docs (#19643)
Aaron Teo [Sun, 15 Feb 2026 16:33:34 +0000 (00:33 +0800)]
docs: update s390x build docs (#19643)

2 months agobuild : remove LLAMA_HTTPLIB option (#19623)
Adrien Gallouët [Sun, 15 Feb 2026 14:38:50 +0000 (15:38 +0100)]
build : remove LLAMA_HTTPLIB option (#19623)

This option was introduced as a workaround because cpp-httplib could not
build on visionOS. Since it has been fixed and now compiles on all platforms,
we can remove it and simplify many things.

Signed-off-by: Adrien Gallouët <redacted>
2 months agocmake : check if KleidiAI API has been fetched (#19640)
Daniel Bevenius [Sun, 15 Feb 2026 12:59:38 +0000 (13:59 +0100)]
cmake : check if KleidiAI API has been fetched (#19640)

This commit addresses a build issue with the KleidiAI backend when
building multiple cpu backends. Commmit
3a00c98584e42a20675b6569d81beadb282b0952 ("cmake : fix KleidiAI install
target failure with EXCLUDE_FROM_ALL") introduced a change where
FetchContent_Populate is called instead of FetchContent_MakeAvailable,
where the latter does handle this case (it is idempotent but
FetchContent_Populate is not).

I missed this during my review and I should not have commited without
verifying the CI failure, sorry about that.

2 months agocontext : fix output reorder with backend sampling (#19638)
Georgi Gerganov [Sun, 15 Feb 2026 12:57:40 +0000 (14:57 +0200)]
context : fix output reorder with backend sampling (#19638)

2 months agoggml : avoid UB in gemm ukernel (#19642)
Georgi Gerganov [Sun, 15 Feb 2026 12:56:35 +0000 (14:56 +0200)]
ggml : avoid UB in gemm ukernel (#19642)

2 months agoggml-cpu: optimize ggml_vec_dot_bf16 for s390x (#19399)
Aaron Teo [Sun, 15 Feb 2026 10:20:35 +0000 (18:20 +0800)]
ggml-cpu: optimize ggml_vec_dot_bf16 for s390x (#19399)

2 months agoggml-cpu: FA add GEMM microkernel (#19422)
Aman Gupta [Sun, 15 Feb 2026 05:39:24 +0000 (11:09 +0530)]
ggml-cpu: FA add GEMM microkernel (#19422)

* ggml-cpu: FA add GEMM microkernel

* add guard for sizeless vector types

* fix case where DV % GGML_F32_EPR !=0

* move memset out of the loop

* move another memset out of the loop

* use RM=4 for arm

* simd_gemm: convert everything to int

* convert everything to size_t to avoid warnings

* fixup

* add pragma for ignoring aggressive loop optimizations

2 months agocmake : fix KleidiAI install target failure with EXCLUDE_FROM_ALL (#19581)
SamareshSingh [Sun, 15 Feb 2026 05:22:53 +0000 (23:22 -0600)]
cmake : fix KleidiAI install target failure with EXCLUDE_FROM_ALL (#19581)

* cmake: fix KleidiAI install target failure with EXCLUDE_FROM_ALL

Fix for the bug #19501 by adding EXCLUDE_FROM_ALL to FetchContent_Declare. This properly excludes KleidiAI from both build and install targets, preventing install failures when GGML_CPU_KLEIDIAI=ON is used.

The KleidiAI source files are still compiled into libggml-cpu.so, preserving all functionality.

* addressed code review comments

2 months agoconvert : ensure all models handle new experts count (#19621)
Sigbjørn Skjæret [Sat, 14 Feb 2026 21:22:32 +0000 (22:22 +0100)]
convert : ensure all models handle new experts count (#19621)

* ensure all models handle new experts count

* revert removal for PhiMoeModel, does not inherit from base

2 months agomtmd : Add Nemotron Nano 12B v2 VL support (#19547)
Anav Prasad [Sat, 14 Feb 2026 13:07:00 +0000 (05:07 -0800)]
mtmd : Add Nemotron Nano 12B v2 VL support (#19547)

* nemotron nano v2 vlm support added

* simplified code; addressed reviews

* pre-downsample position embeddings during GGUF conversion for fixed input size

2 months agomodels : optimize qwen3next graph (#19375)
Georgi Gerganov [Sat, 14 Feb 2026 10:57:36 +0000 (12:57 +0200)]
models : optimize qwen3next graph (#19375)

* models : optimizing qwen3next graph

* cont

* wip

* wip

* wip

* wip

* wip

* wip

* wip

* wip

* wip

* wip

* cont : remove redundant q, g chunking

* minor

* minor

* avoid passing masks around

* avoid concats during chunking

* naming + shapes

* update names and use prefix to disable CUDA graphs

2 months agoggml : fix GGML_DEBUG with OpenMP (#19599)
Adrien Gallouët [Sat, 14 Feb 2026 10:22:57 +0000 (11:22 +0100)]
ggml : fix GGML_DEBUG with OpenMP (#19599)

last_graph is only available without OpenMP, but
ggml_graph_compute_thread() is called in both cases.

Signed-off-by: Adrien Gallouët <redacted>
2 months agoNetBSD build support (#19589)
iMil [Sat, 14 Feb 2026 08:47:01 +0000 (09:47 +0100)]
NetBSD build support (#19589)

2 months agowebui: Architecture and UI improvements (#19596)
Aleksander Grygier [Sat, 14 Feb 2026 08:06:41 +0000 (09:06 +0100)]
webui: Architecture and UI improvements (#19596)

2 months agollama : update LoRA API. + fix excessive graph reserves (#19280)
agent-enemy-2 [Sat, 14 Feb 2026 08:06:27 +0000 (03:06 -0500)]
llama : update LoRA API. + fix excessive graph reserves (#19280)

* Refactoring to use new llama_put_adapter_loras

* cont : alternative lora API

---------

Co-authored-by: Jake Chavis <redacted>
Co-authored-by: Georgi Gerganov <redacted>
2 months agommap: Fix Windows handle lifetime (#19598)
George [Sat, 14 Feb 2026 08:05:12 +0000 (10:05 +0200)]
mmap: Fix Windows handle lifetime (#19598)

* ggml: added cleanups in ggml_quantize_free
Add missing cleanup calls for IQ2_S, IQ1_M quantization types and IQ3XS with 512 blocks during quantization cleanup.

* mmap: Fix Windows handle lifetime
Move hMapping from local variable to member variable so it stays alive for the entire lifetime of the mapping.
The file mapping handle must remain valid until UnmapViewOfFile is called.
Fixes cleanup order in destructor.

* Update llama-mmap.cpp

* Update llama-mmap.cpp

Remove trailing whitespace from line 567

2 months agometal : fix ACC op (#19427)
Georgi Gerganov [Sat, 14 Feb 2026 07:54:03 +0000 (09:54 +0200)]
metal : fix ACC op (#19427)

2 months agoscripts : use official split.py for cpp-httplib (#19588)
Adrien Gallouët [Sat, 14 Feb 2026 07:41:16 +0000 (08:41 +0100)]
scripts : use official split.py for cpp-httplib (#19588)

* scripts : use official split.py for cpp-httplib

Using the official script is safer and ensures the generated code aligns
with the library's standards.

Signed-off-by: Adrien Gallouët <redacted>
* Catch generic errors

Signed-off-by: Adrien Gallouët <redacted>
* Allow print()

Signed-off-by: Adrien Gallouët <redacted>
* Ensure robust cleanup

Signed-off-by: Adrien Gallouët <redacted>
---------

Signed-off-by: Adrien Gallouët <redacted>
2 months agoconvert : store ffn_gate_inp_shexp as F32 (#19606)
Sigbjørn Skjæret [Sat, 14 Feb 2026 07:17:43 +0000 (08:17 +0100)]
convert : store ffn_gate_inp_shexp as F32 (#19606)

2 months agobuild : fix libtool call in build-xcframework.sh (#19605)
Adrien Gallouët [Sat, 14 Feb 2026 05:48:37 +0000 (06:48 +0100)]
build : fix libtool call in build-xcframework.sh (#19605)

Run libtool via xcrun like strip and dsymutil, to have proper tool resolution.

Signed-off-by: Adrien Gallouët <redacted>
2 months agovulkan: support L2_NORM with contiguous rows (#19604)
Jeff Bolz [Sat, 14 Feb 2026 05:42:04 +0000 (21:42 -0800)]
vulkan: support L2_NORM with contiguous rows (#19604)

2 months agovulkan: support GGML_OP_SET (#19584)
Jeff Bolz [Sat, 14 Feb 2026 05:36:38 +0000 (21:36 -0800)]
vulkan: support GGML_OP_SET (#19584)

2 months agovulkan: Add vendor id for Qualcomm drivers (#19569)
Sophon [Sat, 14 Feb 2026 05:29:17 +0000 (13:29 +0800)]
vulkan: Add vendor id for Qualcomm drivers (#19569)

This commit allows Qualcomm native vulkan driver to be used on Windows
instead of Mesa Dozen.

2 months agohexagon: further optimizations and refactoring for flash attention (#19583)
Max Krasnyansky [Sat, 14 Feb 2026 00:27:30 +0000 (16:27 -0800)]
hexagon: further optimizations and refactoring for flash attention (#19583)

* ggml-hexagon: fa improvements

ggml-hexagon: optimize flash attention calculations with improved variable handling

ggml-hexagon: streamline flash attention operations by removing redundant checks for FP32

ggml-hexagon: optimize hvx_dot_f16_f16_aa_rx2 by simplifying variable handling for unused elements

ggml-hexagon: optimize flash attention by changing slope vector type to F16

* hexfa: fixed test-backend-ops failurs due to leftover element handling

* hexagon: refactor and optimize fa to use local context struct

* ggml-hexagon: optimize flash-attention using hvx_vec_expf

Use HVX for online softmax.

---------

Co-authored-by: chraac <redacted>
2 months agogithub : add missing backends to issue templates (#19603)
Mengsheng Wu [Fri, 13 Feb 2026 23:56:53 +0000 (15:56 -0800)]
github : add missing backends to issue templates (#19603)

2 months agovulkan: restore -inf check in FA shaders (#19582)
Jeff Bolz [Fri, 13 Feb 2026 19:35:29 +0000 (11:35 -0800)]
vulkan: restore -inf check in FA shaders (#19582)

2 months agocommon : update download code (#19573)
Adrien Gallouët [Fri, 13 Feb 2026 14:10:46 +0000 (15:10 +0100)]
common : update download code (#19573)

* common : remove legacy .json to .etag migration code

Signed-off-by: Adrien Gallouët <redacted>
* common : simplify common_download_file_single_online

This commit also force a redownload if the file exists
but has no .etag file.

Signed-off-by: Adrien Gallouët <redacted>
---------

Signed-off-by: Adrien Gallouët <redacted>
2 months agomodel: support GLM MoE DSA arch (NOTE: indexer is not yet supported) (#19460)
Xuan-Son Nguyen [Fri, 13 Feb 2026 13:56:53 +0000 (14:56 +0100)]
model: support GLM MoE DSA arch (NOTE: indexer is not yet supported) (#19460)

* model: support GLM MoE DSA arch

* working version

* pyright

* keep indexer tensors

* add indexer gguf params

* loaded now

* Apply suggestions from code review

Co-authored-by: Sigbjørn Skjæret <redacted>
* update

* Update src/llama-model.cpp

Co-authored-by: Sigbjørn Skjæret <redacted>
* minor fix and cleanup

---------

Co-authored-by: Sigbjørn Skjæret <redacted>
2 months agoFix wrong memcpy length for block_interleave == 4 (#19575)
Alberto Cabrera Pérez [Fri, 13 Feb 2026 12:32:14 +0000 (12:32 +0000)]
Fix wrong memcpy length for block_interleave == 4 (#19575)

2 months agofix vulkan ggml_acc only works in 3d but not 4d (#19426)
ymcki [Fri, 13 Feb 2026 12:31:37 +0000 (20:31 +0800)]
fix vulkan ggml_acc only works in 3d but not 4d (#19426)

* fix vulkan ggml_acc only works in 3d but not 4d

* removed clamp in test_acc_block

* use the correct stride and its test case

* cuda : fix "supports op" condition

* change src0 to src1 in ggml_vk_acc. Update acc.comp with jeffbolznv\'s suggestion except to keep the boundary check

* version without boundary check

* revert back to boundary check version

---------

Co-authored-by: Georgi Gerganov <redacted>
2 months agosupport --verbose-prompt (#19576)
Sigbjørn Skjæret [Fri, 13 Feb 2026 11:49:10 +0000 (12:49 +0100)]
support --verbose-prompt (#19576)

2 months agoCUDA: loop over ne2*ne3 in case it overflows (#19538)
Aman Gupta [Fri, 13 Feb 2026 11:31:40 +0000 (17:01 +0530)]
CUDA: loop over ne2*ne3 in case it overflows (#19538)

* CUDA: loop over ne2*ne3 in case it overflows

* use fastdiv

2 months agowebui: UI and routing fixes (#19586)
Aleksander Grygier [Fri, 13 Feb 2026 11:31:00 +0000 (12:31 +0100)]
webui: UI and routing fixes (#19586)

* chore: update webui build output

* chore: update webui build output

* fix: Scroll issues in DropdownMenuSearchable

* webui: fix redirect to root ignoring base path

* fix: Word wrapping

* fix: remove obsolete modality UI tests causing CI failures

- Remove VisionModality/AudioModality test stories
- Remove mockServerProps usage and imports
- Simplify Default test (remove dropdown interaction checks)
- Simplify FileAttachments test (remove mocks)

* feat: Improve formatting performance time

---------

Co-authored-by: Pascal <redacted>
2 months agoCUDA: Do not mutate cgraph for fused ADDs (#19566)
Oliver Simons [Fri, 13 Feb 2026 09:37:55 +0000 (10:37 +0100)]
CUDA: Do not mutate cgraph for fused ADDs (#19566)

* Do not mutate cgraph for fused ADDs

1. We should try to minimize in-place changes to the incoming
   ggml_cgraph where possible (those should happen in graph_optimize)
2. Modifying in-place leads to an additional, unnecessary graph capture
   step as we store the properties before modifying the graph in-place
   in the cuda-backend

* Assert ggml_tensor is trivially copyable

* Update ggml/src/ggml-cuda/ggml-cuda.cu

Co-authored-by: Aman Gupta <redacted>
---------

Co-authored-by: Aman Gupta <redacted>
2 months agodocs : fix broken link and typo (#19560)
Pavan Shinde [Fri, 13 Feb 2026 08:38:09 +0000 (14:08 +0530)]
docs : fix broken link and typo (#19560)

2 months agomodel : Kimi Linear fix conv state update (#19531)
ymcki [Fri, 13 Feb 2026 08:10:18 +0000 (16:10 +0800)]
model : Kimi Linear fix conv state update (#19531)

* fix conv state update for llama-server parallel serving

---------

Co-authored-by: Piotr Wilkin (ilintar) <redacted>
2 months agollama : remove deprecated codecvt (#19565)
Adrien Gallouët [Fri, 13 Feb 2026 05:43:53 +0000 (06:43 +0100)]
llama : remove deprecated codecvt (#19565)

Using the same conversion function ensures a consistent matching between
the regex pattern and the text.

Signed-off-by: Adrien Gallouët <redacted>
2 months agovendor : update BoringSSL to 0.20260211.0 (#19562)
Adrien Gallouët [Fri, 13 Feb 2026 05:43:26 +0000 (06:43 +0100)]
vendor : update BoringSSL to 0.20260211.0 (#19562)

Signed-off-by: Adrien Gallouët <redacted>
2 months agomemory : fix kv cache size for hybrid models (#19559)
Georgi Gerganov [Fri, 13 Feb 2026 05:36:24 +0000 (07:36 +0200)]
memory : fix kv cache size for hybrid models (#19559)

2 months agometal : improve concurrency (#19555)
Georgi Gerganov [Fri, 13 Feb 2026 05:35:57 +0000 (07:35 +0200)]
metal : improve concurrency (#19555)

2 months agometal : support GGML_OP_SET (#19548)
Georgi Gerganov [Fri, 13 Feb 2026 05:34:52 +0000 (07:34 +0200)]
metal : support GGML_OP_SET (#19548)

2 months agohexagon: fix typo in vtcm_needs_release (#19545)
Shupei Fan [Thu, 12 Feb 2026 23:07:49 +0000 (07:07 +0800)]
hexagon: fix typo in vtcm_needs_release (#19545)

2 months agoopencl: add basic support for q4_1 (#19534)
lhez [Thu, 12 Feb 2026 22:52:37 +0000 (14:52 -0800)]
opencl: add basic support for q4_1 (#19534)

* opencl: add q4_1 mv

* opencl: clean up

* opencl: add flattened q4_1 mv

* opencl: clean up

* opencl: add basic q4_1 mm

* opencl: fix whitespace

* opencl: add general q4_0 mm