]> git.djapps.eu Git - pkg/ggml/sources/llama.cpp/log
pkg/ggml/sources/llama.cpp
3 weeks agocommon : Disable backend sampling if reasoning budget is enabled (#21209)
Galunid [Tue, 31 Mar 2026 07:14:01 +0000 (09:14 +0200)]
common : Disable backend sampling if reasoning budget is enabled (#21209)

3 weeks agoopencl: add q4_K gemm and gemv kernels for Adreno (#20919)
shaofeiqi [Mon, 30 Mar 2026 19:19:16 +0000 (12:19 -0700)]
opencl: add q4_K gemm and gemv kernels for Adreno (#20919)

* opencl: add q4_K gemm and gemv kernels for Adreno

* opencl: fix whitespace

* opencl: add workarounds for compiler bugs on older devices

* opencl: handle fp16 denorm on X Elite

* opencl: fix kernel build error

* opencl: fix whitespace

* opencl: make q4_K cvt kernels signature consistent

---------

Co-authored-by: Li He <redacted>
3 weeks agoCI : Enable CUDA and Vulkan ARM64 runners and fix CI/CD (#21122)
Seungmin Kim [Mon, 30 Mar 2026 18:24:37 +0000 (03:24 +0900)]
CI : Enable CUDA and Vulkan ARM64 runners and fix CI/CD (#21122)

* CI: Enable CUDA and Vulkan ARM64 runners and fix CI/CD

Co-authored-by: Ts-sound <redacted>
* Obtain source tag name from git tag

Co-authored-by: Sigbjørn Skjæret <redacted>
---------

Co-authored-by: Ts-sound <redacted>
Co-authored-by: Sigbjørn Skjæret <redacted>
3 weeks agojinja : handle empty expressions correctly (#20913)
Zhihao "Zephyr" Yao [Mon, 30 Mar 2026 18:08:46 +0000 (14:08 -0400)]
jinja : handle empty expressions correctly (#20913)

* Reject empty computed member expressions before returning slices[0] from parse_member_expression_arguments().

* Treat empty computed member expressions with Jinja2 undefined semantics

Treat empty computed member expressions like `a[]` as undefined instead of
raising a parser error, to match Jinja2 behavior.

- return a noop expression for empty computed member arguments
- return undefined when a computed member key evaluates to undefined
- add Jinja tests covering `a[]|default('fallback')` and `a[] is undefined`

* Handle undefined computed member properties

Move undefined-property handling to the common member access path, and add a test covering `a[undefined] is undefined`.

* Use default undefined value in member access

Initialize val and then return it when property is undefined.

Co-authored-by: Sigbjørn Skjæret <redacted>
* empty statement parses to blank_expression instead of noop_statement

---------

Co-authored-by: Sigbjørn Skjæret <redacted>
3 weeks agoCUDA : Fix CUB's argsort when nrows % block_size == 0 CCCL < 3.1 (#21181)
Oliver Simons [Mon, 30 Mar 2026 14:20:00 +0000 (16:20 +0200)]
CUDA : Fix CUB's argsort when nrows % block_size == 0 CCCL < 3.1 (#21181)

* CUDA: Fix CUB's argsort when nrows % block_size == 0 CCCL < 3.1

We wrongly calculated offset_grid as `ceildiv(nrows, block_size)`,
while it must be `ceildiv(nrows + 1, block_size)`. As a consequence, we
had uninitialized values in `offset_iterator[nrows]` for the case when
`nrows % block_size == 0`.

Fixes #21162

* Reduce nrows in test case to 256, don't need 768

3 weeks agorpc : fix misleading error log (#21184)
Radoslav Gerganov [Mon, 30 Mar 2026 14:05:11 +0000 (17:05 +0300)]
rpc : fix misleading error log (#21184)

When RPC is running with a remote backend which doesn't have init_tensor
function (like CPU and Metal), the server log gets full with error
messages saying that init_tensor is being called with null buffer which
is incorrect. This patch fixes this.

3 weeks agowebui: Fix branching logic on edit message (#21175)
Aleksander Grygier [Mon, 30 Mar 2026 12:40:50 +0000 (14:40 +0200)]
webui: Fix branching logic on edit message (#21175)

* fix: Branching logic + small refactor

* chore: update webui build output

3 weeks agollama-model-loader: print warning when using overrides with mmap (#20978)
Aman Gupta [Mon, 30 Mar 2026 09:40:17 +0000 (17:40 +0800)]
llama-model-loader: print warning when using overrides with mmap (#20978)

* llama-model-loader: use pinned memory for tensor overrides

* change to warning

3 weeks agoci : bump ty to 0.0.26 (#21156)
Sigbjørn Skjæret [Mon, 30 Mar 2026 07:29:15 +0000 (09:29 +0200)]
ci : bump ty to 0.0.26 (#21156)

* fix incorrect type ignore comments

* bump ty to 0.0.26

3 weeks agoserver: wrap headers for mcp proxy (#21072)
Xuan-Son Nguyen [Mon, 30 Mar 2026 06:59:16 +0000 (08:59 +0200)]
server: wrap headers for mcp proxy (#21072)

* server: wrap headers for mcp proxy

* Update tools/server/server-cors-proxy.h

Co-authored-by: Georgi Gerganov <redacted>
* fix build

* chore: update webui build output

* chore: update webui build output

---------

Co-authored-by: Georgi Gerganov <redacted>
Co-authored-by: Aleksander Grygier <redacted>
3 weeks agoadd missing ROPE_FACTORS_LONG/SHORT for MiniCPM (#21150)
Sigbjørn Skjæret [Sun, 29 Mar 2026 17:45:40 +0000 (19:45 +0200)]
add missing ROPE_FACTORS_LONG/SHORT for MiniCPM (#21150)

3 weeks agoOptimize MOE GEMV kernel for BS > 1. (#20905)
Gaurav Garg [Sun, 29 Mar 2026 16:35:18 +0000 (22:05 +0530)]
Optimize MOE GEMV kernel for BS > 1. (#20905)

* Optimize MOE GEMV kernel for BS > 1.

The previous MOE kernel for BS > 1 had too many thread blocks (nrows_x, nchannels_dst, ncols_dst), with very little work per block. block of (32, 4) was doing inner dot product for a single row.

New mul_mat_vec_q_moe kernel is dedicated for MoE multi-token kernel with grid (ceil(nrows_x/rpb), nchannels_dst), block (warp_size, ncols_dst). Each warp handles two rows independently with warp-level reduction only (no shared memory sync).

This change doesn't increase any compilation time as a single template instance is needed per type. This also simplifies the original GEMV kernel and gets rid of `is_multi_token_id` specialization.

* Remove em-dashes

* Cherry-pick changes from @am17an PR https://github.com/ggml-org/llama.cpp/pull/20885 to enable small_k optimization only for cases where it benefits

Increase max batch size for MMVQ kernels for MUL_MAT_ID to 8

* Make the max batch size for MOE GEMV kernel configurable based on GPU arch and datatype

---------

Co-authored-by: Aman Gupta <redacted>
4 weeks agohexagon: dma optimizations (mostly fixing regressions) (#21137)
Max Krasnyansky [Sun, 29 Mar 2026 13:40:13 +0000 (06:40 -0700)]
hexagon: dma optimizations (mostly fixing regressions) (#21137)

* hex-fa: add simple dma cache for Mask

I noticed that we were refetch the mask rows over and over.
This simple cache avoids that.

* hex-dma: unset in-order desc bit which caused signficant perf regression

We don't rely on true in order processing of the DMA descriptors anywhere.
Turns out this mode caused significant regression of around 3-4 TPS during token gen.

* hex-rope: update comment to clarify that we don't need in-order DMA completions

4 weeks agodevops: including compute-runtime for intel.Dockerfile (#21076)
Davi Henrique Linhares [Sun, 29 Mar 2026 05:34:03 +0000 (02:34 -0300)]
devops: including compute-runtime for intel.Dockerfile (#21076)

4 weeks ago[SYCL] Enhance build script to use half cores to build, avoid OS hang (#21093)
Neo Zhang [Sun, 29 Mar 2026 01:02:45 +0000 (09:02 +0800)]
[SYCL] Enhance build script to use half cores to build, avoid OS hang (#21093)

* use half cores to build, avoid OS hang

* reduce the output text num to short test time

* avoid to return 0

4 weeks agofix **/x glob matching (#21129)
Sigbjørn Skjæret [Sat, 28 Mar 2026 21:27:38 +0000 (22:27 +0100)]
fix **/x glob matching (#21129)

4 weeks agocommon/parser: fix handling of tool definition with missing properties key (#21128)
Piotr Wilkin (ilintar) [Sat, 28 Mar 2026 19:41:32 +0000 (20:41 +0100)]
common/parser: fix handling of tool definition with missing properties key (#21128)

4 weeks agocommon : add character class support to glob_match (#21111)
Sigbjørn Skjæret [Sat, 28 Mar 2026 18:57:37 +0000 (19:57 +0100)]
common : add character class support to glob_match (#21111)

* add character class support to glob_match

* remove pointless reference

4 weeks agoWebUI: Replace illegal nested button elements (#21026)
BlueMöhre [Sat, 28 Mar 2026 16:57:59 +0000 (17:57 +0100)]
WebUI: Replace illegal nested button elements (#21026)

* remove/replace nested button elements

* map rest props to outer element

* solve TODO

* chore: update webui build output

4 weeks agocommon/json-schema: fix: handle non-capturing groups (?:...) in JSON schema pattern...
Adrien [Sat, 28 Mar 2026 16:55:38 +0000 (17:55 +0100)]
common/json-schema: fix: handle non-capturing groups (?:...) in JSON schema pattern converter (#21124)

The regex-to-grammar converter in _visit_pattern() crashes with SIGSEGV
when a JSON schema "pattern" field contains a non-capturing group (?:...).

Root cause: when the parser sees '(' followed by '?', it pushes a warning
but does not advance past '?:'. The recursive transform() call then
interprets '?' as a quantifier and calls seq.back() on an empty vector,
causing undefined behavior.

This commonly occurs when serving OpenAI-compatible tool calls from
clients that include complex regex patterns in their JSON schemas (e.g.,
date validation patterns like ^(?:(?:\d\d[2468][048]|...)-02-29|...)$).

The fix:
- Skip '?:' after '(' to treat non-capturing groups as regular groups
- For unsupported syntax (?=, ?!, etc.), skip to matching ')' safely,
  handling escaped characters to avoid miscounting parenthesis depth
- Adjust the ')' unbalanced-parentheses check using direct char
  comparisons instead of substr
- Add test cases for non-capturing groups (C++ only, as the JS/Python
  implementations do not yet support this syntax)

4 weeks agocommon : add reasoning_format = none support to gpt-oss (#21094)
Aldehir Rojas [Sat, 28 Mar 2026 14:33:39 +0000 (09:33 -0500)]
common : add reasoning_format = none support to gpt-oss (#21094)

4 weeks agoserver : fix processing of multiple back-to-back mtmd chunks (#21107)
Georgi Gerganov [Sat, 28 Mar 2026 14:27:36 +0000 (16:27 +0200)]
server : fix processing of multiple back-to-back mtmd chunks (#21107)

4 weeks agoci : gracefully shut down the server (#21110)
Adrien Gallouët [Sat, 28 Mar 2026 13:49:57 +0000 (14:49 +0100)]
ci : gracefully shut down the server (#21110)

Signed-off-by: Adrien Gallouët <redacted>
4 weeks agoDocument custom default webui preferences in server README (#19771)
Woof Dog [Sat, 28 Mar 2026 13:19:16 +0000 (13:19 +0000)]
Document custom default webui preferences in server README (#19771)

4 weeks agowebui: Conversation forking + branching improvements (#21021)
Aleksander Grygier [Sat, 28 Mar 2026 12:38:15 +0000 (13:38 +0100)]
webui: Conversation forking + branching improvements (#21021)

* refactor: Make `DialogConfirmation` extensible with children slot

* feat: Add conversation forking logic

* feat: Conversation forking UI

* feat: Update delete/edit dialogs and logic for forks

* refactor: Improve Chat Sidebar UX and add MCP Servers entry

* refactor: Cleanup

* feat: Update message in place when editing leaf nodes

* chore: Cleanup

* chore: Cleanup

* chore: Cleanup

* chore: Cleanup

* chore: Cleanup

* chore: Cleanup

* refactor: Post-review improvements

* chore: update webui build output

* test: Update Storybook test

* chore: update webui build output

* chore: update webui build output

4 weeks agovendor : update cpp-httplib to 0.40.0 (#21100)
Adrien Gallouët [Sat, 28 Mar 2026 07:59:44 +0000 (08:59 +0100)]
vendor : update cpp-httplib to 0.40.0 (#21100)

Signed-off-by: Adrien Gallouët <redacted>
4 weeks agovulkan: add noncontiguous GLU support (#21081)
Ruben Ortlam [Sat, 28 Mar 2026 07:44:56 +0000 (08:44 +0100)]
vulkan: add noncontiguous GLU support (#21081)

* vulkan: add noncontiguous GLU support

* fix compile issue

4 weeks agocommon/parser: fix reasoning whitespace bugs + extra parser tests (#21085)
Piotr Wilkin (ilintar) [Sat, 28 Mar 2026 06:29:26 +0000 (07:29 +0100)]
common/parser: fix reasoning whitespace bugs + extra parser tests (#21085)

* fix whitespace reasoning issues + add reconstruction tests

* Proper fix

* fix Nemotron autoparser test expectations to include newline in marker

4 weeks agocli : add /glob command (#21084)
Sigbjørn Skjæret [Sat, 28 Mar 2026 01:33:04 +0000 (02:33 +0100)]
cli : add /glob command (#21084)

* add /glob command

* output error when max files reached

* support globbing outside curdir

4 weeks agodocker : fix and enable ARM64 image build (#20929)
Ts-sound [Sat, 28 Mar 2026 00:45:09 +0000 (08:45 +0800)]
docker : fix and enable ARM64 image build (#20929)

* CI: fix ARM64 image build error & enable compilation

* Update .github/workflows/docker.yml

Co-authored-by: Aaron Teo <redacted>
* CI: revert ggml/src/ggml-cpu/CMakeLists.txt

* Update .github/workflows/docker.yml

Co-authored-by: Aaron Teo <redacted>
* CI: update runs-on to ubuntu24.04, and update ARM64 build image ( ubuntu_version: "24.04")

* CI: change cpu.Dockerfile gcc to 14;

* CI : cpu.Dockerfile , update pip install .

* Update .github/workflows/docker.yml

Co-authored-by: Aaron Teo <redacted>
---------

Co-authored-by: Aaron Teo <redacted>
4 weeks agoserver : add custom socket options to disable SO_REUSEPORT (#21056)
Adrien Gallouët [Sat, 28 Mar 2026 00:12:43 +0000 (01:12 +0100)]
server : add custom socket options to disable SO_REUSEPORT (#21056)

* server : add custom socket options to disable SO_REUSEPORT

Signed-off-by: Adrien Gallouët <redacted>
* Add --reuse-port

    $ strace -e trace=setsockopt,bind build/bin/llama-server -lv 2 --reuse-port
    setsockopt(3, SOL_TCP, TCP_NODELAY, [1], 4) = 0
    setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
    setsockopt(3, SOL_SOCKET, SO_REUSEPORT, [1], 4) = 0
    bind(3, {sa_family=AF_INET, sin_port=htons(8080), sin_addr=inet_addr("127.0.0.1")}, 16) = 0

    $ strace -e trace=setsockopt,bind build/bin/llama-server -lv 2
    setsockopt(3, SOL_TCP, TCP_NODELAY, [1], 4) = 0
    setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
    bind(3, {sa_family=AF_INET, sin_port=htons(8080), sin_addr=inet_addr("127.0.0.1")}, 16) = 0

Signed-off-by: Adrien Gallouët <redacted>
* Update tools/server/README.md (llama-gen-docs)

Signed-off-by: Adrien Gallouët <redacted>
* Fix windows

Signed-off-by: Adrien Gallouët <redacted>
---------

Signed-off-by: Adrien Gallouët <redacted>
4 weeks agocommon : inhibit lazy grammar sampler while reasoning is active (#20970)
Aldehir Rojas [Fri, 27 Mar 2026 17:30:40 +0000 (12:30 -0500)]
common : inhibit lazy grammar sampler while reasoning is active (#20970)

* common : inhibit grammar while reasoning budget is active

* cont : update force_pos in accept

* cont : fix tests

* cont : tweak should apply logic

* cont : return early not using grammar sampler

* Add tests

* cont : prevent backend sampling when reasoning budget enabled

* cont : fix typo

---------

Co-authored-by: Piotr Wilkin <redacted>
4 weeks agoserver: Introduce LLAMA_BUILD_WEBUI build flag to allow disabling the embedded web...
Kusha Gharahi [Fri, 27 Mar 2026 16:25:55 +0000 (11:25 -0500)]
server: Introduce LLAMA_BUILD_WEBUI build flag to allow disabling the embedded web ui (#20158)

* introduce LLAMA_SERVER_NO_WEBUI

* LLAMA_SERVER_NO_WEBUI → LLAMA_BUILD_WEBUI

* LLAMA_BUILD_WEBUI ON by default not based on LLAMA_STANDALONE

* MIssed this

* Add useWebUi to package.nix

4 weeks agohexagon: support for IQ4_NL and MXFP4 (#21018)
Yiwei Shao [Fri, 27 Mar 2026 16:22:41 +0000 (09:22 -0700)]
hexagon: support for IQ4_NL and MXFP4 (#21018)

* ggml-hexagon: add IQ4_NL and MXFP4 HMX matmul support

- Add IQ4_NL quantization type support to Hexagon backend (buffer
  set/get tensor repack, mul_mat, mul_mat_id dispatch)
- Implement HVX IQ4_NL vec_dot kernels (1x1, 2x1, 2x2) with
  LUT-based 4-bit index to int8 kvalue dequantization
- Add MXFP4 HMX dequantization path with E8M0 scale conversion,
  including batch-4 fast path and single-tile fallback
- Unify quantized row size / scale offset logic to handle Q4_0,
  Q8_0, IQ4_NL, and MXFP4 in the DMA fetch path

* ggml-hexagon: fix SKIP_QUANTIZE src1 address mismatch in mixed-quant models

* Fix the pragma indent

4 weeks agowebui: Improve Chat Messages initial scroll + auto-scroll logic + add lazy loading...
Aleksander Grygier [Fri, 27 Mar 2026 16:01:36 +0000 (17:01 +0100)]
webui: Improve Chat Messages initial scroll + auto-scroll logic + add lazy loading with transitions to content blocks (#20999)

* refactor: Always use agentic content renderer for Assistant Message

* feat: Improve initial scroll + auto-scroll logic + implement fade in action for content blocks

* chore: update webui build output

4 weeks agoserver: remove the verbose_prompt parameter (#21059)
AN Long [Fri, 27 Mar 2026 11:36:13 +0000 (19:36 +0800)]
server: remove the verbose_prompt parameter (#21059)

* server: respect the verbose_prompt parameter

* Revert "server: respect the verbose_prompt parameter"

This reverts commit 8ed885cf375b2c8ba641c661f3667df70b9797f4.

* Remove --verbose-prompt parameter from llama-server

* Using set_examples instead of set_excludes

4 weeks agomtmd: add more sanity checks (#21047)
Xuan-Son Nguyen [Fri, 27 Mar 2026 10:00:52 +0000 (11:00 +0100)]
mtmd: add more sanity checks (#21047)

4 weeks agoserver: add built-in tools backend support (#20898)
Xuan-Son Nguyen [Fri, 27 Mar 2026 09:07:11 +0000 (10:07 +0100)]
server: add built-in tools backend support (#20898)

* wip: server_tools

* refactor

* displayName -> display_name

* snake_case everywhere

* rm redundant field

* change arg to --tools all

* add readme mention

* llama-gen-docs

4 weeks agorpc : proper handling of data pointers to CPU buffers (#21030)
Radoslav Gerganov [Fri, 27 Mar 2026 08:59:35 +0000 (10:59 +0200)]
rpc : proper handling of data pointers to CPU buffers (#21030)

The compute graph may contain tensors pointing to CPU buffers. In these
cases the buffer address is serialized as 0 and sent over the wire.
However, the data pointer is serialized as-is and this prevents proper
validation on the server side. This patches fixes this by serializing
the data pointer as 0 for non-RPC buffers and doing proper validation on
the server side.

closes: #21006

4 weeks agocompletion : session_tokens insert range in completion tool (no-op → correct) (#20917)
mtmcp [Fri, 27 Mar 2026 08:25:58 +0000 (05:25 -0300)]
completion : session_tokens insert range in completion tool (no-op → correct) (#20917)

The embd.begin(), embd.begin() range is empty and inserts nothing, so session_tokens never gets updated after
  decoding. Should be embd.begin(), embd.end(). Introduced in commit 2b6dfe8.

4 weeks agocompletion : Fix segfault on model load failure (#21049)
mtmcp [Fri, 27 Mar 2026 08:01:13 +0000 (05:01 -0300)]
completion : Fix segfault on model load failure (#21049)

4 weeks agoSend reasoning content back to the model across turns via the reasoning_content API...
Pascal [Fri, 27 Mar 2026 07:17:35 +0000 (08:17 +0100)]
Send reasoning content back to the model across turns via the reasoning_content API field (#21036)

* webui: send reasoning_content back to model in context

Preserve assistant reasoning across turns by extracting it from
internal tags and sending it as a separate reasoning_content field
in the API payload. The server and Jinja templates handle native
formatting (e.g. <think> tags for Qwen, GLM, DeepSeek...).

Adds "Exclude reasoning from context" toggle in Settings > Developer
(off by default, so reasoning is preserved). Includes unit tests.

* webui: add syncable parameter for excludeReasoningFromContext

* chore: update webui build output

4 weeks agometal : Fix dimension constraint violation in matmul2d descriptor (#21048)
ren [Fri, 27 Mar 2026 07:05:21 +0000 (00:05 -0700)]
metal : Fix dimension constraint violation in matmul2d descriptor (#21048)

Updates Metal tensor API test probe to fix the dimension constraint violation in the matmul2d descriptor (at least one value must be a multiple of 16).

4 weeks agoCANN: update docker images to 8.5.0 and improve CANN.md (#20801)
KokerZhou [Fri, 27 Mar 2026 00:53:00 +0000 (08:53 +0800)]
CANN: update docker images to 8.5.0 and improve CANN.md (#20801)

* cann: update docker images to 8.5.0

- bump CANN base image from 8.3.rc2 to 8.5.0
- bump ASCEND_VERSION from 8.1.RC1.alpha001 to 8.5.0

Move to newer stable releases.

* cann: update CANN.md

* Update CANN.md to include BF16 support

Added BF16 support information to the CANN documentation and corrected formatting for the installation instructions.

* Fix formatting issues in CANN.md

Fix 234: Trailing whitespace

4 weeks agomtmd: fix "v.patch_embd" quant and unsupported im2col ops on Metal for deepseek-ocr...
Saba Fallah [Thu, 26 Mar 2026 23:07:55 +0000 (00:07 +0100)]
mtmd: fix "v.patch_embd" quant and unsupported im2col ops on Metal for deepseek-ocr (#21027)

* mtmd: fix "v.patch_embd" quant and unsupported im2col ops on Metal for deepseek-ocr

* Update src/llama-quant.cpp

Co-authored-by: Sigbjørn Skjæret <redacted>
---------

Co-authored-by: Sigbjørn Skjæret <redacted>
4 weeks agohip: use fnuz fp8 for conversion on CDNA3 (#21040)
uvos [Thu, 26 Mar 2026 22:06:33 +0000 (23:06 +0100)]
hip: use fnuz fp8 for conversion on CDNA3 (#21040)

4 weeks agoci: pin external actions to exact commit SHA (#21033)
Xuan-Son Nguyen [Thu, 26 Mar 2026 19:44:00 +0000 (20:44 +0100)]
ci: pin external actions to exact commit SHA (#21033)

4 weeks agocommon : add getpwuid fallback for HF cache when HOME is not set (#21035)
Adrien Gallouët [Thu, 26 Mar 2026 19:34:23 +0000 (20:34 +0100)]
common : add getpwuid fallback for HF cache when HOME is not set (#21035)

Signed-off-by: Adrien Gallouët <redacted>
4 weeks agomtmd: refactor image preprocessing (#21031)
Xuan-Son Nguyen [Thu, 26 Mar 2026 18:49:20 +0000 (19:49 +0100)]
mtmd: refactor image preprocessing (#21031)

* mtmd: refactor image pre-processing

* correct some places

* correct lfm2

* fix deepseek-ocr on server

* add comment to clarify about mtmd_image_preprocessor_dyn_size

4 weeks agoopencl: allow large buffer for adreno (#20997)
lhez [Thu, 26 Mar 2026 15:52:21 +0000 (08:52 -0700)]
opencl: allow large buffer for adreno (#20997)

4 weeks agoconvert : support Qwen3.5/Qwen3.5 Moe NVFP4 and add input scales (#20505)
Michael Wand [Thu, 26 Mar 2026 15:52:06 +0000 (08:52 -0700)]
convert : support Qwen3.5/Qwen3.5 Moe NVFP4 and add input scales (#20505)

* convert : fix Qwen3.5 NVFP4 conversion

* Updated copilot concerns and rebased

* move into _LinearAttentionVReorderBase and simplify

* --flake

* new_name not needed

* Added input_scale to gguf

* Fixed input_scale addition as tensor

* Added input scale to loader and named _in_s

* Update convert_hf_to_gguf.py

Re-removed input_scale from aux cleanup

Co-authored-by: Sigbjørn Skjæret <redacted>
---------

Co-authored-by: Sigbjørn Skjæret <redacted>
4 weeks agoconvert : add RuGPT3XL (RuGPT3XLForCausalLM) support (#21011)
Pavel Zloi [Thu, 26 Mar 2026 15:49:09 +0000 (18:49 +0300)]
convert : add RuGPT3XL (RuGPT3XLForCausalLM) support (#21011)

* Support of ruGPT3XL model added

* Update convert_hf_to_gguf.py

Co-authored-by: Sigbjørn Skjæret <redacted>
* Update convert_hf_to_gguf.py

Co-authored-by: Sigbjørn Skjæret <redacted>
* Update convert_hf_to_gguf.py

Co-authored-by: Sigbjørn Skjæret <redacted>
* chkhsh for ruGPT3XL model added

* Update convert_hf_to_gguf.py

Co-authored-by: Sigbjørn Skjæret <redacted>
* Update convert_hf_to_gguf.py

Co-authored-by: Sigbjørn Skjæret <redacted>
* Fixing chkhsh for ruGPT3XL, rerun updated and _qkv_parts in RuGPT3XLModel

---------

Co-authored-by: Sigbjørn Skjæret <redacted>
4 weeks agocommon : filter out imatrix when finding models (#21023)
Adrien Gallouët [Thu, 26 Mar 2026 14:37:18 +0000 (15:37 +0100)]
common : filter out imatrix when finding models (#21023)

Signed-off-by: Adrien Gallouët <redacted>
4 weeks agofix(ggml): correct RISC-V ISA string canonical ordering for RVV in CMake (#20888)
ihb2032 [Thu, 26 Mar 2026 11:08:41 +0000 (19:08 +0800)]
fix(ggml): correct RISC-V ISA string canonical ordering for RVV in CMake (#20888)

Signed-off-by: ihb2032 <redacted>
4 weeks agocommon : make LLAMA_CACHE the one cache for everything (#21009)
Adrien Gallouët [Thu, 26 Mar 2026 11:04:57 +0000 (12:04 +0100)]
common : make LLAMA_CACHE the one cache for everything (#21009)

Signed-off-by: Adrien Gallouët <redacted>
4 weeks agocommon : fix split model migration (#21019)
Adrien Gallouët [Thu, 26 Mar 2026 11:04:37 +0000 (12:04 +0100)]
common : fix split model migration (#21019)

Sadly the manifest does not list all required files, i honestly thought
it was the case

Without the files listed we don't have the sha256, so if the first file
is valid, and all others have the correct size, then we can assume we
are good and do the migration...

Here my test:

    $ find /home/angt/.cache/llama.cpp
    /home/angt/.cache/llama.cpp
    /home/angt/.cache/llama.cpp/angt_test-split-model-stories260K_stories260K-f32-00002-of-00002.gguf
    /home/angt/.cache/llama.cpp/angt_test-split-model-stories260K_stories260K-f32-00001-of-00002.gguf
    /home/angt/.cache/llama.cpp/angt_test-split-model-stories260K_stories260K-f32-00001-of-00002.gguf.etag
    /home/angt/.cache/llama.cpp/angt_test-split-model-stories260K_stories260K-f32-00002-of-00002.gguf.etag
    /home/angt/.cache/llama.cpp/manifest=angt=test-split-model-stories260K=latest.json

    $ build/bin/llama-server
    ================================================================================
    WARNING: Migrating cache to HuggingFace cache directory
      Old cache: /home/angt/.cache/llama.cpp/
      New cache: /home/angt/.cache/huggingface/hub
    This one-time migration moves models previously downloaded with -hf
    from the legacy llama.cpp cache to the standard HuggingFace cache.
    Models downloaded with --model-url are not affected.
    ================================================================================
    migrate_file: migrated angt_test-split-model-stories260K_stories260K-f32-00001-of-00002.gguf -> /home/angt/.cache/huggingface/hub/models--angt--test-split-model-stories260K/snapshots/68c3ea2061e8c7688455fab07597dde0f4d7f0db/stories260K-f32-00001-of-00002.gguf
    migrate_file: migrated angt_test-split-model-stories260K_stories260K-f32-00002-of-00002.gguf -> /home/angt/.cache/huggingface/hub/models--angt--test-split-model-stories260K/snapshots/68c3ea2061e8c7688455fab07597dde0f4d7f0db/stories260K-f32-00002-of-00002.gguf
    migrate_old_cache_to_hf_cache: migration complete, deleting manifest: /home/angt/.cache/llama.cpp/manifest=angt=test-split-model-stories260K=latest.json

    $ find /home/angt/.cache/llama.cpp /home/angt/.cache/huggingface
    /home/angt/.cache/llama.cpp
    /home/angt/.cache/huggingface
    /home/angt/.cache/huggingface/hub
    /home/angt/.cache/huggingface/hub/models--angt--test-split-model-stories260K
    /home/angt/.cache/huggingface/hub/models--angt--test-split-model-stories260K/blobs
    /home/angt/.cache/huggingface/hub/models--angt--test-split-model-stories260K/blobs/50d019817c2626eb9e8a41f361ff5bfa538757e6f708a3076cd3356354a75694
    /home/angt/.cache/huggingface/hub/models--angt--test-split-model-stories260K/blobs/7b273e1dbfab11dc67dce479deb5923fef27c39cbf56a20b3a928a47b77dab3c
    /home/angt/.cache/huggingface/hub/models--angt--test-split-model-stories260K/refs
    /home/angt/.cache/huggingface/hub/models--angt--test-split-model-stories260K/refs/main
    /home/angt/.cache/huggingface/hub/models--angt--test-split-model-stories260K/snapshots
    /home/angt/.cache/huggingface/hub/models--angt--test-split-model-stories260K/snapshots/68c3ea2061e8c7688455fab07597dde0f4d7f0db
    /home/angt/.cache/huggingface/hub/models--angt--test-split-model-stories260K/snapshots/68c3ea2061e8c7688455fab07597dde0f4d7f0db/stories260K-f32-00002-of-00002.gguf
    /home/angt/.cache/huggingface/hub/models--angt--test-split-model-stories260K/snapshots/68c3ea2061e8c7688455fab07597dde0f4d7f0db/stories260K-f32-00001-of-00002.gguf

Signed-off-by: Adrien Gallouët <redacted>
4 weeks agoggml-cuda: Add NVFP4 dp4a kernel (#20644)
Michael Wand [Thu, 26 Mar 2026 08:54:03 +0000 (01:54 -0700)]
ggml-cuda: Add NVFP4 dp4a kernel (#20644)

Added check for dst_t to cuda_cast template for float
Restored ggml_cuda_ue4m3_to_fp32, changed vecdot ints to int32ts
Added CUDART/HIP Check and HIP/fp8 include
Added NVFP4 to Test-backend-ops
Added hip_fp8_e4m3 to __nv_fp8_e4m3 typedef

---------

Co-authored-by: Johannes Gäßler <redacted>
4 weeks agoimatrix : fix crash when using --show-statistics with zero counts (#19532)
SamareshSingh [Thu, 26 Mar 2026 07:14:36 +0000 (02:14 -0500)]
imatrix : fix crash when using --show-statistics with zero counts (#19532)

* imatrix: fix crash when using --show-statistics with zero counts

Fixes division by zero that caused floating point exceptions when processing imatrix files with zero count values. Added checks to skip zero counts and handle empty activation vectors.

Fix for the bug #19190

* imatrix: lower log level for zero-count skip message to DBG

4 weeks agoCUDA & CPU: support F32 kernel type for `CONV_TRANSPOSE_2D` (#17094)
Yihao Wang [Thu, 26 Mar 2026 02:19:14 +0000 (19:19 -0700)]
CUDA & CPU: support F32 kernel type for `CONV_TRANSPOSE_2D` (#17094)

* Refactor CUDA 2D transpose implementation to support multiple kernel types and improve parameter handling

- Introduced a `conv2d_transpose_params` struct for better parameter management.
- Updated `conv2d_transpose_kernel` to be templated for different kernel types (float and half).
- Modified `ggml_cuda_conv_2d_transpose_p0` to handle both F16 and F32 kernel types.
- Enhanced test cases to validate functionality for both kernel types.

* Refactor test cases for 2D convolution transpose to support dynamic kernel types

- Updated `test_conv_transpose_2d` structure to improve parameter handling by reordering constructor arguments.
- Enhanced test case generation to iterate over kernel types, allowing for flexible testing of different configurations.
- Removed hardcoded kernel type instances in favor of a loop for better maintainability and scalability.

* Refactor ggml_compute_forward_conv_transpose_2d to support both F16 and F32 tensor types.

* Refactor conv2d transpose kernel to use a template for kernel type, enhancing flexibility for different data types.
Update test cases to include both F16 and F32 tensor types for comprehensive coverage.

* Update ggml/src/ggml-cuda/conv2d-transpose.cu

Co-authored-by: Aman Gupta <redacted>
* Update ggml/src/ggml-cpu/ggml-cpu.c

Co-authored-by: Aman Gupta <redacted>
* Refactor conv2d transpose implementation by removing the conv2d_transpose_params struct and dispatching with direct kernel launch.

* Enhance cpu conv2d transpose implementation by introducing a templated kernel type for improved flexibility with F16 and F32 data types.

---------

Co-authored-by: Aman Gupta <redacted>
4 weeks agocommon : do not delete old files from the old cache when updating (#21000)
Adrien Gallouët [Wed, 25 Mar 2026 21:28:04 +0000 (22:28 +0100)]
common : do not delete old files from the old cache when updating (#21000)

Signed-off-by: Adrien Gallouët <redacted>
4 weeks agomtmd: Add DeepSeekOCR Support (#17400)
Saba Fallah [Wed, 25 Mar 2026 18:57:40 +0000 (19:57 +0100)]
mtmd: Add DeepSeekOCR Support (#17400)

* mtmd: llama.cpp DeepSeekOCR support
init commit

* loading sam tensors

* mtmd: fix vision model processing

* deepseek-ocr clip-vit model impl

* mtmd: add DeepSeek-OCR LM support with standard attention

* mtmd: successfully runs DeepSeek-OCR LM in llama-cli

* mtmd: Fix RoPE type for DeepSeek-OCR LM.

* loading LM
testing Vision model loading

* sam warmup working

* sam erroneous return corrected

* clip-vit:  corrected cls_embd concat

* clip-vit: model convert  qkv_proj split

* corrected combining of image encoders' results

* fix: update callback for ffn_moe_weighted and add callback for attn_out in deepseek2 model

* concat image_newline and image_seperator tokens

* visual_model warmup (technically) works

* window partitioning using standard ggml ops

* sam implementation without using CPU only ops

* clip: fixed warnings

* Merge branch 'sf/deepseek-ocr' of github.com:sfallah/llama.cpp into sf/deepseek-ocr

* mtmd: fix get_rel_pos

* mtmd: fixed the wrong scaler for get_rel_pos

* image encoding technically works but the output can't be checked singe image decoding fails

* mtmd: minor changed

* mtmd: add native resolution support

* - image encoding debugged
- issues fixed mainly related wrong config like n_patches etc.
- configs need to be corrected in the converter

* mtmd: correct token order

* - dynamic resizing
- changes are concerning PR https://github.com/sfallah/llama.cpp/pull/4

* mtmd: quick fix token order

* mtmd: fix danling pointer

* mtmd: SAM numerically works

* mtmd: debug CLIP-L (vit_pre_ln)

* mtmd: debug CLIP-L & first working DeepSeek-OCR model

* mtmd : add --dsocr-mode CLI argument for DeepSeek-OCR resolution control & all native resolution modes work

* mtmd: simplify SAM patch embedding

* mtmd: adapt Pillow image resizing function

* mtmd:  simplify DeepSeek-OCR dynamic resolution preprocessing

* mtmd: remove --dsocr-mode argument

* mtmd: refactor code & remove unused helper functions

* mtmd: fix tensor names for image newlines and view separator

* clean up

* reverting automatically removed spaces

* reverting automatically removed spaces

* mtmd: fixed bad ocr check in Deepseek2 (LM)

* mtmd: support combined QKV projection in buid_vit

* using common build_attn in sam

* corrected code-branch when flash-attn disabled
enabling usage of --flash-attn option

* mtmd: minor fix

* minor formatting and style

* fixed flake8 lint issues

* minor editorconfig-check fixes

* minor editorconfig-check fixes

* mtmd: simplify get_rel_pos

* mtmd: make sam hparams configurable

* mtmd: add detailed comments for resize_bicubic_pillow

* mtmd: fixed wrong input setting

* mtmd: convert model in FP16

* mtmd: minor fix

* mtmd: remove tweak to llama-mtmd-cli & deepseek-ocr template

* fix: test-1.jpg ORC issue with small (640) resolution
setting min-resolution base (1024) max large (1280) for dynamic-resolution

* minor: editconfig-check fix

* merge with changes from https://github.com/ggml-org/llama.cpp/pull/17909
added new opt to tests.sh to disable flash-attn

* minor: editconfig-check fix

* testing deepseek-ocr
quick and dirty test script comparing results of Qwen2.5-VL vs DeepSeek-OCR

* quick and (potential) dirty merge with https://github.com/ggml-org/llama.cpp/pull/17909

* refactoring, one single builder function and static helpers

* added deepseek-ocr test to tests.sh

* minor formatting fixes

* check with fixed expected resutls

* minor formatting

* editorconfig-check fix

* merge with changes from https://github.com/ggml-org/llama.cpp/pull/18042

* minor
- added GLM-4.6V to big tests
- added missing deps for python test

* convert: minor fix

* mtmd: format code

* convert: quick fix

* convert: quick fix

* minor python formatting

* fixed merge build issue

* merge resolved
- fixed issues in convert
- tested several deepseek models

* minor fix

* minor

* Update convert_hf_to_gguf.py

Co-authored-by: Sigbjørn Skjæret <redacted>
* - removed clip_is_deepseekocr
- removed redundant RESIZE_ALGO_BICUBIC_PILLOW resize-algo
- simplified image-preprocessing
- removed/simplified debug functions

* - cleaning commented out code

* fixing instabilities issues reintroducing resize_bicubic_pillow

* - use f16 model for deepseek-ocr test
- ignore llama-arch test for deepseek-ocr

* rename fc_w --> mm_fc_w

* add links to OCR discussion

* cleaner loading code

* add missing .weight to some tensors

* add default jinja template (to be used by server)

* move test model to ggml-org

* rolling back upscale change

* Update convert_hf_to_gguf.py

Co-authored-by: Sigbjørn Skjæret <redacted>
---------

Co-authored-by: bluebread <redacted>
Co-authored-by: Sigbjørn Skjæret <redacted>
Co-authored-by: Xuan Son Nguyen <redacted>
Co-authored-by: Xuan-Son Nguyen <redacted>
4 weeks agocommon : fix verbosity setup (#20989)
Adrien Gallouët [Wed, 25 Mar 2026 18:41:01 +0000 (19:41 +0100)]
common : fix verbosity setup (#20989)

The verbosity threshold was set at the end of common_params_parse_ex(),
after doing many things (like downloading files..)

Signed-off-by: Adrien Gallouët <redacted>
4 weeks agocommon : fix gguf selection in common_list_cached_models (#20996)
Adrien Gallouët [Wed, 25 Mar 2026 18:18:06 +0000 (19:18 +0100)]
common : fix gguf selection in common_list_cached_models (#20996)

Signed-off-by: Adrien Gallouët <redacted>
4 weeks agoci : fix parsing of vgpr counts in hip-quality-check (#20987)
uvos [Wed, 25 Mar 2026 18:00:37 +0000 (19:00 +0100)]
ci : fix parsing of vgpr counts in hip-quality-check (#20987)

* scripts: hip: gcn-cdna-vgpr-check: fix parsing of vgpr counts when an amdclang Remark block is interlieved with another from a different process

* Return warning ignore

* obay pep8 inline double space before inline commets

* add # noqa: NP100 for other prints too

* Add script changes to cause autotrigger

4 weeks agomodel: codefuse-ai/F2LLM-v2 support
Saba Fallah [Wed, 25 Mar 2026 17:33:42 +0000 (18:33 +0100)]
model: codefuse-ai/F2LLM-v2 support

4 weeks agomodel : allow causal_attn and pooling_type on all architectures (#20973)
Dowon [Wed, 25 Mar 2026 17:12:38 +0000 (02:12 +0900)]
model : allow causal_attn and pooling_type on all architectures (#20973)

* models : allow causal_attn and pooling_type on all architectures

* fix: move location

4 weeks agosnapdragon: add missing features to WoS scripts to achieve parity with ADB scripts...
Aparna M P [Wed, 25 Mar 2026 16:43:12 +0000 (22:13 +0530)]
snapdragon: add missing features to WoS scripts to achieve parity with ADB scripts (#20884)

* Add missing features to WoS scripts to achieve parity with ADB scripts

* Fix line-ending in run-mtmd.ps1

Signed-off-by: Max Krasnyansky <redacted>
---------

Signed-off-by: Max Krasnyansky <redacted>
Co-authored-by: Max Krasnyansky <redacted>
4 weeks agoUse docker in build-android.yml (#20928)
Shreya Jain [Wed, 25 Mar 2026 16:36:27 +0000 (09:36 -0700)]
Use docker in build-android.yml (#20928)

* use docker instead of SDK separately

* fix whitespaces

* Update .github/workflows/build-android.yml

Co-authored-by: Sigbjørn Skjæret <redacted>
---------

Co-authored-by: Max Krasnyansky <redacted>
Co-authored-by: Sigbjørn Skjæret <redacted>
4 weeks agollama-bench: print `-n-cpu-moe` when offloaded layers > 1 (#20984)
Aman Gupta [Wed, 25 Mar 2026 13:17:27 +0000 (21:17 +0800)]
llama-bench: print `-n-cpu-moe` when offloaded layers > 1 (#20984)

4 weeks agoci: Allow ninja to be used during unit test (#20742)
Masato Nakasaka [Wed, 25 Mar 2026 13:00:49 +0000 (06:00 -0700)]
ci: Allow ninja to be used during unit test (#20742)

* Remove make dependency

* Added option to specify Ninja generator

* use ninja-build as default for several CI

* Revert "use ninja-build as default for several CI"

This reverts commit f552c4559b85e222aab37f654da764af4283fee7.

* changed use plain string rather than arrays

* Enabled ninja build by default for experimentation

* ci: add run.sh to test conditions to trigger GitHub CI and self-hosted runners

Signed-off-by: Aaron Teo <redacted>
* Enabled ninja build by default on self-hosted envs for experimentation

* ci: revert generator to ninja instead of ninja multi-config

Signed-off-by: Aaron Teo <redacted>
* ci: install ninja-build for self-hosted workflows

Signed-off-by: Aaron Teo <redacted>
* ci: revert ninja from self-hosted runners

Signed-off-by: Aaron Teo <redacted>
* ci: missed one self-hosted step

Signed-off-by: Aaron Teo <redacted>
* ci: fix windows ci errors from an errenous revert

Signed-off-by: Aaron Teo <redacted>
* Added explicit build types for Ninja

Also reverted some needless change

* ci: use ninja multi-config for vulkan-x64 build

Signed-off-by: Aaron Teo <redacted>
* added time command to measure build time

* Keeping some configs to use Ninja which show improvement

* minor fix based on review

Co-authored-by: Aaron Teo <redacted>
* ci: rm `time` from custom containers

Signed-off-by: Aaron Teo <redacted>
---------

Signed-off-by: Aaron Teo <redacted>
Co-authored-by: Aaron Teo <redacted>
Co-authored-by: Aaron Teo <redacted>
4 weeks agoci : disable self-hosted mac jobs (#20985)
Georgi Gerganov [Wed, 25 Mar 2026 12:46:40 +0000 (14:46 +0200)]
ci : disable self-hosted mac jobs (#20985)

4 weeks agojinja: fix macro with kwargs (#20960)
Xuan-Son Nguyen [Wed, 25 Mar 2026 11:22:48 +0000 (12:22 +0100)]
jinja: fix macro with kwargs (#20960)

* jinja: fix macro with kwargs

* Apply suggestions from code review

Co-authored-by: Sigbjørn Skjæret <redacted>
* fix newline problem

---------

Co-authored-by: Sigbjørn Skjæret <redacted>
4 weeks agogguf-split : clarify operation of gguf-split (#19749)
Francisco Herrera [Wed, 25 Mar 2026 11:12:50 +0000 (06:12 -0500)]
gguf-split : clarify operation of gguf-split (#19749)

* clarify operation of gguf-split

so that you don't have to find out by trial and error

* formatting

4 weeks agollama: fix llama-model-saver (#20503)
Johannes Gäßler [Wed, 25 Mar 2026 10:53:16 +0000 (11:53 +0100)]
llama: fix llama-model-saver (#20503)

* llama : add fd-based model loading via llama_model_load_from_fd

* llama : address review feedback for fd-based model loading

* llama : use FILE pointer instead of fd in public API

* llama : use FILE pointer consistently, address review feedback

* fixup

* fix tensor names

* fix llama-model-saver

* roundtrip tests

* fixup

* refactor tests

* fix prints

* fix model saving

* fix CI, disable Chameleon

* print seed

---------

Co-authored-by: Siddhesh2377 <redacted>
4 weeks agowebui: Fix editing assistant message without branching (#20944)
Aleksander Grygier [Wed, 25 Mar 2026 10:47:33 +0000 (11:47 +0100)]
webui: Fix editing assistant message without branching (#20944)

* fix: Editing assistant response without branching

* chore: update webui build output

4 weeks agoAdd SLEEPING status to the WebUI model selector (#20949)
Pascal [Wed, 25 Mar 2026 10:02:32 +0000 (11:02 +0100)]
Add SLEEPING status to the WebUI model selector (#20949)

* webui: handle sleeping model status, fix favourite -> favorite

* Update tools/server/webui/src/lib/components/app/models/ModelsSelectorOption.svelte

Co-authored-by: Aleksander Grygier <redacted>
* Update tools/server/webui/src/lib/components/app/models/ModelsSelectorOption.svelte

Co-authored-by: Aleksander Grygier <redacted>
* webui: fix optional event parameter in sleeping model onclick

* typo

* webui: restore orange sleeping indicator dot with hover unload

* chore: update webui build output

* webui: move stopPropagation into ActionIcon onclick, remove svelte-ignore

* chore: update webui build output

* webui: fix favourite -> favorite (UK -> US spelling) everywhere

Address review feedback from WhyNotHugo

* chore: update webui build output

---------

Co-authored-by: Aleksander Grygier <redacted>
4 weeks agoandroid : fix-pointer-dangling (#20974)
yikechayedan [Wed, 25 Mar 2026 09:51:26 +0000 (17:51 +0800)]
android : fix-pointer-dangling (#20974)

4 weeks agosycl : fix wrong variable check by assert (#20903)
Neo Zhang [Wed, 25 Mar 2026 09:48:37 +0000 (17:48 +0800)]
sycl : fix wrong variable check by assert (#20903)

* fix wrong variable check by assert

* use GGML api

4 weeks agoci : bump gguf publish python version (#20982)
Sigbjørn Skjæret [Wed, 25 Mar 2026 09:04:59 +0000 (10:04 +0100)]
ci : bump gguf publish python version (#20982)

4 weeks agoci : limit requirements versions (#20980)
Sigbjørn Skjæret [Wed, 25 Mar 2026 08:55:37 +0000 (09:55 +0100)]
ci : limit requirements versions (#20980)

* set requests version

* limit versions outside requirements

4 weeks agoconvert : register Qwen3Model architecture (#20967)
Dowon [Wed, 25 Mar 2026 08:37:59 +0000 (17:37 +0900)]
convert : register Qwen3Model architecture (#20967)

4 weeks agodocs : Update OpenVINO backend docs (#20968)
Ravi Panchumarthy [Wed, 25 Mar 2026 08:33:51 +0000 (01:33 -0700)]
docs : Update OpenVINO backend docs (#20968)

* OpenVINO doc updates

* Update docs/backend/OPENVINO.md

Co-authored-by: Aaron Teo <redacted>
---------

Co-authored-by: Aaron Teo <redacted>
4 weeks agomodels : move the token embedding norms to the first layer (#20943)
Georgi Gerganov [Tue, 24 Mar 2026 15:00:30 +0000 (17:00 +0200)]
models : move the token embedding norms to the first layer (#20943)

* models : move the token embedding norms to the first layer

* cont : fix LLM_TENSOR_CONV1D + fix il indexing

4 weeks agoggml-backend: re-enable graph reuse with pipeline parallelism (#20927)
Aman Gupta [Tue, 24 Mar 2026 12:47:00 +0000 (20:47 +0800)]
ggml-backend: re-enable graph reuse with pipeline parallelism (#20927)

4 weeks agovendor : update cpp-httplib to 0.39.0 (#20933)
Alessandro de Oliveira Faria (A.K.A.CABELO) [Tue, 24 Mar 2026 12:33:33 +0000 (09:33 -0300)]
vendor : update cpp-httplib to 0.39.0 (#20933)

4 weeks agocommon : fix get_gguf_split_info (#20946)
Adrien Gallouët [Tue, 24 Mar 2026 12:33:14 +0000 (13:33 +0100)]
common : fix get_gguf_split_info (#20946)

Signed-off-by: Adrien Gallouët <redacted>
4 weeks agoWebUI: fix edit msg form textarea height (#20830)
BlueMöhre [Tue, 24 Mar 2026 12:17:45 +0000 (13:17 +0100)]
WebUI: fix edit msg form textarea height (#20830)

* autoresize textarea on mount

* allow textarea to grow to same height as rendered messages

* add UI build file

4 weeks agoreadme : clarify MODEL_ENDPOINT usage (#20941)
Adrien Gallouët [Tue, 24 Mar 2026 09:35:07 +0000 (10:35 +0100)]
readme : clarify MODEL_ENDPOINT usage (#20941)

Signed-off-by: Adrien Gallouët <redacted>
4 weeks agocommon : add a WARNING for HF cache migration (#20935)
Adrien Gallouët [Tue, 24 Mar 2026 08:24:39 +0000 (09:24 +0100)]
common : add a WARNING for HF cache migration (#20935)

Signed-off-by: Adrien Gallouët <redacted>
4 weeks agometal : add FLOOR, CEIL, ROUND, TRUNC unary ops (#20930)
nuri [Tue, 24 Mar 2026 08:13:07 +0000 (17:13 +0900)]
metal : add FLOOR, CEIL, ROUND, TRUNC unary ops (#20930)

Co-authored-by: nryoo <redacted>
4 weeks agometal : add FA instantiations for HSK=512, HSV=512 (#20902)
Georgi Gerganov [Tue, 24 Mar 2026 08:03:09 +0000 (10:03 +0200)]
metal : add FA instantiations for HSK=512, HSV=512 (#20902)

4 weeks agoissues: add openvino backends (#20932)
Aaron Teo [Tue, 24 Mar 2026 06:41:10 +0000 (14:41 +0800)]
issues: add openvino backends (#20932)

Signed-off-by: Aaron Teo <redacted>
4 weeks agocommon : add standard Hugging Face cache support (#20775)
Adrien Gallouët [Tue, 24 Mar 2026 06:30:33 +0000 (07:30 +0100)]
common : add standard Hugging Face cache support (#20775)

* common : add standard Hugging Face cache support

- Use HF API to find all files
- Migrate all manifests to hugging face cache at startup

Signed-off-by: Adrien Gallouët <redacted>
* Check with the quant tag

Signed-off-by: Adrien Gallouët <redacted>
* Cleanup

Signed-off-by: Adrien Gallouët <redacted>
* Improve error handling and report API errors

Signed-off-by: Adrien Gallouët <redacted>
* Restore common_cached_model_info and align mmproj filtering

Signed-off-by: Adrien Gallouët <redacted>
* Prefer main when getting cached ref

Signed-off-by: Adrien Gallouët <redacted>
* Use cached files when HF API fails

Signed-off-by: Adrien Gallouët <redacted>
* Use final_path..

Signed-off-by: Adrien Gallouët <redacted>
* Check all inputs

Signed-off-by: Adrien Gallouët <redacted>
---------

Signed-off-by: Adrien Gallouët <redacted>
4 weeks agollama-fit: fix regex pattern for gate_up tensors (#20910)
Aman Gupta [Tue, 24 Mar 2026 04:57:57 +0000 (12:57 +0800)]
llama-fit: fix regex pattern for gate_up tensors (#20910)

* llama-fit: fix regex pattern for gate_up tensors

* Apply suggestions from code review

Co-authored-by: Johannes Gäßler <redacted>
---------

Co-authored-by: Johannes Gäßler <redacted>
4 weeks agocommon : replace wrap_for_generation with a prefix convenience function and fix gpt...
Aldehir Rojas [Tue, 24 Mar 2026 03:21:47 +0000 (22:21 -0500)]
common : replace wrap_for_generation with a prefix convenience function and fix gpt-oss (#20912)

4 weeks agohexagon: general DMA and Binary Op fixes for large strides (#20918)
Max Krasnyansky [Mon, 23 Mar 2026 22:33:49 +0000 (15:33 -0700)]
hexagon: general DMA and Binary Op fixes for large strides (#20918)

* hex-dma: make chained dma the default to handle newer models

This also includes some new instrumentation that we can remove later.

* hexagon: add uint32 dump helper

* hexagon: use single-page VTCM allocation to avoid issues with large gather ops in ssm-conv

ssm-conv uses HVX gather instruction and that instruction cannot handle cases where the base+offset
spans page boundaries.

* hexagon: update ssm-conv to make base-addr compute a bit easier to read

* hex-dma: use 1d mode for reshaping, it supports sizes up to 24-bits (>16MB)

* hex-bin: fix incorrect stride logic

* hexagon: make sure repack buffs are dumped for verbose > 2

* hex-bin: consistently use dma_queue_push even for dummy dst transactions

* hex-dma: start using 2d-wide mode on v75 and up

The removes the need to deal with the 16-bit limitaion for the strides.

* hex-bin: cleanup kernel selection logic

* hex-bin: cleanup binary op core and fix transposed tensor handling

* snapdragon: update run-bench to use larger ubatch and fa-on

4 weeks agoAdd codeowners for scripts/snapdragon and docs/snapdragon (#20915)
Max Krasnyansky [Mon, 23 Mar 2026 21:57:18 +0000 (14:57 -0700)]
Add codeowners for scripts/snapdragon and docs/snapdragon (#20915)

* Add codeowners for scripts/snapdragon

* Also add docs/backends/snapdragon

4 weeks agoopencl: add q6_K gemm and gemv kernels for Adreno (#20089)
lhez [Mon, 23 Mar 2026 19:44:18 +0000 (12:44 -0700)]
opencl: add q6_K gemm and gemv kernels for Adreno (#20089)

* opencl: add q6_K noshuffle kernels, initial q6_K gemv, some host code

* opencl: add q6_K transpose

* opencl: fix cvt kernel name

* opencl: add call to q6_K gemv

* opencl: fix q6_K scale transpose

* opencl: fix loading for gemv q6_K, refactor

* opencl: fix transpose_8_buf kernel assignment, refactor

* opencl: refactor q6_K transpose

* opencl: add gemm_noshuffle_q6_k_f32

* opencl: fix qh loading

* opencl: refactor q6_K gemv host side, release bufs and imgs

* opencl: refactor

* opencl: fix q6_K dequant and scale selection

* opencl: workaround compiler bug, fix dump_tensor

* opencl: refactor q6_K convert kernels

* opencl: unpack transformed q6_K in get_tensor

* opencl: refactor, handle non-uniform workgroups

* opencl: support non-vector subgroup bcast

4 weeks agorpc : RCE patch (#20908)
las7 [Mon, 23 Mar 2026 17:54:57 +0000 (10:54 -0700)]
rpc : RCE patch (#20908)

4 weeks agocontrib: add "Requirements" section to PR template (#20841)
Xuan-Son Nguyen [Mon, 23 Mar 2026 15:59:02 +0000 (16:59 +0100)]
contrib: add "Requirements" section to PR template (#20841)

* contrib: add "Requirements" section to PR template

* typo [no ci]

* use h2, add "Additional information"

---------

Co-authored-by: Piotr Wilkin (ilintar) <redacted>