]> git.djapps.eu Git - pkg/ggml/sources/llama.cpp/log
pkg/ggml/sources/llama.cpp
4 weeks agovulkan: mark IM2COL as supporting non-contig (#13783)
Jeff Bolz [Mon, 26 May 2025 04:02:07 +0000 (23:02 -0500)]
vulkan: mark IM2COL as supporting non-contig (#13783)

4 weeks agoCANN: Add the basic supports of Flash Attention kernel (#13627)
Bizhao Shi [Mon, 26 May 2025 02:20:18 +0000 (10:20 +0800)]
CANN: Add the basic supports of Flash Attention kernel (#13627)

* cann: add the basic FA support

* cann: update the readme

* cann: update the FlashAttention with PSEShift

* cann: update the input parameters in FA

* cann: update the alibi with max_bias

* cann: add the constrints of softcap

* cann: update the docs CANN.md

* cann: update the docs CANN.md

* cann: fix typo of CANN.md

* cann: add some comments and update the CANN.md

* cann: update the CANN.md

* cann: update the inner precise for fusedInferAttention

* cann: update the constraints of flash_attn_ext on ggml-cann.cpp

* cann: clean the whitespace

* cann: clean the whitespace

* cann: add a new endline

4 weeks ago`server`: add `--reasoning-budget 0` to disable thinking (incl. qwen3 w/ enable_think...
Olivier Chafik [Sun, 25 May 2025 23:30:51 +0000 (00:30 +0100)]
`server`: add `--reasoning-budget 0` to disable thinking (incl. qwen3 w/ enable_thinking:false) (#13771)

---------

Co-authored-by: ochafik <redacted>
Co-authored-by: Xuan-Son Nguyen <redacted>
4 weeks agowebui : bump max upload file size to 500MB (#13779)
Xuan-Son Nguyen [Sun, 25 May 2025 17:02:18 +0000 (19:02 +0200)]
webui : bump max upload file size to 500MB (#13779)

4 weeks agotests : improve UGM tokenizer test coverage (#13773)
Sigbjørn Skjæret [Sun, 25 May 2025 14:22:29 +0000 (16:22 +0200)]
tests : improve UGM tokenizer test coverage (#13773)

4 weeks agokv-cache : rework kv_cell (#13706)
Georgi Gerganov [Sun, 25 May 2025 13:34:36 +0000 (16:34 +0300)]
kv-cache : rework kv_cell (#13706)

* kv-cache : rework kv_cell

ggml-ci

* kv-cells : use "shift" instead of "delta" consistently

ggml-ci

* llama : add llama_max_parallel_sequences()

ggml-ci

* kv-cells : update comments [no ci]

* context : fail upon construction if sequences exceed max value

ggml-ci

* kv-cells : get_pos() -> pos_get() + comments

ggml-ci

* kv-cells : fix tracking of "used" cells

ggml-ci

4 weeks agorpc : Fix build on OpenBSD (#13541)
Percy Piper [Sun, 25 May 2025 12:35:53 +0000 (13:35 +0100)]
rpc : Fix build on OpenBSD (#13541)

4 weeks agomtmd : add support for Qwen2-Audio and SeaLLM-Audio (#13760)
Xuan-Son Nguyen [Sun, 25 May 2025 12:06:32 +0000 (14:06 +0200)]
mtmd : add support for Qwen2-Audio and SeaLLM-Audio (#13760)

* mtmd : add Qwen2-Audio support

* small clean up

* update discussion link

* clarify mtmd_get_output_embd

* clarification in multimodal.md

* fix ultravox bug

* ggml_cont

4 weeks agodocs : add Moondream2 pre-quantized link (#13745)
ddpasa [Sun, 25 May 2025 12:04:49 +0000 (14:04 +0200)]
docs : add Moondream2 pre-quantized link (#13745)

* Multimodal: Added Moondream2 model and fixed ggml.org link

* Apply suggestions from code review

---------

Co-authored-by: name <redacted>
Co-authored-by: Xuan-Son Nguyen <redacted>
4 weeks agoserver: fix/test add_generation_prompt (#13770) upstream/latest
Olivier Chafik [Sun, 25 May 2025 09:45:49 +0000 (10:45 +0100)]
server: fix/test add_generation_prompt (#13770)

Co-authored-by: ochafik <redacted>
4 weeks agollama : add support for Qwen3 MoE tied word embeddings (#13768)
Piotr Jasiukajtis [Sun, 25 May 2025 08:29:43 +0000 (10:29 +0200)]
llama : add support for Qwen3 MoE tied word embeddings (#13768)

5 weeks agoSYCL: revert "sycl: simplify bin_bcast_kernel (#13383)" (#13752)
Akarshan Biswas [Sun, 25 May 2025 07:08:37 +0000 (12:38 +0530)]
SYCL: revert "sycl: simplify bin_bcast_kernel (#13383)" (#13752)

Temporarily reverted due to failing fp16 DIV operation

This reverts commit 02cdd2d8b092b5a4bb18e013c6887ce49ba20ac5.

ggml-ci

5 weeks ago`server`: streaming of tool calls and thoughts when `--jinja` is on (#12379)
Olivier Chafik [Sun, 25 May 2025 00:48:08 +0000 (01:48 +0100)]
`server`: streaming of tool calls and thoughts when `--jinja` is on (#12379)

* add common_json w/ support for truncated json healing

* add common_chat_msg_diff

* partial common_chat_parse

* refactor parser w/ optionals

* server: wire chat diffs in stream mode

* fix trigger of thinking models (must happen after thoughts are closed)

* fix functionary v3.2 raw python!

* rename: common_chat_syntax (now contains format)

* rm common_regex.at_start

* don't return empty <think></think>

* accommodate yet another deepseek r1 distill fantasy syntax (`<|tool▁calls|>`)

* fix QwQ 32B tool call parsing after thoughts (hermes2)

* better logs for grammar triggers

* consume spaces after parse_json_tool_calls

* fix required tool calls w/ thinking models that have pre-opened thinking tags

* fix thinking model's initial trigger + test qwq's template

* run most test_tool_call tests in stream + non-stream modes

* make functionary v3.2 parsing more strict (differentiate first match from others)

* send final diff from server, to close off raw python arguments

* support partial content streaming in Generic mode

* tool-call: allow content prelude before hermes2 tool calls (for Qwen2.5)

* Update function-calling.md

* Update tool_bench.py

* chat-parser: remove input from exception (llm output may contain PII)

---------

Co-authored-by: ochafik <redacted>
Co-authored-by: Olivier Chafik <redacted>
5 weeks agoreleases : bundle llvm omp library in windows release (#13763)
Diego Devesa [Sat, 24 May 2025 22:55:16 +0000 (15:55 -0700)]
releases : bundle llvm omp library in windows release (#13763)

5 weeks agoreleases : enable openmp in windows cpu backend build (#13756)
Diego Devesa [Sat, 24 May 2025 20:27:03 +0000 (13:27 -0700)]
releases : enable openmp in windows cpu backend build (#13756)

5 weeks agoggml-cpu : set openmp wait time if not set (#13758)
Diego Devesa [Sat, 24 May 2025 20:26:47 +0000 (13:26 -0700)]
ggml-cpu : set openmp wait time if not set (#13758)

5 weeks agoMove GLM4 f32 attention fix to the correct function (#13750)
0cc4m [Sat, 24 May 2025 14:49:12 +0000 (16:49 +0200)]
Move GLM4 f32 attention fix to the correct function (#13750)

5 weeks agoggml : add ggml_gelu_erf() CUDA kernel (#13719)
Xuan-Son Nguyen [Sat, 24 May 2025 11:06:47 +0000 (13:06 +0200)]
ggml : add ggml_gelu_erf() CUDA kernel (#13719)

* ggml : add ggml_gelu_erf() CUDA kernel

* missing semicolon

5 weeks agovocab : fix ugm tokenizer precision (#13743)
Sigbjørn Skjæret [Sat, 24 May 2025 10:29:09 +0000 (12:29 +0200)]
vocab : fix ugm tokenizer precision (#13743)

5 weeks agoCUDA: fix race condition in FA vector kernels (#13742)
Johannes Gäßler [Sat, 24 May 2025 09:46:19 +0000 (11:46 +0200)]
CUDA: fix race condition in FA vector kernels (#13742)

5 weeks agoci : enable winget package updates (#13734)
Diego Devesa [Fri, 23 May 2025 20:14:00 +0000 (13:14 -0700)]
ci : enable winget package updates (#13734)

5 weeks agoci : add winget package updater (#13732)
Diego Devesa [Fri, 23 May 2025 20:09:38 +0000 (13:09 -0700)]
ci : add winget package updater (#13732)

5 weeks agohparams : initialize arrays (#13728)
Georgi Gerganov [Fri, 23 May 2025 17:16:13 +0000 (20:16 +0300)]
hparams : initialize arrays (#13728)

ggml-ci

5 weeks agollama : allow custom list of swa_layers (#13726)
Xuan-Son Nguyen [Fri, 23 May 2025 15:07:04 +0000 (17:07 +0200)]
llama : allow custom list of swa_layers (#13726)

5 weeks agoserver : support audio input (#13714)
Xuan-Son Nguyen [Fri, 23 May 2025 09:03:47 +0000 (11:03 +0200)]
server : support audio input (#13714)

* server : support audio input

* add audio support on webui

5 weeks agoCANN: Support MUL_MAT_ID for q8_0 and q4_0 (#13705)
Chenguang Li [Fri, 23 May 2025 08:47:53 +0000 (16:47 +0800)]
CANN: Support MUL_MAT_ID for q8_0 and q4_0 (#13705)

* [CANN]Support MUL_MAT_ID Q8 && Q4

Signed-off-by: noemotiovon <redacted>
* codestyle adjustment

Signed-off-by: noemotiovon <redacted>
---------

Signed-off-by: noemotiovon <redacted>
5 weeks agoggml : fix the order of ggml_unary_op (#13718)
Xuan-Son Nguyen [Fri, 23 May 2025 06:12:48 +0000 (08:12 +0200)]
ggml : fix the order of ggml_unary_op (#13718)

5 weeks agovulkan: support CPY from any type to itself (#13695)
Jeff Bolz [Fri, 23 May 2025 04:45:02 +0000 (00:45 -0400)]
vulkan: support CPY from any type to itself (#13695)

Reuse the f16/f32 copy shaders, and just scale the number of elements
according to the type size.

5 weeks agovulkan: Disable coopmat/coopmat2/bfloat extensions if glslc doesn't support it (...
Jeff Bolz [Fri, 23 May 2025 04:33:45 +0000 (00:33 -0400)]
vulkan: Disable coopmat/coopmat2/bfloat extensions if glslc doesn't support it (#13696)

5 weeks agouse LOG_WARN to replace `std::cerr` (#13657)
Judd [Fri, 23 May 2025 04:33:08 +0000 (12:33 +0800)]
use LOG_WARN to replace `std::cerr` (#13657)

5 weeks agorelease : fix windows hip release (#13707)
Diego Devesa [Thu, 22 May 2025 22:21:37 +0000 (15:21 -0700)]
release : fix windows hip release (#13707)

* release : fix windows hip release

* make single hip release with multiple targets

5 weeks agotts : fix n_ubatch + make WavTokenizer cache-less (#13713)
Georgi Gerganov [Thu, 22 May 2025 19:21:07 +0000 (22:21 +0300)]
tts : fix n_ubatch + make WavTokenizer cache-less (#13713)

ggml-ci

5 weeks agomtmd : add ultravox audio input (#13623)
Xuan-Son Nguyen [Thu, 22 May 2025 18:42:48 +0000 (20:42 +0200)]
mtmd : add ultravox audio input (#13623)

* convert ok, load ok

* warmup ok

* test

* still does not work?

* fix padding

* temporary give up

* fix merge conflict

* build_ultravox()

* rm test

* fix merge conflict

* add necessary mtmd APIs

* first working version (only 4s of audio)

* will this monster compile?

* fix compile

* please compile

* fPIC

* fix windows

* various fixes

* clean up audio_helpers

* fix conversion

* add some debug stuff

* long audio input ok

* adapt the api

* add --audio arg

* final touch UX

* add miniaudio to readme

* fix typo

* refactor kv metadata

* mtmd_default_marker()

5 weeks agocommon: Include torch package for s390x (#13699)
Aaron Teo [Thu, 22 May 2025 18:31:29 +0000 (02:31 +0800)]
common: Include torch package for s390x (#13699)

* common: update requirements.txt to include pytorch nightly for s390x

Signed-off-by: Aaron Teo <redacted>
* common: fix torch installation via pip for s390x

Signed-off-by: Aaron Teo <redacted>
---------

Signed-off-by: Aaron Teo <redacted>
5 weeks agoserver : pad small embedding batches (#13692)
Georgi Gerganov [Thu, 22 May 2025 13:33:39 +0000 (16:33 +0300)]
server : pad small embedding batches (#13692)

ggml-ci

5 weeks agogguf-py : correct charsmap parameter typing (#13701)
Sigbjørn Skjæret [Thu, 22 May 2025 12:25:05 +0000 (14:25 +0200)]
gguf-py : correct charsmap parameter typing (#13701)

5 weeks agosycl : Remove waits from function calls (#13702)
Nicolò Scipione [Thu, 22 May 2025 11:54:43 +0000 (13:54 +0200)]
sycl : Remove waits from function calls (#13702)

* removes the waits in async memcpy functions

5 weeks agoSYCL: Avoid using with SYCL-Graph for unsupported nodes (#13587)
Ewan Crawford [Thu, 22 May 2025 08:24:09 +0000 (09:24 +0100)]
SYCL: Avoid using with SYCL-Graph for unsupported nodes (#13587)

Currently on a CUDA backend to SYCL when running
`GGML_SYCL_DISABLE_GRAPH=0 ./bin/test-backend-ops -b SYCL0` there
are two operations that throw an exception from the blocking
waits during queue recording.

* `-o CONCAT` : Use of blocking waits on a queue that's being recorded https://github.com/ggml-org/llama.cpp/blob/master/ggml/src/ggml-sycl/concat.cpp#L185-L187
* `-o MUL_MAT_ID`: Blocking wait on a recording queue for a copy to host memory https://github.com/ggml-org/llama.cpp/blob/master/ggml/src/ggml-sycl/ggml-sycl.cpp#L3072-L3074

We've noticed that `ggml-cuda.cu` has the
[check_node_graph_compatibility_and_refresh_copy_ops](https://github.com/ggml-org/llama.cpp/blob/39e73ae0d69f882d7e29cecc6dd8f5052fca6731/ggml/src/ggml-cuda/ggml-cuda.cu#L2458-L2458)
method for checking if a graph can be used, even if enabled. I've taken a
similar approach in this PR by adding a method to `ggml-sycl.cpp` for checking
if a graph can be used for the operations even if a user has asked for it to be
enabled.

5 weeks agoopencl: Add support for multiple devices (#12622)
Henry Linjamäki [Wed, 21 May 2025 23:21:45 +0000 (02:21 +0300)]
opencl: Add support for multiple devices (#12622)

* opencl: Add support for multiple devices

... but limited to one platform. A platform with a GPU will be preferred.

Additionally:

* Filter out devices that lack capabilities needed by the backend
  implementation (half support, OpenCL 2.0+, etc).

* Make ggml_backend_opencl_reg() thread-safe.

* fixup: fix an error in sync_with_other_backends

... when there is only one OpenCL device available.

5 weeks agoopencl: fix couple crashes (#12795)
Henry Linjamäki [Wed, 21 May 2025 20:21:17 +0000 (23:21 +0300)]
opencl: fix couple crashes (#12795)

* opencl: fix couple crashes

* fix kernel launches failed on devices which do not support
  non-uniform work-groups. When non-uniform work-groups are not
  supported, set `local_work_size` to NULL (= let driver choose the
  work-group sizes). This patch does not cover everything - just the
  cases tested by test-backend-ops.

* fix sub-buffer creation failed due to `cl_buffer_region::origin` not
  being aligned to `CL_DEVICE_MEM_BASE_ADDR_ALIGN`.

* OpenCL: query non-uniform WG sizes only on OpenCL 3.0+

5 weeks agoreleases : build CPU backend separately (windows) (#13642)
Diego Devesa [Wed, 21 May 2025 20:09:57 +0000 (13:09 -0700)]
releases : build CPU backend separately (windows) (#13642)

5 weeks agohparams : support models for which all layers use SWA (#13682)
Georgi Gerganov [Wed, 21 May 2025 17:00:49 +0000 (20:00 +0300)]
hparams : support models for which all layers use SWA (#13682)

ggml-ci

5 weeks agoserver : improve error reporting (#13680)
Georgi Gerganov [Wed, 21 May 2025 16:46:56 +0000 (19:46 +0300)]
server : improve error reporting (#13680)

5 weeks agoconvert : add qwen2vl support for unsloth merges (#13686)
antichristHater [Wed, 21 May 2025 16:40:35 +0000 (19:40 +0300)]
convert : add qwen2vl support for unsloth merges (#13686)

5 weeks agoexamples : switch retrieval to llama_encode (#13685)
Sigbjørn Skjæret [Wed, 21 May 2025 14:57:38 +0000 (16:57 +0200)]
examples : switch retrieval to llama_encode (#13685)

* switch retrieval to llama_encode

* enable --no-warmup for retrieval

5 weeks agogguf-py : display the invalid gguf type (#13687)
Emmanuel Ferdman [Wed, 21 May 2025 14:33:54 +0000 (17:33 +0300)]
gguf-py : display the invalid gguf type (#13687)

Signed-off-by: Emmanuel Ferdman <redacted>
5 weeks agoggml : add ggml_gelu_erf() (#13667)
Xuan-Son Nguyen [Wed, 21 May 2025 14:26:33 +0000 (16:26 +0200)]
ggml : add ggml_gelu_erf() (#13667)

* ggml : add ggml_gelu_na (not approximated)

* fix naming order

* rename na --> erf

* apply review suggesions

* revert naming order

5 weeks agoserver : Add the endpoints /api/tags and /api/chat (#13659)
Robin Davidsson [Wed, 21 May 2025 13:15:27 +0000 (15:15 +0200)]
server : Add the endpoints /api/tags and /api/chat (#13659)

* Add the endpoints /api/tags and /api/chat

Add the endpoints /api/tags and /api/chat, and improved the model metadata response

* Remove trailing whitespaces

* Removed code that is not needed for copilot to work.

5 weeks agoserver : fix first message identification (#13634)
Dorin-Andrei Geman [Wed, 21 May 2025 13:07:57 +0000 (16:07 +0300)]
server : fix first message identification (#13634)

* server : fix first message identification

When using the OpenAI SDK (https://github.com/openai/openai-node/blob/master/src/lib/ChatCompletionStream.ts#L623-L626) we noticed that the expected assistant role is missing in the first streaming message. Fix this by correctly checking for the first message.

Co-authored-by: Piotr Stankiewicz <redacted>
Signed-off-by: Dorin Geman <redacted>
* server : Fix checks for first role message for stream=True

Co-authored-by: Piotr Stankiewicz <redacted>
Signed-off-by: Dorin Geman <redacted>
---------

Signed-off-by: Dorin Geman <redacted>
Co-authored-by: Piotr Stankiewicz <redacted>
5 weeks agokv-cache : simplify the interface (#13660)
Georgi Gerganov [Wed, 21 May 2025 12:11:13 +0000 (15:11 +0300)]
kv-cache : simplify the interface (#13660)

* kv-cache : simplify the interface

ggml-ci

* context : revert llama_batch_allocr position change

ggml-ci

5 weeks agomodel : disable SWA for Phi models (#13676)
Georgi Gerganov [Wed, 21 May 2025 10:09:21 +0000 (13:09 +0300)]
model : disable SWA for Phi models (#13676)

* model : disable SWA for Phi models

ggml-ci

* model : update warning message

* model : print warning only if n_swa > 0

* model : fix typo

5 weeks agomusa: Upgrade MUSA SDK version to rc4.0.1 and use mudnn::Unary::IDENTITY op to accele...
R0CKSTAR [Wed, 21 May 2025 01:58:49 +0000 (09:58 +0800)]
musa: Upgrade MUSA SDK version to rc4.0.1 and use mudnn::Unary::IDENTITY op to accelerate D2D memory copy (#13647)

* musa: fix build warning (unused parameter)

Signed-off-by: Xiaodong Ye <redacted>
* musa: upgrade MUSA SDK version to rc4.0.1

Signed-off-by: Xiaodong Ye <redacted>
* musa: use mudnn::Unary::IDENTITY op to accelerate D2D memory copy

Signed-off-by: Xiaodong Ye <redacted>
* Update ggml/src/ggml-cuda/cpy.cu

Co-authored-by: Johannes Gäßler <redacted>
* musa: remove MUDNN_CHECK_GEN and use CUDA_CHECK_GEN instead in MUDNN_CHECK

Signed-off-by: Xiaodong Ye <redacted>
---------

Signed-off-by: Xiaodong Ye <redacted>
Co-authored-by: Johannes Gäßler <redacted>
5 weeks agovulkan: fix warnings (#13626)
Eve [Tue, 20 May 2025 21:35:16 +0000 (21:35 +0000)]
vulkan: fix warnings (#13626)

* small fixes

* remove ifdef

5 weeks agomtmd-helper : bug fix to token batching in mtmd (#13650)
l3utterfly [Tue, 20 May 2025 16:55:30 +0000 (00:55 +0800)]
mtmd-helper : bug fix to token batching in mtmd (#13650)

* Update mtmd-helper.cpp

* Update tools/mtmd/mtmd-helper.cpp

Co-authored-by: Xuan-Son Nguyen <redacted>
---------

Co-authored-by: Xuan-Son Nguyen <redacted>
5 weeks agomodel : fix llama4 graph (#13663)
Georgi Gerganov [Tue, 20 May 2025 16:21:04 +0000 (19:21 +0300)]
model : fix llama4 graph (#13663)

ggml-ci

5 weeks agollama : remove llama_kv_cache_view API + remove deprecated (#13653)
Georgi Gerganov [Tue, 20 May 2025 13:13:16 +0000 (16:13 +0300)]
llama : remove llama_kv_cache_view API + remove deprecated (#13653)

ggml-ci

5 weeks agoCUDA: skip fully masked-out KV in FA vec kernel (#13584)
Johannes Gäßler [Tue, 20 May 2025 12:45:07 +0000 (14:45 +0200)]
CUDA: skip fully masked-out KV in FA vec kernel (#13584)

* CUDA: skip fully masked-out KV in FA vec kernel

5 weeks agotests : avoid github urls due to throttling (#13654)
Sigbjørn Skjæret [Tue, 20 May 2025 10:03:17 +0000 (12:03 +0200)]
tests : avoid github urls due to throttling (#13654)

5 weeks agosycl: disable reorder for sycl mulmat (#13536)
Svetlozar Georgiev [Tue, 20 May 2025 09:34:15 +0000 (10:34 +0100)]
sycl: disable reorder for sycl mulmat (#13536)

5 weeks agoSet GLM4 blk.*.attn_output.weight, kqv_out-* matmul to GGML_PREC_F32 to fix infinity...
0cc4m [Tue, 20 May 2025 08:11:56 +0000 (10:11 +0200)]
Set GLM4 blk.*.attn_output.weight, kqv_out-* matmul to GGML_PREC_F32 to fix infinity values in output (#13639)

5 weeks agometal : fix typo in FA kernel comments (#13651)
Georgi Gerganov [Tue, 20 May 2025 07:41:40 +0000 (10:41 +0300)]
metal : fix typo in FA kernel comments (#13651)

5 weeks agokv-cache : add SWA support (#13194)
Georgi Gerganov [Tue, 20 May 2025 05:05:46 +0000 (08:05 +0300)]
kv-cache : add SWA support (#13194)

* kv-cache : prepare for SWA

ggml-ci

* kv-cache : initial iSWA implementation

ggml-ci

* kv-cache : rework error recovery logic

ggml-ci

* models : fix Phi-3 SWA parameters

ggml-ci

* model : adjust Granite to rope factor changes

ggml-ci

* server : check if context can do shifts

ggml-ci

* iswa : for now, always enable shifts (experiment)

ggml-ci

* kv-cache : simplify SWA logic

ggml-ci

* kv-cache : apply defrag when we fail to find slots for the batch

ggml-ci

* llama : update docs about llama_decode

ggml-ci

* kv-cache : update warning logs when no space for the batch is available

ggml-ci

* llama : add llama_kv_self_seq_pos_min()

* kv-cache : keep track of partial SWA computes and print warnings

* server : disallow use cases involving partial SWA context

ggml-ci

* llama : add param to control SWA cache size

ggml-ci

* minor : clean-up

ggml-ci

5 weeks agoCANN: Update CANN model support (#13162)
Xinpeng Dou [Tue, 20 May 2025 03:43:43 +0000 (11:43 +0800)]
CANN: Update CANN model support (#13162)

* Update CANN model support status

* Update of model support

* update

* update

* update

* fix format of CANN.md

* fix format of CANN.md

* fix format of CANN.md

5 weeks agosycl : Overcoming workaround for mmap() allocation on Windows (#13482)
Nicolò Scipione [Tue, 20 May 2025 00:54:43 +0000 (02:54 +0200)]
sycl : Overcoming workaround for mmap() allocation on Windows (#13482)

* Remove mmap workaround on windows

After some testing I found that mmap is supported on windows and for
many GPUs on Linux. Therefore I remove the workaround for windows since
it is not necessary.

* Update llama-bench README

SYCL backend introduced a workaround that allows execution of
llama-bench also without specifying `--mmp 0` flag

5 weeks agocommon : add load_progress_callback (#13617)
psocolovsky [Mon, 19 May 2025 19:17:36 +0000 (21:17 +0200)]
common : add load_progress_callback (#13617)

5 weeks agoVulkan: Add f32 accumulator support to quantized mul mat to fix GLM4 32B incoherence...
0cc4m [Mon, 19 May 2025 15:54:08 +0000 (17:54 +0200)]
Vulkan: Add f32 accumulator support to quantized mul mat to fix GLM4 32B incoherence (#13607)

5 weeks agosycl : backend documentation review (#13544)
Alberto Cabrera Pérez [Mon, 19 May 2025 13:38:20 +0000 (14:38 +0100)]
sycl : backend documentation review (#13544)

* sycl: reviewing and updating docs

* Updates Runtime error codes

* Improves OOM troubleshooting entry

* Added a llama 3 sample

* Updated supported models

* Updated releases table

5 weeks agomtmd : add vision support for llama 4 (#13282)
Xuan-Son Nguyen [Mon, 19 May 2025 11:04:14 +0000 (13:04 +0200)]
mtmd : add vision support for llama 4 (#13282)

* wip llama 4 conversion

* rm redundant __init__

* fix conversion

* fix conversion

* test impl

* try this

* reshape patch_embeddings_0

* fix view

* rm ffn_post_norm

* cgraph ok

* f32 for pos embd

* add image marker tokens

* Llama4UnfoldConvolution

* correct pixel shuffle

* fix merge conflicts

* correct

* add debug_graph

* logits matched, but it still preceives the image incorrectly

* fix style

* add image_grid_pinpoints

* handle llama 4 preprocessing

* rm load_image_size

* rm unused line

* fix

* small fix 2

* add test & docs

* fix llava-1.6 test

* test: add notion of huge models

* add comment

* add warn about degraded quality

5 weeks agoci : upgraded oneAPI version in SYCL workflows and dockerfile (#13532)
Alberto Cabrera Pérez [Mon, 19 May 2025 10:46:09 +0000 (11:46 +0100)]
ci : upgraded oneAPI version in SYCL workflows and dockerfile (#13532)

5 weeks agosync : ggml
Georgi Gerganov [Mon, 19 May 2025 09:50:29 +0000 (12:50 +0300)]
sync : ggml

ggml-ci

5 weeks agomnist: fix segmentation fault (ggml/1227)
Johannes Gäßler [Mon, 19 May 2025 07:33:35 +0000 (09:33 +0200)]
mnist: fix segmentation fault (ggml/1227)

5 weeks agoggml : fix apple OS check in ggml_print_backtrace (ggml/1229)
Diego Devesa [Mon, 19 May 2025 01:30:13 +0000 (18:30 -0700)]
ggml : fix apple OS check in ggml_print_backtrace (ggml/1229)

5 weeks agoggml : Fix missing backtrace on Linux (ggml/1228)
Daniel Tang [Sat, 17 May 2025 23:06:26 +0000 (19:06 -0400)]
ggml : Fix missing backtrace on Linux (ggml/1228)

* Modern Linux defaults /proc/sys/kernel/yama/ptrace_scope to 1
* Fixed lldb attach
* Simplify by having the child do ggml_print_backtrace_symbols

5 weeks agofix: check model pointer validity before use (#13631)
Nick [Mon, 19 May 2025 10:25:41 +0000 (18:25 +0800)]
fix: check model pointer validity before use (#13631)

5 weeks agoCANN: Support MOE Model MUL_MAT_ID (#13042)
Chenguang Li [Mon, 19 May 2025 06:21:17 +0000 (14:21 +0800)]
CANN: Support MOE Model MUL_MAT_ID (#13042)

Signed-off-by: noemotiovon <redacted>
6 weeks agoserver : added --no-prefill-assistant flag (#13608)
Isaac McFadyen [Sat, 17 May 2025 21:59:48 +0000 (17:59 -0400)]
server : added --no-prefill-assistant flag (#13608)

* added no-prefill-assistant flag

* reworded documentation comment

* updated server README.md

6 weeks agocmake: use the current build config for vulkan-shaders-gen (#13595)
Gilad S. [Sat, 17 May 2025 18:26:43 +0000 (21:26 +0300)]
cmake: use the current build config for vulkan-shaders-gen (#13595)

* fix: use the current build config for `vulkan-shaders-gen`

* fix: only pass a valid build type to `--config`

6 weeks agoparallel : add option for non-shared and larger prompts (#13598)
Georgi Gerganov [Sat, 17 May 2025 09:58:55 +0000 (12:58 +0300)]
parallel : add option for non-shared and larger prompts (#13598)

* parallel : add option for non-shared and larger prompts

* parallel : update readme [no ci]

* cont : add note about base models [no ci]

* parallel : better var name

ggml-ci

6 weeks agovulkan: move common FA code to flash_attn_base.comp (#13556)
Jeff Bolz [Sat, 17 May 2025 07:14:55 +0000 (16:14 +0900)]
vulkan: move common FA code to flash_attn_base.comp (#13556)

* vulkan: move common FA code to flash_attn_base.comp

* vulkan: move common FA index/stride setup code to flash_attn_base.comp

* build fix

6 weeks agovulkan: use scalar FA rather than coopmat2 when N==1 (#13554)
Jeff Bolz [Sat, 17 May 2025 06:35:47 +0000 (15:35 +0900)]
vulkan: use scalar FA rather than coopmat2 when N==1 (#13554)

6 weeks agollguidance : official v0.7.20 release (no actual changes) [noci] (#13594)
Z [Fri, 16 May 2025 20:56:28 +0000 (14:56 -0600)]
llguidance : official v0.7.20 release (no actual changes) [noci] (#13594)

6 weeks agoserver : do not return error out of context (with ctx shift disabled) (#13577)
Xuan-Son Nguyen [Fri, 16 May 2025 19:50:00 +0000 (21:50 +0200)]
server : do not return error out of context (with ctx shift disabled) (#13577)

6 weeks agowebui : improve accessibility for visually impaired people (#13551)
Xuan-Son Nguyen [Fri, 16 May 2025 19:49:01 +0000 (21:49 +0200)]
webui : improve accessibility for visually impaired people (#13551)

* webui : improve accessibility for visually impaired people

* add a11y for extra contents

* fix some labels being read twice

* add skip to main content

6 weeks agoreadme : add list of dependencies and their license (#13591)
Xuan-Son Nguyen [Fri, 16 May 2025 18:04:18 +0000 (20:04 +0200)]
readme : add list of dependencies and their license (#13591)

6 weeks agoreleases : use arm version of curl for arm releases (#13592)
Diego Devesa [Fri, 16 May 2025 17:36:51 +0000 (10:36 -0700)]
releases : use arm version of curl for arm releases (#13592)

6 weeks agometal : add FA-vec kernel for head size 64 (#13583)
Georgi Gerganov [Fri, 16 May 2025 17:32:58 +0000 (20:32 +0300)]
metal : add FA-vec kernel for head size 64 (#13583)

ggml-ci

6 weeks agollama : print hint when loading a model when no backends are loaded (#13589)
Diego Devesa [Fri, 16 May 2025 14:38:07 +0000 (07:38 -0700)]
llama : print hint when loading a model when no backends are loaded (#13589)

6 weeks agoci : add ppc64el to build-linux-cross (#13575)
Sigbjørn Skjæret [Fri, 16 May 2025 12:54:23 +0000 (14:54 +0200)]
ci : add ppc64el to build-linux-cross (#13575)

6 weeks agosycl : fixed compilation warnings (#13582)
Łukasz Ślusarczyk [Fri, 16 May 2025 10:15:29 +0000 (12:15 +0200)]
sycl : fixed compilation warnings (#13582)

6 weeks agominja: sync (qwen3) (#13573)
Olivier Chafik [Thu, 15 May 2025 22:29:10 +0000 (23:29 +0100)]
minja: sync (qwen3) (#13573)

* minja: sync https://github.com/google/minja/commit/f06140fa52fd140fe38e531ec373d8dc9c86aa06

- https://github.com/google/minja/pull/67 (@grf53)
- https://github.com/google/minja/pull/66 (@taha-yassine)
- https://github.com/google/minja/pull/63 (@grf53)
- https://github.com/google/minja/pull/58

---------

Co-authored-by: ochafik <redacted>
6 weeks agogguf : use ggml log system (#13571)
Diego Devesa [Thu, 15 May 2025 17:13:11 +0000 (10:13 -0700)]
gguf : use ggml log system (#13571)

* gguf : use ggml log system

* llama : remove unnecessary new lines in exception messages

6 weeks agogguf-py : fix disconnect-before-connect in editor-gui (#13569)
Daniel Tang [Thu, 15 May 2025 16:47:10 +0000 (12:47 -0400)]
gguf-py : fix disconnect-before-connect in editor-gui (#13569)

The bug caused a crash upon load with venvs created with
--system-site-packages to use
python3-pyside6.qtwidgets=python3-pyside6.qtwidgets=6.6.2-4
from Kubuntu 24.10.

6 weeks agoconvert : fix conversion for llama 4 (#13567)
Xuan-Son Nguyen [Thu, 15 May 2025 15:40:07 +0000 (17:40 +0200)]
convert : fix conversion for llama 4 (#13567)

6 weeks agosycl: simplify bin_bcast_kernel (#13383)
Atharva Dubey [Thu, 15 May 2025 15:39:52 +0000 (16:39 +0100)]
sycl: simplify bin_bcast_kernel (#13383)

6 weeks agosycl: reordered Q4_K MMVQ (#13109)
Svetlozar Georgiev [Thu, 15 May 2025 15:35:44 +0000 (16:35 +0100)]
sycl: reordered Q4_K MMVQ (#13109)

6 weeks agosycl: use oneDNN for matrices multiplication (#12972)
Łukasz Ślusarczyk [Thu, 15 May 2025 14:53:41 +0000 (16:53 +0200)]
sycl: use oneDNN for matrices multiplication (#12972)

6 weeks agollama-bench : fix -ot with dl backends (#13563)
Diego Devesa [Thu, 15 May 2025 13:46:55 +0000 (06:46 -0700)]
llama-bench : fix -ot with dl backends (#13563)

6 weeks agowebui : handle PDF input (as text or image) + convert pasted long content to file...
Xuan-Son Nguyen [Thu, 15 May 2025 12:24:50 +0000 (14:24 +0200)]
webui : handle PDF input (as text or image) + convert pasted long content to file (#13562)

* webui : handle PDF input (as text or image)

* handle the case where pdf image + server without mtmd

* fix bug missing pages

6 weeks agoserver : proper error handling for missing elements in messages array (OpenAI compati...
Piotr Wilkin (ilintar) [Thu, 15 May 2025 06:40:58 +0000 (08:40 +0200)]
server : proper error handling for missing elements in messages array (OpenAI compatible backend) (#13540)

6 weeks agobench : handle decode errors (#13548)
Georgi Gerganov [Thu, 15 May 2025 02:57:02 +0000 (05:57 +0300)]
bench : handle decode errors (#13548)

ggml-ci