]> git.djapps.eu Git - pkg/ggml/sources/llama.cpp/log
pkg/ggml/sources/llama.cpp
4 months agoserver : support audio input (#13714)
Xuan-Son Nguyen [Fri, 23 May 2025 09:03:47 +0000 (11:03 +0200)]
server : support audio input (#13714)

* server : support audio input

* add audio support on webui

4 months agoCANN: Support MUL_MAT_ID for q8_0 and q4_0 (#13705)
Chenguang Li [Fri, 23 May 2025 08:47:53 +0000 (16:47 +0800)]
CANN: Support MUL_MAT_ID for q8_0 and q4_0 (#13705)

* [CANN]Support MUL_MAT_ID Q8 && Q4

Signed-off-by: noemotiovon <redacted>
* codestyle adjustment

Signed-off-by: noemotiovon <redacted>
---------

Signed-off-by: noemotiovon <redacted>
4 months agoggml : fix the order of ggml_unary_op (#13718)
Xuan-Son Nguyen [Fri, 23 May 2025 06:12:48 +0000 (08:12 +0200)]
ggml : fix the order of ggml_unary_op (#13718)

4 months agovulkan: support CPY from any type to itself (#13695)
Jeff Bolz [Fri, 23 May 2025 04:45:02 +0000 (00:45 -0400)]
vulkan: support CPY from any type to itself (#13695)

Reuse the f16/f32 copy shaders, and just scale the number of elements
according to the type size.

4 months agovulkan: Disable coopmat/coopmat2/bfloat extensions if glslc doesn't support it (...
Jeff Bolz [Fri, 23 May 2025 04:33:45 +0000 (00:33 -0400)]
vulkan: Disable coopmat/coopmat2/bfloat extensions if glslc doesn't support it (#13696)

4 months agouse LOG_WARN to replace `std::cerr` (#13657)
Judd [Fri, 23 May 2025 04:33:08 +0000 (12:33 +0800)]
use LOG_WARN to replace `std::cerr` (#13657)

4 months agorelease : fix windows hip release (#13707)
Diego Devesa [Thu, 22 May 2025 22:21:37 +0000 (15:21 -0700)]
release : fix windows hip release (#13707)

* release : fix windows hip release

* make single hip release with multiple targets

4 months agotts : fix n_ubatch + make WavTokenizer cache-less (#13713)
Georgi Gerganov [Thu, 22 May 2025 19:21:07 +0000 (22:21 +0300)]
tts : fix n_ubatch + make WavTokenizer cache-less (#13713)

ggml-ci

4 months agomtmd : add ultravox audio input (#13623)
Xuan-Son Nguyen [Thu, 22 May 2025 18:42:48 +0000 (20:42 +0200)]
mtmd : add ultravox audio input (#13623)

* convert ok, load ok

* warmup ok

* test

* still does not work?

* fix padding

* temporary give up

* fix merge conflict

* build_ultravox()

* rm test

* fix merge conflict

* add necessary mtmd APIs

* first working version (only 4s of audio)

* will this monster compile?

* fix compile

* please compile

* fPIC

* fix windows

* various fixes

* clean up audio_helpers

* fix conversion

* add some debug stuff

* long audio input ok

* adapt the api

* add --audio arg

* final touch UX

* add miniaudio to readme

* fix typo

* refactor kv metadata

* mtmd_default_marker()

4 months agocommon: Include torch package for s390x (#13699)
Aaron Teo [Thu, 22 May 2025 18:31:29 +0000 (02:31 +0800)]
common: Include torch package for s390x (#13699)

* common: update requirements.txt to include pytorch nightly for s390x

Signed-off-by: Aaron Teo <redacted>
* common: fix torch installation via pip for s390x

Signed-off-by: Aaron Teo <redacted>
---------

Signed-off-by: Aaron Teo <redacted>
4 months agoserver : pad small embedding batches (#13692)
Georgi Gerganov [Thu, 22 May 2025 13:33:39 +0000 (16:33 +0300)]
server : pad small embedding batches (#13692)

ggml-ci

4 months agogguf-py : correct charsmap parameter typing (#13701)
Sigbjørn Skjæret [Thu, 22 May 2025 12:25:05 +0000 (14:25 +0200)]
gguf-py : correct charsmap parameter typing (#13701)

4 months agosycl : Remove waits from function calls (#13702)
Nicolò Scipione [Thu, 22 May 2025 11:54:43 +0000 (13:54 +0200)]
sycl : Remove waits from function calls (#13702)

* removes the waits in async memcpy functions

4 months agoSYCL: Avoid using with SYCL-Graph for unsupported nodes (#13587)
Ewan Crawford [Thu, 22 May 2025 08:24:09 +0000 (09:24 +0100)]
SYCL: Avoid using with SYCL-Graph for unsupported nodes (#13587)

Currently on a CUDA backend to SYCL when running
`GGML_SYCL_DISABLE_GRAPH=0 ./bin/test-backend-ops -b SYCL0` there
are two operations that throw an exception from the blocking
waits during queue recording.

* `-o CONCAT` : Use of blocking waits on a queue that's being recorded https://github.com/ggml-org/llama.cpp/blob/master/ggml/src/ggml-sycl/concat.cpp#L185-L187
* `-o MUL_MAT_ID`: Blocking wait on a recording queue for a copy to host memory https://github.com/ggml-org/llama.cpp/blob/master/ggml/src/ggml-sycl/ggml-sycl.cpp#L3072-L3074

We've noticed that `ggml-cuda.cu` has the
[check_node_graph_compatibility_and_refresh_copy_ops](https://github.com/ggml-org/llama.cpp/blob/39e73ae0d69f882d7e29cecc6dd8f5052fca6731/ggml/src/ggml-cuda/ggml-cuda.cu#L2458-L2458)
method for checking if a graph can be used, even if enabled. I've taken a
similar approach in this PR by adding a method to `ggml-sycl.cpp` for checking
if a graph can be used for the operations even if a user has asked for it to be
enabled.

4 months agoopencl: Add support for multiple devices (#12622)
Henry Linjamäki [Wed, 21 May 2025 23:21:45 +0000 (02:21 +0300)]
opencl: Add support for multiple devices (#12622)

* opencl: Add support for multiple devices

... but limited to one platform. A platform with a GPU will be preferred.

Additionally:

* Filter out devices that lack capabilities needed by the backend
  implementation (half support, OpenCL 2.0+, etc).

* Make ggml_backend_opencl_reg() thread-safe.

* fixup: fix an error in sync_with_other_backends

... when there is only one OpenCL device available.

4 months agoopencl: fix couple crashes (#12795)
Henry Linjamäki [Wed, 21 May 2025 20:21:17 +0000 (23:21 +0300)]
opencl: fix couple crashes (#12795)

* opencl: fix couple crashes

* fix kernel launches failed on devices which do not support
  non-uniform work-groups. When non-uniform work-groups are not
  supported, set `local_work_size` to NULL (= let driver choose the
  work-group sizes). This patch does not cover everything - just the
  cases tested by test-backend-ops.

* fix sub-buffer creation failed due to `cl_buffer_region::origin` not
  being aligned to `CL_DEVICE_MEM_BASE_ADDR_ALIGN`.

* OpenCL: query non-uniform WG sizes only on OpenCL 3.0+

4 months agoreleases : build CPU backend separately (windows) (#13642)
Diego Devesa [Wed, 21 May 2025 20:09:57 +0000 (13:09 -0700)]
releases : build CPU backend separately (windows) (#13642)

4 months agohparams : support models for which all layers use SWA (#13682)
Georgi Gerganov [Wed, 21 May 2025 17:00:49 +0000 (20:00 +0300)]
hparams : support models for which all layers use SWA (#13682)

ggml-ci

4 months agoserver : improve error reporting (#13680)
Georgi Gerganov [Wed, 21 May 2025 16:46:56 +0000 (19:46 +0300)]
server : improve error reporting (#13680)

4 months agoconvert : add qwen2vl support for unsloth merges (#13686)
antichristHater [Wed, 21 May 2025 16:40:35 +0000 (19:40 +0300)]
convert : add qwen2vl support for unsloth merges (#13686)

4 months agoexamples : switch retrieval to llama_encode (#13685)
Sigbjørn Skjæret [Wed, 21 May 2025 14:57:38 +0000 (16:57 +0200)]
examples : switch retrieval to llama_encode (#13685)

* switch retrieval to llama_encode

* enable --no-warmup for retrieval

4 months agogguf-py : display the invalid gguf type (#13687)
Emmanuel Ferdman [Wed, 21 May 2025 14:33:54 +0000 (17:33 +0300)]
gguf-py : display the invalid gguf type (#13687)

Signed-off-by: Emmanuel Ferdman <redacted>
4 months agoggml : add ggml_gelu_erf() (#13667)
Xuan-Son Nguyen [Wed, 21 May 2025 14:26:33 +0000 (16:26 +0200)]
ggml : add ggml_gelu_erf() (#13667)

* ggml : add ggml_gelu_na (not approximated)

* fix naming order

* rename na --> erf

* apply review suggesions

* revert naming order

4 months agoserver : Add the endpoints /api/tags and /api/chat (#13659)
Robin Davidsson [Wed, 21 May 2025 13:15:27 +0000 (15:15 +0200)]
server : Add the endpoints /api/tags and /api/chat (#13659)

* Add the endpoints /api/tags and /api/chat

Add the endpoints /api/tags and /api/chat, and improved the model metadata response

* Remove trailing whitespaces

* Removed code that is not needed for copilot to work.

4 months agoserver : fix first message identification (#13634)
Dorin-Andrei Geman [Wed, 21 May 2025 13:07:57 +0000 (16:07 +0300)]
server : fix first message identification (#13634)

* server : fix first message identification

When using the OpenAI SDK (https://github.com/openai/openai-node/blob/master/src/lib/ChatCompletionStream.ts#L623-L626) we noticed that the expected assistant role is missing in the first streaming message. Fix this by correctly checking for the first message.

Co-authored-by: Piotr Stankiewicz <redacted>
Signed-off-by: Dorin Geman <redacted>
* server : Fix checks for first role message for stream=True

Co-authored-by: Piotr Stankiewicz <redacted>
Signed-off-by: Dorin Geman <redacted>
---------

Signed-off-by: Dorin Geman <redacted>
Co-authored-by: Piotr Stankiewicz <redacted>
4 months agokv-cache : simplify the interface (#13660)
Georgi Gerganov [Wed, 21 May 2025 12:11:13 +0000 (15:11 +0300)]
kv-cache : simplify the interface (#13660)

* kv-cache : simplify the interface

ggml-ci

* context : revert llama_batch_allocr position change

ggml-ci

4 months agomodel : disable SWA for Phi models (#13676)
Georgi Gerganov [Wed, 21 May 2025 10:09:21 +0000 (13:09 +0300)]
model : disable SWA for Phi models (#13676)

* model : disable SWA for Phi models

ggml-ci

* model : update warning message

* model : print warning only if n_swa > 0

* model : fix typo

4 months agomusa: Upgrade MUSA SDK version to rc4.0.1 and use mudnn::Unary::IDENTITY op to accele...
R0CKSTAR [Wed, 21 May 2025 01:58:49 +0000 (09:58 +0800)]
musa: Upgrade MUSA SDK version to rc4.0.1 and use mudnn::Unary::IDENTITY op to accelerate D2D memory copy (#13647)

* musa: fix build warning (unused parameter)

Signed-off-by: Xiaodong Ye <redacted>
* musa: upgrade MUSA SDK version to rc4.0.1

Signed-off-by: Xiaodong Ye <redacted>
* musa: use mudnn::Unary::IDENTITY op to accelerate D2D memory copy

Signed-off-by: Xiaodong Ye <redacted>
* Update ggml/src/ggml-cuda/cpy.cu

Co-authored-by: Johannes Gäßler <redacted>
* musa: remove MUDNN_CHECK_GEN and use CUDA_CHECK_GEN instead in MUDNN_CHECK

Signed-off-by: Xiaodong Ye <redacted>
---------

Signed-off-by: Xiaodong Ye <redacted>
Co-authored-by: Johannes Gäßler <redacted>
4 months agovulkan: fix warnings (#13626)
Eve [Tue, 20 May 2025 21:35:16 +0000 (21:35 +0000)]
vulkan: fix warnings (#13626)

* small fixes

* remove ifdef

4 months agomtmd-helper : bug fix to token batching in mtmd (#13650)
l3utterfly [Tue, 20 May 2025 16:55:30 +0000 (00:55 +0800)]
mtmd-helper : bug fix to token batching in mtmd (#13650)

* Update mtmd-helper.cpp

* Update tools/mtmd/mtmd-helper.cpp

Co-authored-by: Xuan-Son Nguyen <redacted>
---------

Co-authored-by: Xuan-Son Nguyen <redacted>
4 months agomodel : fix llama4 graph (#13663)
Georgi Gerganov [Tue, 20 May 2025 16:21:04 +0000 (19:21 +0300)]
model : fix llama4 graph (#13663)

ggml-ci

4 months agollama : remove llama_kv_cache_view API + remove deprecated (#13653)
Georgi Gerganov [Tue, 20 May 2025 13:13:16 +0000 (16:13 +0300)]
llama : remove llama_kv_cache_view API + remove deprecated (#13653)

ggml-ci

4 months agoCUDA: skip fully masked-out KV in FA vec kernel (#13584)
Johannes Gäßler [Tue, 20 May 2025 12:45:07 +0000 (14:45 +0200)]
CUDA: skip fully masked-out KV in FA vec kernel (#13584)

* CUDA: skip fully masked-out KV in FA vec kernel

4 months agotests : avoid github urls due to throttling (#13654)
Sigbjørn Skjæret [Tue, 20 May 2025 10:03:17 +0000 (12:03 +0200)]
tests : avoid github urls due to throttling (#13654)

4 months agosycl: disable reorder for sycl mulmat (#13536)
Svetlozar Georgiev [Tue, 20 May 2025 09:34:15 +0000 (10:34 +0100)]
sycl: disable reorder for sycl mulmat (#13536)

4 months agoSet GLM4 blk.*.attn_output.weight, kqv_out-* matmul to GGML_PREC_F32 to fix infinity...
0cc4m [Tue, 20 May 2025 08:11:56 +0000 (10:11 +0200)]
Set GLM4 blk.*.attn_output.weight, kqv_out-* matmul to GGML_PREC_F32 to fix infinity values in output (#13639)

4 months agometal : fix typo in FA kernel comments (#13651)
Georgi Gerganov [Tue, 20 May 2025 07:41:40 +0000 (10:41 +0300)]
metal : fix typo in FA kernel comments (#13651)

4 months agokv-cache : add SWA support (#13194)
Georgi Gerganov [Tue, 20 May 2025 05:05:46 +0000 (08:05 +0300)]
kv-cache : add SWA support (#13194)

* kv-cache : prepare for SWA

ggml-ci

* kv-cache : initial iSWA implementation

ggml-ci

* kv-cache : rework error recovery logic

ggml-ci

* models : fix Phi-3 SWA parameters

ggml-ci

* model : adjust Granite to rope factor changes

ggml-ci

* server : check if context can do shifts

ggml-ci

* iswa : for now, always enable shifts (experiment)

ggml-ci

* kv-cache : simplify SWA logic

ggml-ci

* kv-cache : apply defrag when we fail to find slots for the batch

ggml-ci

* llama : update docs about llama_decode

ggml-ci

* kv-cache : update warning logs when no space for the batch is available

ggml-ci

* llama : add llama_kv_self_seq_pos_min()

* kv-cache : keep track of partial SWA computes and print warnings

* server : disallow use cases involving partial SWA context

ggml-ci

* llama : add param to control SWA cache size

ggml-ci

* minor : clean-up

ggml-ci

4 months agoCANN: Update CANN model support (#13162)
Xinpeng Dou [Tue, 20 May 2025 03:43:43 +0000 (11:43 +0800)]
CANN: Update CANN model support (#13162)

* Update CANN model support status

* Update of model support

* update

* update

* update

* fix format of CANN.md

* fix format of CANN.md

* fix format of CANN.md

4 months agosycl : Overcoming workaround for mmap() allocation on Windows (#13482)
Nicolò Scipione [Tue, 20 May 2025 00:54:43 +0000 (02:54 +0200)]
sycl : Overcoming workaround for mmap() allocation on Windows (#13482)

* Remove mmap workaround on windows

After some testing I found that mmap is supported on windows and for
many GPUs on Linux. Therefore I remove the workaround for windows since
it is not necessary.

* Update llama-bench README

SYCL backend introduced a workaround that allows execution of
llama-bench also without specifying `--mmp 0` flag

4 months agocommon : add load_progress_callback (#13617)
psocolovsky [Mon, 19 May 2025 19:17:36 +0000 (21:17 +0200)]
common : add load_progress_callback (#13617)

4 months agoVulkan: Add f32 accumulator support to quantized mul mat to fix GLM4 32B incoherence...
0cc4m [Mon, 19 May 2025 15:54:08 +0000 (17:54 +0200)]
Vulkan: Add f32 accumulator support to quantized mul mat to fix GLM4 32B incoherence (#13607)

4 months agosycl : backend documentation review (#13544)
Alberto Cabrera Pérez [Mon, 19 May 2025 13:38:20 +0000 (14:38 +0100)]
sycl : backend documentation review (#13544)

* sycl: reviewing and updating docs

* Updates Runtime error codes

* Improves OOM troubleshooting entry

* Added a llama 3 sample

* Updated supported models

* Updated releases table

4 months agomtmd : add vision support for llama 4 (#13282)
Xuan-Son Nguyen [Mon, 19 May 2025 11:04:14 +0000 (13:04 +0200)]
mtmd : add vision support for llama 4 (#13282)

* wip llama 4 conversion

* rm redundant __init__

* fix conversion

* fix conversion

* test impl

* try this

* reshape patch_embeddings_0

* fix view

* rm ffn_post_norm

* cgraph ok

* f32 for pos embd

* add image marker tokens

* Llama4UnfoldConvolution

* correct pixel shuffle

* fix merge conflicts

* correct

* add debug_graph

* logits matched, but it still preceives the image incorrectly

* fix style

* add image_grid_pinpoints

* handle llama 4 preprocessing

* rm load_image_size

* rm unused line

* fix

* small fix 2

* add test & docs

* fix llava-1.6 test

* test: add notion of huge models

* add comment

* add warn about degraded quality

4 months agoci : upgraded oneAPI version in SYCL workflows and dockerfile (#13532)
Alberto Cabrera Pérez [Mon, 19 May 2025 10:46:09 +0000 (11:46 +0100)]
ci : upgraded oneAPI version in SYCL workflows and dockerfile (#13532)

4 months agosync : ggml
Georgi Gerganov [Mon, 19 May 2025 09:50:29 +0000 (12:50 +0300)]
sync : ggml

ggml-ci

4 months agomnist: fix segmentation fault (ggml/1227)
Johannes Gäßler [Mon, 19 May 2025 07:33:35 +0000 (09:33 +0200)]
mnist: fix segmentation fault (ggml/1227)

4 months agoggml : fix apple OS check in ggml_print_backtrace (ggml/1229)
Diego Devesa [Mon, 19 May 2025 01:30:13 +0000 (18:30 -0700)]
ggml : fix apple OS check in ggml_print_backtrace (ggml/1229)

4 months agoggml : Fix missing backtrace on Linux (ggml/1228)
Daniel Tang [Sat, 17 May 2025 23:06:26 +0000 (19:06 -0400)]
ggml : Fix missing backtrace on Linux (ggml/1228)

* Modern Linux defaults /proc/sys/kernel/yama/ptrace_scope to 1
* Fixed lldb attach
* Simplify by having the child do ggml_print_backtrace_symbols

4 months agofix: check model pointer validity before use (#13631)
Nick [Mon, 19 May 2025 10:25:41 +0000 (18:25 +0800)]
fix: check model pointer validity before use (#13631)

4 months agoCANN: Support MOE Model MUL_MAT_ID (#13042)
Chenguang Li [Mon, 19 May 2025 06:21:17 +0000 (14:21 +0800)]
CANN: Support MOE Model MUL_MAT_ID (#13042)

Signed-off-by: noemotiovon <redacted>
4 months agoserver : added --no-prefill-assistant flag (#13608)
Isaac McFadyen [Sat, 17 May 2025 21:59:48 +0000 (17:59 -0400)]
server : added --no-prefill-assistant flag (#13608)

* added no-prefill-assistant flag

* reworded documentation comment

* updated server README.md

4 months agocmake: use the current build config for vulkan-shaders-gen (#13595)
Gilad S. [Sat, 17 May 2025 18:26:43 +0000 (21:26 +0300)]
cmake: use the current build config for vulkan-shaders-gen (#13595)

* fix: use the current build config for `vulkan-shaders-gen`

* fix: only pass a valid build type to `--config`

4 months agoparallel : add option for non-shared and larger prompts (#13598)
Georgi Gerganov [Sat, 17 May 2025 09:58:55 +0000 (12:58 +0300)]
parallel : add option for non-shared and larger prompts (#13598)

* parallel : add option for non-shared and larger prompts

* parallel : update readme [no ci]

* cont : add note about base models [no ci]

* parallel : better var name

ggml-ci

4 months agovulkan: move common FA code to flash_attn_base.comp (#13556)
Jeff Bolz [Sat, 17 May 2025 07:14:55 +0000 (16:14 +0900)]
vulkan: move common FA code to flash_attn_base.comp (#13556)

* vulkan: move common FA code to flash_attn_base.comp

* vulkan: move common FA index/stride setup code to flash_attn_base.comp

* build fix

4 months agovulkan: use scalar FA rather than coopmat2 when N==1 (#13554)
Jeff Bolz [Sat, 17 May 2025 06:35:47 +0000 (15:35 +0900)]
vulkan: use scalar FA rather than coopmat2 when N==1 (#13554)

4 months agollguidance : official v0.7.20 release (no actual changes) [noci] (#13594)
Z [Fri, 16 May 2025 20:56:28 +0000 (14:56 -0600)]
llguidance : official v0.7.20 release (no actual changes) [noci] (#13594)

4 months agoserver : do not return error out of context (with ctx shift disabled) (#13577)
Xuan-Son Nguyen [Fri, 16 May 2025 19:50:00 +0000 (21:50 +0200)]
server : do not return error out of context (with ctx shift disabled) (#13577)

4 months agowebui : improve accessibility for visually impaired people (#13551)
Xuan-Son Nguyen [Fri, 16 May 2025 19:49:01 +0000 (21:49 +0200)]
webui : improve accessibility for visually impaired people (#13551)

* webui : improve accessibility for visually impaired people

* add a11y for extra contents

* fix some labels being read twice

* add skip to main content

4 months agoreadme : add list of dependencies and their license (#13591)
Xuan-Son Nguyen [Fri, 16 May 2025 18:04:18 +0000 (20:04 +0200)]
readme : add list of dependencies and their license (#13591)

4 months agoreleases : use arm version of curl for arm releases (#13592)
Diego Devesa [Fri, 16 May 2025 17:36:51 +0000 (10:36 -0700)]
releases : use arm version of curl for arm releases (#13592)

4 months agometal : add FA-vec kernel for head size 64 (#13583)
Georgi Gerganov [Fri, 16 May 2025 17:32:58 +0000 (20:32 +0300)]
metal : add FA-vec kernel for head size 64 (#13583)

ggml-ci

4 months agollama : print hint when loading a model when no backends are loaded (#13589)
Diego Devesa [Fri, 16 May 2025 14:38:07 +0000 (07:38 -0700)]
llama : print hint when loading a model when no backends are loaded (#13589)

4 months agoci : add ppc64el to build-linux-cross (#13575)
Sigbjørn Skjæret [Fri, 16 May 2025 12:54:23 +0000 (14:54 +0200)]
ci : add ppc64el to build-linux-cross (#13575)

4 months agosycl : fixed compilation warnings (#13582)
Łukasz Ślusarczyk [Fri, 16 May 2025 10:15:29 +0000 (12:15 +0200)]
sycl : fixed compilation warnings (#13582)

4 months agominja: sync (qwen3) (#13573)
Olivier Chafik [Thu, 15 May 2025 22:29:10 +0000 (23:29 +0100)]
minja: sync (qwen3) (#13573)

* minja: sync https://github.com/google/minja/commit/f06140fa52fd140fe38e531ec373d8dc9c86aa06

- https://github.com/google/minja/pull/67 (@grf53)
- https://github.com/google/minja/pull/66 (@taha-yassine)
- https://github.com/google/minja/pull/63 (@grf53)
- https://github.com/google/minja/pull/58

---------

Co-authored-by: ochafik <redacted>
4 months agogguf : use ggml log system (#13571)
Diego Devesa [Thu, 15 May 2025 17:13:11 +0000 (10:13 -0700)]
gguf : use ggml log system (#13571)

* gguf : use ggml log system

* llama : remove unnecessary new lines in exception messages

4 months agogguf-py : fix disconnect-before-connect in editor-gui (#13569)
Daniel Tang [Thu, 15 May 2025 16:47:10 +0000 (12:47 -0400)]
gguf-py : fix disconnect-before-connect in editor-gui (#13569)

The bug caused a crash upon load with venvs created with
--system-site-packages to use
python3-pyside6.qtwidgets=python3-pyside6.qtwidgets=6.6.2-4
from Kubuntu 24.10.

4 months agoconvert : fix conversion for llama 4 (#13567)
Xuan-Son Nguyen [Thu, 15 May 2025 15:40:07 +0000 (17:40 +0200)]
convert : fix conversion for llama 4 (#13567)

4 months agosycl: simplify bin_bcast_kernel (#13383)
Atharva Dubey [Thu, 15 May 2025 15:39:52 +0000 (16:39 +0100)]
sycl: simplify bin_bcast_kernel (#13383)

4 months agosycl: reordered Q4_K MMVQ (#13109)
Svetlozar Georgiev [Thu, 15 May 2025 15:35:44 +0000 (16:35 +0100)]
sycl: reordered Q4_K MMVQ (#13109)

4 months agosycl: use oneDNN for matrices multiplication (#12972)
Łukasz Ślusarczyk [Thu, 15 May 2025 14:53:41 +0000 (16:53 +0200)]
sycl: use oneDNN for matrices multiplication (#12972)

4 months agollama-bench : fix -ot with dl backends (#13563)
Diego Devesa [Thu, 15 May 2025 13:46:55 +0000 (06:46 -0700)]
llama-bench : fix -ot with dl backends (#13563)

4 months agowebui : handle PDF input (as text or image) + convert pasted long content to file...
Xuan-Son Nguyen [Thu, 15 May 2025 12:24:50 +0000 (14:24 +0200)]
webui : handle PDF input (as text or image) + convert pasted long content to file (#13562)

* webui : handle PDF input (as text or image)

* handle the case where pdf image + server without mtmd

* fix bug missing pages

4 months agoserver : proper error handling for missing elements in messages array (OpenAI compati...
Piotr Wilkin (ilintar) [Thu, 15 May 2025 06:40:58 +0000 (08:40 +0200)]
server : proper error handling for missing elements in messages array (OpenAI compatible backend) (#13540)

4 months agobench : handle decode errors (#13548)
Georgi Gerganov [Thu, 15 May 2025 02:57:02 +0000 (05:57 +0300)]
bench : handle decode errors (#13548)

ggml-ci

4 months ago`server`: inject date_string in llama 3.x template + fix date for firefunction v2...
Olivier Chafik [Thu, 15 May 2025 01:39:51 +0000 (02:39 +0100)]
`server`: inject date_string in llama 3.x template + fix date for firefunction v2 (#12802)

* Inject date_string in llama 3.x + fix for functionary v2

https://github.com/ggml-org/llama.cpp/issues/12729

* move/fix detection of functionary v3.1 before llama 3.x, fix & test their non-tool mode

Co-authored-by: Sigbjørn Skjæret <redacted>
* generate more tokens in test_completion_with_required_tool_tiny_fast to avoid truncation

---------

Co-authored-by: ochafik <redacted>
Co-authored-by: Sigbjørn Skjæret <redacted>
4 months agokv-cache : fix out-of-bounds view during reserve graph (#13547)
Georgi Gerganov [Wed, 14 May 2025 20:15:15 +0000 (23:15 +0300)]
kv-cache : fix out-of-bounds view during reserve graph (#13547)

* kv-cache : fix reserve graph out-of-bounds access

ggml-ci

* cont : add comment

* cont : fix comments [no ci]

* cont : more correct comment [no ci]

4 months agoarm64: optimize q6_k_q8_k kernel with i8mm (#13519)
Yibo Cai [Wed, 14 May 2025 19:53:52 +0000 (03:53 +0800)]
arm64: optimize q6_k_q8_k kernel with i8mm (#13519)

This PR improves q6_k_q8_k gemm kernel with arm64 i8mm instruction.

Tested on neoverse-n2 with llama3 8b q6_k quantization model.
- 40% ~ 54% S_PP uplift for all batch sizes
- 16% ~ 47% S_TG uplift for batch size 4 and above

Perplexity doesn't change with this PR.

```
// tested on neoverse-n2
$ llama-batched-bench \
      -m Meta-Llama-3-8B-Instruct-Q6_K.gguf \
      --no-mmap -fa \
      -c 8192 -b 4096 -ub 512 -npp 128 -ntg 128 \
      -npl 1,2,4,8,16,32 \
      -t 64

---------------------------------------------------------------------
|    PP |     TG |    B |       S_PP t/s      |       S_TG t/s      |
|       |        |      | original |  this pr | original |  this pr |
|-------|--------|------|----------|----------|----------|----------|
|   128 |    128 |    1 |    78.52 |   109.18 |    18.63 |    18.88 |
|   128 |    128 |    2 |    84.62 |   123.94 |    34.54 |    36.92 |
|   128 |    128 |    4 |    84.36 |   122.49 |    52.65 |    61.32 |
|   128 |    128 |    8 |    90.52 |   138.87 |    63.46 |    84.41 |
|   128 |    128 |   16 |    90.11 |   138.56 |    71.04 |   101.33 |
|   128 |    128 |   32 |    89.81 |   137.79 |    75.14 |   110.47 |
---------------------------------------------------------------------
```

4 months ago`common`: add partial regex support (#12808)
Olivier Chafik [Wed, 14 May 2025 18:50:57 +0000 (19:50 +0100)]
`common`: add partial regex support (#12808)

* move string_find_partial_stop & string_ends_with to common

* add common_regex (supports partial matches)

Co-authored-by: Georgi Gerganov <redacted>
* Update common/regex-partial.cpp

Co-authored-by: Georgi Gerganov <redacted>
* Update common/regex-partial.cpp

Co-authored-by: Georgi Gerganov <redacted>
* Update common/regex-partial.h

Co-authored-by: Georgi Gerganov <redacted>
* partial regex: add missing iterator end checks

* string utils: use string_views

* direct throw to avoid ggml.h include

* regex-partial: replace missed ggml_asserts

---------

Co-authored-by: ochafik <redacted>
Co-authored-by: Georgi Gerganov <redacted>
4 months agoeditorconfig : fix trailing whitespace from #13542 (#13546)
Sigbjørn Skjæret [Wed, 14 May 2025 18:22:49 +0000 (20:22 +0200)]
editorconfig : fix trailing whitespace from #13542 (#13546)

4 months agofix: crash when calling `llama_state_get_size` on a context without a KV cache (...
Gilad S. [Wed, 14 May 2025 16:18:18 +0000 (19:18 +0300)]
fix: crash when calling `llama_state_get_size` on a context without a KV cache (#13542)

4 months agoCUDA: fix crash on large batch size for quant. MoE (#13537)
Johannes Gäßler [Wed, 14 May 2025 14:41:02 +0000 (16:41 +0200)]
CUDA: fix crash on large batch size for quant. MoE (#13537)

4 months agollama : fix quantize with dl backends (#13539)
Diego Devesa [Wed, 14 May 2025 14:12:36 +0000 (07:12 -0700)]
llama : fix quantize with dl backends (#13539)

4 months agoCUDA: faster Deepseek FA, add Turing support (#13435)
Johannes Gäßler [Wed, 14 May 2025 14:08:20 +0000 (16:08 +0200)]
CUDA: faster Deepseek FA, add Turing support (#13435)

4 months agofix: Move build_inp_pos to the top of the graph section for build_granite (#13538)
Gabe Goodhart [Wed, 14 May 2025 12:53:59 +0000 (06:53 -0600)]
fix: Move build_inp_pos to the top of the graph section for build_granite (#13538)

This matches how others do it, but will still avoid the extra
initialization when rope is disabled.

Branch: GraniteFour

Signed-off-by: Gabe Goodhart <redacted>
4 months agoserver : passthrough the /models endpoint during loading (#13535)
Georgi Gerganov [Wed, 14 May 2025 12:42:10 +0000 (15:42 +0300)]
server : passthrough the /models endpoint during loading (#13535)

* server : passthrough the /models endpoint during loading

* server : update readme + return json for "meta" field

4 months agoserver : fix cache_tokens bug with no cache_prompt (#13533)
Xuan-Son Nguyen [Wed, 14 May 2025 11:35:07 +0000 (13:35 +0200)]
server : fix cache_tokens bug with no cache_prompt (#13533)

4 months agocmake: simplify vulkan shader test logic (#13263)
bandoti [Wed, 14 May 2025 10:53:57 +0000 (07:53 -0300)]
cmake: simplify vulkan shader test logic (#13263)

4 months agovulkan: KHR_coopmat flash attention (#13506)
Jeff Bolz [Wed, 14 May 2025 09:55:26 +0000 (18:55 +0900)]
vulkan: KHR_coopmat flash attention (#13506)

This shader uses coopmat1 to do the Q*K^T multiply. The P*V multiply is more
difficult for various reasons so I haven't done it. Performance for this
shader is around 2.5x better than for the scalar shader when doing prompt
processing. Some of the benefit may be from other optimizations like staging
through shared memory, or splitting by rows.

4 months agowebui : use fflate for more deterministic gzip compress (#13525)
Xuan-Son Nguyen [Wed, 14 May 2025 08:26:12 +0000 (10:26 +0200)]
webui : use fflate for more deterministic gzip compress (#13525)

* webui : use pako for more deterministic gzip compress

* simpler code

* use fflate instead of pako

4 months agowebui: Allow pasting file from clipboard (#13526)
Luca Stefani [Wed, 14 May 2025 08:07:31 +0000 (10:07 +0200)]
webui: Allow pasting file from clipboard (#13526)

* server: Allow pasting file from clipboard

* server: Prevent default action on file paste

* update build

* format then build combined

---------

Co-authored-by: Xuan Son Nguyen <redacted>
4 months agodocs: Update link to ggml-org in multimodal.md (#13513)
ddpasa [Wed, 14 May 2025 07:59:12 +0000 (09:59 +0200)]
docs: Update link to ggml-org in multimodal.md (#13513)

* Update multimodal.md

Minor change to include the huggingface link

* Update docs/multimodal.md

---------

Co-authored-by: Xuan-Son Nguyen <redacted>
4 months agoscripts : fix compare-llama-bench.py show parameter (#13514)
Sigbjørn Skjæret [Wed, 14 May 2025 06:41:01 +0000 (08:41 +0200)]
scripts : fix compare-llama-bench.py show parameter (#13514)

4 months agovulkan: workaround FA compile failures on macos (#13517)
Jeff Bolz [Wed, 14 May 2025 04:15:50 +0000 (13:15 +0900)]
vulkan: workaround FA compile failures on macos (#13517)

4 months agoquantize : improve tensor-type pattern matching (#13033)
Ed Addario [Tue, 13 May 2025 17:12:31 +0000 (18:12 +0100)]
quantize : improve tensor-type pattern matching (#13033)

4 months agoclip : clip.h become private API (⚠️ breaking change) (#13510)
Xuan-Son Nguyen [Tue, 13 May 2025 15:07:21 +0000 (17:07 +0200)]
clip : clip.h become private API (⚠️ breaking change) (#13510)

4 months agometal : use FA-vec kernel up to batch size 20 (#13496)
Georgi Gerganov [Tue, 13 May 2025 15:04:39 +0000 (18:04 +0300)]
metal : use FA-vec kernel up to batch size 20 (#13496)

* batched-bench : fix pp batch contents

* metal : optimize multi-sequence FA vec kernel

ggml-ci

* metal : use FA-vec kernel up to batch size 20

ggml-ci

4 months agometal : optimize multi-sequence FA vec kernel (#13493)
Georgi Gerganov [Tue, 13 May 2025 15:04:00 +0000 (18:04 +0300)]
metal : optimize multi-sequence FA vec kernel (#13493)

* batched-bench : fix pp batch contents

* metal : optimize multi-sequence FA vec kernel

ggml-ci

4 months agoggml-cpu: Update KleidiAI to v1.6 and fix include directives (#13509)
Dan Johansson [Tue, 13 May 2025 15:02:28 +0000 (17:02 +0200)]
ggml-cpu: Update KleidiAI to v1.6 and fix include directives (#13509)

Signed-off-by: Dan Johansson <redacted>