]> git.djapps.eu Git - pkg/ggml/sources/llama.cpp/log
pkg/ggml/sources/llama.cpp
2 weeks agoCANN: Enable labeler for Ascend NPU (#13914)
Yuanhao Ji [Mon, 9 Jun 2025 03:20:06 +0000 (11:20 +0800)]
CANN: Enable labeler for Ascend NPU (#13914)

2 weeks agocuda : fix buffer type check with integrated GPUs (#14069)
Diego Devesa [Sun, 8 Jun 2025 18:39:56 +0000 (11:39 -0700)]
cuda : fix buffer type check with integrated GPUs (#14069)

3 weeks agoci: add LoongArch cross-compile build (#13944)
吴小白 [Sat, 7 Jun 2025 13:39:11 +0000 (21:39 +0800)]
ci: add LoongArch cross-compile build (#13944)

3 weeks agoSYCL: Implement few same quantized type copy kernels (#13739)
Akarshan Biswas [Sat, 7 Jun 2025 13:28:20 +0000 (18:58 +0530)]
SYCL: Implement few same quantized type copy kernels (#13739)

* SYCL: Implement few same quantized type copy kernels

* Use memcpy for copying contiguous tensors

ggml-ci

* feat(sycl): add contiguous tensor copy support and device checks

Adds a memcpy path for contiguous tensors of the same type to optimize data transfer. Updates device support checks to recognize contiguous tensor operations, improving compatibility and performance.

* refactor: replace specific block copy functions with template

The changes replace multiple redundant block copy functions (e.g., cpy_block_q8_0_q8_0, cpy_block_q5_0_q5_0) with a single templated function cpy_blck_q_q. This reduces code duplication by using a generic template that works for any block type, improving maintainability while preserving the same functionality. The template is instantiated with specific block types (e.g., block_q8_0) where needed.

* Exclude BF16 support for COPY tensors for now
ggml-ci

* perf: adjust SYCL copy kernel block sizes for efficiency

Use ceil_div to ensure full element coverage and update nd_range parameters to better align with SYCL block sizes, improving parallelism and device utilization in copy operations.

3 weeks agollama : fix llama_model_chat_template with template name (LLM_KV with suffix) (#14050)
Sigbjørn Skjæret [Sat, 7 Jun 2025 12:13:12 +0000 (14:13 +0200)]
llama : fix llama_model_chat_template with template name (LLM_KV with suffix) (#14050)

3 weeks agollama : deprecate llama_kv_self_ API (#14030)
Georgi Gerganov [Fri, 6 Jun 2025 11:11:15 +0000 (14:11 +0300)]
llama : deprecate llama_kv_self_ API (#14030)

* llama : deprecate llama_kv_self_ API

ggml-ci

* llama : allow llama_memory_(nullptr)

ggml-ci

* memory : add flag for optional data clear in llama_memory_clear

ggml-ci

3 weeks agocontext : fix SWA-related warning for multiple sequences (#14045)
Georgi Gerganov [Fri, 6 Jun 2025 10:29:18 +0000 (13:29 +0300)]
context : fix SWA-related warning for multiple sequences (#14045)

3 weeks agollama : support multiple classifier outputs and labels (#13940)
Sigbjørn Skjæret [Fri, 6 Jun 2025 07:03:25 +0000 (09:03 +0200)]
llama : support multiple classifier outputs and labels (#13940)

3 weeks agogguf-py : add add_classifier_output_labels method to writer (#14031)
Sigbjørn Skjæret [Thu, 5 Jun 2025 15:42:31 +0000 (17:42 +0200)]
gguf-py : add add_classifier_output_labels method to writer (#14031)

* add add_classifier_output_labels

* use add_classifier_output_labels

3 weeks agovulkan: Enable VK_KHR_cooperative_matrix extension for Intel Xe2 GPUs (#14001)
Masato Nakasaka [Thu, 5 Jun 2025 14:00:29 +0000 (23:00 +0900)]
vulkan: Enable VK_KHR_cooperative_matrix extension for Intel Xe2 GPUs (#14001)

* allowing B580 and U9-288V

* experimenting code to detect Xe2

* allowing coopmat only for Xe2 GPUs

* fixed comment wording

* fixed comment wording

* removed unnecessary driver check

3 weeks agoci: fix CUDA build failure on autodl cloud machines (#14005)
pockers21 [Thu, 5 Jun 2025 13:25:29 +0000 (06:25 -0700)]
ci: fix CUDA build failure on autodl cloud machines (#14005)

Replace CMAKE_CUDA_ARCHITECTURES=native with nvidia-smi detection
as 'native' fails on autodl cloud environments.

Co-authored-by: pockers21 <redacted>
3 weeks agomemory : migrate from llama_kv_cache to more generic llama_memory (#14006)
Georgi Gerganov [Thu, 5 Jun 2025 12:29:22 +0000 (15:29 +0300)]
memory : migrate from llama_kv_cache to more generic llama_memory (#14006)

* memory : merge llama_kv_cache into llama_memory + new `llama_memory` API

ggml-ci

* context : fix casts

ggml-ci

3 weeks agollama : allow using mmap without PrefetchVirtualMemory, apply GGML_WIN_VER to llama...
Diego Devesa [Thu, 5 Jun 2025 09:57:42 +0000 (02:57 -0700)]
llama : allow using mmap without PrefetchVirtualMemory, apply GGML_WIN_VER to llama.cpp sources (#14013)

3 weeks agoreadme : add badge (#13938)
Olexandr88 [Thu, 5 Jun 2025 07:50:55 +0000 (10:50 +0300)]
readme : add badge (#13938)

3 weeks agovocab : warn about missing mask token (#14022)
Sigbjørn Skjæret [Thu, 5 Jun 2025 07:29:18 +0000 (09:29 +0200)]
vocab : warn about missing mask token (#14022)

3 weeks agocontext : fix pos_min initialization upon error decode (#14008)
Georgi Gerganov [Thu, 5 Jun 2025 06:06:29 +0000 (09:06 +0300)]
context : fix pos_min initialization upon error decode (#14008)

ggml-ci

3 weeks agovulkan: automatically deduce size of push constants (#13936)
Jeff Bolz [Thu, 5 Jun 2025 05:17:58 +0000 (00:17 -0500)]
vulkan: automatically deduce size of push constants (#13936)

3 weeks agoggml-vulkan: adds support for op CONV_TRANSPOSE_1D (#13813)
Ervin Áron Tasnádi [Wed, 4 Jun 2025 20:02:00 +0000 (22:02 +0200)]
ggml-vulkan: adds support for op CONV_TRANSPOSE_1D (#13813)

* * ggml-vulkan: adds op CONV_TRANSPOSE_1D

* test-backend-ops: adds more spohisticated tests for CONV_TRANSPOSE_1D

* Missing barrier added to shader.
Number of additional tests reduced to 108.

* * Fixes typo in variable name.

* Removes extra whitespaces.

* Adds int64->int32 casts to prevent possible warnings.

* Problem size reduced in tests to pass tests with llvmpipe.

* supports_op condition moved from unintended position

3 weeks agokv-cache : refactor the update/defrag mechanism (#13988)
Georgi Gerganov [Wed, 4 Jun 2025 15:58:20 +0000 (18:58 +0300)]
kv-cache : refactor the update/defrag mechanism (#13988)

* kv-cache : refactor update mechanism

ggml-ci

* memory : improve status handling

* defrag : reset head + add comments

ggml-ci

* cont : minor fixes

ggml-ci

3 weeks agoci : remove cuda 11.7 releases, switch runner to windows 2022 (#13997)
Diego Devesa [Wed, 4 Jun 2025 13:37:40 +0000 (06:37 -0700)]
ci : remove cuda 11.7 releases, switch runner to windows 2022 (#13997)

3 weeks agoreleases : use dl backend for linux release, remove arm64 linux release (#13996)
Diego Devesa [Wed, 4 Jun 2025 11:15:54 +0000 (04:15 -0700)]
releases : use dl backend for linux release, remove arm64 linux release (#13996)

3 weeks agollama-graph : use ggml_repeat_4d (#13998)
Xuan-Son Nguyen [Wed, 4 Jun 2025 08:11:26 +0000 (10:11 +0200)]
llama-graph : use ggml_repeat_4d (#13998)

3 weeks agoCUDA: fix FTZ in FA for Gemma 3 (#13991)
Johannes Gäßler [Wed, 4 Jun 2025 06:57:05 +0000 (08:57 +0200)]
CUDA: fix FTZ in FA for Gemma 3 (#13991)

3 weeks agokv-cache : fix unified::seq_rm to work with seq_id < 0 (#13985)
Georgi Gerganov [Wed, 4 Jun 2025 06:50:32 +0000 (09:50 +0300)]
kv-cache : fix unified::seq_rm to work with seq_id < 0 (#13985)

ggml-ci

3 weeks agovulkan: fix warnings in perf logger querypool code (#13937)
Jeff Bolz [Tue, 3 Jun 2025 18:30:22 +0000 (13:30 -0500)]
vulkan: fix warnings in perf logger querypool code (#13937)

3 weeks agodocs : add "Quick start" section for new users (#13862)
Xuan-Son Nguyen [Tue, 3 Jun 2025 11:09:36 +0000 (13:09 +0200)]
docs : add "Quick start" section for new users (#13862)

* docs : add "Quick start" section for non-technical users

* rm flox

* Update README.md

3 weeks agoopencl: add `backend_synchronize` (#13939)
lhez [Mon, 2 Jun 2025 23:54:58 +0000 (16:54 -0700)]
opencl: add `backend_synchronize` (#13939)

* This is not needed by the normal use where the result is read
  using `tensor_get`, but it allows perf mode of `test-backend-ops`
  to properly measure performance.

3 weeks agoOpenCL: Add concat, tsembd, upscale, tanh, pad and repeat (#13840)
rmatif [Mon, 2 Jun 2025 23:53:36 +0000 (23:53 +0000)]
OpenCL: Add concat, tsembd, upscale, tanh, pad and repeat (#13840)

* add concat, pad, repeat, tsembd, tanh, upscale

* small fixes

3 weeks agoserver : disable speculative decoding for SWA models (#13970)
Georgi Gerganov [Mon, 2 Jun 2025 18:34:40 +0000 (21:34 +0300)]
server : disable speculative decoding for SWA models (#13970)

* server : use swa-full fo draft context

ggml-ci

* server : disable speculative decoding for SWA models

3 weeks agometal : use F32 accumulators in FA kernels (#13975)
Georgi Gerganov [Mon, 2 Jun 2025 18:33:40 +0000 (21:33 +0300)]
metal : use F32 accumulators in FA kernels (#13975)

ggml-ci

3 weeks agogemma : more consistent attention scaling for v2 and v3 (#13951)
Georgi Gerganov [Mon, 2 Jun 2025 17:54:26 +0000 (20:54 +0300)]
gemma : more consistent attention scaling for v2 and v3 (#13951)

* gemma : fix attn scale for 27B

* cont : apply scale before attn

* cont : consistent attention scaling

3 weeks ago`server`: update deepseek reasoning format (pass reasoning_content as diffs) (#13933)
Olivier Chafik [Mon, 2 Jun 2025 17:15:44 +0000 (10:15 -0700)]
`server`: update deepseek reasoning format (pass reasoning_content as diffs) (#13933)

* server: update deepseek reasoning format (now in reasoning_content diffs), add legacy option for compat
* update unit/test_tool_call.py::test_thoughts

3 weeks agomtmd : fix memory leak in mtmd_helper_eval_chunk_single (#13961)
Xuan-Son Nguyen [Mon, 2 Jun 2025 14:29:28 +0000 (16:29 +0200)]
mtmd : fix memory leak in mtmd_helper_eval_chunk_single (#13961)

* mtmd : fix memory in mtmd_helper_eval_chunk_single

* mtmd-cli : fix mem leak

* Update tools/mtmd/mtmd-cli.cpp

Co-authored-by: Georgi Gerganov <redacted>
---------

Co-authored-by: Georgi Gerganov <redacted>
3 weeks agocmake : Handle mixed-case 'Power' strings in POWER CPU detection (#13966)
shalinib-ibm [Mon, 2 Jun 2025 12:18:36 +0000 (17:48 +0530)]
cmake : Handle mixed-case 'Power' strings in POWER CPU detection (#13966)

Some systems report the CPU implementation as "Power11" instead of "POWER11".
The existing CMake logic uses a case-sensitive regular expression to extract
the CPU generation, which fails when the casing doesn't exactly match "POWER".

This patch provides a fix by first converting the string to uppercase before applying the regex.

Signed-off-by: root <redacted>
Co-authored-by: root <redacted>
3 weeks agosycl: quantize and reorder the input to q8_1 when reorder is enabled (#13826)
Atharva Dubey [Mon, 2 Jun 2025 09:12:20 +0000 (10:12 +0100)]
sycl: quantize and reorder the input to q8_1 when reorder is enabled (#13826)

* [WIP]: fuse q8 quantization and reorder

* wip2: fuse q8 quantization and reorder

* working q8 reorder commit

* restored common.hpp

* remove debug prints

* remove unnecessary headers and remove trailing whitespace

* Update ggml/src/ggml-sycl/ggml-sycl.cpp

Co-authored-by: Alberto Cabrera Pérez <redacted>
---------

Co-authored-by: Alberto Cabrera Pérez <redacted>
3 weeks agogguf: fix failure on version == 0 (#13956)
Johannes Gäßler [Sun, 1 Jun 2025 16:08:05 +0000 (18:08 +0200)]
gguf: fix failure on version == 0 (#13956)

3 weeks agoconvert : fix nomic-bert-moe mask token (#13757)
Sigbjørn Skjæret [Sun, 1 Jun 2025 16:07:21 +0000 (18:07 +0200)]
convert : fix nomic-bert-moe mask token (#13757)

3 weeks agoconvert : fix vocab padding code for bert models (#13954)
Sigbjørn Skjæret [Sun, 1 Jun 2025 15:23:11 +0000 (17:23 +0200)]
convert : fix vocab padding code for bert models (#13954)

3 weeks agoggml: check if non-native endian model is being loaded (#13943)
Aaron Teo [Sun, 1 Jun 2025 14:53:57 +0000 (22:53 +0800)]
ggml: check if non-native endian model is being loaded (#13943)

* gguf: prevent non-native endian models from being loaded

Signed-off-by: Aaron Teo <redacted>
* gguf: update error message

Signed-off-by: Aaron Teo <redacted>
* gguf: make the non-native endian check more verbose

Signed-off-by: Aaron Teo <redacted>
* ggml: move ggml_assert location

Signed-off-by: Aaron Teo <redacted>
* ggml: reword the endianness check error message

Signed-off-by: Aaron Teo <redacted>
---------

Signed-off-by: Aaron Teo <redacted>
3 weeks agosync : ggml
Georgi Gerganov [Sun, 1 Jun 2025 09:23:14 +0000 (12:23 +0300)]
sync : ggml

ggml-ci

3 weeks agovulkan : Remove unexpected ; (ggml/1253)
Kai Pastor [Sat, 31 May 2025 10:49:55 +0000 (12:49 +0200)]
vulkan : Remove unexpected ; (ggml/1253)

3 weeks agocmake : Fix broken CMake error messages (ggml/1252)
Kai Pastor [Sat, 31 May 2025 10:39:19 +0000 (12:39 +0200)]
cmake : Fix broken CMake error messages (ggml/1252)

3 weeks agoggml : remove ggml_graph_import and ggml_graph_export declarations (ggml/1247)
Radoslav Gerganov [Fri, 30 May 2025 06:11:09 +0000 (09:11 +0300)]
ggml : remove ggml_graph_import and ggml_graph_export declarations (ggml/1247)

The implementation is already deleted with commit 9d0762e.

closes: #1235

3 weeks agosync : whisper.cpp (ggml/1250)
Georgi Gerganov [Thu, 29 May 2025 10:29:50 +0000 (13:29 +0300)]
sync : whisper.cpp (ggml/1250)

* ggml : Fix backtrace breaking Windows build (whisper/3203)

* sync : whisper.cpp

ggml-ci

---------

Co-authored-by: Daniel Tang <redacted>
3 weeks agoggml : install dynamic backends (ggml/1240)
Radoslav Gerganov [Thu, 29 May 2025 05:34:46 +0000 (08:34 +0300)]
ggml : install dynamic backends (ggml/1240)

* ggml : install dynamic backends

Make sure dynamic backends are installed in $CMAKE_INSTALL_BINDIR

3 weeks agoggml : Print backtrace on uncaught C++ exceptions (ggml/1232)
Daniel Tang [Wed, 28 May 2025 00:58:46 +0000 (20:58 -0400)]
ggml : Print backtrace on uncaught C++ exceptions (ggml/1232)

The goal is to have what users call "full logs" contain the backtrace.

This is registered upon ggml_init. Also fixes a minor fd leak on Linux.

3 weeks agoreadme : update bindings (#13950)
ddh0 [Sun, 1 Jun 2025 08:44:30 +0000 (03:44 -0500)]
readme : update bindings (#13950)

3 weeks agoparallel : fix n_junk == 0 (#13952)
Georgi Gerganov [Sun, 1 Jun 2025 08:42:16 +0000 (11:42 +0300)]
parallel : fix n_junk == 0 (#13952)

3 weeks agokv-cache : split implementation in separate sources (#13920)
Georgi Gerganov [Sun, 1 Jun 2025 08:39:27 +0000 (11:39 +0300)]
kv-cache : split implementation in separate sources (#13920)

ggml-ci

4 weeks agothreading: support for GGML_SCHED_PRIO_LOW, update thread info on Windows to avoid...
Max Krasnyansky [Sat, 31 May 2025 22:39:19 +0000 (15:39 -0700)]
threading: support for GGML_SCHED_PRIO_LOW, update thread info on Windows to avoid throttling (#12995)

* threading: support for GGML_SCHED_PRIO_LOW, update thread info on Windows to avoid throttling

We talked about adding LOW priority for GGML threads in the original threadpool PR.
It might be useful for some cases to avoid contention.

Latest Windows ARM64 releases started parking (offlining) the CPU cores
more aggresively which results in suboptimal performance with n_threads > 4.
To deal with that we now disable Power Throttling for our threads for the NORMAL
and higher priorities.

Co-authored-by: Diego Devesa <redacted>
* threading: disable SetThreadInfo() calls for older Windows versions

* Update tools/llama-bench/llama-bench.cpp

Co-authored-by: Diego Devesa <redacted>
---------

Co-authored-by: Diego Devesa <redacted>
4 weeks agodocs : Note about necessity of having libcurl installed for standard build. (#13945)
Jiří Podivín [Sat, 31 May 2025 16:58:35 +0000 (18:58 +0200)]
docs : Note about necessity of having libcurl installed for standard build. (#13945)

Signed-off-by: Jiri Podivin <redacted>
4 weeks agoserver: allow unclosed thinking tags (#13931)
Olivier Chafik [Sat, 31 May 2025 15:26:10 +0000 (08:26 -0700)]
server: allow unclosed thinking tags (#13931)

4 weeks agollama : deprecate explicit kv_self defrag/update calls (#13921)
Georgi Gerganov [Sat, 31 May 2025 12:58:33 +0000 (15:58 +0300)]
llama : deprecate explicit kv_self defrag/update calls (#13921)

ggml-ci

4 weeks agollama : use n_swa + n_ubatch cells for SWA cache (#13833)
Georgi Gerganov [Sat, 31 May 2025 12:57:44 +0000 (15:57 +0300)]
llama : use n_swa + n_ubatch cells for SWA cache (#13833)

* llama : use n_swa + n_ubatch cells for SWA cache

ggml-ci

* llama : add warning about multi-sqeuence SWA contexts

4 weeks agowebui : Replace alert and confirm with custom modals. (#13711)
igardev [Sat, 31 May 2025 09:56:08 +0000 (12:56 +0300)]
webui : Replace alert and confirm with custom modals. (#13711)

* Replace alert and confirm with custom modals. This is needed as Webview in VS Code doesn't permit alert and confirm for security reasons.

* use Modal Provider to simplify the use of confirm and alert modals.

* Increase the z index of the modal dialogs.

* Update index.html.gz

* also add showPrompt

* rebuild

---------

Co-authored-by: igardev <redacted>
Co-authored-by: Xuan Son Nguyen <redacted>
4 weeks agollama : auto-batch preparation (#13845)
Georgi Gerganov [Sat, 31 May 2025 09:55:57 +0000 (12:55 +0300)]
llama : auto-batch preparation (#13845)

* llama : auto-batch

ggml-ci

* context : simplify if branching

4 weeks agomtmd : drop `_shared` from `libmtmd` name, merge helpers into libmtmd (⚠️ breaking...
Xuan-Son Nguyen [Sat, 31 May 2025 08:14:29 +0000 (10:14 +0200)]
mtmd : drop `_shared` from `libmtmd` name, merge helpers into libmtmd (⚠️ breaking change) (#13917)

* mtmd : fix missing public header

* no object

* apply suggestion from Georgi

* rm mtmd-helper, merge it to mtmd

* missing vendor include dir

4 weeks agokv-cache : refactor + add llama_memory_state_i (#13746)
Georgi Gerganov [Sat, 31 May 2025 07:24:04 +0000 (10:24 +0300)]
kv-cache : refactor + add llama_memory_state_i (#13746)

* kv-cache : simplify the "struct llama_kv_cache" interface

ggml-ci

* kv-cache : revert the (n_swa + n_ubatch) change (for next PR)

ggml-ci

* kv-cache : some comments

ggml-ci

* context : fix graph reserve for multiple sequences

ggml-ci

* kv-cache : fix typo [no ci]

* kv-cache : fix find_slot() logic for free slots

ggml-ci

* llama : add TODO for deprecating the defrag API in the future

* kv-cache : improve find_slot() using min/max seq pos info

ggml-ci

* llama : handle aborts and compute errors

ggml-ci

* memory : extract state into llama_memory_state

ggml-ci

* kv-cache : add comments

ggml-ci

* server : update batching logic to reset n_batch on successful decode

* server : upon full re-processing, remove the sequence from the cache

* kv-cache : add TODO for doing split_equal when split_simple fails

ggml-ci

4 weeks agoCUDA: add a prop in ggml_cuda_device_infor for distinguish iGPU or dGPU in cuda ...
Shawn yang [Sat, 31 May 2025 06:48:04 +0000 (14:48 +0800)]
CUDA: add a prop in ggml_cuda_device_infor for distinguish iGPU or dGPU in cuda (#13856) (#13895)

* 1.  add "integrated" in ggml_cuda_device_info for distinguish whether it is Intergrate_gpu or discrete_gpu
2. Adjust the func:"ggml_backend_cuda_device_supports_buft" for this new feature

* Update ggml/src/ggml-cuda/ggml-cuda.cu

Adjusted code indentation

Co-authored-by: Johannes Gäßler <redacted>
* Update ggml/src/ggml-cuda/ggml-cuda.cu

Fixed incorrect setting of variable types

Co-authored-by: Johannes Gäßler <redacted>
* Update ggml/src/ggml-cuda/ggml-cuda.cu

Adjusted the judgment logic

Co-authored-by: Johannes Gäßler <redacted>
* add a host_buft assert in case of integrated_cuda_device with func:'evaluate_and_capture_cuda_graph()'

* Update ggml/src/ggml-cuda/ggml-cuda.cu

Add a defensive security assert

Co-authored-by: Johannes Gäßler <redacted>
* Update ggml/src/ggml-cuda/ggml-cuda.cu

Adjusted the support judgment logic.

Co-authored-by: Johannes Gäßler <redacted>
* revoke the suggest commit changes due to it's not applicable in jetson_device

* Update ggml/src/ggml-cuda/ggml-cuda.cu

Add parentheses to enforce operator precedence​

Co-authored-by: Diego Devesa <redacted>
* Update ggml/src/ggml-cuda/ggml-cuda.cu

Fix ci bug: add a spaces

Co-authored-by: Johannes Gäßler <redacted>
---------

Co-authored-by: yangxiao <redacted>
Co-authored-by: Johannes Gäßler <redacted>
Co-authored-by: yangxiao <redacted>
Co-authored-by: Diego Devesa <redacted>
4 weeks agoCUDA: fix typo in FlashAttention code (#13926)
Johannes Gäßler [Fri, 30 May 2025 19:22:03 +0000 (21:22 +0200)]
CUDA: fix typo in FlashAttention code (#13926)

4 weeks agosched : avoid changing cur_copy when a graph is already allocated (#13922)
Diego Devesa [Fri, 30 May 2025 16:56:19 +0000 (09:56 -0700)]
sched : avoid changing cur_copy when a graph is already allocated (#13922)

4 weeks agoparallel : increase the variability of the prompt lengths (#13927)
Georgi Gerganov [Fri, 30 May 2025 16:38:07 +0000 (19:38 +0300)]
parallel : increase the variability of the prompt lengths (#13927)

ggml-ci

4 weeks agocuda : prevent using split buffers with 3d/4d matrices (#13919)
Diego Devesa [Fri, 30 May 2025 14:37:18 +0000 (07:37 -0700)]
cuda : prevent using split buffers with 3d/4d matrices (#13919)

4 weeks agoSYCL: Add mrope kernel (#13755)
Akarshan Biswas [Fri, 30 May 2025 14:10:57 +0000 (19:40 +0530)]
SYCL: Add mrope kernel (#13755)

* SYCL: Add mrope kernel

* feat: Optimize rope operations with vectorization

Uses `sycl::vec` to load and store two elements at a time,
significantly improving performance in `rope_norm`,
`rope_neox`, and `rope_multi`. This reduces the number of memory
accesses and leverages SIMD instructions for faster execution.

* Use ceil_div

4 weeks agosync : vendor (#13901)
Georgi Gerganov [Fri, 30 May 2025 13:25:45 +0000 (16:25 +0300)]
sync : vendor (#13901)

* sync : vendor

ggml-ci

* cont : fix httplib version

ggml-ci

* cont : fix lint

* cont : fix lint

* vendor : move to common folder /vendor

ggml-ci

* cont : fix lint

* cont : move httplib to /vendor + use json_fwd.hpp

ggml-ci

* cont : fix server build

ggml-ci

* cont : add missing headers

ggml-ci

* cont : header clean-up

ggml-ci

4 weeks agoconvert : fix rwkv bos/eos token (#13844)
Sigbjørn Skjæret [Fri, 30 May 2025 12:50:43 +0000 (14:50 +0200)]
convert : fix rwkv bos/eos token (#13844)

4 weeks agoconvert : allow partial update to the chkhsh pre-tokenizer list (#13847)
Xuan-Son Nguyen [Fri, 30 May 2025 10:24:37 +0000 (12:24 +0200)]
convert : allow partial update to the chkhsh pre-tokenizer list (#13847)

* convert : allow partial update to the chkhsh pre-tokenizer list

* code style

* update tokenizer out

* rm inp/out files for models not having gguf

* fixed hash for glm

* skip nomic-bert-moe test

* Update convert_hf_to_gguf_update.py

* fix minerva-7b hash

* rm redundant import

4 weeks agollama : add support for DistilBert (#13907)
Đinh Trọng Huy [Fri, 30 May 2025 09:56:02 +0000 (18:56 +0900)]
llama : add support for DistilBert (#13907)

* add distilbert

* small fixes

* add note for LLM_ARCH_DISTIL_BERT

* Use MODEL_ARCH.BERT for DistilBert

---------

Co-authored-by: dinhhuy <redacted>
4 weeks agollama : use llm_build_granite for minicpm (#13911)
zhangkaihuo [Fri, 30 May 2025 08:31:48 +0000 (16:31 +0800)]
llama : use llm_build_granite for minicpm (#13911)

4 weeks agocmake: Guard GGML_CPU_ALL_VARIANTS by architecture (#13890)
Christian Kastner [Thu, 29 May 2025 23:28:54 +0000 (01:28 +0200)]
cmake: Guard GGML_CPU_ALL_VARIANTS by architecture (#13890)

4 weeks agollama : add support for jina-reranker-v2 (#13900)
Sigbjørn Skjæret [Thu, 29 May 2025 19:42:31 +0000 (21:42 +0200)]
llama : add support for jina-reranker-v2 (#13900)

4 weeks agogguf-py : add support for sub_type (in arrays) in GGUFWriter add_key_value method...
Sigbjørn Skjæret [Thu, 29 May 2025 13:36:05 +0000 (15:36 +0200)]
gguf-py : add support for sub_type (in arrays) in GGUFWriter add_key_value method (#13561)

4 weeks agoarm64: optimize q4_k_q8_k kernel with i8mm (#13886)
Yibo Cai [Thu, 29 May 2025 11:39:20 +0000 (19:39 +0800)]
arm64: optimize q4_k_q8_k kernel with i8mm (#13886)

This PR improves q4_k_q8_k gemm kernel with arm64 i8mm instruction.

Tested on neoverse-n2 with llama3 8b q4_k_m quantization model.
- 34% ~ 50% S_PP uplift for all batch sizes
- 12% ~ 37% S_TG uplift for batch size 4 and above

Perplexity doesn't change with this PR.

```
// tested on neoverse-n2
$ llama-batched-bench \
      -m Meta-Llama-3-8B-Instruct-Q4_K_M.gguf \
      --no-mmap -fa \
      -c 8192 -b 4096 -ub 512 -npp 128 -ntg 128 \
      -npl 1,2,4,8,16,32 \
      -t 64

---------------------------------------------------------------------
|    PP |     TG |    B |       S_PP t/s      |       S_TG t/s      |
|       |        |      | original |  this pr | original |  this pr |
|-------|--------|------|----------|----------|----------|----------|
|   128 |    128 |    1 |   110.12 |   147.83 |    24.36 |    24.28 |
|   128 |    128 |    2 |   121.16 |   172.42 |    46.36 |    47.93 |
|   128 |    128 |    4 |   120.15 |   169.75 |    74.68 |    84.00 |
|   128 |    128 |    8 |   130.97 |   196.81 |    91.04 |   114.74 |
|   128 |    128 |   16 |   131.01 |   196.88 |   101.43 |   135.79 |
|   128 |    128 |   32 |   130.85 |   196.51 |   106.97 |   147.29 |
---------------------------------------------------------------------
```

4 weeks agocmake: Factor out CPU architecture detection (#13883)
Christian Kastner [Thu, 29 May 2025 10:50:25 +0000 (12:50 +0200)]
cmake: Factor out CPU architecture detection (#13883)

* cmake: Define function for querying architecture

The tests and results match exactly those of ggml/src/CMakeLists.txt

* Switch arch detection over to new function

4 weeks agoggml: aarch64: Implement SVE F32 kernels for Mamba Sequential Scan Algorithm (#13882)
Vineel Abhinav [Thu, 29 May 2025 09:18:43 +0000 (14:48 +0530)]
ggml: aarch64: Implement SVE F32 kernels for Mamba Sequential Scan Algorithm (#13882)

* F32-Mamba-Seq_Scan-SVE

* Fix formatting

* ggml : missing space

---------

Co-authored-by: Georgi Gerganov <redacted>
4 weeks agotests : remove json.hpp from a test (#13880)
Georgi Gerganov [Thu, 29 May 2025 09:17:16 +0000 (12:17 +0300)]
tests : remove json.hpp from a test (#13880)

ggml-ci

4 weeks agoconvert : workaround for AutoConfig dummy labels (#13881)
Sigbjørn Skjæret [Thu, 29 May 2025 08:00:57 +0000 (10:00 +0200)]
convert : workaround for AutoConfig dummy labels (#13881)

4 weeks agollama : add RobertaForSequenceClassification reranker support (#13875)
Sigbjørn Skjæret [Thu, 29 May 2025 06:15:01 +0000 (08:15 +0200)]
llama : add RobertaForSequenceClassification reranker support (#13875)

4 weeks agoggml: aarch64: Implement SVE F32 kernels for vector functions (#13843)
Vineel Abhinav [Thu, 29 May 2025 06:01:33 +0000 (11:31 +0530)]
ggml: aarch64: Implement SVE F32 kernels for vector functions (#13843)

* F32-Mamba-SVE

* F32-Mamba-SVE

* Resolve test errors-1

* Resolve test errors-2

* F32-vec-SVE

* F32-vec-SVE

* F32-vec-SVE

4 weeks agogguf-py : fix SafetensorRemote return on undefined size (< 0) (#13841)
Beinsezii [Wed, 28 May 2025 21:50:20 +0000 (14:50 -0700)]
gguf-py : fix SafetensorRemote return on undefined size (< 0) (#13841)

4 weeks agollama : fix KV shift for qwen2vl (#13870)
Xuan-Son Nguyen [Wed, 28 May 2025 20:35:31 +0000 (22:35 +0200)]
llama : fix KV shift for qwen2vl (#13870)

* llama : fix KV shift for qwen2vl

* add ref to the PR

4 weeks agomtmd : move helpers to dedicated library (⚠️ breaking change) (#13866)
Xuan-Son Nguyen [Wed, 28 May 2025 20:35:22 +0000 (22:35 +0200)]
mtmd : move helpers to dedicated library (⚠️ breaking change) (#13866)

* mtmd : move helpers to dedicated library

* fix server build

* rm leftover cmakelist code

4 weeks agoci: disable LLAMA_CURL for Linux cross-builds (#13871)
bandoti [Wed, 28 May 2025 18:46:47 +0000 (15:46 -0300)]
ci: disable LLAMA_CURL for Linux cross-builds (#13871)

4 weeks agollama : add support for BertForSequenceClassification reranker (#13858)
Đinh Trọng Huy [Wed, 28 May 2025 17:01:58 +0000 (02:01 +0900)]
llama : add support for BertForSequenceClassification reranker (#13858)

* convert: add support for BertForSequenceClassification

* add support for reranking using BertForSequenceClassification

* merge checks of eos and sep

* fix lint

---------

Co-authored-by: dinhhuy <redacted>
4 weeks agoconvert: small addition to support LlamaModel (#13838)
Đinh Trọng Huy [Wed, 28 May 2025 14:34:18 +0000 (23:34 +0900)]
convert: small addition to support LlamaModel (#13838)

Co-authored-by: dinhhuy <redacted>
4 weeks agoserver: fix remove 'image_url'/'input_audio' json-object effectlly for 'llama_params...
Sky [Wed, 28 May 2025 14:33:54 +0000 (22:33 +0800)]
server: fix remove 'image_url'/'input_audio' json-object effectlly for 'llama_params' in multimodal-model-mode (#13853)

[fix]: remove 'image_url'/'input_audio' effectlly for 'llama_params' in multimodal-model-mode

4 weeks agoconvert : fix qwen omni conversion (#13859)
Xuan-Son Nguyen [Wed, 28 May 2025 14:12:35 +0000 (16:12 +0200)]
convert : fix qwen omni conversion (#13859)

* convert : fix qwen omni conversion

* fix typo

4 weeks agotests : change umlaut test (#11600)
Alex Fanthome [Wed, 28 May 2025 13:49:28 +0000 (14:49 +0100)]
tests : change umlaut test (#11600)

4 weeks agoCUDA: fix FA tg at long context for CC >= 8.9 (#13852)
Johannes Gäßler [Wed, 28 May 2025 11:33:37 +0000 (13:33 +0200)]
CUDA: fix FA tg at long context for CC >= 8.9 (#13852)

4 weeks agoconvert : fix tensor naming conflict for llama 4 vision (#13836)
Xuan-Son Nguyen [Wed, 28 May 2025 08:05:54 +0000 (10:05 +0200)]
convert : fix tensor naming conflict for llama 4 vision (#13836)

* convert : fix tensor naming conflict for llama 4 vision

* add comment

4 weeks agoCANN: Add SOC TYPE printing in cmake configuration (#13837)
leo-pony [Wed, 28 May 2025 03:54:20 +0000 (11:54 +0800)]
CANN: Add SOC TYPE printing in cmake configuration (#13837)

4 weeks agoopencl: add new ops - `argsort`, `div`, `sub`, `addrows`, `sigmoid`, `group_norm...
lhez [Tue, 27 May 2025 19:56:08 +0000 (12:56 -0700)]
opencl: add new ops - `argsort`, `div`, `sub`, `addrows`, `sigmoid`, `group_norm` (#13787)

* opencl: add `argsort`

* opencl: add `div`

* opencl: add `add_rows`

* opencl: add `sub`

* opencl: add `sigmoid`, both `f16` and `f32`

* opencl: add `group_norm`

4 weeks agoopencl: mark `mul_mat` `f32f32` as supporting non-contiguous tensors (#13790)
lhez [Tue, 27 May 2025 19:53:14 +0000 (12:53 -0700)]
opencl: mark `mul_mat` `f32f32` as supporting non-contiguous tensors (#13790)

4 weeks agovulkan: use timestamp queries for GGML_VULKAN_PERF (#13817)
Jeff Bolz [Tue, 27 May 2025 16:39:07 +0000 (11:39 -0500)]
vulkan: use timestamp queries for GGML_VULKAN_PERF (#13817)

Also change it to be controlled by an env var rather than cmake flag

4 weeks agocmake : add llama-cparams.cpp to build (#13832)
Georgi Gerganov [Tue, 27 May 2025 16:08:44 +0000 (19:08 +0300)]
cmake : add llama-cparams.cpp to build (#13832)

4 weeks agoSYCL: add gelu_erf kernel (#13749)
Akarshan Biswas [Tue, 27 May 2025 15:22:59 +0000 (20:52 +0530)]
SYCL: add gelu_erf kernel (#13749)

* SYCL: add gelu_erf kernel

* refactor code

Co-authored-by: Atharva Dubey <redacted>
* Use scope_op_debug_print

---------

Co-authored-by: Atharva Dubey <redacted>
4 weeks agosync : ggml
Georgi Gerganov [Tue, 27 May 2025 15:04:38 +0000 (18:04 +0300)]
sync : ggml

4 weeks agoggml : add ggml_repeat_4d (#13824)
Xuan-Son Nguyen [Tue, 27 May 2025 13:53:55 +0000 (15:53 +0200)]
ggml : add ggml_repeat_4d (#13824)

4 weeks agoggml : riscv: add xtheadvector support (#13720)
xctan [Tue, 27 May 2025 13:21:36 +0000 (21:21 +0800)]
ggml : riscv: add xtheadvector support (#13720)

* ggml : riscv: add xtheadvector support

* ggml : clean up some macro usage

4 weeks agomtmd : support Qwen 2.5 Omni (input audio+vision, no audio output) (#13784)
Xuan-Son Nguyen [Tue, 27 May 2025 12:06:10 +0000 (14:06 +0200)]
mtmd : support Qwen 2.5 Omni (input audio+vision, no audio output) (#13784)

* mtmd : allow multiple modalities at the same time

* refactor mtmd tokenizer

* fix compile

* ok, missing SinusoidsPositionEmbedding

* first working version

* fix style

* more strict validate of n_embd

* refactor if..else to switch

* fix regression

* add test for 3B

* update docs

* fix tokenizing with add_special

* add more tests

* fix test case "huge"

* rm redundant code

* set_position_mrope_1d rm n_tokens