]> git.djapps.eu Git - pkg/ggml/sources/ggml/log
pkg/ggml/sources/ggml
3 weeks agocmake: Guard GGML_CPU_ALL_VARIANTS by architecture (llama/13890)
Christian Kastner [Thu, 29 May 2025 23:28:54 +0000 (01:28 +0200)]
cmake: Guard GGML_CPU_ALL_VARIANTS by architecture (llama/13890)

3 weeks agoarm64: optimize q4_k_q8_k kernel with i8mm (llama/13886)
Yibo Cai [Thu, 29 May 2025 11:39:20 +0000 (19:39 +0800)]
arm64: optimize q4_k_q8_k kernel with i8mm (llama/13886)

This PR improves q4_k_q8_k gemm kernel with arm64 i8mm instruction.

Tested on neoverse-n2 with llama3 8b q4_k_m quantization model.
- 34% ~ 50% S_PP uplift for all batch sizes
- 12% ~ 37% S_TG uplift for batch size 4 and above

Perplexity doesn't change with this PR.

```
// tested on neoverse-n2
$ llama-batched-bench \
      -m Meta-Llama-3-8B-Instruct-Q4_K_M.gguf \
      --no-mmap -fa \
      -c 8192 -b 4096 -ub 512 -npp 128 -ntg 128 \
      -npl 1,2,4,8,16,32 \
      -t 64

---------------------------------------------------------------------
|    PP |     TG |    B |       S_PP t/s      |       S_TG t/s      |
|       |        |      | original |  this pr | original |  this pr |
|-------|--------|------|----------|----------|----------|----------|
|   128 |    128 |    1 |   110.12 |   147.83 |    24.36 |    24.28 |
|   128 |    128 |    2 |   121.16 |   172.42 |    46.36 |    47.93 |
|   128 |    128 |    4 |   120.15 |   169.75 |    74.68 |    84.00 |
|   128 |    128 |    8 |   130.97 |   196.81 |    91.04 |   114.74 |
|   128 |    128 |   16 |   131.01 |   196.88 |   101.43 |   135.79 |
|   128 |    128 |   32 |   130.85 |   196.51 |   106.97 |   147.29 |
---------------------------------------------------------------------
```

3 weeks agocmake: Factor out CPU architecture detection (llama/13883)
Christian Kastner [Thu, 29 May 2025 10:50:25 +0000 (12:50 +0200)]
cmake: Factor out CPU architecture detection (llama/13883)

* cmake: Define function for querying architecture

The tests and results match exactly those of src/CMakeLists.txt

* Switch arch detection over to new function

3 weeks agoggml: aarch64: Implement SVE F32 kernels for Mamba Sequential Scan Algorithm (llama...
Vineel Abhinav [Thu, 29 May 2025 09:18:43 +0000 (14:48 +0530)]
ggml: aarch64: Implement SVE F32 kernels for Mamba Sequential Scan Algorithm (llama/13882)

* F32-Mamba-Seq_Scan-SVE

* Fix formatting

* ggml : missing space

---------

Co-authored-by: Georgi Gerganov <redacted>
3 weeks agoggml: aarch64: Implement SVE F32 kernels for vector functions (llama/13843)
Vineel Abhinav [Thu, 29 May 2025 06:01:33 +0000 (11:31 +0530)]
ggml: aarch64: Implement SVE F32 kernels for vector functions (llama/13843)

* F32-Mamba-SVE

* F32-Mamba-SVE

* Resolve test errors-1

* Resolve test errors-2

* F32-vec-SVE

* F32-vec-SVE

* F32-vec-SVE

3 weeks agoCUDA: fix FA tg at long context for CC >= 8.9 (llama/13852)
Johannes Gäßler [Wed, 28 May 2025 11:33:37 +0000 (13:33 +0200)]
CUDA: fix FA tg at long context for CC >= 8.9 (llama/13852)

3 weeks agoCANN: Add SOC TYPE printing in cmake configuration (llama/13837)
leo-pony [Wed, 28 May 2025 03:54:20 +0000 (11:54 +0800)]
CANN: Add SOC TYPE printing in cmake configuration (llama/13837)

3 weeks agoopencl: add new ops - `argsort`, `div`, `sub`, `addrows`, `sigmoid`, `group_norm...
lhez [Tue, 27 May 2025 19:56:08 +0000 (12:56 -0700)]
opencl: add new ops - `argsort`, `div`, `sub`, `addrows`, `sigmoid`, `group_norm` (llama/13787)

* opencl: add `argsort`

* opencl: add `div`

* opencl: add `add_rows`

* opencl: add `sub`

* opencl: add `sigmoid`, both `f16` and `f32`

* opencl: add `group_norm`

3 weeks agoopencl: mark `mul_mat` `f32f32` as supporting non-contiguous tensors (llama/13790)
lhez [Tue, 27 May 2025 19:53:14 +0000 (12:53 -0700)]
opencl: mark `mul_mat` `f32f32` as supporting non-contiguous tensors (llama/13790)

3 weeks agovulkan: use timestamp queries for GGML_VULKAN_PERF (llama/13817)
Jeff Bolz [Tue, 27 May 2025 16:39:07 +0000 (11:39 -0500)]
vulkan: use timestamp queries for GGML_VULKAN_PERF (llama/13817)

Also change it to be controlled by an env var rather than cmake flag

3 weeks agoSYCL: add gelu_erf kernel (llama/13749)
Akarshan Biswas [Tue, 27 May 2025 15:22:59 +0000 (20:52 +0530)]
SYCL: add gelu_erf kernel (llama/13749)

* SYCL: add gelu_erf kernel

* refactor code

Co-authored-by: Atharva Dubey <redacted>
* Use scope_op_debug_print

---------

Co-authored-by: Atharva Dubey <redacted>
3 weeks agoggml : add ggml_repeat_4d (llama/13824)
Xuan-Son Nguyen [Tue, 27 May 2025 13:53:55 +0000 (15:53 +0200)]
ggml : add ggml_repeat_4d (llama/13824)

4 weeks agovulkan : Remove unexpected ; (#1253)
Kai Pastor [Sat, 31 May 2025 10:49:55 +0000 (12:49 +0200)]
vulkan : Remove unexpected ; (#1253)

4 weeks agocmake : Fix broken CMake error messages (#1252)
Kai Pastor [Sat, 31 May 2025 10:39:19 +0000 (12:39 +0200)]
cmake : Fix broken CMake error messages (#1252)

4 weeks agoggml : remove ggml_graph_import and ggml_graph_export declarations (#1247)
Radoslav Gerganov [Fri, 30 May 2025 06:11:09 +0000 (09:11 +0300)]
ggml : remove ggml_graph_import and ggml_graph_export declarations (#1247)

The implementation is already deleted with commit 9d0762e.

closes: #1235

4 weeks agoggml : fix pkg-config include path (#1248)
Radoslav Gerganov [Fri, 30 May 2025 06:10:09 +0000 (09:10 +0300)]
ggml : fix pkg-config include path (#1248)

CMake is installing public headers in CMAKE_INSTALL_INCLUDEDIR, not
CMAKE_INSTALL_INCLUDEDIR/ggml.

This patch fixes building external programs which use
'pkg-config --cflags ggml'

4 weeks agoggml : skip tests, examples incompatible with GGML_BACKEND_DL (#1242)
Christian Kastner [Thu, 29 May 2025 12:43:43 +0000 (14:43 +0200)]
ggml : skip tests, examples incompatible with GGML_BACKEND_DL (#1242)

4 weeks agoci : Add Windows (#1249)
Daniel Tang [Thu, 29 May 2025 11:13:35 +0000 (07:13 -0400)]
ci : Add Windows (#1249)

4 weeks agosync : whisper.cpp (#1250)
Georgi Gerganov [Thu, 29 May 2025 10:29:50 +0000 (13:29 +0300)]
sync : whisper.cpp (#1250)

* ggml : Fix backtrace breaking Windows build (whisper/3203)

* sync : whisper.cpp

ggml-ci

---------

Co-authored-by: Daniel Tang <redacted>
4 weeks agosync : whisper.cpp
Georgi Gerganov [Thu, 29 May 2025 06:57:18 +0000 (09:57 +0300)]
sync : whisper.cpp

4 weeks agoggml : install dynamic backends (#1240)
Radoslav Gerganov [Thu, 29 May 2025 05:34:46 +0000 (08:34 +0300)]
ggml : install dynamic backends (#1240)

* ggml : install dynamic backends

Make sure dynamic backends are installed in $CMAKE_INSTALL_BINDIR

4 weeks agoggml : Print backtrace on uncaught C++ exceptions (#1232)
Daniel Tang [Wed, 28 May 2025 00:58:46 +0000 (20:58 -0400)]
ggml : Print backtrace on uncaught C++ exceptions (#1232)

The goal is to have what users call "full logs" contain the backtrace.

This is registered upon ggml_init. Also fixes a minor fd leak on Linux.

4 weeks agosync : whisper.cpp
Georgi Gerganov [Tue, 27 May 2025 15:03:34 +0000 (18:03 +0300)]
sync : whisper.cpp

4 weeks agoexamples : add --print-confidence option to cli (whisper/3150)
Daniel Bevenius [Wed, 14 May 2025 17:21:48 +0000 (19:21 +0200)]
examples : add --print-confidence option to cli (whisper/3150)

* examples : add --print-confidence option to cli

This commit adds a new command-line option `--print-confidence` to the
whisper-cli. When enabled, this option prints the confidence level of each
token in the transcribed text using ANSI formatting codes.

The confidence levels are represented using different styles:
```console
main: confidence: highlighted (low confidence), underlined (medium), dim (high confidence)
```

Refs: https://github.com/ggml-org/whisper.cpp/issues/3135

4 weeks agosync : llama.cpp
Georgi Gerganov [Tue, 27 May 2025 13:28:55 +0000 (16:28 +0300)]
sync : llama.cpp

ggml-ci

4 weeks agoggml : riscv: add xtheadvector support (llama/13720)
xctan [Tue, 27 May 2025 13:21:36 +0000 (21:21 +0800)]
ggml : riscv: add xtheadvector support (llama/13720)

* ggml : riscv: add xtheadvector support

* ggml : clean up some macro usage

4 weeks agoggml-cpu: x86 feature detection is specific to x86 (llama/13811)
Christian Kastner [Tue, 27 May 2025 11:18:39 +0000 (13:18 +0200)]
ggml-cpu: x86 feature detection is specific to x86 (llama/13811)

4 weeks agoggml : allow CUDA graphs when using pipeline parallelism (llama/13814)
Diego Devesa [Tue, 27 May 2025 11:05:18 +0000 (04:05 -0700)]
ggml : allow CUDA graphs when using pipeline parallelism (llama/13814)

4 weeks agocuda : avoid cuGetErrorString (llama/13791)
Georgi Gerganov [Mon, 26 May 2025 19:14:52 +0000 (22:14 +0300)]
cuda : avoid cuGetErrorString (llama/13791)

ggml-ci

4 weeks agoSYCL: Add non contiguous support in RMS_NORM and NORM kernels (llama/13611)
Akarshan Biswas [Mon, 26 May 2025 15:40:36 +0000 (21:10 +0530)]
SYCL: Add non contiguous support in RMS_NORM and NORM kernels (llama/13611)

* SYCL: Add non contiguous input support to norm kernel

* refactor and add RMS_NORM non contiguous input support

ggml-ci

* restore subgroup reduction for multi-subgroup thread blocks in norm kernels

* Swap grid dims of nsamples and nrows

ggml-ci

* Revert "Swap grid dims of nsamples and nrows"

This reverts commit 43be2d657fec7f7fba54e2cd154106bc0fc45adf.

* restore not required changes
ggml-ci

* address review comments: change it to more like SYCL

* Use a common function to calculate offset

* remove wrap around logic for handling broadcasts

* remove static from calculate_offset fn and use ceil_div

4 weeks agosycl: Add more debug prints (llama/13640)
Romain Biessy [Mon, 26 May 2025 08:28:53 +0000 (10:28 +0200)]
sycl: Add more debug prints (llama/13640)

4 weeks agovulkan: mark IM2COL as supporting non-contig (llama/13783)
Jeff Bolz [Mon, 26 May 2025 04:02:07 +0000 (23:02 -0500)]
vulkan: mark IM2COL as supporting non-contig (llama/13783)

4 weeks agoCANN: Add the basic supports of Flash Attention kernel (llama/13627)
Bizhao Shi [Mon, 26 May 2025 02:20:18 +0000 (10:20 +0800)]
CANN: Add the basic supports of Flash Attention kernel (llama/13627)

* cann: add the basic FA support

* cann: update the readme

* cann: update the FlashAttention with PSEShift

* cann: update the input parameters in FA

* cann: update the alibi with max_bias

* cann: add the constrints of softcap

* cann: update the docs CANN.md

* cann: update the docs CANN.md

* cann: fix typo of CANN.md

* cann: add some comments and update the CANN.md

* cann: update the CANN.md

* cann: update the inner precise for fusedInferAttention

* cann: update the constraints of flash_attn_ext on ggml-cann.cpp

* cann: clean the whitespace

* cann: clean the whitespace

* cann: add a new endline

4 weeks agosync : llama.cpp upstream/latest
Georgi Gerganov [Sun, 25 May 2025 07:09:01 +0000 (10:09 +0300)]
sync : llama.cpp

ggml-ci

4 weeks agoSYCL: revert "sycl: simplify bin_bcast_kernel (#13383)" (llama/13752)
Akarshan Biswas [Sun, 25 May 2025 07:08:37 +0000 (12:38 +0530)]
SYCL: revert "sycl: simplify bin_bcast_kernel (#13383)" (llama/13752)

Temporarily reverted due to failing fp16 DIV operation

This reverts commit 02cdd2d8b092b5a4bb18e013c6887ce49ba20ac5.

ggml-ci

4 weeks agoggml-cpu : set openmp wait time if not set (llama/13758)
Diego Devesa [Sat, 24 May 2025 20:26:47 +0000 (13:26 -0700)]
ggml-cpu : set openmp wait time if not set (llama/13758)

4 weeks agosync : llama.cpp
Georgi Gerganov [Sat, 24 May 2025 13:55:46 +0000 (16:55 +0300)]
sync : llama.cpp

ggml-ci

4 weeks agoggml : add ggml_gelu_erf() CUDA kernel (llama/13719)
Xuan-Son Nguyen [Sat, 24 May 2025 11:06:47 +0000 (13:06 +0200)]
ggml : add ggml_gelu_erf() CUDA kernel (llama/13719)

* ggml : add ggml_gelu_erf() CUDA kernel

* missing semicolon

4 weeks agoCUDA: fix race condition in FA vector kernels (llama/13742)
Johannes Gäßler [Sat, 24 May 2025 09:46:19 +0000 (11:46 +0200)]
CUDA: fix race condition in FA vector kernels (llama/13742)

4 weeks agoCANN: Support MUL_MAT_ID for q8_0 and q4_0 (llama/13705)
Chenguang Li [Fri, 23 May 2025 08:47:53 +0000 (16:47 +0800)]
CANN: Support MUL_MAT_ID for q8_0 and q4_0 (llama/13705)

* [CANN]Support MUL_MAT_ID Q8 && Q4

Signed-off-by: noemotiovon <redacted>
* codestyle adjustment

Signed-off-by: noemotiovon <redacted>
---------

Signed-off-by: noemotiovon <redacted>
4 weeks agoggml : fix the order of ggml_unary_op (llama/13718)
Xuan-Son Nguyen [Fri, 23 May 2025 06:12:48 +0000 (08:12 +0200)]
ggml : fix the order of ggml_unary_op (llama/13718)

4 weeks agovulkan: support CPY from any type to itself (llama/13695)
Jeff Bolz [Fri, 23 May 2025 04:45:02 +0000 (00:45 -0400)]
vulkan: support CPY from any type to itself (llama/13695)

Reuse the f16/f32 copy shaders, and just scale the number of elements
according to the type size.

4 weeks agovulkan: Disable coopmat/coopmat2/bfloat extensions if glslc doesn't support it (llama...
Jeff Bolz [Fri, 23 May 2025 04:33:45 +0000 (00:33 -0400)]
vulkan: Disable coopmat/coopmat2/bfloat extensions if glslc doesn't support it (llama/13696)

4 weeks agouse LOG_WARN to replace `std::cerr` (llama/13657)
Judd [Fri, 23 May 2025 04:33:08 +0000 (12:33 +0800)]
use LOG_WARN to replace `std::cerr` (llama/13657)

4 weeks agosycl : Remove waits from function calls (llama/13702)
Nicolò Scipione [Thu, 22 May 2025 11:54:43 +0000 (13:54 +0200)]
sycl : Remove waits from function calls (llama/13702)

* removes the waits in async memcpy functions

4 weeks agoSYCL: Avoid using with SYCL-Graph for unsupported nodes (llama/13587)
Ewan Crawford [Thu, 22 May 2025 08:24:09 +0000 (09:24 +0100)]
SYCL: Avoid using with SYCL-Graph for unsupported nodes (llama/13587)

Currently on a CUDA backend to SYCL when running
`GGML_SYCL_DISABLE_GRAPH=0 ./bin/test-backend-ops -b SYCL0` there
are two operations that throw an exception from the blocking
waits during queue recording.

* `-o CONCAT` : Use of blocking waits on a queue that's being recorded https://github.com/ggml-org/llama.cpp/blob/master/ggml/src/ggml-sycl/concat.cpp#L185-L187
* `-o MUL_MAT_ID`: Blocking wait on a recording queue for a copy to host memory https://github.com/ggml-org/llama.cpp/blob/master/ggml/src/ggml-sycl/ggml-sycl.cpp#L3072-L3074

We've noticed that `ggml-cuda.cu` has the
[check_node_graph_compatibility_and_refresh_copy_ops](https://github.com/ggml-org/llama.cpp/blob/39e73ae0d69f882d7e29cecc6dd8f5052fca6731/ggml/src/ggml-cuda/ggml-cuda.cu#L2458-L2458)
method for checking if a graph can be used, even if enabled. I've taken a
similar approach in this PR by adding a method to `ggml-sycl.cpp` for checking
if a graph can be used for the operations even if a user has asked for it to be
enabled.

4 weeks agoopencl: Add support for multiple devices (llama/12622)
Henry Linjamäki [Wed, 21 May 2025 23:21:45 +0000 (02:21 +0300)]
opencl: Add support for multiple devices (llama/12622)

* opencl: Add support for multiple devices

... but limited to one platform. A platform with a GPU will be preferred.

Additionally:

* Filter out devices that lack capabilities needed by the backend
  implementation (half support, OpenCL 2.0+, etc).

* Make ggml_backend_opencl_reg() thread-safe.

* fixup: fix an error in sync_with_other_backends

... when there is only one OpenCL device available.

4 weeks agoopencl: fix couple crashes (llama/12795)
Henry Linjamäki [Wed, 21 May 2025 20:21:17 +0000 (23:21 +0300)]
opencl: fix couple crashes (llama/12795)

* opencl: fix couple crashes

* fix kernel launches failed on devices which do not support
  non-uniform work-groups. When non-uniform work-groups are not
  supported, set `local_work_size` to NULL (= let driver choose the
  work-group sizes). This patch does not cover everything - just the
  cases tested by test-backend-ops.

* fix sub-buffer creation failed due to `cl_buffer_region::origin` not
  being aligned to `CL_DEVICE_MEM_BASE_ADDR_ALIGN`.

* OpenCL: query non-uniform WG sizes only on OpenCL 3.0+

4 weeks agoggml : add ggml_gelu_erf() (llama/13667)
Xuan-Son Nguyen [Wed, 21 May 2025 14:26:33 +0000 (16:26 +0200)]
ggml : add ggml_gelu_erf() (llama/13667)

* ggml : add ggml_gelu_na (not approximated)

* fix naming order

* rename na --> erf

* apply review suggesions

* revert naming order

4 weeks agomusa: Upgrade MUSA SDK version to rc4.0.1 and use mudnn::Unary::IDENTITY op to accele...
R0CKSTAR [Wed, 21 May 2025 01:58:49 +0000 (09:58 +0800)]
musa: Upgrade MUSA SDK version to rc4.0.1 and use mudnn::Unary::IDENTITY op to accelerate D2D memory copy (llama/13647)

* musa: fix build warning (unused parameter)

Signed-off-by: Xiaodong Ye <redacted>
* musa: upgrade MUSA SDK version to rc4.0.1

Signed-off-by: Xiaodong Ye <redacted>
* musa: use mudnn::Unary::IDENTITY op to accelerate D2D memory copy

Signed-off-by: Xiaodong Ye <redacted>
* Update src/ggml-cuda/cpy.cu

Co-authored-by: Johannes Gäßler <redacted>
* musa: remove MUDNN_CHECK_GEN and use CUDA_CHECK_GEN instead in MUDNN_CHECK

Signed-off-by: Xiaodong Ye <redacted>
---------

Signed-off-by: Xiaodong Ye <redacted>
Co-authored-by: Johannes Gäßler <redacted>
4 weeks agovulkan: fix warnings (llama/13626)
Eve [Tue, 20 May 2025 21:35:16 +0000 (21:35 +0000)]
vulkan: fix warnings (llama/13626)

* small fixes

* remove ifdef

4 weeks agoCUDA: skip fully masked-out KV in FA vec kernel (llama/13584)
Johannes Gäßler [Tue, 20 May 2025 12:45:07 +0000 (14:45 +0200)]
CUDA: skip fully masked-out KV in FA vec kernel (llama/13584)

* CUDA: skip fully masked-out KV in FA vec kernel

4 weeks agosycl: disable reorder for sycl mulmat (llama/13536)
Svetlozar Georgiev [Tue, 20 May 2025 09:34:15 +0000 (10:34 +0100)]
sycl: disable reorder for sycl mulmat (llama/13536)

4 weeks agometal : fix typo in FA kernel comments (llama/13651)
Georgi Gerganov [Tue, 20 May 2025 07:41:40 +0000 (10:41 +0300)]
metal : fix typo in FA kernel comments (llama/13651)

4 weeks agosycl : Overcoming workaround for mmap() allocation on Windows (llama/13482)
Nicolò Scipione [Tue, 20 May 2025 00:54:43 +0000 (02:54 +0200)]
sycl : Overcoming workaround for mmap() allocation on Windows (llama/13482)

* Remove mmap workaround on windows

After some testing I found that mmap is supported on windows and for
many GPUs on Linux. Therefore I remove the workaround for windows since
it is not necessary.

* Update llama-bench README

SYCL backend introduced a workaround that allows execution of
llama-bench also without specifying `--mmp 0` flag

4 weeks agoVulkan: Add f32 accumulator support to quantized mul mat to fix GLM4 32B incoherence...
0cc4m [Mon, 19 May 2025 15:54:08 +0000 (17:54 +0200)]
Vulkan: Add f32 accumulator support to quantized mul mat to fix GLM4 32B incoherence (llama/13607)

5 weeks agosync : llama.cpp
Georgi Gerganov [Mon, 19 May 2025 10:30:48 +0000 (13:30 +0300)]
sync : llama.cpp

ggml-ci

5 weeks agoCANN: Support MOE Model MUL_MAT_ID (llama/13042)
Chenguang Li [Mon, 19 May 2025 06:21:17 +0000 (14:21 +0800)]
CANN: Support MOE Model MUL_MAT_ID (llama/13042)

Signed-off-by: noemotiovon <redacted>
5 weeks agocmake: use the current build config for vulkan-shaders-gen (llama/13595)
Gilad S. [Sat, 17 May 2025 18:26:43 +0000 (21:26 +0300)]
cmake: use the current build config for vulkan-shaders-gen (llama/13595)

* fix: use the current build config for `vulkan-shaders-gen`

* fix: only pass a valid build type to `--config`

5 weeks agovulkan: move common FA code to flash_attn_base.comp (llama/13556)
Jeff Bolz [Sat, 17 May 2025 07:14:55 +0000 (16:14 +0900)]
vulkan: move common FA code to flash_attn_base.comp (llama/13556)

* vulkan: move common FA code to flash_attn_base.comp

* vulkan: move common FA index/stride setup code to flash_attn_base.comp

* build fix

5 weeks agovulkan: use scalar FA rather than coopmat2 when N==1 (llama/13554)
Jeff Bolz [Sat, 17 May 2025 06:35:47 +0000 (15:35 +0900)]
vulkan: use scalar FA rather than coopmat2 when N==1 (llama/13554)

5 weeks agometal : add FA-vec kernel for head size 64 (llama/13583)
Georgi Gerganov [Fri, 16 May 2025 17:32:58 +0000 (20:32 +0300)]
metal : add FA-vec kernel for head size 64 (llama/13583)

ggml-ci

5 weeks agosycl : fixed compilation warnings (llama/13582)
Łukasz Ślusarczyk [Fri, 16 May 2025 10:15:29 +0000 (12:15 +0200)]
sycl : fixed compilation warnings (llama/13582)

5 weeks agogguf : use ggml log system (llama/13571)
Diego Devesa [Thu, 15 May 2025 17:13:11 +0000 (10:13 -0700)]
gguf : use ggml log system (llama/13571)

* gguf : use ggml log system

* llama : remove unnecessary new lines in exception messages

5 weeks agosycl: simplify bin_bcast_kernel (llama/13383)
Atharva Dubey [Thu, 15 May 2025 15:39:52 +0000 (16:39 +0100)]
sycl: simplify bin_bcast_kernel (llama/13383)

5 weeks agosycl: reordered Q4_K MMVQ (llama/13109)
Svetlozar Georgiev [Thu, 15 May 2025 15:35:44 +0000 (16:35 +0100)]
sycl: reordered Q4_K MMVQ (llama/13109)

5 weeks agosycl: use oneDNN for matrices multiplication (llama/12972)
Łukasz Ślusarczyk [Thu, 15 May 2025 14:53:41 +0000 (16:53 +0200)]
sycl: use oneDNN for matrices multiplication (llama/12972)

5 weeks agoarm64: optimize q6_k_q8_k kernel with i8mm (llama/13519)
Yibo Cai [Wed, 14 May 2025 19:53:52 +0000 (03:53 +0800)]
arm64: optimize q6_k_q8_k kernel with i8mm (llama/13519)

This PR improves q6_k_q8_k gemm kernel with arm64 i8mm instruction.

Tested on neoverse-n2 with llama3 8b q6_k quantization model.
- 40% ~ 54% S_PP uplift for all batch sizes
- 16% ~ 47% S_TG uplift for batch size 4 and above

Perplexity doesn't change with this PR.

```
// tested on neoverse-n2
$ llama-batched-bench \
      -m Meta-Llama-3-8B-Instruct-Q6_K.gguf \
      --no-mmap -fa \
      -c 8192 -b 4096 -ub 512 -npp 128 -ntg 128 \
      -npl 1,2,4,8,16,32 \
      -t 64

---------------------------------------------------------------------
|    PP |     TG |    B |       S_PP t/s      |       S_TG t/s      |
|       |        |      | original |  this pr | original |  this pr |
|-------|--------|------|----------|----------|----------|----------|
|   128 |    128 |    1 |    78.52 |   109.18 |    18.63 |    18.88 |
|   128 |    128 |    2 |    84.62 |   123.94 |    34.54 |    36.92 |
|   128 |    128 |    4 |    84.36 |   122.49 |    52.65 |    61.32 |
|   128 |    128 |    8 |    90.52 |   138.87 |    63.46 |    84.41 |
|   128 |    128 |   16 |    90.11 |   138.56 |    71.04 |   101.33 |
|   128 |    128 |   32 |    89.81 |   137.79 |    75.14 |   110.47 |
---------------------------------------------------------------------
```

5 weeks agoCUDA: fix crash on large batch size for quant. MoE (llama/13537)
Johannes Gäßler [Wed, 14 May 2025 14:41:02 +0000 (16:41 +0200)]
CUDA: fix crash on large batch size for quant. MoE (llama/13537)

5 weeks agoCUDA: faster Deepseek FA, add Turing support (llama/13435)
Johannes Gäßler [Wed, 14 May 2025 14:08:20 +0000 (16:08 +0200)]
CUDA: faster Deepseek FA, add Turing support (llama/13435)

5 weeks agocmake: simplify vulkan shader test logic (llama/13263)
bandoti [Wed, 14 May 2025 10:53:57 +0000 (07:53 -0300)]
cmake: simplify vulkan shader test logic (llama/13263)

5 weeks agovulkan: KHR_coopmat flash attention (llama/13506)
Jeff Bolz [Wed, 14 May 2025 09:55:26 +0000 (18:55 +0900)]
vulkan: KHR_coopmat flash attention (llama/13506)

This shader uses coopmat1 to do the Q*K^T multiply. The P*V multiply is more
difficult for various reasons so I haven't done it. Performance for this
shader is around 2.5x better than for the scalar shader when doing prompt
processing. Some of the benefit may be from other optimizations like staging
through shared memory, or splitting by rows.

5 weeks agovulkan: workaround FA compile failures on macos (llama/13517)
Jeff Bolz [Wed, 14 May 2025 04:15:50 +0000 (13:15 +0900)]
vulkan: workaround FA compile failures on macos (llama/13517)

5 weeks agometal : use FA-vec kernel up to batch size 20 (llama/13496)
Georgi Gerganov [Tue, 13 May 2025 15:04:39 +0000 (18:04 +0300)]
metal : use FA-vec kernel up to batch size 20 (llama/13496)

* batched-bench : fix pp batch contents

* metal : optimize multi-sequence FA vec kernel

ggml-ci

* metal : use FA-vec kernel up to batch size 20

ggml-ci

5 weeks agometal : optimize multi-sequence FA vec kernel (llama/13493)
Georgi Gerganov [Tue, 13 May 2025 15:04:00 +0000 (18:04 +0300)]
metal : optimize multi-sequence FA vec kernel (llama/13493)

* batched-bench : fix pp batch contents

* metal : optimize multi-sequence FA vec kernel

ggml-ci

5 weeks agoggml-cpu: Update KleidiAI to v1.6 and fix include directives (llama/13509)
Dan Johansson [Tue, 13 May 2025 15:02:28 +0000 (17:02 +0200)]
ggml-cpu: Update KleidiAI to v1.6 and fix include directives (llama/13509)

Signed-off-by: Dan Johansson <redacted>
5 weeks agomnist: fix segmentation fault (#1227)
Johannes Gäßler [Mon, 19 May 2025 07:33:35 +0000 (09:33 +0200)]
mnist: fix segmentation fault (#1227)

5 weeks agoggml : fix apple OS check in ggml_print_backtrace (#1229)
Diego Devesa [Mon, 19 May 2025 01:30:13 +0000 (18:30 -0700)]
ggml : fix apple OS check in ggml_print_backtrace (#1229)

6 weeks agoggml : Fix missing backtrace on Linux (#1228)
Daniel Tang [Sat, 17 May 2025 23:06:26 +0000 (19:06 -0400)]
ggml : Fix missing backtrace on Linux (#1228)

* Modern Linux defaults /proc/sys/kernel/yama/ptrace_scope to 1
* Fixed lldb attach
* Simplify by having the child do ggml_print_backtrace_symbols

6 weeks agosync : whisper.cpp
Georgi Gerganov [Tue, 13 May 2025 11:00:29 +0000 (14:00 +0300)]
sync : whisper.cpp

ggml-ci

6 weeks agoexamples : update link to Paul Tol's color scheme [no ci] (whisper/3140)
Daniel Bevenius [Mon, 12 May 2025 07:02:06 +0000 (09:02 +0200)]
examples : update link to Paul Tol's color scheme [no ci] (whisper/3140)

This commit updates the link to Paul Tol's color scheme in the
`examples/common.h` file. The previous link was outdated and
pointed to a non-existent page.

6 weeks agoexamples : update to ggml-opt and ggml-backend changes (#0)
Georgi Gerganov [Tue, 13 May 2025 09:52:39 +0000 (12:52 +0300)]
examples : update to ggml-opt and ggml-backend changes (#0)

ggml-ci

6 weeks agosync : llama.cpp
Georgi Gerganov [Tue, 13 May 2025 09:43:02 +0000 (12:43 +0300)]
sync : llama.cpp

ggml-ci

6 weeks agoopencl: remove unnecessary assert for `add` (llama/13257)
lhez [Mon, 12 May 2025 20:13:49 +0000 (13:13 -0700)]
opencl: remove unnecessary assert for `add` (llama/13257)

6 weeks agollama/ggml: add LLM training support (llama/10544)
Johannes Gäßler [Mon, 12 May 2025 12:44:49 +0000 (14:44 +0200)]
llama/ggml: add LLM training support (llama/10544)

* llama/ggml: add LLM training support

more compact progress bar

llama_save_model_to_file

llama_opt_param_filter

ggml_graph_dup force_grads

refactor ggml_opt, fix test-opt

* remove logits_all

* refactor CUDA implementation for ACC

* reset graph at beginning of opt period

6 weeks agoggml-cpu: Integrate fp32=bf16xbf16 SME KleidiAI kernel (llama/13053)
Dan Johansson [Mon, 12 May 2025 11:06:19 +0000 (13:06 +0200)]
ggml-cpu: Integrate fp32=bf16xbf16 SME KleidiAI kernel (llama/13053)

* ggml-cpu: Integrate fp32=bf16xbf16 SME KleidiAI kernel

Signed-off-by: Dan Johansson <redacted>
* * code review fixes

Signed-off-by: Dan Johansson <redacted>
* * adds a comment that clarifies barrier usage

Signed-off-by: Dan Johansson <redacted>
---------

Signed-off-by: Dan Johansson <redacted>
Co-authored-by: Charles Xu <redacted>
6 weeks agoCUDA: fix misaligned synchronization in FA (llama/13469)
Johannes Gäßler [Mon, 12 May 2025 08:51:21 +0000 (10:51 +0200)]
CUDA: fix misaligned synchronization in FA (llama/13469)

6 weeks agoggml : add mrope kernel for metal (llama/13457)
Xuan-Son Nguyen [Mon, 12 May 2025 08:29:13 +0000 (10:29 +0200)]
ggml : add mrope kernel for metal (llama/13457)

6 weeks agoenable dpcpp nightly builds with libraries (llama/13406)
Atharva Dubey [Mon, 12 May 2025 05:15:32 +0000 (06:15 +0100)]
enable dpcpp nightly builds with libraries (llama/13406)

6 weeks agoCUDA: fix crash with partial offloading of MoE (llama/13439)
Johannes Gäßler [Sun, 11 May 2025 14:09:33 +0000 (16:09 +0200)]
CUDA: fix crash with partial offloading of MoE (llama/13439)

6 weeks agoAdd `--no-op-offload` to improve `-ot` pp perf in MoE models like llama4 400B (llama...
David Huang [Sun, 11 May 2025 12:18:39 +0000 (20:18 +0800)]
Add `--no-op-offload` to improve `-ot` pp perf in MoE models like llama4 400B (llama/13386)

6 weeks agoCUDA: fix race conditions FlashAttention kernels (llama/13438)
Johannes Gäßler [Sat, 10 May 2025 20:22:48 +0000 (22:22 +0200)]
CUDA: fix race conditions FlashAttention kernels (llama/13438)

6 weeks agoCUDA: fix FlashAttention on Turing (llama/13415)
Johannes Gäßler [Sat, 10 May 2025 07:16:52 +0000 (09:16 +0200)]
CUDA: fix FlashAttention on Turing (llama/13415)

6 weeks agovulkan: scalar flash attention implementation (llama/13324)
Jeff Bolz [Sat, 10 May 2025 06:07:07 +0000 (23:07 -0700)]
vulkan: scalar flash attention implementation (llama/13324)

* vulkan: scalar flash attention implementation

* vulkan: always use fp32 for scalar flash attention

* vulkan: use vector loads in scalar flash attention shader

* vulkan: remove PV matrix, helps with register usage

* vulkan: reduce register usage in scalar FA, but perf may be slightly worse

* vulkan: load each Q value once. optimize O reduction. more tuning

* vulkan: support q4_0/q8_0 KV in scalar FA

* CI: increase timeout to accommodate newly-supported tests

* vulkan: for scalar FA, select between 1 and 8 rows

* vulkan: avoid using Float16 capability in scalar FA

6 weeks agosycl : implementation of reordered Q4_0 MMVQ for Intel GPUs (llama/12858)
Alberto Cabrera Pérez [Fri, 9 May 2025 15:34:08 +0000 (16:34 +0100)]
sycl : implementation of reordered Q4_0 MMVQ for Intel GPUs (llama/12858)

* sycl : Implemented reorder Q4_0 mmvq

Signed-off-by: Alberto Cabrera <redacted>
* sycl : Fixed mmvq being called when reorder is disabled

* sycl : Improved comments in the quants header

Signed-off-by: Alberto Cabrera <redacted>
* Use static_assert

* safe_div -> ceil_div

* Clarify qi comment

* change the reorder tensor from init to execute OP

* dbg

* Undo changes to test-backend-ops

* Refactor changes on top of q4_0 reorder fix

* Missing Reverts

* Refactored opt_for_reorder logic to simplify code path

* Explicit inlining and unroll

* Renamed mul_mat_algo enum for consistency

---------

Signed-off-by: Alberto Cabrera <redacted>
Co-authored-by: romain.biessy <redacted>
6 weeks agometal : optimize MoE for large batches (llama/13388)
Georgi Gerganov [Fri, 9 May 2025 12:14:56 +0000 (15:14 +0300)]
metal : optimize MoE for large batches (llama/13388)

ggml-ci

6 weeks agoCUDA: FA support for Deepseek (Ampere or newer) (llama/13306)
Johannes Gäßler [Fri, 9 May 2025 11:34:58 +0000 (13:34 +0200)]
CUDA: FA support for Deepseek (Ampere or newer) (llama/13306)

* CUDA: FA support for Deepseek (Ampere or newer)

* do loop unrolling via C++ template

6 weeks agoCUDA: fix crash on large batch size for MoE models (llama/13384)
Johannes Gäßler [Fri, 9 May 2025 10:14:04 +0000 (12:14 +0200)]
CUDA: fix crash on large batch size for MoE models (llama/13384)

6 weeks agorpc : add rpc_msg_set_tensor_hash_req (llama/13353)
Radoslav Gerganov [Fri, 9 May 2025 07:31:07 +0000 (10:31 +0300)]
rpc : add rpc_msg_set_tensor_hash_req (llama/13353)

* rpc : add rpc_msg_set_tensor_hash_req

Use a dedicated struct for the request of RPC_CMD_SET_TENSOR_HASH which
makes the code cleaner.

* fix

6 weeks agovulkan: Allow up to 4096 elements for mul_mat_id row_ids (llama/13326)
Jeff Bolz [Fri, 9 May 2025 07:23:41 +0000 (02:23 -0500)]
vulkan: Allow up to 4096 elements for mul_mat_id row_ids (llama/13326)

This assert fired running Qwen_Qwen3-30B-A3B-Q2_K.gguf:

GGML_ASSERT(nei0 * nei1 <= 3072);

The tensor is 8 x 512. Increase this array size to accommodate.