]> git.djapps.eu Git - pkg/ggml/sources/ggml/log
pkg/ggml/sources/ggml
3 months agoggml : block interleaving support for Q4_K quantization for x86 AVX2 architecture...
Srihari-mcw [Thu, 20 Mar 2025 11:35:34 +0000 (17:05 +0530)]
ggml : block interleaving support for Q4_K quantization for x86 AVX2 architecture (llama/12332)

* Add block interleaving support for Q4_K quantization

* Remove whitespaces and fix CI/CD issues

* Update pointer of bsums from int16_t to const int16_t

* Add vector version of quantize_q8_K_4x8 function

* Update code formatting based on review comments

3 months agoCUDA: Improve flash decoding kernel GPU occupancy for BS=1 case (llama/12183)
Gaurav Garg [Wed, 19 Mar 2025 19:52:06 +0000 (01:22 +0530)]
CUDA: Improve flash decoding kernel GPU occupancy for BS=1 case (llama/12183)

- Find out active blocks per SM using cudaOccupancyMaxActiveBlocksPerMultiprocessor API. Use this value to determine the optimal parallel_blocks value.
- Prefer vector flash attention kernels over MMA kernel for BS=1

Fixes Issue: #12182
---------

Co-authored-by: Johannes Gäßler <redacted>
3 months agovulkan: optimize iq1 coopmat2 dequant functions (llama/12427)
Jeff Bolz [Wed, 19 Mar 2025 18:56:23 +0000 (13:56 -0500)]
vulkan: optimize iq1 coopmat2 dequant functions (llama/12427)

3 months agoFix visionOS build and add CI (llama/12415)
Guus Waals [Wed, 19 Mar 2025 10:15:23 +0000 (10:15 +0000)]
Fix visionOS build and add CI (llama/12415)

* ci: add visionOS build workflow

Add a new GitHub Actions workflow for building on visionOS with CMake and Xcode.

* ggml: Define _DARWIN_C_SOURCE for visionOS to fix missing u_xxx typedefs

* ci: remove define hacks for u_xxx system types

---------

Co-authored-by: Giovanni Petrantoni <redacted>
3 months agovulkan: Submit once enough matmul work has been recorded (llama/12406)
Jeff Bolz [Wed, 19 Mar 2025 07:26:26 +0000 (02:26 -0500)]
vulkan: Submit once enough matmul work has been recorded (llama/12406)

I've been seeing significantly worse performance for tg with flash attention
enabled vs disabled, and it seems to be related to the submit heuristic.
Change the heuristic to check how many bytes worth of weight matrix are
used and flush every 100MB, and ramp up after the first few submits.
This seems to resolve the issue, and also increases perf for non-FA a bit.

3 months agoopencl: improve profiling (llama/12442)
lhez [Tue, 18 Mar 2025 19:54:55 +0000 (12:54 -0700)]
opencl: improve profiling (llama/12442)

* opencl: more profiling timing

* opencl: generate trace for profiling

* opencl: reduce profiling overhead

* Populate profiling timing info at the end rather than after each
  kernel run

* opencl: fix for chrome tracing

3 months agomusa: override warp_size of musa device to 32 (llama/12445)
R0CKSTAR [Tue, 18 Mar 2025 18:28:26 +0000 (02:28 +0800)]
musa: override warp_size of musa device to 32 (llama/12445)

Signed-off-by: Xiaodong Ye <redacted>
3 months agoSYCL: using graphs is configurable by environment variable and compile option (llama...
Łukasz Ślusarczyk [Tue, 18 Mar 2025 10:16:31 +0000 (11:16 +0100)]
SYCL: using graphs is configurable by environment variable and compile option (llama/12371)

* alberto changes

* enable sycl graphs by env variable

* fixed compilation warnings in ggml-sycl.cpp

* renamed graph variables

* fix markdown in docs/backend/SYCL.md

Co-authored-by: Romain Biessy <redacted>
* fix markdown in docs/backend/SYCL.md again

* compiling graphs by default, renamed graph_enable to graph_disable

---------

Co-authored-by: Romain Biessy <redacted>
3 months agoggml : add SVE support for q6_K_q8_K (llama/12361)
fj-y-saito [Tue, 18 Mar 2025 08:14:39 +0000 (17:14 +0900)]
ggml : add SVE support for q6_K_q8_K (llama/12361)

3 months agoVulkan: Default to 1GB allocations instead of 4GB to avoid fragmentation and driver...
0cc4m [Tue, 18 Mar 2025 06:21:40 +0000 (07:21 +0100)]
Vulkan: Default to 1GB allocations instead of 4GB to avoid fragmentation and driver issues (llama/12434)

3 months agofixed compilation warnings in ggml-sycl (llama/12424)
Łukasz Ślusarczyk [Tue, 18 Mar 2025 00:51:25 +0000 (01:51 +0100)]
fixed compilation warnings in ggml-sycl (llama/12424)

3 months agollama: Add support for RWKV v7 architecture (llama/12412)
Molly Sophia [Mon, 17 Mar 2025 23:27:50 +0000 (07:27 +0800)]
llama: Add support for RWKV v7 architecture (llama/12412)

* ggml: Add op l2_norm

Signed-off-by: Molly Sophia <redacted>
* ggml: Add op rwkv_wkv7

Signed-off-by: Molly Sophia <redacted>
* llama: Add support for RWKV7 and ARWKV7 models

Signed-off-by: Molly Sophia <redacted>
* llama: fix inference with RWKV6Qwen2

Signed-off-by: Molly Sophia <redacted>
* llama: add more (a)rwkv7 variants in size

Signed-off-by: Molly Sophia <redacted>
* Apply code-format changes

Signed-off-by: Molly Sophia <redacted>
* fix MUSA build

Signed-off-by: Molly Sophia <redacted>
* llama: fix shape error with rwkv using llama-parallel

Signed-off-by: Molly Sophia <redacted>
---------

Signed-off-by: Molly Sophia <redacted>
3 months agocuda : enable CUDA Graph on CUDA Toolkit < 12.x (llama/12394)
Gaurav Garg [Mon, 17 Mar 2025 18:25:13 +0000 (23:55 +0530)]
cuda : enable CUDA Graph on CUDA Toolkit < 12.x (llama/12394)

* Enable CUDA Graph on CTK < 12.x

`cudaGraphExecUpdate` API was changed on 12.x. For this reason CUDA graph support was disabled on older CUDA toolkit. This change enables CUDA support in CTK version < 12.x by using older API if CTK < 12.x.

* Fix compilation errors with MUSA

* Disable CUDA Graph for MUSA

3 months agoggml-vulkan: remove unused find_program(glslc) (llama/12416)
Guus Waals [Mon, 17 Mar 2025 16:35:43 +0000 (00:35 +0800)]
ggml-vulkan: remove unused find_program(glslc) (llama/12416)

It's already found by FindVulkan.cmake in the parent CMakeLists

3 months agovulkan: Add N/2 and N/4 optimized paths in coopmat2 shader (llama/12312)
Jeff Bolz [Mon, 17 Mar 2025 14:26:18 +0000 (09:26 -0500)]
vulkan: Add N/2 and N/4 optimized paths in coopmat2 shader (llama/12312)

3 months agovulkan: subgroup size tuning (llama/12087)
Daniele [Mon, 17 Mar 2025 11:42:33 +0000 (12:42 +0100)]
vulkan: subgroup size tuning (llama/12087)

* vulkan: subgroup size test

* Vulkan: Add device architecture enum and logic to recognize AMD generations

* vulkan: use new architecture logic to specify subgroup size

* Initial vulkan subgroup size tuning for RDNA3

* vulkan: commonize RDNA subgroup tuning

* vulkan: override subgroup size if required_subgroup_size = 0

* vulkan: disable warp 32 for RDNA3

* vulkan: fine tuned RDNA1 subgroup sizes

* vulkan: adjusted subgroup size map

* vulkan: fixed RDNA2 subgroup map

---------

Co-authored-by: 0cc4m <redacted>
3 months agovulkan: use fp32 in coopmat2 q4_k dequant function (llama/12309)
Jeff Bolz [Mon, 17 Mar 2025 09:43:35 +0000 (04:43 -0500)]
vulkan: use fp32 in coopmat2 q4_k dequant function (llama/12309)

3 months agovulkan: Pad N dimension of B matrix for coopmat2 perf, to avoid bounds checking ...
Jeff Bolz [Mon, 17 Mar 2025 09:41:59 +0000 (04:41 -0500)]
vulkan: Pad N dimension of B matrix for coopmat2 perf, to avoid bounds checking (llama/12273)

* vulkan: Pad N dimension of B matrix for coopmat2 perf, to avoid bounds checking

3 months agovulkan: Adjust coopmat2 tile sizes and selection heuristic (llama/12258)
Jeff Bolz [Mon, 17 Mar 2025 09:35:00 +0000 (04:35 -0500)]
vulkan: Adjust coopmat2 tile sizes and selection heuristic (llama/12258)

3 months agocmake : enable building llama.cpp using system libggml (llama/12321)
Christian Kastner [Mon, 17 Mar 2025 09:05:23 +0000 (10:05 +0100)]
cmake : enable building llama.cpp using system libggml (llama/12321)

* cmake: Factor out compiler flag function from ggml

llama.cpps's build requires it, too, and we may want to make use of it
without add_subdirectory(ggml).

* cmake: Enable building against system ggml

This facilitates package maintenance for Linux distributions, where the
libggml library most likely will be shipped as an individual package
upon which a llama.cpp package depends.

3 months agoSYCL: set extras only on GGML_TYPE_Q4_0 (llama/12366)
Akarshan Biswas [Mon, 17 Mar 2025 01:45:12 +0000 (07:15 +0530)]
SYCL: set extras only on GGML_TYPE_Q4_0 (llama/12366)

* SYCL: set extras only on GGML_TYPE_Q4_0

* release tensor_extras in reset buffer interface

3 months agoSYCL: Delete redundant plus sign and space (llama/12391)
aubreyli [Sat, 15 Mar 2025 14:49:03 +0000 (22:49 +0800)]
SYCL: Delete redundant plus sign and space (llama/12391)

3 months agoSYCL : support non-contiguous tensors in binary ops (add, sub, etc) (llama/12399)
fairydreaming [Sat, 15 Mar 2025 14:19:30 +0000 (15:19 +0100)]
SYCL : support non-contiguous tensors in binary ops (add, sub, etc) (llama/12399)

* sycl : support non-contiguous tensors in binary ops

* sycl : silence unused variable warning

---------

Co-authored-by: Stanisław Szymczyk <redacted>
3 months agoMUL_MAT optimization (llama/12382)
Chenguang Li [Sat, 15 Mar 2025 01:31:08 +0000 (09:31 +0800)]
MUL_MAT optimization (llama/12382)

3 months agosycl : variable sg_size support for mmvq kernels (llama/12336)
Alberto Cabrera Pérez [Wed, 12 Mar 2025 09:57:32 +0000 (09:57 +0000)]
sycl : variable sg_size support for mmvq kernels (llama/12336)

3 months agoCUDA/HIP: Fix fattn-vec-* when device warp size is not 32 (llama/12315)
uvos [Wed, 12 Mar 2025 09:14:11 +0000 (10:14 +0100)]
CUDA/HIP: Fix fattn-vec-* when device warp size is not 32 (llama/12315)

When fattn-wmma was ported over to warp64 various bits that also touch fattn-vec where converted to
selectable warp size, however the fattn-vec kernels dont work with 64 wide warps for now, so we need
to avoid launching them with parameters for warp64

3 months agovulkan: fix bug in coopmat1 mul_mat_id (llama/12316)
Jeff Bolz [Wed, 12 Mar 2025 05:59:19 +0000 (00:59 -0500)]
vulkan: fix bug in coopmat1 mul_mat_id (llama/12316)

* tests: run mul_mat_id with a larger N

* vulkan: fix bug in coopmat1 mul_mat_id

3 months agoCUDA/HIP: refractor mmqv to unify the calculation of nwarps and rows per block betwee...
uvos [Tue, 11 Mar 2025 19:16:03 +0000 (20:16 +0100)]
CUDA/HIP: refractor mmqv to unify the calculation of nwarps and rows per block between host and device code. (llama/12177)

refactor mmqv to unify the calculation of nwarps and rows per block between host and device code.

---------

Co-authored-by: Johannes Gäßler <redacted>
3 months agoggml-backend : fix backend search path (llama/12330)
jklincn [Tue, 11 Mar 2025 13:25:17 +0000 (21:25 +0800)]
ggml-backend : fix backend search path (llama/12330)

* Fix backend search path

* replace .native() with '/'

* reverted .native()

3 months agometal : Cache the Metal library at the device context level (llama/12265)
BB-fat [Tue, 11 Mar 2025 11:45:02 +0000 (19:45 +0800)]
metal : Cache the Metal library at the device context level (llama/12265)

3 months agomat vec double buffer (llama/12188)
Eve [Mon, 10 Mar 2025 19:28:11 +0000 (19:28 +0000)]
mat vec double buffer (llama/12188)

3 months agomusa: support new arch mp_31 and update doc (llama/12296)
R0CKSTAR [Mon, 10 Mar 2025 17:18:25 +0000 (01:18 +0800)]
musa: support new arch mp_31 and update doc (llama/12296)

Signed-off-by: Xiaodong Ye <redacted>
3 months agoopencl: use OpenCL C standard supported by the device (llama/12221)
Henry Linjamäki [Mon, 10 Mar 2025 16:57:00 +0000 (18:57 +0200)]
opencl: use OpenCL C standard supported by the device (llama/12221)

This patch nudges the llama.cpp a bit to be supported on PoCL which
doesn't support OpenCL C CL2.0. The issue is solved by querying the
device for the supported OpenCL C versions and using the highest one
available.

3 months agotests : fix test-quantize-fns to init the CPU backend (llama/12306)
Georgi Gerganov [Mon, 10 Mar 2025 12:07:15 +0000 (14:07 +0200)]
tests : fix test-quantize-fns to init the CPU backend (llama/12306)

ggml-ci

3 months agoggml-backend : make path_str compatible with C++20 (llama/12269)
Jason C.H [Sat, 8 Mar 2025 16:02:39 +0000 (00:02 +0800)]
ggml-backend : make path_str compatible with C++20 (llama/12269)

3 months agoggml : skip intermediate .air file when compiling .metallib (llama/12247)
Daniel Bevenius [Fri, 7 Mar 2025 13:15:27 +0000 (14:15 +0100)]
ggml : skip intermediate .air file when compiling .metallib (llama/12247)

This commit updates the compilation of default.metallib to skip the
intermediate .air (Apple Intermediate Representation) file.

The motivation for this change is to simplify the custom command a
little and avoid generating and then removing the .air file.

3 months agoci: disable test-opt for now (#1158)
Akarshan Biswas [Wed, 26 Mar 2025 07:51:18 +0000 (13:21 +0530)]
ci: disable test-opt for now (#1158)

* ci: disable test-opt for now

* Use CTEXT_EXTRA to disable tests

3 months agoci: Initial SYCL setup (#1157)
Akarshan Biswas [Tue, 25 Mar 2025 09:38:14 +0000 (15:08 +0530)]
ci: Initial SYCL setup (#1157)

3 months agoCreate CONTRIBUTING.md (#1146)
cmdr2 [Thu, 13 Mar 2025 18:29:48 +0000 (23:59 +0530)]
Create CONTRIBUTING.md (#1146)

* Create CONTRIBUTING.md

* Update CONTRIBUTING.md

3 months agogpt-2 : add comment about KV cache type (#1142)
bssrdf [Thu, 13 Mar 2025 18:29:19 +0000 (14:29 -0400)]
gpt-2 : add comment about KV cache type (#1142)

* change KV cache to fp16 to take advantage of tensor cores

* added a note/comment to indicate kv can be FP16

3 months agocmake: Enable specifying exact PowerPC CPU architecture (#1138)
Christian Kastner [Mon, 10 Mar 2025 18:19:58 +0000 (19:19 +0100)]
cmake: Enable specifying exact PowerPC CPU architecture (#1138)

In the process, guard automatic CPU detection with GGML_NATIVE.

https://gcc.gnu.org/onlinedocs/gcc/RS_002f6000-and-PowerPC-Options.html#index-mcpu-10

3 months agocmake: Comment out GGML_BIN_DIR for now (#1139)
Christian Kastner [Mon, 10 Mar 2025 12:06:21 +0000 (13:06 +0100)]
cmake: Comment out GGML_BIN_DIR for now (#1139)

Nothing installs to it yet, so when attempting to use the cmake package,
set_and_check() triggers an error if the directory doesn't already exist
for other reasons.

3 months agospm : remove (#1135)
Georgi Gerganov [Sat, 8 Mar 2025 13:18:24 +0000 (15:18 +0200)]
spm : remove (#1135)

ggml-ci

3 months agosync : whisper.cpp upstream/0.0.1802
Georgi Gerganov [Sat, 8 Mar 2025 13:14:03 +0000 (15:14 +0200)]
sync : whisper.cpp

ggml-ci

3 months agocommon : fix audio loading by miniaudio (whisper/2862)
Dmitry Atamanov [Tue, 4 Mar 2025 17:05:21 +0000 (22:05 +0500)]
common : fix audio loading by miniaudio (whisper/2862)

3 months agosync : llama.cpp
Georgi Gerganov [Fri, 7 Mar 2025 12:50:30 +0000 (14:50 +0200)]
sync : llama.cpp

ggml-ci

3 months agoggml-cpu: faster AVX2 variant for IQ1_M (llama/12216)
Rémy O [Fri, 7 Mar 2025 11:54:22 +0000 (12:54 +0100)]
ggml-cpu: faster AVX2 variant for IQ1_M (llama/12216)

3 months agometal : simplify kernel arguments using a struct (#3229) (llama/12194)
BB-fat [Fri, 7 Mar 2025 07:35:57 +0000 (15:35 +0800)]
metal : simplify kernel arguments using a struct (#3229) (llama/12194)

* metal : refactor im2col parameters into a struct

* metal: Change im2col offset types from int32_t to uint64_t to support larger memory offsets

* metal : refactor sum_rows parameters into a struct

* metal : refactor soft_max parameters into a struct

* metal : refactor diag_mask_inf parameters into a struct

* metal : refactor ssm_conv parameters into a struct

* metal : refactor ssm_scan parameters into a struct

* metal : refactor get_rows parameters into a struct

* metal : refactor group_norm parameters into a struct

* metal : refactor conv_transpose_1d parameters into a struct

* metal : refactor upscale parameters into a struct

* metal : refactor pad parameters into a struct

* metal : refactor pad_reflect_1d parameters into a struct

* metal : refactor arange parameters into a struct

* metal : refactor timestep_embedding parameters into a struct

* metal : refactor argsort parameters into a struct

* metal : refactor leaky_relu parameters into a struct

* metal : refactor pool_2d parameters into a struct

* metal : fix trailing whitespace

---------

Co-authored-by: alexju <redacted>
3 months agometal : fix default.metallib build (llama/12224)
Daniel Bevenius [Fri, 7 Mar 2025 05:23:16 +0000 (06:23 +0100)]
metal : fix default.metallib build (llama/12224)

This commit updates the custom command to build the default.metallib
file to use the correct path to ../ggml-common.h by using the variable
METALLIB_COMMON.

The motivation for this change is that currently when building and
specifying GGML_METAL_EMBED_LIBRARY=OFF the following error is
generated:
```console
[ 11%] Linking CXX shared library ../../bin/libggml.dylib
[ 11%] Built target ggml
make[2]: *** No rule to make target `ggml/src/ggml-metal/ggml-common.h', needed by `bin/default.metallib'.  Stop.
make[1]: *** [ggml/src/ggml-metal/CMakeFiles/ggml-metal-lib.dir/all] Error 2
```

With the above change the build could progress but there was a follow
on error about not being able to find the ggml-common.h file in
ggml-metal.metal where is was included as a relative path:
```console
[ 11%] Compiling Metal kernels
/Users/danbev/work/llama.cpp/build/bin/ggml-metal.metal:6:10: error: '../ggml-common.h' file not found, did you mean 'ggml-common.h'?
         ^~~~~~~~~~~~~~~~~~
         "ggml-common.h"
1 error generated.
```
Removing the relative path then allowed the build to complete
successfully.

3 months agoopencl: Noncontiguous `norm`, `rms_norm`, disable `fp16` for some ops (llama/12217)
lhez [Fri, 7 Mar 2025 00:20:35 +0000 (16:20 -0800)]
opencl: Noncontiguous `norm`, `rms_norm`, disable `fp16` for some ops (llama/12217)

* opencl: support noncontiguous `norm`

* opencl: support noncontiguous `rms_norm`

* opencl: disable fp16 for `ADD`, `MUL`, `SCALE`, `RELU`, `GELU`, `SILU`, `CLAMP`

3 months agocmake : fix undefined reference errors for std::filesystem in ggml (#12092) (llama...
xiaofei [Thu, 6 Mar 2025 22:58:25 +0000 (06:58 +0800)]
cmake : fix undefined reference errors for std::filesystem in ggml (#12092) (llama/12094)

Signed-off-by: Ray Lee <redacted>
Co-authored-by: Ray Lee <redacted>
3 months agoCUDA: fix FA logic for PTX 7.0 and CC >= 7.5 (llama/12222)
Johannes Gäßler [Thu, 6 Mar 2025 17:45:09 +0000 (18:45 +0100)]
CUDA: fix FA logic for PTX 7.0 and CC >= 7.5 (llama/12222)

3 months agoHIP/CUDA: set the paramerter value in maintain_cuda_graph instead of replaceing it...
uvos [Thu, 6 Mar 2025 07:20:52 +0000 (08:20 +0100)]
HIP/CUDA: set the paramerter value in maintain_cuda_graph instead of replaceing it. (llama/12209)

This avoids conflict with internal cuda/hip runtimes memory managment behavior.

3 months agoopencl : fix buffer alignment (llama/12197)
Henry Linjamäki [Thu, 6 Mar 2025 01:33:40 +0000 (03:33 +0200)]
opencl : fix buffer alignment (llama/12197)

Fix the following error:

```
ggml-alloc.c:99: not enough space in the buffer
ggml_tallocr_alloc: not enough space in the buffer to allocate blk.17.ffn_down.weight (needed 27525120, available 27521024)
```

which occurs when `ggml_backend_opencl_context::alignment` is larger
than `cl_ptr_base` (hard-coded to `0x1000`).

Also, fix `ggml_backend_opencl_context::alignment` was set to
`CL_DEVICE_MEM_BASE_ADDR_ALIGN` which was treated as bytes but the
value is reported in bits.

3 months agoopencl : fix `ulong` kernel args were set from `int` variables (llama/12174)
Henry Linjamäki [Thu, 6 Mar 2025 01:31:14 +0000 (03:31 +0200)]
opencl : fix `ulong` kernel args were set from `int` variables (llama/12174)

... which left garbage bits in the upper half of the kernel args. This
caused segmentation faults when running PoCL.

3 months agoopencl : fix profile-related errors (llama/12095)
simon886212 [Thu, 6 Mar 2025 01:30:05 +0000 (09:30 +0800)]
opencl : fix profile-related errors (llama/12095)

Co-authored-by: ubuntu <redacted>
3 months agoggml-cpu: Faster IQ1 mul_mat_vec on AVX2 using BMI2 instructions (llama/12154)
Rémy O [Thu, 6 Mar 2025 01:26:10 +0000 (02:26 +0100)]
ggml-cpu: Faster IQ1 mul_mat_vec on AVX2 using BMI2 instructions (llama/12154)

* ggml-cpu: Faster IQ1 mul_mat_vec on AVX2 using BMI2 instructions

* cmake: Add GGML_BMI2 build option

* ggml: enable BMI2 on relevant CPU variants

* ggml-cpu: include BMI2 in backend score

* ggml-cpu: register BMI2 in ggml_backend_cpu_get_features

* ggml-cpu: add __BMI2__ define when using MSVC

3 months agoSYCL: Disable f16 Unary OPs as not supported by the kernels (llama/12201)
Akarshan Biswas [Wed, 5 Mar 2025 15:58:23 +0000 (21:28 +0530)]
SYCL: Disable f16 Unary OPs as not supported by the kernels (llama/12201)

3 months agoggml : fix GGMLMetalClass ODR (llama/12200)
Plamen Minev [Wed, 5 Mar 2025 15:16:01 +0000 (17:16 +0200)]
ggml : fix GGMLMetalClass ODR (llama/12200)

-- it might happen if ggml is loaded from 2 separate libraries since each one of them will expose the class. This is more of a guard since we want to use only Metal as embedded library and don't care about the other case.

3 months agoggml : ggml_compute_forward_concat() for arbitrary tensor type (#1118)
vmobilis [Fri, 7 Mar 2025 08:11:40 +0000 (11:11 +0300)]
ggml : ggml_compute_forward_concat() for arbitrary tensor type (#1118)

* ggml_compute_forward_concat() for arbitrary tensor type

* Check that tensors' type match

* ggml-cpu.c: check type of source tensors

* ggml-cpu.c: move tensor type check to ggml_compute_forward_concat()

* ggml.c: check concatenated tensor type

* Remove tensor type check from ggml_compute_forward_concat() in ggml-cpu.c

..., as it was moved to ggml.c.

3 months agopkg-config: Use CMake install paths for lib, include (#1133)
Christian Kastner [Thu, 6 Mar 2025 19:01:02 +0000 (20:01 +0100)]
pkg-config: Use CMake install paths for lib, include (#1133)

3 months agovulkan : sync (llama/0)
Georgi Gerganov [Tue, 4 Mar 2025 19:08:15 +0000 (21:08 +0200)]
vulkan : sync (llama/0)

ggml-ci

3 months agosync : llama.cpp
Georgi Gerganov [Tue, 4 Mar 2025 19:07:04 +0000 (21:07 +0200)]
sync : llama.cpp

3 months agoggml : portability fixes for VS 2017 (llama/12150)
mgroeber9110 [Tue, 4 Mar 2025 16:53:26 +0000 (17:53 +0100)]
ggml : portability fixes for VS 2017 (llama/12150)

* Add include files for std::min/max and std::toupper/tolower

* win32: move _USE_MATH_DEFINES before includes to ensure M_PI is defined

* Use GGML_RESTRICT instead of "restrict" keyword everywhere, and use "__restrict" in MSVC plain C mode

* win32: only use __restrict in MSVC if C11/C17 support is not enabled

---------

Co-authored-by: Marcus Groeber <redacted>
3 months agoHIP: implement FlashAttention via rocWMMA for CDNA and RDNA3+ (llama/12032)
David Huang [Mon, 3 Mar 2025 21:10:54 +0000 (05:10 +0800)]
HIP: implement FlashAttention via rocWMMA for CDNA and RDNA3+ (llama/12032)

Adds GGML_HIP_ROCWMMA_FATTN and rocwmma header check
Adds rocWMMA support to fattn-wmma-f16

3 months agotest-backend-ops : add option -p to filter by op params (llama/12155)
Diego Devesa [Mon, 3 Mar 2025 13:00:46 +0000 (14:00 +0100)]
test-backend-ops : add option -p to filter by op params (llama/12155)

3 months agoggml : fix kleidiai build (llama/12159)
ag2s20150909 [Mon, 3 Mar 2025 12:54:08 +0000 (20:54 +0800)]
ggml : fix kleidiai build (llama/12159)

The libggml API has changed, but this has not been updated.

3 months agoSYCL: Move CPY kernels to a separate file and add few missing kernels (llama/12133)
Akarshan Biswas [Mon, 3 Mar 2025 10:07:22 +0000 (15:37 +0530)]
SYCL: Move CPY kernels to a separate file and add few missing kernels (llama/12133)

* SYCL: refactor and move cpy kernels to a separate file

* Add few missing cpy kernels

* refactor and add debug logs

3 months agoggml-backend : keep paths in native string type when possible (llama/12144)
Diego Devesa [Sun, 2 Mar 2025 21:11:00 +0000 (22:11 +0100)]
ggml-backend : keep paths in native string type when possible (llama/12144)

3 months agoCUDA: compress mode option and default to size (llama/12029)
Erik Scholz [Sat, 1 Mar 2025 11:57:22 +0000 (12:57 +0100)]
CUDA: compress mode option and default to size (llama/12029)

cuda 12.8 added the option to specify stronger compression for binaries, so we now default to "size".

3 months agoggml : upgrade init_tensor API to return a ggml_status (llama/11854)
William Tambellini [Fri, 28 Feb 2025 13:41:47 +0000 (05:41 -0800)]
ggml : upgrade init_tensor API to return a ggml_status (llama/11854)

* Upgrade init_tensor API to return a ggml_status

To prepare for an 'abort-free' ggml
(ggml not to abort on OOMs but return a OOM status),
as agreeed with Diego in the ggml repo,
upgrade the init_tensor() and view_init() APIs
to return a ggml_status.

* misc fixes

---------

Co-authored-by: slaren <redacted>
3 months agovulkan: add specific MMV kernels for IQ2 and IQ3 quants + optimizations (llama/11595)
Rémy O [Fri, 28 Feb 2025 08:42:52 +0000 (09:42 +0100)]
vulkan: add specific MMV kernels for IQ2 and IQ3 quants + optimizations (llama/11595)

* vulkan: implement specialized MMV kernels for IQ2 quantizations

* vulkan: add MMV kernels for IQ3 quants

* vulkan: Increase MMV batch size and unroll IQ LUT setup

* vulkan: fix init_iq_shmem for WG sizes larger than tables

* vulkan: common batch size for all I-quants

3 months agoCUDA: fix logic for V100 + GGML_CUDA_FORCE_MMQ (llama/12098)
Johannes Gäßler [Fri, 28 Feb 2025 08:26:43 +0000 (09:26 +0100)]
CUDA: fix logic for V100 + GGML_CUDA_FORCE_MMQ (llama/12098)

3 months agoggml: aarch64: implement SVE kernels for q2_k_q8_k vector dot (llama/12064)
Prashant Vithule [Fri, 28 Feb 2025 07:36:12 +0000 (13:06 +0530)]
ggml: aarch64: implement SVE kernels for q2_k_q8_k vector dot (llama/12064)

* Added SVE Support for Q2_K Quantized Models

* Use 4-space indentation in the switch cases

* removed comments lines

* Remove the loop Retain the curly bracess for better understanding of code

* Remove the comment like added for q3_k_q8_k kernel

---------

Co-authored-by: vithulep <redacted>
3 months agoCANN: Fix build error with GCC 13 (llama/11990)
hipudding [Fri, 28 Feb 2025 07:23:47 +0000 (15:23 +0800)]
CANN: Fix build error with GCC 13 (llama/11990)

Remove unused header file that causes compilation failure on ARM
platform with GCC 13.

3 months agovulkan: matmul dequantization improvements (llama/12015)
Eve [Fri, 28 Feb 2025 07:20:08 +0000 (07:20 +0000)]
vulkan: matmul dequantization improvements (llama/12015)

* faster dequant for old quants

* dont use unpack for iq4_nl

* vec2 unpack for q8

3 months agovulkan: improve im2col (llama/11826)
Daniele [Fri, 28 Feb 2025 06:52:51 +0000 (06:52 +0000)]
vulkan: improve im2col (llama/11826)

* vulkan: improve im2col performance

3 months agocmake: Fix ggml backend dependencies and installation (llama/11818)
Vladimir Vuksanovic [Thu, 27 Feb 2025 07:42:48 +0000 (08:42 +0100)]
cmake: Fix ggml backend dependencies and installation (llama/11818)

* Fix dependencies between ggml and backends

ggml backends link only to ggml-base and ggml links to all backends.

* Fix installation of ggml backends

Set up GNUInstallDirs before setting the installation directory of ggml backends

3 months agovulkan: fix assertion when qy_needs_dequant (llama/12068)
Jeff Bolz [Tue, 25 Feb 2025 15:30:21 +0000 (09:30 -0600)]
vulkan: fix assertion when qy_needs_dequant (llama/12068)

Looks like a copy/paste bug from qx_needs_dequant.

3 months agoggml-cpu: Fix build with sve (llama/12059)
Molly Sophia [Tue, 25 Feb 2025 11:28:22 +0000 (19:28 +0800)]
ggml-cpu: Fix build with sve (llama/12059)

* ggml-cpu: Fix build with sve

Signed-off-by: Molly Sophia <redacted>
* ggml-cpu: Remove unused variable in sve q3_k vec dot

Signed-off-by: Molly Sophia <redacted>
---------

Signed-off-by: Molly Sophia <redacted>
3 months agocuda: unary ops as float + de-duplicate (#1130)
cmdr2 [Mon, 3 Mar 2025 15:21:31 +0000 (20:51 +0530)]
cuda: unary ops as float + de-duplicate (#1130)

3 months agocuda/vulkan: specify fp32-only support for some operations in supports_op (#1129)
cmdr2 [Fri, 28 Feb 2025 10:29:55 +0000 (15:59 +0530)]
cuda/vulkan: specify fp32-only support for some operations in supports_op (#1129)

* cuda: restrict SILU_BACK to fp32, since fp16 exceeds the desired test threshold

* vulkan: specify fp32-only support for certain ops (that are now tested for fp16 as well)

* f32 sigmoid in vulkan supports op

* Revert "f32 sigmoid in vulkan supports op"

This reverts commit c6f04b3c19bf4504c2776149c6d8cd84e0b48acb.

3 months agocuda/cpu: Increase support for fp16 unary operations (#1125)
cmdr2 [Fri, 28 Feb 2025 07:04:39 +0000 (12:34 +0530)]
cuda/cpu: Increase support for fp16 unary operations (#1125)

* Support fp16 unary operations in the CUDA backend

* cpu: increase fp16 support for unary operators in the CPU backend

* cuda: increase fp16 support for unary operators in the CUDA backend

* Add test cases for fp16 unary operators

* metal: update supports_op for unary operators that don't support fp16, to prevent test-backend-ops from failing

* metal: fix PR comments for unary op support after fp16 unary tests

4 months agosync : whisper.cpp
Georgi Gerganov [Thu, 27 Feb 2025 12:43:20 +0000 (14:43 +0200)]
sync : whisper.cpp

ggml-ci

4 months agowhisper : support GGML_BACKEND_DL (whisper/2843)
Diego Devesa [Thu, 27 Feb 2025 12:35:07 +0000 (13:35 +0100)]
whisper : support GGML_BACKEND_DL (whisper/2843)

* whisper : support GGML_BACKEND_DL

* fix DTW crash

* whisper.objc : fix build - add ggml-cpp.h

---------

Co-authored-by: Georgi Gerganov <redacted>
4 months agoci : fix workflow name
Georgi Gerganov [Thu, 27 Feb 2025 11:11:33 +0000 (13:11 +0200)]
ci : fix workflow name

4 months agoexamples : remove dr_wab.h (#1127)
Georgi Gerganov [Thu, 27 Feb 2025 10:53:37 +0000 (12:53 +0200)]
examples : remove dr_wab.h (#1127)

ggml-ci

4 months agosync : whisper.cpp
Georgi Gerganov [Thu, 27 Feb 2025 10:52:45 +0000 (12:52 +0200)]
sync : whisper.cpp

4 months agocommon : separate whisper sources (whisper/2846)
Georgi Gerganov [Thu, 27 Feb 2025 10:50:32 +0000 (12:50 +0200)]
common : separate whisper sources (whisper/2846)

* common : separate whisper sources

* examples : add chrono

* examples : add more headers

4 months agocommon : fix build min/max (whisper/2845)
Georgi Gerganov [Thu, 27 Feb 2025 08:39:13 +0000 (10:39 +0200)]
common : fix build min/max (whisper/2845)

* common : try to fix build

* cont : try another fix

4 months agoexamples : use miniaudio for direct decoding flac, mp3, ogg and wav (whisper/2759)
Dmitry Atamanov [Thu, 27 Feb 2025 07:06:54 +0000 (12:06 +0500)]
examples : use miniaudio for direct decoding flac, mp3, ogg and wav (whisper/2759)

4 months agocmake : fix compile assumptions for power9/etc (whisper/2777)
midnight [Wed, 5 Feb 2025 12:41:10 +0000 (04:41 -0800)]
cmake : fix compile assumptions for power9/etc (whisper/2777)

* Add small comment re: VSX to readme

Co-authored-by: midnight <redacted>
4 months agoTold cmake to install ggml-cpp.h as a public header file. (#1126)
petterreinholdtsen [Wed, 26 Feb 2025 20:44:00 +0000 (21:44 +0100)]
Told cmake to install ggml-cpp.h as a public header file. (#1126)

It is used by Whisper talk-llama example.

Co-authored-by: Petter Reinholdtsen <redacted>
4 months agoSupport pure float16 add/sub/mul/div operations in the CUDA (and CPU) backend (#1121)
cmdr2 [Tue, 25 Feb 2025 12:36:34 +0000 (18:06 +0530)]
Support pure float16 add/sub/mul/div operations in the CUDA (and CPU) backend (#1121)

* Support float16-to-float16 add/sub/mul/div operations in the CUDA backend

* Add fp16 support for add/sub/mul/div on the CPU backend

* Add test cases for fp16 add/sub/mul/div

4 months agosync : llama.cpp
Georgi Gerganov [Tue, 25 Feb 2025 09:44:48 +0000 (11:44 +0200)]
sync : llama.cpp

ggml-ci

4 months agometal : copy kernels for quant to F32/F16 conversions (llama/12017)
Gian-Carlo Pascutto [Tue, 25 Feb 2025 09:27:58 +0000 (10:27 +0100)]
metal : copy kernels for quant to F32/F16 conversions (llama/12017)

metal: use dequantize_q templates

---------

Co-authored-by: Georgi Gerganov <redacted>
4 months agoopencl: fix for small models (llama/11950)
lhez [Mon, 24 Feb 2025 21:47:07 +0000 (13:47 -0800)]
opencl: fix for small models (llama/11950)

* opencl: fix small shape gemv, remove unused extensions

* opencl: fix `transpose_16`, `dump_tensor`, enforce subgroup size

* opencl: fix for token length < 4

* opencl: use wave size of 64 for all Adreno GPUs

---------

Co-authored-by: Shawn Gu <redacted>
Co-authored-by: Skyler Szot <redacted>
4 months agoOptimize mul_mat for Q4_0 on Intel GPU (llama/12035)
Neo Zhang Jianyu [Mon, 24 Feb 2025 14:33:23 +0000 (22:33 +0800)]
Optimize mul_mat for Q4_0 on Intel GPU (llama/12035)

* opt performance by reorder for Intel GPU

* detect hw type and save opt feature, and print opt feature

* correct name

* support optimize graph once when compute graph, record the opt status in tensor->extra, make CI passed

* add env variable GGML_SYCL_DISABLE_OPT for debug

* use syclex::architecture replace the custom hw define, update the guide for GGML_SYCL_DISABLE_OPT

* add performance data

* mv getrows functions to separeted files

* fix global variables

---------

Co-authored-by: arthw <redacted>
4 months agoSYCL: Fix GGML_SYCL_DEBUG macro (llama/11995)
Akarshan Biswas [Mon, 24 Feb 2025 10:18:25 +0000 (15:48 +0530)]
SYCL: Fix GGML_SYCL_DEBUG macro (llama/11995)

4 months agoggml-cpu: Support s390x SIMD Instruction Set (llama/12019)
Aaron Teo [Sat, 22 Feb 2025 21:39:24 +0000 (05:39 +0800)]
ggml-cpu: Support s390x SIMD Instruction Set (llama/12019)

* ggml: add s390x ARCH_FLAGS for compilation

Signed-off-by: Aaron Teo <redacted>
* ggml: add SIMD for s390x using vector intrinsics

SIMD is activated for:
* ggml_vec_dot_f32
* ggml_vec_dot_f16
* ggml_vec_mad_f32
* ggml_vec_mad_f16
* ggml_vec_mad_f32_unroll
* ggml_vec_scale_f32
* ggml_vec_scale_f16

SIMD is NOT activated for:
* ggml_vec_dot_f16_unroll (pending bugfix)

Signed-off-by: Aaron Teo <redacted>
* ggml: fix missing escape character in GGML_F32x4_REDUCE

Signed-off-by: Aaron Teo <redacted>
* ggml: add temporary patch for GGML_F32_ARR and GGML_F16_ARR

Signed-off-by: Aaron Teo <redacted>
* ggml: fix s390x GGML_F32x4_REDUCE

Signed-off-by: Aaron Teo <redacted>
* ggml: full SIMD activation for F32,F16 s390x

Signed-off-by: Aaron Teo <redacted>
* ggml: add option to disable s390x VXE/VXE2

Signed-off-by: Aaron Teo <redacted>
* ggml: change vecintrin.h include to ggml-cpu-impl

* add __VXE__ and __VXE2__ macros

Signed-off-by: Aaron Teo <redacted>
* cmake: add s390x target detection for VX/VXE/VXE2

Signed-off-by: Aaron Teo <redacted>
* ggml: move s390x vector intrinsics to ggml-cpu-impl.h

Signed-off-by: Aaron Teo <redacted>
* ggml: s390x Q8_0 SIMD

Signed-off-by: Aaron Teo <redacted>
* ggml: correct documentation for Q8_0

Signed-off-by: Aaron Teo <redacted>
* ggml: s390x reduce code complexity Q8_0

Signed-off-by: Aaron Teo <redacted>
* ggml: s390x bugfix typo Q8_0

Signed-off-by: Aaron Teo <redacted>
* ggml: s390x SIMD activated for Q4_1

Signed-off-by: Aaron Teo <redacted>
* ggml: s390x inline vec_reve

Signed-off-by: Aaron Teo <redacted>
* ggml: s390x SIMD activation for Q4_0

Signed-off-by: Aaron Teo <redacted>
* ggml: add VXE backend feature

Signed-off-by: Aaron Teo <redacted>
* ggml: remove test.py

Signed-off-by: Aaron Teo <redacted>
* ggml: s390x SIMD activation for quantize_row_q8_0

Signed-off-by: Aaron Teo <redacted>
* ggml: s390x SIMD activation for quantize_row_q8_1

Signed-off-by: Aaron Teo <redacted>
* ggml: s390x SIMD activation for iq4_xs

Signed-off-by: Aaron Teo <redacted>
* ggml: bugfix iq4_xs

Signed-off-by: Aaron Teo <redacted>
* ggml: s390x SIMD activation for iq4_nl

Signed-off-by: Aaron Teo <redacted>
* ggml: add float, double, and long vector data type

Signed-off-by: Aaron Teo <redacted>
* ggml: clean up iq4_xs SIMD

Signed-off-by: Aaron Teo <redacted>
* ggml: fix improper use of restrict keyword

Signed-off-by: Aaron Teo <redacted>
* ggml: update warning message for ggml_vec_tbl

Signed-off-by: Aaron Teo <redacted>
* ggml: untested implementation of ggml_vec_dot_iq2_xxs_q8_K

Signed-off-by: Aaron Teo <redacted>
* ggml: update ggml_vec_dot_q4_1_q8_1 to use typedefs

Signed-off-by: Aaron Teo <redacted>
* ggml: switch to restrict for iq4_nl

Signed-off-by: Aaron Teo <redacted>
* ggml: slight dot product speed improvement for q4_1_q8_1

Signed-off-by: Aaron Teo <redacted>
* ggml: s390x SIMD activation for q6_K

Signed-off-by: Aaron Teo <redacted>
* ggml: add missing `_t` to ggml_int8x16x4_t

Signed-off-by: Aaron Teo <redacted>
* ggml: fix missing `_t` for ggml_vec_xl_s8x4

Signed-off-by: Aaron Teo <redacted>
* ggml: fix more missing `_t`

Signed-off-by: Aaron Teo <redacted>
* ggml: add unroll and prefetch to Q8_0

increase of 3.86% for prompt processing and 32.22% for token generation

Signed-off-by: Aaron Teo <redacted>
* ggml: patch Q8_0 to use proper vector sizes

Signed-off-by: Aaron Teo <redacted>
* ggml: optimise Q8_0 dot prod compute kernel further

Signed-off-by: Aaron Teo <redacted>
* ggml: add unroll and prefetch to Q4_1

Signed-off-by: Aaron Teo <redacted>
* ggml: refactor Q6_K variable naming for readability

Signed-off-by: Aaron Teo <redacted>
* ggml: fix Q6_K typos

Signed-off-by: Aaron Teo <redacted>
* ggml: s390x SIMD activation for Q5_K

Signed-off-by: Aaron Teo <redacted>
* ggml: fix wrong char*x16_t naming

Signed-off-by: Aaron Teo <redacted>
* ggml: Q5_K y0 wrong signness

Signed-off-by: Aaron Teo <redacted>
* ggml: fix Q5_K invalid uchar type

Signed-off-by: Aaron Teo <redacted>
* ggml: fix Q5_K invalid uchar type

Signed-off-by: Aaron Teo <redacted>
* ggml: s390x SIMD activation for Q4_K

Signed-off-by: Aaron Teo <redacted>
* ggml: fix Q4_K invalid vector intrinsics

Signed-off-by: Aaron Teo <redacted>
* ggml: simplify ggml_padd_s16 compute kernel

Signed-off-by: Aaron Teo <redacted>
* ggml: correct ggml-cpu vxe wording

Signed-off-by: Aaron Teo <redacted>
* ggml: change ggml_aligned_malloc alignment to 256

256 is the cache line size for s390x platforms

Signed-off-by: Aaron Teo <redacted>
* ggml: resolve pr merge via cherry-pick 225bbbf

Signed-off-by: Aaron Teo <redacted>
* ggml : fix LoongArch compile error with 128-bit SIMD (llama/11701)

* ggml: resolve pr merge via cherry-pick 4571953

Signed-off-by: Aaron Teo <redacted>
* ggml: cmake remove fork when determining s390x machine type

thank you @ericcurtin

Signed-off-by: Aaron Teo <redacted>
---------

Signed-off-by: Aaron Teo <redacted>
Co-authored-by: Jinyang He <redacted>
Co-authored-by: junchao-zhao <redacted>