]> git.djapps.eu Git - pkg/ggml/sources/whisper.cpp/log
pkg/ggml/sources/whisper.cpp
10 months agofix the mul_mat_id ut issues (llama/8427)
Chen Xi [Fri, 12 Jul 2024 00:52:04 +0000 (00:52 +0000)]
fix the mul_mat_id ut issues (llama/8427)

* fix part of mul_mat_id

* skip the bfloat 16 sycl ut

Signed-off-by: Chen Xi <redacted>
---------

Signed-off-by: Chen Xi <redacted>
Co-authored-by: Meng, Hengyu <redacted>
Co-authored-by: Chen Xi <redacted>
10 months agoggml : add NVPL BLAS support (ggml/8329) (llama/8425)
Nicholai Tukanov [Thu, 11 Jul 2024 16:49:15 +0000 (11:49 -0500)]
ggml : add NVPL BLAS support (ggml/8329) (llama/8425)

* ggml : add NVPL BLAS support

* ggml : replace `<BLASLIB>_ENABLE_CBLAS` with `GGML_BLAS_USE_<BLASLIB>`

---------

Co-authored-by: ntukanov <redacted>
10 months agocuda : suppress 'noreturn' warn in no_device_code (llama/8414)
Daniel Bevenius [Thu, 11 Jul 2024 15:53:42 +0000 (17:53 +0200)]
cuda : suppress 'noreturn' warn in no_device_code (llama/8414)

* cuda : suppress 'noreturn' warn in no_device_code

This commit adds a while(true) loop to the no_device_code function in
common.cuh. This is done to suppress the warning:

```console
/src/ggml-cuda/template-instances/../common.cuh:346:1: warning:
function declared 'noreturn' should not return [-Winvalid-noreturn]
  346 | }
      | ^
```

The motivation for this is to reduce the number of warnings when
compilng with GGML_HIPBLAS=ON.

Signed-off-by: Daniel Bevenius <redacted>
* squash! cuda : suppress 'noreturn' warn in no_device_code

Update __trap macro instead of using a while loop to suppress the
warning.

Signed-off-by: Daniel Bevenius <redacted>
---------

Signed-off-by: Daniel Bevenius <redacted>
10 months agoCUDA: optimize and refactor MMQ (llama/8416)
Johannes Gäßler [Thu, 11 Jul 2024 14:47:47 +0000 (16:47 +0200)]
CUDA: optimize and refactor MMQ (llama/8416)

* CUDA: optimize and refactor MMQ

* explicit q8_1 memory layouts, add documentation

10 months agoUse multi_ptr to clean up deprecated warnings (llama/8256)
AidanBeltonS [Wed, 10 Jul 2024 15:10:49 +0000 (16:10 +0100)]
Use multi_ptr to clean up deprecated warnings (llama/8256)

10 months agoggml : move sgemm sources to llamafile subfolder (llama/8394)
Georgi Gerganov [Wed, 10 Jul 2024 12:23:29 +0000 (15:23 +0300)]
ggml : move sgemm sources to llamafile subfolder (llama/8394)

ggml-ci

10 months agoggml : add AArch64 optimized GEMV and GEMM Q4 kernels (llama/5780)
Dibakar Gope [Wed, 10 Jul 2024 12:14:51 +0000 (07:14 -0500)]
ggml : add AArch64 optimized GEMV and GEMM Q4 kernels (llama/5780)

* Arm AArch64: optimized GEMV and GEMM kernels for q4_0_q8_0, and q8_0_q8_0 quantization

* Arm AArch64: add optimized GEMV and GEMM asm kernels for q4_0_q8_0 quantization and refactor code to address llama.cpp pr#5780 suggestions

* Arm AArch64: add optimized GEMV and GEMM asm kernels for q4_0_q8_0 quantization and refactor code to address llama.cpp pr#5780 suggestions

* Arm AArch64: add optimized GEMV and GEMM asm kernels for q4_0_q8_0 quantization and refactor code to address llama.cpp pr#5780 suggestions

* Arm AArch64: add optimized GEMV and GEMM asm kernels for q4_0_q8_0 quantization and refactor code to address llama.cpp pr#5780 suggestions

* Arm AArch64: add copyright claim only to ggml-aarch64.cpp and ggml-aarch64.h files

* Arm AArch64: minor code refactoring for rebase

* Arm AArch64: minor code refactoring for resolving a build issue with cmake

* Arm AArch64: minor code refactoring to split the Q4_0_AARC64 type into three separate types: Q4_0_4_4, Q4_0_4_8, and Q4_0_8_8

* Arm AArch64: minor code change for resolving a build issue with server-windows

* retrigger checks

* Arm AArch64: minor code changes for rebase

* Arm AArch64: minor changes to skip the pr#7433 vec_dot code for arm cpus with SVE VL not equal to 256 bits

* Arm AArch64: remove stale LLAMA_QKK_64 from CMakeLists.txt and delete build.zig

* Arm AArch64: add reference scalar gemm and gemv, and avoid dynamic memory allocations during quantization for Q4_0_4_4, Q4_0_4_8, and Q4_0_8_8

* Arm AArch64: add multithreaded quantization support for the new types: Q4_0_4_4, Q4_0_4_8, and Q4_0_8_8

* Arm AArch64: minor code refactoring

* Arm AArch64: simplify logic for calling gemm and gemv functions in ggml_compute_forward_mul_mat

* Arm AArch64: minimize changes in ggml_compute_forward_mul_mat

* Arm AArch64: minor code refactoring, and add reference scalar code to quantize routines for new quant types

* Arm AArch64: minor code refactoring

* Arm AArch64: minor code refactoring

* Arm AArch64: minor code refactoring

* rebase on the latest master commit 3fd62a6 and adapt to the new directory structure

* Arm AArch64: remove a redundant comment

* Arm AArch64: add pragma in ggml-aarch64.c to turn -Woverlength-strings warning off

* Arm AArch64: use __aarch64__ check to guard 64-bit neon kernels

* Arm AArch64: update docs/build.md README to include compile time flags for buiilding the Q4_0_4_4 quant type

10 months agosycl : Reenabled mmvq path for the SYCL Nvidia Backend (llama/8372)
Alberto Cabrera Pérez [Tue, 9 Jul 2024 14:03:15 +0000 (15:03 +0100)]
sycl : Reenabled mmvq path for the SYCL Nvidia Backend (llama/8372)

* SYCL : Reenabled mmvq path for the SYCL Nvidia Backend

* Reduced verbosity of comment

10 months agosycl : fix powf call in device code (llama/8368)
Alberto Cabrera Pérez [Mon, 8 Jul 2024 13:22:41 +0000 (14:22 +0100)]
sycl : fix powf call in device code (llama/8368)

10 months agoggml : loop tiling optimizations for scalar path (ggml/898)
Mahesh Madhav [Thu, 25 Jul 2024 07:54:08 +0000 (00:54 -0700)]
ggml : loop tiling optimizations for scalar path (ggml/898)

Apply a loop tiling technique to the generic path, which provides
performance upside for ISAs with enough registers to take advantage
of it. Also helps the compiler optimize this path.

10 months agoggml: add support for float16 input tensors in pooling operations (ggml/895)
Ivan Filipov [Mon, 22 Jul 2024 11:32:02 +0000 (14:32 +0300)]
ggml: add support for float16 input tensors in pooling operations (ggml/895)

* Add support for float16 tensors in 1d pooling operations

* Add support for float16 input tensors in 2d pooling operations

* code cleanup

remove unnecessary casting during srow ptr initialization

---------

Co-authored-by: vanaka11 <redacted>
10 months agovulkan : initialize vk_buffer_struct members to VK_NULL_HANDLE (ggml/893)
Tony Wasserka [Sat, 20 Jul 2024 18:49:44 +0000 (20:49 +0200)]
vulkan : initialize vk_buffer_struct members to VK_NULL_HANDLE (ggml/893)

This prevents invalid frees when destroying a partially initialized
vk_buffer_struct. For example, this could happen in ggml_vk_create_buffer
when running out of device memory.

Co-authored-by: Tony Wasserka <redacted>
10 months agocmake : only enable GGML_NATIVE and x86 flags if not crosscompiling (ggml/885)
Borislav Stanimirov [Fri, 12 Jul 2024 14:24:20 +0000 (17:24 +0300)]
cmake : only enable GGML_NATIVE and x86 flags if not crosscompiling (ggml/885)

10 months agoscripts : sync new files (#0)
Georgi Gerganov [Thu, 8 Aug 2024 11:00:51 +0000 (14:00 +0300)]
scripts : sync new files (#0)

10 months agocmake : fix compile in xcode (#2311)
Daven Sanassy [Mon, 5 Aug 2024 06:48:26 +0000 (07:48 +0100)]
cmake : fix compile in xcode (#2311)

11 months agowhisper : handle empty mel (#2324)
Georgi Gerganov [Sat, 27 Jul 2024 17:35:04 +0000 (20:35 +0300)]
whisper : handle empty mel (#2324)

11 months agowhisper : use vulkan as gpu backend when available (#2302)
Matt Stephenson [Tue, 16 Jul 2024 07:21:09 +0000 (03:21 -0400)]
whisper : use vulkan as gpu backend when available (#2302)

* ggml: use vulkan as gpu backend when available

Signed-off-by: Matt Stephenson <redacted>
* whisper: enable using vk as default buffer type

Signed-off-by: Matt Stephenson <redacted>
---------

Signed-off-by: Matt Stephenson <redacted>
11 months agowhisper : fix DTW assert (#2299)
arizhih [Mon, 15 Jul 2024 12:50:36 +0000 (14:50 +0200)]
whisper : fix DTW assert (#2299)

11 months agocmake : use WHISPER_EXTRA_FLAGS (#2294)
Georgi Gerganov [Tue, 9 Jul 2024 15:54:18 +0000 (18:54 +0300)]
cmake : use WHISPER_EXTRA_FLAGS (#2294)

11 months agocmake : allow external ggml
Borislav Stanimirov [Mon, 8 Jul 2024 14:08:55 +0000 (17:08 +0300)]
cmake : allow external ggml

11 months agocmake : try to fix openvino build (#2281)
Georgi Gerganov [Mon, 8 Jul 2024 12:36:51 +0000 (15:36 +0300)]
cmake : try to fix openvino build (#2281)

11 months agocmake : remove install of llama convert script [no ci] (#2266)
Georgi Gerganov [Mon, 8 Jul 2024 11:21:04 +0000 (14:21 +0300)]
cmake : remove install of llama convert script [no ci] (#2266)

11 months agomake : remove llama prints [no ci] (#2265)
Georgi Gerganov [Mon, 8 Jul 2024 11:19:36 +0000 (14:19 +0300)]
make : remove llama prints [no ci] (#2265)

11 months agotalk-llama : sync llama.cpp
Georgi Gerganov [Mon, 8 Jul 2024 11:14:17 +0000 (14:14 +0300)]
talk-llama : sync llama.cpp

11 months agoexamples : fix compile warnings [no ci] (#0)
Georgi Gerganov [Mon, 8 Jul 2024 11:09:09 +0000 (14:09 +0300)]
examples : fix compile warnings [no ci] (#0)

11 months agosync : ggml
Georgi Gerganov [Mon, 8 Jul 2024 10:50:28 +0000 (13:50 +0300)]
sync : ggml

11 months agoggml : sync sycl (skip) (#0)
Georgi Gerganov [Mon, 8 Jul 2024 10:50:14 +0000 (13:50 +0300)]
ggml : sync sycl (skip) (#0)

11 months agoscripts : fix sync scripts
Georgi Gerganov [Mon, 8 Jul 2024 10:48:14 +0000 (13:48 +0300)]
scripts : fix sync scripts

11 months agoggml : remove unnecessary UNUSED macro call (ggml/880)
Daniel Bevenius [Mon, 8 Jul 2024 10:03:42 +0000 (12:03 +0200)]
ggml : remove unnecessary UNUSED macro call (ggml/880)

This commit removes an UNUSED macro call that is not needed as the
variable n0 is used in the code and will not produce a warning.

Signed-off-by: Daniel Bevenius <redacted>
11 months agocmake : add GGML_BUILD and GGML_SHARED macro definitions (llama/8281)
Natsu [Fri, 5 Jul 2024 14:29:35 +0000 (22:29 +0800)]
cmake : add GGML_BUILD and GGML_SHARED macro definitions (llama/8281)

11 months agoEnabled more data types for oneMKL gemm_batch (llama/8236)
Ouadie EL FAROUKI [Fri, 5 Jul 2024 12:23:25 +0000 (13:23 +0100)]
Enabled more data types for oneMKL gemm_batch (llama/8236)

11 months agoCUDA: MMQ support for iq4_nl, iq4_xs (llama/8278)
Johannes Gäßler [Fri, 5 Jul 2024 07:06:31 +0000 (09:06 +0200)]
CUDA: MMQ support for iq4_nl, iq4_xs (llama/8278)

11 months agoCUDA: revert part of the RDNA1 optimizations (llama/8309)
Daniele [Fri, 5 Jul 2024 07:06:09 +0000 (07:06 +0000)]
CUDA: revert part of the RDNA1 optimizations (llama/8309)

The change on the launch_bounds was causing a small performance drop in perplexity of 25 t/s

11 months agoCUDA: fix MMQ stream-k rounding if ne00 % 128 != 0 (llama/8311)
Johannes Gäßler [Fri, 5 Jul 2024 07:05:34 +0000 (09:05 +0200)]
CUDA: fix MMQ stream-k rounding if ne00 % 128 != 0 (llama/8311)

11 months agoFix WARP_SIZE=16 bug of Intel GPU (llama/8266)
luoyu-intel [Fri, 5 Jul 2024 05:06:13 +0000 (05:06 +0000)]
Fix WARP_SIZE=16 bug of Intel GPU (llama/8266)

* fix group_norm ut

* split softmax

* fix softmax

* add concat support condition

* revert debug code

* move QK_WARP_SIZE to presets.hpp

11 months agorm get_work_group_size() by local cache for performance (llama/8286)
Neo Zhang Jianyu [Fri, 5 Jul 2024 02:32:29 +0000 (10:32 +0800)]
rm get_work_group_size() by local cache for performance (llama/8286)

Co-authored-by: arthw <redacted>
11 months agoDefine and optimize RDNA1 (llama/8085)
Daniele [Wed, 3 Jul 2024 23:02:58 +0000 (23:02 +0000)]
Define and optimize RDNA1 (llama/8085)

11 months agofix typo (llama/8267)
Judd [Wed, 3 Jul 2024 12:40:16 +0000 (20:40 +0800)]
fix typo (llama/8267)

Co-authored-by: Judd <redacted>
11 months agoRemoves multiple newlines at the end of files that is breaking the editorconfig step...
Clint Herron [Tue, 2 Jul 2024 16:18:10 +0000 (12:18 -0400)]
Removes multiple newlines at the end of files that is breaking the editorconfig step of CI. (llama/8258)

11 months agocuda : update supports_op for matrix multiplication (llama/8245)
slaren [Tue, 2 Jul 2024 06:39:38 +0000 (08:39 +0200)]
cuda : update supports_op for matrix multiplication (llama/8245)

11 months agoFix win build conflict of math library (llama/8230)
luoyu-intel [Tue, 2 Jul 2024 04:50:07 +0000 (04:50 +0000)]
Fix win build conflict of math library (llama/8230)

* fix win build conflict of math library

* fix the condition: !(win32 & SYCL)

* revert warp_size=16

11 months agoFix the sub group size of Intel (llama/8106)
luoyu-intel [Tue, 2 Jul 2024 02:16:00 +0000 (02:16 +0000)]
Fix the sub group size of Intel (llama/8106)

* use warp_size macro for all sycl kernels

* fix mask of permute_sub_group_by_xor

* fix rms_norm with correct warp number

* fix rms_norm_f32/group_norm_f32

* move norm to norm.cpp file

* fix quantize bug

* fix mmvq's batch size

11 months agoCUDA: refactor and optimize IQ MMVQ (llama/8215)
Johannes Gäßler [Mon, 1 Jul 2024 18:39:06 +0000 (20:39 +0200)]
CUDA: refactor and optimize IQ MMVQ (llama/8215)

* CUDA: refactor and optimize IQ MMVQ

* uint -> uint32_t

* __dp4a -> ggml_cuda_dp4a

* remove MIN_CC_DP4A checks

* change default

* try CI fix

11 months agoUpdate SYCL-Rope op and Refactor (llama/8157)
zhentaoyu [Mon, 1 Jul 2024 11:39:06 +0000 (19:39 +0800)]
Update SYCL-Rope op and Refactor (llama/8157)

* align with rope.cu and move sycl-op to a single file

11 months agoCUDA: fix MMQ stream-k for --split-mode row (llama/8167)
Johannes Gäßler [Thu, 27 Jun 2024 14:26:05 +0000 (16:26 +0200)]
CUDA: fix MMQ stream-k for --split-mode row (llama/8167)

11 months agofeat: cuda implementation for `ggml_conv_transpose_1d` (ggml/854)
John Balis [Tue, 2 Jul 2024 16:09:52 +0000 (11:09 -0500)]
feat: cuda implementation for `ggml_conv_transpose_1d` (ggml/854)

* conv transpose 1d passing test for 1d input and kernel

* working for different input and output channel counts, added test for variable stride

* initial draft appears to work with stride other than 1

* working with all old and new conv1d  tests

* added a test for large tensors

* removed use cuda hardcoding

* restored test-conv-transpose.c

* removed unused arugments, and fixed bug where test failure would cause subsequent tests to fail

* fixed accumulator bug

* added test to test-backend-ops

* fixed mistake

* addressed review

* fixed includes

* removed blank lines

* style and warning fixes

* return failure when test fails

* fix supports_op

---------

Co-authored-by: slaren <redacted>
11 months agoci : disable java build
Georgi Gerganov [Mon, 8 Jul 2024 11:26:59 +0000 (14:26 +0300)]
ci : disable java build

11 months agoserver : add inference path to make OAI API compatible (#2270)
Emmanuel Schmidbauer [Mon, 8 Jul 2024 11:24:58 +0000 (07:24 -0400)]
server : add inference path to make OAI API compatible (#2270)

12 months agosync : ggml + fix sync script
Georgi Gerganov [Wed, 26 Jun 2024 20:20:19 +0000 (23:20 +0300)]
sync : ggml + fix sync script

12 months agomake : disable CUDA graphs
Georgi Gerganov [Wed, 26 Jun 2024 20:20:13 +0000 (23:20 +0300)]
make : disable CUDA graphs

12 months agoggml : add GGML_CUDA_USE_GRAPHS option, restore GGML_CUDA_FORCE_CUBLAS (cmake) (llama...
slaren [Wed, 26 Jun 2024 19:34:14 +0000 (21:34 +0200)]
ggml : add GGML_CUDA_USE_GRAPHS option, restore GGML_CUDA_FORCE_CUBLAS (cmake) (llama/8140)

12 months agomake : disable CUDA mel build
Georgi Gerganov [Wed, 26 Jun 2024 19:25:25 +0000 (22:25 +0300)]
make : disable CUDA mel build

12 months agocmake : minor fixes
Georgi Gerganov [Wed, 26 Jun 2024 18:42:39 +0000 (21:42 +0300)]
cmake : minor fixes

12 months agomake : fix missing -O3
Georgi Gerganov [Wed, 26 Jun 2024 18:20:45 +0000 (21:20 +0300)]
make : fix missing -O3

same as https://github.com/ggerganov/llama.cpp/pull/8143

12 months agowhisper : disable CUDA mel + fix FFMPEG
Georgi Gerganov [Wed, 26 Jun 2024 17:11:38 +0000 (20:11 +0300)]
whisper : disable CUDA mel + fix FFMPEG

12 months agosync : ggml
Georgi Gerganov [Wed, 26 Jun 2024 16:40:23 +0000 (19:40 +0300)]
sync : ggml

12 months agowhisper : reorganize source code + improve CMake (#2256)
Georgi Gerganov [Wed, 26 Jun 2024 16:34:09 +0000 (19:34 +0300)]
whisper : reorganize source code + improve CMake (#2256)

* scripts : update sync [no ci]

* files : reorganize [no ci]

* sync : llama.cpp

* cmake : link math library

* cmake : build normal ggml library

* files : move headers to include

* objc : fix path to ggml-metal.h

* ci : fix WHISPER_CUDA -> GGML_CUDA

* scripts : sync LICENSE [no ci]

12 months agowhisper : optimize fft() function (#2242)
mky_coder [Tue, 18 Jun 2024 15:10:33 +0000 (23:10 +0800)]
whisper : optimize fft() function (#2242)

Co-authored-by: Mike Fan <redacted>
12 months agotalk-llama : sync llama.cpp
Georgi Gerganov [Tue, 18 Jun 2024 06:45:37 +0000 (09:45 +0300)]
talk-llama : sync llama.cpp

12 months agowhisper : use ggml_backend_sched (#2239)
Georgi Gerganov [Tue, 18 Jun 2024 06:37:20 +0000 (09:37 +0300)]
whisper : use ggml_backend_sched (#2239)

* whisper : use ggml_backend_sched (wip)

* use sched in whisper_allocr

* whisper : single backend in whisper_context

* whisper : remove whisper_state->backends_used

* whisper : remove whisper_context->backend

* whisper : reset scheduler after init

* whisper : fix external encoder (e.g. CoreML)

* whisper : cleanup

* whisper : handle null GPU buffer types + fix sycl

---------

Co-authored-by: slaren <redacted>
12 months agofix : remove extra files
Georgi Gerganov [Sun, 16 Jun 2024 16:23:55 +0000 (19:23 +0300)]
fix : remove extra files

12 months agoscripts : sync ggml-blas
Georgi Gerganov [Sun, 16 Jun 2024 16:23:32 +0000 (19:23 +0300)]
scripts : sync ggml-blas

12 months agobuild : update make / cmake
Georgi Gerganov [Sun, 16 Jun 2024 16:10:20 +0000 (19:10 +0300)]
build : update make / cmake

12 months agosync : ggml
Georgi Gerganov [Sun, 16 Jun 2024 15:40:07 +0000 (18:40 +0300)]
sync : ggml

12 months agomove BLAS to a separate backend (cont) (llama/6210)
slaren [Sun, 16 Jun 2024 10:57:37 +0000 (13:57 +0300)]
move BLAS to a separate backend (cont) (llama/6210)

ggml-ci

12 months agoVulkan Shader Refactor, Memory Debugging Option (llama/7947)
0cc4m [Sun, 16 Jun 2024 05:17:31 +0000 (07:17 +0200)]
Vulkan Shader Refactor, Memory Debugging Option (llama/7947)

* Refactor shaders, extract GLSL code from ggml_vk_generate_shaders.py into vulkan-shaders directory

* Improve debug log code

* Add memory debug output option

* Fix flake8

* Fix unnecessary high llama-3 VRAM use

12 months agoscripts : stop sync whisper example from ggml
Georgi Gerganov [Sun, 16 Jun 2024 15:38:46 +0000 (18:38 +0300)]
scripts : stop sync whisper example from ggml

12 months agocmake : fix sycl build (#0)
Georgi Gerganov [Sun, 16 Jun 2024 14:57:35 +0000 (17:57 +0300)]
cmake : fix sycl build (#0)

12 months agoggml : remove OpenCL (#0)
Georgi Gerganov [Sun, 16 Jun 2024 10:46:12 +0000 (13:46 +0300)]
ggml : remove OpenCL (#0)

12 months agosycl : sync (#0)
Georgi Gerganov [Sun, 16 Jun 2024 10:24:17 +0000 (13:24 +0300)]
sycl : sync (#0)

12 months agocuda : enable CUDA graphs (#0)
Georgi Gerganov [Sun, 16 Jun 2024 10:20:19 +0000 (13:20 +0300)]
cuda : enable CUDA graphs (#0)

12 months agotalk-llama : sync llama.cpp
Georgi Gerganov [Sun, 16 Jun 2024 10:10:54 +0000 (13:10 +0300)]
talk-llama : sync llama.cpp

12 months agocmake : fix CUDA build (#0)
Georgi Gerganov [Sun, 16 Jun 2024 10:07:43 +0000 (13:07 +0300)]
cmake : fix CUDA build (#0)

12 months agosync : ggml
Georgi Gerganov [Sun, 16 Jun 2024 09:43:14 +0000 (12:43 +0300)]
sync : ggml

ggml-ci

12 months agoggml : fix and optimize ppc64le (ggml/849)
Hong Bo PENG [Sun, 16 Jun 2024 08:53:11 +0000 (16:53 +0800)]
ggml : fix and optimize ppc64le (ggml/849)

* fix compile issues introduced by loongarch_asx

* restore quant changes to merge

* fix compile issues introduced by loongarch_asx

* further optimize by using vec_msum & vec_sum4s on ppc64le

12 months agoggml : remove duplicate include of ggml-common.h (ggml/853)
Daniel Bevenius [Sun, 16 Jun 2024 08:51:18 +0000 (10:51 +0200)]
ggml : remove duplicate include of ggml-common.h (ggml/853)

Signed-off-by: Daniel Bevenius <redacted>
12 months agoremove global variables (llama/7710)
Meng, Hengyu [Sat, 15 Jun 2024 06:05:10 +0000 (14:05 +0800)]
remove global variables (llama/7710)

* separate DPCT helpers outside

* replace global variables with context

* remove useless extra

* update mul_mat condition

* remove duplicate buft initialization

* remove duplicate extra and global work group size

* remove useless backend check

* remove duplicated extras

* use macro for group_size and remove cuda-related

12 months agoCUDA: faster q2_K, q3_K MMQ + int8 tensor cores (llama/7921)
Johannes Gäßler [Fri, 14 Jun 2024 16:41:49 +0000 (18:41 +0200)]
CUDA: faster q2_K, q3_K MMQ + int8 tensor cores (llama/7921)

* CUDA: faster q2_K, q3_K MMQ + int8 tensor cores

* try CI fix

* try CI fix

* try CI fix

* fix data race

* rever q2_K precision related changes

12 months agometal : utilize max shared memory for mul_mat_id (llama/7935)
Georgi Gerganov [Fri, 14 Jun 2024 14:14:09 +0000 (17:14 +0300)]
metal : utilize max shared memory for mul_mat_id (llama/7935)

12 months agorpc : fix ggml_backend_rpc_supports_buft() (llama/7918)
Radoslav Gerganov [Thu, 13 Jun 2024 12:18:44 +0000 (15:18 +0300)]
rpc : fix ggml_backend_rpc_supports_buft() (llama/7918)

12 months agomove BLAS to a separate backend (llama/6210)
slaren [Thu, 13 Jun 2024 01:11:35 +0000 (03:11 +0200)]
move BLAS to a separate backend (llama/6210)

* move BLAS to a separate backend

* rename GGML_USE_OPENBLAS to GGML_USE_BLAS

* alloc : reuse same buffer when the same buffer type if used multiple times

* set number of threads automatically for openblas and blis

* sched : print assignments when GGML_SCHED_DEBUG env variable is set

* sched : allow ops with weights on an incompatible buffer type

This will cause the weight to be copied to a backend that supports the
op, which is very costly. The weight should have been stored in a buffer
of a backend that can run the op, but llama.cpp cannot do this
automatically at the moment.

---------

Co-authored-by: Georgi Gerganov <redacted>
12 months agoCUDA: fix broken oob check for FA vec f32 kernel (llama/7904)
Johannes Gäßler [Wed, 12 Jun 2024 15:41:51 +0000 (17:41 +0200)]
CUDA: fix broken oob check for FA vec f32 kernel (llama/7904)

12 months agotests : add non-cont unary tests (llama/7857)
Georgi Gerganov [Wed, 12 Jun 2024 13:00:22 +0000 (16:00 +0300)]
tests : add non-cont unary tests (llama/7857)

* tests : add non-cont unary tests

* ggml : update unary asserts and "supports_op"

ggml-ci

12 months agoggml : improve ggml_is_contiguous logic (llama/7856)
Georgi Gerganov [Wed, 12 Jun 2024 12:24:20 +0000 (15:24 +0300)]
ggml : improve ggml_is_contiguous logic (llama/7856)

* ggml : improve ggml_is_contiguous logic

ggml-ci

* ggml : support more contiguous cases

ggml-ci

12 months agovulkan: select only one device for single gpu with multiple drivers (llama/7582)
k.h.lai [Tue, 11 Jun 2024 19:26:05 +0000 (03:26 +0800)]
vulkan: select only one device for single gpu with multiple drivers (llama/7582)

12 months agoUpdate Vulkan RoPE implementation (llama/7818)
0cc4m [Tue, 11 Jun 2024 19:20:29 +0000 (21:20 +0200)]
Update Vulkan RoPE implementation (llama/7818)

* Update Vulkan RoPE implementation

* Return nullptr on alloc_buffer when allocation fails, instead of throwing an exception

Minor fixes

* Fix segfault when running out of VRAM

Co-authored-by: slaren <redacted>
---------

Co-authored-by: slaren <redacted>
12 months agoCUDA: int8 tensor cores for MMQ (q4_K, q5_K, q6_K) (llama/7860)
Johannes Gäßler [Tue, 11 Jun 2024 06:26:07 +0000 (08:26 +0200)]
CUDA: int8 tensor cores for MMQ (q4_K, q5_K, q6_K) (llama/7860)

12 months agoCUDA: use tensor cores for MMQ (llama/7676)
Johannes Gäßler [Mon, 10 Jun 2024 09:45:13 +0000 (11:45 +0200)]
CUDA: use tensor cores for MMQ (llama/7676)

* CUDA: int8 tensor cores for MMQ (legacy quants)

* fix out-of-bounds writes

* __builtin_assume -> GGML_CUDA_ASSUME

* fix writeback returning too early

12 months agouse the correct SYCL context for host USM allocations (llama/7777)
Ben Ashbaugh [Mon, 10 Jun 2024 09:21:31 +0000 (02:21 -0700)]
use the correct SYCL context for host USM allocations (llama/7777)

Signed-off-by: Ben Ashbaugh <redacted>
12 months agoCUDA: revise q8_1 data layout for mul_mat_q (llama/7824)
Johannes Gäßler [Sun, 9 Jun 2024 07:42:25 +0000 (09:42 +0200)]
CUDA: revise q8_1 data layout for mul_mat_q (llama/7824)

12 months agovulkan : reuse parent extra for views (llama/7806)
slaren [Fri, 7 Jun 2024 17:47:49 +0000 (19:47 +0200)]
vulkan : reuse parent extra for views (llama/7806)

* vulkan : reuse parent extra for views

* Fix validation error when multiple compute contexts are used in a graph

---------

Co-authored-by: 0cc4m <redacted>
12 months agofix softmax r2r result wrong issue (llama/7811)
pengxin99 [Fri, 7 Jun 2024 06:28:26 +0000 (14:28 +0800)]
fix softmax r2r result wrong issue (llama/7811)

12 months agoCUDA: refactor mmq, dmmv, mmvq (llama/7716)
Johannes Gäßler [Wed, 5 Jun 2024 14:53:00 +0000 (16:53 +0200)]
CUDA: refactor mmq, dmmv, mmvq (llama/7716)

* CUDA: refactor mmq, dmmv, mmvq

* fix out-of-bounds write

* struct for qk, qr, qi

* fix cmake build

* mmq_type_traits

12 months agoggml : refactor rope norm/neox (llama/7634)
Georgi Gerganov [Wed, 5 Jun 2024 08:29:20 +0000 (11:29 +0300)]
ggml : refactor rope norm/neox (llama/7634)

* ggml : unify rope norm/neox (CPU)

* ggml : fix compile warning

* ggml : remove GLM rope mode

ggml-ci

* metal : better rope implementation

ggml-ci

* cuda : better rope implementation

ggml-ci

* naming : n_orig_ctx -> n_ctx_orig

ggml-ci

* dev : add reminders to update backends

ggml-ci

* vulkan : fix ggml_rope_ext() usage

* cuda : fix array size + indents

ggml-ci

12 months agoAllow number of nodes in CUDA graph to change (llama/7738)
agray3 [Tue, 4 Jun 2024 20:06:49 +0000 (21:06 +0100)]
Allow number of nodes in CUDA graph to change (llama/7738)

Previously the code would have failed to cope in the case that the
number of nodes changes in an existing CUDA graph. This fixes the
issue by removing an unnecessary conditional.

12 months agoggml : remove OpenCL (llama/7735)
Georgi Gerganov [Tue, 4 Jun 2024 18:23:20 +0000 (21:23 +0300)]
ggml : remove OpenCL (llama/7735)

ggml-ci

12 months agoggml : prevent builds with -ffinite-math-only (llama/7726)
Georgi Gerganov [Tue, 4 Jun 2024 07:01:09 +0000 (10:01 +0300)]
ggml : prevent builds with -ffinite-math-only (llama/7726)

This enforces a check that -fno-finite-math-only was set and that the operating
compiling mode is not in finite maths mode. This is because during rewriting of
silu and softmax for cpu #7154 there emerged an issue where the result that was
observed when >1 slot was nondeterministic as found by @JohannesGaessler.

@LostRuins narrowed the problem down to -ffinite-math-only which was theorised
to be due to SiLU, instead of flushing small values to 0, returns NaN or some
other garbage. @jart proposed a fix that @ggerganov then implemented in this fix

ref https://github.com/ggerganov/llama.cpp/pull/7154#issuecomment-2145661825

12 months agollama : offload to RPC in addition to other backends (llama/7640)
Radoslav Gerganov [Mon, 3 Jun 2024 17:03:26 +0000 (20:03 +0300)]
llama : offload to RPC in addition to other backends (llama/7640)

* llama : offload to RPC in addition to other backends

* - fix copy_tensor being called on the src buffer instead of the dst buffer

- always initialize views in the view_src buffer

- add RPC backend to Makefile build

- add endpoint to all RPC object names

* add rpc-server to Makefile

* Update llama.cpp

Co-authored-by: slaren <redacted>
---------

Co-authored-by: slaren <redacted>
12 months agoggml : use OpenMP as a thread pool (llama/7606)
Masaya, Kato [Mon, 3 Jun 2024 15:14:15 +0000 (00:14 +0900)]
ggml : use OpenMP as a thread pool (llama/7606)

* ggml: Added OpenMP for multi-threads processing

* ggml : Limit the number of threads used to avoid deadlock

* update shared state n_threads in parallel region

* clear numa affinity for main thread even with openmp

* enable openmp by default

* fix msvc build

* disable openmp on macos

* ci : disable openmp with thread sanitizer

* Update ggml.c

Co-authored-by: Georgi Gerganov <redacted>
---------

Co-authored-by: slaren <redacted>
Co-authored-by: Georgi Gerganov <redacted>
12 months agoVulkan Mixture of Experts (MoE) support (llama/7628)
0cc4m [Mon, 3 Jun 2024 08:59:14 +0000 (10:59 +0200)]
Vulkan Mixture of Experts (MoE) support (llama/7628)

* Finish Vulkan mul_mat_id implementation

* Add Vulkan sum_rows and div ops

* Fix MUL_MAT_ID matrix matrix shader

* Fix MUL_MAT_ID matrix vector shader dispatch size

* Fix MUL_MAT_ID matrix vector shader and dispatch code

* Update Vulkan CPU offload for MUL_MAT_ID

* Fix crash when using split mode none and setting a main GPU