]>
git.djapps.eu Git - pkg/ggml/sources/whisper.cpp/log
Johannes Gäßler [Sat, 20 Jul 2024 20:25:26 +0000 (22:25 +0200)]
CUDA: MMQ code deduplication + iquant support (llama/8495)
* CUDA: MMQ code deduplication + iquant support
* 1 less parallel job for CI build
Georgi Gerganov [Sat, 20 Jul 2024 14:15:42 +0000 (17:15 +0300)]
gguf : handle null name during init (llama/8587)
slaren [Fri, 19 Jul 2024 15:17:27 +0000 (17:17 +0200)]
ggml : fix quant dot product with odd number of blocks (llama/8549)
* ggml : fix iq4_nl dot product with odd number of blocks
* ggml : fix odd blocks for ARM_NEON (llama/8556)
* ggml : fix iq4_nl dot product with odd number of blocks
* ggml : fix q4_1
* ggml : fix q5_0
* ggml : fix q5_1
* ggml : fix iq4_nl metal
ggml-ci
* ggml : fix q4_0
* ggml : fix q8_0
ggml-ci
* ggml : remove special Q4_0 code for first 2 blocks
* ggml : fix sumf redefinition
---------
Co-authored-by: slaren <redacted>
---------
Co-authored-by: Georgi Gerganov <redacted>
Clint Herron [Fri, 19 Jul 2024 11:05:45 +0000 (07:05 -0400)]
ggml : add friendlier error message to fopen errors (llama/8575)
* Add additional error information when model files fail to load.
* Adding additional error information to most instances of fopen.
Johannes Gäßler [Thu, 18 Jul 2024 21:48:47 +0000 (23:48 +0200)]
CUDA: fix partial offloading for ne0 % 256 != 0 (llama/8572)
65a [Thu, 18 Jul 2024 14:47:12 +0000 (07:47 -0700)]
cmake : install all ggml public headers (llama/8480)
Co-authored-by: 65a <redacted>
hipudding [Wed, 17 Jul 2024 11:23:50 +0000 (19:23 +0800)]
Add Ascend NPU backend (llama/6035)
* [CANN] Add Ascend NPU backend
Ascend is a full-stack AI computing infrastructure for industry
applications and services based on Huawei Ascend processors and
software.
CANN (Compute Architecture of Neural Networks), developped by
Huawei, is a heterogeneous computing architecture for AI.
Co-authored-by: wangshuai09 <redacted>
* delete trailing whitespaces
* Modify the code based on review comment
* Rename LLAMA_CANN to GGML_CANN
* Make ggml-common.h private
* add ggml_cann prefix for acl funcs
* Add logging for CANN backend
* Delete Trailing whitespace
---------
Co-authored-by: wangshuai09 <redacted>
Johannes Gäßler [Tue, 16 Jul 2024 19:20:59 +0000 (21:20 +0200)]
make/cmake: add missing force MMQ/cuBLAS for HIP (llama/8515)
Xuan Son Nguyen [Mon, 15 Jul 2024 18:50:47 +0000 (20:50 +0200)]
Refactor lora adapter support (llama/8332)
* lora: load to devide buft
* add patch tensor function
* correct tensor patch
* llama_lora_adapter_apply
* correct ggml_backend_tensor_copy
* add llm_build_mm
* fix auto merge
* update based on review comments
* add convert script
* no more transpose A
* add f16 convert
* add metadata check
* add sanity check
* fix ftype
* add requirements
* fix requirements
* fix outfile
* conversion: only allow selected models
* fix types
* cuda : do not use dmmv if the tensor does not have enough cols
* llama : lora fixes
* do not disable mmap with lora
Co-authored-by: slaren <redacted>
* llm_build_lora_mm_id
* convert_lora : MoE LoRA conversion support
* convert_lora : prefer safetensors, similarly to convert_hf
* convert_hf : simplify modify_tensors for InternLM2
* convert_lora : lazy conversion
* llama : load and use alpha from LoRA adapters
* llama : use llm_build_lora_mm in most model graphs
* auto scale
* Revert "auto scale"
This reverts commit
42415a4874e0f963e4aca6796ea5dfb97cd17464 .
* remove redundant params
* Apply suggestions from code review
Co-authored-by: slaren <redacted>
* change kv metadata
* move add_type to __init__
* convert_hf : move add_type to main()
* convert_lora : use the GGUFWriter from Model instead of overwriting it
---------
Co-authored-by: slaren <redacted>
Co-authored-by: Francis Couture-Harpin <redacted>
Meng, Hengyu [Mon, 15 Jul 2024 11:32:15 +0000 (19:32 +0800)]
add concat through dim 1/2 (llama/8483)
* add concat through dim 1/2
0cc4m [Mon, 15 Jul 2024 07:38:52 +0000 (09:38 +0200)]
Vulkan MMQ Fix (llama/8479)
* Fix incoherence by adding missing LOAD_VEC_A parameter
* Fix Vulkan op result checker build error
bandoti [Sat, 13 Jul 2024 16:12:39 +0000 (13:12 -0300)]
vulkan : cmake integration (llama/8119)
* Add Vulkan to CMake pkg
* Add Sycl to CMake pkg
* Add OpenMP to CMake pkg
* Split generated shader file into separate translation unit
* Add CMake target for Vulkan shaders
* Update README.md
* Add make target for Vulkan shaders
* Use pkg-config to locate vulkan library
* Add vulkan SDK dep to ubuntu-22-cmake-vulkan workflow
* Clean up tabs
* Move sudo to apt-key invocation
* Forward GGML_EXTRA_LIBS to CMake config pkg
* Update vulkan obj file paths
* Add shaderc to nix pkg
* Add python3 to Vulkan nix build
* Link against ggml in cmake pkg
* Remove Python dependency from Vulkan build
* code review changes
* Remove trailing newline
* Add cflags from pkg-config to fix w64devkit build
* Update README.md
* Remove trailing whitespace
* Update README.md
* Remove trailing whitespace
* Fix doc heading
* Make glslc required Vulkan component
* remove clblast from nix pkg
Georgi Gerganov [Sat, 13 Jul 2024 15:32:33 +0000 (18:32 +0300)]
metal : template-ify some of the kernels (llama/8447)
ggml-ci
Georgi Gerganov [Fri, 12 Jul 2024 07:46:02 +0000 (10:46 +0300)]
ggml : minor naming changes (llama/8433)
* ggml : minor naming changes
ggml-ci
* ggml : use PRId64 [no ci]
* ggml : revert FA K/Q names
Chen Xi [Fri, 12 Jul 2024 00:52:04 +0000 (00:52 +0000)]
fix the mul_mat_id ut issues (llama/8427)
* fix part of mul_mat_id
* skip the bfloat 16 sycl ut
Signed-off-by: Chen Xi <redacted>
---------
Signed-off-by: Chen Xi <redacted>
Co-authored-by: Meng, Hengyu <redacted>
Co-authored-by: Chen Xi <redacted>
Nicholai Tukanov [Thu, 11 Jul 2024 16:49:15 +0000 (11:49 -0500)]
ggml : add NVPL BLAS support (ggml/8329) (llama/8425)
* ggml : add NVPL BLAS support
* ggml : replace `<BLASLIB>_ENABLE_CBLAS` with `GGML_BLAS_USE_<BLASLIB>`
---------
Co-authored-by: ntukanov <redacted>
Daniel Bevenius [Thu, 11 Jul 2024 15:53:42 +0000 (17:53 +0200)]
cuda : suppress 'noreturn' warn in no_device_code (llama/8414)
* cuda : suppress 'noreturn' warn in no_device_code
This commit adds a while(true) loop to the no_device_code function in
common.cuh. This is done to suppress the warning:
```console
/src/ggml-cuda/template-instances/../common.cuh:346:1: warning:
function declared 'noreturn' should not return [-Winvalid-noreturn]
346 | }
| ^
```
The motivation for this is to reduce the number of warnings when
compilng with GGML_HIPBLAS=ON.
Signed-off-by: Daniel Bevenius <redacted>
* squash! cuda : suppress 'noreturn' warn in no_device_code
Update __trap macro instead of using a while loop to suppress the
warning.
Signed-off-by: Daniel Bevenius <redacted>
---------
Signed-off-by: Daniel Bevenius <redacted>
Johannes Gäßler [Thu, 11 Jul 2024 14:47:47 +0000 (16:47 +0200)]
CUDA: optimize and refactor MMQ (llama/8416)
* CUDA: optimize and refactor MMQ
* explicit q8_1 memory layouts, add documentation
AidanBeltonS [Wed, 10 Jul 2024 15:10:49 +0000 (16:10 +0100)]
Use multi_ptr to clean up deprecated warnings (llama/8256)
Georgi Gerganov [Wed, 10 Jul 2024 12:23:29 +0000 (15:23 +0300)]
ggml : move sgemm sources to llamafile subfolder (llama/8394)
ggml-ci
Dibakar Gope [Wed, 10 Jul 2024 12:14:51 +0000 (07:14 -0500)]
ggml : add AArch64 optimized GEMV and GEMM Q4 kernels (llama/5780)
* Arm AArch64: optimized GEMV and GEMM kernels for q4_0_q8_0, and q8_0_q8_0 quantization
* Arm AArch64: add optimized GEMV and GEMM asm kernels for q4_0_q8_0 quantization and refactor code to address llama.cpp pr#5780 suggestions
* Arm AArch64: add optimized GEMV and GEMM asm kernels for q4_0_q8_0 quantization and refactor code to address llama.cpp pr#5780 suggestions
* Arm AArch64: add optimized GEMV and GEMM asm kernels for q4_0_q8_0 quantization and refactor code to address llama.cpp pr#5780 suggestions
* Arm AArch64: add optimized GEMV and GEMM asm kernels for q4_0_q8_0 quantization and refactor code to address llama.cpp pr#5780 suggestions
* Arm AArch64: add copyright claim only to ggml-aarch64.cpp and ggml-aarch64.h files
* Arm AArch64: minor code refactoring for rebase
* Arm AArch64: minor code refactoring for resolving a build issue with cmake
* Arm AArch64: minor code refactoring to split the Q4_0_AARC64 type into three separate types: Q4_0_4_4, Q4_0_4_8, and Q4_0_8_8
* Arm AArch64: minor code change for resolving a build issue with server-windows
* retrigger checks
* Arm AArch64: minor code changes for rebase
* Arm AArch64: minor changes to skip the pr#7433 vec_dot code for arm cpus with SVE VL not equal to 256 bits
* Arm AArch64: remove stale LLAMA_QKK_64 from CMakeLists.txt and delete build.zig
* Arm AArch64: add reference scalar gemm and gemv, and avoid dynamic memory allocations during quantization for Q4_0_4_4, Q4_0_4_8, and Q4_0_8_8
* Arm AArch64: add multithreaded quantization support for the new types: Q4_0_4_4, Q4_0_4_8, and Q4_0_8_8
* Arm AArch64: minor code refactoring
* Arm AArch64: simplify logic for calling gemm and gemv functions in ggml_compute_forward_mul_mat
* Arm AArch64: minimize changes in ggml_compute_forward_mul_mat
* Arm AArch64: minor code refactoring, and add reference scalar code to quantize routines for new quant types
* Arm AArch64: minor code refactoring
* Arm AArch64: minor code refactoring
* Arm AArch64: minor code refactoring
* rebase on the latest master commit
3fd62a6 and adapt to the new directory structure
* Arm AArch64: remove a redundant comment
* Arm AArch64: add pragma in ggml-aarch64.c to turn -Woverlength-strings warning off
* Arm AArch64: use __aarch64__ check to guard 64-bit neon kernels
* Arm AArch64: update docs/build.md README to include compile time flags for buiilding the Q4_0_4_4 quant type
Alberto Cabrera Pérez [Tue, 9 Jul 2024 14:03:15 +0000 (15:03 +0100)]
sycl : Reenabled mmvq path for the SYCL Nvidia Backend (llama/8372)
* SYCL : Reenabled mmvq path for the SYCL Nvidia Backend
* Reduced verbosity of comment
Alberto Cabrera Pérez [Mon, 8 Jul 2024 13:22:41 +0000 (14:22 +0100)]
sycl : fix powf call in device code (llama/8368)
Mahesh Madhav [Thu, 25 Jul 2024 07:54:08 +0000 (00:54 -0700)]
ggml : loop tiling optimizations for scalar path (ggml/898)
Apply a loop tiling technique to the generic path, which provides
performance upside for ISAs with enough registers to take advantage
of it. Also helps the compiler optimize this path.
Ivan Filipov [Mon, 22 Jul 2024 11:32:02 +0000 (14:32 +0300)]
ggml: add support for float16 input tensors in pooling operations (ggml/895)
* Add support for float16 tensors in 1d pooling operations
* Add support for float16 input tensors in 2d pooling operations
* code cleanup
remove unnecessary casting during srow ptr initialization
---------
Co-authored-by: vanaka11 <redacted>
Tony Wasserka [Sat, 20 Jul 2024 18:49:44 +0000 (20:49 +0200)]
vulkan : initialize vk_buffer_struct members to VK_NULL_HANDLE (ggml/893)
This prevents invalid frees when destroying a partially initialized
vk_buffer_struct. For example, this could happen in ggml_vk_create_buffer
when running out of device memory.
Co-authored-by: Tony Wasserka <redacted>
Borislav Stanimirov [Fri, 12 Jul 2024 14:24:20 +0000 (17:24 +0300)]
cmake : only enable GGML_NATIVE and x86 flags if not crosscompiling (ggml/885)
Georgi Gerganov [Thu, 8 Aug 2024 11:00:51 +0000 (14:00 +0300)]
scripts : sync new files (#0)
Daven Sanassy [Mon, 5 Aug 2024 06:48:26 +0000 (07:48 +0100)]
cmake : fix compile in xcode (#2311)
Georgi Gerganov [Sat, 27 Jul 2024 17:35:04 +0000 (20:35 +0300)]
whisper : handle empty mel (#2324)
Matt Stephenson [Tue, 16 Jul 2024 07:21:09 +0000 (03:21 -0400)]
whisper : use vulkan as gpu backend when available (#2302)
* ggml: use vulkan as gpu backend when available
Signed-off-by: Matt Stephenson <redacted>
* whisper: enable using vk as default buffer type
Signed-off-by: Matt Stephenson <redacted>
---------
Signed-off-by: Matt Stephenson <redacted>
arizhih [Mon, 15 Jul 2024 12:50:36 +0000 (14:50 +0200)]
whisper : fix DTW assert (#2299)
Georgi Gerganov [Tue, 9 Jul 2024 15:54:18 +0000 (18:54 +0300)]
cmake : use WHISPER_EXTRA_FLAGS (#2294)
Borislav Stanimirov [Mon, 8 Jul 2024 14:08:55 +0000 (17:08 +0300)]
cmake : allow external ggml
Georgi Gerganov [Mon, 8 Jul 2024 12:36:51 +0000 (15:36 +0300)]
cmake : try to fix openvino build (#2281)
Georgi Gerganov [Mon, 8 Jul 2024 11:21:04 +0000 (14:21 +0300)]
cmake : remove install of llama convert script [no ci] (#2266)
Georgi Gerganov [Mon, 8 Jul 2024 11:19:36 +0000 (14:19 +0300)]
make : remove llama prints [no ci] (#2265)
Georgi Gerganov [Mon, 8 Jul 2024 11:14:17 +0000 (14:14 +0300)]
talk-llama : sync llama.cpp
Georgi Gerganov [Mon, 8 Jul 2024 11:09:09 +0000 (14:09 +0300)]
examples : fix compile warnings [no ci] (#0)
Georgi Gerganov [Mon, 8 Jul 2024 10:50:28 +0000 (13:50 +0300)]
sync : ggml
Georgi Gerganov [Mon, 8 Jul 2024 10:50:14 +0000 (13:50 +0300)]
ggml : sync sycl (skip) (#0)
Georgi Gerganov [Mon, 8 Jul 2024 10:48:14 +0000 (13:48 +0300)]
scripts : fix sync scripts
Daniel Bevenius [Mon, 8 Jul 2024 10:03:42 +0000 (12:03 +0200)]
ggml : remove unnecessary UNUSED macro call (ggml/880)
This commit removes an UNUSED macro call that is not needed as the
variable n0 is used in the code and will not produce a warning.
Signed-off-by: Daniel Bevenius <redacted>
Natsu [Fri, 5 Jul 2024 14:29:35 +0000 (22:29 +0800)]
cmake : add GGML_BUILD and GGML_SHARED macro definitions (llama/8281)
Ouadie EL FAROUKI [Fri, 5 Jul 2024 12:23:25 +0000 (13:23 +0100)]
Enabled more data types for oneMKL gemm_batch (llama/8236)
Johannes Gäßler [Fri, 5 Jul 2024 07:06:31 +0000 (09:06 +0200)]
CUDA: MMQ support for iq4_nl, iq4_xs (llama/8278)
Daniele [Fri, 5 Jul 2024 07:06:09 +0000 (07:06 +0000)]
CUDA: revert part of the RDNA1 optimizations (llama/8309)
The change on the launch_bounds was causing a small performance drop in perplexity of 25 t/s
Johannes Gäßler [Fri, 5 Jul 2024 07:05:34 +0000 (09:05 +0200)]
CUDA: fix MMQ stream-k rounding if ne00 % 128 != 0 (llama/8311)
luoyu-intel [Fri, 5 Jul 2024 05:06:13 +0000 (05:06 +0000)]
Fix WARP_SIZE=16 bug of Intel GPU (llama/8266)
* fix group_norm ut
* split softmax
* fix softmax
* add concat support condition
* revert debug code
* move QK_WARP_SIZE to presets.hpp
Neo Zhang Jianyu [Fri, 5 Jul 2024 02:32:29 +0000 (10:32 +0800)]
rm get_work_group_size() by local cache for performance (llama/8286)
Co-authored-by: arthw <redacted>
Daniele [Wed, 3 Jul 2024 23:02:58 +0000 (23:02 +0000)]
Define and optimize RDNA1 (llama/8085)
Judd [Wed, 3 Jul 2024 12:40:16 +0000 (20:40 +0800)]
fix typo (llama/8267)
Co-authored-by: Judd <redacted>
Clint Herron [Tue, 2 Jul 2024 16:18:10 +0000 (12:18 -0400)]
Removes multiple newlines at the end of files that is breaking the editorconfig step of CI. (llama/8258)
slaren [Tue, 2 Jul 2024 06:39:38 +0000 (08:39 +0200)]
cuda : update supports_op for matrix multiplication (llama/8245)
luoyu-intel [Tue, 2 Jul 2024 04:50:07 +0000 (04:50 +0000)]
Fix win build conflict of math library (llama/8230)
* fix win build conflict of math library
* fix the condition: !(win32 & SYCL)
* revert warp_size=16
luoyu-intel [Tue, 2 Jul 2024 02:16:00 +0000 (02:16 +0000)]
Fix the sub group size of Intel (llama/8106)
* use warp_size macro for all sycl kernels
* fix mask of permute_sub_group_by_xor
* fix rms_norm with correct warp number
* fix rms_norm_f32/group_norm_f32
* move norm to norm.cpp file
* fix quantize bug
* fix mmvq's batch size
Johannes Gäßler [Mon, 1 Jul 2024 18:39:06 +0000 (20:39 +0200)]
CUDA: refactor and optimize IQ MMVQ (llama/8215)
* CUDA: refactor and optimize IQ MMVQ
* uint -> uint32_t
* __dp4a -> ggml_cuda_dp4a
* remove MIN_CC_DP4A checks
* change default
* try CI fix
zhentaoyu [Mon, 1 Jul 2024 11:39:06 +0000 (19:39 +0800)]
Update SYCL-Rope op and Refactor (llama/8157)
* align with rope.cu and move sycl-op to a single file
Johannes Gäßler [Thu, 27 Jun 2024 14:26:05 +0000 (16:26 +0200)]
CUDA: fix MMQ stream-k for --split-mode row (llama/8167)
John Balis [Tue, 2 Jul 2024 16:09:52 +0000 (11:09 -0500)]
feat: cuda implementation for `ggml_conv_transpose_1d` (ggml/854)
* conv transpose 1d passing test for 1d input and kernel
* working for different input and output channel counts, added test for variable stride
* initial draft appears to work with stride other than 1
* working with all old and new conv1d tests
* added a test for large tensors
* removed use cuda hardcoding
* restored test-conv-transpose.c
* removed unused arugments, and fixed bug where test failure would cause subsequent tests to fail
* fixed accumulator bug
* added test to test-backend-ops
* fixed mistake
* addressed review
* fixed includes
* removed blank lines
* style and warning fixes
* return failure when test fails
* fix supports_op
---------
Co-authored-by: slaren <redacted>
Georgi Gerganov [Mon, 8 Jul 2024 11:26:59 +0000 (14:26 +0300)]
ci : disable java build
Emmanuel Schmidbauer [Mon, 8 Jul 2024 11:24:58 +0000 (07:24 -0400)]
server : add inference path to make OAI API compatible (#2270)
Georgi Gerganov [Wed, 26 Jun 2024 20:20:19 +0000 (23:20 +0300)]
sync : ggml + fix sync script
Georgi Gerganov [Wed, 26 Jun 2024 20:20:13 +0000 (23:20 +0300)]
make : disable CUDA graphs
slaren [Wed, 26 Jun 2024 19:34:14 +0000 (21:34 +0200)]
ggml : add GGML_CUDA_USE_GRAPHS option, restore GGML_CUDA_FORCE_CUBLAS (cmake) (llama/8140)
Georgi Gerganov [Wed, 26 Jun 2024 19:25:25 +0000 (22:25 +0300)]
make : disable CUDA mel build
Georgi Gerganov [Wed, 26 Jun 2024 18:42:39 +0000 (21:42 +0300)]
cmake : minor fixes
Georgi Gerganov [Wed, 26 Jun 2024 18:20:45 +0000 (21:20 +0300)]
make : fix missing -O3
same as https://github.com/ggerganov/llama.cpp/pull/8143
Georgi Gerganov [Wed, 26 Jun 2024 17:11:38 +0000 (20:11 +0300)]
whisper : disable CUDA mel + fix FFMPEG
Georgi Gerganov [Wed, 26 Jun 2024 16:40:23 +0000 (19:40 +0300)]
sync : ggml
Georgi Gerganov [Wed, 26 Jun 2024 16:34:09 +0000 (19:34 +0300)]
whisper : reorganize source code + improve CMake (#2256)
* scripts : update sync [no ci]
* files : reorganize [no ci]
* sync : llama.cpp
* cmake : link math library
* cmake : build normal ggml library
* files : move headers to include
* objc : fix path to ggml-metal.h
* ci : fix WHISPER_CUDA -> GGML_CUDA
* scripts : sync LICENSE [no ci]
mky_coder [Tue, 18 Jun 2024 15:10:33 +0000 (23:10 +0800)]
whisper : optimize fft() function (#2242)
Co-authored-by: Mike Fan <redacted>
Georgi Gerganov [Tue, 18 Jun 2024 06:45:37 +0000 (09:45 +0300)]
talk-llama : sync llama.cpp
Georgi Gerganov [Tue, 18 Jun 2024 06:37:20 +0000 (09:37 +0300)]
whisper : use ggml_backend_sched (#2239)
* whisper : use ggml_backend_sched (wip)
* use sched in whisper_allocr
* whisper : single backend in whisper_context
* whisper : remove whisper_state->backends_used
* whisper : remove whisper_context->backend
* whisper : reset scheduler after init
* whisper : fix external encoder (e.g. CoreML)
* whisper : cleanup
* whisper : handle null GPU buffer types + fix sycl
---------
Co-authored-by: slaren <redacted>
Georgi Gerganov [Sun, 16 Jun 2024 16:23:55 +0000 (19:23 +0300)]
fix : remove extra files
Georgi Gerganov [Sun, 16 Jun 2024 16:23:32 +0000 (19:23 +0300)]
scripts : sync ggml-blas
Georgi Gerganov [Sun, 16 Jun 2024 16:10:20 +0000 (19:10 +0300)]
build : update make / cmake
Georgi Gerganov [Sun, 16 Jun 2024 15:40:07 +0000 (18:40 +0300)]
sync : ggml
slaren [Sun, 16 Jun 2024 10:57:37 +0000 (13:57 +0300)]
move BLAS to a separate backend (cont) (llama/6210)
ggml-ci
0cc4m [Sun, 16 Jun 2024 05:17:31 +0000 (07:17 +0200)]
Vulkan Shader Refactor, Memory Debugging Option (llama/7947)
* Refactor shaders, extract GLSL code from ggml_vk_generate_shaders.py into vulkan-shaders directory
* Improve debug log code
* Add memory debug output option
* Fix flake8
* Fix unnecessary high llama-3 VRAM use
Georgi Gerganov [Sun, 16 Jun 2024 15:38:46 +0000 (18:38 +0300)]
scripts : stop sync whisper example from ggml
Georgi Gerganov [Sun, 16 Jun 2024 14:57:35 +0000 (17:57 +0300)]
cmake : fix sycl build (#0)
Georgi Gerganov [Sun, 16 Jun 2024 10:46:12 +0000 (13:46 +0300)]
ggml : remove OpenCL (#0)
Georgi Gerganov [Sun, 16 Jun 2024 10:24:17 +0000 (13:24 +0300)]
sycl : sync (#0)
Georgi Gerganov [Sun, 16 Jun 2024 10:20:19 +0000 (13:20 +0300)]
cuda : enable CUDA graphs (#0)
Georgi Gerganov [Sun, 16 Jun 2024 10:10:54 +0000 (13:10 +0300)]
talk-llama : sync llama.cpp
Georgi Gerganov [Sun, 16 Jun 2024 10:07:43 +0000 (13:07 +0300)]
cmake : fix CUDA build (#0)
Georgi Gerganov [Sun, 16 Jun 2024 09:43:14 +0000 (12:43 +0300)]
sync : ggml
ggml-ci
Hong Bo PENG [Sun, 16 Jun 2024 08:53:11 +0000 (16:53 +0800)]
ggml : fix and optimize ppc64le (ggml/849)
* fix compile issues introduced by loongarch_asx
* restore quant changes to merge
* fix compile issues introduced by loongarch_asx
* further optimize by using vec_msum & vec_sum4s on ppc64le
Daniel Bevenius [Sun, 16 Jun 2024 08:51:18 +0000 (10:51 +0200)]
ggml : remove duplicate include of ggml-common.h (ggml/853)
Signed-off-by: Daniel Bevenius <redacted>
Meng, Hengyu [Sat, 15 Jun 2024 06:05:10 +0000 (14:05 +0800)]
remove global variables (llama/7710)
* separate DPCT helpers outside
* replace global variables with context
* remove useless extra
* update mul_mat condition
* remove duplicate buft initialization
* remove duplicate extra and global work group size
* remove useless backend check
* remove duplicated extras
* use macro for group_size and remove cuda-related
Johannes Gäßler [Fri, 14 Jun 2024 16:41:49 +0000 (18:41 +0200)]
CUDA: faster q2_K, q3_K MMQ + int8 tensor cores (llama/7921)
* CUDA: faster q2_K, q3_K MMQ + int8 tensor cores
* try CI fix
* try CI fix
* try CI fix
* fix data race
* rever q2_K precision related changes
Georgi Gerganov [Fri, 14 Jun 2024 14:14:09 +0000 (17:14 +0300)]
metal : utilize max shared memory for mul_mat_id (llama/7935)
Radoslav Gerganov [Thu, 13 Jun 2024 12:18:44 +0000 (15:18 +0300)]
rpc : fix ggml_backend_rpc_supports_buft() (llama/7918)
slaren [Thu, 13 Jun 2024 01:11:35 +0000 (03:11 +0200)]
move BLAS to a separate backend (llama/6210)
* move BLAS to a separate backend
* rename GGML_USE_OPENBLAS to GGML_USE_BLAS
* alloc : reuse same buffer when the same buffer type if used multiple times
* set number of threads automatically for openblas and blis
* sched : print assignments when GGML_SCHED_DEBUG env variable is set
* sched : allow ops with weights on an incompatible buffer type
This will cause the weight to be copied to a backend that supports the
op, which is very costly. The weight should have been stored in a buffer
of a backend that can run the op, but llama.cpp cannot do this
automatically at the moment.
---------
Co-authored-by: Georgi Gerganov <redacted>
Johannes Gäßler [Wed, 12 Jun 2024 15:41:51 +0000 (17:41 +0200)]
CUDA: fix broken oob check for FA vec f32 kernel (llama/7904)
Georgi Gerganov [Wed, 12 Jun 2024 13:00:22 +0000 (16:00 +0300)]
tests : add non-cont unary tests (llama/7857)
* tests : add non-cont unary tests
* ggml : update unary asserts and "supports_op"
ggml-ci
Georgi Gerganov [Wed, 12 Jun 2024 12:24:20 +0000 (15:24 +0300)]
ggml : improve ggml_is_contiguous logic (llama/7856)
* ggml : improve ggml_is_contiguous logic
ggml-ci
* ggml : support more contiguous cases
ggml-ci
k.h.lai [Tue, 11 Jun 2024 19:26:05 +0000 (03:26 +0800)]
vulkan: select only one device for single gpu with multiple drivers (llama/7582)
0cc4m [Tue, 11 Jun 2024 19:20:29 +0000 (21:20 +0200)]
Update Vulkan RoPE implementation (llama/7818)
* Update Vulkan RoPE implementation
* Return nullptr on alloc_buffer when allocation fails, instead of throwing an exception
Minor fixes
* Fix segfault when running out of VRAM
Co-authored-by: slaren <redacted>
---------
Co-authored-by: slaren <redacted>