]>
git.djapps.eu Git - pkg/ggml/sources/ggml/log
Georgi Gerganov [Sat, 20 Jul 2024 14:15:42 +0000 (17:15 +0300)]
gguf : handle null name during init (llama/8587)
slaren [Fri, 19 Jul 2024 15:17:27 +0000 (17:17 +0200)]
ggml : fix quant dot product with odd number of blocks (llama/8549)
* ggml : fix iq4_nl dot product with odd number of blocks
* ggml : fix odd blocks for ARM_NEON (llama/8556)
* ggml : fix iq4_nl dot product with odd number of blocks
* ggml : fix q4_1
* ggml : fix q5_0
* ggml : fix q5_1
* ggml : fix iq4_nl metal
ggml-ci
* ggml : fix q4_0
* ggml : fix q8_0
ggml-ci
* ggml : remove special Q4_0 code for first 2 blocks
* ggml : fix sumf redefinition
---------
Co-authored-by: slaren <redacted>
---------
Co-authored-by: Georgi Gerganov <redacted>
Clint Herron [Fri, 19 Jul 2024 11:05:45 +0000 (07:05 -0400)]
ggml : add friendlier error message to fopen errors (llama/8575)
* Add additional error information when model files fail to load.
* Adding additional error information to most instances of fopen.
Johannes Gäßler [Thu, 18 Jul 2024 21:48:47 +0000 (23:48 +0200)]
CUDA: fix partial offloading for ne0 % 256 != 0 (llama/8572)
65a [Thu, 18 Jul 2024 14:47:12 +0000 (07:47 -0700)]
cmake : install all ggml public headers (llama/8480)
Co-authored-by: 65a <redacted>
hipudding [Wed, 17 Jul 2024 11:23:50 +0000 (19:23 +0800)]
Add Ascend NPU backend (llama/6035)
* [CANN] Add Ascend NPU backend
Ascend is a full-stack AI computing infrastructure for industry
applications and services based on Huawei Ascend processors and
software.
CANN (Compute Architecture of Neural Networks), developped by
Huawei, is a heterogeneous computing architecture for AI.
Co-authored-by: wangshuai09 <redacted>
* delete trailing whitespaces
* Modify the code based on review comment
* Rename LLAMA_CANN to GGML_CANN
* Make ggml-common.h private
* add ggml_cann prefix for acl funcs
* Add logging for CANN backend
* Delete Trailing whitespace
---------
Co-authored-by: wangshuai09 <redacted>
Johannes Gäßler [Tue, 16 Jul 2024 19:20:59 +0000 (21:20 +0200)]
make/cmake: add missing force MMQ/cuBLAS for HIP (llama/8515)
Xuan Son Nguyen [Mon, 15 Jul 2024 18:50:47 +0000 (20:50 +0200)]
Refactor lora adapter support (llama/8332)
* lora: load to devide buft
* add patch tensor function
* correct tensor patch
* llama_lora_adapter_apply
* correct ggml_backend_tensor_copy
* add llm_build_mm
* fix auto merge
* update based on review comments
* add convert script
* no more transpose A
* add f16 convert
* add metadata check
* add sanity check
* fix ftype
* add requirements
* fix requirements
* fix outfile
* conversion: only allow selected models
* fix types
* cuda : do not use dmmv if the tensor does not have enough cols
* llama : lora fixes
* do not disable mmap with lora
Co-authored-by: slaren <redacted>
* llm_build_lora_mm_id
* convert_lora : MoE LoRA conversion support
* convert_lora : prefer safetensors, similarly to convert_hf
* convert_hf : simplify modify_tensors for InternLM2
* convert_lora : lazy conversion
* llama : load and use alpha from LoRA adapters
* llama : use llm_build_lora_mm in most model graphs
* auto scale
* Revert "auto scale"
This reverts commit
42415a4874e0f963e4aca6796ea5dfb97cd17464 .
* remove redundant params
* Apply suggestions from code review
Co-authored-by: slaren <redacted>
* change kv metadata
* move add_type to __init__
* convert_hf : move add_type to main()
* convert_lora : use the GGUFWriter from Model instead of overwriting it
---------
Co-authored-by: slaren <redacted>
Co-authored-by: Francis Couture-Harpin <redacted>
Daniel Bevenius [Mon, 15 Jul 2024 12:48:17 +0000 (14:48 +0200)]
ggml : suppress unknown pragma 'GCC' on windows (llama/8460)
This commit adds a macro guard to pragma GCC to avoid the following
warning on windows:
```console
C:\llama.cpp\ggml\src\ggml-aarch64.c(17,9): warning C4068:
unknown pragma 'GCC' [C:\lama.cpp\build\ggml\src\ggml.vcxproj]
```
Meng, Hengyu [Mon, 15 Jul 2024 11:32:15 +0000 (19:32 +0800)]
add concat through dim 1/2 (llama/8483)
* add concat through dim 1/2
0cc4m [Mon, 15 Jul 2024 07:38:52 +0000 (09:38 +0200)]
Vulkan MMQ Fix (llama/8479)
* Fix incoherence by adding missing LOAD_VEC_A parameter
* Fix Vulkan op result checker build error
bandoti [Sat, 13 Jul 2024 16:12:39 +0000 (13:12 -0300)]
vulkan : cmake integration (llama/8119)
* Add Vulkan to CMake pkg
* Add Sycl to CMake pkg
* Add OpenMP to CMake pkg
* Split generated shader file into separate translation unit
* Add CMake target for Vulkan shaders
* Update README.md
* Add make target for Vulkan shaders
* Use pkg-config to locate vulkan library
* Add vulkan SDK dep to ubuntu-22-cmake-vulkan workflow
* Clean up tabs
* Move sudo to apt-key invocation
* Forward GGML_EXTRA_LIBS to CMake config pkg
* Update vulkan obj file paths
* Add shaderc to nix pkg
* Add python3 to Vulkan nix build
* Link against ggml in cmake pkg
* Remove Python dependency from Vulkan build
* code review changes
* Remove trailing newline
* Add cflags from pkg-config to fix w64devkit build
* Update README.md
* Remove trailing whitespace
* Update README.md
* Remove trailing whitespace
* Fix doc heading
* Make glslc required Vulkan component
* remove clblast from nix pkg
Georgi Gerganov [Sat, 13 Jul 2024 15:32:33 +0000 (18:32 +0300)]
metal : template-ify some of the kernels (llama/8447)
ggml-ci
Georgi Gerganov [Fri, 12 Jul 2024 07:46:02 +0000 (10:46 +0300)]
ggml : minor naming changes (llama/8433)
* ggml : minor naming changes
ggml-ci
* ggml : use PRId64 [no ci]
* ggml : revert FA K/Q names
Chen Xi [Fri, 12 Jul 2024 00:52:04 +0000 (00:52 +0000)]
fix the mul_mat_id ut issues (llama/8427)
* fix part of mul_mat_id
* skip the bfloat 16 sycl ut
Signed-off-by: Chen Xi <redacted>
---------
Signed-off-by: Chen Xi <redacted>
Co-authored-by: Meng, Hengyu <redacted>
Co-authored-by: Chen Xi <redacted>
Nicholai Tukanov [Thu, 11 Jul 2024 16:49:15 +0000 (11:49 -0500)]
ggml : add NVPL BLAS support (#8329) (llama/8425)
* ggml : add NVPL BLAS support
* ggml : replace `<BLASLIB>_ENABLE_CBLAS` with `GGML_BLAS_USE_<BLASLIB>`
---------
Co-authored-by: ntukanov <redacted>
Daniel Bevenius [Thu, 11 Jul 2024 15:53:42 +0000 (17:53 +0200)]
cuda : suppress 'noreturn' warn in no_device_code (llama/8414)
* cuda : suppress 'noreturn' warn in no_device_code
This commit adds a while(true) loop to the no_device_code function in
common.cuh. This is done to suppress the warning:
```console
/src/ggml-cuda/template-instances/../common.cuh:346:1: warning:
function declared 'noreturn' should not return [-Winvalid-noreturn]
346 | }
| ^
```
The motivation for this is to reduce the number of warnings when
compilng with GGML_HIPBLAS=ON.
Signed-off-by: Daniel Bevenius <redacted>
* squash! cuda : suppress 'noreturn' warn in no_device_code
Update __trap macro instead of using a while loop to suppress the
warning.
Signed-off-by: Daniel Bevenius <redacted>
---------
Signed-off-by: Daniel Bevenius <redacted>
Johannes Gäßler [Thu, 11 Jul 2024 14:47:47 +0000 (16:47 +0200)]
CUDA: optimize and refactor MMQ (llama/8416)
* CUDA: optimize and refactor MMQ
* explicit q8_1 memory layouts, add documentation
AidanBeltonS [Wed, 10 Jul 2024 15:10:49 +0000 (16:10 +0100)]
Use multi_ptr to clean up deprecated warnings (llama/8256)
Georgi Gerganov [Wed, 10 Jul 2024 12:23:29 +0000 (15:23 +0300)]
ggml : move sgemm sources to llamafile subfolder (llama/8394)
ggml-ci
Dibakar Gope [Wed, 10 Jul 2024 12:14:51 +0000 (07:14 -0500)]
ggml : add AArch64 optimized GEMV and GEMM Q4 kernels (llama/5780)
* Arm AArch64: optimized GEMV and GEMM kernels for q4_0_q8_0, and q8_0_q8_0 quantization
* Arm AArch64: add optimized GEMV and GEMM asm kernels for q4_0_q8_0 quantization and refactor code to address llama.cpp pr#5780 suggestions
* Arm AArch64: add optimized GEMV and GEMM asm kernels for q4_0_q8_0 quantization and refactor code to address llama.cpp pr#5780 suggestions
* Arm AArch64: add optimized GEMV and GEMM asm kernels for q4_0_q8_0 quantization and refactor code to address llama.cpp pr#5780 suggestions
* Arm AArch64: add optimized GEMV and GEMM asm kernels for q4_0_q8_0 quantization and refactor code to address llama.cpp pr#5780 suggestions
* Arm AArch64: add copyright claim only to ggml-aarch64.cpp and ggml-aarch64.h files
* Arm AArch64: minor code refactoring for rebase
* Arm AArch64: minor code refactoring for resolving a build issue with cmake
* Arm AArch64: minor code refactoring to split the Q4_0_AARC64 type into three separate types: Q4_0_4_4, Q4_0_4_8, and Q4_0_8_8
* Arm AArch64: minor code change for resolving a build issue with server-windows
* retrigger checks
* Arm AArch64: minor code changes for rebase
* Arm AArch64: minor changes to skip the pr#7433 vec_dot code for arm cpus with SVE VL not equal to 256 bits
* Arm AArch64: remove stale LLAMA_QKK_64 from CMakeLists.txt and delete build.zig
* Arm AArch64: add reference scalar gemm and gemv, and avoid dynamic memory allocations during quantization for Q4_0_4_4, Q4_0_4_8, and Q4_0_8_8
* Arm AArch64: add multithreaded quantization support for the new types: Q4_0_4_4, Q4_0_4_8, and Q4_0_8_8
* Arm AArch64: minor code refactoring
* Arm AArch64: simplify logic for calling gemm and gemv functions in ggml_compute_forward_mul_mat
* Arm AArch64: minimize changes in ggml_compute_forward_mul_mat
* Arm AArch64: minor code refactoring, and add reference scalar code to quantize routines for new quant types
* Arm AArch64: minor code refactoring
* Arm AArch64: minor code refactoring
* Arm AArch64: minor code refactoring
* rebase on the latest master commit
3fd62a6 and adapt to the new directory structure
* Arm AArch64: remove a redundant comment
* Arm AArch64: add pragma in ggml-aarch64.c to turn -Woverlength-strings warning off
* Arm AArch64: use __aarch64__ check to guard 64-bit neon kernels
* Arm AArch64: update docs/build.md README to include compile time flags for buiilding the Q4_0_4_4 quant type
Alberto Cabrera Pérez [Tue, 9 Jul 2024 14:03:15 +0000 (15:03 +0100)]
sycl : Reenabled mmvq path for the SYCL Nvidia Backend (llama/8372)
* SYCL : Reenabled mmvq path for the SYCL Nvidia Backend
* Reduced verbosity of comment
Alberto Cabrera Pérez [Mon, 8 Jul 2024 13:22:41 +0000 (14:22 +0100)]
sycl : fix powf call in device code (llama/8368)
Mahesh Madhav [Thu, 25 Jul 2024 07:54:08 +0000 (00:54 -0700)]
ggml : loop tiling optimizations for scalar path (#898)
Apply a loop tiling technique to the generic path, which provides
performance upside for ISAs with enough registers to take advantage
of it. Also helps the compiler optimize this path.
Ivan Filipov [Mon, 22 Jul 2024 11:32:02 +0000 (14:32 +0300)]
ggml: add support for float16 input tensors in pooling operations (#895)
* Add support for float16 tensors in 1d pooling operations
* Add support for float16 input tensors in 2d pooling operations
* code cleanup
remove unnecessary casting during srow ptr initialization
---------
Co-authored-by: vanaka11 <redacted>
Brian [Mon, 22 Jul 2024 10:25:01 +0000 (20:25 +1000)]
gguf.md: naming convention synced to llama.cpp (#896)
It is now updated to this form
`<BaseName><SizeLabel><FineTune><Version><Encoding><Type><Shard>.gguf`
Brian [Sun, 21 Jul 2024 08:20:30 +0000 (18:20 +1000)]
gguf.md: kv store has new authorship metadata keys (#897)
Tony Wasserka [Sat, 20 Jul 2024 18:49:44 +0000 (20:49 +0200)]
vulkan : initialize vk_buffer_struct members to VK_NULL_HANDLE (#893)
This prevents invalid frees when destroying a partially initialized
vk_buffer_struct. For example, this could happen in ggml_vk_create_buffer
when running out of device memory.
Co-authored-by: Tony Wasserka <redacted>
Georgi Gerganov [Sat, 20 Jul 2024 13:38:56 +0000 (16:38 +0300)]
py : update pacakges + fix yolo warning
Borislav Stanimirov [Fri, 12 Jul 2024 14:24:20 +0000 (17:24 +0300)]
cmake : only enable GGML_NATIVE and x86 flags if not crosscompiling (#885)
Georgi Gerganov [Mon, 8 Jul 2024 11:54:35 +0000 (14:54 +0300)]
sync : whisper.cpp
Georgi Gerganov [Mon, 8 Jul 2024 11:09:09 +0000 (14:09 +0300)]
examples : fix compile warnings [no ci] (whisper/0)
Daniel Bevenius [Mon, 8 Jul 2024 10:03:42 +0000 (12:03 +0200)]
ggml : remove unnecessary UNUSED macro call (#880)
This commit removes an UNUSED macro call that is not needed as the
variable n0 is used in the code and will not produce a warning.
Signed-off-by: Daniel Bevenius <redacted>
Georgi Gerganov [Mon, 8 Jul 2024 09:23:45 +0000 (12:23 +0300)]
sync : llama.cpp
ggml-ci
Georgi Gerganov [Mon, 8 Jul 2024 07:39:36 +0000 (10:39 +0300)]
tests : fix whitespace (llama/0)
Natsu [Fri, 5 Jul 2024 14:29:35 +0000 (22:29 +0800)]
cmake : add GGML_BUILD and GGML_SHARED macro definitions (llama/8281)
Ouadie EL FAROUKI [Fri, 5 Jul 2024 12:23:25 +0000 (13:23 +0100)]
Enabled more data types for oneMKL gemm_batch (llama/8236)
Johannes Gäßler [Fri, 5 Jul 2024 07:06:31 +0000 (09:06 +0200)]
CUDA: MMQ support for iq4_nl, iq4_xs (llama/8278)
Daniele [Fri, 5 Jul 2024 07:06:09 +0000 (07:06 +0000)]
CUDA: revert part of the RDNA1 optimizations (llama/8309)
The change on the launch_bounds was causing a small performance drop in perplexity of 25 t/s
Johannes Gäßler [Fri, 5 Jul 2024 07:05:34 +0000 (09:05 +0200)]
CUDA: fix MMQ stream-k rounding if ne00 % 128 != 0 (llama/8311)
luoyu-intel [Fri, 5 Jul 2024 05:06:13 +0000 (05:06 +0000)]
Fix WARP_SIZE=16 bug of Intel GPU (llama/8266)
* fix group_norm ut
* split softmax
* fix softmax
* add concat support condition
* revert debug code
* move QK_WARP_SIZE to presets.hpp
Neo Zhang Jianyu [Fri, 5 Jul 2024 02:32:29 +0000 (10:32 +0800)]
rm get_work_group_size() by local cache for performance (llama/8286)
Co-authored-by: arthw <redacted>
AidanBeltonS [Thu, 4 Jul 2024 01:07:19 +0000 (02:07 +0100)]
Remove unneeded semicolons (llama/8280)
Daniele [Wed, 3 Jul 2024 23:02:58 +0000 (23:02 +0000)]
Define and optimize RDNA1 (llama/8085)
Judd [Wed, 3 Jul 2024 12:40:16 +0000 (20:40 +0800)]
fix typo (llama/8267)
Co-authored-by: Judd <redacted>
AidanBeltonS [Wed, 3 Jul 2024 01:55:34 +0000 (02:55 +0100)]
Dequant improvements rebase (llama/8255)
* Single load for half2
* Store scales in local mem
* Vec load quantized values
Clint Herron [Tue, 2 Jul 2024 16:18:10 +0000 (12:18 -0400)]
Removes multiple newlines at the end of files that is breaking the editorconfig step of CI. (llama/8258)
slaren [Tue, 2 Jul 2024 06:39:38 +0000 (08:39 +0200)]
cuda : update supports_op for matrix multiplication (llama/8245)
luoyu-intel [Tue, 2 Jul 2024 04:50:07 +0000 (04:50 +0000)]
Fix win build conflict of math library (llama/8230)
* fix win build conflict of math library
* fix the condition: !(win32 & SYCL)
* revert warp_size=16
luoyu-intel [Tue, 2 Jul 2024 02:16:00 +0000 (02:16 +0000)]
Fix the sub group size of Intel (llama/8106)
* use warp_size macro for all sycl kernels
* fix mask of permute_sub_group_by_xor
* fix rms_norm with correct warp number
* fix rms_norm_f32/group_norm_f32
* move norm to norm.cpp file
* fix quantize bug
* fix mmvq's batch size
Johannes Gäßler [Mon, 1 Jul 2024 18:39:06 +0000 (20:39 +0200)]
CUDA: refactor and optimize IQ MMVQ (llama/8215)
* CUDA: refactor and optimize IQ MMVQ
* uint -> uint32_t
* __dp4a -> ggml_cuda_dp4a
* remove MIN_CC_DP4A checks
* change default
* try CI fix
zhentaoyu [Mon, 1 Jul 2024 11:39:06 +0000 (19:39 +0800)]
Update SYCL-Rope op and Refactor (llama/8157)
* align with rope.cu and move sycl-op to a single file
Johannes Gäßler [Thu, 27 Jun 2024 14:26:05 +0000 (16:26 +0200)]
CUDA: fix MMQ stream-k for --split-mode row (llama/8167)
slaren [Tue, 2 Jul 2024 17:11:52 +0000 (19:11 +0200)]
fix uses of GGML_USE_CUBLAS in tests and examples (#879)
* fix uses of GGML_USE_CUBLAS in tests and examples
* fix ci/run.sh
ggml-ci
John Balis [Tue, 2 Jul 2024 16:09:52 +0000 (11:09 -0500)]
feat: cuda implementation for `ggml_conv_transpose_1d` (#854)
* conv transpose 1d passing test for 1d input and kernel
* working for different input and output channel counts, added test for variable stride
* initial draft appears to work with stride other than 1
* working with all old and new conv1d tests
* added a test for large tensors
* removed use cuda hardcoding
* restored test-conv-transpose.c
* removed unused arugments, and fixed bug where test failure would cause subsequent tests to fail
* fixed accumulator bug
* added test to test-backend-ops
* fixed mistake
* addressed review
* fixed includes
* removed blank lines
* style and warning fixes
* return failure when test fails
* fix supports_op
---------
Co-authored-by: slaren <redacted>
Yilong Guo [Sun, 30 Jun 2024 16:05:44 +0000 (09:05 -0700)]
sycl : add build instruction (#870)
John Balis [Sun, 30 Jun 2024 15:14:31 +0000 (10:14 -0500)]
update "Using cuBLAS" to use correct update cuda compile flag (#876)
It seems like the previous `-DGGML_CUBLAS=ON` compile flag was deprecated.
Georgi Gerganov [Wed, 26 Jun 2024 20:26:16 +0000 (23:26 +0300)]
sync : whisper.cpp
Georgi Gerganov [Wed, 26 Jun 2024 17:11:38 +0000 (20:11 +0300)]
whisper : disable CUDA mel + fix FFMPEG
Georgi Gerganov [Wed, 26 Jun 2024 19:42:27 +0000 (22:42 +0300)]
sync : llama.cpp
slaren [Wed, 26 Jun 2024 19:34:14 +0000 (21:34 +0200)]
ggml : add GGML_CUDA_USE_GRAPHS option, restore GGML_CUDA_FORCE_CUBLAS (cmake) (llama/8140)
Georgi Gerganov [Wed, 26 Jun 2024 16:40:53 +0000 (19:40 +0300)]
sync : llama.cpp, whisper.cpp
Georgi Gerganov [Wed, 26 Jun 2024 16:33:53 +0000 (19:33 +0300)]
ggml : reorganize source code + improve CMake (#865)
* scripts : update sync [no ci]
* ggml : move headers one up [no ci]
* files : reorganize + update CMake
ggml-ci
* cmake : build normal ggml library
ggml-ci
* cmake : link math library to test + remove ci for code cov
ggml-ci
* files : move public headers to include
ggml-ci
Georgi Gerganov [Fri, 21 Jun 2024 07:25:14 +0000 (10:25 +0300)]
files : remove old (#0)
Georgi Gerganov [Tue, 18 Jun 2024 06:48:08 +0000 (09:48 +0300)]
sync : whisper.cpp
Georgi Gerganov [Tue, 18 Jun 2024 06:37:20 +0000 (09:37 +0300)]
whisper : use ggml_backend_sched (whisper/2239)
* whisper : use ggml_backend_sched (wip)
* use sched in whisper_allocr
* whisper : single backend in whisper_context
* whisper : remove whisper_state->backends_used
* whisper : remove whisper_context->backend
* whisper : reset scheduler after init
* whisper : fix external encoder (e.g. CoreML)
* whisper : cleanup
* whisper : handle null GPU buffer types + fix sycl
---------
Co-authored-by: slaren <redacted>
Georgi Gerganov [Sun, 16 Jun 2024 17:31:33 +0000 (20:31 +0300)]
sync : whisper.cpp
Georgi Gerganov [Tue, 11 Jun 2024 14:39:01 +0000 (17:39 +0300)]
cuda : fix bounds check for src0 rows in MMVQ kernel (whisper/2231)
* cuda : fix bounds check for src0 rows in MMVQ kernel
* Update ggml-cuda/mmvq.cu
Co-authored-by: Johannes Gäßler <redacted>
---------
Co-authored-by: Johannes Gäßler <redacted>
Borislav Stanimirov [Fri, 31 May 2024 08:37:29 +0000 (11:37 +0300)]
whisper : remove `speed_up` and `phase_vocoder*` functions (whisper/2198)
* whisper : fix cast warning
* whisper : remove phase_vocoder functions, ref #2195
* whisper : remove speed_up from whisper_full_params, closes #2195
William Tambellini [Tue, 21 May 2024 15:31:41 +0000 (08:31 -0700)]
examples : add support for decoding input with ffmpeg (Linux) (whisper/2133)
- search for ffmpeg libs/headers at cmake time
- added ffmpeg-transcode.cpp into libcommon if ffmpeg on
- hooked ffmpeg trancoding in common read_wav(...)
- passed test:
./main -m ggml-base.en.bin -f samples/jfk.mp3
Georgi Gerganov [Sun, 16 Jun 2024 16:10:54 +0000 (19:10 +0300)]
examples : remove whisper (#860)
ggml-ci
slaren [Sun, 16 Jun 2024 10:57:37 +0000 (13:57 +0300)]
move BLAS to a separate backend (cont) (llama/6210)
ggml-ci
Georgi Gerganov [Sun, 16 Jun 2024 10:56:06 +0000 (13:56 +0300)]
scripts : sync ggml-blas
0cc4m [Sun, 16 Jun 2024 05:17:31 +0000 (07:17 +0200)]
Vulkan Shader Refactor, Memory Debugging Option (llama/7947)
* Refactor shaders, extract GLSL code from ggml_vk_generate_shaders.py into vulkan-shaders directory
* Improve debug log code
* Add memory debug output option
* Fix flake8
* Fix unnecessary high llama-3 VRAM use
Georgi Gerganov [Sun, 16 Jun 2024 10:42:57 +0000 (13:42 +0300)]
ggml : remove OpenCL (#0)
Georgi Gerganov [Sun, 16 Jun 2024 10:05:11 +0000 (13:05 +0300)]
cmake : fix cuda vars (#0)
Georgi Gerganov [Sun, 16 Jun 2024 09:40:38 +0000 (12:40 +0300)]
scripts : update sync
Hong Bo PENG [Sun, 16 Jun 2024 08:53:11 +0000 (16:53 +0800)]
ggml : fix and optimize ppc64le (#849)
* fix compile issues introduced by loongarch_asx
* restore quant changes to merge
* fix compile issues introduced by loongarch_asx
* further optimize by using vec_msum & vec_sum4s on ppc64le
Daniel Bevenius [Sun, 16 Jun 2024 08:51:18 +0000 (10:51 +0200)]
ggml : remove duplicate include of ggml-common.h (#853)
Signed-off-by: Daniel Bevenius <redacted>
Yilong Guo [Sun, 16 Jun 2024 07:51:38 +0000 (00:51 -0700)]
sycl : remove global variables (cont) (llama/7710)
* separate DPCT helpers outside
* replace global variables with context
* remove useless extra
* update mul_mat condition
* remove duplicate buft initialization
* remove duplicate extra and global work group size
* remove useless backend check
* remove duplicated extras
* use macro for group_size and remove cuda-related
Co-authored-by: Meng, Hengyu <redacted>
Yilong Guo [Sun, 16 Jun 2024 07:40:35 +0000 (00:40 -0700)]
scripts : add ggml-sycl to sync scripts (#857)
Georgi Gerganov [Sat, 15 Jun 2024 18:12:18 +0000 (21:12 +0300)]
ci : add GG_BUILD_NO_DOWNLOAD
ggml-ci
Georgi Gerganov [Sat, 15 Jun 2024 17:54:22 +0000 (20:54 +0300)]
ggml : remove opencl (#0)
ggml-ci
Georgi Gerganov [Sat, 15 Jun 2024 17:53:02 +0000 (20:53 +0300)]
cuda : update build (#0)
ggml-ci
Georgi Gerganov [Sat, 15 Jun 2024 17:16:55 +0000 (20:16 +0300)]
sync : llama.cpp
ggml-ci
Georgi Gerganov [Sat, 15 Jun 2024 17:16:42 +0000 (20:16 +0300)]
tests : adapt to changes (#0)
Meng, Hengyu [Sat, 15 Jun 2024 06:05:10 +0000 (14:05 +0800)]
remove global variables (llama/7710)
* separate DPCT helpers outside
* replace global variables with context
* remove useless extra
* update mul_mat condition
* remove duplicate buft initialization
* remove duplicate extra and global work group size
* remove useless backend check
* remove duplicated extras
* use macro for group_size and remove cuda-related
Johannes Gäßler [Fri, 14 Jun 2024 16:41:49 +0000 (18:41 +0200)]
CUDA: faster q2_K, q3_K MMQ + int8 tensor cores (llama/7921)
* CUDA: faster q2_K, q3_K MMQ + int8 tensor cores
* try CI fix
* try CI fix
* try CI fix
* fix data race
* rever q2_K precision related changes
Georgi Gerganov [Fri, 14 Jun 2024 14:14:09 +0000 (17:14 +0300)]
metal : utilize max shared memory for mul_mat_id (llama/7935)
Radoslav Gerganov [Thu, 13 Jun 2024 12:18:44 +0000 (15:18 +0300)]
rpc : fix ggml_backend_rpc_supports_buft() (llama/7918)
slaren [Thu, 13 Jun 2024 01:11:35 +0000 (03:11 +0200)]
move BLAS to a separate backend (llama/6210)
* move BLAS to a separate backend
* rename GGML_USE_OPENBLAS to GGML_USE_BLAS
* alloc : reuse same buffer when the same buffer type if used multiple times
* set number of threads automatically for openblas and blis
* sched : print assignments when GGML_SCHED_DEBUG env variable is set
* sched : allow ops with weights on an incompatible buffer type
This will cause the weight to be copied to a backend that supports the
op, which is very costly. The weight should have been stored in a buffer
of a backend that can run the op, but llama.cpp cannot do this
automatically at the moment.
---------
Co-authored-by: Georgi Gerganov <redacted>
Johannes Gäßler [Wed, 12 Jun 2024 15:41:51 +0000 (17:41 +0200)]
CUDA: fix broken oob check for FA vec f32 kernel (llama/7904)
Georgi Gerganov [Wed, 12 Jun 2024 13:00:22 +0000 (16:00 +0300)]
tests : add non-cont unary tests (llama/7857)
* tests : add non-cont unary tests
* ggml : update unary asserts and "supports_op"
ggml-ci
Georgi Gerganov [Wed, 12 Jun 2024 12:24:20 +0000 (15:24 +0300)]
ggml : improve ggml_is_contiguous logic (llama/7856)
* ggml : improve ggml_is_contiguous logic
ggml-ci
* ggml : support more contiguous cases
ggml-ci
k.h.lai [Tue, 11 Jun 2024 19:26:05 +0000 (03:26 +0800)]
vulkan: select only one device for single gpu with multiple drivers (llama/7582)
0cc4m [Tue, 11 Jun 2024 19:20:29 +0000 (21:20 +0200)]
Update Vulkan RoPE implementation (llama/7818)
* Update Vulkan RoPE implementation
* Return nullptr on alloc_buffer when allocation fails, instead of throwing an exception
Minor fixes
* Fix segfault when running out of VRAM
Co-authored-by: slaren <redacted>
---------
Co-authored-by: slaren <redacted>
Johannes Gäßler [Tue, 11 Jun 2024 06:26:07 +0000 (08:26 +0200)]
CUDA: int8 tensor cores for MMQ (q4_K, q5_K, q6_K) (llama/7860)
Johannes Gäßler [Mon, 10 Jun 2024 09:45:13 +0000 (11:45 +0200)]
CUDA: use tensor cores for MMQ (llama/7676)
* CUDA: int8 tensor cores for MMQ (legacy quants)
* fix out-of-bounds writes
* __builtin_assume -> GGML_CUDA_ASSUME
* fix writeback returning too early
Ben Ashbaugh [Mon, 10 Jun 2024 09:21:31 +0000 (02:21 -0700)]
use the correct SYCL context for host USM allocations (llama/7777)
Signed-off-by: Ben Ashbaugh <redacted>
Johannes Gäßler [Sun, 9 Jun 2024 07:42:25 +0000 (09:42 +0200)]
CUDA: revise q8_1 data layout for mul_mat_q (llama/7824)