]> git.djapps.eu Git - pkg/ggml/sources/ggml/log
pkg/ggml/sources/ggml
2 years agoResolve ErrorIncompatibleDriver with Vulkan on MacOS.
Mathijs de Bruin [Sat, 3 Feb 2024 18:00:11 +0000 (18:00 +0000)]
Resolve ErrorIncompatibleDriver with Vulkan on MacOS.

Refs:
- https://chat.openai.com/share/7020ce72-65fc-45ec-b7be-9d9d798a5f3f
- https://github.com/SaschaWillems/Vulkan/issues/954
- https://github.com/haasn/libplacebo/issues/128
- https://github.com/KhronosGroup/Vulkan-Samples/issues/476

2 years agoAllow for Vulkan build with Accelerate.
Mathijs de Bruin [Sat, 3 Feb 2024 17:56:46 +0000 (17:56 +0000)]
Allow for Vulkan build with Accelerate.

Closes #5304

2 years agocuda : ignore peer access already enabled errors (llama/5597)
slaren [Mon, 19 Feb 2024 22:40:26 +0000 (23:40 +0100)]
cuda : ignore peer access already enabled errors (llama/5597)

* cuda : ignore peer access already enabled errors

* fix hip

2 years agoggml : compute forward no longer pass src tensors (#729)
Siddharth Ramakrishnan [Wed, 21 Feb 2024 12:34:53 +0000 (04:34 -0800)]
ggml : compute forward no longer pass src tensors (#729)

* refactored compute forward to not pass in the src tensors each time

* fix merge issues with flags

* missed one place in the last commit to fix the is_param / flags issue

* minor spacing fix

* fixed some variable assignments so all tests locally are passing

* new change after merge fix

---------

Co-authored-by: siddharthvader <redacted>
2 years agoggml : fix conv_2d batch mode (#737)
bssrdf [Tue, 20 Feb 2024 19:17:09 +0000 (14:17 -0500)]
ggml : fix conv_2d batch mode (#737)

Co-authored-by: bssrdf <redacted>
2 years agosync : whisper.cpp
Georgi Gerganov [Mon, 19 Feb 2024 13:56:03 +0000 (15:56 +0200)]
sync : whisper.cpp

2 years agobuild : update CBLAS flags + fix unused var warning (whisper/0)
Georgi Gerganov [Mon, 19 Feb 2024 12:44:46 +0000 (14:44 +0200)]
build : update CBLAS flags + fix unused var warning (whisper/0)

2 years agomain : check if input files exist before proceeding (whisper/1872)
Davidson Francis [Mon, 19 Feb 2024 08:51:26 +0000 (05:51 -0300)]
main : check if input files exist before proceeding (whisper/1872)

Until the most recent commit (3d42463), the main.cpp sample file does
not check whether the input files exist or not. Consequently, the
model is loaded first before reporting whether there was a failure or
not when processing a file. In environments with HDD, this can take
about 50 seconds or more, depending on the loaded model.

This commit addresses this issue by checking in advance whether the
input files exist or not.

2 years agoexamples : clean up common code (whisper/1871)
Felix [Mon, 19 Feb 2024 08:50:15 +0000 (09:50 +0100)]
examples : clean up common code (whisper/1871)

move some utility functions into common.h

2 years agowhisper : fix external encoder (whisper/1860)
Georgi Gerganov [Mon, 12 Feb 2024 17:53:51 +0000 (19:53 +0200)]
whisper : fix external encoder (whisper/1860)

2 years agoggml : resolve merge conflicts (#0)
Georgi Gerganov [Mon, 19 Feb 2024 13:33:51 +0000 (15:33 +0200)]
ggml : resolve merge conflicts (#0)

ggml-ci

2 years agocommon : add IQ1_S (#0)
Georgi Gerganov [Mon, 19 Feb 2024 13:27:37 +0000 (15:27 +0200)]
common : add IQ1_S (#0)

ggml-ci

2 years agosync : llama.cpp
Georgi Gerganov [Mon, 19 Feb 2024 13:19:26 +0000 (15:19 +0200)]
sync : llama.cpp

2 years agoci : enable -Werror for CUDA builds (llama/5579)
Georgi Gerganov [Mon, 19 Feb 2024 12:45:41 +0000 (14:45 +0200)]
ci : enable -Werror for CUDA builds (llama/5579)

* cmake : pass -Werror through -Xcompiler

ggml-ci

* make, cmake : enable CUDA errors on warnings

ggml-ci

2 years agocuda, metal : fix nans in soft_max (llama/5574)
slaren [Mon, 19 Feb 2024 08:04:45 +0000 (09:04 +0100)]
cuda, metal : fix nans in soft_max (llama/5574)

* cuda : fix nans in soft_max

* metal : fix nans in soft_max

---------

Co-authored-by: Georgi Gerganov <redacted>
2 years agoggml : android and old glibc NUMA incompatibility bugfixes (llama/5557)
bmwl [Mon, 19 Feb 2024 07:38:32 +0000 (23:38 -0800)]
ggml : android and old glibc NUMA incompatibility bugfixes (llama/5557)

* #ifdef out some code NUMA blocks for Android due to lack of support

* added in some __ANDROID__ if def gates around numa code and forced GLIBC prior to 2.29 to use a syscall for getcpu instead of the wrapper

* Changed gates on numa platform specific stuff to __gnu_linux__ to skip any platforms without glibc

* harmonizing #if defined blocks for numa code to __gnu_linux__ since that's the only model that's being followed anyways

---------

Co-authored-by: root <redacted>
2 years agoggml : restore vec dot stride arg names (llama/5453)
Georgi Gerganov [Sun, 18 Feb 2024 20:58:57 +0000 (22:58 +0200)]
ggml : restore vec dot stride arg names (llama/5453)

2 years agoci : fix wikitext url + compile warnings (llama/5569)
Georgi Gerganov [Sun, 18 Feb 2024 20:39:30 +0000 (22:39 +0200)]
ci : fix wikitext url + compile warnings (llama/5569)

ggml-ci

2 years agometal : fix unused warnings (llama/0)
Georgi Gerganov [Sun, 18 Feb 2024 19:39:58 +0000 (21:39 +0200)]
metal : fix unused warnings (llama/0)

2 years agoggml, common, examples, tests : fixed type arguments in printf (llama/5528)
Herman Semenov [Sun, 18 Feb 2024 16:20:12 +0000 (16:20 +0000)]
ggml, common, examples, tests : fixed type arguments in printf (llama/5528)

2 years ago1.5 bit quantization (llama/5453)
Kawrakow [Sun, 18 Feb 2024 16:16:55 +0000 (18:16 +0200)]
1.5 bit quantization (llama/5453)

* iq1_s: WIP basics

* iq1_s: CUDA is working

* iq1_s: scalar CPU dot product

* iq1_s: WIP AVX2 dot product - something is not right

* Fix tests

* Fix shadow warnings

* Fix after merge with latest master

* iq1_s: AVX2 finally works

* iq1_s: ARM_NEON dot product. Works, but not very fast

* iq1_s: better grid

* iq1_s: use IQ2_XXS for attn_output

At a cost of 0.04 extra bpw this gives a big improvement in PPL.

* iq1_s: Metal basics

Dequantize works, but not dot product

* iq1_s: Metal works, but quite slow

As usual, Apple Silicon does not like the code I write.

* iq1_s: Tests

* iq1_s: slightly faster dot product

---------

Co-authored-by: Iwan Kawrakow <redacted>
2 years agoggml : add ALiBi support for ggml_soft_max_ext (llama/5488)
Georgi Gerganov [Mon, 19 Feb 2024 13:18:09 +0000 (15:18 +0200)]
ggml : add ALiBi support for ggml_soft_max_ext (llama/5488)

2 years agosync : llama.cpp
Georgi Gerganov [Mon, 19 Feb 2024 13:14:20 +0000 (15:14 +0200)]
sync : llama.cpp

2 years agoci : add an option to fail on compile warning (llama/3952)
Ananta Bastola [Sat, 17 Feb 2024 21:03:14 +0000 (16:03 -0500)]
ci : add an option to fail on compile warning (llama/3952)

* feat(ci): add an option to fail on compile warning

* Update CMakeLists.txt

* minor : fix compile warnings

ggml-ci

* ggml : fix unreachable code warnings

ggml-ci

* ci : disable fatal warnings for windows, ios and tvos

* ggml : fix strncpy warning

* ci : disable fatal warnings for MPI build

* ci : add fatal warnings to ggml-ci

ggml-ci

---------

Co-authored-by: Georgi Gerganov <redacted>
2 years agocmake : fix VULKAN and ROCm builds (llama/5525)
Georgi Gerganov [Fri, 16 Feb 2024 17:05:56 +0000 (19:05 +0200)]
cmake : fix VULKAN and ROCm builds (llama/5525)

* cmake : fix VULKAN and ROCm builds

* cmake : fix (cont)

* vulkan : fix compile warnings

ggml-ci

* cmake : fix

ggml-ci

* cmake : minor

ggml-ci

2 years agoggml : add numa options (llama/5377)
bmwl [Fri, 16 Feb 2024 09:31:07 +0000 (01:31 -0800)]
ggml : add numa options (llama/5377)

* Added numa options to allow finer grained control as well as plumbing for a new mirror mode that will require numa.h

* Reverted Makefile

* Fixed include

* Removed sched.h from ggml.h, moved ggml_get_numa_affinity into ggml.c, removed trailing whitespace and fixed up a few inconsistent variables

* removed trailing whitespace

* Added numa options to allow finer grained control as well as plumbing for a new mirror mode that will require numa.h

* Reverting Makefile

* Fixed a number of issues with the move from BOOL to ggml_numa_strategies. Added a note about mirror mode note being implemented yet

* Removing MIRROR_MODE code for this PR

* Removing last bit of MIRROR_MODE code for this PR

* Removing unneeded branch in server.cpp example and moving get_numa_affinity and making it static

* Fixed lingering init_llama_backend() bool calls in tests and examples

* Remote enum llama_numa_strategies

* Revert bad merge with dynatemp flags

* add missing enum ggml_numa_strategies declaration and revert sync problem with master

* add missing enum ggml_numa_strategies declaration

* fixed ggml_init_numa variable

* Update ggml.h

Co-authored-by: Jared Van Bortel <redacted>
* Update READMEs with info about numa flags, change INTERLEAVE strategy name to DISTRIBUTE everywhere, implement the improved distribution strategy from @rankaiyx, fix a spelling mistake and un-merge some bad merges

* split numa init out from llama_backend_init and created llama_numa_init. Updated all code paths and samples

* Fix up some boolean vs enum comparisons

* Added #ifdefs for non-Linux OS that don't have cpu_set_t datatype

* Update ggml.h

Align enum values

Co-authored-by: Georgi Gerganov <redacted>
* Update ggml.c

Remove whitespace

Co-authored-by: Georgi Gerganov <redacted>
* Update ggml.c

align paremeters

Co-authored-by: Georgi Gerganov <redacted>
* Update examples/server/server.cpp

remove whitespace and align brace

Co-authored-by: Georgi Gerganov <redacted>
* Update common/common.cpp

Remove whitespace and align brace

Co-authored-by: Georgi Gerganov <redacted>
* unified ggml_numa_strategy enum and fixed text alignment in server.cpp example

* Update ggml.c

simplified return for platforms without NUMA support

Co-authored-by: Jared Van Bortel <redacted>
* removed redundant else from cli argument processing of --numa

* whitespace

---------

Co-authored-by: root <redacted>
Co-authored-by: Jared Van Bortel <redacted>
Co-authored-by: Georgi Gerganov <redacted>
Co-authored-by: Jared Van Bortel <redacted>
2 years agocuda : print message when initialization fails (llama/5512)
slaren [Thu, 15 Feb 2024 15:49:01 +0000 (16:49 +0100)]
cuda : print message when initialization fails (llama/5512)

* cuda : print message when initialization fails

* use CUDA_NAME both times

2 years agovulkan: Find optimal memory type but with fallback (llama/5381)
Neuman Vong [Thu, 15 Feb 2024 06:11:15 +0000 (17:11 +1100)]
vulkan: Find optimal memory type but with fallback (llama/5381)

* @0cc4m feedback

* More feedback @0cc4m

2 years agoEarly return for zero size calls to get_tensor. (llama/5482)
AT [Tue, 13 Feb 2024 21:44:25 +0000 (15:44 -0600)]
Early return for zero size calls to get_tensor. (llama/5482)

* Early return for zero size calls to get_tensor.

Signed-off-by: Adam Treat <redacted>
* Update ggml-kompute.cpp

Co-authored-by: Georgi Gerganov <redacted>
* Update ggml-kompute.cpp

Co-authored-by: Georgi Gerganov <redacted>
* Add an early return to the get/set tensor when the size is null.

Signed-off-by: Adam Treat <redacted>
* Early return after the assertions.

Signed-off-by: Adam Treat <redacted>
* Since we do the early return in the generic backend now no reason to do so here as well.

Signed-off-by: Adam Treat <redacted>
---------

Signed-off-by: Adam Treat <redacted>
Co-authored-by: Georgi Gerganov <redacted>
2 years agotests : disable moe test (llama/5473)
Georgi Gerganov [Tue, 13 Feb 2024 09:20:24 +0000 (11:20 +0200)]
tests : disable moe test (llama/5473)

2 years agoggml-quants : fix compiler warnings (shadow variable) (llama/5472)
Kawrakow [Tue, 13 Feb 2024 07:07:57 +0000 (09:07 +0200)]
ggml-quants : fix compiler warnings (shadow variable) (llama/5472)

Co-authored-by: Iwan Kawrakow <redacted>
2 years agoggml-sycl: Replace 3d ops with macro (llama/5458)
Abhilash Majumder [Mon, 12 Feb 2024 14:52:05 +0000 (20:22 +0530)]
ggml-sycl: Replace 3d ops with macro (llama/5458)

* use macro

* use macro

* fix format

2 years agocmake : update CBLAS build flags (#0)
Georgi Gerganov [Mon, 19 Feb 2024 12:41:31 +0000 (14:41 +0200)]
cmake : update CBLAS build flags (#0)

2 years agoggml-alloc : allocate all leafs as if they were inputs (#731)
slaren [Mon, 12 Feb 2024 17:07:14 +0000 (18:07 +0100)]
ggml-alloc : allocate all leafs as if they were inputs (#731)

* ggml-alloc : allocate all leafs as if they were inputs

* ensure static leafs are allocated

* gpt-2-backend : remove unnecesary ggml_new_tensor

* update other gpt-2 examples to remove ggml_new_tensor calls in the graph

2 years agosync : whisper.cpp
Georgi Gerganov [Mon, 12 Feb 2024 07:32:58 +0000 (09:32 +0200)]
sync : whisper.cpp

2 years agoexamples : added audio_ctx argument to main and server (whisper/1857)
dscripka [Mon, 12 Feb 2024 07:19:07 +0000 (02:19 -0500)]
examples : added audio_ctx argument to main and server (whisper/1857)

* added audio_ctx argument to main and server examples

* Better default value

Co-authored-by: Georgi Gerganov <redacted>
* better default value (again)

Co-authored-by: Georgi Gerganov <redacted>
---------

Co-authored-by: Georgi Gerganov <redacted>
2 years agometal : option to embed MSL source into compiled binary (whisper/1842)
Didzis Gosko [Sun, 11 Feb 2024 14:41:41 +0000 (16:41 +0200)]
metal : option to embed MSL source into compiled binary (whisper/1842)

* ggml : embed Metal library source (ggml-metal.metal) into binary

enable by setting WHISPER_EMBED_METAL_LIBRARY

* rename the build option

* rename the preprocessor directive

* generate Metal library embedding assembly on-fly during build process

2 years agoexamples : initialize context params properly (whisper/1852)
Georgi Gerganov [Sun, 11 Feb 2024 14:39:12 +0000 (16:39 +0200)]
examples : initialize context params properly (whisper/1852)

2 years agosync : llama.cpp
Georgi Gerganov [Mon, 12 Feb 2024 07:30:12 +0000 (09:30 +0200)]
sync : llama.cpp

2 years agoggml-backend : sync remnant
Georgi Gerganov [Mon, 12 Feb 2024 07:27:57 +0000 (09:27 +0200)]
ggml-backend : sync remnant

2 years agoCUDA: mul_mat_vec_q tiling, refactor mul mat logic (llama/5434)
Johannes Gäßler [Sun, 11 Feb 2024 18:08:39 +0000 (19:08 +0100)]
CUDA: mul_mat_vec_q tiling, refactor mul mat logic (llama/5434)

* CUDA: mul_mat_vec_q tiling, refactor mul mat logic

Co-authored-by: slaren <redacted>
---------

Co-authored-by: slaren <redacted>
2 years agovulkan: only use M-sized matmul on Apple GPUs (llama/5412)
Sergio López [Sun, 11 Feb 2024 14:12:00 +0000 (15:12 +0100)]
vulkan: only use M-sized matmul on Apple GPUs (llama/5412)

* vulkan: refactor guess_matmul_pipeline for vendor

Refactor ggml_vk_guess_matmul_pipeline to simplify adding per-vendor
conditionals.

Signed-off-by: Sergio Lopez <redacted>
* vulkan: only use M-sized matmul on Apple GPUs

L-sized and S-sized matmuls are broken on Apple GPUs, force using
M-size with this vendor.

Signed-off-by: Sergio Lopez <redacted>
---------

Signed-off-by: Sergio Lopez <redacted>
2 years agoggml : fix compile warnings (unused vars) (llama/4966)
Georgi Gerganov [Sun, 11 Feb 2024 13:33:01 +0000 (15:33 +0200)]
ggml : fix compile warnings (unused vars) (llama/4966)

2 years agoggml : add mmla kernels for quantized GEMM (llama/4966)
snadampal [Sun, 11 Feb 2024 13:22:33 +0000 (07:22 -0600)]
ggml : add mmla kernels for quantized GEMM (llama/4966)

* ggml: aarch64: implement smmla kernel for q8_0_q8_0 quantized gemm

armv8.2-a and above supports MMLA instructions that have higher
throughput than DOT. this commit adds mmla kernel for
q8_0_q8_0 gemm. The feature is enabled if the platform supports
"__ARM_FEATURE_MATMUL_INT8"

On AWS Graviton3 processors this kernel resulted up to 1.5x
improvement for prompt evaluation throughput compared to the
default sdot kernel.

* ggml: aarch64: implement smmla kernel for q4_0_q8_0 quantized gemm

armv8.2-a and above supports MMLA instructions that have higher
throughput than DOT. this commit adds mmla kernel for
q4_0_q8_0 gemm. The feature is enabled if the platform supports
"__ARM_FEATURE_MATMUL_INT8"

On AWS Graviton3 processors this kernel resulted up to 1.5x
improvement for prompt evaluation throughput compared to the
default sdot kernel.

* ggml: aarch64: implement smmla kernel for q4_1_q8_1 quantized gemm

armv8.2-a and above supports MMLA instructions that have higher
throughput than DOT. this commit adds mmla kernel for
q4_1_q8_1 gemm. The feature is enabled if the platform supports
"__ARM_FEATURE_MATMUL_INT8"

On AWS Graviton3 processors this kernel resulted up to 1.5x
improvement for prompt evaluation throughput compared to the
default sdot kernel.

* ggml: update unit tests for the new vec_dot interface

* llama.cpp: add MATMUL_INT8 capability to system_info

2 years agometal : use autoreleasepool to avoid memory leaks (llama/5437)
Ian Bull [Sat, 10 Feb 2024 10:53:28 +0000 (02:53 -0800)]
metal : use autoreleasepool to avoid memory leaks (llama/5437)

There appears to be a known memory leak when using the
`MLTCommandBuffer`. It is suggested to use `@autoreleasepool` in
[1,2]

[1] https://developer.apple.com/forums/thread/662721
[2] https://forums.developer.apple.com/forums/thread/120931

This change-set wraps the `ggml_metal_graph_compute` in a
`@autoreleasepool`.

This commit addresses https://github.com/ggerganov/llama.cpp/issues/5436

2 years agoggml-alloc : v3 (#727)
slaren [Sun, 11 Feb 2024 12:37:58 +0000 (13:37 +0100)]
ggml-alloc : v3 (#727)

* ggml-alloc v3

ggml-ci

* fix ci

ggml-ci

* whisper : check for backend buffer allocation failures

* whisper : avoid leaks when initialization fails

* cleanup

ggml-ci

* style fixes

ggml-ci

2 years agoexamples : remove old stuff (#728)
Georgi Gerganov [Sat, 10 Feb 2024 14:04:18 +0000 (16:04 +0200)]
examples : remove old stuff (#728)

* examples : remove old stuff

ggml-ci

* readme : remove examples links

2 years agosync : whisper.cpp
Georgi Gerganov [Sat, 10 Feb 2024 08:09:09 +0000 (10:09 +0200)]
sync : whisper.cpp

2 years agowhisper : expose CUDA device setting in public API (whisper/1840)
Didzis Gosko [Fri, 9 Feb 2024 15:27:47 +0000 (17:27 +0200)]
whisper : expose CUDA device setting in public API (whisper/1840)

* Makefile : allow to override CUDA_ARCH_FLAG

* whisper : allow to select GPU (CUDA) device from public API

2 years agosync : ggml (whisper/0)
Georgi Gerganov [Tue, 30 Jan 2024 19:30:26 +0000 (21:30 +0200)]
sync : ggml (whisper/0)

2 years agosrc : relocate new backend sources
Georgi Gerganov [Sat, 10 Feb 2024 07:50:24 +0000 (09:50 +0200)]
src : relocate new backend sources

2 years agosync : llama.cpp
Georgi Gerganov [Sat, 10 Feb 2024 07:46:12 +0000 (09:46 +0200)]
sync : llama.cpp

2 years agoci : fix mpt test
Georgi Gerganov [Sat, 10 Feb 2024 07:46:00 +0000 (09:46 +0200)]
ci : fix mpt test

2 years agotests : fix im2col usage
Georgi Gerganov [Sat, 10 Feb 2024 07:45:40 +0000 (09:45 +0200)]
tests : fix im2col usage

2 years agoggml : fix `error C2078: too many initializers` for MSVC ARM64 (llama/5404)
Michael Podvitskiy [Fri, 9 Feb 2024 09:56:43 +0000 (10:56 +0100)]
ggml : fix `error C2078: too many initializers` for MSVC ARM64 (llama/5404)

2 years agoFix Vulkan crash on APUs with very little device memory (llama/5424)
0cc4m [Fri, 9 Feb 2024 05:52:33 +0000 (06:52 +0100)]
Fix Vulkan crash on APUs with very little device memory (llama/5424)

* Fix Vulkan crash on APUs with very little device memory

* Fix debug output function names

2 years agoCUDA: more warps for mmvq on NVIDIA (llama/5394)
Johannes Gäßler [Thu, 8 Feb 2024 20:56:40 +0000 (21:56 +0100)]
CUDA: more warps for mmvq on NVIDIA (llama/5394)

2 years agoFix f16_sycl cpy call from Arc (llama/5411)
Abhilash Majumder [Thu, 8 Feb 2024 17:09:10 +0000 (22:39 +0530)]
Fix f16_sycl cpy call from Arc (llama/5411)

* fix f16_sycl cpy call

* rm old logic

* add fp16 build CI

* use macro

* format fix

2 years agoCUDA: fixed mmvq kernel for bs 2,3,4 and -sm row (llama/5386)
Johannes Gäßler [Wed, 7 Feb 2024 11:40:26 +0000 (12:40 +0100)]
CUDA: fixed mmvq kernel for bs 2,3,4 and -sm row (llama/5386)

2 years agoBasic Vulkan Multi-GPU implementation (llama/5321)
0cc4m [Wed, 7 Feb 2024 06:54:50 +0000 (07:54 +0100)]
Basic Vulkan Multi-GPU implementation (llama/5321)

* Initial Vulkan multi-gpu implementation

Move most global variables into backend context

* Add names to backend device functions

* Add further missing cleanup code

* Reduce code duplication in tensor split layer assignment

* generalize LLAMA_SPLIT_LAYER for all backends, do not expose device count and memory in llama.h

* Only do device info print in the beginning and initialize one backend for cpu assist

Add missing cleanup code

* Rework backend memory management to make sure devices and buffers get properly allocated and freed

* Rename cpu assist free function

---------

Co-authored-by: slaren <redacted>
2 years agoCUDA: mul_mat_vec_q max. batch size 8 -> 4 (llama/5370)
Johannes Gäßler [Tue, 6 Feb 2024 17:43:06 +0000 (18:43 +0100)]
CUDA: mul_mat_vec_q max. batch size 8 -> 4 (llama/5370)

2 years agoSlight quantization improvement for Q4_K and Q5_K (llama/5361)
Kawrakow [Tue, 6 Feb 2024 15:28:02 +0000 (17:28 +0200)]
Slight quantization improvement for Q4_K and Q5_K (llama/5361)

* Q4_K: slightly better quantization

* Q5_K: slightly better quantization

---------

Co-authored-by: Iwan Kawrakow <redacted>
2 years agoCUDA: mul_mat_vec_q for batch sizes > 1 (llama/5351)
Johannes Gäßler [Tue, 6 Feb 2024 13:44:06 +0000 (14:44 +0100)]
CUDA: mul_mat_vec_q for batch sizes > 1 (llama/5351)

2 years agoggml : make use of ggml-quants.h possible in C++ code (llama/5338)
Kawrakow [Mon, 5 Feb 2024 12:09:47 +0000 (14:09 +0200)]
ggml : make use of ggml-quants.h possible in C++ code (llama/5338)

* Make use of ggml-quants.h possible in C++ code

* One cannot possibly be defining static_assert in a C++ compilation

---------

Co-authored-by: Iwan Kawrakow <redacted>
2 years agoggml : avoid duplicating function calls using MIN/MAX macros (llama/5325)
Dr. Tom Murphy VII Ph.D [Mon, 5 Feb 2024 11:13:57 +0000 (06:13 -0500)]
ggml : avoid duplicating function calls using MIN/MAX macros (llama/5325)

* Avoid duplicating function calls when using MIN/MAX macros.

Since these copy "a" and "b" they ask the compiler to evaluate one of them twice. The compiler doesn't have a problem with removing the duplication in something like MAX(0, x + 2), but in some cases we're calling functions, and those calls just happen twice.
By explicitly evaluating at the expression we get smaller and faster code without duplicate calls. See ggml_rope_yarn_corr_dims in Compiler Explorer:

https://godbolt.org/z/Ee4KMrvKh

Code behaves exactly the same.

* Update ggml.c

---------

Co-authored-by: Georgi Gerganov <redacted>
2 years agoiq2_xxs: tune quantization (llama/5320)
Kawrakow [Mon, 5 Feb 2024 08:46:06 +0000 (10:46 +0200)]
iq2_xxs: tune quantization (llama/5320)

We get slightly better PPL, and we cut quantization time in
nearly half.

The trick is to 1st quantize without forcing points onto the E8-lattice.
We can then use a narrower search range around the block scale that we
got that way.

Co-authored-by: Iwan Kawrakow <redacted>
2 years agoFix cpy with dims of 3 (llama/5289)
AidanBeltonS [Mon, 5 Feb 2024 07:08:24 +0000 (07:08 +0000)]
Fix cpy with dims of 3 (llama/5289)

* Fix cpy with dims of 3

* rm asserts

---------

Co-authored-by: Abhilash Majumder <redacted>
2 years agoVulkan Intel Fixes, Optimizations and Debugging Flags (llama/5301)
0cc4m [Sat, 3 Feb 2024 17:15:00 +0000 (18:15 +0100)]
Vulkan Intel Fixes, Optimizations and Debugging Flags (llama/5301)

* Fix Vulkan on Intel ARC

Optimize matmul for Intel ARC

Add Vulkan dequant test

* Add Vulkan debug and validate flags to Make and CMakeLists.txt

* Enable asynchronous transfers in Vulkan backend

* Fix flake8

* Disable Vulkan async backend functions for now

* Also add Vulkan run tests command to Makefile and CMakeLists.txt

2 years agoFix im2col with 32fp (llama/5286)
AidanBeltonS [Sat, 3 Feb 2024 08:11:37 +0000 (08:11 +0000)]
Fix im2col with 32fp (llama/5286)

2 years agoTidy ggml-sycl (llama/5261)
AidanBeltonS [Fri, 2 Feb 2024 08:39:48 +0000 (08:39 +0000)]
Tidy ggml-sycl (llama/5261)

* Tidy some code in ggml-sycl

* Remove blank space

* Remove std::printf comments

---------

Co-authored-by: Abhilash Majumder <redacted>
2 years agoget MAX_MEM_ALLOC from device property (llama/5270)
Meng, Hengyu [Fri, 2 Feb 2024 07:54:14 +0000 (15:54 +0800)]
get MAX_MEM_ALLOC from device property (llama/5270)

* get max alloc size from device prop

* fix macro typo

2 years agoadd --no-mmap in llama-bench (llama/5257)
Neo Zhang Jianyu [Thu, 1 Feb 2024 19:48:53 +0000 (03:48 +0800)]
add --no-mmap in llama-bench (llama/5257)

* add --no-mmap, show sycl backend

* fix conflict

* fix code format, change print for --no-mmap

* ren no_mmap to mmap, show mmap when not default value in printer

* update guide for mmap

* mv position to reduce model reload

2 years agoVulkan Phi Fix for AMD Proprietary Drivers (llama/5260)
0cc4m [Thu, 1 Feb 2024 18:25:24 +0000 (19:25 +0100)]
Vulkan Phi Fix for AMD Proprietary Drivers (llama/5260)

* Replace tanh to avoid NaN in gelu shader on AMD proprietary driver

* Fix another Vulkan CPY buffer size bug

2 years agocuda : fix LLAMA_CUDA_F16 (llama/5262)
slaren [Thu, 1 Feb 2024 17:30:17 +0000 (18:30 +0100)]
cuda : fix LLAMA_CUDA_F16 (llama/5262)

2 years agometal : add im2col F32 dst support (llama/5132)
Georgi Gerganov [Wed, 31 Jan 2024 13:35:41 +0000 (15:35 +0200)]
metal : add im2col F32 dst support (llama/5132)

2 years agollava : add MobileVLM support (llama/5132)
JidongZhang-THU [Wed, 31 Jan 2024 13:10:15 +0000 (21:10 +0800)]
llava : add MobileVLM support (llama/5132)

* New Feature:
    1. Sum_Rows:
        fix cuda kernel overflow
        fix block shape error when nrows too big
    2. Im2Col:
        Support Batch in cuda
        Support f32 to f32 both in cpu && cuda
    3. DepthWiseConv:
        Support by Im2Col && MulMat
    4. Pool_2d:
        Supoort avg pooling in cuda
    5. HardSigmoid:
        Imp in cuda
    6. HardSwish:
        Imp in cuda

* fix tabs instead of spaces

* code clean

* CUDA POOL2D

* ADD POOL2D test case in test-backend-ops.cpp

* code clean

* fix pool2d_kernel

nits

* fix bug in pool2d kernel

* fix avg pooling, count_include_pad

nits

* test-backend-ops : add more pool_2d tests

* cuda : fix warnings and formatting

* ggml : check types in release builds too in pool_2d

* test-backend-ops : remove f16 pool_2d tests

* cuda : more style fixes

* Add assert in ggml_cuda_op_pool2d

* pool2d float padding fallback

* test-backend-ops : add dst_type to im2col

---------

Co-authored-by: slaren <redacted>
2 years agoformat license text, restore apache license by legal suggestion (llama/5233)
Neo Zhang Jianyu [Wed, 31 Jan 2024 13:04:46 +0000 (21:04 +0800)]
format license text, restore apache license by legal suggestion (llama/5233)

2 years agoggml : limit n_threads to the max n_tasks (llama/5238)
slaren [Wed, 31 Jan 2024 12:43:03 +0000 (13:43 +0100)]
ggml : limit n_threads to the max n_tasks (llama/5238)

2 years agoVulkan Fixes (llama/5223)
0cc4m [Wed, 31 Jan 2024 10:44:19 +0000 (11:44 +0100)]
Vulkan Fixes (llama/5223)

* Fix Vulkan F16 models

* Fix Vulkan context shift crash

* Add Vulkan to common.cpp dump_non_result_info_yaml function

* Fix bug in Vulkan CPY op

* Fix small matrix multiplication errors in AMD GPUs on Windows or with amdvlk

Co-authored-by: Engininja2 <redacted>
---------

Co-authored-by: Engininja2 <redacted>
2 years agokompute : llama-bench support and ggml_cpu_has_kompute() (llama/5226)
Jared Van Bortel [Wed, 31 Jan 2024 00:04:37 +0000 (19:04 -0500)]
kompute : llama-bench support and ggml_cpu_has_kompute() (llama/5226)

2 years agoggml : add abort_callback for cpu backend (#725)
Michael Podvitskiy [Fri, 9 Feb 2024 09:42:27 +0000 (10:42 +0100)]
ggml : add abort_callback for cpu backend (#725)

* a way to use abort_callback with the cpu backend

* whisper update

2 years agosync : whisper.cpp
Georgi Gerganov [Tue, 30 Jan 2024 19:27:42 +0000 (21:27 +0200)]
sync : whisper.cpp

2 years agocommon : fix wav buffer detection (whisper/1819)
JacobLinCool [Tue, 30 Jan 2024 17:35:08 +0000 (01:35 +0800)]
common : fix wav buffer detection (whisper/1819)

2 years agosync : llama.cpp
Georgi Gerganov [Tue, 30 Jan 2024 19:27:12 +0000 (21:27 +0200)]
sync : llama.cpp

2 years agoggml : fix IQ3_XXS on Metal (llama/5219)
Kawrakow [Tue, 30 Jan 2024 17:15:28 +0000 (19:15 +0200)]
ggml : fix IQ3_XXS on Metal (llama/5219)

Co-authored-by: Iwan Kawrakow <redacted>
2 years agosync : ggml (llama/0)
Georgi Gerganov [Tue, 30 Jan 2024 14:21:57 +0000 (16:21 +0200)]
sync : ggml (llama/0)

2 years agoFaster AVX2 dot product for IQ2_XS (llama/5187)
Kawrakow [Tue, 30 Jan 2024 13:15:07 +0000 (15:15 +0200)]
Faster AVX2 dot product for IQ2_XS (llama/5187)

* iq2xs: faster AVX2 dot product

* iq2xs: small AVX2 imrovement

* Speed up computing sign bits in AVX2 iq2_xs dot product

---------

Co-authored-by: Iwan Kawrakow <redacted>
Co-authored-by: Peter Reid <redacted>
2 years agoSOTA 3-bit quants (llama/5196)
Kawrakow [Tue, 30 Jan 2024 13:14:12 +0000 (15:14 +0200)]
SOTA 3-bit quants (llama/5196)

* iq3_xxs: quantize/dequantize

RMSE seems a bit high-ish at about half-way between q2_K and
q3_K, so need to check more.

* iq3_xxs: CUDA dequantize works

* iq2_xxs: tuning quantization

* iq3_xxs: starting to look better

PPL on wiki.test.raw
LLaMA-v1-7B: 6.4218
LLaMA-v2-7B: 6.3560
Mistral-7B : 6.0717

This is better than Q3_K_XS, with a 5% reduction in quantized model
size.

* iq3_xxs: CUDA dot product

We have
PP-512: 5891 t/s
TG-128: 143.9 t/s

* iq3_xxs: scalar and AVX2 dot products

* iq3_xxs: ARM_NEON and Metal

Metal performance is decent, ARM_NEON is pathetic

* iq3_xxs: slightly better grid points

* Faster iq3_xxs and iq2_xs dot products on CUDA

* iq3_xxs: add some quant mix

* iq3_xxs: fix failing quantization test

Dot product still fails. Is this real?

* iq3_xxs: hopefully fix ROCm

* iq3_xxs: failing tests

This time the dot product accuracy did find an actual bug
in the AVX2 implementation.

* Add IQ3_XXS to test-backend-ops

---------

Co-authored-by: Iwan Kawrakow <redacted>
2 years agoVulkan Windows APU Memory Handling (llama/5199)
0cc4m [Tue, 30 Jan 2024 12:59:30 +0000 (13:59 +0100)]
Vulkan Windows APU Memory Handling (llama/5199)

* Add basic UMA memory handling

Improve memory OOM behavior

Fix tests

* Fix UMA handling

* Also fix UMA handling for prealloc buffers

* Remove unnecessary warning message

* Remove outdated comment

2 years agoggml alloc: Fix for null dereference on alloc failure (llama/5200)
Paul Tsochantaris [Mon, 29 Jan 2024 22:19:29 +0000 (22:19 +0000)]
ggml alloc: Fix for null dereference on alloc failure (llama/5200)

* Fix for a null pointer dereference if a metal GGML buffer fails to be allocated

* Freeing the allocated buffers rather than the pointer in ggml-alloc.c

* Fixed the fix of the fix

2 years agoNomic Vulkan backend (llama/4456)
Jared Van Bortel [Mon, 29 Jan 2024 20:50:50 +0000 (15:50 -0500)]
Nomic Vulkan backend (llama/4456)

Signed-off-by: Jared Van Bortel <redacted>
Co-authored-by: niansa <redacted>
Co-authored-by: Adam Treat <redacted>
Co-authored-by: Aaron Miller <redacted>
Co-authored-by: ToKiNoBug <redacted>
Co-authored-by: Georgi Gerganov <redacted>
Co-authored-by: slaren <redacted>
2 years agoggml : add max buffer sizes to opencl and metal backends (llama/5181)
slaren [Mon, 29 Jan 2024 08:05:13 +0000 (09:05 +0100)]
ggml : add max buffer sizes to opencl and metal backends (llama/5181)

2 years agometal : free metal objects (llama/5161)
Paul Tsochantaris [Sun, 28 Jan 2024 19:50:16 +0000 (19:50 +0000)]
metal : free metal objects (llama/5161)

* Releasing MTLFunction references after Metal pipeline construction

* Keeping the `ggml_metal_kernel` structure

* Spacing fix

* Whitespace fix

2 years agogguf : fix comparison (#715)
Georgi Gerganov [Mon, 29 Jan 2024 19:08:18 +0000 (21:08 +0200)]
gguf : fix comparison (#715)

ggml-ci

2 years ago`ggml_cuda_cpy` support for 4d tensors and float16->float32 upcasting (#686)
John Balis [Mon, 29 Jan 2024 12:37:33 +0000 (06:37 -0600)]
`ggml_cuda_cpy` support for 4d tensors and float16->float32 upcasting (#686)

* added cuda float16->float32 upcasting to ggml_cuda_cpy

* added ability to copy 4d tensors with the cuda backend

* added tests for float16_>float32 upcast and 4d tensor cuda copys

* added 4d copy test for float32->float16 copy

* applied patch suggested by @iamlemec

* simplify cpy tests

---------

Co-authored-by: slaren <redacted>
2 years agogguf : add input validation, prevent integer overflows (#709)
Georgi Gerganov [Mon, 29 Jan 2024 12:00:10 +0000 (14:00 +0200)]
gguf : add input validation, prevent integer overflows (#709)

* gguf : add input validation, prevent integer overflows

ggml-ci

* gguf : fix switch default case

* gguf : sanitize info->n_dims and info->type

ggml-ci

* gguf : assert GGUF_TYPE_SIZE access

ggml-ci

* ggml : assert mallocs are successful

ggml-ci

* gguf : prevent integer overflow

* gguf : sanitize tensor info

ggml-ci

* gguf : stricter limit on the number of items

ggml-ci

2 years agoci : fix yolo URLs + fix metal capture (#712)
Georgi Gerganov [Mon, 29 Jan 2024 11:29:46 +0000 (13:29 +0200)]
ci : fix yolo URLs + fix metal capture (#712)

2 years agometal : add debug capture backend function (#694)
Jack Mousseau [Mon, 29 Jan 2024 09:22:23 +0000 (01:22 -0800)]
metal : add debug capture backend function (#694)

Co-authored-by: Georgi Gerganov <redacted>
2 years agosync : llama.cpp
Georgi Gerganov [Sun, 28 Jan 2024 17:49:41 +0000 (19:49 +0200)]
sync : llama.cpp

2 years agosync : whisper.cpp
Georgi Gerganov [Sun, 28 Jan 2024 17:45:08 +0000 (19:45 +0200)]
sync : whisper.cpp