]>
git.djapps.eu Git - pkg/ggml/sources/whisper.cpp/log
Sergio López [Sun, 11 Feb 2024 14:12:00 +0000 (15:12 +0100)]
vulkan: only use M-sized matmul on Apple GPUs (llama/5412)
* vulkan: refactor guess_matmul_pipeline for vendor
Refactor ggml_vk_guess_matmul_pipeline to simplify adding per-vendor
conditionals.
Signed-off-by: Sergio Lopez <redacted>
* vulkan: only use M-sized matmul on Apple GPUs
L-sized and S-sized matmuls are broken on Apple GPUs, force using
M-size with this vendor.
Signed-off-by: Sergio Lopez <redacted>
---------
Signed-off-by: Sergio Lopez <redacted>
Georgi Gerganov [Sun, 11 Feb 2024 13:33:01 +0000 (15:33 +0200)]
ggml : fix compile warnings (unused vars) (llama/4966)
snadampal [Sun, 11 Feb 2024 13:22:33 +0000 (07:22 -0600)]
ggml : add mmla kernels for quantized GEMM (llama/4966)
* ggml: aarch64: implement smmla kernel for q8_0_q8_0 quantized gemm
armv8.2-a and above supports MMLA instructions that have higher
throughput than DOT. this commit adds mmla kernel for
q8_0_q8_0 gemm. The feature is enabled if the platform supports
"__ARM_FEATURE_MATMUL_INT8"
On AWS Graviton3 processors this kernel resulted up to 1.5x
improvement for prompt evaluation throughput compared to the
default sdot kernel.
* ggml: aarch64: implement smmla kernel for q4_0_q8_0 quantized gemm
armv8.2-a and above supports MMLA instructions that have higher
throughput than DOT. this commit adds mmla kernel for
q4_0_q8_0 gemm. The feature is enabled if the platform supports
"__ARM_FEATURE_MATMUL_INT8"
On AWS Graviton3 processors this kernel resulted up to 1.5x
improvement for prompt evaluation throughput compared to the
default sdot kernel.
* ggml: aarch64: implement smmla kernel for q4_1_q8_1 quantized gemm
armv8.2-a and above supports MMLA instructions that have higher
throughput than DOT. this commit adds mmla kernel for
q4_1_q8_1 gemm. The feature is enabled if the platform supports
"__ARM_FEATURE_MATMUL_INT8"
On AWS Graviton3 processors this kernel resulted up to 1.5x
improvement for prompt evaluation throughput compared to the
default sdot kernel.
* ggml: update unit tests for the new vec_dot interface
* llama.cpp: add MATMUL_INT8 capability to system_info
Ian Bull [Sat, 10 Feb 2024 10:53:28 +0000 (02:53 -0800)]
metal : use autoreleasepool to avoid memory leaks (llama/5437)
There appears to be a known memory leak when using the
`MLTCommandBuffer`. It is suggested to use `@autoreleasepool` in
[1,2]
[1] https://developer.apple.com/forums/thread/662721
[2] https://forums.developer.apple.com/forums/thread/120931
This change-set wraps the `ggml_metal_graph_compute` in a
`@autoreleasepool`.
This commit addresses https://github.com/ggerganov/llama.cpp/issues/5436
slaren [Sun, 11 Feb 2024 12:37:58 +0000 (13:37 +0100)]
ggml-alloc : v3 (ggml/727)
* ggml-alloc v3
ggml-ci
* fix ci
ggml-ci
* whisper : check for backend buffer allocation failures
* whisper : avoid leaks when initialization fails
* cleanup
ggml-ci
* style fixes
ggml-ci
dscripka [Mon, 12 Feb 2024 07:19:07 +0000 (02:19 -0500)]
examples : added audio_ctx argument to main and server (#1857)
* added audio_ctx argument to main and server examples
* Better default value
Co-authored-by: Georgi Gerganov <redacted>
* better default value (again)
Co-authored-by: Georgi Gerganov <redacted>
---------
Co-authored-by: Georgi Gerganov <redacted>
Didzis Gosko [Sun, 11 Feb 2024 14:41:41 +0000 (16:41 +0200)]
metal : option to embed MSL source into compiled binary (#1842)
* ggml : embed Metal library source (ggml-metal.metal) into binary
enable by setting WHISPER_EMBED_METAL_LIBRARY
* rename the build option
* rename the preprocessor directive
* generate Metal library embedding assembly on-fly during build process
Georgi Gerganov [Sun, 11 Feb 2024 14:39:12 +0000 (16:39 +0200)]
examples : initialize context params properly (#1852)
Georgi Gerganov [Sat, 10 Feb 2024 08:10:59 +0000 (10:10 +0200)]
talk-llama : sync llama.cpp
Georgi Gerganov [Sat, 10 Feb 2024 07:56:47 +0000 (09:56 +0200)]
sync : ggml
Georgi Gerganov [Sat, 10 Feb 2024 07:50:24 +0000 (09:50 +0200)]
src : relocate new backend sources
Michael Podvitskiy [Fri, 9 Feb 2024 09:56:43 +0000 (10:56 +0100)]
ggml : fix `error C2078: too many initializers` for MSVC ARM64 (llama/5404)
Johannes Gäßler [Thu, 8 Feb 2024 20:56:40 +0000 (21:56 +0100)]
CUDA: more warps for mmvq on NVIDIA (llama/5394)
Johannes Gäßler [Wed, 7 Feb 2024 11:40:26 +0000 (12:40 +0100)]
CUDA: fixed mmvq kernel for bs 2,3,4 and -sm row (llama/5386)
0cc4m [Wed, 7 Feb 2024 06:54:50 +0000 (07:54 +0100)]
Basic Vulkan Multi-GPU implementation (llama/5321)
* Initial Vulkan multi-gpu implementation
Move most global variables into backend context
* Add names to backend device functions
* Add further missing cleanup code
* Reduce code duplication in tensor split layer assignment
* generalize LLAMA_SPLIT_LAYER for all backends, do not expose device count and memory in llama.h
* Only do device info print in the beginning and initialize one backend for cpu assist
Add missing cleanup code
* Rework backend memory management to make sure devices and buffers get properly allocated and freed
* Rename cpu assist free function
---------
Co-authored-by: slaren <redacted>
Johannes Gäßler [Tue, 6 Feb 2024 17:43:06 +0000 (18:43 +0100)]
CUDA: mul_mat_vec_q max. batch size 8 -> 4 (llama/5370)
Kawrakow [Tue, 6 Feb 2024 15:28:02 +0000 (17:28 +0200)]
Slight quantization improvement for Q4_K and Q5_K (llama/5361)
* Q4_K: slightly better quantization
* Q5_K: slightly better quantization
---------
Co-authored-by: Iwan Kawrakow <redacted>
Johannes Gäßler [Tue, 6 Feb 2024 13:44:06 +0000 (14:44 +0100)]
CUDA: mul_mat_vec_q for batch sizes > 1 (llama/5351)
Kawrakow [Mon, 5 Feb 2024 12:09:47 +0000 (14:09 +0200)]
ggml : make use of ggml-quants.h possible in C++ code (llama/5338)
* Make use of ggml-quants.h possible in C++ code
* One cannot possibly be defining static_assert in a C++ compilation
---------
Co-authored-by: Iwan Kawrakow <redacted>
Dr. Tom Murphy VII Ph.D [Mon, 5 Feb 2024 11:13:57 +0000 (06:13 -0500)]
ggml : avoid duplicating function calls using MIN/MAX macros (llama/5325)
* Avoid duplicating function calls when using MIN/MAX macros.
Since these copy "a" and "b" they ask the compiler to evaluate one of them twice. The compiler doesn't have a problem with removing the duplication in something like MAX(0, x + 2), but in some cases we're calling functions, and those calls just happen twice.
By explicitly evaluating at the expression we get smaller and faster code without duplicate calls. See ggml_rope_yarn_corr_dims in Compiler Explorer:
https://godbolt.org/z/Ee4KMrvKh
Code behaves exactly the same.
* Update ggml.c
---------
Co-authored-by: Georgi Gerganov <redacted>
Kawrakow [Mon, 5 Feb 2024 08:46:06 +0000 (10:46 +0200)]
iq2_xxs: tune quantization (llama/5320)
We get slightly better PPL, and we cut quantization time in
nearly half.
The trick is to 1st quantize without forcing points onto the E8-lattice.
We can then use a narrower search range around the block scale that we
got that way.
Co-authored-by: Iwan Kawrakow <redacted>
slaren [Thu, 1 Feb 2024 17:30:17 +0000 (18:30 +0100)]
cuda : fix LLAMA_CUDA_F16 (llama/5262)
Georgi Gerganov [Wed, 31 Jan 2024 13:35:41 +0000 (15:35 +0200)]
metal : add im2col F32 dst support (llama/5132)
JidongZhang-THU [Wed, 31 Jan 2024 13:10:15 +0000 (21:10 +0800)]
llava : add MobileVLM support (llama/5132)
* New Feature:
1. Sum_Rows:
fix cuda kernel overflow
fix block shape error when nrows too big
2. Im2Col:
Support Batch in cuda
Support f32 to f32 both in cpu && cuda
3. DepthWiseConv:
Support by Im2Col && MulMat
4. Pool_2d:
Supoort avg pooling in cuda
5. HardSigmoid:
Imp in cuda
6. HardSwish:
Imp in cuda
* fix tabs instead of spaces
* code clean
* CUDA POOL2D
* ADD POOL2D test case in test-backend-ops.cpp
* code clean
* fix pool2d_kernel
nits
* fix bug in pool2d kernel
* fix avg pooling, count_include_pad
nits
* test-backend-ops : add more pool_2d tests
* cuda : fix warnings and formatting
* ggml : check types in release builds too in pool_2d
* test-backend-ops : remove f16 pool_2d tests
* cuda : more style fixes
* Add assert in ggml_cuda_op_pool2d
* pool2d float padding fallback
* test-backend-ops : add dst_type to im2col
---------
Co-authored-by: slaren <redacted>
slaren [Wed, 31 Jan 2024 12:43:03 +0000 (13:43 +0100)]
ggml : limit n_threads to the max n_tasks (llama/5238)
Jared Van Bortel [Wed, 31 Jan 2024 00:04:37 +0000 (19:04 -0500)]
kompute : llama-bench support and ggml_cpu_has_kompute() (llama/5226)
Michael Podvitskiy [Fri, 9 Feb 2024 09:42:27 +0000 (10:42 +0100)]
ggml : add abort_callback for cpu backend (ggml/725)
* a way to use abort_callback with the cpu backend
* whisper update
Georgi Gerganov [Sat, 10 Feb 2024 07:55:19 +0000 (09:55 +0200)]
extra : update sync scripts
Valentin Gosu [Fri, 9 Feb 2024 15:42:41 +0000 (16:42 +0100)]
server : allow CORS request with authorization headers (#1850)
Whisper plugin in Obsidian requires an API key which is
then sent as an authorization header.
However, the presence of an authorization header requires
a CORS Preflight, so both the OPTIONS method and
the Access-Control-Allow-Headers: authorization must be
handled.
Neuman Vong [Fri, 9 Feb 2024 15:39:05 +0000 (02:39 +1100)]
whisper.android : how to build with CLBlast (#1809)
* FetchContent
* OpenCL
* Documentation and make optional
* Specify GGML build options in build.gradle
* Use gradle properties
* @ggerganov
Co-authored-by: Georgi Gerganov <redacted>
* @gpokat
---------
Co-authored-by: Georgi Gerganov <redacted>
Didzis Gosko [Fri, 9 Feb 2024 15:27:47 +0000 (17:27 +0200)]
whisper : expose CUDA device setting in public API (#1840)
* Makefile : allow to override CUDA_ARCH_FLAG
* whisper : allow to select GPU (CUDA) device from public API
Didzis Gosko [Fri, 9 Feb 2024 15:26:29 +0000 (17:26 +0200)]
make : add macOS deployment target option (#1839)
Georgi Gerganov [Tue, 6 Feb 2024 17:56:12 +0000 (19:56 +0200)]
talk-llama : stream response (#1121)
Georgi Gerganov [Tue, 30 Jan 2024 19:30:26 +0000 (21:30 +0200)]
sync : ggml (#0)
Kawrakow [Tue, 30 Jan 2024 17:15:28 +0000 (19:15 +0200)]
ggml : fix IQ3_XXS on Metal (llama/5219)
Co-authored-by: Iwan Kawrakow <redacted>
Georgi Gerganov [Tue, 30 Jan 2024 14:21:57 +0000 (16:21 +0200)]
sync : ggml (llama/0)
Kawrakow [Tue, 30 Jan 2024 13:15:07 +0000 (15:15 +0200)]
Faster AVX2 dot product for IQ2_XS (llama/5187)
* iq2xs: faster AVX2 dot product
* iq2xs: small AVX2 imrovement
* Speed up computing sign bits in AVX2 iq2_xs dot product
---------
Co-authored-by: Iwan Kawrakow <redacted>
Co-authored-by: Peter Reid <redacted>
Kawrakow [Tue, 30 Jan 2024 13:14:12 +0000 (15:14 +0200)]
SOTA 3-bit quants (llama/5196)
* iq3_xxs: quantize/dequantize
RMSE seems a bit high-ish at about half-way between q2_K and
q3_K, so need to check more.
* iq3_xxs: CUDA dequantize works
* iq2_xxs: tuning quantization
* iq3_xxs: starting to look better
PPL on wiki.test.raw
LLaMA-v1-7B: 6.4218
LLaMA-v2-7B: 6.3560
Mistral-7B : 6.0717
This is better than Q3_K_XS, with a 5% reduction in quantized model
size.
* iq3_xxs: CUDA dot product
We have
PP-512: 5891 t/s
TG-128: 143.9 t/s
* iq3_xxs: scalar and AVX2 dot products
* iq3_xxs: ARM_NEON and Metal
Metal performance is decent, ARM_NEON is pathetic
* iq3_xxs: slightly better grid points
* Faster iq3_xxs and iq2_xs dot products on CUDA
* iq3_xxs: add some quant mix
* iq3_xxs: fix failing quantization test
Dot product still fails. Is this real?
* iq3_xxs: hopefully fix ROCm
* iq3_xxs: failing tests
This time the dot product accuracy did find an actual bug
in the AVX2 implementation.
* Add IQ3_XXS to test-backend-ops
---------
Co-authored-by: Iwan Kawrakow <redacted>
Paul Tsochantaris [Mon, 29 Jan 2024 22:19:29 +0000 (22:19 +0000)]
ggml alloc: Fix for null dereference on alloc failure (llama/5200)
* Fix for a null pointer dereference if a metal GGML buffer fails to be allocated
* Freeing the allocated buffers rather than the pointer in ggml-alloc.c
* Fixed the fix of the fix
Jared Van Bortel [Mon, 29 Jan 2024 20:50:50 +0000 (15:50 -0500)]
Nomic Vulkan backend (llama/4456)
Signed-off-by: Jared Van Bortel <redacted>
Co-authored-by: niansa <redacted>
Co-authored-by: Adam Treat <redacted>
Co-authored-by: Aaron Miller <redacted>
Co-authored-by: ToKiNoBug <redacted>
Co-authored-by: Georgi Gerganov <redacted>
Co-authored-by: slaren <redacted>
slaren [Mon, 29 Jan 2024 08:05:13 +0000 (09:05 +0100)]
ggml : add max buffer sizes to opencl and metal backends (llama/5181)
Paul Tsochantaris [Sun, 28 Jan 2024 19:50:16 +0000 (19:50 +0000)]
metal : free metal objects (llama/5161)
* Releasing MTLFunction references after Metal pipeline construction
* Keeping the `ggml_metal_kernel` structure
* Spacing fix
* Whitespace fix
Georgi Gerganov [Mon, 29 Jan 2024 19:08:18 +0000 (21:08 +0200)]
gguf : fix comparison (ggml/715)
ggml-ci
John Balis [Mon, 29 Jan 2024 12:37:33 +0000 (06:37 -0600)]
`ggml_cuda_cpy` support for 4d tensors and float16->float32 upcasting (ggml/686)
* added cuda float16->float32 upcasting to ggml_cuda_cpy
* added ability to copy 4d tensors with the cuda backend
* added tests for float16_>float32 upcast and 4d tensor cuda copys
* added 4d copy test for float32->float16 copy
* applied patch suggested by @iamlemec
* simplify cpy tests
---------
Co-authored-by: slaren <redacted>
Georgi Gerganov [Mon, 29 Jan 2024 12:00:10 +0000 (14:00 +0200)]
gguf : add input validation, prevent integer overflows (ggml/709)
* gguf : add input validation, prevent integer overflows
ggml-ci
* gguf : fix switch default case
* gguf : sanitize info->n_dims and info->type
ggml-ci
* gguf : assert GGUF_TYPE_SIZE access
ggml-ci
* ggml : assert mallocs are successful
ggml-ci
* gguf : prevent integer overflow
* gguf : sanitize tensor info
ggml-ci
* gguf : stricter limit on the number of items
ggml-ci
Georgi Gerganov [Mon, 29 Jan 2024 11:29:46 +0000 (13:29 +0200)]
ci : fix yolo URLs + fix metal capture (ggml/712)
Jack Mousseau [Mon, 29 Jan 2024 09:22:23 +0000 (01:22 -0800)]
metal : add debug capture backend function (ggml/694)
Co-authored-by: Georgi Gerganov <redacted>
JacobLinCool [Tue, 30 Jan 2024 17:35:08 +0000 (01:35 +0800)]
common : fix wav buffer detection (#1819)
JacobLinCool [Tue, 30 Jan 2024 12:15:55 +0000 (20:15 +0800)]
server : add fields to `verbose_json` response (#1802)
* server: include additional fields in the verbose_json response as OpenAI does
* server: show request examples on home page
* server: todo note for compression_ratio and no_speech_prob
* server: add simple demo form to the homepage
jwijffels [Tue, 30 Jan 2024 12:13:49 +0000 (13:13 +0100)]
make : update MSYS_NT (#1813)
I just upgraded the R wrapper at https://github.com/bnosac/audio.whisper to use whisper.cpp 1.5.4
I'm working on Windows and noticed while doing that that it did not pick up the relevant CFLAGS/CXXFLAGS as my system showed
```
I whisper.cpp build info:
I UNAME_S: MSYS_NT-10.0-19045
I UNAME_P: unknown
I UNAME_M: x86_64
```
Many thanks for all the tremendous hard work on maintaining whisper.cpp!
Georgi Gerganov [Sun, 28 Jan 2024 17:44:10 +0000 (19:44 +0200)]
talk-llama : sync llama.cpp
Georgi Gerganov [Sun, 28 Jan 2024 17:30:32 +0000 (19:30 +0200)]
sync : ggml
0cc4m [Sun, 28 Jan 2024 17:03:59 +0000 (18:03 +0100)]
ggml : add Vulkan backend (llama/2059)
* Vulkan loader code
* Fix matmul kernel, continue implementation
* Continue implementation
* Vulkan memory management
* Vulkan development
* Matmul call
* Add aligned malloc and free for VMA
* Continue implementation
* First matmul success
* GEMM Kernel optimization
* 1D Blocktiling
* 2D Blocktiling
* Write coalescing
* Continue vulkan implementation and optimization
* First FP16 attempt, disabled for now
* Code abstraction, FP16 implementation, fix kernel, add FP16 to FP32 kernel
* Enable device extensions properly, restore fp16 matmul op
* Fix mulmat_f16
* Output FP32 in fp16 matmul shader
* Fix f16_to_f32 kernel
* dequant_q4_0 kernel
* Add VMA library
* Avoid requesting dedicated memory, VMA can decide that by itself
* Add bounds checking to matmul kernels, improve implementation, fix command buffers not freed properly
* add cmake commands
* Add 2d write operation, profiling code
* Fix 2d write
* Fix queue selection for AMD RADV
* Fix trailing whitespace in vk_mem_alloc.h
* Add WIP warp tile mat mul shaders
* Disable glslc optimization
* Disable glslc optimization for CMake
* Optimize warptile matmul shader, replace blocktile with it
* Add split-k optimization for small matrix multiplication
Use semaphores for synchronization instead of fences or waitidle
Rework async write/read for synchronization
* Fix validation errors, improve compatibility with AMD GPUs
* Rework command buffer handling
* Variable matmul kernel using specialization constants
* Fix synchronization on AMD, add barriers for buffer ownership transfer, add debug flag and prints
* Reuse semaphores
* Handle stage flags during command buffer submission properly
* Increase matmul test runs for consistent results
* Fix F32 matmul
* Add vectorized loading and zeropadding for matrix multiplication
* Use pinned memory for f16 preprocessing
* Don't force aligned matmul
* Don't free before queue done
* Replace VMA library with native Vulkan buffer management
* Basic offloading support with mul_f32 and dmmv for q4_0
* Run glslc commands in parallel
* Unroll loops in dmmv shader
* Reduce usage of waitIdle
* Reuse pinned allocation for f16 conversion
* Handle devices with only a single queue
* Fix trailing whitespace in CMakeLists.txt
* Allow parallel execution of kernels, parallelize third and fourth dimension calls
* Add fallback for devices only supporting one DescriptorSet per DescriptorPool
* Move to graph function similar to CUDA implementation
* Use F16 kernel for most things, replace q_f32 with mul_mat_q_f16 function
* Add F32 dmmv shaders
* Batch submissions
* Add .spv to gitignore
* Split off matrix vector multiplication for separate optimization
* Use single command buffer for matrix vector multiplication ops
* Reduce overhead of mul_f32 calls by using a single command buffer
* Add submission batching to mul_f32
* Fix tests
* Add missing barrier
* Add further missing barrier
* Add further ops
* Replace vk::QueueFamilyIgnored with VK_QUEUE_FAMILY_IGNORED to support more Vulkan header versions
* Remove unnecessary cblas link
* Fix descriptor set pre-allocation assert
* Add runtime shader compilation, start transferring shaders to this approach
* Transfer remaining shaders to header and compile on runtime
* Fix fp32 fallback if device doesn't support fp16, add force disable env var GGML_VULKAN_DISABLE_F16
* Add support for q4_1, q5_0, q5_1 and q8_0
* Remove unnecessary scalar layout extension
* Parse graph early to pre-record command buffers
* Add q6_k support
* Add multi-submit for command buffers
* Fix q6_k dequant shader for AMD
* Fix q6_k for GPUs without fp16 support
* Simplify q6_k fp16 fix
* Minor fixes
* Fix wg_denom of m-mulmat shaders
* Add Python-based Vulkan shader generator
* Replace shaderc dependency with precompiled shaders
Fix python script to generate shaders
* Clean up code
* Fix shader generator script Windows compatibility
Co-authored-by: Concedo <redacted>
* Close file before deletion
* Fix vulkan shader fp32 name
* Add q2_k and q3_k support
Add validation check to compare shader results to cpu results
* Add q4_k support
* Add q5_k support
* Bake SPIR-V bytecode into the library instead of loading shaders from file
* Switch to signal semaphores for flexibility
Prepare broadcasting support for mul mat
* Finish broadcasting mul mat support for GQA
* Clean up unused functions
Add repeat op
* Add further ops, not yet enabled. Improve semaphore code
* Reduce number of used semaphores by utilizing timelines more properly
* Remove queue information
* Reuse timeline semaphores, allow parallel operation with binary semaphores to work around nvidia driver limitations
* Add Vulkan to llama-bench
* Remove cblas dependency
* Fix matmul k-split bug
* Fix q4_k dmmv K_QUANTS_PER_ITERATION 1 shader
* Add RMS Norm shader, rework op_f32 shader setup, fix matmul bug
* Fix issues with float16 overflows in shaders
* Fix issues with older Vulkan headers on Ubuntu 22.04
* Allow multi-op partial offloading by parsing the graph to preallocate enough between-op buffers
* Implement further ops, rework op_f32 calls, fix bugs
* Finish full offloading support, add last remaining ops, fix bugs, remove redundant code
* Upload generated file ggml-vulkan-shaders.hpp, remove redundant shaders
* Merge upstream changes, fix conflicts, adapt soft_max op
* Fix Python and shader header format
* Free model gpu buffers on exit
* Use single queue per device to simplify code
* Add matmul shader support for running multiple calculations in parallel
* Switch from semaphore-synchronized multiple command buffers per op to single command buffer for multiple ops, whole graph if possible
* Fix missing event cast
* Replace uint64_t(-1) with UINT64_MAX, rename function for clarity
* Fix warning about empty C function parameters
* Fix compiler warnings
* Properly implement Vulkan backend buffer handling
* Fix oversized host staging buffers
* Simplify barrier synchronization calls
* Fix gcc warnings
* Implement max_size for backend buffer types to limit the size of a single allocation
* Use min of maxMemoryAllocationSize and maxBufferSize for device max allocation size
* refactor multi buf
* Disable unsupported ops to fix tests
* Check for maintenance4 support before using it
* Handle devices with only a single queue
* Fix single queue logic
* propagate buffer usage in multi buffers
* Implement rope_neox op
* Cleanup header and other files
* Simplify gpu_extras by removing events and putting staging memcpys into contexts
* Move queue into context
Add not-yet-enabled async backend ops
* Simplify context use, optimize matmul shader for warp size 64 (AMD GCN), fix split_k matmul shader optimization
* Add get_max_size to SYCL backend.
Co-authored-by: Georgi Gerganov <redacted>
* llama : fix trailing whitespace
---------
Co-authored-by: Henri Vasserman <redacted>
Co-authored-by: Concedo <redacted>
Co-authored-by: slaren <redacted>
Co-authored-by: Georgi Gerganov <redacted>
Abhilash Majumder [Sun, 28 Jan 2024 15:56:23 +0000 (21:26 +0530)]
ggml : add unified SYCL backend for Intel GPUs (llama/2690)
* first update for migration
* update init_cublas
* add debug functio, commit all help code
* step 1
* step 2
* step3 add fp16, slower 31->28
* add GGML_LIST_DEVICE function
* step 5 format device and print
* step6, enhance error check, remove CUDA macro, enhance device id to fix none-zero id issue
* support main device is non-zero
* step7 add debug for code path, rm log
* step 8, rename all macro & func from cuda by sycl
* fix error of select non-zero device, format device list
* ren ggml-sycl.hpp -> ggml-sycl.h
* clear CMAKE to rm unused lib and options
* correct queue: rm dtct:get_queue
* add print tensor function to debug
* fix error: wrong result in
658746bb26702e50f2c59c0e4ada8e9da6010481
* summary dpct definition in one header file to replace folder:dpct
* refactor device log
* mv dpct definition from folder dpct to ggml-sycl.h
* update readme, refactor build script
* fix build with sycl
* set nthread=1 when sycl, increase performance
* add run script, comment debug code
* add ls-sycl-device tool
* add ls-sycl-device, rm unused files
* rm rear space
* dos2unix
* Update README_sycl.md
* fix return type
* remove sycl version from include path
* restore rm code to fix hang issue
* add syc and link for sycl readme
* rm original sycl code before refactor
* fix code err
* add know issue for pvc hang issue
* enable SYCL_F16 support
* align pr4766
* check for sycl blas, better performance
* cleanup 1
* remove extra endif
* add build&run script, clean CMakefile, update guide by review comments
* rename macro to intel hardware
* editor config format
* format fixes
* format fixes
* editor format fix
* Remove unused headers
* skip build sycl tool for other code path
* replace tab by space
* fix blas matmul function
* fix mac build
* restore hip dependency
* fix conflict
* ren as review comments
* mv internal function to .cpp file
* export funciton print_sycl_devices(), mv class dpct definition to source file
* update CI/action for sycl code, fix CI error of repeat/dup
* fix action ID format issue
* rm unused strategy
* enable llama_f16 in ci
* fix conflict
* fix build break on MacOS, due to CI of MacOS depend on external ggml, instead of internal ggml
* fix ci cases for unsupported data type
* revert unrelated changed in cuda cmake
remove useless nommq
fix typo of GGML_USE_CLBLAS_SYCL
* revert hip cmake changes
* fix indent
* add prefix in func name
* revert no mmq
* rm cpu blas duplicate
* fix no_new_line
* fix src1->type==F16 bug.
* pass batch offset for F16 src1
* fix batch error
* fix wrong code
* revert sycl checking in test-sampling
* pass void as arguments of ggml_backend_sycl_print_sycl_devices
* remove extra blank line in test-sampling
* revert setting n_threads in sycl
* implement std::isinf for icpx with fast math.
* Update ci/run.sh
Co-authored-by: Georgi Gerganov <redacted>
* Update examples/sycl/run-llama2.sh
Co-authored-by: Georgi Gerganov <redacted>
* Update examples/sycl/run-llama2.sh
Co-authored-by: Georgi Gerganov <redacted>
* Update CMakeLists.txt
Co-authored-by: Georgi Gerganov <redacted>
* Update CMakeLists.txt
Co-authored-by: Georgi Gerganov <redacted>
* Update CMakeLists.txt
Co-authored-by: Georgi Gerganov <redacted>
* Update CMakeLists.txt
Co-authored-by: Georgi Gerganov <redacted>
* add copyright and MIT license declare
* update the cmd example
---------
Co-authored-by: jianyuzh <redacted>
Co-authored-by: luoyu-intel <redacted>
Co-authored-by: Meng, Hengyu <redacted>
Co-authored-by: Georgi Gerganov <redacted>
Georgi Gerganov [Sun, 28 Jan 2024 16:44:58 +0000 (18:44 +0200)]
ggml : minor type fix (int64_t -> size_t)
Georgi Gerganov [Sat, 27 Jan 2024 15:33:09 +0000 (17:33 +0200)]
common : fix input buffer check (#1812)
Georgi Gerganov [Sat, 27 Jan 2024 15:24:53 +0000 (17:24 +0200)]
talk-llama : sync llama.cpp
Georgi Gerganov [Sat, 27 Jan 2024 15:23:25 +0000 (17:23 +0200)]
sync : ggml
0cc4m [Fri, 26 Jan 2024 22:07:32 +0000 (23:07 +0100)]
Add OpenCL add kernel (llama/5151)
* Add OpenCL add kernel
* Put add kernel into different string to stay within MSVC string length limit, disable float16 support due to bad results
slaren [Fri, 26 Jan 2024 17:59:43 +0000 (18:59 +0100)]
cuda : fix tensor size calculation for non-split buffer (llama/5145)
slaren [Fri, 26 Jan 2024 17:18:26 +0000 (18:18 +0100)]
ggml-alloc : add 10% margin to the buffer sizes (llama/5149)
snadampal [Fri, 26 Jan 2024 17:17:59 +0000 (11:17 -0600)]
ggml : update softmax n_task calculation (llama/5126)
updated the n_task calculation to use max number of
threads possible. This has improved the prompt eval
performance by around 5% for DOT kernels and by
around 10% for MMLA kernels on AWS Graviton3.
Paul Tsochantaris [Fri, 26 Jan 2024 12:16:07 +0000 (12:16 +0000)]
metal : remove unused `n_buffers` and `buffers` (llama/5129)
Georgi Gerganov [Thu, 25 Jan 2024 09:26:17 +0000 (11:26 +0200)]
metal : show compile log messages
Engininja2 [Wed, 24 Jan 2024 22:18:15 +0000 (16:18 -0600)]
cuda : fix 2-bit quants on amd hip (llama/5105)
* cuda : fix 2-bit quants on amd hip
* use __low2float intrinsic function for new quants
slaren [Wed, 24 Jan 2024 11:48:14 +0000 (12:48 +0100)]
llama : pre-allocate input tensors in a separate buffer (llama/5100)
Georgi Gerganov [Tue, 23 Jan 2024 13:50:56 +0000 (15:50 +0200)]
metal : disable support for MUL_MAT F32 x F16
Johannes Gäßler [Tue, 23 Jan 2024 12:31:56 +0000 (13:31 +0100)]
CUDA: more info when no device code (llama/5088)
Georgi Gerganov [Tue, 23 Jan 2024 12:12:57 +0000 (14:12 +0200)]
minor : clean-up some warnings and style (llama/5094)
* minor : clean-up some warnings and style
ggml-ci
* ggml : add comment
Reinforce-II [Mon, 22 Jan 2024 13:15:08 +0000 (21:15 +0800)]
ggml : parallelize FP32 conversion when using BLAS (llama/5045)
* make GGML_TASK_INIT phase can be run in multithread
* multithreaded dequantize in mul_mat when using blas library
* minor fixes
* update outdated comment
* fix coding style
* simplify code
Co-authored-by: Georgi Gerganov <redacted>
---------
Co-authored-by: Georgi Gerganov <redacted>
XiaotaoChen [Mon, 22 Jan 2024 13:09:35 +0000 (21:09 +0800)]
llava : MobileVLM support (llama/4954)
* MobileVLM native implementation
* delete depthwise_conv_2d and permute_cpy relative code, replace the two by the existed functions, and opt ldp definition, support LLAMA_PERF option for CMake
* move android script to example/llava directory
* Fix the editor config checks
---------
Co-authored-by: Chenxiaotao03 <redacted>
slaren [Sat, 20 Jan 2024 15:05:49 +0000 (16:05 +0100)]
llama : run all KQV ops on the CPU with no KV offload (llama/5049)
ggml-ci
Kylin [Sat, 20 Jan 2024 07:01:46 +0000 (15:01 +0800)]
cuda : fix compile error in jetson platform (llama/4975)
* cuda: fix compile error in jetson platform
* cuda: update comment in ggml-cuda.cu
* cuda: update ggml-cuda.cu comment
Judd [Fri, 26 Jan 2024 13:04:01 +0000 (21:04 +0800)]
ggml : check ggml_add src1 type (ggml/708)
Co-authored-by: Judd <redacted>
Michael Rienstra [Fri, 26 Jan 2024 15:39:54 +0000 (07:39 -0800)]
docs : make model options / model install methods clearer (#1806)
* Make models more "discoverable"
* Clean up code block language identifiers
* make 3 options clearer
* undo Prettier formatter change
* docs: `$` shell prompt, consistently
* docs: minor changes
trixirt [Mon, 22 Jan 2024 13:02:35 +0000 (05:02 -0800)]
cmake : make libwhisper.so position independent (#1792)
This is similar to how libllama.so is built.
Signed-off-by: Tom Rix <redacted>
Georgi Gerganov [Mon, 22 Jan 2024 12:51:42 +0000 (14:51 +0200)]
cmake : temporary remove VLA check (#1795)
Neuman Vong [Fri, 19 Jan 2024 14:17:38 +0000 (01:17 +1100)]
whisper.android : return output from benchmarks (#1785)
Benchmarks are failing because JNI expects a jstring and the benchmarks
are missing a return statement (i.e., returning null). The functions
actually build a jstring but don't return it, so this seems to have been
an oversight.
This patch returns the jstring and now the benchmarks run successfully.
Fixes #1783.
Ryan Hitchman [Thu, 18 Jan 2024 20:58:42 +0000 (13:58 -0700)]
server : implement "verbose_json" format with token details (#1781)
* examples/server: implement "verbose_json" format with token details.
This is intended to mirror the format of openai's Python
whisper.transcribe() return values.
* server: don't write WAV to a temporary file if not converting
* server: use std::lock_guard instead of manual lock/unlock
Georgi Gerganov [Thu, 18 Jan 2024 09:03:13 +0000 (11:03 +0200)]
ggml : sync ggml-metal.m
Georgi Gerganov [Wed, 17 Jan 2024 19:23:33 +0000 (21:23 +0200)]
sync : llama.cpp
Georgi Gerganov [Wed, 17 Jan 2024 19:22:38 +0000 (21:22 +0200)]
sync : ggml
Georgi Gerganov [Wed, 17 Jan 2024 16:54:56 +0000 (18:54 +0200)]
ggml : add IQ2 to test-backend-ops + refactoring (llama/4990)
* ggml : add IQ2 to test-backend-ops + refactoring
ggml-ci
* cuda : update supports_op for IQ2
ggml-ci
* ci : enable LLAMA_CUBLAS=1 for CUDA nodes
ggml-ci
* cuda : fix out-of-bounds-access in `mul_mat_vec_q`
ggml-ci
* tests : avoid creating RNGs for each Q tensor
ggml-ci
* tests : avoid creating RNGs for each tensor
ggml-ci
Georgi Gerganov [Wed, 17 Jan 2024 16:46:30 +0000 (18:46 +0200)]
imatrix : offload to GPU support (llama/4957)
* backend : add eval callback
ggml-ci
* backend : group nodes in a single compute when user don't need them
* backend : clean-up the implementation
ggml-ci
* simple : do not perform tensor data copy if not needed
* simple : fix
* imatrix : offload to GPU support
* imatrix : fix ggml_mul_mat_id hanlding
ggml-ci
* ci : add imatrix test
ggml-ci
* ci : rearrange output
ggml-ci
Georgi Gerganov [Wed, 17 Jan 2024 16:39:41 +0000 (18:39 +0200)]
backend : add eval callback (llama/4935)
* backend : add eval callback
ggml-ci
* backend : group nodes in a single compute when user don't need them
* backend : clean-up the implementation
ggml-ci
* simple : do not perform tensor data copy if not needed
* simple : fix
* simple : no need for ggml_is_contiguous + fix bool parse
* llama : fix callback placement in llama_context_params
* backend : avoid double-ask callback calls
* simple : restore examples, imatrix will serve as a demo
Georgi Gerganov [Wed, 17 Jan 2024 16:38:39 +0000 (18:38 +0200)]
metal : create autorelease pool during library build (llama/4970)
* metal : create autorelease pool during library build
ggml-ci
* test : simplify
ggml-ci
Kawrakow [Tue, 16 Jan 2024 17:51:26 +0000 (19:51 +0200)]
ggml : importance matrix support for legacy quants (llama/4969)
* imatrix: adding support for legacy quants
* imatrix: guard Q4_0/Q5_0 against ffn_down craziness
---------
Co-authored-by: Iwan Kawrakow <redacted>
Alex Azarov [Tue, 16 Jan 2024 13:33:02 +0000 (14:33 +0100)]
metal : log `recommendedMaxWorkingSetSize` on iOS 16+ (llama/4936)
* metal: Log `recommendedMaxWorkingSetSize` on iOS 16+
* Only log on iOS and macOS, ignoring tvOS and other platforms
* Check for Xcode version before using recommendedMaxWorkingSetSize
---------
Co-authored-by: Georgi Gerganov <redacted>
Justine Tunney [Tue, 16 Jan 2024 11:16:33 +0000 (03:16 -0800)]
ggml : introduce GGML_CALL function annotation (llama/4850)
This change makes it possible to build ggml-cuda.cu and ggml-metal.m as
independent dynamic shared objects, that may be conditionally linked at
runtime in a multiplatform binary. It introduces a GGML_CALL annotation
that documents which functions have a cyclic call relationship, between
the application code and GPU modules.
This change does nothing, unless the build defines -DGGML_MULTIPLATFORM
which causes back-references and function pointers to conform to MS ABI
which is supported by NVCC, ROCm, XCode, GCC and Clang across platforms
Georgi Gerganov [Mon, 15 Jan 2024 11:27:00 +0000 (13:27 +0200)]
cuda : fix dequantize kernel names (llama/4938)
Kawrakow [Mon, 15 Jan 2024 05:48:06 +0000 (07:48 +0200)]
CUDA: faster dequantize kernels for Q4_0 and Q4_1 (llama/4938)
Co-authored-by: Iwan Kawrakow <redacted>
Kawrakow [Sun, 14 Jan 2024 14:21:12 +0000 (16:21 +0200)]
Add ability to use importance matrix for all k-quants (llama/4930)
Co-authored-by: Iwan Kawrakow <redacted>
Benjamin Heiniger [Tue, 16 Jan 2024 13:52:01 +0000 (14:52 +0100)]
talk-llama : optional wake-up command and audio confirmation (#1765)
* talk-llama: add optional wake-word detection from command
* talk-llama: add optional audio confirmation before generating answer
* talk-llama: fix small formatting issue in output
* talk-llama.cpp: fix Windows build
Przemysław Pawełczyk [Mon, 15 Jan 2024 13:48:13 +0000 (14:48 +0100)]
server : fix building and simplify lib deps on Windows (#1772)
* make : fix server example building on MSYS2 environments (Windows)
It was not working since commit
eff3570f78742dfd56024328ed93d4f442434280
when server was introduced.
* cmake : simplify server example lib deps on Windows
server uses httplib::Server, not httplib::SSLServer, so there is no need
to mention cryptographic libraries in target_link_libraries.
Winsock (ws2_32) suffices here.
Also use plain library names like we use in other places.
Georgi Gerganov [Sun, 14 Jan 2024 16:08:20 +0000 (18:08 +0200)]
talk-llama : sync llama.cpp
Georgi Gerganov [Sun, 14 Jan 2024 09:06:28 +0000 (11:06 +0200)]
talk-llama : llama.cpp
Georgi Gerganov [Sun, 14 Jan 2024 08:55:18 +0000 (10:55 +0200)]
sync : ggml
Alex Azarov [Sun, 14 Jan 2024 08:44:39 +0000 (09:44 +0100)]
metal : correctly set SIMD support flags on iOS (llama/4923)
* Correctly set support_simdgroup_reduction and support_simdgroup_mm on iPhone/iPad
* log a little bit more info on iOS
Kawrakow [Sun, 14 Jan 2024 07:45:56 +0000 (09:45 +0200)]
2-bit quantizations (llama/4897)
* imatrix: load
* imatrix: WIP
* imatrix: Add Q2_K quantization
* imatrix: also guard against Q2_K_S quantization without importance matrix
* imatrix: guard even more against low-bit quantization misuse
---------
Co-authored-by: Iwan Kawrakow <redacted>
Georgi Gerganov [Sun, 14 Jan 2024 08:53:19 +0000 (10:53 +0200)]
scripts : sync-ggml-am.sh add option to skip commits