git.djapps.eu Git - pkg/ggml/sources/whisper.cpp/log

]> git.djapps.eu Git - pkg/ggml/sources/whisper.cpp/log

overview / pkg / ggml / sources / whisper.cpp / log

commit | commitdiff | tree

Kawrakow [Tue, 13 Feb 2024 07:07:57 +0000 (09:07 +0200)]

ggml-quants : fix compiler warnings (shadow variable) (llama/5472)

Co-authored-by: Iwan Kawrakow <redacted>

commit | commitdiff | tree

Abhilash Majumder [Mon, 12 Feb 2024 14:52:05 +0000 (20:22 +0530)]

ggml-sycl: Replace 3d ops with macro (llama/5458)

* use macro

* use macro

* fix format

commit | commitdiff | tree

Georgi Gerganov [Mon, 19 Feb 2024 12:44:46 +0000 (14:44 +0200)]

build : update CBLAS flags + fix unused var warning (#0)

commit | commitdiff | tree

Davidson Francis [Mon, 19 Feb 2024 08:51:26 +0000 (05:51 -0300)]

main : check if input files exist before proceeding (#1872)

Until the most recent commit (3d42463), the main.cpp sample file does
not check whether the input files exist or not. Consequently, the
model is loaded first before reporting whether there was a failure or
not when processing a file. In environments with HDD, this can take
about 50 seconds or more, depending on the loaded model.

This commit addresses this issue by checking in advance whether the
input files exist or not.

commit | commitdiff | tree

Felix [Mon, 19 Feb 2024 08:50:15 +0000 (09:50 +0100)]

examples : clean up common code (#1871)

move some utility functions into common.h

commit | commitdiff | tree

Jumper775 [Mon, 19 Feb 2024 02:19:47 +0000 (21:19 -0500)]

models : fix openvino setup info (#1874)

commit | commitdiff | tree

Georgi Gerganov [Tue, 13 Feb 2024 09:51:32 +0000 (11:51 +0200)]

models : add update py requirements

commit | commitdiff | tree

Georgi Gerganov [Mon, 12 Feb 2024 17:54:11 +0000 (19:54 +0200)]

swift : package no longer use ggml dependency (#1861)

* Revert "swift : update Package.swift to use ggml as package dependency (#1701)"

This reverts commit 993acb5d410cd8eaebaa3fc54d4b153e04bbefce.

* spm : add ggml.h

commit | commitdiff | tree

Georgi Gerganov [Mon, 12 Feb 2024 17:53:51 +0000 (19:53 +0200)]

whisper : fix external encoder (#1860)

commit | commitdiff | tree

Georgi Gerganov [Mon, 12 Feb 2024 17:07:56 +0000 (19:07 +0200)]

sync : ggml

commit | commitdiff | tree

slaren [Mon, 12 Feb 2024 17:07:14 +0000 (18:07 +0100)]

ggml-alloc : allocate all leafs as if they were inputs (ggml/731)

* ggml-alloc : allocate all leafs as if they were inputs

* ensure static leafs are allocated

* gpt-2-backend : remove unnecesary ggml_new_tensor

* update other gpt-2 examples to remove ggml_new_tensor calls in the graph

commit | commitdiff | tree

Georgi Gerganov [Mon, 12 Feb 2024 08:39:58 +0000 (10:39 +0200)]

talk-llama : sync llama.cpp

commit | commitdiff | tree

Georgi Gerganov [Mon, 12 Feb 2024 07:32:15 +0000 (09:32 +0200)]

sync : ggml

commit | commitdiff | tree

Georgi Gerganov [Mon, 12 Feb 2024 07:27:57 +0000 (09:27 +0200)]

ggml-backend : sync remnant

commit | commitdiff | tree

Johannes Gäßler [Sun, 11 Feb 2024 18:08:39 +0000 (19:08 +0100)]

CUDA: mul_mat_vec_q tiling, refactor mul mat logic (llama/5434)

* CUDA: mul_mat_vec_q tiling, refactor mul mat logic

Co-authored-by: slaren <redacted>
---------

Co-authored-by: slaren <redacted>

commit | commitdiff | tree

Sergio López [Sun, 11 Feb 2024 14:12:00 +0000 (15:12 +0100)]

vulkan: only use M-sized matmul on Apple GPUs (llama/5412)

* vulkan: refactor guess_matmul_pipeline for vendor

Refactor ggml_vk_guess_matmul_pipeline to simplify adding per-vendor
conditionals.

Signed-off-by: Sergio Lopez <redacted>
* vulkan: only use M-sized matmul on Apple GPUs

L-sized and S-sized matmuls are broken on Apple GPUs, force using
M-size with this vendor.

Signed-off-by: Sergio Lopez <redacted>
---------

Signed-off-by: Sergio Lopez <redacted>

commit | commitdiff | tree

Georgi Gerganov [Sun, 11 Feb 2024 13:33:01 +0000 (15:33 +0200)]

ggml : fix compile warnings (unused vars) (llama/4966)

commit | commitdiff | tree

snadampal [Sun, 11 Feb 2024 13:22:33 +0000 (07:22 -0600)]

ggml : add mmla kernels for quantized GEMM (llama/4966)

* ggml: aarch64: implement smmla kernel for q8_0_q8_0 quantized gemm

armv8.2-a and above supports MMLA instructions that have higher
throughput than DOT. this commit adds mmla kernel for
q8_0_q8_0 gemm. The feature is enabled if the platform supports
"__ARM_FEATURE_MATMUL_INT8"

On AWS Graviton3 processors this kernel resulted up to 1.5x
improvement for prompt evaluation throughput compared to the
default sdot kernel.

* ggml: aarch64: implement smmla kernel for q4_0_q8_0 quantized gemm

armv8.2-a and above supports MMLA instructions that have higher
throughput than DOT. this commit adds mmla kernel for
q4_0_q8_0 gemm. The feature is enabled if the platform supports
"__ARM_FEATURE_MATMUL_INT8"

On AWS Graviton3 processors this kernel resulted up to 1.5x
improvement for prompt evaluation throughput compared to the
default sdot kernel.

* ggml: aarch64: implement smmla kernel for q4_1_q8_1 quantized gemm

armv8.2-a and above supports MMLA instructions that have higher
throughput than DOT. this commit adds mmla kernel for
q4_1_q8_1 gemm. The feature is enabled if the platform supports
"__ARM_FEATURE_MATMUL_INT8"

On AWS Graviton3 processors this kernel resulted up to 1.5x
improvement for prompt evaluation throughput compared to the
default sdot kernel.

* ggml: update unit tests for the new vec_dot interface

* llama.cpp: add MATMUL_INT8 capability to system_info

commit | commitdiff | tree

Ian Bull [Sat, 10 Feb 2024 10:53:28 +0000 (02:53 -0800)]

metal : use autoreleasepool to avoid memory leaks (llama/5437)

There appears to be a known memory leak when using the
`MLTCommandBuffer`. It is suggested to use `@autoreleasepool` in
[1,2]

[1] https://developer.apple.com/forums/thread/662721
[2] https://forums.developer.apple.com/forums/thread/120931

This change-set wraps the `ggml_metal_graph_compute` in a
`@autoreleasepool`.

This commit addresses https://github.com/ggerganov/llama.cpp/issues/5436

commit | commitdiff | tree

slaren [Sun, 11 Feb 2024 12:37:58 +0000 (13:37 +0100)]

ggml-alloc : v3 (ggml/727)

* ggml-alloc v3

ggml-ci

* fix ci

ggml-ci

* whisper : check for backend buffer allocation failures

* whisper : avoid leaks when initialization fails

* cleanup

ggml-ci

* style fixes

ggml-ci

commit | commitdiff | tree

dscripka [Mon, 12 Feb 2024 07:19:07 +0000 (02:19 -0500)]

examples : added audio_ctx argument to main and server (#1857)

* added audio_ctx argument to main and server examples

* Better default value

Co-authored-by: Georgi Gerganov <redacted>
* better default value (again)

Co-authored-by: Georgi Gerganov <redacted>
---------

Co-authored-by: Georgi Gerganov <redacted>

commit | commitdiff | tree

Didzis Gosko [Sun, 11 Feb 2024 14:41:41 +0000 (16:41 +0200)]

metal : option to embed MSL source into compiled binary (#1842)

* ggml : embed Metal library source (ggml-metal.metal) into binary

enable by setting WHISPER_EMBED_METAL_LIBRARY

* rename the build option

* rename the preprocessor directive

* generate Metal library embedding assembly on-fly during build process

commit | commitdiff | tree

Georgi Gerganov [Sun, 11 Feb 2024 14:39:12 +0000 (16:39 +0200)]

examples : initialize context params properly (#1852)

commit | commitdiff | tree

Georgi Gerganov [Sat, 10 Feb 2024 08:10:59 +0000 (10:10 +0200)]

talk-llama : sync llama.cpp

commit | commitdiff | tree

Georgi Gerganov [Sat, 10 Feb 2024 07:56:47 +0000 (09:56 +0200)]

sync : ggml

commit | commitdiff | tree

Georgi Gerganov [Sat, 10 Feb 2024 07:50:24 +0000 (09:50 +0200)]

src : relocate new backend sources

commit | commitdiff | tree

Michael Podvitskiy [Fri, 9 Feb 2024 09:56:43 +0000 (10:56 +0100)]

ggml : fix `error C2078: too many initializers` for MSVC ARM64 (llama/5404)

commit | commitdiff | tree

Johannes Gäßler [Thu, 8 Feb 2024 20:56:40 +0000 (21:56 +0100)]

CUDA: more warps for mmvq on NVIDIA (llama/5394)

commit | commitdiff | tree

Johannes Gäßler [Wed, 7 Feb 2024 11:40:26 +0000 (12:40 +0100)]

CUDA: fixed mmvq kernel for bs 2,3,4 and -sm row (llama/5386)

commit | commitdiff | tree

0cc4m [Wed, 7 Feb 2024 06:54:50 +0000 (07:54 +0100)]

Basic Vulkan Multi-GPU implementation (llama/5321)

* Initial Vulkan multi-gpu implementation

Move most global variables into backend context

* Add names to backend device functions

* Add further missing cleanup code

* Reduce code duplication in tensor split layer assignment

* generalize LLAMA_SPLIT_LAYER for all backends, do not expose device count and memory in llama.h

* Only do device info print in the beginning and initialize one backend for cpu assist

Add missing cleanup code

* Rework backend memory management to make sure devices and buffers get properly allocated and freed

* Rename cpu assist free function

---------

Co-authored-by: slaren <redacted>

commit | commitdiff | tree

Johannes Gäßler [Tue, 6 Feb 2024 17:43:06 +0000 (18:43 +0100)]

CUDA: mul_mat_vec_q max. batch size 8 -> 4 (llama/5370)

commit | commitdiff | tree

Kawrakow [Tue, 6 Feb 2024 15:28:02 +0000 (17:28 +0200)]

Slight quantization improvement for Q4_K and Q5_K (llama/5361)

* Q4_K: slightly better quantization

* Q5_K: slightly better quantization

---------

Co-authored-by: Iwan Kawrakow <redacted>

commit | commitdiff | tree

Johannes Gäßler [Tue, 6 Feb 2024 13:44:06 +0000 (14:44 +0100)]

CUDA: mul_mat_vec_q for batch sizes > 1 (llama/5351)

commit | commitdiff | tree

Kawrakow [Mon, 5 Feb 2024 12:09:47 +0000 (14:09 +0200)]

ggml : make use of ggml-quants.h possible in C++ code (llama/5338)

* Make use of ggml-quants.h possible in C++ code

* One cannot possibly be defining static_assert in a C++ compilation

---------

Co-authored-by: Iwan Kawrakow <redacted>

commit | commitdiff | tree

Dr. Tom Murphy VII Ph.D [Mon, 5 Feb 2024 11:13:57 +0000 (06:13 -0500)]

ggml : avoid duplicating function calls using MIN/MAX macros (llama/5325)

* Avoid duplicating function calls when using MIN/MAX macros.

Since these copy "a" and "b" they ask the compiler to evaluate one of them twice. The compiler doesn't have a problem with removing the duplication in something like MAX(0, x + 2), but in some cases we're calling functions, and those calls just happen twice.
By explicitly evaluating at the expression we get smaller and faster code without duplicate calls. See ggml_rope_yarn_corr_dims in Compiler Explorer:

https://godbolt.org/z/Ee4KMrvKh

Code behaves exactly the same.

* Update ggml.c

---------

Co-authored-by: Georgi Gerganov <redacted>

commit | commitdiff | tree

Kawrakow [Mon, 5 Feb 2024 08:46:06 +0000 (10:46 +0200)]

iq2_xxs: tune quantization (llama/5320)

We get slightly better PPL, and we cut quantization time in
nearly half.

The trick is to 1st quantize without forcing points onto the E8-lattice.
We can then use a narrower search range around the block scale that we
got that way.

Co-authored-by: Iwan Kawrakow <redacted>

commit | commitdiff | tree

slaren [Thu, 1 Feb 2024 17:30:17 +0000 (18:30 +0100)]

cuda : fix LLAMA_CUDA_F16 (llama/5262)

commit | commitdiff | tree

Georgi Gerganov [Wed, 31 Jan 2024 13:35:41 +0000 (15:35 +0200)]

metal : add im2col F32 dst support (llama/5132)

commit | commitdiff | tree

JidongZhang-THU [Wed, 31 Jan 2024 13:10:15 +0000 (21:10 +0800)]

llava : add MobileVLM support (llama/5132)

* New Feature:
    1. Sum_Rows:
        fix cuda kernel overflow
        fix block shape error when nrows too big
    2. Im2Col:
        Support Batch in cuda
        Support f32 to f32 both in cpu && cuda
    3. DepthWiseConv:
        Support by Im2Col && MulMat
    4. Pool_2d:
        Supoort avg pooling in cuda
    5. HardSigmoid:
        Imp in cuda
    6. HardSwish:
        Imp in cuda

* fix tabs instead of spaces

* code clean

* CUDA POOL2D

* ADD POOL2D test case in test-backend-ops.cpp

* code clean

* fix pool2d_kernel

nits

* fix bug in pool2d kernel

* fix avg pooling, count_include_pad

nits

* test-backend-ops : add more pool_2d tests

* cuda : fix warnings and formatting

* ggml : check types in release builds too in pool_2d

* test-backend-ops : remove f16 pool_2d tests

* cuda : more style fixes

* Add assert in ggml_cuda_op_pool2d

* pool2d float padding fallback

* test-backend-ops : add dst_type to im2col

---------

Co-authored-by: slaren <redacted>

commit | commitdiff | tree

slaren [Wed, 31 Jan 2024 12:43:03 +0000 (13:43 +0100)]

ggml : limit n_threads to the max n_tasks (llama/5238)

commit | commitdiff | tree

Jared Van Bortel [Wed, 31 Jan 2024 00:04:37 +0000 (19:04 -0500)]

kompute : llama-bench support and ggml_cpu_has_kompute() (llama/5226)

commit | commitdiff | tree

Michael Podvitskiy [Fri, 9 Feb 2024 09:42:27 +0000 (10:42 +0100)]

ggml : add abort_callback for cpu backend (ggml/725)

* a way to use abort_callback with the cpu backend

* whisper update

commit | commitdiff | tree

Georgi Gerganov [Sat, 10 Feb 2024 07:55:19 +0000 (09:55 +0200)]

extra : update sync scripts

commit | commitdiff | tree

Valentin Gosu [Fri, 9 Feb 2024 15:42:41 +0000 (16:42 +0100)]

server : allow CORS request with authorization headers (#1850)

Whisper plugin in Obsidian requires an API key which is
then sent as an authorization header.
However, the presence of an authorization header requires
a CORS Preflight, so both the OPTIONS method and
the Access-Control-Allow-Headers: authorization must be
handled.

commit | commitdiff | tree

Neuman Vong [Fri, 9 Feb 2024 15:39:05 +0000 (02:39 +1100)]

whisper.android : how to build with CLBlast (#1809)

* FetchContent

* OpenCL

* Documentation and make optional

* Specify GGML build options in build.gradle

* Use gradle properties

* @ggerganov

Co-authored-by: Georgi Gerganov <redacted>
* @gpokat

---------

Co-authored-by: Georgi Gerganov <redacted>

commit | commitdiff | tree

Didzis Gosko [Fri, 9 Feb 2024 15:27:47 +0000 (17:27 +0200)]

whisper : expose CUDA device setting in public API (#1840)

* Makefile : allow to override CUDA_ARCH_FLAG

* whisper : allow to select GPU (CUDA) device from public API

commit | commitdiff | tree

Didzis Gosko [Fri, 9 Feb 2024 15:26:29 +0000 (17:26 +0200)]

make : add macOS deployment target option (#1839)

commit | commitdiff | tree

Georgi Gerganov [Tue, 6 Feb 2024 17:56:12 +0000 (19:56 +0200)]

talk-llama : stream response (#1121)

commit | commitdiff | tree

Georgi Gerganov [Tue, 30 Jan 2024 19:30:26 +0000 (21:30 +0200)]

sync : ggml (#0)

commit | commitdiff | tree

Kawrakow [Tue, 30 Jan 2024 17:15:28 +0000 (19:15 +0200)]

ggml : fix IQ3_XXS on Metal (llama/5219)

Co-authored-by: Iwan Kawrakow <redacted>

commit | commitdiff | tree

Georgi Gerganov [Tue, 30 Jan 2024 14:21:57 +0000 (16:21 +0200)]

sync : ggml (llama/0)

commit | commitdiff | tree

Kawrakow [Tue, 30 Jan 2024 13:15:07 +0000 (15:15 +0200)]

Faster AVX2 dot product for IQ2_XS (llama/5187)

* iq2xs: faster AVX2 dot product

* iq2xs: small AVX2 imrovement

* Speed up computing sign bits in AVX2 iq2_xs dot product

---------

Co-authored-by: Iwan Kawrakow <redacted>
Co-authored-by: Peter Reid <redacted>

commit | commitdiff | tree

Kawrakow [Tue, 30 Jan 2024 13:14:12 +0000 (15:14 +0200)]

SOTA 3-bit quants (llama/5196)

* iq3_xxs: quantize/dequantize

RMSE seems a bit high-ish at about half-way between q2_K and
q3_K, so need to check more.

* iq3_xxs: CUDA dequantize works

* iq2_xxs: tuning quantization

* iq3_xxs: starting to look better

PPL on wiki.test.raw
LLaMA-v1-7B: 6.4218
LLaMA-v2-7B: 6.3560
Mistral-7B : 6.0717

This is better than Q3_K_XS, with a 5% reduction in quantized model
size.

* iq3_xxs: CUDA dot product

We have
PP-512: 5891 t/s
TG-128: 143.9 t/s

* iq3_xxs: scalar and AVX2 dot products

* iq3_xxs: ARM_NEON and Metal

Metal performance is decent, ARM_NEON is pathetic

* iq3_xxs: slightly better grid points

* Faster iq3_xxs and iq2_xs dot products on CUDA

* iq3_xxs: add some quant mix

* iq3_xxs: fix failing quantization test

Dot product still fails. Is this real?

* iq3_xxs: hopefully fix ROCm

* iq3_xxs: failing tests

This time the dot product accuracy did find an actual bug
in the AVX2 implementation.

* Add IQ3_XXS to test-backend-ops

---------

Co-authored-by: Iwan Kawrakow <redacted>

commit | commitdiff | tree

Paul Tsochantaris [Mon, 29 Jan 2024 22:19:29 +0000 (22:19 +0000)]

ggml alloc: Fix for null dereference on alloc failure (llama/5200)

* Fix for a null pointer dereference if a metal GGML buffer fails to be allocated

* Freeing the allocated buffers rather than the pointer in ggml-alloc.c

* Fixed the fix of the fix

commit | commitdiff | tree

Jared Van Bortel [Mon, 29 Jan 2024 20:50:50 +0000 (15:50 -0500)]

Nomic Vulkan backend (llama/4456)

Signed-off-by: Jared Van Bortel <redacted>
Co-authored-by: niansa <redacted>
Co-authored-by: Adam Treat <redacted>
Co-authored-by: Aaron Miller <redacted>
Co-authored-by: ToKiNoBug <redacted>
Co-authored-by: Georgi Gerganov <redacted>
Co-authored-by: slaren <redacted>

commit | commitdiff | tree

slaren [Mon, 29 Jan 2024 08:05:13 +0000 (09:05 +0100)]

ggml : add max buffer sizes to opencl and metal backends (llama/5181)

commit | commitdiff | tree

Paul Tsochantaris [Sun, 28 Jan 2024 19:50:16 +0000 (19:50 +0000)]

metal : free metal objects (llama/5161)

* Releasing MTLFunction references after Metal pipeline construction

* Keeping the `ggml_metal_kernel` structure

* Spacing fix

* Whitespace fix

commit | commitdiff | tree

Georgi Gerganov [Mon, 29 Jan 2024 19:08:18 +0000 (21:08 +0200)]

gguf : fix comparison (ggml/715)

ggml-ci

commit | commitdiff | tree

John Balis [Mon, 29 Jan 2024 12:37:33 +0000 (06:37 -0600)]

`ggml_cuda_cpy` support for 4d tensors and float16->float32 upcasting (ggml/686)

* added cuda float16->float32 upcasting to ggml_cuda_cpy

* added ability to copy 4d tensors with the cuda backend

* added tests for float16_>float32 upcast and 4d tensor cuda copys

* added 4d copy test for float32->float16 copy

* applied patch suggested by @iamlemec

* simplify cpy tests

---------

Co-authored-by: slaren <redacted>

commit | commitdiff | tree

Georgi Gerganov [Mon, 29 Jan 2024 12:00:10 +0000 (14:00 +0200)]

gguf : add input validation, prevent integer overflows (ggml/709)

* gguf : add input validation, prevent integer overflows

ggml-ci

* gguf : fix switch default case

* gguf : sanitize info->n_dims and info->type

ggml-ci

* gguf : assert GGUF_TYPE_SIZE access

ggml-ci

* ggml : assert mallocs are successful

ggml-ci

* gguf : prevent integer overflow

* gguf : sanitize tensor info

ggml-ci

* gguf : stricter limit on the number of items

ggml-ci

commit | commitdiff | tree

Georgi Gerganov [Mon, 29 Jan 2024 11:29:46 +0000 (13:29 +0200)]

ci : fix yolo URLs + fix metal capture (ggml/712)

commit | commitdiff | tree

Jack Mousseau [Mon, 29 Jan 2024 09:22:23 +0000 (01:22 -0800)]

metal : add debug capture backend function (ggml/694)

Co-authored-by: Georgi Gerganov <redacted>

commit | commitdiff | tree

JacobLinCool [Tue, 30 Jan 2024 17:35:08 +0000 (01:35 +0800)]

common : fix wav buffer detection (#1819)

commit | commitdiff | tree

JacobLinCool [Tue, 30 Jan 2024 12:15:55 +0000 (20:15 +0800)]

server : add fields to `verbose_json` response (#1802)

* server: include additional fields in the verbose_json response as OpenAI does

* server: show request examples on home page

* server: todo note for compression_ratio and no_speech_prob

* server: add simple demo form to the homepage

commit | commitdiff | tree

jwijffels [Tue, 30 Jan 2024 12:13:49 +0000 (13:13 +0100)]

make : update MSYS_NT (#1813)

I just upgraded the R wrapper at https://github.com/bnosac/audio.whisper to use whisper.cpp 1.5.4
I'm working on Windows and noticed while doing that that it did not pick up the relevant CFLAGS/CXXFLAGS as my system showed

```
I whisper.cpp build info:
I UNAME_S:  MSYS_NT-10.0-19045
I UNAME_P:  unknown
I UNAME_M:  x86_64
```

Many thanks for all the tremendous hard work on maintaining whisper.cpp!

commit | commitdiff | tree

Georgi Gerganov [Sun, 28 Jan 2024 17:44:10 +0000 (19:44 +0200)]

talk-llama : sync llama.cpp

commit | commitdiff | tree

Georgi Gerganov [Sun, 28 Jan 2024 17:30:32 +0000 (19:30 +0200)]

sync : ggml

commit | commitdiff | tree

0cc4m [Sun, 28 Jan 2024 17:03:59 +0000 (18:03 +0100)]

ggml : add Vulkan backend (llama/2059)

* Vulkan loader code

* Fix matmul kernel, continue implementation

* Continue implementation

* Vulkan memory management

* Vulkan development

* Matmul call

* Add aligned malloc and free for VMA

* Continue implementation

* First matmul success

* GEMM Kernel optimization

* 1D Blocktiling

* 2D Blocktiling

* Write coalescing

* Continue vulkan implementation and optimization

* First FP16 attempt, disabled for now

* Code abstraction, FP16 implementation, fix kernel, add FP16 to FP32 kernel

* Enable device extensions properly, restore fp16 matmul op

* Fix mulmat_f16

* Output FP32 in fp16 matmul shader

* Fix f16_to_f32 kernel

* dequant_q4_0 kernel

* Add VMA library

* Avoid requesting dedicated memory, VMA can decide that by itself

* Add bounds checking to matmul kernels, improve implementation, fix command buffers not freed properly

* add cmake commands

* Add 2d write operation, profiling code

* Fix 2d write

* Fix queue selection for AMD RADV

* Fix trailing whitespace in vk_mem_alloc.h

* Add WIP warp tile mat mul shaders

* Disable glslc optimization

* Disable glslc optimization for CMake

* Optimize warptile matmul shader, replace blocktile with it

* Add split-k optimization for small matrix multiplication

Use semaphores for synchronization instead of fences or waitidle

Rework async write/read for synchronization

* Fix validation errors, improve compatibility with AMD GPUs

* Rework command buffer handling

* Variable matmul kernel using specialization constants

* Fix synchronization on AMD, add barriers for buffer ownership transfer, add debug flag and prints

* Reuse semaphores

* Handle stage flags during command buffer submission properly

* Increase matmul test runs for consistent results

* Fix F32 matmul

* Add vectorized loading and zeropadding for matrix multiplication

* Use pinned memory for f16 preprocessing

* Don't force aligned matmul

* Don't free before queue done

* Replace VMA library with native Vulkan buffer management

* Basic offloading support with mul_f32 and dmmv for q4_0

* Run glslc commands in parallel

* Unroll loops in dmmv shader

* Reduce usage of waitIdle

* Reuse pinned allocation for f16 conversion

* Handle devices with only a single queue

* Fix trailing whitespace in CMakeLists.txt

* Allow parallel execution of kernels, parallelize third and fourth dimension calls

* Add fallback for devices only supporting one DescriptorSet per DescriptorPool

* Move to graph function similar to CUDA implementation

* Use F16 kernel for most things, replace q_f32 with mul_mat_q_f16 function

* Add F32 dmmv shaders

* Batch submissions

* Add .spv to gitignore

* Split off matrix vector multiplication for separate optimization

* Use single command buffer for matrix vector multiplication ops

* Reduce overhead of mul_f32 calls by using a single command buffer

* Add submission batching to mul_f32

* Fix tests

* Add missing barrier

* Add further missing barrier

* Add further ops

* Replace vk::QueueFamilyIgnored with VK_QUEUE_FAMILY_IGNORED to support more Vulkan header versions

* Remove unnecessary cblas link

* Fix descriptor set pre-allocation assert

* Add runtime shader compilation, start transferring shaders to this approach

* Transfer remaining shaders to header and compile on runtime

* Fix fp32 fallback if device doesn't support fp16, add force disable env var GGML_VULKAN_DISABLE_F16

* Add support for q4_1, q5_0, q5_1 and q8_0

* Remove unnecessary scalar layout extension

* Parse graph early to pre-record command buffers

* Add q6_k support

* Add multi-submit for command buffers

* Fix q6_k dequant shader for AMD

* Fix q6_k for GPUs without fp16 support

* Simplify q6_k fp16 fix

* Minor fixes

* Fix wg_denom of m-mulmat shaders

* Add Python-based Vulkan shader generator

* Replace shaderc dependency with precompiled shaders

Fix python script to generate shaders

* Clean up code

* Fix shader generator script Windows compatibility

Co-authored-by: Concedo <redacted>
* Close file before deletion

* Fix vulkan shader fp32 name

* Add q2_k and q3_k support

Add validation check to compare shader results to cpu results

* Add q4_k support

* Add q5_k support

* Bake SPIR-V bytecode into the library instead of loading shaders from file

* Switch to signal semaphores for flexibility

Prepare broadcasting support for mul mat

* Finish broadcasting mul mat support for GQA

* Clean up unused functions

Add repeat op

* Add further ops, not yet enabled. Improve semaphore code

* Reduce number of used semaphores by utilizing timelines more properly

* Remove queue information

* Reuse timeline semaphores, allow parallel operation with binary semaphores to work around nvidia driver limitations

* Add Vulkan to llama-bench

* Remove cblas dependency

* Fix matmul k-split bug

* Fix q4_k dmmv K_QUANTS_PER_ITERATION 1 shader

* Add RMS Norm shader, rework op_f32 shader setup, fix matmul bug

* Fix issues with float16 overflows in shaders

* Fix issues with older Vulkan headers on Ubuntu 22.04

* Allow multi-op partial offloading by parsing the graph to preallocate enough between-op buffers

* Implement further ops, rework op_f32 calls, fix bugs

* Finish full offloading support, add last remaining ops, fix bugs, remove redundant code

* Upload generated file ggml-vulkan-shaders.hpp, remove redundant shaders

* Merge upstream changes, fix conflicts, adapt soft_max op

* Fix Python and shader header format

* Free model gpu buffers on exit

* Use single queue per device to simplify code

* Add matmul shader support for running multiple calculations in parallel

* Switch from semaphore-synchronized multiple command buffers per op to single command buffer for multiple ops, whole graph if possible

* Fix missing event cast

* Replace uint64_t(-1) with UINT64_MAX, rename function for clarity

* Fix warning about empty C function parameters

* Fix compiler warnings

* Properly implement Vulkan backend buffer handling

* Fix oversized host staging buffers

* Simplify barrier synchronization calls

* Fix gcc warnings

* Implement max_size for backend buffer types to limit the size of a single allocation

* Use min of maxMemoryAllocationSize and maxBufferSize for device max allocation size

* refactor multi buf

* Disable unsupported ops to fix tests

* Check for maintenance4 support before using it

* Handle devices with only a single queue

* Fix single queue logic

* propagate buffer usage in multi buffers

* Implement rope_neox op

* Cleanup header and other files

* Simplify gpu_extras by removing events and putting staging memcpys into contexts

* Move queue into context

Add not-yet-enabled async backend ops

* Simplify context use, optimize matmul shader for warp size 64 (AMD GCN), fix split_k matmul shader optimization

* Add get_max_size to SYCL backend.

Co-authored-by: Georgi Gerganov <redacted>
* llama : fix trailing whitespace

---------

Co-authored-by: Henri Vasserman <redacted>
Co-authored-by: Concedo <redacted>
Co-authored-by: slaren <redacted>
Co-authored-by: Georgi Gerganov <redacted>

commit | commitdiff | tree

Abhilash Majumder [Sun, 28 Jan 2024 15:56:23 +0000 (21:26 +0530)]

ggml : add unified SYCL backend for Intel GPUs (llama/2690)

* first update for migration

* update init_cublas

* add debug functio, commit all help code

* step 1

* step 2

* step3 add fp16, slower 31->28

* add GGML_LIST_DEVICE function

* step 5 format device and print

* step6, enhance error check, remove CUDA macro, enhance device id to fix none-zero id issue

* support main device is non-zero

* step7 add debug for code path, rm log

* step 8, rename all macro & func from cuda by sycl

* fix error of select non-zero device, format device list

* ren ggml-sycl.hpp -> ggml-sycl.h

* clear CMAKE to rm unused lib and options

* correct queue: rm dtct:get_queue

* add print tensor function to debug

* fix error: wrong result in 658746bb26702e50f2c59c0e4ada8e9da6010481

* summary dpct definition in one header file to replace folder:dpct

* refactor device log

* mv dpct definition from folder dpct to ggml-sycl.h

* update readme, refactor build script

* fix build with sycl

* set nthread=1 when sycl, increase performance

* add run script, comment debug code

* add ls-sycl-device tool

* add ls-sycl-device, rm unused files

* rm rear space

* dos2unix

* Update README_sycl.md

* fix return type

* remove sycl version from include path

* restore rm code to fix hang issue

* add syc and link for sycl readme

* rm original sycl code before refactor

* fix code err

* add know issue for pvc hang issue

* enable SYCL_F16 support

* align pr4766

* check for sycl blas, better performance

* cleanup 1

* remove extra endif

* add build&run script, clean CMakefile, update guide by review comments

* rename macro to intel hardware

* editor config format

* format fixes

* format fixes

* editor format fix

* Remove unused headers

* skip build sycl tool for other code path

* replace tab by space

* fix blas matmul function

* fix mac build

* restore hip dependency

* fix conflict

* ren as review comments

* mv internal function to .cpp file

* export funciton print_sycl_devices(), mv class dpct definition to source file

* update CI/action for sycl code, fix CI error of repeat/dup

* fix action ID format issue

* rm unused strategy

* enable llama_f16 in ci

* fix conflict

* fix build break on MacOS, due to CI of MacOS depend on external ggml, instead of internal ggml

* fix ci cases for unsupported data type

* revert unrelated changed in cuda cmake
remove useless nommq
fix typo of GGML_USE_CLBLAS_SYCL

* revert hip cmake changes

* fix indent

* add prefix in func name

* revert no mmq

* rm cpu blas duplicate

* fix no_new_line

* fix src1->type==F16 bug.

* pass batch offset for F16 src1

* fix batch error

* fix wrong code

* revert sycl checking in test-sampling

* pass void as arguments of ggml_backend_sycl_print_sycl_devices

* remove extra blank line in test-sampling

* revert setting n_threads in sycl

* implement std::isinf for icpx with fast math.

* Update ci/run.sh

Co-authored-by: Georgi Gerganov <redacted>
* Update examples/sycl/run-llama2.sh

Co-authored-by: Georgi Gerganov <redacted>
* Update examples/sycl/run-llama2.sh

Co-authored-by: Georgi Gerganov <redacted>
* Update CMakeLists.txt

Co-authored-by: Georgi Gerganov <redacted>
* Update CMakeLists.txt

Co-authored-by: Georgi Gerganov <redacted>
* Update CMakeLists.txt

Co-authored-by: Georgi Gerganov <redacted>
* Update CMakeLists.txt

Co-authored-by: Georgi Gerganov <redacted>
* add copyright and MIT license declare

* update the cmd example

---------

Co-authored-by: jianyuzh <redacted>
Co-authored-by: luoyu-intel <redacted>
Co-authored-by: Meng, Hengyu <redacted>
Co-authored-by: Georgi Gerganov <redacted>

commit | commitdiff | tree

Georgi Gerganov [Sun, 28 Jan 2024 16:44:58 +0000 (18:44 +0200)]

ggml : minor type fix (int64_t -> size_t)

commit | commitdiff | tree

Georgi Gerganov [Sat, 27 Jan 2024 15:33:09 +0000 (17:33 +0200)]

common : fix input buffer check (#1812)

commit | commitdiff | tree

Georgi Gerganov [Sat, 27 Jan 2024 15:24:53 +0000 (17:24 +0200)]

talk-llama : sync llama.cpp

commit | commitdiff | tree

Georgi Gerganov [Sat, 27 Jan 2024 15:23:25 +0000 (17:23 +0200)]

sync : ggml

commit | commitdiff | tree

0cc4m [Fri, 26 Jan 2024 22:07:32 +0000 (23:07 +0100)]

Add OpenCL add kernel (llama/5151)

* Add OpenCL add kernel

* Put add kernel into different string to stay within MSVC string length limit, disable float16 support due to bad results

commit | commitdiff | tree

slaren [Fri, 26 Jan 2024 17:59:43 +0000 (18:59 +0100)]

cuda : fix tensor size calculation for non-split buffer (llama/5145)

commit | commitdiff | tree

slaren [Fri, 26 Jan 2024 17:18:26 +0000 (18:18 +0100)]

ggml-alloc : add 10% margin to the buffer sizes (llama/5149)

commit | commitdiff | tree

snadampal [Fri, 26 Jan 2024 17:17:59 +0000 (11:17 -0600)]

ggml : update softmax n_task calculation (llama/5126)

updated the n_task calculation to use max number of
threads possible. This has improved the prompt eval
performance by around 5% for DOT kernels and by
around 10% for MMLA kernels on AWS Graviton3.

commit | commitdiff | tree

Paul Tsochantaris [Fri, 26 Jan 2024 12:16:07 +0000 (12:16 +0000)]

metal : remove unused `n_buffers` and `buffers` (llama/5129)

commit | commitdiff | tree

Georgi Gerganov [Thu, 25 Jan 2024 09:26:17 +0000 (11:26 +0200)]

metal : show compile log messages

commit | commitdiff | tree

Engininja2 [Wed, 24 Jan 2024 22:18:15 +0000 (16:18 -0600)]

cuda : fix 2-bit quants on amd hip (llama/5105)

* cuda : fix 2-bit quants on amd hip

* use __low2float intrinsic function for new quants

commit | commitdiff | tree

slaren [Wed, 24 Jan 2024 11:48:14 +0000 (12:48 +0100)]

llama : pre-allocate input tensors in a separate buffer (llama/5100)

commit | commitdiff | tree

Georgi Gerganov [Tue, 23 Jan 2024 13:50:56 +0000 (15:50 +0200)]

metal : disable support for MUL_MAT F32 x F16

commit | commitdiff | tree

Johannes Gäßler [Tue, 23 Jan 2024 12:31:56 +0000 (13:31 +0100)]

CUDA: more info when no device code (llama/5088)

commit | commitdiff | tree

Georgi Gerganov [Tue, 23 Jan 2024 12:12:57 +0000 (14:12 +0200)]

minor : clean-up some warnings and style (llama/5094)

* minor : clean-up some warnings and style

ggml-ci

* ggml : add comment

commit | commitdiff | tree

Reinforce-II [Mon, 22 Jan 2024 13:15:08 +0000 (21:15 +0800)]

ggml : parallelize FP32 conversion when using BLAS (llama/5045)

* make GGML_TASK_INIT phase can be run in multithread

* multithreaded dequantize in mul_mat when using blas library

* minor fixes

* update outdated comment
* fix coding style

* simplify code

Co-authored-by: Georgi Gerganov <redacted>
---------

Co-authored-by: Georgi Gerganov <redacted>

commit | commitdiff | tree

XiaotaoChen [Mon, 22 Jan 2024 13:09:35 +0000 (21:09 +0800)]

llava : MobileVLM support (llama/4954)

* MobileVLM native implementation

* delete depthwise_conv_2d and permute_cpy relative code, replace the two by the existed functions, and opt ldp definition, support LLAMA_PERF option for CMake

* move android script to example/llava directory

* Fix the editor config checks

---------

Co-authored-by: Chenxiaotao03 <redacted>

commit | commitdiff | tree

slaren [Sat, 20 Jan 2024 15:05:49 +0000 (16:05 +0100)]

llama : run all KQV ops on the CPU with no KV offload (llama/5049)

ggml-ci

commit | commitdiff | tree

Kylin [Sat, 20 Jan 2024 07:01:46 +0000 (15:01 +0800)]

cuda : fix compile error in jetson platform (llama/4975)

* cuda: fix compile error in jetson platform

* cuda: update comment in ggml-cuda.cu

* cuda: update ggml-cuda.cu comment

commit | commitdiff | tree

Judd [Fri, 26 Jan 2024 13:04:01 +0000 (21:04 +0800)]

ggml : check ggml_add src1 type (ggml/708)

Co-authored-by: Judd <redacted>

commit | commitdiff | tree

Michael Rienstra [Fri, 26 Jan 2024 15:39:54 +0000 (07:39 -0800)]

docs : make model options / model install methods clearer (#1806)

* Make models more "discoverable"

* Clean up code block language identifiers

* make 3 options clearer

* undo Prettier formatter change

* docs: `$` shell prompt, consistently

* docs: minor changes

commit | commitdiff | tree

trixirt [Mon, 22 Jan 2024 13:02:35 +0000 (05:02 -0800)]

cmake : make libwhisper.so position independent (#1792)

This is similar to how libllama.so is built.

Signed-off-by: Tom Rix <redacted>

commit | commitdiff | tree

Georgi Gerganov [Mon, 22 Jan 2024 12:51:42 +0000 (14:51 +0200)]

cmake : temporary remove VLA check (#1795)

commit | commitdiff | tree

Neuman Vong [Fri, 19 Jan 2024 14:17:38 +0000 (01:17 +1100)]

whisper.android : return output from benchmarks (#1785)

Benchmarks are failing because JNI expects a jstring and the benchmarks
are missing a return statement (i.e., returning null). The functions
actually build a jstring but don't return it, so this seems to have been
an oversight.

This patch returns the jstring and now the benchmarks run successfully.

Fixes #1783.

commit | commitdiff | tree

Ryan Hitchman [Thu, 18 Jan 2024 20:58:42 +0000 (13:58 -0700)]

server : implement "verbose_json" format with token details (#1781)

* examples/server: implement "verbose_json" format with token details.

This is intended to mirror the format of openai's Python
whisper.transcribe() return values.

* server: don't write WAV to a temporary file if not converting

* server: use std::lock_guard instead of manual lock/unlock

commit | commitdiff | tree

Georgi Gerganov [Thu, 18 Jan 2024 09:03:13 +0000 (11:03 +0200)]

ggml : sync ggml-metal.m

commit | commitdiff | tree

Georgi Gerganov [Wed, 17 Jan 2024 19:23:33 +0000 (21:23 +0200)]

sync : llama.cpp

commit | commitdiff | tree

Georgi Gerganov [Wed, 17 Jan 2024 19:22:38 +0000 (21:22 +0200)]

sync : ggml

commit | commitdiff | tree

Georgi Gerganov [Wed, 17 Jan 2024 16:54:56 +0000 (18:54 +0200)]

ggml : add IQ2 to test-backend-ops + refactoring (llama/4990)

* ggml : add IQ2 to test-backend-ops + refactoring

ggml-ci

* cuda : update supports_op for IQ2

ggml-ci

* ci : enable LLAMA_CUBLAS=1 for CUDA nodes

ggml-ci

* cuda : fix out-of-bounds-access in `mul_mat_vec_q`

ggml-ci

* tests : avoid creating RNGs for each Q tensor

ggml-ci

* tests : avoid creating RNGs for each tensor

ggml-ci

commit | commitdiff | tree

Georgi Gerganov [Wed, 17 Jan 2024 16:46:30 +0000 (18:46 +0200)]

imatrix : offload to GPU support (llama/4957)

* backend : add eval callback

ggml-ci

* backend : group nodes in a single compute when user don't need them

* backend : clean-up the implementation

ggml-ci

* simple : do not perform tensor data copy if not needed

* simple : fix

* imatrix : offload to GPU support

* imatrix : fix ggml_mul_mat_id hanlding

ggml-ci

* ci : add imatrix test

ggml-ci

* ci : rearrange output

ggml-ci

commit | commitdiff | tree

Georgi Gerganov [Wed, 17 Jan 2024 16:39:41 +0000 (18:39 +0200)]

backend : add eval callback (llama/4935)

* backend : add eval callback

ggml-ci

* backend : group nodes in a single compute when user don't need them

* backend : clean-up the implementation

ggml-ci

* simple : do not perform tensor data copy if not needed

* simple : fix

* simple : no need for ggml_is_contiguous + fix bool parse

* llama : fix callback placement in llama_context_params

* backend : avoid double-ask callback calls

* simple : restore examples, imatrix will serve as a demo

Packaging of ggerganov/whisper.cpp

RSS Atom