]> git.djapps.eu Git - pkg/ggml/sources/whisper.cpp/log
pkg/ggml/sources/whisper.cpp
3 months agocmake: Fix ggml backend dependencies and installation (llama/11818)
Vladimir Vuksanovic [Thu, 27 Feb 2025 07:42:48 +0000 (08:42 +0100)]
cmake: Fix ggml backend dependencies and installation (llama/11818)

* Fix dependencies between ggml and backends

ggml backends link only to ggml-base and ggml links to all backends.

* Fix installation of ggml backends

Set up GNUInstallDirs before setting the installation directory of ggml backends

3 months agovulkan: fix assertion when qy_needs_dequant (llama/12068)
Jeff Bolz [Tue, 25 Feb 2025 15:30:21 +0000 (09:30 -0600)]
vulkan: fix assertion when qy_needs_dequant (llama/12068)

Looks like a copy/paste bug from qx_needs_dequant.

3 months agoggml-cpu: Fix build with sve (llama/12059)
Molly Sophia [Tue, 25 Feb 2025 11:28:22 +0000 (19:28 +0800)]
ggml-cpu: Fix build with sve (llama/12059)

* ggml-cpu: Fix build with sve

Signed-off-by: Molly Sophia <redacted>
* ggml-cpu: Remove unused variable in sve q3_k vec dot

Signed-off-by: Molly Sophia <redacted>
---------

Signed-off-by: Molly Sophia <redacted>
3 months agocuda: unary ops as float + de-duplicate (ggml/1130)
cmdr2 [Mon, 3 Mar 2025 15:21:31 +0000 (20:51 +0530)]
cuda: unary ops as float + de-duplicate (ggml/1130)

3 months agocuda/vulkan: specify fp32-only support for some operations in supports_op (ggml/1129)
cmdr2 [Fri, 28 Feb 2025 10:29:55 +0000 (15:59 +0530)]
cuda/vulkan: specify fp32-only support for some operations in supports_op (ggml/1129)

* cuda: restrict SILU_BACK to fp32, since fp16 exceeds the desired test threshold

* vulkan: specify fp32-only support for certain ops (that are now tested for fp16 as well)

* f32 sigmoid in vulkan supports op

* Revert "f32 sigmoid in vulkan supports op"

This reverts commit c6f04b3c19bf4504c2776149c6d8cd84e0b48acb.

3 months agocuda/cpu: Increase support for fp16 unary operations (ggml/1125)
cmdr2 [Fri, 28 Feb 2025 07:04:39 +0000 (12:34 +0530)]
cuda/cpu: Increase support for fp16 unary operations (ggml/1125)

* Support fp16 unary operations in the CUDA backend

* cpu: increase fp16 support for unary operators in the CPU backend

* cuda: increase fp16 support for unary operators in the CUDA backend

* Add test cases for fp16 unary operators

* metal: update supports_op for unary operators that don't support fp16, to prevent test-backend-ops from failing

* metal: fix PR comments for unary op support after fp16 unary tests

3 months agoTold cmake to install ggml-cpp.h as a public header file. (ggml/1126)
petterreinholdtsen [Wed, 26 Feb 2025 20:44:00 +0000 (21:44 +0100)]
Told cmake to install ggml-cpp.h as a public header file. (ggml/1126)

It is used by Whisper talk-llama example.

Co-authored-by: Petter Reinholdtsen <redacted>
3 months agocommon : more general m_audio_len update logic (#2855)
Ivy233 [Fri, 7 Mar 2025 08:10:03 +0000 (16:10 +0800)]
common : more general m_audio_len update logic (#2855)

Co-authored-by: Ivy233 <redacted>
3 months agogo : improve model download (#2756)
Ryan Johnson [Fri, 7 Mar 2025 08:03:51 +0000 (02:03 -0600)]
go : improve model download (#2756)

* Updated models download URL

* Updated list of models available

All of the high efficiency quantized models are rejected when trying to download. They exist on the server. Let's allow them.

* added path prefix for whisper-cli in message to user. The message is misleading if this script is called from another script in a different folder. So the message has to be fixed.

* undid download URL change I made earlier. Fixed filepath.Join(urlPath, model) bug.

* Undid download URL change I made earlier.

Seems that the old URL works but only when provided a model to download. Still doesn't explain why there's a different download URL that also works. Please elucidate in docs.

* Fixed URLForModel Function's bug

filepath.Join is designed for filesystem paths, and it uses backslashes (\) on Windows. URLs, however, require forward slashes (/), so the use of filepath.Join is inappropriate for constructing URLs.

The fmt.Sprintf function ensures that forward slashes are used.

* Fixed URL trailing / double slash bug

Ensure no double slash by trimming trailing '/' from srcUrl if present

* Fixed bad download URL, missing ggml prefix

Not sure if that was a bug I introduced but it was trying to download without the prefix.

* Added question before downloading all models. Added download size estimate

HEAD Requests:
Efficiently fetches file sizes without downloading the content.
Interactive Workflow:
Allows the user to make informed decisions about downloading all models.
Safe Defaults:
Aborts if the user does not explicitly confirm.

* Fixed Unbuffered channel warning.

warning in context.go : misuse of unbuffered os.Signal channel as argument to signal.

The warning indicates that the unbuffered channel used in signal.Notify in context.go may be misused. In Go, unbuffered channels can cause potential deadlocks if signals are sent faster than they are received.

* Fixed download size calculation, download URL prefix bug, added link to models URL for user.

The URL formatter was prepending the model name to the formatted model name in the URL

* Added logs and exes to gitignore

* Delete bindings/go/examples/go-model-download/go-model-download.exe

* Delete whisper_build.log

3 months agocommon : fix audio loading by miniaudio (#2862)
Dmitry Atamanov [Tue, 4 Mar 2025 17:05:21 +0000 (22:05 +0500)]
common : fix audio loading by miniaudio (#2862)

3 months agofix: missing include common-whisper (#2858)
Lin Xiaodong [Sun, 2 Mar 2025 18:55:11 +0000 (02:55 +0800)]
fix: missing include common-whisper (#2858)

3 months agoruby : follow audio library change (#2851)
KITAITI Makoto [Fri, 28 Feb 2025 06:09:02 +0000 (15:09 +0900)]
ruby : follow audio library change (#2851)

* Enable CPU

* Follow audio lib change

3 months agowhisper : support GGML_BACKEND_DL (#2843)
Diego Devesa [Thu, 27 Feb 2025 12:35:07 +0000 (13:35 +0100)]
whisper : support GGML_BACKEND_DL (#2843)

* whisper : support GGML_BACKEND_DL

* fix DTW crash

* whisper.objc : fix build - add ggml-cpp.h

---------

Co-authored-by: Georgi Gerganov <redacted>
3 months agocommon : separate whisper sources (#2846)
Georgi Gerganov [Thu, 27 Feb 2025 10:50:32 +0000 (12:50 +0200)]
common : separate whisper sources (#2846)

* common : separate whisper sources

* examples : add chrono

* examples : add more headers

3 months agocommon : fix build min/max (#2845)
Georgi Gerganov [Thu, 27 Feb 2025 08:39:13 +0000 (10:39 +0200)]
common :  fix build min/max (#2845)

* common : try to fix build

* cont : try another fix

3 months agoexamples : use miniaudio for direct decoding flac, mp3, ogg and wav (#2759)
Dmitry Atamanov [Thu, 27 Feb 2025 07:06:54 +0000 (12:06 +0500)]
examples : use miniaudio for direct decoding flac, mp3, ogg and wav (#2759)

3 months agostream : stop on ^C when no audio is received (#2822)
petterreinholdtsen [Thu, 27 Feb 2025 06:59:51 +0000 (07:59 +0100)]
stream : stop on ^C when no audio is received (#2822)

Add check for ctrl-c in potentially endless loop while calling audio.get()
to receive sound.

Co-authored-by: Petter Reinholdtsen <redacted>
3 months agosync : ggml
Georgi Gerganov [Wed, 26 Feb 2025 20:39:12 +0000 (22:39 +0200)]
sync : ggml

3 months agoSupport pure float16 add/sub/mul/div operations in the CUDA (and CPU) backend (ggml...
cmdr2 [Tue, 25 Feb 2025 12:36:34 +0000 (18:06 +0530)]
Support pure float16 add/sub/mul/div operations in the CUDA (and CPU) backend (ggml/1121)

* Support float16-to-float16 add/sub/mul/div operations in the CUDA backend

* Add fp16 support for add/sub/mul/div on the CPU backend

* Add test cases for fp16 add/sub/mul/div

3 months agometal : copy kernels for quant to F32/F16 conversions (llama/12017)
Gian-Carlo Pascutto [Tue, 25 Feb 2025 09:27:58 +0000 (10:27 +0100)]
metal : copy kernels for quant to F32/F16 conversions (llama/12017)

metal: use dequantize_q templates

---------

Co-authored-by: Georgi Gerganov <redacted>
3 months agoopencl: fix for small models (llama/11950)
lhez [Mon, 24 Feb 2025 21:47:07 +0000 (13:47 -0800)]
opencl: fix for small models (llama/11950)

* opencl: fix small shape gemv, remove unused extensions

* opencl: fix `transpose_16`, `dump_tensor`, enforce subgroup size

* opencl: fix for token length < 4

* opencl: use wave size of 64 for all Adreno GPUs

---------

Co-authored-by: Shawn Gu <redacted>
Co-authored-by: Skyler Szot <redacted>
3 months agoOptimize mul_mat for Q4_0 on Intel GPU (llama/12035)
Neo Zhang Jianyu [Mon, 24 Feb 2025 14:33:23 +0000 (22:33 +0800)]
Optimize mul_mat for Q4_0 on Intel GPU (llama/12035)

* opt performance by reorder for Intel GPU

* detect hw type and save opt feature, and print opt feature

* correct name

* support optimize graph once when compute graph, record the opt status in tensor->extra, make CI passed

* add env variable GGML_SYCL_DISABLE_OPT for debug

* use syclex::architecture replace the custom hw define, update the guide for GGML_SYCL_DISABLE_OPT

* add performance data

* mv getrows functions to separeted files

* fix global variables

---------

Co-authored-by: arthw <redacted>
3 months agoSYCL: Fix GGML_SYCL_DEBUG macro (llama/11995)
Akarshan Biswas [Mon, 24 Feb 2025 10:18:25 +0000 (15:48 +0530)]
SYCL: Fix GGML_SYCL_DEBUG macro (llama/11995)

3 months agoggml-cpu: Support s390x SIMD Instruction Set (llama/12019)
Aaron Teo [Sat, 22 Feb 2025 21:39:24 +0000 (05:39 +0800)]
ggml-cpu: Support s390x SIMD Instruction Set (llama/12019)

* ggml: add s390x ARCH_FLAGS for compilation

Signed-off-by: Aaron Teo <redacted>
* ggml: add SIMD for s390x using vector intrinsics

SIMD is activated for:
* ggml_vec_dot_f32
* ggml_vec_dot_f16
* ggml_vec_mad_f32
* ggml_vec_mad_f16
* ggml_vec_mad_f32_unroll
* ggml_vec_scale_f32
* ggml_vec_scale_f16

SIMD is NOT activated for:
* ggml_vec_dot_f16_unroll (pending bugfix)

Signed-off-by: Aaron Teo <redacted>
* ggml: fix missing escape character in GGML_F32x4_REDUCE

Signed-off-by: Aaron Teo <redacted>
* ggml: add temporary patch for GGML_F32_ARR and GGML_F16_ARR

Signed-off-by: Aaron Teo <redacted>
* ggml: fix s390x GGML_F32x4_REDUCE

Signed-off-by: Aaron Teo <redacted>
* ggml: full SIMD activation for F32,F16 s390x

Signed-off-by: Aaron Teo <redacted>
* ggml: add option to disable s390x VXE/VXE2

Signed-off-by: Aaron Teo <redacted>
* ggml: change vecintrin.h include to ggml-cpu-impl

* add __VXE__ and __VXE2__ macros

Signed-off-by: Aaron Teo <redacted>
* cmake: add s390x target detection for VX/VXE/VXE2

Signed-off-by: Aaron Teo <redacted>
* ggml: move s390x vector intrinsics to ggml-cpu-impl.h

Signed-off-by: Aaron Teo <redacted>
* ggml: s390x Q8_0 SIMD

Signed-off-by: Aaron Teo <redacted>
* ggml: correct documentation for Q8_0

Signed-off-by: Aaron Teo <redacted>
* ggml: s390x reduce code complexity Q8_0

Signed-off-by: Aaron Teo <redacted>
* ggml: s390x bugfix typo Q8_0

Signed-off-by: Aaron Teo <redacted>
* ggml: s390x SIMD activated for Q4_1

Signed-off-by: Aaron Teo <redacted>
* ggml: s390x inline vec_reve

Signed-off-by: Aaron Teo <redacted>
* ggml: s390x SIMD activation for Q4_0

Signed-off-by: Aaron Teo <redacted>
* ggml: add VXE backend feature

Signed-off-by: Aaron Teo <redacted>
* ggml: remove test.py

Signed-off-by: Aaron Teo <redacted>
* ggml: s390x SIMD activation for quantize_row_q8_0

Signed-off-by: Aaron Teo <redacted>
* ggml: s390x SIMD activation for quantize_row_q8_1

Signed-off-by: Aaron Teo <redacted>
* ggml: s390x SIMD activation for iq4_xs

Signed-off-by: Aaron Teo <redacted>
* ggml: bugfix iq4_xs

Signed-off-by: Aaron Teo <redacted>
* ggml: s390x SIMD activation for iq4_nl

Signed-off-by: Aaron Teo <redacted>
* ggml: add float, double, and long vector data type

Signed-off-by: Aaron Teo <redacted>
* ggml: clean up iq4_xs SIMD

Signed-off-by: Aaron Teo <redacted>
* ggml: fix improper use of restrict keyword

Signed-off-by: Aaron Teo <redacted>
* ggml: update warning message for ggml_vec_tbl

Signed-off-by: Aaron Teo <redacted>
* ggml: untested implementation of ggml_vec_dot_iq2_xxs_q8_K

Signed-off-by: Aaron Teo <redacted>
* ggml: update ggml_vec_dot_q4_1_q8_1 to use typedefs

Signed-off-by: Aaron Teo <redacted>
* ggml: switch to restrict for iq4_nl

Signed-off-by: Aaron Teo <redacted>
* ggml: slight dot product speed improvement for q4_1_q8_1

Signed-off-by: Aaron Teo <redacted>
* ggml: s390x SIMD activation for q6_K

Signed-off-by: Aaron Teo <redacted>
* ggml: add missing `_t` to ggml_int8x16x4_t

Signed-off-by: Aaron Teo <redacted>
* ggml: fix missing `_t` for ggml_vec_xl_s8x4

Signed-off-by: Aaron Teo <redacted>
* ggml: fix more missing `_t`

Signed-off-by: Aaron Teo <redacted>
* ggml: add unroll and prefetch to Q8_0

increase of 3.86% for prompt processing and 32.22% for token generation

Signed-off-by: Aaron Teo <redacted>
* ggml: patch Q8_0 to use proper vector sizes

Signed-off-by: Aaron Teo <redacted>
* ggml: optimise Q8_0 dot prod compute kernel further

Signed-off-by: Aaron Teo <redacted>
* ggml: add unroll and prefetch to Q4_1

Signed-off-by: Aaron Teo <redacted>
* ggml: refactor Q6_K variable naming for readability

Signed-off-by: Aaron Teo <redacted>
* ggml: fix Q6_K typos

Signed-off-by: Aaron Teo <redacted>
* ggml: s390x SIMD activation for Q5_K

Signed-off-by: Aaron Teo <redacted>
* ggml: fix wrong char*x16_t naming

Signed-off-by: Aaron Teo <redacted>
* ggml: Q5_K y0 wrong signness

Signed-off-by: Aaron Teo <redacted>
* ggml: fix Q5_K invalid uchar type

Signed-off-by: Aaron Teo <redacted>
* ggml: fix Q5_K invalid uchar type

Signed-off-by: Aaron Teo <redacted>
* ggml: s390x SIMD activation for Q4_K

Signed-off-by: Aaron Teo <redacted>
* ggml: fix Q4_K invalid vector intrinsics

Signed-off-by: Aaron Teo <redacted>
* ggml: simplify ggml_padd_s16 compute kernel

Signed-off-by: Aaron Teo <redacted>
* ggml: correct ggml-cpu vxe wording

Signed-off-by: Aaron Teo <redacted>
* ggml: change ggml_aligned_malloc alignment to 256

256 is the cache line size for s390x platforms

Signed-off-by: Aaron Teo <redacted>
* ggml: resolve pr merge via cherry-pick 225bbbf

Signed-off-by: Aaron Teo <redacted>
* ggml : fix LoongArch compile error with 128-bit SIMD (llama/11701)

* ggml: resolve pr merge via cherry-pick 4571953

Signed-off-by: Aaron Teo <redacted>
* ggml: cmake remove fork when determining s390x machine type

thank you @ericcurtin

Signed-off-by: Aaron Teo <redacted>
---------

Signed-off-by: Aaron Teo <redacted>
Co-authored-by: Jinyang He <redacted>
Co-authored-by: junchao-zhao <redacted>
3 months agoCUDA: app option to compile without FlashAttention (llama/12025)
Johannes Gäßler [Sat, 22 Feb 2025 19:44:34 +0000 (20:44 +0100)]
CUDA: app option to compile without FlashAttention (llama/12025)

3 months agoCUDA: optimize FA for GQA + large batches (llama/12014)
Johannes Gäßler [Sat, 22 Feb 2025 11:20:17 +0000 (12:20 +0100)]
CUDA: optimize FA for GQA + large batches (llama/12014)

3 months agocuda: Add Q5_1, Q5_0, Q4_1 and Q4_0 to F32 conversion support. (llama/12000)
Gian-Carlo Pascutto [Sat, 22 Feb 2025 08:43:24 +0000 (09:43 +0100)]
cuda: Add Q5_1, Q5_0, Q4_1 and Q4_0 to F32 conversion support. (llama/12000)

3 months agoCUDA: correct the lowest Maxwell supported by CUDA 12 (llama/11984)
PureJourney [Fri, 21 Feb 2025 11:21:05 +0000 (19:21 +0800)]
CUDA: correct the lowest Maxwell supported by CUDA 12 (llama/11984)

* CUDA: correct the lowest Maxwell supported by CUDA 12

---------

Co-authored-by: Johannes Gäßler <redacted>
3 months agoMUSA: support ARM64 and enable dp4a .etc (llama/11843)
Bodhi [Fri, 21 Feb 2025 07:46:23 +0000 (15:46 +0800)]
MUSA: support ARM64 and enable dp4a .etc (llama/11843)

* MUSA:  support ARM64 and enable __dp4a .etc

* fix cross entropy loss op for musa

* update

* add cc info log for musa

* add comment for the MUSA .cc calculation block

---------

Co-authored-by: Bodhi Hu <redacted>
3 months agoggml-cpu: Add CPU backend support for KleidiAI library (llama/11390)
Charles Xu [Thu, 20 Feb 2025 13:06:51 +0000 (14:06 +0100)]
ggml-cpu: Add CPU backend support for KleidiAI library (llama/11390)

* ggml-cpu: Add CPU backend support for KleidiAI library

* Add environmental variable GGML_KLEIDIAI_SME

* Add support for multithread LHS conversion

* Switch kernel selection order to dotprod and i8mm

* updates for review comments

* More updates for review comments

* Reorganize and rename KleidiAI files

* Move ggml-cpu-traits.h to source file

* Update cmake for SME build and add alignment for SME

* Remove append GGML_USE_CPU_KLEIDIAI to the GGML_CDEF_PUBLIC list

3 months agoggml: aarch64: implement SVE kernels for q3_K_q8_K vector dot (llama/11917)
Prashant Vithule [Thu, 20 Feb 2025 10:08:32 +0000 (15:38 +0530)]
ggml: aarch64: implement SVE kernels for q3_K_q8_K vector dot (llama/11917)

* Added SVE Implementation for Q3_K Kernel in ggml-cpu-quants.c file

* Improved Formating of code in  ggml-cpu-quants.c file

* style : minor fixes

* style : less whitespaces

* style : ptr spaceing

---------

Co-authored-by: vithulep <redacted>
Co-authored-by: Georgi Gerganov <redacted>
3 months agoCUDA: use async data loading for FlashAttention (llama/11894)
Johannes Gäßler [Mon, 17 Feb 2025 13:03:24 +0000 (14:03 +0100)]
CUDA: use async data loading for FlashAttention (llama/11894)

* CUDA: use async data loading for FlashAttention

---------

Co-authored-by: Diego Devesa <redacted>
3 months agovulkan: implement several ops relevant for ggml_opt (llama/11769)
Rémy O [Mon, 17 Feb 2025 06:55:57 +0000 (07:55 +0100)]
vulkan: implement several ops relevant for ggml_opt (llama/11769)

* vulkan: support memset_tensor

* vulkan: support GGML_OP_SUM

* vulkan: implement GGML_OP_ARGMAX

* vulkan: implement GGML_OP_SUB

* vulkan: implement GGML_OP_COUNT_EQUAL

* vulkan: implement GGML_OP_OPT_STEP_ADAMW

* vulkan: fix check_results RWKV_WKV6 crash and memory leaks

* vulkan: implement GGML_OP_REPEAT_BACK

* tests: remove invalid test-backend-ops REPEAT_BACK tests

* vulkan: fix COUNT_EQUAL memset using a fillBuffer command

3 months agovulkan: support multi/vision rope, and noncontiguous rope (llama/11902)
Jeff Bolz [Sun, 16 Feb 2025 07:52:23 +0000 (01:52 -0600)]
vulkan: support multi/vision rope, and noncontiguous rope (llama/11902)

3 months agometal : fix the crash caused by the lack of residency set support on Intel Macs....
Hale Chan [Sun, 16 Feb 2025 06:50:26 +0000 (14:50 +0800)]
metal : fix the crash caused by the lack of residency set support on Intel Macs. (llama/11904)

3 months agometal : optimize dequant q6_K kernel (llama/11892)
Adrian Kretz [Sat, 15 Feb 2025 18:39:20 +0000 (19:39 +0100)]
metal : optimize dequant q6_K kernel (llama/11892)

3 months agorepo : update links to new url (llama/11886)
Georgi Gerganov [Sat, 15 Feb 2025 14:40:57 +0000 (16:40 +0200)]
repo : update links to new url (llama/11886)

* repo : update links to new url

ggml-ci

* cont : more urls

ggml-ci

3 months agovulkan: initial support for IQ1_S and IQ1_M quantizations (llama/11528)
Rémy O [Sat, 15 Feb 2025 08:01:40 +0000 (09:01 +0100)]
vulkan: initial support for IQ1_S and IQ1_M quantizations (llama/11528)

* vulkan: initial support for IQ1_S and IQ1_M quantizations

* vulkan: define MMV kernels for IQ1 quantizations

* devops: increase timeout of Vulkan tests again

* vulkan: simplify ifdef for init_iq_shmem

3 months agoopencl: Fix rope and softmax (llama/11833)
lhez [Fri, 14 Feb 2025 19:12:23 +0000 (11:12 -0800)]
opencl: Fix rope and softmax (llama/11833)

* opencl: fix `ROPE`

* opencl: fix `SOFT_MAX`

* Add fp16 variant

* opencl: enforce subgroup size for `soft_max`

3 months agocuda : add ampere to the list of default architectures (llama/11870)
Diego Devesa [Fri, 14 Feb 2025 14:33:52 +0000 (15:33 +0100)]
cuda : add ampere to the list of default architectures (llama/11870)

3 months agoggml: optimize some vec dot functions for LoongArch ASX (llama/11842)
Jinyang He [Fri, 14 Feb 2025 08:54:27 +0000 (16:54 +0800)]
ggml: optimize some vec dot functions for LoongArch ASX (llama/11842)

* Optimize ggml_vec_dot_q3_K_q8_K for LoongArch ASX

* Optimize ggml_vec_dot_q4_K_q8_K for LoongArch ASX

* Optimize ggml_vec_dot_q6_K_q8_K for LoongArch ASX

* Optimize ggml_vec_dot_q5_K_q8_K for LoongArch ASX

* Optimize ggml_vec_dot_q2_K_q8_K for LoongArch ASX

* Optimize mul_sum_i8_pairs_float for LoongArch ASX

* Optimize ggml_vec_dot_iq4_xs_q8_K for LoongArch ASX

3 months agovulkan: linux builds + small subgroup size fixes (llama/11767)
Eve [Fri, 14 Feb 2025 02:59:40 +0000 (02:59 +0000)]
vulkan: linux builds + small subgroup size fixes (llama/11767)

* mm subgroup size

* upload vulkan x86 builds

3 months agollamafile: use member variable instead of constant for iq4nlt (llama/11780)
Jeffrey Morgan [Thu, 13 Feb 2025 17:05:04 +0000 (09:05 -0800)]
llamafile: use member variable instead of constant for iq4nlt (llama/11780)

3 months agomusa: bump MUSA SDK version to rc3.1.1 (llama/11822)
R0CKSTAR [Thu, 13 Feb 2025 12:28:18 +0000 (20:28 +0800)]
musa: bump MUSA SDK version to rc3.1.1 (llama/11822)

* musa: Update MUSA SDK version to rc3.1.1

Signed-off-by: Xiaodong Ye <redacted>
* musa: Remove workaround in PR #10042

Signed-off-by: Xiaodong Ye <redacted>
---------

Signed-off-by: Xiaodong Ye <redacted>
3 months agoggml-cpu : add chunking support to mul_mat_id (llama/11666)
Diego Devesa [Thu, 13 Feb 2025 00:02:38 +0000 (01:02 +0100)]
ggml-cpu : add chunking support to mul_mat_id (llama/11666)

* ggml-cpu : add chunking support to mul_mat_id

* allocate chunk counter in wdata
parallelize src1 quantization by column to allows parallelization even when there is only one row

* disable for arm

* cleanup

* better way to disable for arm

* fix uninitialized counter when using 1 thread only

* revert test-backend-ops changes

3 months agoggml : x2 speed for WASM by optimizing SIMD (llama/11453)
Xuan-Son Nguyen [Wed, 12 Feb 2025 23:33:45 +0000 (00:33 +0100)]
ggml : x2 speed for WASM by optimizing SIMD (llama/11453)

* ggml : x2 speed for WASM by optimizing SIMD

* fix bad merging

* rm trailing spaces

* rm redundant clamp

* better quantize_row_q8_K

Co-authored-by: camel-cdr <redacted>
* remove memset that causes buffer overflow
Co-authored-by: camel-cdr <redacted>
---------

Co-authored-by: camel-cdr <redacted>
3 months agoHIP: Remove GCN from list of devices that avoid MMQ (llama/11831)
uvos [Wed, 12 Feb 2025 21:25:28 +0000 (22:25 +0100)]
HIP: Remove GCN from list of devices that avoid MMQ (llama/11831)

3 months agoHIP: Switch to std::vector in rocblas version check (llama/11820)
uvos [Wed, 12 Feb 2025 16:25:03 +0000 (17:25 +0100)]
HIP: Switch to std::vector in rocblas version check (llama/11820)

3 months agocleanup: fix compile warnings associated with gnu_printf (llama/11811)
bandoti [Wed, 12 Feb 2025 14:06:53 +0000 (10:06 -0400)]
cleanup: fix compile warnings associated with gnu_printf (llama/11811)

3 months agoggml : fix multi-threaded clamp_f32 (llama/11824)
Richard [Wed, 12 Feb 2025 13:57:33 +0000 (13:57 +0000)]
ggml : fix multi-threaded clamp_f32 (llama/11824)

* Bug fix for clamp_f32

When using tensors larger than 1d clamp operation does not work due to the restriction of returning if ith is not 0.

* Bug fix for clamp_f32

* Bug fix for clamp_f32

3 months agoggml-cpu: Fix duplicate MATMUL_INT8 (llama/11817)
Weizhao Ouyang [Wed, 12 Feb 2025 12:22:58 +0000 (20:22 +0800)]
ggml-cpu: Fix duplicate MATMUL_INT8 (llama/11817)

Signed-off-by: Weizhao Ouyang <redacted>
3 months agoCUDA: fix CUDART_VERSION checks (llama/11821)
Johannes Gäßler [Wed, 12 Feb 2025 12:16:39 +0000 (13:16 +0100)]
CUDA: fix CUDART_VERSION checks (llama/11821)

3 months agoFix #11802: Compile bug - RegQueryValueExA changed to RegQueryValueEx (llama/11803)
Sheldon Robinson [Tue, 11 Feb 2025 15:55:45 +0000 (10:55 -0500)]
Fix #11802: Compile bug - RegQueryValueExA changed to RegQueryValueEx (llama/11803)

* Fix #11802: Compile bug - RegQueryValueExA changed to RegQueryValueEx

* Fix #11802: PR #11803 - keep RegQueryValueExA, remove TEXT macro, description needs to be ANSI string

3 months agoCUDA: use arch list for compatibility check (llama/11775)
Johannes Gäßler [Mon, 10 Feb 2025 23:17:22 +0000 (00:17 +0100)]
CUDA: use arch list for compatibility check (llama/11775)

* CUDA: use arch list for feature availability check

---------

Co-authored-by: Diego Devesa <redacted>
3 months agofix: typos in documentation files (llama/11791)
Maxim Evtush [Mon, 10 Feb 2025 22:21:31 +0000 (23:21 +0100)]
fix: typos in documentation files (llama/11791)

* Update ggml.c

* Update arg.cpp

* Update speculative.h

3 months agovulkan: Make Vulkan optional at runtime (ggml/11493). (llama/11494)
Danny Milosavljevic [Mon, 10 Feb 2025 06:17:21 +0000 (07:17 +0100)]
vulkan: Make Vulkan optional at runtime (ggml/11493). (llama/11494)

Co-authored-by: Jeff Bolz <redacted>
3 months agovulkan: add environment variable GGML_VK_PREFER_HOST_MEMORY to avoid VRAM allocation...
Wagner Bruna [Mon, 10 Feb 2025 06:08:22 +0000 (03:08 -0300)]
vulkan: add environment variable GGML_VK_PREFER_HOST_MEMORY to avoid VRAM allocation (llama/11592)

3 months agovulkan: account for lookup tables when checking shared memory size (llama/11502)
Jeff Bolz [Sun, 9 Feb 2025 07:43:51 +0000 (01:43 -0600)]
vulkan: account for lookup tables when checking shared memory size (llama/11502)

3 months agoggml: Fix data race in ggml threadpool (llama/11736)
Karol Kontny [Sat, 8 Feb 2025 14:30:53 +0000 (15:30 +0100)]
ggml: Fix data race in ggml threadpool (llama/11736)

After the barrier in last iteration is executed, still the loop termination
condition will be executed. However main thread can destroy the cgraph object
and its nodes already, then another thread will access it, but the thing is already gone.
Also trouble can happen when n_nodes == 0 or abort is called, but I'm not sure if the
prior situation is possible.

Last syncronization should be done after the loop to ensure the cgraph/cplan won't be
accessed after the main thread exits from the function.

3 months agoCUDA: fix min. version for movmatrix (llama/11751)
Johannes Gäßler [Sat, 8 Feb 2025 09:46:07 +0000 (10:46 +0100)]
CUDA: fix min. version for movmatrix (llama/11751)

3 months agovulkan: print shared memory size (llama/11719)
Jeff Bolz [Fri, 7 Feb 2025 10:26:03 +0000 (04:26 -0600)]
vulkan: print shared memory size (llama/11719)

3 months agoSYCL: remove XMX info from print devices (llama/11712)
Akarshan Biswas [Fri, 7 Feb 2025 09:27:53 +0000 (14:57 +0530)]
SYCL: remove XMX info from print devices (llama/11712)

3 months agoggml : optimize and build warning fix for LoongArch (llama/11709)
Jinyang He [Fri, 7 Feb 2025 07:38:31 +0000 (15:38 +0800)]
ggml : optimize and build warning fix for LoongArch (llama/11709)

* ggml : optimize convert f32<->f16 for loongarch_asx

* ggml : optimize loongarch_asx extend i16,i8,u8 to i32,i16

* ggml : Fix warnings when run cpu CI locally on LoongArch

3 months agoSYCL: Adjust support condition for norm operators (llama/11674)
Akarshan Biswas [Thu, 6 Feb 2025 11:42:35 +0000 (17:12 +0530)]
SYCL: Adjust support condition for norm operators (llama/11674)

SYCL does not support non contiguous tensors for norm operations

3 months agoggml : fix LoongArch compile error with 128-bit SIMD (llama/11701)
junchao-zhao [Thu, 6 Feb 2025 09:20:00 +0000 (17:20 +0800)]
ggml : fix LoongArch compile error with 128-bit SIMD (llama/11701)

3 months agovulkan: optimize coopmat2 iq2/iq3 callbacks (llama/11521)
Jeff Bolz [Thu, 6 Feb 2025 06:15:30 +0000 (00:15 -0600)]
vulkan: optimize coopmat2 iq2/iq3 callbacks (llama/11521)

* vulkan: optimize coopmat2 iq2/iq3 callbacks

* build: trigger CI on GLSL compute shader changes

3 months agovulkan: initial support for IQ4_XS quantization (llama/11501)
Rémy O [Thu, 6 Feb 2025 06:09:59 +0000 (07:09 +0100)]
vulkan: initial support for IQ4_XS quantization (llama/11501)

3 months agovulkan: use smaller combined allocations to avoid fragmentation (llama/11551)
Jeff Bolz [Thu, 6 Feb 2025 06:02:18 +0000 (00:02 -0600)]
vulkan: use smaller combined allocations to avoid fragmentation (llama/11551)

3 months agometal : avoid breaking build when metal API predates TARGET_OS_VISION (llama/11690)
Charles Duffy [Thu, 6 Feb 2025 01:52:31 +0000 (19:52 -0600)]
metal : avoid breaking build when metal API predates TARGET_OS_VISION (llama/11690)

Avoids breakage in nix flake build introduced by b0569130c5e9c671152c913d82803b7c2f014ff9

3 months agometal : adjust support conditions for norm operators (llama/11671)
Georgi Gerganov [Wed, 5 Feb 2025 08:57:42 +0000 (10:57 +0200)]
metal : adjust support conditions for norm operators (llama/11671)

cont #11659

ggml-ci

3 months agoCUDA: support for mat. mul. with ne03 != ne13 (llama/11656)
Johannes Gäßler [Wed, 5 Feb 2025 07:58:31 +0000 (08:58 +0100)]
CUDA: support for mat. mul. with ne03 != ne13 (llama/11656)

3 months agoCUDA: non-contiguous (RMS) norm support (llama/11659)
Johannes Gäßler [Tue, 4 Feb 2025 21:21:42 +0000 (22:21 +0100)]
CUDA: non-contiguous (RMS) norm support (llama/11659)

* CUDA: non-contiguous (RMS) norm support

---------

Co-authored-by: Georgi Gerganov <redacted>
3 months agoHIP: force max threads per block to be 1024 (llama/11621)
fxzjshm [Tue, 4 Feb 2025 18:18:38 +0000 (02:18 +0800)]
HIP: force max threads per block to be 1024 (llama/11621)

Some old/vendor forked version of llvm still use 256. Explicitly set it to 1024 to align with upstream llvm.

Signed-off-by: fxzjshm <redacted>
3 months agometal : use residency set for other platforms (llama/11648)
Jhen-Jie Hong [Tue, 4 Feb 2025 11:07:18 +0000 (19:07 +0800)]
metal : use residency set for other platforms (llama/11648)

3 months agorpc: fix known RCE in rpc-server (ggml/1103)
Patrick Peng [Thu, 6 Feb 2025 14:29:13 +0000 (09:29 -0500)]
rpc: fix known RCE in rpc-server (ggml/1103)

Add bounds checking in `rpc_server::copy_tensor` to prevent out-of-bounds writes
+ Check if  `(uint8_t *)dst->data + ggml_nbytes(src)` remains within the destination buffer’s allocated region.

4 months agostream : add beam size parameter(#2836)
masahji [Tue, 25 Feb 2025 09:39:33 +0000 (01:39 -0800)]
stream : add beam size parameter(#2836)

* feat: Add beam size parameter to stream.cpp for beam search configuration

* feat: Add beam size parameter to whisper full params in stream example

* fix: Remove duplicate beam search size assignment in server.cpp

4 months agowhisper : restore big endian support (#2816)
Thomas Fitzsimmons [Tue, 25 Feb 2025 09:38:13 +0000 (09:38 +0000)]
whisper : restore big endian support (#2816)

* whisper : fix BYTESWAP whitespace

* whisper : make byteswap useable with C++17

* cmake : define WHISPER_BIG_ENDIAN for big-endian targets

* ci : fix (again) arm64 build fails

* docker : attempt fixing arm64 build on ci

* qemu v7.0.0-28

[imported from
https://github.com/ggml-org/llama.cpp
/commit/818a340ea8be55b3706e1772527cb8738e90a8c7
(#11895)]

---------

Co-authored-by: Xuan-Son Nguyen <redacted>
4 months agoFixes for Windows (#2790)
Judd [Thu, 6 Feb 2025 07:37:21 +0000 (15:37 +0800)]
Fixes for Windows (#2790)

Fixes for Windows:

* MSVC default to utf-8 without BOM.
* Console output code page changed to utf-8.

---------

Co-authored-by: Judd <redacted>
4 months agocmake : fix compile assumptions for power9/etc (#2777)
midnight [Wed, 5 Feb 2025 12:41:10 +0000 (04:41 -0800)]
cmake : fix compile assumptions for power9/etc (#2777)

* Add small comment re: VSX to readme

Co-authored-by: midnight <redacted>
4 months agoauthors : update upstream/1.7.4+95
Georgi Gerganov [Tue, 4 Feb 2025 11:03:40 +0000 (13:03 +0200)]
authors : update

4 months agosync : ggml
Georgi Gerganov [Tue, 4 Feb 2025 11:03:09 +0000 (13:03 +0200)]
sync : ggml

4 months agocmake: Add ability to pass in GGML_BUILD_NUMBER (ggml/1096)
Christian Kastner [Mon, 3 Feb 2025 23:17:15 +0000 (00:17 +0100)]
cmake: Add ability to pass in GGML_BUILD_NUMBER (ggml/1096)

This makes git as a dependency optional, and is useful in the case where
ggml is built not from git, but from a tarball, or a distribution source
package.

This conditional also affects GGML_BUILD_COMMIT. Nothing seems to be
using it, though, so there doesn't seem much value factor it out, or
even require it.

4 months agoreadme : add maintenance roadmap
Georgi Gerganov [Tue, 4 Feb 2025 08:50:10 +0000 (10:50 +0200)]
readme : add maintenance roadmap

4 months agoci : add stalebot
Georgi Gerganov [Tue, 4 Feb 2025 07:30:08 +0000 (09:30 +0200)]
ci : add stalebot

4 months agonode : add max_len params in node addon (#2760)
billyct [Mon, 3 Feb 2025 20:49:06 +0000 (04:49 +0800)]
node : add max_len params in node addon (#2760)

4 months agotalk-llama : sync llama.cpp
Georgi Gerganov [Mon, 3 Feb 2025 20:42:26 +0000 (22:42 +0200)]
talk-llama : sync llama.cpp

4 months agocoreml : always convert to "neuralnetwork" (#2770)
mgrachten [Mon, 3 Feb 2025 20:36:32 +0000 (21:36 +0100)]
coreml : always convert to "neuralnetwork" (#2770)

4 months agoci : more git
Georgi Gerganov [Mon, 3 Feb 2025 19:17:33 +0000 (21:17 +0200)]
ci : more git

4 months agoci : install git
Georgi Gerganov [Mon, 3 Feb 2025 18:12:37 +0000 (20:12 +0200)]
ci : install git

4 months agoci : use ubuntu-22.04 instead of ubuntu-latest
Georgi Gerganov [Mon, 3 Feb 2025 17:50:24 +0000 (19:50 +0200)]
ci : use ubuntu-22.04 instead of ubuntu-latest

4 months agocmake : sync cmake scripts
Georgi Gerganov [Mon, 3 Feb 2025 14:24:38 +0000 (16:24 +0200)]
cmake : sync cmake scripts

4 months agosync : ggml
Georgi Gerganov [Mon, 3 Feb 2025 14:05:34 +0000 (16:05 +0200)]
sync : ggml

4 months agoscripts : fix sync paths
Georgi Gerganov [Mon, 3 Feb 2025 14:05:27 +0000 (16:05 +0200)]
scripts : fix sync paths

4 months agoCUDA: fix Volta FlashAttention logic (llama/11615)
Johannes Gäßler [Mon, 3 Feb 2025 12:25:56 +0000 (13:25 +0100)]
CUDA: fix Volta FlashAttention logic (llama/11615)

4 months agoHIP: fix flash_attn_stream_k_fixup warning (llama/11604)
Johannes Gäßler [Sun, 2 Feb 2025 22:48:29 +0000 (23:48 +0100)]
HIP: fix flash_attn_stream_k_fixup warning (llama/11604)

4 months agoCUDA/HIP: add support for selectable warp size to mmv (llama/11519)
uvos [Sun, 2 Feb 2025 21:40:09 +0000 (22:40 +0100)]
CUDA/HIP: add support for selectable warp size to mmv (llama/11519)

CUDA/HIP: add support for selectable warp size to mmv

4 months agoHIP: add GGML_CUDA_CC_IS_* for amd familys as increasing cc archtectures for amd...
uvos [Sun, 2 Feb 2025 21:08:05 +0000 (22:08 +0100)]
HIP: add GGML_CUDA_CC_IS_* for amd familys as increasing cc archtectures for amd gpus are not supersets of eatch other (llama/11601)

This fixes a bug where RDNA1 gpus other than gfx1010 where not handled correctly

4 months agoCUDA: use mma PTX instructions for FlashAttention (llama/11583)
Johannes Gäßler [Sun, 2 Feb 2025 18:31:09 +0000 (19:31 +0100)]
CUDA: use mma PTX instructions for FlashAttention (llama/11583)

* CUDA: use mma PTX instructions for FlashAttention

* __shfl_sync workaround for movmatrix

* add __shfl_sync to HIP

Co-authored-by: Diego Devesa <redacted>
4 months ago`ci`: use sccache on windows instead of ccache (llama/11545)
Olivier Chafik [Fri, 31 Jan 2025 17:12:40 +0000 (17:12 +0000)]
`ci`: use sccache on windows instead of ccache (llama/11545)

* Use sccache on ci for windows

* Detect sccache in cmake

4 months agoHIP: require at least HIP 5.5
uvos [Wed, 29 Jan 2025 18:36:00 +0000 (19:36 +0100)]
HIP: require at least HIP 5.5