git.djapps.eu Git - pkg/ggml/sources/whisper.cpp/log

ci : use ninja and fix caching for windows-cublas (#2910)

This commit updates the windows-cublas job to use Ninja as the build
system instead of msbuild/msvc.

The motivation for this is that msbuild/mscv does not seem to handle
ccache/sccache well, for example it ignores the
`CMAKE_C_COMPILER_LAUNCHER` etc. variables. But using Ninja as the build
caching works and the build is initially the same speed as it is
currently (without caching) subsequently builds are much faster.

Refs: https://github.com/ggerganov/whisper.cpp/issues/2781

examples : update wasm examples to include server.py [no ci] (#2908)

This commit updates the README files for the wasm examples to include
instructions on how to run the examples using the provided server.py
which was included in Commit 6e8242f7fe166b7798bbf49b4c65aba8afe1e131
("examples : command.wasm updates (#2904)").

The motivation for this is consistency with the command.wasm example.

examples : command.wasm updates (#2904)

This commit updates the command.wasm example by adding a server.py script to make it easy to start a local http server to try out the example, updates the build instructions, and also addresses some of the compiler warnings that were being generated.

* emscripten : fix TOTAL_STACK for wasm

This commit moves the TOTAL_STACK setting from the compile flags to the
linker flags. This is because the TOTAL_STACK setting is a linker
setting.

The motivation for this change is that currently the following warnings
are generated when building:
```console
em++: warning: linker setting ignored during compilation: 'TOTAL_STACK' [-Wunused-command-line-argument]
em++: warning: linker setting ignored during compilation: 'TOTAL_STACK' [-Wunused-command-line-argument]
em++: warning: linker setting ignored during compilation: 'TOTAL_STACK' [-Wunused-command-line-argument]
em++: warning: linker setting ignored during compilation: 'TOTAL_STACK' [-Wunused-command-line-argument]
em++: warning: linker setting ignored during compilation: 'TOTAL_STACK' [-Wunused-command-line-argument]
em++: warning: linker setting ignored during compilation: 'TOTAL_STACK' [-Wunused-command-line-argument]
```

* examples : suppress C++17 deprecation warning for std::codecvt_utf8

This commit suppresses the C++17 deprecation warning for
std::codecvt_utf8 similar to what is done in
examples/talk-llama/unicode.cpp.

The motivation for this change is to suppress these warnings:
```console
/Users/danbev/work/ai/whisper-work/examples/common.cpp:251:31: warning: 'codecvt_utf8<wchar_t>' is deprecated [-Wdeprecated-declarations]
  251 |     std::wstring_convert<std::codecvt_utf8<wchar_t>> converter;
      |                               ^
/Users/danbev/work/wasm/emsdk/upstream/emscripten/cache/sysroot/include/c++/v1/codecvt:193:28: note: 'codecvt_utf8<wchar_t>' has been explicitly marked deprecated here
  193 | class _LIBCPP_TEMPLATE_VIS _LIBCPP_DEPRECATED_IN_CXX17 codecvt_utf8 : public __codecvt_utf8<_Elem> {
      |                            ^
/Users/danbev/work/wasm/emsdk/upstream/emscripten/cache/sysroot/include/c++/v1/__config:723:41: note: expanded from macro '_LIBCPP_DEPRECATED_IN_CXX17'
  723 | #    define _LIBCPP_DEPRECATED_IN_CXX17 _LIBCPP_DEPRECATED
      |                                         ^
/Users/danbev/work/wasm/emsdk/upstream/emscripten/cache/sysroot/include/c++/v1/__config:688:49: note: expanded from macro '_LIBCPP_DEPRECATED'
  688 | #      define _LIBCPP_DEPRECATED __attribute__((__deprecated__))
      |                                                 ^
/Users/danbev/work/ai/whisper-work/examples/common.cpp:251:10: warning: 'wstring_convert<std::codecvt_utf8<wchar_t>>' is deprecated [-Wdeprecated-declarations]
  251 |     std::wstring_convert<std::codecvt_utf8<wchar_t>> converter;
      |          ^
/Users/danbev/work/wasm/emsdk/upstream/emscripten/cache/sysroot/include/c++/v1/locale:3145:28: note: 'wstring_convert<std::codecvt_utf8<wchar_t>>' has been explicitly marked deprecated here
3145 | class _LIBCPP_TEMPLATE_VIS _LIBCPP_DEPRECATED_IN_CXX17 wstring_convert {
      |                            ^
/Users/danbev/work/wasm/emsdk/upstream/emscripten/cache/sysroot/include/c++/v1/__config:723:41: note: expanded from macro '_LIBCPP_DEPRECATED_IN_CXX17'
  723 | #    define _LIBCPP_DEPRECATED_IN_CXX17 _LIBCPP_DEPRECATED
      |                                         ^
/Users/danbev/work/wasm/emsdk/upstream/emscripten/cache/sysroot/include/c++/v1/__config:688:49: note: expanded from macro '_LIBCPP_DEPRECATED'
  688 | #      define _LIBCPP_DEPRECATED __attribute__((__deprecated__))
      |                                                 ^
/Users/danbev/work/ai/whisper-work/examples/common.cpp:257:31: warning: 'codecvt_utf8<wchar_t>' is deprecated [-Wdeprecated-declarations]
  257 |     std::wstring_convert<std::codecvt_utf8<wchar_t>> converter;
      |                               ^
/Users/danbev/work/wasm/emsdk/upstream/emscripten/cache/sysroot/include/c++/v1/codecvt:193:28: note: 'codecvt_utf8<wchar_t>' has been explicitly marked deprecated here
  193 | class _LIBCPP_TEMPLATE_VIS _LIBCPP_DEPRECATED_IN_CXX17 codecvt_utf8 : public __codecvt_utf8<_Elem> {
      |                            ^
/Users/danbev/work/wasm/emsdk/upstream/emscripten/cache/sysroot/include/c++/v1/__config:723:41: note: expanded from macro '_LIBCPP_DEPRECATED_IN_CXX17'
  723 | #    define _LIBCPP_DEPRECATED_IN_CXX17 _LIBCPP_DEPRECATED
      |                                         ^
/Users/danbev/work/wasm/emsdk/upstream/emscripten/cache/sysroot/include/c++/v1/__config:688:49: note: expanded from macro '_LIBCPP_DEPRECATED'
  688 | #      define _LIBCPP_DEPRECATED __attribute__((__deprecated__))
      |                                                 ^
/Users/danbev/work/ai/whisper-work/examples/common.cpp:257:10: warning: 'wstring_convert<std::codecvt_utf8<wchar_t>>' is deprecated [-Wdeprecated-declarations]
  257 |     std::wstring_convert<std::codecvt_utf8<wchar_t>> converter;
      |          ^
/Users/danbev/work/wasm/emsdk/upstream/emscripten/cache/sysroot/include/c++/v1/locale:3145:28: note: 'wstring_convert<std::codecvt_utf8<wchar_t>>' has been explicitly marked deprecated here
3145 | class _LIBCPP_TEMPLATE_VIS _LIBCPP_DEPRECATED_IN_CXX17 wstring_convert {
      |                            ^
/Users/danbev/work/wasm/emsdk/upstream/emscripten/cache/sysroot/include/c++/v1/__config:723:41: note: expanded from macro '_LIBCPP_DEPRECATED_IN_CXX17'
  723 | #    define _LIBCPP_DEPRECATED_IN_CXX17 _LIBCPP_DEPRECATED
      |                                         ^
/Users/danbev/work/wasm/emsdk/upstream/emscripten/cache/sysroot/include/c++/v1/__config:688:49: note: expanded from macro '_LIBCPP_DEPRECATED'
  688 | #      define _LIBCPP_DEPRECATED __attribute__((__deprecated__))
      |                                                 ^
4 warnings generated.
```

* ggml : suppress double-promotion warning in GGML_F16x4_REDUCE

This commit adds a cast to `ggml_float` in the `GGML_F16x4_REDUCE` macro
to suppress a double-promotion warning.

Currently the following warning is generated when compiling the
command.wasm example:
```console
/whisper-work/ggml/src/ggml-cpu/ggml-cpu.c:1592:5: warning: implicit conversion increases floating-point precision: 'float' to 'ggml_float' (aka 'double') [-Wdouble-promotion]
1592 |     GGML_F16_VEC_REDUCE(sumf, sum);
      |     ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/Users/danbev/work/ai/whisper-work/ggml/src/ggml-cpu/ggml-cpu.c:932:37: note: expanded from macro 'GGML_F16_VEC_REDUCE'
  932 | #define GGML_F16_VEC_REDUCE         GGML_F16x4_REDUCE
      |                                     ^
/Users/danbev/work/ai/whisper-work/ggml/src/ggml-cpu/ggml-cpu.c:920:44: note: expanded from macro 'GGML_F16x4_REDUCE'
  918 |     res = wasm_f32x4_extract_lane(x[0], 0) +       \
      |         ~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  919 |           wasm_f32x4_extract_lane(x[0], 1) +       \
      |           ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  920 |           wasm_f32x4_extract_lane(x[0], 2) +       \
      |           ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~
  921 |           wasm_f32x4_extract_lane(x[0], 3);        \
      |           ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/whisper-work/ggml/src/ggml-cpu/ggml-cpu.c:1640:9: warning: implicit conversion increases floating-point precision: 'float' to 'ggml_float' (aka 'double') [-Wdouble-promotion]
1640 |         GGML_F16_VEC_REDUCE(sumf[k], sum[k]);
      |         ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/Users/danbev/work/ai/whisper-work/ggml/src/ggml-cpu/ggml-cpu.c:932:37: note: expanded from macro 'GGML_F16_VEC_REDUCE'
  932 | #define GGML_F16_VEC_REDUCE         GGML_F16x4_REDUCE
      |                                     ^
/Users/danbev/work/ai/whisper-work/ggml/src/ggml-cpu/ggml-cpu.c:920:44: note: expanded from macro 'GGML_F16x4_REDUCE'
  918 |     res = wasm_f32x4_extract_lane(x[0], 0) +       \
      |         ~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  919 |           wasm_f32x4_extract_lane(x[0], 1) +       \
      |           ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  920 |           wasm_f32x4_extract_lane(x[0], 2) +       \
      |           ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~
  921 |           wasm_f32x4_extract_lane(x[0], 3);        \
      |           ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
2 warnings generated.
```
wasm_f32x4_extract_lane returns a 32-bit float and this is what the
addition is performed on. But there is an implicit conversion from
32-bit float to 64-bit double when the result is assigned to `res`,
which is of type `ggml_float`. My understanding here is that this is
intentional and adding a cast to `ggml_float` should suppress the
warning.

* emscripten : add -Wno-deprecated to for emscripten

This commit adds -Wno-deprecated to the CMAKE_CXX_FLAGS for emscripten
builds.

The motivation for this is that currently there a number of warnings
generated like the following:
```console
warning: JS library symbol '$print' is deprecated. Please open a bug if you have a continuing need for this symbol [-Wdeprecated]
warning: JS library symbol '$printErr' is deprecated. Please open a bug if you have a continuing need for this symbol [-Wdeprecated]
em++: warning: warnings in JS library compilation [-Wjs-compiler]
em++: warning: linker setting ignored during compilation: 'ENVIRONMENT' [-Wunused-command-line-argument]
warning: JS library symbol '$print' is deprecated. Please open a bug if you have a continuing need for this symbol [-Wdeprecated]
warning: JS library symbol '$printErr' is deprecated. Please open a bug if you have a continuing need for this symbol [-Wdeprecated]
em++: warning: warnings in JS library compilation [-Wjs-compiler]
warning: JS library symbol '$print' is deprecated. Please open a bug if you have a continuing need for this symbol [-Wdeprecated]
warning: JS library symbol '$printErr' is deprecated. Please open a bug if you have a continuing need for this symbol [-Wdeprecated]
em++: warning: warnings in JS library compilation [-Wjs-compiler]
em++: warning: linker setting ignored during compilation: 'ENVIRONMENT' [-Wunused-command-line-argument]
em++: warning: linker setting ignored during compilation: 'ENVIRONMENT' [-Wunused-command-line-argument]
```

The downside of this is that we might miss other deprecation warnings
in the future so I'm not sure if this is acceptable. But it make the
wasm examples cleaner without the warnings.

* examples : fix tautological-compare warning in stb_vorbis.c [no ci]

This commit applies a fix to address a tautological-compare warning
in stb_vorbis.c.

The motivation for this is that currently the following warning is
generated when compiling the commmand-wasm example:
```console
/Users/danbev/work/ai/whisper-work/examples/stb_vorbis.c:1404:75: warning: pointer comparison always evaluates to false [-Wtautological-compare]
1404 |       if (f->stream_start + loc >= f->stream_end || f->stream_start + loc < f->stream_start) {
      |                                                                           ^
1 warning generated.
```

This fix was taken from an open pull request on the stb repository
that addreses this issue:
https://github.com/nothings/stb/pull/1746

* squash! examples : update command.wasm instructions [no ci]

This commit adds a Python script to serve the the wasm examples build
in the `build-em` directory. Initially I thought that it would be enough
to start a simple python server but I did not notice that there was an
error in the browser console when I did that:
```console
command.js:1 Uncaught (in promise) DataCloneError: Failed to execute 'postMessage' on 'Worker': SharedArrayBuffer transfer requires self.crossOriginIsolated.
    at command.js:1:1206224
    at new Promise (<anonymous>)
    at loadWasmModuleToWorker (command.js:1:1204981)
    at Array.map (<anonymous>)
    at Object.loadWasmModuleToAllWorkers (command.js:1:1206428)
    at command.js:1:1204318
    at callRuntimeCallbacks (command.js:1:1202062)
    at preRun (command.js:1:6136)
    at run (command.js:1:1294094)
    at removeRunDependency (command.js:1:7046)
```
We need a few CORS headers to be set and in order hopefully make this
easy for users a Python script is added to the examples directory.
This should be able to server all the wasm examples provided they have
been built. command.wasm's README.md is updated to reflect this change.

* examples : remove unused functions

This commit removed the unused functions convert_to_utf8 and
convert_to_wstring from examples/common.cpp.

* Revert "examples : fix tautological-compare warning in stb_vorbis.c [no ci]"

This reverts commit 8e3c47d96141c7675c985562ebdc705e839e338a.

We should not make this change here and instead when the upstream PR is
merged we can sync with it.

Refs: https://github.com/ggerganov/whisper.cpp/issues/2784

ci : refactor cuda toolkit installation steps (#2902)

The commit updates the CUDA tookkit installation steps to use variables
for the CUDA version and the components versions.

The motivation for this change is that the currently the versions for
the components are used in multiple places and it is hard to update
and maintain.

go : add Encoder Begin Callback (#2900)

Adding in EncoderBeginCallback to the Context's Process callback.
This optional callback function returns false if computation should
be aborted.

Co-authored-by: Amanda Der Bedrosian <redacted>

ci : add ccache action to windows-cublas job (#2893)

* ci : add ccache action to windows-cublas job

This commit adds the ccache action to the windows-cublas job. This will
allow us to cache the build artifacts and hopefully speed up the build
process.

Refs: https://github.com/ggerganov/whisper.cpp/issues/2781

whisper : fix compiler warnings in whisper.cpp (#2895)

This commit fixes compiler warnings in whisper.cpp by changing the type
of the loop index variable from int64_t to size_t.

Currently the following warnings are generated by the compiler:
```console
/whisper.cpp/src/whisper.cpp:209:27: warning: comparison of integers of different signs: 'int64_t' (aka 'long long') and 'size_t' (aka 'unsigned long') [-Wsign-compare]
  209 |     for (int64_t i = 0; i < nels; ++i) {
      |                         ~ ^ ~~~~
/whisper.cpp/src/whisper.cpp:219:27: warning: comparison of integers of different signs: 'int64_t' (aka 'long long') and 'size_t' (aka 'unsigned long') [-Wsign-compare]
  219 |     for (int64_t i = 0; i < nels; ++i) {
      |                         ~ ^ ~~~~
```

ci : add missing env.branch_name to build.yml (#2896)

This commit adds the missing env.branch_name to the build.yml file.

The motivation for this is that the currently the build is failing
during the release job because the branch_name is not set in the
an invalid tag is being used.

whisper : enable compiler warnings for src (#2891)

* whisper : enable compiler warnings for src

This commit enables compiler warnings for the src directory. Currently
when the WHISPER_ALL_WARNINGS flag is set to ON is only enables warnings
in ggml, by setting GGML_ALL_WARNINGS to ON. This commit adds the same
compiler flags for whisper's src directory.

The motivation for this is to catch potential bugs and issues early on
in the development process.

* squash! whisper : enable compiler warnings for src

Remove GF_C_FLAGS and GF_CXX_FLAGS from add_compile_options.

ci : add release job and include xcframework (#2889)

* ci : add release job and include xcframework

This commit adds a release job that uploads the xcframework as an
artifact and creates a release with the xcframework as an asset.

This job can be triggered manually and enables a pre-release tag name to
be specified to that these releases can be distinguished from the
regular releases more easily.

Resolves: https://github.com/ggerganov/whisper.cpp/issues/2886

examples : use xcframework in whisper.objc example (#2882)

* examples : use xcframework in whisper.objc example

This commit updates the whisper.objc example to use the xcframework.

The motivation for this to be consistent with the swift example and to
also act as a reference for how to use the xcframework in an objc
project.

Resolves: https://github.com/ggerganov/whisper.cpp/issues/2881

* examples : setup audio session viewDidload

This commit adds the setup of the audio session in the viewDidload
method of the ViewController.m file. This is necessary to allow the app
to record audio.

The motivation for this is that without this it was not possible to
caputue audio from the microphone. It was possible to click on the
Capture button but nothing happened after that, and the button was not
marked red indicating that the button could be clicked again to stop
capturing. With this change it is possible to capture audio from the
microphone and get it transcribed.

whisper : add option to use system-installed GGML (#2887)

convert : update convert-h5-to-ggml.py (#2840)

improved handling of missing max_length

examples : add GGML_USE_CPU=ON flag to whisper.objc (#2880)

This commit adds the GGML_USE_CPU=ON flag to the whisper.objc project in
order to enable the CPU backend for the whisper.objc project.

The motivation for this change is that currently the following error
is generated when running the example:
```console
ggml_backend_buffer_type_t ggml_backend_get_default_buffer_type(ggml_backend_t backend) {
return ggml_backend_dev_buffer_type(backend->device); <- Thread 1: EXC_BAD_ACCESS (code=1, address=0x70)
}
```
If we inspect the `backend` variable we can see that it is a `nullptr`.
```console
(lldb) p backend
(ggml_backend_t) nullptr
```
When running in a simulator and that automatically means that there will
be no gpu as there is a check for this in the code. But the CPU backend
should still be present.

The objective-c code will compile the whisper sources including the ggml
sources. And if `-DGGMLL_USE_CPU` is not defined then there will be no
CPU backend, and in this particular case of backend at all.

Resolves: https://github.com/ggerganov/whisper.cpp/issues/2870

ggml-ci: update input env variables to GG_BUILD_ (#2879)

ggml-ci: add run.sh (#2877)

examples : add dl to the list of libraries linked (#2875)

* examples : add dl to the list of libraries linked

This commit adds the dynamic linker library to the list of libraries
linked by the examples.

The motivation for this change is that when building the examples on
ubuntu 20.04, which uses GCC 9.4.0, the dynamic linker requires
explicit linking or the following error is generated:
```console
[ 64%] Linking CXX executable ../../bin/whisper-cli
cd /app/whisper.cpp/build/examples/cli && /usr/bin/cmake -E cmake_link_script CMakeFiles/whisper-cli.dir/link.txt --verbose=1
/usr/bin/c++ -O3 -DNDEBUG CMakeFiles/whisper-cli.dir/cli.cpp.o -o ../../bin/whisper-cli -Wl,-rpath,/app/whisper.cpp/build/src:/app/whisper.cpp/build/ggml/src: ../libcommon.a ../../src/libwhisper.so.1.7.4 -pthread ../../ggml/src/libggml.so ../../ggml/src/libggml-cpu.so ../../ggml/src/libggml-base.so
/usr/bin/ld: ../libcommon.a(common-whisper.cpp.o): undefined reference to symbol 'dlclose@@GLIBC_2.2.5'
/usr/bin/ld: /lib/x86_64-linux-gnu/libdl.so.2: error adding symbols: DSO missing from command line
collect2: error: ld returned 1 exit status
make[2]: *** [examples/cli/CMakeFiles/whisper-cli.dir/build.make:89: bin/whisper-cli] Error 1
make[2]: Leaving directory '/app/whisper.cpp/build'
make[1]: *** [CMakeFiles/Makefile2:433: examples/cli/CMakeFiles/whisper-cli.dir/all] Error 2
make[1]: Leaving directory '/app/whisper.cpp/build'
make: *** [Makefile:130: all] Error 2
```

Resolves: https://github.com/ggerganov/whisper.cpp/issues/2854

whisper: add xcframework build script (#2873)

* whisper: add xcframework build script

* added apple validation scripts

* fixed Readme

* validation script fix

objc : fix build, tmp remove GPU support, use C++17

cmake : fix ggml-config (ggml/0)

sync : ggml

ggml-cpu: faster AVX2 variant for IQ1_M (llama/12216)

metal : simplify kernel arguments using a struct (ggml/3229) (llama/12194)

* metal : refactor im2col parameters into a struct

* metal: Change im2col offset types from int32_t to uint64_t to support larger memory offsets

* metal : refactor sum_rows parameters into a struct

* metal : refactor soft_max parameters into a struct

* metal : refactor diag_mask_inf parameters into a struct

* metal : refactor ssm_conv parameters into a struct

* metal : refactor ssm_scan parameters into a struct

* metal : refactor get_rows parameters into a struct

* metal : refactor group_norm parameters into a struct

* metal : refactor conv_transpose_1d parameters into a struct

* metal : refactor upscale parameters into a struct

* metal : refactor pad parameters into a struct

* metal : refactor pad_reflect_1d parameters into a struct

* metal : refactor arange parameters into a struct

* metal : refactor timestep_embedding parameters into a struct

* metal : refactor argsort parameters into a struct

* metal : refactor leaky_relu parameters into a struct

* metal : refactor pool_2d parameters into a struct

* metal : fix trailing whitespace

---------

Co-authored-by: alexju <redacted>

metal : fix default.metallib build (llama/12224)

This commit updates the custom command to build the default.metallib
file to use the correct path to ../ggml-common.h by using the variable
METALLIB_COMMON.

The motivation for this change is that currently when building and
specifying GGML_METAL_EMBED_LIBRARY=OFF the following error is
generated:
```console
[ 11%] Linking CXX shared library ../../bin/libggml.dylib
[ 11%] Built target ggml
make[2]: *** No rule to make target `ggml/src/ggml-metal/ggml-common.h', needed by `bin/default.metallib'.  Stop.
make[1]: *** [ggml/src/ggml-metal/CMakeFiles/ggml-metal-lib.dir/all] Error 2
```

With the above change the build could progress but there was a follow
on error about not being able to find the ggml-common.h file in
ggml-metal.metal where is was included as a relative path:
```console
[ 11%] Compiling Metal kernels
/Users/danbev/work/llama.cpp/build/bin/ggml-metal.metal:6:10: error: '../ggml-common.h' file not found, did you mean 'ggml-common.h'?
         ^~~~~~~~~~~~~~~~~~
         "ggml-common.h"
1 error generated.
```
Removing the relative path then allowed the build to complete
successfully.

opencl: Noncontiguous `norm`, `rms_norm`, disable `fp16` for some ops (llama/12217)

* opencl: support noncontiguous `norm`

* opencl: support noncontiguous `rms_norm`

* opencl: disable fp16 for `ADD`, `MUL`, `SCALE`, `RELU`, `GELU`, `SILU`, `CLAMP`

cmake : fix undefined reference errors for std::filesystem in ggml (#12092) (llama/12094)

Signed-off-by: Ray Lee <redacted>
Co-authored-by: Ray Lee <redacted>

CUDA: fix FA logic for PTX 7.0 and CC >= 7.5 (llama/12222)

HIP/CUDA: set the paramerter value in maintain_cuda_graph instead of replaceing it. (llama/12209)

This avoids conflict with internal cuda/hip runtimes memory managment behavior.

opencl : fix buffer alignment (llama/12197)

Fix the following error:

```
ggml-alloc.c:99: not enough space in the buffer
ggml_tallocr_alloc: not enough space in the buffer to allocate blk.17.ffn_down.weight (needed 27525120, available 27521024)
```

which occurs when `ggml_backend_opencl_context::alignment` is larger
than `cl_ptr_base` (hard-coded to `0x1000`).

Also, fix `ggml_backend_opencl_context::alignment` was set to
`CL_DEVICE_MEM_BASE_ADDR_ALIGN` which was treated as bytes but the
value is reported in bits.

opencl : fix `ulong` kernel args were set from `int` variables (llama/12174)

... which left garbage bits in the upper half of the kernel args. This
caused segmentation faults when running PoCL.

opencl : fix profile-related errors (llama/12095)

Co-authored-by: ubuntu <redacted>

ggml-cpu: Faster IQ1 mul_mat_vec on AVX2 using BMI2 instructions (llama/12154)

* ggml-cpu: Faster IQ1 mul_mat_vec on AVX2 using BMI2 instructions

* cmake: Add GGML_BMI2 build option

* ggml: enable BMI2 on relevant CPU variants

* ggml-cpu: include BMI2 in backend score

* ggml-cpu: register BMI2 in ggml_backend_cpu_get_features

* ggml-cpu: add __BMI2__ define when using MSVC

SYCL: Disable f16 Unary OPs as not supported by the kernels (llama/12201)

ggml : fix GGMLMetalClass ODR (llama/12200)

-- it might happen if ggml is loaded from 2 separate libraries since each one of them will expose the class. This is more of a guard since we want to use only Metal as embedded library and don't care about the other case.

ggml : ggml_compute_forward_concat() for arbitrary tensor type (ggml/1118)

* ggml_compute_forward_concat() for arbitrary tensor type

* Check that tensors' type match

* ggml-cpu.c: check type of source tensors

* ggml-cpu.c: move tensor type check to ggml_compute_forward_concat()

* ggml.c: check concatenated tensor type

* Remove tensor type check from ggml_compute_forward_concat() in ggml-cpu.c

..., as it was moved to ggml.c.

vulkan : sync (llama/0)

ggml-ci

ggml : portability fixes for VS 2017 (llama/12150)

* Add include files for std::min/max and std::toupper/tolower

* win32: move _USE_MATH_DEFINES before includes to ensure M_PI is defined

* Use GGML_RESTRICT instead of "restrict" keyword everywhere, and use "__restrict" in MSVC plain C mode

* win32: only use __restrict in MSVC if C11/C17 support is not enabled

---------

Co-authored-by: Marcus Groeber <redacted>

HIP: implement FlashAttention via rocWMMA for CDNA and RDNA3+ (llama/12032)

Adds GGML_HIP_ROCWMMA_FATTN and rocwmma header check
Adds rocWMMA support to fattn-wmma-f16

ggml : fix kleidiai build (llama/12159)

The libggml API has changed, but this has not been updated.

SYCL: Move CPY kernels to a separate file and add few missing kernels (llama/12133)

* SYCL: refactor and move cpy kernels to a separate file

* Add few missing cpy kernels

* refactor and add debug logs

ggml-backend : keep paths in native string type when possible (llama/12144)

CUDA: compress mode option and default to size (llama/12029)

cuda 12.8 added the option to specify stronger compression for binaries, so we now default to "size".

ggml : upgrade init_tensor API to return a ggml_status (llama/11854)

* Upgrade init_tensor API to return a ggml_status

To prepare for an 'abort-free' ggml
(ggml not to abort on OOMs but return a OOM status),
as agreeed with Diego in the ggml repo,
upgrade the init_tensor() and view_init() APIs
to return a ggml_status.

* misc fixes

---------

Co-authored-by: slaren <redacted>

vulkan: add specific MMV kernels for IQ2 and IQ3 quants + optimizations (llama/11595)

* vulkan: implement specialized MMV kernels for IQ2 quantizations

* vulkan: add MMV kernels for IQ3 quants

* vulkan: Increase MMV batch size and unroll IQ LUT setup

* vulkan: fix init_iq_shmem for WG sizes larger than tables

* vulkan: common batch size for all I-quants

CUDA: fix logic for V100 + GGML_CUDA_FORCE_MMQ (llama/12098)

ggml: aarch64: implement SVE kernels for q2_k_q8_k vector dot (llama/12064)

* Added SVE Support for Q2_K Quantized Models

* Use 4-space indentation in the switch cases

* removed comments lines

* Remove the loop Retain the curly bracess for better understanding of code

* Remove the comment like added for q3_k_q8_k kernel

---------

Co-authored-by: vithulep <redacted>

CANN: Fix build error with GCC 13 (llama/11990)

Remove unused header file that causes compilation failure on ARM
platform with GCC 13.

vulkan: matmul dequantization improvements (llama/12015)

* faster dequant for old quants

* dont use unpack for iq4_nl

* vec2 unpack for q8

vulkan: improve im2col (llama/11826)

* vulkan: improve im2col performance

cmake: Fix ggml backend dependencies and installation (llama/11818)

* Fix dependencies between ggml and backends

ggml backends link only to ggml-base and ggml links to all backends.

* Fix installation of ggml backends

Set up GNUInstallDirs before setting the installation directory of ggml backends

vulkan: fix assertion when qy_needs_dequant (llama/12068)

Looks like a copy/paste bug from qx_needs_dequant.

ggml-cpu: Fix build with sve (llama/12059)

* ggml-cpu: Fix build with sve

Signed-off-by: Molly Sophia <redacted>
* ggml-cpu: Remove unused variable in sve q3_k vec dot

Signed-off-by: Molly Sophia <redacted>
---------

Signed-off-by: Molly Sophia <redacted>

cuda: unary ops as float + de-duplicate (ggml/1130)

cuda/vulkan: specify fp32-only support for some operations in supports_op (ggml/1129)

* cuda: restrict SILU_BACK to fp32, since fp16 exceeds the desired test threshold

* vulkan: specify fp32-only support for certain ops (that are now tested for fp16 as well)

* f32 sigmoid in vulkan supports op

* Revert "f32 sigmoid in vulkan supports op"

This reverts commit c6f04b3c19bf4504c2776149c6d8cd84e0b48acb.

cuda/cpu: Increase support for fp16 unary operations (ggml/1125)

* Support fp16 unary operations in the CUDA backend

* cpu: increase fp16 support for unary operators in the CPU backend

* cuda: increase fp16 support for unary operators in the CUDA backend

* Add test cases for fp16 unary operators

* metal: update supports_op for unary operators that don't support fp16, to prevent test-backend-ops from failing

* metal: fix PR comments for unary op support after fp16 unary tests

Told cmake to install ggml-cpp.h as a public header file. (ggml/1126)

It is used by Whisper talk-llama example.

Co-authored-by: Petter Reinholdtsen <redacted>

common : more general m_audio_len update logic (#2855)

Co-authored-by: Ivy233 <redacted>

go : improve model download (#2756)

* Updated models download URL

* Updated list of models available

All of the high efficiency quantized models are rejected when trying to download. They exist on the server. Let's allow them.

* added path prefix for whisper-cli in message to user. The message is misleading if this script is called from another script in a different folder. So the message has to be fixed.

* undid download URL change I made earlier. Fixed filepath.Join(urlPath, model) bug.

* Undid download URL change I made earlier.

Seems that the old URL works but only when provided a model to download. Still doesn't explain why there's a different download URL that also works. Please elucidate in docs.

* Fixed URLForModel Function's bug

filepath.Join is designed for filesystem paths, and it uses backslashes (\) on Windows. URLs, however, require forward slashes (/), so the use of filepath.Join is inappropriate for constructing URLs.

The fmt.Sprintf function ensures that forward slashes are used.

* Fixed URL trailing / double slash bug

Ensure no double slash by trimming trailing '/' from srcUrl if present

* Fixed bad download URL, missing ggml prefix

Not sure if that was a bug I introduced but it was trying to download without the prefix.

* Added question before downloading all models. Added download size estimate

HEAD Requests:
Efficiently fetches file sizes without downloading the content.
Interactive Workflow:
Allows the user to make informed decisions about downloading all models.
Safe Defaults:
Aborts if the user does not explicitly confirm.

* Fixed Unbuffered channel warning.

warning in context.go : misuse of unbuffered os.Signal channel as argument to signal.

The warning indicates that the unbuffered channel used in signal.Notify in context.go may be misused. In Go, unbuffered channels can cause potential deadlocks if signals are sent faster than they are received.

* Fixed download size calculation, download URL prefix bug, added link to models URL for user.

The URL formatter was prepending the model name to the formatted model name in the URL

* Added logs and exes to gitignore

* Delete bindings/go/examples/go-model-download/go-model-download.exe

* Delete whisper_build.log

common : fix audio loading by miniaudio (#2862)

fix: missing include common-whisper (#2858)

ruby : follow audio library change (#2851)

* Enable CPU

* Follow audio lib change

whisper : support GGML_BACKEND_DL (#2843)

* whisper : support GGML_BACKEND_DL

* fix DTW crash

* whisper.objc : fix build - add ggml-cpp.h

---------

Co-authored-by: Georgi Gerganov <redacted>

common : separate whisper sources (#2846)

* common : separate whisper sources

* examples : add chrono

* examples : add more headers

common : fix build min/max (#2845)

* common : try to fix build

* cont : try another fix

examples : use miniaudio for direct decoding flac, mp3, ogg and wav (#2759)

stream : stop on ^C when no audio is received (#2822)

Add check for ctrl-c in potentially endless loop while calling audio.get()
to receive sound.

Co-authored-by: Petter Reinholdtsen <redacted>

sync : ggml

Support pure float16 add/sub/mul/div operations in the CUDA (and CPU) backend (ggml/1121)

* Support float16-to-float16 add/sub/mul/div operations in the CUDA backend

* Add fp16 support for add/sub/mul/div on the CPU backend

* Add test cases for fp16 add/sub/mul/div

metal : copy kernels for quant to F32/F16 conversions (llama/12017)

metal: use dequantize_q templates

---------

Co-authored-by: Georgi Gerganov <redacted>

opencl: fix for small models (llama/11950)

* opencl: fix small shape gemv, remove unused extensions

* opencl: fix `transpose_16`, `dump_tensor`, enforce subgroup size

* opencl: fix for token length < 4

* opencl: use wave size of 64 for all Adreno GPUs

---------

Co-authored-by: Shawn Gu <redacted>
Co-authored-by: Skyler Szot <redacted>

Optimize mul_mat for Q4_0 on Intel GPU (llama/12035)

* opt performance by reorder for Intel GPU

* detect hw type and save opt feature, and print opt feature

* correct name

* support optimize graph once when compute graph, record the opt status in tensor->extra, make CI passed

* add env variable GGML_SYCL_DISABLE_OPT for debug

* use syclex::architecture replace the custom hw define, update the guide for GGML_SYCL_DISABLE_OPT

* add performance data

* mv getrows functions to separeted files

* fix global variables

---------

Co-authored-by: arthw <redacted>

SYCL: Fix GGML_SYCL_DEBUG macro (llama/11995)

ggml-cpu: Support s390x SIMD Instruction Set (llama/12019)

* ggml: add s390x ARCH_FLAGS for compilation

Signed-off-by: Aaron Teo <redacted>
* ggml: add SIMD for s390x using vector intrinsics

SIMD is activated for:
* ggml_vec_dot_f32
* ggml_vec_dot_f16
* ggml_vec_mad_f32
* ggml_vec_mad_f16
* ggml_vec_mad_f32_unroll
* ggml_vec_scale_f32
* ggml_vec_scale_f16

SIMD is NOT activated for:
* ggml_vec_dot_f16_unroll (pending bugfix)

Signed-off-by: Aaron Teo <redacted>
* ggml: fix missing escape character in GGML_F32x4_REDUCE

Signed-off-by: Aaron Teo <redacted>
* ggml: add temporary patch for GGML_F32_ARR and GGML_F16_ARR

Signed-off-by: Aaron Teo <redacted>
* ggml: fix s390x GGML_F32x4_REDUCE

Signed-off-by: Aaron Teo <redacted>
* ggml: full SIMD activation for F32,F16 s390x

Signed-off-by: Aaron Teo <redacted>
* ggml: add option to disable s390x VXE/VXE2

Signed-off-by: Aaron Teo <redacted>
* ggml: change vecintrin.h include to ggml-cpu-impl

* add __VXE__ and __VXE2__ macros

Signed-off-by: Aaron Teo <redacted>
* cmake: add s390x target detection for VX/VXE/VXE2

Signed-off-by: Aaron Teo <redacted>
* ggml: move s390x vector intrinsics to ggml-cpu-impl.h

Signed-off-by: Aaron Teo <redacted>
* ggml: s390x Q8_0 SIMD

Signed-off-by: Aaron Teo <redacted>
* ggml: correct documentation for Q8_0

Signed-off-by: Aaron Teo <redacted>
* ggml: s390x reduce code complexity Q8_0

Signed-off-by: Aaron Teo <redacted>
* ggml: s390x bugfix typo Q8_0

Signed-off-by: Aaron Teo <redacted>
* ggml: s390x SIMD activated for Q4_1

Signed-off-by: Aaron Teo <redacted>
* ggml: s390x inline vec_reve

Signed-off-by: Aaron Teo <redacted>
* ggml: s390x SIMD activation for Q4_0

Signed-off-by: Aaron Teo <redacted>
* ggml: add VXE backend feature

Signed-off-by: Aaron Teo <redacted>
* ggml: remove test.py

Signed-off-by: Aaron Teo <redacted>
* ggml: s390x SIMD activation for quantize_row_q8_0

Signed-off-by: Aaron Teo <redacted>
* ggml: s390x SIMD activation for quantize_row_q8_1

Signed-off-by: Aaron Teo <redacted>
* ggml: s390x SIMD activation for iq4_xs

Signed-off-by: Aaron Teo <redacted>
* ggml: bugfix iq4_xs

Signed-off-by: Aaron Teo <redacted>
* ggml: s390x SIMD activation for iq4_nl

Signed-off-by: Aaron Teo <redacted>
* ggml: add float, double, and long vector data type

Signed-off-by: Aaron Teo <redacted>
* ggml: clean up iq4_xs SIMD

Signed-off-by: Aaron Teo <redacted>
* ggml: fix improper use of restrict keyword

Signed-off-by: Aaron Teo <redacted>
* ggml: update warning message for ggml_vec_tbl

Signed-off-by: Aaron Teo <redacted>
* ggml: untested implementation of ggml_vec_dot_iq2_xxs_q8_K

Signed-off-by: Aaron Teo <redacted>
* ggml: update ggml_vec_dot_q4_1_q8_1 to use typedefs

Signed-off-by: Aaron Teo <redacted>
* ggml: switch to restrict for iq4_nl

Signed-off-by: Aaron Teo <redacted>
* ggml: slight dot product speed improvement for q4_1_q8_1

Signed-off-by: Aaron Teo <redacted>
* ggml: s390x SIMD activation for q6_K

Signed-off-by: Aaron Teo <redacted>
* ggml: add missing `_t` to ggml_int8x16x4_t

Signed-off-by: Aaron Teo <redacted>
* ggml: fix missing `_t` for ggml_vec_xl_s8x4

Signed-off-by: Aaron Teo <redacted>
* ggml: fix more missing `_t`

Signed-off-by: Aaron Teo <redacted>
* ggml: add unroll and prefetch to Q8_0

increase of 3.86% for prompt processing and 32.22% for token generation

Signed-off-by: Aaron Teo <redacted>
* ggml: patch Q8_0 to use proper vector sizes

Signed-off-by: Aaron Teo <redacted>
* ggml: optimise Q8_0 dot prod compute kernel further

Signed-off-by: Aaron Teo <redacted>
* ggml: add unroll and prefetch to Q4_1

Signed-off-by: Aaron Teo <redacted>
* ggml: refactor Q6_K variable naming for readability

Signed-off-by: Aaron Teo <redacted>
* ggml: fix Q6_K typos

Signed-off-by: Aaron Teo <redacted>
* ggml: s390x SIMD activation for Q5_K

Signed-off-by: Aaron Teo <redacted>
* ggml: fix wrong char*x16_t naming

Signed-off-by: Aaron Teo <redacted>
* ggml: Q5_K y0 wrong signness

Signed-off-by: Aaron Teo <redacted>
* ggml: fix Q5_K invalid uchar type

Signed-off-by: Aaron Teo <redacted>
* ggml: fix Q5_K invalid uchar type

Signed-off-by: Aaron Teo <redacted>
* ggml: s390x SIMD activation for Q4_K

Signed-off-by: Aaron Teo <redacted>
* ggml: fix Q4_K invalid vector intrinsics

Signed-off-by: Aaron Teo <redacted>
* ggml: simplify ggml_padd_s16 compute kernel

Signed-off-by: Aaron Teo <redacted>
* ggml: correct ggml-cpu vxe wording

Signed-off-by: Aaron Teo <redacted>
* ggml: change ggml_aligned_malloc alignment to 256

256 is the cache line size for s390x platforms

Signed-off-by: Aaron Teo <redacted>
* ggml: resolve pr merge via cherry-pick 225bbbf

Signed-off-by: Aaron Teo <redacted>
* ggml : fix LoongArch compile error with 128-bit SIMD (llama/11701)

* ggml: resolve pr merge via cherry-pick 4571953

Signed-off-by: Aaron Teo <redacted>
* ggml: cmake remove fork when determining s390x machine type

thank you @ericcurtin

Signed-off-by: Aaron Teo <redacted>
---------

Signed-off-by: Aaron Teo <redacted>
Co-authored-by: Jinyang He <redacted>
Co-authored-by: junchao-zhao <redacted>

CUDA: app option to compile without FlashAttention (llama/12025)

CUDA: optimize FA for GQA + large batches (llama/12014)

cuda: Add Q5_1, Q5_0, Q4_1 and Q4_0 to F32 conversion support. (llama/12000)

CUDA: correct the lowest Maxwell supported by CUDA 12 (llama/11984)

* CUDA: correct the lowest Maxwell supported by CUDA 12

---------

Co-authored-by: Johannes Gäßler <redacted>

MUSA: support ARM64 and enable dp4a .etc (llama/11843)

* MUSA: support ARM64 and enable __dp4a .etc

* fix cross entropy loss op for musa

* update

* add cc info log for musa

* add comment for the MUSA .cc calculation block

---------

Co-authored-by: Bodhi Hu <redacted>

ggml-cpu: Add CPU backend support for KleidiAI library (llama/11390)

* ggml-cpu: Add CPU backend support for KleidiAI library

* Add environmental variable GGML_KLEIDIAI_SME

* Add support for multithread LHS conversion

* Switch kernel selection order to dotprod and i8mm

* updates for review comments

* More updates for review comments

* Reorganize and rename KleidiAI files

* Move ggml-cpu-traits.h to source file

* Update cmake for SME build and add alignment for SME

* Remove append GGML_USE_CPU_KLEIDIAI to the GGML_CDEF_PUBLIC list

ggml: aarch64: implement SVE kernels for q3_K_q8_K vector dot (llama/11917)

* Added SVE Implementation for Q3_K Kernel in ggml-cpu-quants.c file

* Improved Formating of code in ggml-cpu-quants.c file

* style : minor fixes

* style : less whitespaces

* style : ptr spaceing

---------

Co-authored-by: vithulep <redacted>
Co-authored-by: Georgi Gerganov <redacted>

CUDA: use async data loading for FlashAttention (llama/11894)

* CUDA: use async data loading for FlashAttention

---------

Co-authored-by: Diego Devesa <redacted>

vulkan: implement several ops relevant for ggml_opt (llama/11769)

* vulkan: support memset_tensor

* vulkan: support GGML_OP_SUM

* vulkan: implement GGML_OP_ARGMAX

* vulkan: implement GGML_OP_SUB

* vulkan: implement GGML_OP_COUNT_EQUAL

* vulkan: implement GGML_OP_OPT_STEP_ADAMW

* vulkan: fix check_results RWKV_WKV6 crash and memory leaks

* vulkan: implement GGML_OP_REPEAT_BACK

* tests: remove invalid test-backend-ops REPEAT_BACK tests

* vulkan: fix COUNT_EQUAL memset using a fillBuffer command

vulkan: support multi/vision rope, and noncontiguous rope (llama/11902)

metal : fix the crash caused by the lack of residency set support on Intel Macs. (llama/11904)

metal : optimize dequant q6_K kernel (llama/11892)

repo : update links to new url (llama/11886)

* repo : update links to new url

ggml-ci

* cont : more urls

ggml-ci

vulkan: initial support for IQ1_S and IQ1_M quantizations (llama/11528)

* vulkan: initial support for IQ1_S and IQ1_M quantizations

* vulkan: define MMV kernels for IQ1 quantizations

* devops: increase timeout of Vulkan tests again

* vulkan: simplify ifdef for init_iq_shmem

opencl: Fix rope and softmax (llama/11833)

* opencl: fix `ROPE`

* opencl: fix `SOFT_MAX`

* Add fp16 variant

* opencl: enforce subgroup size for `soft_max`

cuda : add ampere to the list of default architectures (llama/11870)

ggml: optimize some vec dot functions for LoongArch ASX (llama/11842)

* Optimize ggml_vec_dot_q3_K_q8_K for LoongArch ASX

* Optimize ggml_vec_dot_q4_K_q8_K for LoongArch ASX

* Optimize ggml_vec_dot_q6_K_q8_K for LoongArch ASX

* Optimize ggml_vec_dot_q5_K_q8_K for LoongArch ASX

* Optimize ggml_vec_dot_q2_K_q8_K for LoongArch ASX

* Optimize mul_sum_i8_pairs_float for LoongArch ASX

* Optimize ggml_vec_dot_iq4_xs_q8_K for LoongArch ASX

vulkan: linux builds + small subgroup size fixes (llama/11767)

* mm subgroup size

* upload vulkan x86 builds

llamafile: use member variable instead of constant for iq4nlt (llama/11780)

musa: bump MUSA SDK version to rc3.1.1 (llama/11822)

* musa: Update MUSA SDK version to rc3.1.1

Signed-off-by: Xiaodong Ye <redacted>
* musa: Remove workaround in PR #10042

Signed-off-by: Xiaodong Ye <redacted>
---------

Signed-off-by: Xiaodong Ye <redacted>

ggml-cpu : add chunking support to mul_mat_id (llama/11666)

* ggml-cpu : add chunking support to mul_mat_id

* allocate chunk counter in wdata
parallelize src1 quantization by column to allows parallelization even when there is only one row

* disable for arm

* cleanup

* better way to disable for arm

* fix uninitialized counter when using 1 thread only

* revert test-backend-ops changes

ggml : x2 speed for WASM by optimizing SIMD (llama/11453)

* ggml : x2 speed for WASM by optimizing SIMD

* fix bad merging

* rm trailing spaces

* rm redundant clamp

* better quantize_row_q8_K

Co-authored-by: camel-cdr <redacted>
* remove memset that causes buffer overflow
Co-authored-by: camel-cdr <redacted>
---------

Co-authored-by: camel-cdr <redacted>

HIP: Remove GCN from list of devices that avoid MMQ (llama/11831)

HIP: Switch to std::vector in rocblas version check (llama/11820)

cleanup: fix compile warnings associated with gnu_printf (llama/11811)

ggml : fix multi-threaded clamp_f32 (llama/11824)

* Bug fix for clamp_f32

When using tensors larger than 1d clamp operation does not work due to the restriction of returning if ith is not 0.

* Bug fix for clamp_f32

* Bug fix for clamp_f32

ggml-cpu: Fix duplicate MATMUL_INT8 (llama/11817)

Signed-off-by: Weizhao Ouyang <redacted>