Daniel Bevenius [Mon, 31 Mar 2025 09:34:40 +0000 (11:34 +0200)]
ci : add github pages workflow for wasm examples (#2969)
* ci : add github pages workflow for wasm examples
This commit adds a github workflow to build and deploy the wasm examples
to github pages. The whisper.wasm example is deployed as the main page.
This workflow is trigged by a push to master and will deploy the
examples to: https://ggerganov.github.io/whisper.cpp/.
This requires that the repository has enabled github actions in
`Settings` -> `Pages` -> `Build and deployment` -> `Source` be set to
`GitHub Actions`.
One thing to note is that this commit removes the `talk` example as I'm
not sure how this example is built yet.
Icenowy Zheng [Fri, 28 Mar 2025 17:51:06 +0000 (01:51 +0800)]
vulkan: fix coopmat shader generation when cross-compiling (llama/12272)
* vulkan: fix coopmat shader generation when cross-compiling
Previously the status of coopmat{,2} support isn't passed to the
vulkan-shaders-gen project building on the host, which leads to build
failure because of the cross-compiling code expecting coopmat{,2}
shaders that didn't get generated.
Fix this by passing the coopmat{,2} support status to vulkan-shaders
subproject.
Signed-off-by: Icenowy Zheng <redacted>
* Only call coop-mat shaders once
amritahs-ibm [Fri, 28 Mar 2025 07:43:22 +0000 (13:13 +0530)]
llamafile : ppc64le GEMV forwarding for FP32. (llama/12594)
This patch enables usage of MMA when one of the
dimensions of the matrix(ie either M or N) is 1. This
is useful in case of token generation where N < 2.
The concept of 'GEMV Forwarding' is used where when one
of the matrix has a single row/column, the elements are
broadcasted, instead of using packing routine to prepack
the matrix elements.
This change results in 5% - 15% improvement in total
speed(ie all tokens/total time), across various batch
sizes. This is in comparision with the corresponding
dot product implementation.
The patch is tested with FP32 models of Meta-Lllama-3-8B,
Mistral-7B, Llama-2-7B-chat-hf on a IBM POWER10 machine.
Amanda Der Bedrosian [Fri, 28 Mar 2025 11:26:22 +0000 (04:26 -0700)]
bindings.go : add DetectedLanguage to go bindings (#2947)
Adding in DetectedLanguage(), a function to retrieve the detected
language that's populated by processing audio. Also adding in a unit
test to test the success.
Daniel Bevenius [Fri, 28 Mar 2025 08:29:56 +0000 (09:29 +0100)]
ruby : fix test failures in test_whisper (#2955)
* bindings.ruby : fix test failures in test_whisper
This commit updates the parallel tests to use 2 processors instead of
the number of processors on the system. It also comments out the setting
of the log callback to an empty lambda as this causes a segfault when
enabled.
The motivation for the change to the number of processors is that if one
has a large number of processors, for example I have 16 on the machine I
used to test this, this would cause the following warning to be printed:
```console
whisper_full_with_state: input is too short - 680 ms < 1000 ms. consider padding the input audio with silence
```
This is logged from:
```c++
int whisper_full_with_state(
struct whisper_context * ctx,
struct whisper_state * state,
struct whisper_full_params params,
const float * samples,
int n_samples) {
...
if (seek_end < seek_start + 100) {
WHISPER_LOG_WARN("%s: input is too short - %d ms < 1000 ms. consider padding the input audio with silence\n", __func__, (seek_end - seek_start)*10);
return 0;
}
```
This will return early and there will be segment callbacks to be invoked
which in turn will cause the tests to fail.
* bindings.ruby : fix warnings in tests
This commit fixes the following warnings in the Ruby tests:
```console
/whisper/bindings/ruby/tests/test_segment.rb:52:
warning: ambiguity between regexp and two divisions:
wrap regexp in parentheses or add a space after `/' operator
```
And also adds a '_' prefix to some unused variables to avoid warnings.
* bindings.ruby : enable Wisper.log_set in tests
The commit reverts the commenting out of the Whisper.log_set call in
the test_whisper.rb tests.
I'm no longer getting segfaults when running the tests with this
which was the case earlier. One theory could be that I rebased this to
include the latest ggml sync to master to make sure things still worked.
With the latest changes in ggml, I can't reproduce the segfaults.
amritahs-ibm [Thu, 27 Mar 2025 06:51:47 +0000 (12:21 +0530)]
llamafile : ppc64le MMA implementation for Q4_0. (llama/12489)
This change upstreams llamafile's cpu matrix
multiplication kernels for ppc64le ISA using MMA
builtins. This patch handles matrix multiplication
between quantised datatypes, block_q4_0 and
block_q8_0.
This change results in 5% - 50% improvement
in total speed(ie all tokens/total time), across
various batch sizes.
The patch is tested with Meta-Lllama-3-8B,
Mistral-7B, Llama-2-7B-chat-hf models on a
IBM POWER10 machine.
Jeff Bolz [Mon, 24 Mar 2025 06:56:17 +0000 (01:56 -0500)]
vulkan: fix mul_mat_vec failure in backend tests (llama/12529)
The OOB calculation could be wrong if the last iteration was during one of
the unrolled loops. Adjust the unrolling counts to avoid this. Add a couple
new backend tests that hit this failure on NVIDIA GPUs.
Jeff Bolz [Sat, 22 Mar 2025 08:40:11 +0000 (03:40 -0500)]
vulkan: Optimize mul_mat_vec p021 and nc shaders (llama/12505)
* tests: add mul_mat perf/functional tests for p021/nc vulkan shaders
* vulkan: Optimize mul_mat_vec p021 and nc shaders.
These shaders are used in attention calculations, and when the KV cache grows
large they start to dominate the run time. For the nc shader (which is called
with large 'k' dimension), use unrolling and vector loads. For the p021 shader
(which is called with large 'm' and small 'k' dimensions), take advantage of
grouped query attention to reuse loads from the A matrix for the whole group,
and reduce the number of workgroups (too much overhead from tiny dispatches).
Using subgroupAdd in the p021 shader also helps, use that conditionally.
Gaurav Garg [Wed, 19 Mar 2025 19:52:06 +0000 (01:22 +0530)]
CUDA: Improve flash decoding kernel GPU occupancy for BS=1 case (llama/12183)
- Find out active blocks per SM using cudaOccupancyMaxActiveBlocksPerMultiprocessor API. Use this value to determine the optimal parallel_blocks value.
- Prefer vector flash attention kernels over MMA kernel for BS=1
Jeff Bolz [Wed, 19 Mar 2025 07:26:26 +0000 (02:26 -0500)]
vulkan: Submit once enough matmul work has been recorded (llama/12406)
I've been seeing significantly worse performance for tg with flash attention
enabled vs disabled, and it seems to be related to the submit heuristic.
Change the heuristic to check how many bytes worth of weight matrix are
used and flush every 100MB, and ramp up after the first few submits.
This seems to resolve the issue, and also increases perf for non-FA a bit.
Gaurav Garg [Mon, 17 Mar 2025 18:25:13 +0000 (23:55 +0530)]
cuda : enable CUDA Graph on CUDA Toolkit < 12.x (llama/12394)
* Enable CUDA Graph on CTK < 12.x
`cudaGraphExecUpdate` API was changed on 12.x. For this reason CUDA graph support was disabled on older CUDA toolkit. This change enables CUDA support in CTK version < 12.x by using older API if CTK < 12.x.
Christian Kastner [Mon, 17 Mar 2025 09:05:23 +0000 (10:05 +0100)]
cmake : enable building llama.cpp using system libggml (llama/12321)
* cmake: Factor out compiler flag function from ggml
llama.cpps's build requires it, too, and we may want to make use of it
without add_subdirectory(ggml).
* cmake: Enable building against system ggml
This facilitates package maintenance for Linux distributions, where the
libggml library most likely will be shipped as an individual package
upon which a llama.cpp package depends.
uvos [Wed, 12 Mar 2025 09:14:11 +0000 (10:14 +0100)]
CUDA/HIP: Fix fattn-vec-* when device warp size is not 32 (llama/12315)
When fattn-wmma was ported over to warp64 various bits that also touch fattn-vec where converted to
selectable warp size, however the fattn-vec kernels dont work with 64 wide warps for now, so we need
to avoid launching them with parameters for warp64
Henry Linjamäki [Mon, 10 Mar 2025 16:57:00 +0000 (18:57 +0200)]
opencl: use OpenCL C standard supported by the device (llama/12221)
This patch nudges the llama.cpp a bit to be supported on PoCL which
doesn't support OpenCL C CL2.0. The issue is solved by querying the
device for the supported OpenCL C versions and using the highest one
available.
Christian Kastner [Mon, 10 Mar 2025 12:06:21 +0000 (13:06 +0100)]
cmake: Comment out GGML_BIN_DIR for now (ggml/1139)
Nothing installs to it yet, so when attempting to use the cmake package,
set_and_check() triggers an error if the directory doesn't already exist
for other reasons.
Daniel Bevenius [Wed, 26 Mar 2025 15:21:07 +0000 (16:21 +0100)]
bindings-go : update Makefile to use cmake (#2952)
This commit updates the Makefile to use cmake instead of make to build
whisper.cpp.
The motivation for this change is that currently the make recipe test
will fail with the following error:
```console
$ make test
Mkdir build
Mkdir models
Build whisper
make[1]: Entering directory '/home/danbev/work/ai/whisper-work'
make[1]: *** No rule to make target 'libwhisper.a'. Stop.
make[1]: Leaving directory '/home/danbev/work/ai/whisper-work'
make: *** [Makefile:33: whisper] Error 2
```
This commit adds a dependency on the copyLibs task to the sourcesJar and
jar tasks. This ensures that the libwhisper.so file is copied to the
correct location before the jar is built.
It also sets the executable bit on the gradlew file.
* bindings.java : add copyLibs dep for processResources [no ci]
This will otherwise cause builds to fail after doing an initial build.
* bindings.java : pass structs by value to native code
This commit refactors the code to pass the structs by value to the
native code. This is done by creating a ByValue class for each struct
and using it in the Java code.
The motivation for this change is that without this application crashes
due to what I believe was memory mis-alignement. When the structs were
passed to the native code they would be att different memory locations.
Passing by value overcomes this issue and considering that the structs
hold parementers (context and full params) it might be alright do to
this. These changes allow all the tests to pass.
* bindings.java : fix javadoc warnings [no ci]
* bindings.java : fix libwhisper.dylib path in build.gradle [no ci]
This commit fixes the copyLibwhisperDynlib task in the build.gradle file
to copy the correct libwhisper.dylib file from build/src.
Daniel Bevenius [Wed, 26 Mar 2025 13:49:12 +0000 (14:49 +0100)]
bindings.javascript : update test instructions [no ci] (#2951)
This commit updates the instructions for running the test in the
JavaScript bindings README file.
The motivation for this is for Node.js versions after v16.4.0 the
`--experimental-wasm-threads` and `--experimental-wasm-simd` flags are
no longer required and they generate the following errors:
```console
$ node --experimental-wasm-threads --experimental-wasm-simd ../tests/test-whisper.js
node: bad option: --experimental-wasm-threads
node: bad option: --experimental-wasm-simd
```
Daniel Bevenius [Tue, 25 Mar 2025 15:01:59 +0000 (16:01 +0100)]
whisper.android.java : update build with ggml source changes (#2942)
* whisper.android.java : update build with ggml source changes
This commit updates the whisper.android.java build to include the
new ggml source files and directories. The gradle build configuration is
also updated to include the aliyun maven repository.
Daniel Bevenius [Mon, 24 Mar 2025 13:42:12 +0000 (14:42 +0100)]
examples : reduce initial memory to 512MB (#2939)
* examples : reduce initial memory to 512MB
This commit reduces the initial memory size to 512MB. This is done to
to avoid WebAssembly memory allocation issues on some platforms. It also
adds a flag to allow the memory to grow dynamically (up to the maximum).
The motivation for this change is that currently the initial memory is
set to 2GB which might be to large for some platforms. This will lead to
an error being thrown from the JavaScript code generated by Emscripten
when trying to allocate memory. More details can be found in the
referenced issue below.
* examples : set MAXIMUM_MEMORY instead of TOTAL_MEMORY
This commit sets MAXIMUM_MEMORY instead of TOTAL_MEMORY in the
whisper.wasm example.
The motivation for this is that TOTAL_MEMORY and INITIAL_MEMORY are
actually the same thing. Instead we want to set MAXIMUM_MEMORY to
2GB.
Daniel Bevenius [Mon, 24 Mar 2025 13:40:00 +0000 (14:40 +0100)]
examples : fix nthread parsing in whisper.wasm (#2938)
This commit fixes the nthread parsing in the whisper.wasm example when
using the `Threads` slider to change the number of threads to be used.
Currently this results in the following error:
```console
main.js:5597 Uncaught TypeError: Cannot convert "5" to int
at checkAssertions (main.js:5597:21)
at Object.toWireType (main.js:5611:15)
at Object.full_default (eval at new_ (main.js:5292:27), <anonymous>:10:26)
at whisper.wasm/:649:42
```
Daniel Bevenius [Mon, 24 Mar 2025 13:33:45 +0000 (14:33 +0100)]
examples : fix request path for local worker files (#2937)
This commit adds a fix to the server.py file to handle requests for
web worker files when running the local python server to test the wasm
examples.
The motivation for this is that currently the server is serving files
from the build-em/bin directory which is where the .worker.js files
exist. But when examples access these resources they do so with the
application context path, for example /whisper.wasm/libmain.worker.js
but this will not be found as it currently works.
Daniel Bevenius [Mon, 24 Mar 2025 08:53:38 +0000 (09:53 +0100)]
ggml : add logging for native build options/vars (#2935)
This commit adds debug level logging for the native build options and
variables to ggml/CMakeLists.txt.
The motivation for this is that it can be useful to see the effective
result of `GGML_NATIVE`, `GGML_NATIVE_DEFAULT`, and `INS_ENB` for a
cmake build. I've found myself adding similar logging a few times now,
so I thought it might be a good idea to add this.
Example output, specifying `-DCMAKE_MESSAGE_LOG_LEVEL=DEBUG` when
running cmake produces the following output:
```console
-- GGML_NATIVE : OFF
-- GGML_NATIVE_DEFAULT : OFF
-- INS_ENB : OFF
```
Peter [Mon, 24 Mar 2025 08:39:50 +0000 (19:39 +1100)]
whisper : enhance model download scripts functionality and resolve compiler warning (#2925)
* whisper : improve whisper-cli executable path detection in model download shell scripts
If whisper-cli is found on the path, do not suggest invoking from build directory. This improves flexibility and usability for distribution and packaging scenarios.
* whisper : enhance Windows model download batch script to have comparable functionality and behaviour as shell scripts
* Download models to the current directory if the script is executed from the \bin\ directory (for future distribution scenarios where the script is in the \bin\ subdirectory of a Windows build)
* Add model_path command line argument
* If whisper-cli is found on the path, do not suggest invoking from build directory
* whisper : resolve compiler warning by removing duplicate definition of NOMINMAX in whisper-cli code
Daniel Bevenius [Mon, 24 Mar 2025 08:36:07 +0000 (09:36 +0100)]
whisper : initialize decoder's rng with unique seed (#2932)
This change initializes each decoder's random number generator with a
unique seed.
The motivation for this is that currently all decoders are initialized
with the same seed value, 0. The result of this is that for the same
state (logits, probs, and logprobs) they will produce the same output.
Daniel Bevenius [Sat, 22 Mar 2025 14:40:28 +0000 (15:40 +0100)]
ci : remove CMAKE_CUDA_ARCHITECTURES in windows-cublas (#2923)
This commit removes the -DCMAKE_CUDA_ARCHITECTURES=all flag from the
windows-cublas job in the build.yml file.
The motivation for this is that building for all architectures is
unnecessary and takes a long time. Without this flag the architectures
will instead be set by ggml-cuda.
Peter [Sat, 22 Mar 2025 14:27:57 +0000 (01:27 +1100)]
whisper : update default model download directory behavior to use current working directory when script is in /bin/ directory (#2924)
This change ensures that when the script is packaged and distributed, models are downloaded to the current directory instead of the script's location, preventing conflicts with system directories. This improves flexibility and usability for distribution and packaging scenarios.
Daniel Bevenius [Fri, 21 Mar 2025 09:31:55 +0000 (10:31 +0100)]
readme : update Python version to 3.11 for Core ML support [no -ci] (#2919)
This commit updates the recommended version of Python to 3.11 for Core
ML conversion support. It also adds the `-e` flag to the
`generate-coreml-model.sh` script to ensure that the script exits on the
first error.
The motivation for this that when following the installation instructions
using Python 3.10 I get the following error:
```console
(venv) $ ./models/generate-coreml-model.sh base.en
A module that was compiled using NumPy 1.x cannot be run in
NumPy 2.1.3 as it may crash. To support both 1.x and 2.x
versions of NumPy, modules must be compiled with NumPy 2.0.
Some module may need to rebuild instead e.g. with 'pybind11>=2.12'.
If you are a user of the module, the easiest solution will be to
downgrade to 'numpy<2' or try to upgrade the affected module.
We expect that some modules will need time to support NumPy 2.
Traceback (most recent call last): File "/whisper-work/models/convert-whisper-to-coreml.py", line 2, in <module>
import torch
File "/whisper-work/venv/lib/python3.10/site-packages/torch/__init__.py", line 870, in <module>
from . import _masked
File "/whisper-work/venv/lib/python3.10/site-packages/torch/_masked/__init__.py", line 420, in <module>
def sum(input: Tensor,
File "/whisper-work/venv/lib/python3.10/site-packages/torch/_masked/__init__.py", line 223, in _apply_docstring_templates
example_input = torch.tensor([[-3, -2, -1], [0, 1, 2]])
/whisper-work/venv/lib/python3.10/site-packages/torch/_masked/__init__.py:223: UserWarning: Failed to initialize NumPy: _ARRAY_API not found (Triggered internally at /Users/distiller/project/pytorch/torch/csrc/utils/tensor_numpy.cpp:68.)
example_input = torch.tensor([[-3, -2, -1], [0, 1, 2]])
Minimum required torch version for importing coremltools.optimize.torch is 2.1.0. Got torch version 1.11.0.
Traceback (most recent call last):
File "/whisper-work/models/convert-whisper-to-coreml.py", line 4, in <module>
import coremltools as ct
File "/whisper-work/venv/lib/python3.10/site-packages/coremltools/__init__.py", line 120, in <module>
from . import converters, models, optimize, proto
File "/whisper-work/venv/lib/python3.10/site-packages/coremltools/converters/__init__.py", line 7, in <module>
from . import libsvm, sklearn, xgboost
File "/Users/danbev/work/ai/whisper-work/venv/lib/python3.10/site-packages/coremltools/converters/xgboost/__init__.py", line 6, in <module>
from ._tree import convert
File "/Users/danbev/work/ai/whisper-work/venv/lib/python3.10/site-packages/coremltools/converters/xgboost/_tree.py", line 9, in <module>
from ._tree_ensemble import convert_tree_ensemble as _convert_tree_ensemble
File "/Users/danbev/work/ai/whisper-work/venv/lib/python3.10/site-packages/coremltools/converters/xgboost/_tree_ensemble.py", line 11, in <module>
from ...models.tree_ensemble import TreeEnsembleClassifier
File "/Users/danbev/work/ai/whisper-work/venv/lib/python3.10/site-packages/coremltools/models/__init__.py", line 6, in <module>
from . import (
File "/Users/danbev/work/ai/whisper-work/venv/lib/python3.10/site-packages/coremltools/models/ml_program/__init__.py", line 6, in <module>
from . import compression_utils
File "/Users/danbev/work/ai/whisper-work/venv/lib/python3.10/site-packages/coremltools/models/ml_program/compression_utils.py", line 8, in <module>
from coremltools.converters.mil.mil import Operation as _Operation
File "/Users/danbev/work/ai/whisper-work/venv/lib/python3.10/site-packages/coremltools/converters/mil/__init__.py", line 7, in <module>
from .frontend.tensorflow.tf_op_registry import register_tf_op
File "/Users/danbev/work/ai/whisper-work/venv/lib/python3.10/site-packages/coremltools/converters/mil/frontend/__init__.py", line 6, in <module>
from . import tensorflow, tensorflow2, torch
File "/Users/danbev/work/ai/whisper-work/venv/lib/python3.10/site-packages/coremltools/converters/mil/frontend/torch/__init__.py", line 11, in <module>
from . import ops, quantization_ops
File "/Users/danbev/work/ai/whisper-work/venv/lib/python3.10/site-packages/coremltools/converters/mil/frontend/torch/ops.py", line 36, in <module>
from .internal_graph import InternalTorchIRGraph, InternalTorchIRNode
File "/Users/danbev/work/ai/whisper-work/venv/lib/python3.10/site-packages/coremltools/converters/mil/frontend/torch/internal_graph.py", line 15, in <module>
from .exir_utils import extract_io_from_exir_program
File "/Users/danbev/work/ai/whisper-work/venv/lib/python3.10/site-packages/coremltools/converters/mil/frontend/torch/exir_utils.py", line 99, in <module>
) -> Dict[str, torch.fx.Node]:
AttributeError: module 'torch' has no attribute 'fx'
```
Using Python3.11 the conversion script runs without any errors.
Daniel Bevenius [Fri, 21 Mar 2025 08:52:53 +0000 (09:52 +0100)]
examples : update whisper.objc README.md (#2916)
This commit updates the hisper.objc README.md to reflect the changes of
using the xcframework and the new build process.
Since whisper.cpp is no longer compiled by the example project, instead
the library from the xframework will be used, the build instructions
have been removed.
Daniel Bevenius [Fri, 21 Mar 2025 07:19:24 +0000 (08:19 +0100)]
ci : increase windows-cublas evict-old-files to 5d (#2915)
This commit updates the evict-old-files parameter for the windows-cublas
build job to 5 days.
The motivation for this change is to avoid the full rebuild which takes
around 1.5 hours for the windows-cublas build job. Considering that
there are periods of low traffic on whisper.cpp (like weekends etc.) it
might be better to have a longer eviction policy to avoid the full
rebuild.
Daniel Bevenius [Thu, 20 Mar 2025 17:39:08 +0000 (18:39 +0100)]
xcframework : add support for CoreML to ios/macOS (#2912)
* xcframework : add support for CoreML to ios/macOS
This commit add support for compiling whisper with CoreML support for
iOS and macOS.
The motivation for this change is it will allow users to use a Core ML
model or fall back to a ggml model if Core ML is not available.
With the updated xcframework, I was able to run the whisper.objc example
and successfully load a Core ML model:
```console
whisper_init_state: loading Core ML model from '/Users/danbev/Library/Developer/CoreSimulator/Devices/25E8C27D-0253-4281-AF17-C3F2A4D1D8F4/data/Containers/Bundle/Application/B81F6FF0-BF1A-40DF-AC2A-3908EC4BCC9A/whisper.objc.app/ggml-base.en-encoder.mlmodelc'
whisper_init_state: first run on a device may take a while ...
whisper_init_state: Core ML model loaded
```
* squash! xcframework : add support for CoreML to ios/macOS
Daniel Bevenius [Thu, 20 Mar 2025 17:36:02 +0000 (18:36 +0100)]
examples : add WHISPER_SDL2 check to deprecation executables (#2911)
This commit adds a check for `WHISPER_SDL2` to the deprecation warning
examples. This is to prevent the examples from being built when
WHISPER_SDL2 is not enabled.
The motivation for this is that currently these deprecation executables
are generate and when run they refer the user to examples with other
names, for example `whisper-command` but unless they have built with
`WHISPER_SDL2` those executable will not be present:
```console
$ ls build/bin/
bench command main quantize stream whisper-bench whisper-cli
whisper-server
$ ./build/bin/command
WARNING: The binary 'command' is deprecated.
Please use 'whisper-command' instead.
See https://github.com/ggerganov/whisper.cpp/tree/master/examples/deprecation-warning/README.md for more information.
```
Daniel Bevenius [Thu, 20 Mar 2025 16:01:48 +0000 (17:01 +0100)]
ci : use ninja and fix caching for windows-cublas (#2910)
This commit updates the windows-cublas job to use Ninja as the build
system instead of msbuild/msvc.
The motivation for this is that msbuild/mscv does not seem to handle
ccache/sccache well, for example it ignores the
`CMAKE_C_COMPILER_LAUNCHER` etc. variables. But using Ninja as the build
caching works and the build is initially the same speed as it is
currently (without caching) subsequently builds are much faster.