Jeff Bolz [Sun, 6 Apr 2025 08:47:13 +0000 (03:47 -0500)]
vulkan: Use unclamped loads for flash attention mask (llama/12720)
nem1 must be a multiple of GGML_KQ_MASK_PAD, and GGML_KQ_MASK_PAD is a multiple
of the number of rows in the matrix. The KV dim is a multiple of the number of
columns for the aligned shader.
Jeff Bolz [Fri, 4 Apr 2025 05:54:35 +0000 (00:54 -0500)]
vulkan: Hybrid waitForFences/getFenceStatus to reduce fence latency (llama/12630)
There seems to be a bubble waking up from waitForFences, which costs a few
percent performance and also increased variance in performance. This change
inserts an "almost_ready" fence when the graph is about 80% complete and we
waitForFences for the almost_ready fence and then spin (with _mm_pauses) waiting
for the final fence to be signaled.
CUDA: Prefer vector flash decoding kernel for Gemma models (llama/12738)
* Prefer vector flash decoding kernel for Gemma models
Vector flash decoding kernel was not being picked for models with head dimension 256. Gemma models are in this category.
Removing this limit improves e2e performance by upto 12% in gen phase throughput for Gemm models.
* Update ggml/src/ggml-cuda/fattn.cu
Co-authored-by: Johannes Gäßler <redacted>
---------
Alan Gray [Thu, 3 Apr 2025 01:31:15 +0000 (02:31 +0100)]
Simplify and improve CUDA graphs through use of indirect copy pointers (llama/9017)
* CUDA: Simplify and improve CUDA graphs through use of indirect copy pointers
Previously there was complexity in the CUDA graphs implementation due
frequently changing parameters to copy kernels associated with K and V
cache pointers. This patch simplifies by using indirection to avoid
such parameters frequently changing, avoiding the need for frequent
graph updates.
Jeff Bolz [Wed, 2 Apr 2025 19:25:08 +0000 (14:25 -0500)]
vulkan: Implement split_k for coopmat2 flash attention. (llama/12627)
When using group query attention, we have one workgroup per KV batch and this
can be very few workgroups (e.g. just 8 in some models). Enable split_k to
spread the work across SMs. This helps a lot when the KV cache is large.
previously we would run 32 workgroups computing 1 result each, now we will
run 8 workgroups computing 4 results each.
This doesn't directly translate to better performance (at least when you have
>=32 SMs), but in a subsequent change I'll enable split_k which will scale much
better with 4x fewer workgroups.
Daniel Bevenius [Wed, 23 Apr 2025 06:24:38 +0000 (08:24 +0200)]
coreml : set convert_to="mlprogram" in convert
* coreml : skip model load in convert-whisper-to-coreml.py
This commit updates the conversion process for Whisper models to use the
"mlprogram" format instead of "neuralnetwork".
The motivation for this change is that when using the "neuralnetwork"
format the underlying model produced is based on protobuf and my
understanding is that there are limitations to this format, such as
sizes of strings and the complexity of the model.
Currently when trying to convert larger models such as large-v3 the
conversion fails but succeeds for smaller models.
The "mlprogram" format is a more recent addition to CoreML and is
designed to be more flexible and powerful, allowing for more complex
models and larger data types. This seems to work for larger and smaller
models alike and unless I'm there are considerations that I'm not aware
of I think this is what we should be using moving forward.
The error that is generated for large models is the following:
```console
Running MIL backend_neuralnetwork pipeline: 100%|█████████| 9/9 [00:00<00:00, 35.44 passes/s]
Translating MIL ==> NeuralNetwork Ops: 100%|███████████| 5641/5641 [03:31<00:00, 26.65 ops/s]
Traceback (most recent call last):
File "/Users/danbev/work/ai/whisper-work/models/convert-whisper-to-coreml.py", line 322, in <module>
encoder = convert_encoder(hparams, encoder, quantize=args.quantize)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/danbev/work/ai/whisper-work/models/convert-whisper-to-coreml.py", line 255, in convert_encoder
model = ct.convert(
^^^^^^^^^^^
File "/Users/danbev/work/ai/whisper-work/venv/lib/python3.11/site-packages/coremltools/converters/_converters_entry.py", line 635, in convert
mlmodel = mil_convert(
^^^^^^^^^^^^
File "/Users/danbev/work/ai/whisper-work/venv/lib/python3.11/site-packages/coremltools/converters/mil/converter.py", line 186, in mil_convert
return _mil_convert(
^^^^^^^^^^^^^
File "/Users/danbev/work/ai/whisper-work/venv/lib/python3.11/site-packages/coremltools/converters/mil/converter.py", line 245, in _mil_convert
return modelClass(
^^^^^^^^^^^
File "/Users/danbev/work/ai/whisper-work/venv/lib/python3.11/site-packages/coremltools/models/model.py", line 489, in __init__
self.__proxy__, self._spec, self._framework_error = self._get_proxy_and_spec(
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/danbev/work/ai/whisper-work/venv/lib/python3.11/site-packages/coremltools/models/model.py", line 550, in _get_proxy_and_spec
_MLModelProxy(
ValueError: basic_string
```
Daniel Bevenius [Sun, 20 Apr 2025 17:40:25 +0000 (19:40 +0200)]
examples : add HEAPU8 to exported runtime methods (#3062)
This commit adds `HEAPU8` to the list of exported methods.
The motivation for this commit is that currently this is causing an
error on Window systems where HEAPU8 in undefined, which results in the
following error message in the web console:
```console
main.js:1 Uncaught TypeError:
Cannot read properties of undefined (reading 'buffer') at __emval_get_property
(main.js:1:1363125) at 003a453a:0xc4a47 at 003a453a:0xc51cd at
Object.full_default (eval at craftInvokerFunction (main.js:1:1347011),
<anonymous>:9:10) at whisper.cpp/:647:42
```
examples : add FFmpeg v7.0 support to ffmpeg-transcode.cpp (#3038)
FFmpeg introduced a new channel layout API that uses `AVChannelLayout`
interface in v6.0. It subsequently dropped the old bitmask-based API
in v7.0.
This updates decode_audio() to support the new channel layout API,
so that we can compile `whisper-cli` and `whisper-server` with FFmpeg
v7.0 or later.
Daniel Bevenius [Wed, 9 Apr 2025 14:34:58 +0000 (16:34 +0200)]
xcf : use check for visionos build version (#3021)
This commit adds a check for the visionos build version used with vtool
in build-xcframework.sh. The script now checks the Xcode version and
determines whether to use "xros" or "visionos" for the build version.
This commit also uses xcrun for the vtool so that the version of vtool
in xcode command line tools is used instead of the one in the system
path.
tests : add script to benchmark whisper.cpp on LibriSpeech corpus (#2999)
* tests : add script to benchmark whisper.cpp on LibriSpeech corpus
LibriSpeech is a widely-used benchmark dataset for training and
testing speech recognition models.
This adds a set of scripts to measure the recognition accuracy of
whisper.cpp models, following the common benchmark standards.
Signed-off-by: Fujimoto Seiji <redacted>
* Document how to prepare `whisper-cli` and model files
Feedback from Daniel Bevenius.
This adds a short code example how to prepare the `whisper-cli`
command, to make the initial setup step a little bit clearer.
Signed-off-by: Fujimoto Seiji <redacted>
* tests : Simplify how to set up Python environment
Based on a feedback from Georgi Gerganov.
Instead of setting up a virtual environment in Makefile, let users
set up the Python environment. This is better since users may have
their own preferred workflow/toolkit.
whisper : fix "bench-all outputs an invalid result on larger models" (#3002)
The benchmark script 'scripts/bench-all.sh' assumes that the 11th
field of the output line is a timestamp. This assumption does not
hold when the target model takes a bit longer to process.
Fix this issue by introducing an explicit whitespace to the output
lines of `whisper_print_timings()`.
Daniel Bevenius [Fri, 4 Apr 2025 08:23:53 +0000 (10:23 +0200)]
examples : update server.py to match github pages app [no ci] (#3004)
This commit updates examples/server.py which is used to serve the wasm
examples locally. The changes include:
- Added a redirect from the root URL to /whisper.cpp.
So now accessing http://localhost:8000/ will redirect to
http://localhost:8000/whisper.cpp/ which matches the url for the app
deployed to github pages.
- Custom handling for coi-serviceworker.js to serve it to avoid
and error in the console. This file is not strictly necessary
for the local server to work as the headers are provided already but
it is nice to not have an error in the console.
- Fixed the shutdown of the server to ensure it exits cleanly
on Ctrl+C. Previously it would continue to hang onto the port even
after the processed had exited.
Daniel Bevenius [Thu, 3 Apr 2025 17:50:47 +0000 (19:50 +0200)]
whisper.wasm : fix unknown language issue (#3000)
* whisper.wasm : fix unknown language issue
This commit addresses an issue with whisper.wasm where the following
error was being displayed when running the application in github pages:
```
whisper_lang_id: unknown language 'д=␙c'
```
This turned out to be a memory corruption issue and further details
can be found in the reference issue below.
Daniel Bevenius [Thu, 3 Apr 2025 07:06:53 +0000 (09:06 +0200)]
docs : add xcframework section to README.md [no ci] (#2997)
This adds a section to the README.md file that describes how to use the
XCFramework.
The modification for this is that is not obvious how to use the
XCFramework and and example will help.
One thing to note is that the example is using the latest release
including the checksum. We are thinking about how we might automate
this in the future but for now this is a good start.
This commit removes test-whisper-cli-tiny-en from the gh label.
The motivation for this change is that until recently the tests were
disabled. But now that they are enabled some of the tests, specifically
the ci jobs that use sanatizers (e.g. thread-sanitizer) take a long time
to run as they are instrumented.
Some of these jobs also have matricies which means that there are
multiple jobs are created that all run these tests.
The suggestion here is to limit the number of tests that are run in the
ci jobs so cut down the CI build time.
Daniel Bevenius [Tue, 1 Apr 2025 16:01:23 +0000 (18:01 +0200)]
coreml: fix Whisper to CoreML conversion by disabling SDPA [no ci] (#2979)
* coreml: fix Whisper to CoreML conversion by disabling SDPA
This commit disables the use of PyTorch's
`scaled_dot_product_attention` in the Whisper model to avoid
compatibility issues during CoreML conversion.
The issue occurs because coremltools requires PyTorch 2.5.0, but the
Whisper implementation may expect behavior from newer PyTorch versions.
By setting `MultiHeadAttention.use_sdpa = False`, we force Whisper to
use its fallback manual attention implementation, which works correctly
with PyTorch 2.5.0 during the tracing process.
This commit updates the generated encoder/decoder interfaces for the
whisper model which is the result of running the
generate-coreml-interface.sh script.
Daniel Bevenius [Tue, 1 Apr 2025 15:04:32 +0000 (17:04 +0200)]
ci : add coreml job that converts base.en to coreml [no ci] (#2981)
* ci : add coreml job that converts base.en to coreml [no ci]
This commit adds a new job to the CI pipeline that downloads the base.en
model and converts it to CoreML format. The CoreML model is then packed
into a zip file and uploaded as an artifact.
This will only be done for pushes to master, releases, or pre-releases.
Daniel Bevenius [Mon, 31 Mar 2025 15:04:37 +0000 (17:04 +0200)]
tests : re-enable tests [no ci] (#2977)
This commit re-enables the tests in the build process which are
currently commented out.
It is possible to build the tests using `-DWHISPER_BUILD_TESTS=ON` and
then run a single test using:
```console
$ ctest -R test-whisper-cli-tiny.en --test-dir build
Internal ctest changing into directory: /home/danbev/work/ai/whisper-work/build
Test project /home/danbev/work/ai/whisper-work/build
Start 2: test-whisper-cli-tiny.en
1/1 Test #2: test-whisper-cli-tiny.en ......... Passed 4.44 sec
Some of the tests take a long time to run so it might not be a good idea
to enable them in CI, or perhaps we could only run a subset of the tests
in CI.
Daniel Bevenius [Mon, 31 Mar 2025 14:14:33 +0000 (16:14 +0200)]
android.java : re-add ggml source updates (#2975)
This commit updates the ggml source to include the new unary and binary
operations. I merged https://github.com/ggerganov/whisper.cpp/pull/2958
which seems to have overwritten the changes to the ggml source which
were added in https://github.com/ggerganov/whisper.cpp/pull/2972.
Daniel Bevenius [Mon, 31 Mar 2025 13:14:24 +0000 (15:14 +0200)]
ci : re-enable android_java job (#2958)
This commit re-enables the android_java job in the CI workflow. The job
was disabled because of a failing build.
The motivation for this is that Commit 226d344f565ea6140e7c6a583bc300a64454af58 ("whisper.android.java : update
build with ggml source changes") addressed build issues and it should
now be possible to re-enable this job.
Daniel Bevenius [Mon, 31 Mar 2025 09:34:40 +0000 (11:34 +0200)]
ci : add github pages workflow for wasm examples (#2969)
* ci : add github pages workflow for wasm examples
This commit adds a github workflow to build and deploy the wasm examples
to github pages. The whisper.wasm example is deployed as the main page.
This workflow is trigged by a push to master and will deploy the
examples to: https://ggerganov.github.io/whisper.cpp/.
This requires that the repository has enabled github actions in
`Settings` -> `Pages` -> `Build and deployment` -> `Source` be set to
`GitHub Actions`.
One thing to note is that this commit removes the `talk` example as I'm
not sure how this example is built yet.
Icenowy Zheng [Fri, 28 Mar 2025 17:51:06 +0000 (01:51 +0800)]
vulkan: fix coopmat shader generation when cross-compiling (llama/12272)
* vulkan: fix coopmat shader generation when cross-compiling
Previously the status of coopmat{,2} support isn't passed to the
vulkan-shaders-gen project building on the host, which leads to build
failure because of the cross-compiling code expecting coopmat{,2}
shaders that didn't get generated.
Fix this by passing the coopmat{,2} support status to vulkan-shaders
subproject.
Signed-off-by: Icenowy Zheng <redacted>
* Only call coop-mat shaders once
amritahs-ibm [Fri, 28 Mar 2025 07:43:22 +0000 (13:13 +0530)]
llamafile : ppc64le GEMV forwarding for FP32. (llama/12594)
This patch enables usage of MMA when one of the
dimensions of the matrix(ie either M or N) is 1. This
is useful in case of token generation where N < 2.
The concept of 'GEMV Forwarding' is used where when one
of the matrix has a single row/column, the elements are
broadcasted, instead of using packing routine to prepack
the matrix elements.
This change results in 5% - 15% improvement in total
speed(ie all tokens/total time), across various batch
sizes. This is in comparision with the corresponding
dot product implementation.
The patch is tested with FP32 models of Meta-Lllama-3-8B,
Mistral-7B, Llama-2-7B-chat-hf on a IBM POWER10 machine.
Amanda Der Bedrosian [Fri, 28 Mar 2025 11:26:22 +0000 (04:26 -0700)]
bindings.go : add DetectedLanguage to go bindings (#2947)
Adding in DetectedLanguage(), a function to retrieve the detected
language that's populated by processing audio. Also adding in a unit
test to test the success.
Daniel Bevenius [Fri, 28 Mar 2025 08:29:56 +0000 (09:29 +0100)]
ruby : fix test failures in test_whisper (#2955)
* bindings.ruby : fix test failures in test_whisper
This commit updates the parallel tests to use 2 processors instead of
the number of processors on the system. It also comments out the setting
of the log callback to an empty lambda as this causes a segfault when
enabled.
The motivation for the change to the number of processors is that if one
has a large number of processors, for example I have 16 on the machine I
used to test this, this would cause the following warning to be printed:
```console
whisper_full_with_state: input is too short - 680 ms < 1000 ms. consider padding the input audio with silence
```
This is logged from:
```c++
int whisper_full_with_state(
struct whisper_context * ctx,
struct whisper_state * state,
struct whisper_full_params params,
const float * samples,
int n_samples) {
...
if (seek_end < seek_start + 100) {
WHISPER_LOG_WARN("%s: input is too short - %d ms < 1000 ms. consider padding the input audio with silence\n", __func__, (seek_end - seek_start)*10);
return 0;
}
```
This will return early and there will be segment callbacks to be invoked
which in turn will cause the tests to fail.
* bindings.ruby : fix warnings in tests
This commit fixes the following warnings in the Ruby tests:
```console
/whisper/bindings/ruby/tests/test_segment.rb:52:
warning: ambiguity between regexp and two divisions:
wrap regexp in parentheses or add a space after `/' operator
```
And also adds a '_' prefix to some unused variables to avoid warnings.
* bindings.ruby : enable Wisper.log_set in tests
The commit reverts the commenting out of the Whisper.log_set call in
the test_whisper.rb tests.
I'm no longer getting segfaults when running the tests with this
which was the case earlier. One theory could be that I rebased this to
include the latest ggml sync to master to make sure things still worked.
With the latest changes in ggml, I can't reproduce the segfaults.