Daniel Bevenius [Fri, 4 Apr 2025 08:23:53 +0000 (10:23 +0200)]
examples : update server.py to match github pages app [no ci] (#3004)
This commit updates examples/server.py which is used to serve the wasm
examples locally. The changes include:
- Added a redirect from the root URL to /whisper.cpp.
So now accessing http://localhost:8000/ will redirect to
http://localhost:8000/whisper.cpp/ which matches the url for the app
deployed to github pages.
- Custom handling for coi-serviceworker.js to serve it to avoid
and error in the console. This file is not strictly necessary
for the local server to work as the headers are provided already but
it is nice to not have an error in the console.
- Fixed the shutdown of the server to ensure it exits cleanly
on Ctrl+C. Previously it would continue to hang onto the port even
after the processed had exited.
Daniel Bevenius [Thu, 3 Apr 2025 17:50:47 +0000 (19:50 +0200)]
whisper.wasm : fix unknown language issue (#3000)
* whisper.wasm : fix unknown language issue
This commit addresses an issue with whisper.wasm where the following
error was being displayed when running the application in github pages:
```
whisper_lang_id: unknown language 'д=␙c'
```
This turned out to be a memory corruption issue and further details
can be found in the reference issue below.
Daniel Bevenius [Thu, 3 Apr 2025 07:06:53 +0000 (09:06 +0200)]
docs : add xcframework section to README.md [no ci] (#2997)
This adds a section to the README.md file that describes how to use the
XCFramework.
The modification for this is that is not obvious how to use the
XCFramework and and example will help.
One thing to note is that the example is using the latest release
including the checksum. We are thinking about how we might automate
this in the future but for now this is a good start.
This commit removes test-whisper-cli-tiny-en from the gh label.
The motivation for this change is that until recently the tests were
disabled. But now that they are enabled some of the tests, specifically
the ci jobs that use sanatizers (e.g. thread-sanitizer) take a long time
to run as they are instrumented.
Some of these jobs also have matricies which means that there are
multiple jobs are created that all run these tests.
The suggestion here is to limit the number of tests that are run in the
ci jobs so cut down the CI build time.
Daniel Bevenius [Tue, 1 Apr 2025 16:01:23 +0000 (18:01 +0200)]
coreml: fix Whisper to CoreML conversion by disabling SDPA [no ci] (#2979)
* coreml: fix Whisper to CoreML conversion by disabling SDPA
This commit disables the use of PyTorch's
`scaled_dot_product_attention` in the Whisper model to avoid
compatibility issues during CoreML conversion.
The issue occurs because coremltools requires PyTorch 2.5.0, but the
Whisper implementation may expect behavior from newer PyTorch versions.
By setting `MultiHeadAttention.use_sdpa = False`, we force Whisper to
use its fallback manual attention implementation, which works correctly
with PyTorch 2.5.0 during the tracing process.
This commit updates the generated encoder/decoder interfaces for the
whisper model which is the result of running the
generate-coreml-interface.sh script.
Daniel Bevenius [Tue, 1 Apr 2025 15:04:32 +0000 (17:04 +0200)]
ci : add coreml job that converts base.en to coreml [no ci] (#2981)
* ci : add coreml job that converts base.en to coreml [no ci]
This commit adds a new job to the CI pipeline that downloads the base.en
model and converts it to CoreML format. The CoreML model is then packed
into a zip file and uploaded as an artifact.
This will only be done for pushes to master, releases, or pre-releases.
Daniel Bevenius [Mon, 31 Mar 2025 15:04:37 +0000 (17:04 +0200)]
tests : re-enable tests [no ci] (#2977)
This commit re-enables the tests in the build process which are
currently commented out.
It is possible to build the tests using `-DWHISPER_BUILD_TESTS=ON` and
then run a single test using:
```console
$ ctest -R test-whisper-cli-tiny.en --test-dir build
Internal ctest changing into directory: /home/danbev/work/ai/whisper-work/build
Test project /home/danbev/work/ai/whisper-work/build
Start 2: test-whisper-cli-tiny.en
1/1 Test #2: test-whisper-cli-tiny.en ......... Passed 4.44 sec
Some of the tests take a long time to run so it might not be a good idea
to enable them in CI, or perhaps we could only run a subset of the tests
in CI.
Daniel Bevenius [Mon, 31 Mar 2025 14:14:33 +0000 (16:14 +0200)]
android.java : re-add ggml source updates (#2975)
This commit updates the ggml source to include the new unary and binary
operations. I merged https://github.com/ggerganov/whisper.cpp/pull/2958
which seems to have overwritten the changes to the ggml source which
were added in https://github.com/ggerganov/whisper.cpp/pull/2972.
Daniel Bevenius [Mon, 31 Mar 2025 13:14:24 +0000 (15:14 +0200)]
ci : re-enable android_java job (#2958)
This commit re-enables the android_java job in the CI workflow. The job
was disabled because of a failing build.
The motivation for this is that Commit 226d344f565ea6140e7c6a583bc300a64454af58 ("whisper.android.java : update
build with ggml source changes") addressed build issues and it should
now be possible to re-enable this job.
Daniel Bevenius [Mon, 31 Mar 2025 09:34:40 +0000 (11:34 +0200)]
ci : add github pages workflow for wasm examples (#2969)
* ci : add github pages workflow for wasm examples
This commit adds a github workflow to build and deploy the wasm examples
to github pages. The whisper.wasm example is deployed as the main page.
This workflow is trigged by a push to master and will deploy the
examples to: https://ggerganov.github.io/whisper.cpp/.
This requires that the repository has enabled github actions in
`Settings` -> `Pages` -> `Build and deployment` -> `Source` be set to
`GitHub Actions`.
One thing to note is that this commit removes the `talk` example as I'm
not sure how this example is built yet.
Icenowy Zheng [Fri, 28 Mar 2025 17:51:06 +0000 (01:51 +0800)]
vulkan: fix coopmat shader generation when cross-compiling (llama/12272)
* vulkan: fix coopmat shader generation when cross-compiling
Previously the status of coopmat{,2} support isn't passed to the
vulkan-shaders-gen project building on the host, which leads to build
failure because of the cross-compiling code expecting coopmat{,2}
shaders that didn't get generated.
Fix this by passing the coopmat{,2} support status to vulkan-shaders
subproject.
Signed-off-by: Icenowy Zheng <redacted>
* Only call coop-mat shaders once
amritahs-ibm [Fri, 28 Mar 2025 07:43:22 +0000 (13:13 +0530)]
llamafile : ppc64le GEMV forwarding for FP32. (llama/12594)
This patch enables usage of MMA when one of the
dimensions of the matrix(ie either M or N) is 1. This
is useful in case of token generation where N < 2.
The concept of 'GEMV Forwarding' is used where when one
of the matrix has a single row/column, the elements are
broadcasted, instead of using packing routine to prepack
the matrix elements.
This change results in 5% - 15% improvement in total
speed(ie all tokens/total time), across various batch
sizes. This is in comparision with the corresponding
dot product implementation.
The patch is tested with FP32 models of Meta-Lllama-3-8B,
Mistral-7B, Llama-2-7B-chat-hf on a IBM POWER10 machine.
Amanda Der Bedrosian [Fri, 28 Mar 2025 11:26:22 +0000 (04:26 -0700)]
bindings.go : add DetectedLanguage to go bindings (#2947)
Adding in DetectedLanguage(), a function to retrieve the detected
language that's populated by processing audio. Also adding in a unit
test to test the success.
Daniel Bevenius [Fri, 28 Mar 2025 08:29:56 +0000 (09:29 +0100)]
ruby : fix test failures in test_whisper (#2955)
* bindings.ruby : fix test failures in test_whisper
This commit updates the parallel tests to use 2 processors instead of
the number of processors on the system. It also comments out the setting
of the log callback to an empty lambda as this causes a segfault when
enabled.
The motivation for the change to the number of processors is that if one
has a large number of processors, for example I have 16 on the machine I
used to test this, this would cause the following warning to be printed:
```console
whisper_full_with_state: input is too short - 680 ms < 1000 ms. consider padding the input audio with silence
```
This is logged from:
```c++
int whisper_full_with_state(
struct whisper_context * ctx,
struct whisper_state * state,
struct whisper_full_params params,
const float * samples,
int n_samples) {
...
if (seek_end < seek_start + 100) {
WHISPER_LOG_WARN("%s: input is too short - %d ms < 1000 ms. consider padding the input audio with silence\n", __func__, (seek_end - seek_start)*10);
return 0;
}
```
This will return early and there will be segment callbacks to be invoked
which in turn will cause the tests to fail.
* bindings.ruby : fix warnings in tests
This commit fixes the following warnings in the Ruby tests:
```console
/whisper/bindings/ruby/tests/test_segment.rb:52:
warning: ambiguity between regexp and two divisions:
wrap regexp in parentheses or add a space after `/' operator
```
And also adds a '_' prefix to some unused variables to avoid warnings.
* bindings.ruby : enable Wisper.log_set in tests
The commit reverts the commenting out of the Whisper.log_set call in
the test_whisper.rb tests.
I'm no longer getting segfaults when running the tests with this
which was the case earlier. One theory could be that I rebased this to
include the latest ggml sync to master to make sure things still worked.
With the latest changes in ggml, I can't reproduce the segfaults.
amritahs-ibm [Thu, 27 Mar 2025 06:51:47 +0000 (12:21 +0530)]
llamafile : ppc64le MMA implementation for Q4_0. (llama/12489)
This change upstreams llamafile's cpu matrix
multiplication kernels for ppc64le ISA using MMA
builtins. This patch handles matrix multiplication
between quantised datatypes, block_q4_0 and
block_q8_0.
This change results in 5% - 50% improvement
in total speed(ie all tokens/total time), across
various batch sizes.
The patch is tested with Meta-Lllama-3-8B,
Mistral-7B, Llama-2-7B-chat-hf models on a
IBM POWER10 machine.
Jeff Bolz [Mon, 24 Mar 2025 06:56:17 +0000 (01:56 -0500)]
vulkan: fix mul_mat_vec failure in backend tests (llama/12529)
The OOB calculation could be wrong if the last iteration was during one of
the unrolled loops. Adjust the unrolling counts to avoid this. Add a couple
new backend tests that hit this failure on NVIDIA GPUs.
Jeff Bolz [Sat, 22 Mar 2025 08:40:11 +0000 (03:40 -0500)]
vulkan: Optimize mul_mat_vec p021 and nc shaders (llama/12505)
* tests: add mul_mat perf/functional tests for p021/nc vulkan shaders
* vulkan: Optimize mul_mat_vec p021 and nc shaders.
These shaders are used in attention calculations, and when the KV cache grows
large they start to dominate the run time. For the nc shader (which is called
with large 'k' dimension), use unrolling and vector loads. For the p021 shader
(which is called with large 'm' and small 'k' dimensions), take advantage of
grouped query attention to reuse loads from the A matrix for the whole group,
and reduce the number of workgroups (too much overhead from tiny dispatches).
Using subgroupAdd in the p021 shader also helps, use that conditionally.
Gaurav Garg [Wed, 19 Mar 2025 19:52:06 +0000 (01:22 +0530)]
CUDA: Improve flash decoding kernel GPU occupancy for BS=1 case (llama/12183)
- Find out active blocks per SM using cudaOccupancyMaxActiveBlocksPerMultiprocessor API. Use this value to determine the optimal parallel_blocks value.
- Prefer vector flash attention kernels over MMA kernel for BS=1
Jeff Bolz [Wed, 19 Mar 2025 07:26:26 +0000 (02:26 -0500)]
vulkan: Submit once enough matmul work has been recorded (llama/12406)
I've been seeing significantly worse performance for tg with flash attention
enabled vs disabled, and it seems to be related to the submit heuristic.
Change the heuristic to check how many bytes worth of weight matrix are
used and flush every 100MB, and ramp up after the first few submits.
This seems to resolve the issue, and also increases perf for non-FA a bit.
Gaurav Garg [Mon, 17 Mar 2025 18:25:13 +0000 (23:55 +0530)]
cuda : enable CUDA Graph on CUDA Toolkit < 12.x (llama/12394)
* Enable CUDA Graph on CTK < 12.x
`cudaGraphExecUpdate` API was changed on 12.x. For this reason CUDA graph support was disabled on older CUDA toolkit. This change enables CUDA support in CTK version < 12.x by using older API if CTK < 12.x.
Christian Kastner [Mon, 17 Mar 2025 09:05:23 +0000 (10:05 +0100)]
cmake : enable building llama.cpp using system libggml (llama/12321)
* cmake: Factor out compiler flag function from ggml
llama.cpps's build requires it, too, and we may want to make use of it
without add_subdirectory(ggml).
* cmake: Enable building against system ggml
This facilitates package maintenance for Linux distributions, where the
libggml library most likely will be shipped as an individual package
upon which a llama.cpp package depends.
uvos [Wed, 12 Mar 2025 09:14:11 +0000 (10:14 +0100)]
CUDA/HIP: Fix fattn-vec-* when device warp size is not 32 (llama/12315)
When fattn-wmma was ported over to warp64 various bits that also touch fattn-vec where converted to
selectable warp size, however the fattn-vec kernels dont work with 64 wide warps for now, so we need
to avoid launching them with parameters for warp64