Akarshan Biswas [Fri, 30 May 2025 14:10:57 +0000 (19:40 +0530)]
SYCL: Add mrope kernel (llama/13755)
* SYCL: Add mrope kernel
* feat: Optimize rope operations with vectorization
Uses `sycl::vec` to load and store two elements at a time,
significantly improving performance in `rope_norm`,
`rope_neox`, and `rope_multi`. This reduces the number of memory
accesses and leverages SIMD instructions for faster execution.
Daniel Bevenius [Fri, 30 May 2025 04:28:46 +0000 (06:28 +0200)]
vad : revisit timestamp alignment/mapping (#3173)
* vad : revisit timestamp alignment/mapping
This commit improving the timestamp alignment by introducing a mapping
table, adding intermediate reference points for longer segments, and
binary search for lookups.
The motivation for this changes is to address issues with the currently
solution where zero-length segments are possible, and also to improve
the precision of the VAD timestamps.
This commit changes the type of the `processed_time` and `original_time`
fields in the `vad_time_mapping` struct from `double` to `uint64_t`.
The motivation for this change is made to improve precision and avoid
floating-point inaccuracies and also be consistent with other part of
the code base that use `uint64_t` for time representation.
This is a part of a refactoring where I'm also going to change the
vad_segment_info struct to use `uint64_t` for the start and end times.
This is the reason for the not so pleasant conversion and casts in the
code at the moment.
* vad : change vad_segment_info and whisper_vad_segment to use uint64_t
* vad : use int64_t instead of uint64_t for timestamps
To be consistent with other timestamps in the codebase.
* vad : add centisecond conversion functions
* vad : extract vad processing from whisper_full_with_state
This commit extracts the VAD processing from the
`whisper_full_with_state` function into the `whisper_full` and
`whisper_full_parallel` functions.
The motivation for this is that I did not take into account that when
`whisper_full_parallel` is called with `n_processors > 1`, then the
vad processing would not be applied correctly. Instead the VAD
processing should be done prior to processing in the case of
`whisper_full_parallel`.
* vad : remove filtered_n_samples from whisper_vad
The commit removes the parameter `filtered_n_samples` from the
`whisper_vad` function signature and its usage, as it is no longer
needed since filtered samples is now a vector (previously it was a
float*)
The motivation for this is to simplify the usage of this function.
Daniel Bevenius [Thu, 29 May 2025 06:03:17 +0000 (08:03 +0200)]
whisper : remove whisper_load_backends function (#3196)
* whisper : remove whisper_load_backends function
This commit removes the `whisper_load_backends` function, which was used
to load all GGML backends.
The motivation for this change push the responsibility of loading
backends to user applications to give them more control over which
backends to load and when. See the references below for more context.
Fujimoto Seiji [Wed, 28 May 2025 05:08:44 +0000 (14:08 +0900)]
tests : add a new benchmark test for long-form audio (#3185)
* tests : add a new benchmark test for long-form audio
Based on "Earnings-21" corpus by Del Rio et al.
Earnings-21: A Practical Benchmark for ASR in the Wild (2021)
https://arxiv.org/abs/2104.11348
This dataset contains 39 hours of long-form speech, sourced from public
earning calls. Each recording contains roughly 50 minutes of English
dialogues between multiple speakers (2-20 persons).
This benchmark suite should allow us to evaluate the performance of
whisper.cpp on long-form audio data.
Add 'errors' parameter to open() in order to avoid unhandled
exception on invalid UTF-8 bytes.
Signed-off-by: Fujimoto Seiji <redacted>
* tests : try to interpret the hypothesis as Windows-1252
Based on the discussion in PR#3185.
Evidently Whisper.cpp can represent a quotation mark as '0x93', which
implifies Windows-1252 (Microsoft's ASCII excention), and cannot be
decoded by UTF-8.
Add an explicit decoding loop to address the issue.
Daniel Bevenius [Tue, 27 May 2025 16:01:31 +0000 (18:01 +0200)]
ci : update windows-blas uploads action (#3192)
This commit modifies windows-blas which was updated previously to use
the zip functionality provided by `actions/upload-artifact`. This turned
out to be incorrect and I should not have done that. The reason for
zipping the archives first is that otherwise the artifacts when
downloaded will be unzipped and just be simple directories. In our case
the release task depends on the artifacts having a .zip extension so
that those archives are include in the release.
Ewan Crawford [Thu, 22 May 2025 08:24:09 +0000 (09:24 +0100)]
SYCL: Avoid using with SYCL-Graph for unsupported nodes (llama/13587)
Currently on a CUDA backend to SYCL when running
`GGML_SYCL_DISABLE_GRAPH=0 ./bin/test-backend-ops -b SYCL0` there
are two operations that throw an exception from the blocking
waits during queue recording.
* `-o CONCAT` : Use of blocking waits on a queue that's being recorded https://github.com/ggml-org/llama.cpp/blob/master/ggml/src/ggml-sycl/concat.cpp#L185-L187
* `-o MUL_MAT_ID`: Blocking wait on a recording queue for a copy to host memory https://github.com/ggml-org/llama.cpp/blob/master/ggml/src/ggml-sycl/ggml-sycl.cpp#L3072-L3074
We've noticed that `ggml-cuda.cu` has the
[check_node_graph_compatibility_and_refresh_copy_ops](https://github.com/ggml-org/llama.cpp/blob/39e73ae0d69f882d7e29cecc6dd8f5052fca6731/ggml/src/ggml-cuda/ggml-cuda.cu#L2458-L2458)
method for checking if a graph can be used, even if enabled. I've taken a
similar approach in this PR by adding a method to `ggml-sycl.cpp` for checking
if a graph can be used for the operations even if a user has asked for it to be
enabled.
Henry Linjamäki [Wed, 21 May 2025 20:21:17 +0000 (23:21 +0300)]
opencl: fix couple crashes (llama/12795)
* opencl: fix couple crashes
* fix kernel launches failed on devices which do not support
non-uniform work-groups. When non-uniform work-groups are not
supported, set `local_work_size` to NULL (= let driver choose the
work-group sizes). This patch does not cover everything - just the
cases tested by test-backend-ops.
* fix sub-buffer creation failed due to `cl_buffer_region::origin` not
being aligned to `CL_DEVICE_MEM_BASE_ADDR_ALIGN`.
* OpenCL: query non-uniform WG sizes only on OpenCL 3.0+
Nicolò Scipione [Tue, 20 May 2025 00:54:43 +0000 (02:54 +0200)]
sycl : Overcoming workaround for mmap() allocation on Windows (llama/13482)
* Remove mmap workaround on windows
After some testing I found that mmap is supported on windows and for
many GPUs on Linux. Therefore I remove the workaround for windows since
it is not necessary.
* Update llama-bench README
SYCL backend introduced a workaround that allows execution of
llama-bench also without specifying `--mmp 0` flag
Daniel Bevenius [Tue, 27 May 2025 08:53:50 +0000 (10:53 +0200)]
docs : convert README_sycl.md to utf8 format [no ci] (#3191)
This commit updates the README_sycl.md file to use UTF-8 encoding.
The motivation for this is that while this file displays correctly in
github it will fail to render with tools that expect UTF-8 encoding.
For example this is the case when using `grip` to view the file locally.
Daniel Bevenius [Tue, 27 May 2025 03:51:47 +0000 (05:51 +0200)]
node : enable no_prints to suppress all output (#3189)
This commit enable the node addon to suppress all output, even the
result of the transcription if the no_prints parameter is set to true.
The motivation for this is that for the node addon there is a
fullfilment handler/success callback to process the transcription
result. And it might be useful to be able to disable the printing of
the transcription result to the console, so that the user can handle
the result in their own way.
Daniel Bevenius [Fri, 23 May 2025 03:48:08 +0000 (05:48 +0200)]
ci : use dynamic libopenblas.dll for window-blas (#3177)
* ci : use dynamic libopenblas.dll for window-blas
This commit updates the windows-blas job to use the dynamic (can load
different kernels depending of the CPU arch) libopenblas.dll instead of
the "static" openblas.dll that get installed by vcpgk.
The motivation for this change is that there have been reports of
performance drops in later version specifically related to blas. Please
see the links below for more details.
Daniel Bevenius [Mon, 19 May 2025 15:59:43 +0000 (17:59 +0200)]
ruby : add GGML_SYCL_DNN option to ruby bindings (#3172)
This commit adds the `GGML_SYCL_DNN` option to the Ruby bindings for
the GGML library. This option as added to ggml in
Commit (5e7e07758a5f3172380500e173ca71f679bbef1e "sycl: use oneDNN for
matrices multiplication")
The motivation for this change to enable the CI build to pass.
Jeff Bolz [Wed, 14 May 2025 09:55:26 +0000 (18:55 +0900)]
vulkan: KHR_coopmat flash attention (llama/13506)
This shader uses coopmat1 to do the Q*K^T multiply. The P*V multiply is more
difficult for various reasons so I haven't done it. Performance for this
shader is around 2.5x better than for the scalar shader when doing prompt
processing. Some of the benefit may be from other optimizations like staging
through shared memory, or splitting by rows.