]> git.djapps.eu Git - pkg/ggml/sources/whisper.cpp/log
pkg/ggml/sources/whisper.cpp
3 weeks agoci : use mirrors.kernel.org for Ubuntu packages (#3220)
Daniel Bevenius [Mon, 2 Jun 2025 14:46:40 +0000 (16:46 +0200)]
ci : use mirrors.kernel.org for Ubuntu packages (#3220)

This commit updates the ubuntu jobs to use mirrors sites instead of archive.ubuntu.com.

The motivation of this is an attempt to make the CI build more stable and avoid errors like:
https://github.com/ggml-org/whisper.cpp/actions/runs/15384056535/job/43291948394?pr=3217

3 weeks agonode : add language detection support (#3190)
Daniel Bevenius [Mon, 2 Jun 2025 12:58:05 +0000 (14:58 +0200)]
node : add language detection support (#3190)

This commit add support for language detection in the Whisper Node.js
addon example. It also updates the node addon to return an object
instead of an array as the results.

The motivation for this change is to enable the inclusion of the
detected language in the result, in addition to the transcription
segments.

For example, when using the `detect_language` option, the result will
now be:
```console
{ language: 'en' }
```

And if the `language` option is set to "auto", it will also return:
```console
{
  language: 'en',
  transcription: [
    [
      '00:00:00.000',
      '00:00:07.600',
      ' And so my fellow Americans, ask not what your country can do for you,'
    ],
    [
      '00:00:07.600',
      '00:00:10.600',
      ' ask what you can do for your country.'
    ]
  ]
}
```

3 weeks agotalk-llama : sync llama.cpp
Georgi Gerganov [Sun, 1 Jun 2025 11:07:36 +0000 (14:07 +0300)]
talk-llama : sync llama.cpp

ggml-ci

3 weeks agosync : ggml
Georgi Gerganov [Sun, 1 Jun 2025 11:03:21 +0000 (14:03 +0300)]
sync : ggml

ggml-ci

3 weeks agothreading: support for GGML_SCHED_PRIO_LOW, update thread info on Windows to avoid...
Max Krasnyansky [Sat, 31 May 2025 22:39:19 +0000 (15:39 -0700)]
threading: support for GGML_SCHED_PRIO_LOW, update thread info on Windows to avoid throttling (llama/12995)

* threading: support for GGML_SCHED_PRIO_LOW, update thread info on Windows to avoid throttling

We talked about adding LOW priority for GGML threads in the original threadpool PR.
It might be useful for some cases to avoid contention.

Latest Windows ARM64 releases started parking (offlining) the CPU cores
more aggresively which results in suboptimal performance with n_threads > 4.
To deal with that we now disable Power Throttling for our threads for the NORMAL
and higher priorities.

Co-authored-by: Diego Devesa <redacted>
* threading: disable SetThreadInfo() calls for older Windows versions

* Update tools/llama-bench/llama-bench.cpp

Co-authored-by: Diego Devesa <redacted>
---------

Co-authored-by: Diego Devesa <redacted>
3 weeks agoCUDA: add a prop in ggml_cuda_device_infor for distinguish iGPU or dGPU in cuda ...
Shawn yang [Sat, 31 May 2025 06:48:04 +0000 (14:48 +0800)]
CUDA: add a prop in ggml_cuda_device_infor for distinguish iGPU or dGPU in cuda (#13856) (llama/13895)

* 1.  add "integrated" in ggml_cuda_device_info for distinguish whether it is Intergrate_gpu or discrete_gpu
2. Adjust the func:"ggml_backend_cuda_device_supports_buft" for this new feature

* Update ggml/src/ggml-cuda/ggml-cuda.cu

Adjusted code indentation

Co-authored-by: Johannes Gäßler <redacted>
* Update ggml/src/ggml-cuda/ggml-cuda.cu

Fixed incorrect setting of variable types

Co-authored-by: Johannes Gäßler <redacted>
* Update ggml/src/ggml-cuda/ggml-cuda.cu

Adjusted the judgment logic

Co-authored-by: Johannes Gäßler <redacted>
* add a host_buft assert in case of integrated_cuda_device with func:'evaluate_and_capture_cuda_graph()'

* Update ggml/src/ggml-cuda/ggml-cuda.cu

Add a defensive security assert

Co-authored-by: Johannes Gäßler <redacted>
* Update ggml/src/ggml-cuda/ggml-cuda.cu

Adjusted the support judgment logic.

Co-authored-by: Johannes Gäßler <redacted>
* revoke the suggest commit changes due to it's not applicable in jetson_device

* Update ggml/src/ggml-cuda/ggml-cuda.cu

Add parentheses to enforce operator precedence​

Co-authored-by: Diego Devesa <redacted>
* Update ggml/src/ggml-cuda/ggml-cuda.cu

Fix ci bug: add a spaces

Co-authored-by: Johannes Gäßler <redacted>
---------

Co-authored-by: yangxiao <redacted>
Co-authored-by: Johannes Gäßler <redacted>
Co-authored-by: yangxiao <redacted>
Co-authored-by: Diego Devesa <redacted>
3 weeks agoCUDA: fix typo in FlashAttention code (llama/13926)
Johannes Gäßler [Fri, 30 May 2025 19:22:03 +0000 (21:22 +0200)]
CUDA: fix typo in FlashAttention code (llama/13926)

3 weeks agosched : avoid changing cur_copy when a graph is already allocated (llama/13922)
Diego Devesa [Fri, 30 May 2025 16:56:19 +0000 (09:56 -0700)]
sched : avoid changing cur_copy when a graph is already allocated (llama/13922)

3 weeks agocuda : prevent using split buffers with 3d/4d matrices (llama/13919)
Diego Devesa [Fri, 30 May 2025 14:37:18 +0000 (07:37 -0700)]
cuda : prevent using split buffers with 3d/4d matrices (llama/13919)

3 weeks agoSYCL: Add mrope kernel (llama/13755)
Akarshan Biswas [Fri, 30 May 2025 14:10:57 +0000 (19:40 +0530)]
SYCL: Add mrope kernel (llama/13755)

* SYCL: Add mrope kernel

* feat: Optimize rope operations with vectorization

Uses `sycl::vec` to load and store two elements at a time,
significantly improving performance in `rope_norm`,
`rope_neox`, and `rope_multi`. This reduces the number of memory
accesses and leverages SIMD instructions for faster execution.

* Use ceil_div

3 weeks agocmake: Guard GGML_CPU_ALL_VARIANTS by architecture (llama/13890)
Christian Kastner [Thu, 29 May 2025 23:28:54 +0000 (01:28 +0200)]
cmake: Guard GGML_CPU_ALL_VARIANTS by architecture (llama/13890)

3 weeks agoarm64: optimize q4_k_q8_k kernel with i8mm (llama/13886)
Yibo Cai [Thu, 29 May 2025 11:39:20 +0000 (19:39 +0800)]
arm64: optimize q4_k_q8_k kernel with i8mm (llama/13886)

This PR improves q4_k_q8_k gemm kernel with arm64 i8mm instruction.

Tested on neoverse-n2 with llama3 8b q4_k_m quantization model.
- 34% ~ 50% S_PP uplift for all batch sizes
- 12% ~ 37% S_TG uplift for batch size 4 and above

Perplexity doesn't change with this PR.

```
// tested on neoverse-n2
$ llama-batched-bench \
      -m Meta-Llama-3-8B-Instruct-Q4_K_M.gguf \
      --no-mmap -fa \
      -c 8192 -b 4096 -ub 512 -npp 128 -ntg 128 \
      -npl 1,2,4,8,16,32 \
      -t 64

---------------------------------------------------------------------
|    PP |     TG |    B |       S_PP t/s      |       S_TG t/s      |
|       |        |      | original |  this pr | original |  this pr |
|-------|--------|------|----------|----------|----------|----------|
|   128 |    128 |    1 |   110.12 |   147.83 |    24.36 |    24.28 |
|   128 |    128 |    2 |   121.16 |   172.42 |    46.36 |    47.93 |
|   128 |    128 |    4 |   120.15 |   169.75 |    74.68 |    84.00 |
|   128 |    128 |    8 |   130.97 |   196.81 |    91.04 |   114.74 |
|   128 |    128 |   16 |   131.01 |   196.88 |   101.43 |   135.79 |
|   128 |    128 |   32 |   130.85 |   196.51 |   106.97 |   147.29 |
---------------------------------------------------------------------
```

3 weeks agocmake: Factor out CPU architecture detection (llama/13883)
Christian Kastner [Thu, 29 May 2025 10:50:25 +0000 (12:50 +0200)]
cmake: Factor out CPU architecture detection (llama/13883)

* cmake: Define function for querying architecture

The tests and results match exactly those of src/CMakeLists.txt

* Switch arch detection over to new function

3 weeks agoggml: aarch64: Implement SVE F32 kernels for Mamba Sequential Scan Algorithm (llama...
Vineel Abhinav [Thu, 29 May 2025 09:18:43 +0000 (14:48 +0530)]
ggml: aarch64: Implement SVE F32 kernels for Mamba Sequential Scan Algorithm (llama/13882)

* F32-Mamba-Seq_Scan-SVE

* Fix formatting

* ggml : missing space

---------

Co-authored-by: Georgi Gerganov <redacted>
3 weeks agoggml: aarch64: Implement SVE F32 kernels for vector functions (llama/13843)
Vineel Abhinav [Thu, 29 May 2025 06:01:33 +0000 (11:31 +0530)]
ggml: aarch64: Implement SVE F32 kernels for vector functions (llama/13843)

* F32-Mamba-SVE

* F32-Mamba-SVE

* Resolve test errors-1

* Resolve test errors-2

* F32-vec-SVE

* F32-vec-SVE

* F32-vec-SVE

3 weeks agoCUDA: fix FA tg at long context for CC >= 8.9 (llama/13852)
Johannes Gäßler [Wed, 28 May 2025 11:33:37 +0000 (13:33 +0200)]
CUDA: fix FA tg at long context for CC >= 8.9 (llama/13852)

3 weeks agoCANN: Add SOC TYPE printing in cmake configuration (llama/13837)
leo-pony [Wed, 28 May 2025 03:54:20 +0000 (11:54 +0800)]
CANN: Add SOC TYPE printing in cmake configuration (llama/13837)

3 weeks agoopencl: add new ops - `argsort`, `div`, `sub`, `addrows`, `sigmoid`, `group_norm...
lhez [Tue, 27 May 2025 19:56:08 +0000 (12:56 -0700)]
opencl: add new ops - `argsort`, `div`, `sub`, `addrows`, `sigmoid`, `group_norm` (llama/13787)

* opencl: add `argsort`

* opencl: add `div`

* opencl: add `add_rows`

* opencl: add `sub`

* opencl: add `sigmoid`, both `f16` and `f32`

* opencl: add `group_norm`

3 weeks agoopencl: mark `mul_mat` `f32f32` as supporting non-contiguous tensors (llama/13790)
lhez [Tue, 27 May 2025 19:53:14 +0000 (12:53 -0700)]
opencl: mark `mul_mat` `f32f32` as supporting non-contiguous tensors (llama/13790)

3 weeks agovulkan: use timestamp queries for GGML_VULKAN_PERF (llama/13817)
Jeff Bolz [Tue, 27 May 2025 16:39:07 +0000 (11:39 -0500)]
vulkan: use timestamp queries for GGML_VULKAN_PERF (llama/13817)

Also change it to be controlled by an env var rather than cmake flag

3 weeks agoSYCL: add gelu_erf kernel (llama/13749)
Akarshan Biswas [Tue, 27 May 2025 15:22:59 +0000 (20:52 +0530)]
SYCL: add gelu_erf kernel (llama/13749)

* SYCL: add gelu_erf kernel

* refactor code

Co-authored-by: Atharva Dubey <redacted>
* Use scope_op_debug_print

---------

Co-authored-by: Atharva Dubey <redacted>
3 weeks agoggml : add ggml_repeat_4d (llama/13824)
Xuan-Son Nguyen [Tue, 27 May 2025 13:53:55 +0000 (15:53 +0200)]
ggml : add ggml_repeat_4d (llama/13824)

3 weeks agovulkan : Remove unexpected ; (ggml/1253)
Kai Pastor [Sat, 31 May 2025 10:49:55 +0000 (12:49 +0200)]
vulkan : Remove unexpected ; (ggml/1253)

3 weeks agocmake : Fix broken CMake error messages (ggml/1252)
Kai Pastor [Sat, 31 May 2025 10:39:19 +0000 (12:39 +0200)]
cmake : Fix broken CMake error messages (ggml/1252)

3 weeks agoggml : remove ggml_graph_import and ggml_graph_export declarations (ggml/1247)
Radoslav Gerganov [Fri, 30 May 2025 06:11:09 +0000 (09:11 +0300)]
ggml : remove ggml_graph_import and ggml_graph_export declarations (ggml/1247)

The implementation is already deleted with commit 9d0762e.

closes: #1235

3 weeks agoruby : add Core ML support (#3214)
KITAITI Makoto [Sun, 1 Jun 2025 09:16:02 +0000 (18:16 +0900)]
ruby : add Core ML support (#3214)

* Prevent overflow

* Fix memsize of Whisper::Context

* Rename xxx_initialize to more Ruby-esque name: xxx_s_new

* Define Whisper::Model::ZipURI

* Define Whisper::Model.coreml_compiled_models

* Make Options' @cmake_options Hash

* Use --{enable,disable}-whisper-coreml option for -I/opt/homebrew/opt/llvm/include

* Prepare Core ML model if enabled

* Add test for ZipURI

* Add signatures for ZipURI

* Add Whisper.system_info_str

* Add test for Whisper.system_info_str

* Add signagure for Model.coreml_compiled_models

* Add signature for Whisper.system_info_str

* Add test for Core ML

* Update date

* Maintain .gitignore

4 weeks agovad : revisit timestamp alignment/mapping (#3173)
Daniel Bevenius [Fri, 30 May 2025 04:28:46 +0000 (06:28 +0200)]
vad : revisit timestamp alignment/mapping (#3173)

* vad : revisit timestamp alignment/mapping

This commit improving the timestamp alignment by introducing a mapping
table, adding intermediate reference points for longer segments, and
binary search for lookups.

The motivation for this changes is to address issues with the currently
solution where zero-length segments are possible, and also to improve
the precision of the VAD timestamps.

Refs: https://github.com/ggml-org/whisper.cpp/issues/3162

* vad : use uint64_t for time mapping

This commit changes the type of the `processed_time` and `original_time`
fields in the `vad_time_mapping` struct from `double` to `uint64_t`.

The motivation for this change is made to improve precision and avoid
floating-point inaccuracies and also be consistent with other part of
the code base that use `uint64_t` for time representation.

This is a part of a refactoring where I'm also going to change the
vad_segment_info struct to use `uint64_t` for the start and end times.
This is the reason for the not so pleasant conversion and casts in the
code at the moment.

* vad : change vad_segment_info and whisper_vad_segment to use uint64_t

* vad : use int64_t instead of uint64_t for timestamps

To be consistent with other timestamps in the codebase.

* vad : add centisecond conversion functions

* vad : extract vad processing from whisper_full_with_state

This commit extracts the VAD processing from the
`whisper_full_with_state` function into the `whisper_full` and
`whisper_full_parallel` functions.

The motivation for this is that I did not take into account that when
`whisper_full_parallel` is called with `n_processors > 1`, then the
vad processing would not be applied correctly. Instead the VAD
processing should be done prior to processing in the case of
`whisper_full_parallel`.

* vad : remove filtered_n_samples from whisper_vad

The commit removes the parameter `filtered_n_samples` from the
`whisper_vad` function signature and its usage, as it is no longer
needed since filtered samples is now a vector (previously it was a
float*)

The motivation for this is to simplify the usage of this function.

* vad : remove vad_mapping_table_initialized flag

* vad : fix leaning (none) of pointer/references

4 weeks agoruby : handle build options on installation (#3206)
KITAITI Makoto [Thu, 29 May 2025 16:32:49 +0000 (01:32 +0900)]
ruby : handle build options on installation (#3206)

* Don't pass empty string to cmake command

* Refactor Dependencies

* Use found cmake path for options

* Maintain extsources.rb

* List dependent files by directory separator agnostic way

* Prepend whitespace before '='

* Handle build options on install

* Remove useless test

* Retrieve gem file name and version from spec file

* Bump version to 1.3.3

* Update date

* Add install option examples

* [skip ci]Remove unused module

4 weeks agoggml : Fix backtrace breaking Windows build (#3203)
Daniel Tang [Thu, 29 May 2025 10:26:58 +0000 (06:26 -0400)]
ggml : Fix backtrace breaking Windows build (#3203)

4 weeks agosync : ggml
Georgi Gerganov [Thu, 29 May 2025 06:49:46 +0000 (09:49 +0300)]
sync : ggml

ggml-ci

4 weeks agoggml : install dynamic backends (ggml/1240)
Radoslav Gerganov [Thu, 29 May 2025 06:49:27 +0000 (09:49 +0300)]
ggml : install dynamic backends (ggml/1240)

4 weeks agoggml : Print backtrace on uncaught C++ exceptions (ggml/1232)
Daniel Tang [Wed, 28 May 2025 00:58:46 +0000 (20:58 -0400)]
ggml : Print backtrace on uncaught C++ exceptions (ggml/1232)

The goal is to have what users call "full logs" contain the backtrace.

This is registered upon ggml_init. Also fixes a minor fd leak on Linux.

4 weeks agowhisper : remove whisper_load_backends function (#3196)
Daniel Bevenius [Thu, 29 May 2025 06:03:17 +0000 (08:03 +0200)]
whisper : remove whisper_load_backends function (#3196)

* whisper : remove whisper_load_backends function

This commit removes the `whisper_load_backends` function, which was used
to load all GGML backends.

The motivation for this change push the responsibility of loading
backends to user applications to give them more control over which
backends to load and when. See the references below for more context.

Resolves: https://github.com/ggml-org/whisper.cpp/issues/3182
Refs: https://github.com/ggml-org/whisper.cpp/pull/3042#issuecomment-2801778733
Refs: https://github.com/ggml-org/whisper.cpp/pull/3042#issuecomment-2801928990

* ruby : add check for rwc is NULL

This commit adds a check to ensure that the `rwc` pointer is not NULL
before attempting to mark its members in the garbage collector.

The motivation for this is an attempt to see if this fixed the CI build
as I'm not able to reproduce the issue locally.

Refs: https://github.com/ggml-org/whisper.cpp/actions/runs/15299612277/job/43036694928?pr=3196

4 weeks agoruby : add VAD support, migration to Ruby's newer API (#3197)
KITAITI Makoto [Wed, 28 May 2025 11:05:12 +0000 (20:05 +0900)]
ruby : add VAD support, migration to Ruby's newer API (#3197)

* Add VAD models

* Extract function to normalize model path from ruby_whisper_initialize()

* Define ruby_whisper_vad_params struct

* Add VAD-related features to Whisper::Params

* Add tests for VAD-related features

* Define Whisper::VADParams

* Add Whisper::VAD::Params attributes

* Add test suite for VAD::Params

* Make older test to follow namespace change

* Add test for transcription with VAD

* Add assertion for test_vad_params

* Add signatures for VAD-related methods

* Define VAD::Params#==

* Add test for VAD::Params#==

* Fix Params#vad_params

* Add test for Params#vad_params

* Fix signature of Params#vad_params

* Use macro to define VAD::Params params

* Define VAD::Params#initialize

* Add tests for VAD::Params#initialize

* Add signature for VAD::Params.new

* Add documentation on VAD in README

* Wrap register_callbask in prepare_transcription for clear meanings

* Set whisper_params.vad_params just before transcription

* Don't touch NULL

* Define ruby_whisper_params_type

* Use TypedData_XXX for ruby_whisper_params instead of Data_XXX

* Remove unused functions

* Define rb_whisper_model_data_type

* Use TypedData_XXX for ruby_whisper_model instead of Data_XXX

* Define ruby_whisper_segment_type

* Use TypedData_XXX for ruby_whisper_segment instead of Data_XXX

* Define ruby_whisper_type

* Use TypedData_XXX for ruby_whisper instead of Data_XXX

* Qualify with const

4 weeks agowhisper : install shared libs when using GGML_BACKEND_DL (#3195)
Simon Booth [Wed, 28 May 2025 08:15:04 +0000 (09:15 +0100)]
whisper : install shared libs when using GGML_BACKEND_DL (#3195)

4 weeks agotests : add a new benchmark test for long-form audio (#3185)
Fujimoto Seiji [Wed, 28 May 2025 05:08:44 +0000 (14:08 +0900)]
tests : add a new benchmark test for long-form audio (#3185)

* tests : add a new benchmark test for long-form audio

Based on "Earnings-21" corpus by Del Rio et al.

    Earnings-21: A Practical Benchmark for ASR in the Wild (2021)
    https://arxiv.org/abs/2104.11348

This dataset contains 39 hours of long-form speech, sourced from public
earning calls. Each recording contains roughly 50 minutes of English
dialogues between multiple speakers (2-20 persons).

This benchmark suite should allow us to evaluate the performance of
whisper.cpp on long-form audio data.

Signed-off-by: Fujimoto Seiji <redacted>
* tests : apply PR feedback to 'earnings21/README.md'

Based on feedback from Daniel Bevenius.

 - Simplify how to download & prepare a Silero VAD model.
 - Fix typo: inferece -> inference

Signed-off-by: Fujimoto Seiji <redacted>
* tests : avoid crashing on non-UTF-8 characters

Based on feedback from Daniel Bevenius.

Add 'errors' parameter to open() in order to avoid unhandled
exception on invalid UTF-8 bytes.

Signed-off-by: Fujimoto Seiji <redacted>
* tests : try to interpret the hypothesis as Windows-1252

Based on the discussion in PR#3185.

Evidently Whisper.cpp can represent a quotation mark as '0x93', which
implifies Windows-1252 (Microsoft's ASCII excention), and cannot be
decoded by UTF-8.

Add an explicit decoding loop to address the issue.

Signed-off-by: Fujimoto Seiji <redacted>
---------

Signed-off-by: Fujimoto Seiji <redacted>
4 weeks agoci : update windows-blas uploads action (#3192)
Daniel Bevenius [Tue, 27 May 2025 16:01:31 +0000 (18:01 +0200)]
ci : update windows-blas uploads action (#3192)

This commit modifies windows-blas which was updated previously to use
the zip functionality provided by `actions/upload-artifact`. This turned
out to be incorrect and I should not have done that. The reason for
zipping the archives first is that otherwise the artifacts when
downloaded will be unzipped and just be simple directories. In our case
the release task depends on the artifacts having a .zip extension so
that those archives are include in the release.

4 weeks agosync : fix builds - musa, ruby
Georgi Gerganov [Tue, 27 May 2025 15:02:37 +0000 (18:02 +0300)]
sync : fix builds - musa, ruby

4 weeks agotalk-llama : sync llama.cpp
Georgi Gerganov [Tue, 27 May 2025 14:08:24 +0000 (17:08 +0300)]
talk-llama : sync llama.cpp

ggml-ci

4 weeks agosync : ggml
Georgi Gerganov [Tue, 27 May 2025 14:07:06 +0000 (17:07 +0300)]
sync : ggml

ggml-ci

4 weeks agoggml : riscv: add xtheadvector support (llama/13720)
xctan [Tue, 27 May 2025 13:21:36 +0000 (21:21 +0800)]
ggml : riscv: add xtheadvector support (llama/13720)

* ggml : riscv: add xtheadvector support

* ggml : clean up some macro usage

4 weeks agoggml-cpu: x86 feature detection is specific to x86 (llama/13811)
Christian Kastner [Tue, 27 May 2025 11:18:39 +0000 (13:18 +0200)]
ggml-cpu: x86 feature detection is specific to x86 (llama/13811)

4 weeks agoggml : allow CUDA graphs when using pipeline parallelism (llama/13814)
Diego Devesa [Tue, 27 May 2025 11:05:18 +0000 (04:05 -0700)]
ggml : allow CUDA graphs when using pipeline parallelism (llama/13814)

4 weeks agocuda : avoid cuGetErrorString (llama/13791)
Georgi Gerganov [Mon, 26 May 2025 19:14:52 +0000 (22:14 +0300)]
cuda : avoid cuGetErrorString (llama/13791)

ggml-ci

4 weeks agoSYCL: Add non contiguous support in RMS_NORM and NORM kernels (llama/13611)
Akarshan Biswas [Mon, 26 May 2025 15:40:36 +0000 (21:10 +0530)]
SYCL: Add non contiguous support in RMS_NORM and NORM kernels (llama/13611)

* SYCL: Add non contiguous input support to norm kernel

* refactor and add RMS_NORM non contiguous input support

ggml-ci

* restore subgroup reduction for multi-subgroup thread blocks in norm kernels

* Swap grid dims of nsamples and nrows

ggml-ci

* Revert "Swap grid dims of nsamples and nrows"

This reverts commit 43be2d657fec7f7fba54e2cd154106bc0fc45adf.

* restore not required changes
ggml-ci

* address review comments: change it to more like SYCL

* Use a common function to calculate offset

* remove wrap around logic for handling broadcasts

* remove static from calculate_offset fn and use ceil_div

4 weeks agosycl: Add more debug prints (llama/13640)
Romain Biessy [Mon, 26 May 2025 08:28:53 +0000 (10:28 +0200)]
sycl: Add more debug prints (llama/13640)

4 weeks agovulkan: mark IM2COL as supporting non-contig (llama/13783)
Jeff Bolz [Mon, 26 May 2025 04:02:07 +0000 (23:02 -0500)]
vulkan: mark IM2COL as supporting non-contig (llama/13783)

4 weeks agoCANN: Add the basic supports of Flash Attention kernel (llama/13627)
Bizhao Shi [Mon, 26 May 2025 02:20:18 +0000 (10:20 +0800)]
CANN: Add the basic supports of Flash Attention kernel (llama/13627)

* cann: add the basic FA support

* cann: update the readme

* cann: update the FlashAttention with PSEShift

* cann: update the input parameters in FA

* cann: update the alibi with max_bias

* cann: add the constrints of softcap

* cann: update the docs CANN.md

* cann: update the docs CANN.md

* cann: fix typo of CANN.md

* cann: add some comments and update the CANN.md

* cann: update the CANN.md

* cann: update the inner precise for fusedInferAttention

* cann: update the constraints of flash_attn_ext on ggml-cann.cpp

* cann: clean the whitespace

* cann: clean the whitespace

* cann: add a new endline

4 weeks agoSYCL: revert "sycl: simplify bin_bcast_kernel (ggml/13383)" (llama/13752)
Akarshan Biswas [Sun, 25 May 2025 07:08:37 +0000 (12:38 +0530)]
SYCL: revert "sycl: simplify bin_bcast_kernel (ggml/13383)" (llama/13752)

Temporarily reverted due to failing fp16 DIV operation

This reverts commit 02cdd2d8b092b5a4bb18e013c6887ce49ba20ac5.

ggml-ci

4 weeks agoggml-cpu : set openmp wait time if not set (llama/13758)
Diego Devesa [Sat, 24 May 2025 20:26:47 +0000 (13:26 -0700)]
ggml-cpu : set openmp wait time if not set (llama/13758)

4 weeks agoggml : add ggml_gelu_erf() CUDA kernel (llama/13719)
Xuan-Son Nguyen [Sat, 24 May 2025 11:06:47 +0000 (13:06 +0200)]
ggml : add ggml_gelu_erf() CUDA kernel (llama/13719)

* ggml : add ggml_gelu_erf() CUDA kernel

* missing semicolon

4 weeks agoCUDA: fix race condition in FA vector kernels (llama/13742)
Johannes Gäßler [Sat, 24 May 2025 09:46:19 +0000 (11:46 +0200)]
CUDA: fix race condition in FA vector kernels (llama/13742)

4 weeks agoCANN: Support MUL_MAT_ID for q8_0 and q4_0 (llama/13705)
Chenguang Li [Fri, 23 May 2025 08:47:53 +0000 (16:47 +0800)]
CANN: Support MUL_MAT_ID for q8_0 and q4_0 (llama/13705)

* [CANN]Support MUL_MAT_ID Q8 && Q4

Signed-off-by: noemotiovon <redacted>
* codestyle adjustment

Signed-off-by: noemotiovon <redacted>
---------

Signed-off-by: noemotiovon <redacted>
4 weeks agoggml : fix the order of ggml_unary_op (llama/13718)
Xuan-Son Nguyen [Fri, 23 May 2025 06:12:48 +0000 (08:12 +0200)]
ggml : fix the order of ggml_unary_op (llama/13718)

4 weeks agovulkan: support CPY from any type to itself (llama/13695)
Jeff Bolz [Fri, 23 May 2025 04:45:02 +0000 (00:45 -0400)]
vulkan: support CPY from any type to itself (llama/13695)

Reuse the f16/f32 copy shaders, and just scale the number of elements
according to the type size.

4 weeks agovulkan: Disable coopmat/coopmat2/bfloat extensions if glslc doesn't support it (llama...
Jeff Bolz [Fri, 23 May 2025 04:33:45 +0000 (00:33 -0400)]
vulkan: Disable coopmat/coopmat2/bfloat extensions if glslc doesn't support it (llama/13696)

4 weeks agouse LOG_WARN to replace `std::cerr` (llama/13657)
Judd [Fri, 23 May 2025 04:33:08 +0000 (12:33 +0800)]
use LOG_WARN to replace `std::cerr` (llama/13657)

4 weeks agosycl : Remove waits from function calls (llama/13702)
Nicolò Scipione [Thu, 22 May 2025 11:54:43 +0000 (13:54 +0200)]
sycl : Remove waits from function calls (llama/13702)

* removes the waits in async memcpy functions

4 weeks agoSYCL: Avoid using with SYCL-Graph for unsupported nodes (llama/13587)
Ewan Crawford [Thu, 22 May 2025 08:24:09 +0000 (09:24 +0100)]
SYCL: Avoid using with SYCL-Graph for unsupported nodes (llama/13587)

Currently on a CUDA backend to SYCL when running
`GGML_SYCL_DISABLE_GRAPH=0 ./bin/test-backend-ops -b SYCL0` there
are two operations that throw an exception from the blocking
waits during queue recording.

* `-o CONCAT` : Use of blocking waits on a queue that's being recorded https://github.com/ggml-org/llama.cpp/blob/master/ggml/src/ggml-sycl/concat.cpp#L185-L187
* `-o MUL_MAT_ID`: Blocking wait on a recording queue for a copy to host memory https://github.com/ggml-org/llama.cpp/blob/master/ggml/src/ggml-sycl/ggml-sycl.cpp#L3072-L3074

We've noticed that `ggml-cuda.cu` has the
[check_node_graph_compatibility_and_refresh_copy_ops](https://github.com/ggml-org/llama.cpp/blob/39e73ae0d69f882d7e29cecc6dd8f5052fca6731/ggml/src/ggml-cuda/ggml-cuda.cu#L2458-L2458)
method for checking if a graph can be used, even if enabled. I've taken a
similar approach in this PR by adding a method to `ggml-sycl.cpp` for checking
if a graph can be used for the operations even if a user has asked for it to be
enabled.

4 weeks agoopencl: Add support for multiple devices (llama/12622)
Henry Linjamäki [Wed, 21 May 2025 23:21:45 +0000 (02:21 +0300)]
opencl: Add support for multiple devices (llama/12622)

* opencl: Add support for multiple devices

... but limited to one platform. A platform with a GPU will be preferred.

Additionally:

* Filter out devices that lack capabilities needed by the backend
  implementation (half support, OpenCL 2.0+, etc).

* Make ggml_backend_opencl_reg() thread-safe.

* fixup: fix an error in sync_with_other_backends

... when there is only one OpenCL device available.

4 weeks agoopencl: fix couple crashes (llama/12795)
Henry Linjamäki [Wed, 21 May 2025 20:21:17 +0000 (23:21 +0300)]
opencl: fix couple crashes (llama/12795)

* opencl: fix couple crashes

* fix kernel launches failed on devices which do not support
  non-uniform work-groups. When non-uniform work-groups are not
  supported, set `local_work_size` to NULL (= let driver choose the
  work-group sizes). This patch does not cover everything - just the
  cases tested by test-backend-ops.

* fix sub-buffer creation failed due to `cl_buffer_region::origin` not
  being aligned to `CL_DEVICE_MEM_BASE_ADDR_ALIGN`.

* OpenCL: query non-uniform WG sizes only on OpenCL 3.0+

4 weeks agoggml : add ggml_gelu_erf() (llama/13667)
Xuan-Son Nguyen [Wed, 21 May 2025 14:26:33 +0000 (16:26 +0200)]
ggml : add ggml_gelu_erf() (llama/13667)

* ggml : add ggml_gelu_na (not approximated)

* fix naming order

* rename na --> erf

* apply review suggesions

* revert naming order

4 weeks agomusa: Upgrade MUSA SDK version to rc4.0.1 and use mudnn::Unary::IDENTITY op to accele...
R0CKSTAR [Wed, 21 May 2025 01:58:49 +0000 (09:58 +0800)]
musa: Upgrade MUSA SDK version to rc4.0.1 and use mudnn::Unary::IDENTITY op to accelerate D2D memory copy (llama/13647)

* musa: fix build warning (unused parameter)

Signed-off-by: Xiaodong Ye <redacted>
* musa: upgrade MUSA SDK version to rc4.0.1

Signed-off-by: Xiaodong Ye <redacted>
* musa: use mudnn::Unary::IDENTITY op to accelerate D2D memory copy

Signed-off-by: Xiaodong Ye <redacted>
* Update ggml/src/ggml-cuda/cpy.cu

Co-authored-by: Johannes Gäßler <redacted>
* musa: remove MUDNN_CHECK_GEN and use CUDA_CHECK_GEN instead in MUDNN_CHECK

Signed-off-by: Xiaodong Ye <redacted>
---------

Signed-off-by: Xiaodong Ye <redacted>
Co-authored-by: Johannes Gäßler <redacted>
4 weeks agovulkan: fix warnings (llama/13626)
Eve [Tue, 20 May 2025 21:35:16 +0000 (21:35 +0000)]
vulkan: fix warnings (llama/13626)

* small fixes

* remove ifdef

4 weeks agoCUDA: skip fully masked-out KV in FA vec kernel (llama/13584)
Johannes Gäßler [Tue, 20 May 2025 12:45:07 +0000 (14:45 +0200)]
CUDA: skip fully masked-out KV in FA vec kernel (llama/13584)

* CUDA: skip fully masked-out KV in FA vec kernel

4 weeks agosycl: disable reorder for sycl mulmat (llama/13536)
Svetlozar Georgiev [Tue, 20 May 2025 09:34:15 +0000 (10:34 +0100)]
sycl: disable reorder for sycl mulmat (llama/13536)

4 weeks agometal : fix typo in FA kernel comments (llama/13651)
Georgi Gerganov [Tue, 20 May 2025 07:41:40 +0000 (10:41 +0300)]
metal : fix typo in FA kernel comments (llama/13651)

4 weeks agosycl : Overcoming workaround for mmap() allocation on Windows (llama/13482)
Nicolò Scipione [Tue, 20 May 2025 00:54:43 +0000 (02:54 +0200)]
sycl : Overcoming workaround for mmap() allocation on Windows (llama/13482)

* Remove mmap workaround on windows

After some testing I found that mmap is supported on windows and for
many GPUs on Linux. Therefore I remove the workaround for windows since
it is not necessary.

* Update llama-bench README

SYCL backend introduced a workaround that allows execution of
llama-bench also without specifying `--mmp 0` flag

4 weeks agoVulkan: Add f32 accumulator support to quantized mul mat to fix GLM4 32B incoherence...
0cc4m [Mon, 19 May 2025 15:54:08 +0000 (17:54 +0200)]
Vulkan: Add f32 accumulator support to quantized mul mat to fix GLM4 32B incoherence (llama/13607)

4 weeks agosync : ggml
Georgi Gerganov [Tue, 27 May 2025 14:06:49 +0000 (17:06 +0300)]
sync : ggml

4 weeks agodocs : convert README_sycl.md to utf8 format [no ci] (#3191)
Daniel Bevenius [Tue, 27 May 2025 08:53:50 +0000 (10:53 +0200)]
docs : convert README_sycl.md to utf8 format [no ci] (#3191)

This commit updates the README_sycl.md file to use UTF-8 encoding.

The motivation for this is that while this file displays correctly in
github it will fail to render with tools that expect UTF-8 encoding.
For example this is the case when using `grip` to view the file locally.

4 weeks agonode : enable no_prints to suppress all output (#3189)
Daniel Bevenius [Tue, 27 May 2025 03:51:47 +0000 (05:51 +0200)]
node : enable no_prints to suppress all output (#3189)

This commit enable the node addon to suppress all output, even the
result of the transcription if the no_prints parameter is set to true.

The motivation for this is that for the node addon there is a
fullfilment handler/success callback to process the transcription
result. And it might be useful to be able to disable the printing of
the transcription result to the console, so that the user can handle
the result in their own way.

Refs: https://github.com/ggml-org/whisper.cpp/issues/3176

4 weeks agotalk-llama : fix for swedish umlauts + expose model inference settings in talk-llama...
matteng1 [Mon, 26 May 2025 05:57:39 +0000 (07:57 +0200)]
talk-llama : fix for swedish umlauts + expose model inference settings in talk-llama.cpp (#3187)

Quick fix for not removing swedish umlauts.

* Update talk-llama.cpp

Expose model inference settings to user instead of hard coding them. Same defaults as previous defaults.

* Update examples/talk-llama/talk-llama.cpp

Co-authored-by: Georgi Gerganov <redacted>
5 weeks agodocs : fix VAD section heading levels (#3186)
KITAITI Makoto [Fri, 23 May 2025 08:38:26 +0000 (17:38 +0900)]
docs : fix VAD section heading levels (#3186)

5 weeks agoci : use dynamic libopenblas.dll for window-blas (#3177)
Daniel Bevenius [Fri, 23 May 2025 03:48:08 +0000 (05:48 +0200)]
ci : use dynamic libopenblas.dll for window-blas (#3177)

* ci : use dynamic libopenblas.dll for window-blas

This commit updates the windows-blas job to use the dynamic (can load
different kernels depending of the CPU arch) libopenblas.dll instead of
the "static" openblas.dll that get installed by vcpgk.

The motivation for this change is that there have been reports of
performance drops in later version specifically related to blas. Please
see the links below for more details.

Resolves: https://github.com/ggml-org/whisper.cpp/issues/3166
Refs: https://github.com/ggml-org/whisper.cpp/issues/2666#issuecomment-2885978811

5 weeks agoserver : Add k6 Load Testing Script (#3175)
Sacha Arbonel [Thu, 22 May 2025 08:03:04 +0000 (10:03 +0200)]
server : Add k6 Load Testing Script (#3175)

* add load testing script and update README for k6 integration

5 weeks agodocs : add VAD model download instructions [no ci] (#3180)
Daniel Bevenius [Thu, 22 May 2025 05:49:29 +0000 (07:49 +0200)]
docs : add VAD model download instructions [no ci] (#3180)

5 weeks agodocs : replace typo "]"with ")" in README (#3179)
Alpaim [Thu, 22 May 2025 03:49:44 +0000 (06:49 +0300)]
docs : replace typo "]"with ")" in README (#3179)

5 weeks agowhisper : remove redundant assignments (#3178)
Daniel Bevenius [Wed, 21 May 2025 11:23:20 +0000 (13:23 +0200)]
whisper : remove redundant assignments (#3178)

This commit removes some redundant assignments in the function
`whisper_exp_compute_token_level_timestamps`.

The motivations for this is that tokens[j] and token are references to
the same object and this can be a little confusing when reading the
code.

5 weeks agowhisper : update CMakeLists.txt to handle deprecated gpu Warnings (#3163)
Jugal Haresh Sheth [Tue, 20 May 2025 09:58:25 +0000 (10:58 +0100)]
whisper : update CMakeLists.txt to handle deprecated gpu Warnings (#3163)

* Fix CMakeLists.txt to handle deprecated gpu Warnings

* Conditionally apply -Wno-deprecated-gpu-targets only when GGML_CUDA is enabled

* Conditionally apply -Wno-deprecated-gpu-targets only when GGML_CUDA is enabled and not MSVC

---------

Co-authored-by: Jugal Sheth <redacted>
5 weeks agoruby : add GGML_SYCL_DNN option to ruby bindings (#3172)
Daniel Bevenius [Mon, 19 May 2025 15:59:43 +0000 (17:59 +0200)]
ruby : add GGML_SYCL_DNN option to ruby bindings (#3172)

This commit adds the `GGML_SYCL_DNN` option to the Ruby bindings for
the GGML library. This option as added to ggml in
Commit (5e7e07758a5f3172380500e173ca71f679bbef1e "sycl: use oneDNN for
matrices multiplication")

The motivation for this change to enable the CI build to pass.

5 weeks agotalk-llama : sync llama.cpp
Georgi Gerganov [Mon, 19 May 2025 10:39:12 +0000 (13:39 +0300)]
talk-llama : sync llama.cpp

ggml-ci

5 weeks agosync : ggml
Georgi Gerganov [Mon, 19 May 2025 10:38:44 +0000 (13:38 +0300)]
sync : ggml

ggml-ci

5 weeks agoCANN: Support MOE Model MUL_MAT_ID (llama/13042)
Chenguang Li [Mon, 19 May 2025 06:21:17 +0000 (14:21 +0800)]
CANN: Support MOE Model MUL_MAT_ID (llama/13042)

Signed-off-by: noemotiovon <redacted>
5 weeks agocmake: use the current build config for vulkan-shaders-gen (llama/13595)
Gilad S. [Sat, 17 May 2025 18:26:43 +0000 (21:26 +0300)]
cmake: use the current build config for vulkan-shaders-gen (llama/13595)

* fix: use the current build config for `vulkan-shaders-gen`

* fix: only pass a valid build type to `--config`

5 weeks agovulkan: move common FA code to flash_attn_base.comp (llama/13556)
Jeff Bolz [Sat, 17 May 2025 07:14:55 +0000 (16:14 +0900)]
vulkan: move common FA code to flash_attn_base.comp (llama/13556)

* vulkan: move common FA code to flash_attn_base.comp

* vulkan: move common FA index/stride setup code to flash_attn_base.comp

* build fix

5 weeks agovulkan: use scalar FA rather than coopmat2 when N==1 (llama/13554)
Jeff Bolz [Sat, 17 May 2025 06:35:47 +0000 (15:35 +0900)]
vulkan: use scalar FA rather than coopmat2 when N==1 (llama/13554)

5 weeks agometal : add FA-vec kernel for head size 64 (llama/13583)
Georgi Gerganov [Fri, 16 May 2025 17:32:58 +0000 (20:32 +0300)]
metal : add FA-vec kernel for head size 64 (llama/13583)

ggml-ci

5 weeks agosycl : fixed compilation warnings (llama/13582)
Łukasz Ślusarczyk [Fri, 16 May 2025 10:15:29 +0000 (12:15 +0200)]
sycl : fixed compilation warnings (llama/13582)

5 weeks agogguf : use ggml log system (llama/13571)
Diego Devesa [Thu, 15 May 2025 17:13:11 +0000 (10:13 -0700)]
gguf : use ggml log system (llama/13571)

* gguf : use ggml log system

* llama : remove unnecessary new lines in exception messages

5 weeks agosycl: simplify bin_bcast_kernel (llama/13383)
Atharva Dubey [Thu, 15 May 2025 15:39:52 +0000 (16:39 +0100)]
sycl: simplify bin_bcast_kernel (llama/13383)

5 weeks agosycl: reordered Q4_K MMVQ (llama/13109)
Svetlozar Georgiev [Thu, 15 May 2025 15:35:44 +0000 (16:35 +0100)]
sycl: reordered Q4_K MMVQ (llama/13109)

5 weeks agosycl: use oneDNN for matrices multiplication (llama/12972)
Łukasz Ślusarczyk [Thu, 15 May 2025 14:53:41 +0000 (16:53 +0200)]
sycl: use oneDNN for matrices multiplication (llama/12972)

5 weeks agoarm64: optimize q6_k_q8_k kernel with i8mm (llama/13519)
Yibo Cai [Wed, 14 May 2025 19:53:52 +0000 (03:53 +0800)]
arm64: optimize q6_k_q8_k kernel with i8mm (llama/13519)

This PR improves q6_k_q8_k gemm kernel with arm64 i8mm instruction.

Tested on neoverse-n2 with llama3 8b q6_k quantization model.
- 40% ~ 54% S_PP uplift for all batch sizes
- 16% ~ 47% S_TG uplift for batch size 4 and above

Perplexity doesn't change with this PR.

```
// tested on neoverse-n2
$ llama-batched-bench \
      -m Meta-Llama-3-8B-Instruct-Q6_K.gguf \
      --no-mmap -fa \
      -c 8192 -b 4096 -ub 512 -npp 128 -ntg 128 \
      -npl 1,2,4,8,16,32 \
      -t 64

---------------------------------------------------------------------
|    PP |     TG |    B |       S_PP t/s      |       S_TG t/s      |
|       |        |      | original |  this pr | original |  this pr |
|-------|--------|------|----------|----------|----------|----------|
|   128 |    128 |    1 |    78.52 |   109.18 |    18.63 |    18.88 |
|   128 |    128 |    2 |    84.62 |   123.94 |    34.54 |    36.92 |
|   128 |    128 |    4 |    84.36 |   122.49 |    52.65 |    61.32 |
|   128 |    128 |    8 |    90.52 |   138.87 |    63.46 |    84.41 |
|   128 |    128 |   16 |    90.11 |   138.56 |    71.04 |   101.33 |
|   128 |    128 |   32 |    89.81 |   137.79 |    75.14 |   110.47 |
---------------------------------------------------------------------
```

5 weeks agoCUDA: fix crash on large batch size for quant. MoE (llama/13537)
Johannes Gäßler [Wed, 14 May 2025 14:41:02 +0000 (16:41 +0200)]
CUDA: fix crash on large batch size for quant. MoE (llama/13537)

5 weeks agoCUDA: faster Deepseek FA, add Turing support (llama/13435)
Johannes Gäßler [Wed, 14 May 2025 14:08:20 +0000 (16:08 +0200)]
CUDA: faster Deepseek FA, add Turing support (llama/13435)

5 weeks agocmake: simplify vulkan shader test logic (llama/13263)
bandoti [Wed, 14 May 2025 10:53:57 +0000 (07:53 -0300)]
cmake: simplify vulkan shader test logic (llama/13263)

5 weeks agovulkan: KHR_coopmat flash attention (llama/13506)
Jeff Bolz [Wed, 14 May 2025 09:55:26 +0000 (18:55 +0900)]
vulkan: KHR_coopmat flash attention (llama/13506)

This shader uses coopmat1 to do the Q*K^T multiply. The P*V multiply is more
difficult for various reasons so I haven't done it. Performance for this
shader is around 2.5x better than for the scalar shader when doing prompt
processing. Some of the benefit may be from other optimizations like staging
through shared memory, or splitting by rows.

5 weeks agovulkan: workaround FA compile failures on macos (llama/13517)
Jeff Bolz [Wed, 14 May 2025 04:15:50 +0000 (13:15 +0900)]
vulkan: workaround FA compile failures on macos (llama/13517)

5 weeks agometal : use FA-vec kernel up to batch size 20 (llama/13496)
Georgi Gerganov [Tue, 13 May 2025 15:04:39 +0000 (18:04 +0300)]
metal : use FA-vec kernel up to batch size 20 (llama/13496)

* batched-bench : fix pp batch contents

* metal : optimize multi-sequence FA vec kernel

ggml-ci

* metal : use FA-vec kernel up to batch size 20

ggml-ci