Daniel Bevenius [Mon, 24 Nov 2025 11:51:50 +0000 (12:51 +0100)]
ggml : remove dirty flag from version string (ggml/1391)
This commit removes the "-dirty" suffix from the GGML version string.
The motivation for this change is to ensure that the version string
works with different ways of checking out ggml and using it in projects.
By removing the dirty flag from the version string, we avoid potential
artifacts like shared libraries getting a -dirty suffix in their names.
Instead, if the project is built from a dirty git state, the dirty flag
will be appended to the commit hash in the GGML_BUILD_COMMIT variable.
This will enable users to still identify that the build was made from
from a modified/dirty state even though the version might match a "real"
version.
For example, the commit can be produces as follows:
```c++
printf("commit: %s\n", ggml_commit());
```
Which would print the following for a dirty build:
```console
commit: 781baf2a-dirty
```
Joseph Sellers [Sat, 6 Dec 2025 11:28:32 +0000 (11:28 +0000)]
vad : fix buffer overflow in sample reduction loop (#3558)
The buffer size calculation loop (line ~6661) uses `n_samples - 1` as
the upper bound for segment_end_samples, but the copy loop (line 6696)
uses `n_samples`. This inconsistency allows the copy loop to compute
segment_length values up to 1 sample larger per segment than what was
allocated, causing heap corruption.
Symptom: `malloc(): corrupted top size` or `malloc(): invalid size
(unsorted)` crashes after VAD completes sample reduction.
Fix: Use consistent bounds (`n_samples - 1`) in both loops.
Daniel Bevenius [Sat, 6 Dec 2025 09:58:58 +0000 (10:58 +0100)]
tests : update VAD tests to use Silero V6.2.0 (#3534)
* tests : update VAD tests to use Silero V6.2.0
This commit updates the VAD tests to use the Silero V6.2.0 instead of
V5.1.2. I'm was not sure if we needed to keep testing for both versions,
but opted to just update to the latest version for simplicity.
* wasm : use C++17 for emscripten builds
This commit updates the CMakeLists.txt file to explicitly set the C++
standard to C++17 when building with Emscripten.
The motivation for this change is that building with Emscripten
will currently fail locally and on CI with the following error:
```console
[ 75%] Building CXX object examples/CMakeFiles/common.dir/common-ggml.cpp.o
In file included from /home/danbev/work/ai/whisper.cpp/examples/stream.wasm/emscripten.cpp:5:
/home/danbev/work/utils/emsdk/upstream/emscripten/cache/sysroot/include/emscripten/bind.h:11:2: error:
"embind requires -std=c++17 or newer"
11 | #error "embind requires -std=c++17 or newer"
| ^
In file included from /home/danbev/work/ai/whisper.cpp/examples/whisper.wasm/emscripten.cpp:4:
/home/danbev/work/utils/emsdk/upstream/emscripten/cache/sysroot/include/emscripten/bind.h:11:2: error:
"embind requires -std=c++17 or newer"
11 | #error "embind requires -std=c++17 or newer"
| ^
```
hipudding [Mon, 17 Nov 2025 00:43:59 +0000 (08:43 +0800)]
CANN: Use smart pointers to manage ACL objects (llama/17238)
* CANN: Use smart pointers to manage ACL objects
Previously, ACL objects were managed via manual destruction, which
led to multiple memory-leak issues during runtime. This patch replaces
manual memory management with smart pointers so that ACL objects
are properly released and ownership is clearly defined.
Note that the ownership of an ACL object belongs to the function
that creates it. Other internal functions should operate on these ACL
objects using raw pointers to avoid unintended ownership transfers.
Additionally, since aclTensorList automatically frees its contained
aclTensor objects, any aclTensor added to a tensor list must release
ownership to avoid double free operations.
This PR also removes the asynchronous task submission mechanism.
Due to changes in recent CANN versions, tiling time has significantly
decreased. Even with a dual-thread submission model, the dispatch
overhead still falls on the critical path, making async submission
less beneficial. Moreover, aclGraph support provides a much better
path to reducing operator dispatch latency.
Jeff Bolz [Sat, 15 Nov 2025 08:06:41 +0000 (02:06 -0600)]
vulkan: change graph_compute to be async and enable get_tensor_async (llama/17158)
* vulkan: change graph_compute to be async and enable get_tensor_async
This allows some additional CPU/GPU overlap for large pp workloads. Also seems
to help a bit for token gen, maybe getting rid of a small bubble between
graph_compute and get_tensor.
Async set and copy functions seem to be very rarely used, so I didn't enable
them because I didn't have a good way to test them.
The async commands need to be ordered against each other, so put them all on
the compute queue. The non-async commands still use the transfer queue.
The fence for graph_compute/get_tensor_async is submitted and waited on in
ggml_vk_synchronize.
Max Krasnyansky [Tue, 11 Nov 2025 23:25:04 +0000 (15:25 -0800)]
hexagon: various Op fixes (llama/17135)
* hexagon: explicitly check for ops with zero nrows
llm_graph_context::build_inp_out_ids() can generate tensors with zero nrows.
Somehow other backends seems to handle this without obvious explicit checks.
In the hexagon case we need to check explicitly and skip them.
* hexagon: introduce fastdiv, fix test-backend-ops for ADD/SUB/MUL
Co-authored-by: chraac <redacted>
* hexagon: use fastdiv in ADD_ID
* hexagon: use ggml_op_is_empty and ggml_is_empty to check for NOPs
Mike Abbott [Tue, 11 Nov 2025 11:19:50 +0000 (04:19 -0700)]
cmake : add version to all shared object files (llama/17091)
When compiling llama.cpp in Yocto, it fails QA checks because the generated so files aren't versioned. This applies a version to all generated so files, allowing the package to build without errors.
This change combines the rms_norm+mul and rope+view+set_rows fusions to
allow fusing the whole sequence together. This comes up in Qwen3, Bailing,
and some other models.
The std::map pipeline_flash_attn_f32_f16 could be searched and inserted at the
same time, which needs to hold the lock. To be safe, hold the lock for all of
ggml_vk_load_shaders.
iron [Fri, 7 Nov 2025 16:18:14 +0000 (00:18 +0800)]
ggml-cpu: detect correct cpu flags for arm64 (ggml/16229) (llama/16239)
When using GCC 9 and GCC 12 on the arm64 platform of ubuntu 2004,
the command "gcc -mcpu=native -E -v -" fails to detect the correct CPU flags,
which results in compilation failures for certain extended instructions,
but the correct CPU flags can be obtained by using gcc -march.
bssrdf [Wed, 5 Nov 2025 20:55:04 +0000 (15:55 -0500)]
improve CUDA cpy memory bandwidth when copying transposed tensor (llama/16841)
* WIP
* added a cpy kernel specific to transposed tensor which uses smem to avoid uncoalesced access; test cases also added shwoing improved memory bandwidth
* added BF16 support
* more strict check to make sure src0 is a transpose
* reformulated to handle more complicated transpose cases
* bring back 2D transpose for higher performance
* allow build on windows
* tranpose copy more shapes
* minor tweak
* final clean up
* restore some test cases
* keep only the kernel for true tranposed case; updated with review suggestions
Noah [Tue, 4 Nov 2025 05:04:59 +0000 (05:04 +0000)]
Fix garbled output with REPACK at high thread counts (llama/16956)
* Fix garbled output with REPACK at high thread counts
Fixed a race condition in the REPACK matrix multiplication code that caused garbled output when using 26+ threads (model-dependent threshold). The issue occurred because with high thread counts, the code forced chunk count to equal thread count, creating many small chunks. After aligning these chunks to NB_COLS boundaries, adjacent chunks could overlap, causing data corruption and race conditions. The fix enforces minimum chunk sizes based on NB_COLS and caps maximum chunk count to prevent creating too many tiny chunks, ensuring proper alignment without overlaps.
* Update ggml/src/ggml-cpu/repack.cpp
Co-authored-by: Georgi Gerganov <redacted>
* Update ggml/src/ggml-cpu/repack.cpp
Co-authored-by: Georgi Gerganov <redacted>
---------