* `matched_graph` is obtained even if graph mode is disabled.
* End of graph capture and graph replay are unnecessarily placed in different `if` blocks.
**Proposed solution**
* Obtain `matched_graph` only if graph mode is enabled.
* Place end of graph capture and graph reply inside the same `if` block.
* Unify graph related comments.
Jeff Bolz [Wed, 19 Nov 2025 16:25:50 +0000 (10:25 -0600)]
vulkan: support larger argsort (llama/17313)
* vulkan: support larger argsort
This is an extension of the original bitonic sorting shader that puts the
temporary values in global memory and when more than 1024 threads are needed
it runs multiple workgroups and synchronizes through a pipelinebarrier.
To improve the memory access pattern, a copy of the float value is kept with
the index value. I've applied this same change to the original shared memory
version of the shader, which is still used when ncols <= 1024.
* Reduce the number of shader variants. Use smaller workgroups when doing a single pass, for a modest perf boost
* reduce loop overhead
* run multiple cols per invocation, to reduce barrier overhead
Chenguang Li [Tue, 18 Nov 2025 08:41:52 +0000 (16:41 +0800)]
CANN: fix acl_tensor_ptr usage in ASCEND_310P ROPE (llama/17347)
* cann: fix acl_tensor_ptr usage in ASCEND_310P ROPE implementation
Fix compilation errors in the ASCEND_310P-specific ROPE operation code
by adding .get() calls when passing acl_tensor_ptr smart pointers to
functions expecting raw aclTensor* pointers.
This fixes the code that was missed in the previous refactoring commit
(8981848) which changed ggml_cann_create_tensor() return type from
aclTensor* to acl_tensor_ptr.
Daniel Bevenius [Mon, 24 Nov 2025 11:51:50 +0000 (12:51 +0100)]
ggml : remove dirty flag from version string (ggml/1391)
This commit removes the "-dirty" suffix from the GGML version string.
The motivation for this change is to ensure that the version string
works with different ways of checking out ggml and using it in projects.
By removing the dirty flag from the version string, we avoid potential
artifacts like shared libraries getting a -dirty suffix in their names.
Instead, if the project is built from a dirty git state, the dirty flag
will be appended to the commit hash in the GGML_BUILD_COMMIT variable.
This will enable users to still identify that the build was made from
from a modified/dirty state even though the version might match a "real"
version.
For example, the commit can be produces as follows:
```c++
printf("commit: %s\n", ggml_commit());
```
Which would print the following for a dirty build:
```console
commit: 781baf2a-dirty
```
Joseph Sellers [Sat, 6 Dec 2025 11:28:32 +0000 (11:28 +0000)]
vad : fix buffer overflow in sample reduction loop (#3558)
The buffer size calculation loop (line ~6661) uses `n_samples - 1` as
the upper bound for segment_end_samples, but the copy loop (line 6696)
uses `n_samples`. This inconsistency allows the copy loop to compute
segment_length values up to 1 sample larger per segment than what was
allocated, causing heap corruption.
Symptom: `malloc(): corrupted top size` or `malloc(): invalid size
(unsorted)` crashes after VAD completes sample reduction.
Fix: Use consistent bounds (`n_samples - 1`) in both loops.
Daniel Bevenius [Sat, 6 Dec 2025 09:58:58 +0000 (10:58 +0100)]
tests : update VAD tests to use Silero V6.2.0 (#3534)
* tests : update VAD tests to use Silero V6.2.0
This commit updates the VAD tests to use the Silero V6.2.0 instead of
V5.1.2. I'm was not sure if we needed to keep testing for both versions,
but opted to just update to the latest version for simplicity.
* wasm : use C++17 for emscripten builds
This commit updates the CMakeLists.txt file to explicitly set the C++
standard to C++17 when building with Emscripten.
The motivation for this change is that building with Emscripten
will currently fail locally and on CI with the following error:
```console
[ 75%] Building CXX object examples/CMakeFiles/common.dir/common-ggml.cpp.o
In file included from /home/danbev/work/ai/whisper.cpp/examples/stream.wasm/emscripten.cpp:5:
/home/danbev/work/utils/emsdk/upstream/emscripten/cache/sysroot/include/emscripten/bind.h:11:2: error:
"embind requires -std=c++17 or newer"
11 | #error "embind requires -std=c++17 or newer"
| ^
In file included from /home/danbev/work/ai/whisper.cpp/examples/whisper.wasm/emscripten.cpp:4:
/home/danbev/work/utils/emsdk/upstream/emscripten/cache/sysroot/include/emscripten/bind.h:11:2: error:
"embind requires -std=c++17 or newer"
11 | #error "embind requires -std=c++17 or newer"
| ^
```
hipudding [Mon, 17 Nov 2025 00:43:59 +0000 (08:43 +0800)]
CANN: Use smart pointers to manage ACL objects (llama/17238)
* CANN: Use smart pointers to manage ACL objects
Previously, ACL objects were managed via manual destruction, which
led to multiple memory-leak issues during runtime. This patch replaces
manual memory management with smart pointers so that ACL objects
are properly released and ownership is clearly defined.
Note that the ownership of an ACL object belongs to the function
that creates it. Other internal functions should operate on these ACL
objects using raw pointers to avoid unintended ownership transfers.
Additionally, since aclTensorList automatically frees its contained
aclTensor objects, any aclTensor added to a tensor list must release
ownership to avoid double free operations.
This PR also removes the asynchronous task submission mechanism.
Due to changes in recent CANN versions, tiling time has significantly
decreased. Even with a dual-thread submission model, the dispatch
overhead still falls on the critical path, making async submission
less beneficial. Moreover, aclGraph support provides a much better
path to reducing operator dispatch latency.
Jeff Bolz [Sat, 15 Nov 2025 08:06:41 +0000 (02:06 -0600)]
vulkan: change graph_compute to be async and enable get_tensor_async (llama/17158)
* vulkan: change graph_compute to be async and enable get_tensor_async
This allows some additional CPU/GPU overlap for large pp workloads. Also seems
to help a bit for token gen, maybe getting rid of a small bubble between
graph_compute and get_tensor.
Async set and copy functions seem to be very rarely used, so I didn't enable
them because I didn't have a good way to test them.
The async commands need to be ordered against each other, so put them all on
the compute queue. The non-async commands still use the transfer queue.
The fence for graph_compute/get_tensor_async is submitted and waited on in
ggml_vk_synchronize.
Max Krasnyansky [Tue, 11 Nov 2025 23:25:04 +0000 (15:25 -0800)]
hexagon: various Op fixes (llama/17135)
* hexagon: explicitly check for ops with zero nrows
llm_graph_context::build_inp_out_ids() can generate tensors with zero nrows.
Somehow other backends seems to handle this without obvious explicit checks.
In the hexagon case we need to check explicitly and skip them.
* hexagon: introduce fastdiv, fix test-backend-ops for ADD/SUB/MUL
Co-authored-by: chraac <redacted>
* hexagon: use fastdiv in ADD_ID
* hexagon: use ggml_op_is_empty and ggml_is_empty to check for NOPs
Mike Abbott [Tue, 11 Nov 2025 11:19:50 +0000 (04:19 -0700)]
cmake : add version to all shared object files (llama/17091)
When compiling llama.cpp in Yocto, it fails QA checks because the generated so files aren't versioned. This applies a version to all generated so files, allowing the package to build without errors.
This change combines the rms_norm+mul and rope+view+set_rows fusions to
allow fusing the whole sequence together. This comes up in Qwen3, Bailing,
and some other models.
The std::map pipeline_flash_attn_f32_f16 could be searched and inserted at the
same time, which needs to hold the lock. To be safe, hold the lock for all of
ggml_vk_load_shaders.
iron [Fri, 7 Nov 2025 16:18:14 +0000 (00:18 +0800)]
ggml-cpu: detect correct cpu flags for arm64 (ggml/16229) (llama/16239)
When using GCC 9 and GCC 12 on the arm64 platform of ubuntu 2004,
the command "gcc -mcpu=native -E -v -" fails to detect the correct CPU flags,
which results in compilation failures for certain extended instructions,
but the correct CPU flags can be obtained by using gcc -march.