Gabe Goodhart [Thu, 4 Dec 2025 17:12:19 +0000 (10:12 -0700)]
metal: TRI, FILL, EXPM1, SOFTPLUS (llama/16623)
* feat(wip): Port initial TRI impl from pervious work
The kernel does not work and is not optimized, but the
code compiles and runs, so this will be the starting point
now that the core op has been merged.
Branch: ggml-cumsum-tri
Signed-off-by: Gabe Goodhart <redacted>
* fix: Remove argument for constant val override
This was added in the original draft, but later removed. With this, the
kernel now passes tests.
Branch: ggml-cumsum-tri
Signed-off-by: Gabe Goodhart <redacted>
* feat: Move the ttype conditional to templating to avoid conditional in kernel
Branch: ggml-cumsum-tri
Signed-off-by: Gabe Goodhart <redacted>
* fix: Type fixes
Signed-off-by: Gabe Goodhart <redacted> Co-authored-by: Georgi Gerganov <redacted> Co-authored-by: Georgi Gerganov <redacted>
* feat: Add softplus for metal
Branch: ggml-cumsum-tri
Signed-off-by: Gabe Goodhart <redacted>
* feat: Add EXPM1 for metal
Branch: ggml-cumsum-tri
Signed-off-by: Gabe Goodhart <redacted>
* feat: Add FILL for metal
Branch: ggml-cumsum-tri
Signed-off-by: Gabe Goodhart <redacted>
* refactor: Branchless version of tri using _ggml_vec_tri_cmp as a mask
Branch: ggml-cumsum-tri
Signed-off-by: Gabe Goodhart <redacted>
* fix: Remove unused arguments
Branch: ggml-cumsum-tri
Signed-off-by: Gabe Goodhart <redacted>
* refactor: Use select instead of branch for softplus non-vec
Branch: ggml-cumsum-tri
Signed-off-by: Gabe Goodhart <redacted>
---------
Signed-off-by: Gabe Goodhart <redacted> Co-authored-by: Georgi Gerganov <redacted>
Jeff Bolz [Tue, 2 Dec 2025 18:22:04 +0000 (12:22 -0600)]
vulkan: Reduce temporary memory usage for TOP_K (llama/17623)
- Compute row size for the temp buffer based on the output of the first pass.
- Update shader addressing math to use the output row size
- Pass the output row size as "ncols_output", what used to be "ncols_output" is now "k"
For the common case of K=40 and src0=(200000,1,1,1), this reduces the temporary buffer
from about 3.2MB to 500KB.
Radoslav Gerganov [Fri, 28 Nov 2025 08:33:51 +0000 (10:33 +0200)]
rpc : cache and reuse compute graphs (llama/15405)
Store the last computed graph and reuse it when possible.
Also do not return response from GRAPH_COMPUTE and assume it always
completes successfully. If this this is not the case, the server closes
the connection. This saves us a network round trip to the server.
Jeff Bolz [Wed, 26 Nov 2025 15:45:43 +0000 (09:45 -0600)]
vulkan: Implement top-k (llama/17418)
* vulkan: Implement top-k
Each pass launches workgroups that each sort 2^N elements (where N is usually 7-10)
and discards all but the top K. Repeat until only K are left. And there's a fast
path when K==1 to just find the max value rather than sorting.
Adrien Gallouët [Wed, 26 Nov 2025 13:14:41 +0000 (14:14 +0100)]
ggml : fix ARM feature verification (llama/17519)
On arm64 with `cmake` version 3.31.6, the final feature verification fails:
-- ARM detected flags: -mcpu=neoverse-v2+crc+sve2-aes+sve2-sha3+nossbs
-- Performing Test GGML_MACHINE_SUPPORTS_dotprod
-- Performing Test GGML_MACHINE_SUPPORTS_dotprod - Success
-- Performing Test GGML_MACHINE_SUPPORTS_i8mm
-- Performing Test GGML_MACHINE_SUPPORTS_i8mm - Success
-- Performing Test GGML_MACHINE_SUPPORTS_sve
-- Performing Test GGML_MACHINE_SUPPORTS_sve - Success
-- Performing Test GGML_MACHINE_SUPPORTS_sme
-- Performing Test GGML_MACHINE_SUPPORTS_sme - Failed
-- Performing Test GGML_MACHINE_SUPPORTS_nosme
-- Performing Test GGML_MACHINE_SUPPORTS_nosme - Success
-- Checking for ARM features using flags:
-- -U__ARM_FEATURE_SME
-- -mcpu=neoverse-v2+crc+sve2-aes+sve2-sha3+nossbs+dotprod+i8mm+sve+nosme
-- Performing Test HAVE_DOTPROD
-- Performing Test HAVE_DOTPROD - Failed
-- Performing Test HAVE_SVE
-- Performing Test HAVE_SVE - Failed
-- Performing Test HAVE_MATMUL_INT8
-- Performing Test HAVE_MATMUL_INT8 - Failed
-- Performing Test HAVE_FMA
-- Performing Test HAVE_FMA - Success
-- Performing Test HAVE_FP16_VECTOR_ARITHMETIC
-- Performing Test HAVE_FP16_VECTOR_ARITHMETIC - Failed
-- Performing Test HAVE_SME
-- Performing Test HAVE_SME - Failed
-- Adding CPU backend variant ggml-cpu: -U__ARM_FEATURE_SME;-mcpu=neoverse-v2+crc+sve2-aes+sve2-sha3+nossbs+dotprod+i8mm+sve+nosme
We need to explicitly replace `;` with spaces from the list to make
`CMAKE_REQUIRED_FLAGS` work correctly...
hipudding [Wed, 26 Nov 2025 08:44:19 +0000 (16:44 +0800)]
CANN: Add MROPE and IMROPE support (llama/17401)
* CANN: ROPE supports both MROPE and IMROPE.
1. Optimize the caching logic of rope_cache_init.
2. Add support for mRoPE and i-mRoPE.
Note that on Ascend 910B devices, it is necessary to disable FA
in CLIP and disable NZ-format conversion. These two issues are
still under investigation.
* `matched_graph` is obtained even if graph mode is disabled.
* End of graph capture and graph replay are unnecessarily placed in different `if` blocks.
**Proposed solution**
* Obtain `matched_graph` only if graph mode is enabled.
* Place end of graph capture and graph reply inside the same `if` block.
* Unify graph related comments.
Jeff Bolz [Wed, 19 Nov 2025 16:25:50 +0000 (10:25 -0600)]
vulkan: support larger argsort (llama/17313)
* vulkan: support larger argsort
This is an extension of the original bitonic sorting shader that puts the
temporary values in global memory and when more than 1024 threads are needed
it runs multiple workgroups and synchronizes through a pipelinebarrier.
To improve the memory access pattern, a copy of the float value is kept with
the index value. I've applied this same change to the original shared memory
version of the shader, which is still used when ncols <= 1024.
* Reduce the number of shader variants. Use smaller workgroups when doing a single pass, for a modest perf boost
* reduce loop overhead
* run multiple cols per invocation, to reduce barrier overhead
Chenguang Li [Tue, 18 Nov 2025 08:41:52 +0000 (16:41 +0800)]
CANN: fix acl_tensor_ptr usage in ASCEND_310P ROPE (llama/17347)
* cann: fix acl_tensor_ptr usage in ASCEND_310P ROPE implementation
Fix compilation errors in the ASCEND_310P-specific ROPE operation code
by adding .get() calls when passing acl_tensor_ptr smart pointers to
functions expecting raw aclTensor* pointers.
This fixes the code that was missed in the previous refactoring commit
(8981848) which changed ggml_cann_create_tensor() return type from
aclTensor* to acl_tensor_ptr.
Daniel Bevenius [Mon, 24 Nov 2025 11:51:50 +0000 (12:51 +0100)]
ggml : remove dirty flag from version string (ggml/1391)
This commit removes the "-dirty" suffix from the GGML version string.
The motivation for this change is to ensure that the version string
works with different ways of checking out ggml and using it in projects.
By removing the dirty flag from the version string, we avoid potential
artifacts like shared libraries getting a -dirty suffix in their names.
Instead, if the project is built from a dirty git state, the dirty flag
will be appended to the commit hash in the GGML_BUILD_COMMIT variable.
This will enable users to still identify that the build was made from
from a modified/dirty state even though the version might match a "real"
version.
For example, the commit can be produces as follows:
```c++
printf("commit: %s\n", ggml_commit());
```
Which would print the following for a dirty build:
```console
commit: 781baf2a-dirty
```
Joseph Sellers [Sat, 6 Dec 2025 11:28:32 +0000 (11:28 +0000)]
vad : fix buffer overflow in sample reduction loop (#3558)
The buffer size calculation loop (line ~6661) uses `n_samples - 1` as
the upper bound for segment_end_samples, but the copy loop (line 6696)
uses `n_samples`. This inconsistency allows the copy loop to compute
segment_length values up to 1 sample larger per segment than what was
allocated, causing heap corruption.
Symptom: `malloc(): corrupted top size` or `malloc(): invalid size
(unsorted)` crashes after VAD completes sample reduction.
Fix: Use consistent bounds (`n_samples - 1`) in both loops.
Daniel Bevenius [Sat, 6 Dec 2025 09:58:58 +0000 (10:58 +0100)]
tests : update VAD tests to use Silero V6.2.0 (#3534)
* tests : update VAD tests to use Silero V6.2.0
This commit updates the VAD tests to use the Silero V6.2.0 instead of
V5.1.2. I'm was not sure if we needed to keep testing for both versions,
but opted to just update to the latest version for simplicity.
* wasm : use C++17 for emscripten builds
This commit updates the CMakeLists.txt file to explicitly set the C++
standard to C++17 when building with Emscripten.
The motivation for this change is that building with Emscripten
will currently fail locally and on CI with the following error:
```console
[ 75%] Building CXX object examples/CMakeFiles/common.dir/common-ggml.cpp.o
In file included from /home/danbev/work/ai/whisper.cpp/examples/stream.wasm/emscripten.cpp:5:
/home/danbev/work/utils/emsdk/upstream/emscripten/cache/sysroot/include/emscripten/bind.h:11:2: error:
"embind requires -std=c++17 or newer"
11 | #error "embind requires -std=c++17 or newer"
| ^
In file included from /home/danbev/work/ai/whisper.cpp/examples/whisper.wasm/emscripten.cpp:4:
/home/danbev/work/utils/emsdk/upstream/emscripten/cache/sysroot/include/emscripten/bind.h:11:2: error:
"embind requires -std=c++17 or newer"
11 | #error "embind requires -std=c++17 or newer"
| ^
```