Congcong Cai [Fri, 12 Dec 2025 14:37:38 +0000 (22:37 +0800)]
cmake : set `CMAKE_RUNTIME_OUTPUT_DIRECTORY` for non standalone build (#1394)
Some backend depends on CMAKE_RUNTIME_OUTPUT_DIRECTORY to create temporary file like metal backened.
Missing CMAKE_RUNTIME_OUTPUT_DIRECTORY will cause some cmake error like permission denied (try to copy file to root).
This PR wants to setup a default path for CMAKE_RUNTIME_OUTPUT_DIRECTORY when it does not exist.
Chenguang Li [Tue, 9 Dec 2025 09:53:23 +0000 (17:53 +0800)]
CANN: add support for partial RoPE and Vision mode (llama/17543)
* cann: add support for partial RoPE and Vision mode
Add support for two important RoPE variants: partial rotation (rope_dims < ne0)
and Vision mode rotation.
1. Support for partial RoPE (rope_dims < ne0):
- Split tensor into head (first rope_dims dimensions) and tail portions
- Apply rotation only to head portion using RotaryPositionEmbedding operator
- Copy unrotated tail portion directly from source to destination
- Handle both contiguous and non-contiguous tensor layouts
2. Support for Vision mode (GGML_ROPE_TYPE_VISION):
- Set rope_dims = ne0 for Vision mode to rotate entire tensor
- Vision mode pairs dimension i with dimension i+n_dims (where n_dims = ne0/2)
- No tail handling needed since entire tensor is rotated
Implementation details:
- Use has_tail flag to determine execution path: head/tail splitting when
rope_dims < ne0, or full tensor rotation when rope_dims == ne0
- Support both F32 and F16 data types with intermediate F32 conversion
- Copy non-contiguous tensors to contiguous buffers before calling
RotaryPositionEmbedding operator for compatibility
- Improve cache invalidation logic to include rope_dims and indep_sects
parameters
These enhancements enable CANN backend to handle various RoPE configurations
used in modern vision-language models and models with partial rotation.
wsbagnsv1 [Mon, 8 Dec 2025 09:41:08 +0000 (10:41 +0100)]
cuda: optimize SOLVE_TRI using registers and FMAF (llama/17703)
* ggml-cuda: optimize solve_tri_f32_fast and fix stride handling
- Switch from using shared memory for the RHS/solution matrix to a register-based approach (x_low, x_high), reducing shared memory pressure and bank conflicts.
- Implement explicit `fmaf` instructions for the reduction loop.
- Update kernel arguments to pass strides in bytes rather than elements to align with standard ggml tensor arithmetic (casting to `char *` before addition).
- Remove unused `MAX_K_FAST` definition.
* Small cleanup
* Remove comments in solve_tri.cu
* Update ggml/src/ggml-cuda/solve_tri.cu
Co-authored-by: Johannes Gäßler <redacted>
* Update ggml/src/ggml-cuda/solve_tri.cu
Co-authored-by: Johannes Gäßler <redacted>
* Update ggml/src/ggml-cuda/solve_tri.cu
Co-authored-by: Johannes Gäßler <redacted>
* Use const for variables in solve_tri.cu
Jeff Bolz [Sat, 6 Dec 2025 17:46:46 +0000 (11:46 -0600)]
vulkan: perf_logger improvements (llama/17672)
* vulkan: perf_logger improvements
- Move perf_logger from device to ctx.
- Add an env var to control the frequency we dump the stats. If you set a very
large value, it just dumps when the ctx is destroyed.
- Add a fusion info string to the tracking, only log one item per fused op.
- Fix MUL_MAT_ID flops calculation.
Jeff Bolz [Sat, 6 Dec 2025 10:12:26 +0000 (04:12 -0600)]
vulkan: Use one row per workgroup for f32 mmv (llama/17711)
The MoE models have a mul_mat_vec with very small m (32, 64, 128) right before
the topk_moe selection. Running multiple rows per wg doesn't utilize the SMs
well. I think even for larger m, f32 is so bandwidth-limited that running
multiple rows doesn't help.
Jeff Bolz [Fri, 5 Dec 2025 21:03:19 +0000 (15:03 -0600)]
vulkan: fix top_k bug when there are ties in the input (llama/17659)
* vulkan: Reduce temporary memory usage for TOP_K
- Compute row size for the temp buffer based on the output of the first pass.
- Update shader addressing math to use the output row size
- Pass the output row size as "ncols_output", what used to be "ncols_output" is now "k"
For the common case of K=40 and src0=(200000,1,1,1), this reduces the temporary buffer
from about 3.2MB to 500KB.
* vulkan: fix top_k bug when there are ties in the input
I noticed by inspection a bug in the vulkan top_k shader where if the least
value in the top_k appears multiple times we could end up writing those extra
copies out rather than some larger values (if the larger values are on higher
numbered threads).
I rewrote the test verification to handle this case, where the final index set
is not necessarily the same.
* Update tests/test-backend-ops.cpp
Co-authored-by: Georgi Gerganov <redacted>
---------
The XIELU operation was incorrectly using static_cast to convert
float parameters to uint32_t, which converted numeric values instead
of preserving IEEE 754 bit patterns. This caused incorrect values
to be interpreted by the GPU shader.
* Use reinterpret_cast to preserve float bit patterns when passing
through uint32_t params buffer
* Update WGSL shader parameter types from u32 to f32
* Re-enable XIELU support (was disabled due to numerical issues)
Fixes NMSE test failures for XIELU operation on WebGPU backend.
neg f16xf32xip builds and runs, havent actually ran a model that uses neg kernel yet though
Co-authored-by: James Contini <redacted> Co-authored-by: Neha Abbas <redacted> Co-authored-by: Abhijit Ramesh <redacted>
* Remove extra code and format
Gabe Goodhart [Thu, 4 Dec 2025 17:12:19 +0000 (10:12 -0700)]
metal: TRI, FILL, EXPM1, SOFTPLUS (llama/16623)
* feat(wip): Port initial TRI impl from pervious work
The kernel does not work and is not optimized, but the
code compiles and runs, so this will be the starting point
now that the core op has been merged.
Branch: ggml-cumsum-tri
Signed-off-by: Gabe Goodhart <redacted>
* fix: Remove argument for constant val override
This was added in the original draft, but later removed. With this, the
kernel now passes tests.
Branch: ggml-cumsum-tri
Signed-off-by: Gabe Goodhart <redacted>
* feat: Move the ttype conditional to templating to avoid conditional in kernel
Branch: ggml-cumsum-tri
Signed-off-by: Gabe Goodhart <redacted>
* fix: Type fixes
Signed-off-by: Gabe Goodhart <redacted> Co-authored-by: Georgi Gerganov <redacted> Co-authored-by: Georgi Gerganov <redacted>
* feat: Add softplus for metal
Branch: ggml-cumsum-tri
Signed-off-by: Gabe Goodhart <redacted>
* feat: Add EXPM1 for metal
Branch: ggml-cumsum-tri
Signed-off-by: Gabe Goodhart <redacted>
* feat: Add FILL for metal
Branch: ggml-cumsum-tri
Signed-off-by: Gabe Goodhart <redacted>
* refactor: Branchless version of tri using _ggml_vec_tri_cmp as a mask
Branch: ggml-cumsum-tri
Signed-off-by: Gabe Goodhart <redacted>
* fix: Remove unused arguments
Branch: ggml-cumsum-tri
Signed-off-by: Gabe Goodhart <redacted>
* refactor: Use select instead of branch for softplus non-vec
Branch: ggml-cumsum-tri
Signed-off-by: Gabe Goodhart <redacted>
---------
Signed-off-by: Gabe Goodhart <redacted> Co-authored-by: Georgi Gerganov <redacted>
Jeff Bolz [Tue, 2 Dec 2025 18:22:04 +0000 (12:22 -0600)]
vulkan: Reduce temporary memory usage for TOP_K (llama/17623)
- Compute row size for the temp buffer based on the output of the first pass.
- Update shader addressing math to use the output row size
- Pass the output row size as "ncols_output", what used to be "ncols_output" is now "k"
For the common case of K=40 and src0=(200000,1,1,1), this reduces the temporary buffer
from about 3.2MB to 500KB.
Radoslav Gerganov [Fri, 28 Nov 2025 08:33:51 +0000 (10:33 +0200)]
rpc : cache and reuse compute graphs (llama/15405)
Store the last computed graph and reuse it when possible.
Also do not return response from GRAPH_COMPUTE and assume it always
completes successfully. If this this is not the case, the server closes
the connection. This saves us a network round trip to the server.
Jeff Bolz [Wed, 26 Nov 2025 15:45:43 +0000 (09:45 -0600)]
vulkan: Implement top-k (llama/17418)
* vulkan: Implement top-k
Each pass launches workgroups that each sort 2^N elements (where N is usually 7-10)
and discards all but the top K. Repeat until only K are left. And there's a fast
path when K==1 to just find the max value rather than sorting.
Adrien Gallouët [Wed, 26 Nov 2025 13:14:41 +0000 (14:14 +0100)]
ggml : fix ARM feature verification (llama/17519)
On arm64 with `cmake` version 3.31.6, the final feature verification fails:
-- ARM detected flags: -mcpu=neoverse-v2+crc+sve2-aes+sve2-sha3+nossbs
-- Performing Test GGML_MACHINE_SUPPORTS_dotprod
-- Performing Test GGML_MACHINE_SUPPORTS_dotprod - Success
-- Performing Test GGML_MACHINE_SUPPORTS_i8mm
-- Performing Test GGML_MACHINE_SUPPORTS_i8mm - Success
-- Performing Test GGML_MACHINE_SUPPORTS_sve
-- Performing Test GGML_MACHINE_SUPPORTS_sve - Success
-- Performing Test GGML_MACHINE_SUPPORTS_sme
-- Performing Test GGML_MACHINE_SUPPORTS_sme - Failed
-- Performing Test GGML_MACHINE_SUPPORTS_nosme
-- Performing Test GGML_MACHINE_SUPPORTS_nosme - Success
-- Checking for ARM features using flags:
-- -U__ARM_FEATURE_SME
-- -mcpu=neoverse-v2+crc+sve2-aes+sve2-sha3+nossbs+dotprod+i8mm+sve+nosme
-- Performing Test HAVE_DOTPROD
-- Performing Test HAVE_DOTPROD - Failed
-- Performing Test HAVE_SVE
-- Performing Test HAVE_SVE - Failed
-- Performing Test HAVE_MATMUL_INT8
-- Performing Test HAVE_MATMUL_INT8 - Failed
-- Performing Test HAVE_FMA
-- Performing Test HAVE_FMA - Success
-- Performing Test HAVE_FP16_VECTOR_ARITHMETIC
-- Performing Test HAVE_FP16_VECTOR_ARITHMETIC - Failed
-- Performing Test HAVE_SME
-- Performing Test HAVE_SME - Failed
-- Adding CPU backend variant ggml-cpu: -U__ARM_FEATURE_SME;-mcpu=neoverse-v2+crc+sve2-aes+sve2-sha3+nossbs+dotprod+i8mm+sve+nosme
We need to explicitly replace `;` with spaces from the list to make
`CMAKE_REQUIRED_FLAGS` work correctly...
hipudding [Wed, 26 Nov 2025 08:44:19 +0000 (16:44 +0800)]
CANN: Add MROPE and IMROPE support (llama/17401)
* CANN: ROPE supports both MROPE and IMROPE.
1. Optimize the caching logic of rope_cache_init.
2. Add support for mRoPE and i-mRoPE.
Note that on Ascend 910B devices, it is necessary to disable FA
in CLIP and disable NZ-format conversion. These two issues are
still under investigation.