tamarPal [Mon, 27 Oct 2025 01:20:24 +0000 (03:20 +0200)]
sycl: add ROLL operation support (llama/16665)
* sycl: add ROLL operation support
- Implement ggml_sycl_roll function for F32 tensors
- Add multi-axis roll operation with SYCL kernel
- Support all 4 tensor dimensions with proper shift normalization
- Add roll.cpp and roll.hpp to SYCL backend
- Update backend dispatch and supports_op for GGML_OP_ROLL
- Tests: 17662/17662 pass with identical CPU reference results
* fix: remove trailing whitespace from roll.cpp
- Fix EditorConfig violations in ggml/src/ggml-sycl/roll.cpp
- Remove trailing spaces from lines 6, 11, 28, 47, 58, 60
* ci: retrigger
* sycl: remove wait() calls from ROLL operation
* fix: editorconfig — LF endings + final newline for roll.hpp
Matthew Michel [Thu, 23 Oct 2025 01:05:15 +0000 (20:05 -0500)]
sycl: use async memory allocation to fix crashes during graph recording (llama/16644)
* sycl: use async memory allocation to fix graph recording failures
GGML_SYCL_DISABLE_GRAPHS=0 causes crashes because:
- Host waits are currently unsupported in graph recording mode.
- SYCL malloc / free calls are unsupported in graph recording mode.
The following changes are made to fix SYCL graph functionality:
- When graphs are enabled, use the SYCL async memory extension for temp
buffers which is supported with SYCL graphs.
- For compiler versions that do not support this extension, skip
graphs with the affected op.
- Switch from USM shared to device memory as the async extension
currently just supports device allocations.
* Address reviewer feedback
* Use global async variable to decide path in sycl_ext_[malloc_device|free]
Max Krasnyansky [Wed, 22 Oct 2025 20:47:09 +0000 (13:47 -0700)]
Add experimental ggml-hexagon backend for the Hexagon NPU (llama/16547)
* model: add support for extra bufs for all devices
* hexagon: add experimental ggml-hexagon backend for the Hexagon NPU
This commit introduces a new experimental backend `ggml-hexagon` with support for the Hexagon NPU.
Highlights:
- Supports Hexagon versions: v73, v75, v79, and v81
- Targets Android devices based on Snapdragon SoCs: Gen3, 8-Elite, and 8-Elite Gen5
- Supports Q4_0, Q8_0, MXFP4, and FP32 data types
- Implements core LLM ops: MUL_MAT/MUL_MAT_ID, ADD/SUB/MUL/ADD_ID, RMS_NORM, ROPE, GLU/SWIGLU, SOFTMAX
**Note:** This backend is experimental and may exhibit instability or limited performance across supported devices.
It is intended for early testing and feedback from llama.cpp/ggml developer and user community.
sirus20x6 [Wed, 22 Oct 2025 10:14:14 +0000 (05:14 -0500)]
ggml : Leverage the existing GGML_F32_VEC helpers to vectorize ggml_vec_set_f32 for faster fills (llama/16522)
* Leverage the existing GGML_F32_VEC helpers to broadcast the fill value across SIMD registers and store in vector-sized chunks, while retaining the scalar tail for leftover elements and non-SIMD builds.
Radoslav Gerganov [Fri, 17 Oct 2025 15:02:52 +0000 (18:02 +0300)]
rpc : report actual free memory (llama/16616)
* rpc : report actual free memory
Start reporting the free memory on every device instead of using
fixed values. Now llama-cli users can get a nice memory breakdown
when using RPC devices.
muggle-stack [Fri, 17 Oct 2025 10:01:23 +0000 (18:01 +0800)]
ggml : fix SpaceMit IME array out-of-bounds in task assignment (llama/16629)
Fix incorrect task-to-batch index calculation in the quantization phase.
The bug caused out-of-bounds access to qnbitgemm_args array when
compute_idx exceeded per_gemm_block_count_m, leading to invalid
pointer dereferences and SIGBUS errors.
Correctly map tasks to batches by dividing compute_idx by
per_gemm_block_count_m instead of block_size_m.
Chenguang Li [Thu, 16 Oct 2025 08:41:11 +0000 (16:41 +0800)]
CANN: format code using .clang-format (llama/15863)
This commit applies .clang-format rules to all source files under the
ggml-cann directory to ensure consistent coding style and readability.
The .clang-format option `SortIncludes: false` has been set to disable
automatic reordering of include directives.
No functional changes are introduced.
takuya kodama [Thu, 16 Oct 2025 05:10:32 +0000 (13:10 +0800)]
ggml-cpu: replace putenv with setenv for const-correctness (llama/16573)
## Why it failed
When compiling with strict compiler flags (-Wwrite-strings -Werror=discarded-qualifiers),
the build fails with the following error:
```
cmake \
-S . \
-B ../llama.cpp.build \
--preset=x64-linux-gcc-debug \
-DCMAKE_INSTALL_PREFIX=/tmp/local \
-DCMAKE_C_FLAGS="-Wwrite-strings -Werror=discarded-qualifiers" && \
cmake --build ../llama.cpp.build/
...
/home/otegami/work/cpp/llama.cpp/ggml/src/ggml-cpu/ggml-cpu.c: In function ‘ggml_cpu_init’:
/home/otegami/work/cpp/llama.cpp/ggml/src/ggml-cpu/ggml-cpu.c:3572:24: error: passing argument 1 of ‘putenv’ discards ‘const’ qualifier from pointer target type [-Werror=discarded-qualifiers]
3572 | putenv("KMP_BLOCKTIME=200"); // 200ms
| ^~~~~~~~~~~~~~~~~~~
In file included from /home/otegami/work/cpp/llama.cpp/ggml/src/./ggml-impl.h:10,
from /home/otegami/work/cpp/llama.cpp/ggml/src/ggml-cpu/ggml-cpu-impl.h:6,
from /home/otegami/work/cpp/llama.cpp/ggml/src/ggml-cpu/traits.h:3,
from /home/otegami/work/cpp/llama.cpp/ggml/src/ggml-cpu/ggml-cpu.c:6:
/usr/include/stdlib.h:786:26: note: expected ‘char *’ but argument is of type ‘const char *’
786 | extern int putenv (char *__string) __THROW __nonnull ((1));
| ~~~~~~^~~~~~~~
cc1: some warnings being treated as errors
ninja: build stopped: subcommand failed.
```
The issue is that putenv() expects a non-const char * but receives a string literal (const char *).
## How to fix
This PR replaces putenv("KMP_BLOCKTIME=200") with setenv("KMP_BLOCKTIME", "200", 0).
Benefits of setenv():
- Accepts const char * parameters (no qualifier warnings)
- Makes copies of the strings (safer memory handling)
- The third parameter (0) ensures we don't overwrite if already set
safranowith [Wed, 15 Oct 2025 19:24:51 +0000 (22:24 +0300)]
cpu : add FLOOR, CEIL, ROUND and TRUNC unary operators (llama/16083)
* CPU: Add support for FLOOR,CEIL,ROUND and TRUNC unary operators
- Added the operators to unary op enum
- Implemented API functions
- Implemented forward and unary-op logic in CPU backend
- Updated ggml_get_n_tasks
- Updated operators names array and static_assert
- Updated docs and enabled automatic tests
* docs: add documentation for ggml_trunc and ggml_trunc_inplace in ggml.h
* chore: remove trailing whitespace from ggml.h
* Remove unresolved merge markers
* Apply review suggestions: cleanup formatting, enum order and leftover artifacts
Chenguang Li [Mon, 13 Oct 2025 09:01:24 +0000 (17:01 +0800)]
CANN: fix CPU memory leak in CANN backend (llama/16549)
This commit fixes a CPU-side memory leak issue in the CANN backend,
which occurred when intermediate aclTensorList objects were not properly
released after operator execution. The leak happened during repeated
invocations of CANN ops (e.g., FlashAttention), leading to increasing
host memory usage over time.
Proper resource cleanup (aclDestroyTensorList and related release logic)
has been added to ensure that all temporary tensors are correctly freed.
hipudding [Mon, 13 Oct 2025 00:52:22 +0000 (08:52 +0800)]
CANN: Update several operators to support FP16 data format (llama/16251)
Many Ascend operators internally use FP16 precision for computation.
If input data is in FP32, it must first be cast to FP16 before
computation, and then cast back to FP32 after computation, which
introduces unnecessary cast operations. Moreover, FP16 computation
requires significantly less workload compared to FP32, leading to
noticeable efficiency improvements.
In this change, `get_rows`, `rms_norm`, and `flash_attn_ext` are extended
to support multiple data types. Validation on the Qwen2 0.5b model shows
correct accuracy and about 10% performance gain in concurrent scenarios.
sirus20x6 [Sun, 12 Oct 2025 05:15:00 +0000 (00:15 -0500)]
ggml: Correct SVE implementation in ggml_vec_dot_f16_unroll (llama/16518)
The previous SVE implementation for `ggml_vec_dot_f16_unroll` contained a bug due to a copy-paste error. The wrong variable was used in an FMA instruction, leading to incorrect results. This commit corrects the variable usage and improves the clarity of the code by renaming variables to avoid confusion.
Chenguang Li [Thu, 9 Oct 2025 07:50:25 +0000 (15:50 +0800)]
CANN: Improve ACL graph matching (llama/16166)
* CANN: improve ACL graph matching
Record `ne` and `nb` information for src tensors and include them in the
graph matching check. This enhances the robustness of ACL graph matching
by preventing incorrect matches when src tensors share the same data
address but differ in shape or stride.
Daniel Bevenius [Mon, 6 Oct 2025 12:17:12 +0000 (14:17 +0200)]
ggml-cpu : fix leftover handling in ggml_vec_scale_f32 for SVE (llama/16443)
This commit updates the leftover handling in ggml_vec_scale_f32.
The motivation for this is that the code currently incorrectly assumes
there would be fewer than ggml_f32_epr leftover elements. However,
since the main loop processes 2*ggml_f32_epr elements per iteration
, there can be up to (2*ggml_f32_epr - 1) leftover elements.
The original single-pass leftover code could only process ggml_f32_epr
elements, leaving some elements unscaled.
Example scenario with 256-bit SVE:
```
ggml_f32_epr = 8 (elements per register)
ggml_f32_step = 16 (two registers per iteration)
n = 25
np = 16
leftovers = 9 elements (16-24)
Original : processes only elements 16-23, misses element 24
This commit : loop processes elements 16-23, then element 24
```
Acly [Sat, 11 Oct 2025 14:59:36 +0000 (17:59 +0300)]
vulkan : incremental shader builds (llama/16341)
* vulkan (DRAFT): split shader generation by GLSL source file, to improve incremental build times
* support dep-files so shaders are recompiled if their included files change
* rename shader files which are used as "headers" to use .glsl extension
* move glslc extension detection shaders to separate folders
* the above is to prevent them from getting glob'd with the actual compute shaders that need to be compiled
* vulkan : only write embedded shader .hpp/.cpp when they change
* avoid recompiling ggml-vulkan.cpp when editing shaders
* pass single --source argument instead of --input-dir & --filter to shader gen
* check for source file match earlier
* fix hang in vulkan-shaders-gen when there are compilation errors
* early out did not decrement compile_count
* clean up
* fix glslc integer dot product test
* unconditionally write the embedded shader cpp output
* replace output filepath in generated dep-files to match output in CMakeLists
Jeff Bolz [Fri, 3 Oct 2025 10:50:46 +0000 (05:50 -0500)]
vulkan: Replace uses of maxMemoryAllocationSize and VK_WHOLE_SIZE (llama/16354)
* vulkan: Replace uses of maxMemoryAllocationSize and VK_WHOLE_SIZE
Replace maxMemoryAllocationSize check with maxBufferSize when creating buffers.
The maxMemoryAllocationSize limit is a "soft" limit and allocations can succeed
beyond that limit. This allows > 4GB buffers to be allocated on some
implementations (e.g. NVIDIA) and tensors this large can be used for im2col
and mul_mat.
For temporary buffers (prealloc_x/y/etc) check against maxStorageBufferRange.
I'm not sure this check is ideal, but we always use these buffers as a single
full size binding and the limit may be smaller than maxMemoryAllocationSize
or maxBufferSize, so I think this is reasonable.
Replace descriptor range uses of VK_WHOLE_SIZE with a manually computed range.
The maxStorageBufferRange may be smaller than the maxBufferSize or
maxMemoryAllocationSize (and the Vulkan spec warns about this in a note) and
it's invalid usage if VK_WHOLE_SIZE computes a range larger than
maxStorageBufferRange.
With this change, it should be possible to generate videos using wan networks
in stable-diffusion.cpp.
* vulkan: Add env var GGML_VK_FORCE_MAX_BUFFER_SIZE and use stoull