Akarshan Biswas [Tue, 26 Aug 2025 18:57:49 +0000 (00:27 +0530)]
SYCL: fix rms_norm_mul_add for tensor dim not a multiple of sg_size (llama/15592)
The original implementation unconditionally returned true for this operation, leading to a failure when the tensor's first dimension (ne[0]) was not a multiple of WARP_SIZE. This caused an GGML_ASSERT(ncols % WARP_SIZE == 0) failure in ggml-sycl/norm.cpp.
This change updates the ggml_backend_sycl_device_supports_op check to correctly return true for GGML_OP_RMS_NORM only when the first dimension of the tensor is a multiple of WARP_SIZE, ensuring the operation can be performed without error.
This patch improves GEMM for FP32 Data Type on PowerPC
Implements GEMM on large blocks with configurable block size mc, nc, kc
(default: 256, 256, 256).
Packing Function optimized to access blocks as per memory layout.
GEMM Optimized to work on larger blocks.
Isolated Packing from GEMM Operations for better MMA utilization.
Verified functionality and correctness uing llama-cli and stand alone
test case (performs matmul and compares final mattrix C result with base).
Minor code refactoring changes:
Replace macro with inline function
Code Indent made consistent with 4 spaces
Performance Testing:
Observed 50% ~ 70% improvement in Prompt Processing Speed mesured using
llama-bench with Meta-Llama3-8B FP32 Model. Similar gains observed with
Mistral-7b-Instruct-v0.3 Model.
model Size Params Backend Threads Test Patch Base
llama 8B all F32 29.92 GiB 8.03 B CPU 20 pp512 98.58 60.3
llama 8B all F32 29.92 GiB 8.03 B CPU 20 pp1024 95.88 57.36
llama 8B all F32 29.92 GiB 8.03 B CPU 20 pp2048 85.46 53.26
llama 8B all F32 29.92 GiB 8.03 B CPU 20 pp4096 68.66 45.78
llama 8B all F32 29.92 GiB 8.03 B CPU 20 pp6144 57.35 40.44
25 ~ 30% improvement in llama-batched-bench with Metla-Llama3-8B in
Prompt Processing Speed for large prompts (256, 512, 1024, 2048, 4096)tokens with various batch
sizes ( 1, 2, 4, 8, 16)
Jeff Bolz [Sun, 24 Aug 2025 09:24:25 +0000 (04:24 -0500)]
vulkan: Support FA with any multiple of 8 head sizes (llama/15537)
The scalar FA shader already handled multiples of 8. The coopmat1 FA
shader assumed 16x16x16 and the shared memory allocations need the HSK
dimensions padded to a multiple of 16. NVIDIA's coopmat2 implementation
requires multiples of 16 for N and K, and needs the matrix dimensions
padded and loads clamped.
Store the FA pipelines in a map, indexed by the pipeline state.
Jeff Bolz [Sat, 23 Aug 2025 18:16:17 +0000 (13:16 -0500)]
vulkan: optimize rms_norm, and allow the work to spread across multiple SMs (llama/15281)
* vulkan: optimize rms_norm, and allow the work to spread across multiple SMs
There are really two parts to this change:
(1) Some optimizations similar to what we have in soft_max, to unroll with
different numbers of iterations.
(2) A fusion optimization where we detect add followed by rms_norm, and make
the add shader atomically accumulate the values^2 into memory. Then the
rms_norm shader can just load that sum. This allows the rms_norm to be
parallelized across multiple workgroups, it just becomes a simple per-element
multiply.
The fusion optimization is currently only applied when the rms_norm is on a
single vector. This previously always ran on a single SM. It could apply more
broadly, but when there are other dimensions the work can already spread across
SMs, and there would be some complexity to tracking multiple atomic sums.
* Change add+rms_norm optimization to write out an array of partial sums
rather than using atomic add, to make it deterministic. The rms_norm
shader fetches a subgroup's worth in parallel and uses subgroupAdd to
add them up.
* complete rebase against fused adds - multi_add shader can also compute partial sums
* fix validation errors
* disable add_rms_fusion for Intel due to possible driver bug
* resolve against #15489, sync after clearing partial sums
Jeff Bolz [Sat, 23 Aug 2025 07:33:36 +0000 (02:33 -0500)]
vulkan: Rewrite synchronization to allow some overlap between nodes (llama/15489)
Track a list of nodes that need synchronization, and only sync if the new node
depends on them (or overwrites them). This allows some overlap which can
improve performance, and centralizes a big chunk of the synchronization logic.
The remaining synchronization logic involves writes to memory other than the
nodes, e.g. for dequantization or split_k. Each of these allocations has a bool
indicating whether they were in use and need to be synced. This should be
checked before they are written to, and set to true after they are done being
consumed.
Jeff Bolz [Sat, 23 Aug 2025 06:31:54 +0000 (01:31 -0500)]
vulkan: optimize mul_mat_id loading row ids into shared memory (llama/15427)
- Spread the work across the whole workgroup. Using more threads seems to
far outweigh the synchronization overhead.
- Specialize the code for when the division is by a power of two.
Daniel Bevenius [Thu, 28 Aug 2025 07:10:48 +0000 (09:10 +0200)]
ci : add github release job (#1334)
* ci : add github release job
This commit adds a GitHub Actions workflow to automate the release
process. Currently this will only create an archive of the sources for
ggml when a tag is pushed.
The motivation for this is that when we start releasing versions of ggml
using semantic versioning it can be nice to have the sources needed for
ggml to be deployed as a github release. This enables CMake users that
use `FetchContent` efficiently specify the the zip file instead of
cloning.
Example usage with `FetchContent`:
```cmake
cmake_minimum_required(VERSION 3.14)
project(ggml_example)
add_executable(ggml_example main.cpp)
target_link_libraries(ggml_example ggml)
```
And with the following `main.cpp` file:
```c++
#include <iostream>
#include <ggml.h>
int main() {
std::cout << "GGML Version: " << ggml_version() << std::endl;
return 0;
}
```
This could then be built using:
```console
$ cmake -S . -B build
$ cmake --build build
$ ./build/ggml_example
GGML Version: 0.0.2472
```
Jeff Bolz [Sun, 17 Aug 2025 16:08:57 +0000 (11:08 -0500)]
vulkan: Use larger workgroups for mul_mat_vec when M is small (llama/15355)
* vulkan: Use larger workgroups for mul_mat_vec when M is small
Also use subgroup instructions for (part of) the reduction when supported.
Without this, the more expensive reductions would eat into the benefits of
the larger workgroups.
Jeff Bolz [Sun, 17 Aug 2025 08:41:45 +0000 (03:41 -0500)]
vulkan: Optimize argsort (llama/15354)
- Launch an appropriate number of invocations (next larger power of two).
32 invocations is common and the barrier is much cheaper there.
- Specialize for "needs bounds checking" vs not.
- Make the code less branchy and [[unroll]] the loops. In the final code,
I see no branches inside the main loop (only predicated stores) when
needs_bounds_check is false.
- Always sort ascending, then apply the ascending vs descending option when
doing the final stores to memory.
- Copy the values into shared memory, makes them slightly cheaper to access.
Jeff Bolz [Sat, 16 Aug 2025 16:48:22 +0000 (11:48 -0500)]
vulkan: fuse adds (llama/15252)
* vulkan: fuse adds
Fuse adds that have the same shape, which are common in MoE models.
It will currently fuse up to 6 adds, because we assume no more than
8 descriptors per dispatch. But this could be changed.
* check runtimeDescriptorArray feature
* disable multi_add for Intel due to likely driver bug
Jeff Bolz [Sat, 16 Aug 2025 09:18:31 +0000 (04:18 -0500)]
vulkan: Support mul_mat_id with f32 accumulators (llama/15337)
* vulkan: Add missing bounds checking to scalar/coopmat1 mul_mat_id
* vulkan: Support mul_mat_id with f32 accumulators, but they are not hooked up
- There's no explicit way to request f32 precision for mul_mat_id, but there
probably should be, and this gets the code in place for that.
- A couple fixes to check_results.
- Remove casts to fp16 in coopmat1 FA shader (found by inspection).
uvos [Thu, 14 Aug 2025 14:23:56 +0000 (16:23 +0200)]
HIP: Cleanup hipification header (llama/15285)
add expicit conversion operator to support older versions of rocm
Switch over to hip_bf16 from legacy hip_bfloat16
Simplify RDNA3 define
Reduce swap over of new hipblas api to rocm 6.5 as this version is used for rocm 7.0 previews
Jeff Bolz [Thu, 14 Aug 2025 13:38:10 +0000 (08:38 -0500)]
vulkan: perf_logger improvements (llama/15246)
* vulkan: perf_logger improvements
- Account for batch dimension in flops calculation.
- Fix how "_VEC" is detected for mat_mul_id.
- Fix "n" dimension for mat_mul_id (in case of broadcasting).
- Include a->type in name.
add unit tested GGML_OPT_OPTIMIZER_SGD to ggml - avoids allocating
m, v tensors.
support finetune.cpp arg -opt SGD (or sgd). (default adamw as before)
llama 3.2-1b-F32 result: observed 11gb gpu ram (41 sec/epoch)
when using SGD instead of 19gb (55 sec/epoch) using adamw.
(wikipedia 100 lines finetune)
(
using the same GPU memory, adamw can only do before OOM 512
batch/context, reaching:
train: [███████▉] data=0000140/0000140 loss=0.02575±0.00099 acc=99.52±0.03% t=00:00:47 ETA=00:00:00
val: [███████▉] data=0000008/0000008 loss=4.76565±0.28810 acc=41.46±0.77% t=00:00:00 ETA=00:00:00
SGD is superior, though it converges slower, with max before OOM 1728
batch/context (esp see the better validation perf):
train: [███████▉] data=0000039/0000039 loss=0.00371±0.00010 acc=99.96±0.01% t=00:00:41 ETA=00:00:00
val: [███████▉] data=0000003/0000003 loss=5.11406±0.76034 acc=48.01±0.69% t=00:00:01 ETA=00:00:00
)
note: when finetuning long enough (or w/ enough -lr),
validation accuracy *eventually* drops ('catastrophic forgetting')
-lr-half (halflife) option useful for SGD to avoid oscillation or
super slow underdamped learning (makes setting -lr more forgiving).
terminal -lr for now is set by lr-halvings i.e. if you want at most
1/8 the inital -lr you set -lr-halvings 3.
note: objective loss not directly comparable between adamw, sgd? -
check perplexity or accuracy or consider relative improvements
for convergence
new finetune args -wd 1e-9 to enable weight decay in sgd or adamw,
and max -epochs N (default 2 as before)
cache (1 - wd*alpha) in 'adamw' opt struct -
no noticeable perf benefit, disabled (still done
for new SGD though)
since opt. memory is pre-allocated, the ggml_opt_get_optimizer_params
would probably be able to change between SGD and AdamW with each epoch
but would need to use adamw for the first (unconfirmed - no cmdline arg
to set such a policy yet)
test-opt checks adamw as before and now sgd (except for a few disabled
tests for sgd only; probably just needs logging values and adding
alternate reference values); tolerance on the 'regression'
test is broader for sgd (so we don't need many more epochs)
* Vulkan: Implement GGML_OP_OPT_STEP_SGD
* tests: Fix OPT_STEP_SGD test-backend-ops
* SGD op param store weight-decay and not 1-alpha*wd
* minor + cosmetic changes
* fix vulkan sgd
* try CI fix
---------
Co-authored-by: 0cc4m <redacted> Co-authored-by: Johannes Gäßler <redacted>
Oliver Simons [Wed, 13 Aug 2025 08:04:46 +0000 (10:04 +0200)]
CUDA: Optimize `reduce_rows_f32` kernel, leading up to 25x perf improvement on kernel-level and 10% perf increase for Gemma3n (llama/15132)
* Factor out `reduce_rows_f32` from common.cuh
This increases iteration cycle speed by not having to recompile
every kernel all the time
* Hide memory-latency by loop unrolling in reduce_rows_f32
* Further optimizations to `reduce_rows_f32`
1. Increase threadblock size to better hide latency of memory requests.
As a consequence of bigger threadblocks, do 2-step summation, using
shared memory to communicate results between invocations
2. Use sum_temp array to reduce waits on sum
3. Adjust num_unroll to reflext bigger threadblock
4. Improve default block_dims, increase support for more block_dims
* Add perf tests for `reduce_rows_f32` kernel
* Add heuristic to toggle 128/512 threads based on sm count
Break even point was the minimum of the following multiples.
| GPU Model | Nrow SM Count Multiple |
| ----------- | ----------- |
| RTX 4000 SFF ADA | 2.0x |
| RTX 6000 ADA | 2.5x |
| RTX PRO 6000 Blackwell Max-Q | 3.04x |
| RTX PRO 4500 Blackwell | 3.15x |
* Ensure perf gains also for small ncols and large nrows
Alternative to this, one could have also made the number of unrollings
template-able, but that would require compiling the kernel multiple
times, increasing binary size unnecessarily
* Modify perf and unit-tests
* Apply auto-formatting by clang
* Fix CI build failure
See https://github.com/ggml-org/llama.cpp/actions/runs/16798370266/job/47573716079?pr=15132#step:7:486
Building with VS generator worked though.
* Remove sm_count property from `ggml_backend_cuda_context`
Requested by @JohannesGaessler, and should fix remaining CI issues as a
side-effect
* Add CUB-based implementation for GGML_OP_MEAN
Currently this branch is only executed for nrows==1
* Add heuristics to execute CUB branch only when it brings perf
Heuristics were determined on the following HW:
* RTX 4000 SFF ADA
* RTX 6000 ADA
* RTX PRO 6000 Blackwell Max-Q
* RTX PRO 4500 Blackwell
* Add unit-test for CUB-based mean
Tests should run with CUDA Graphs enabled per default on NVGPUs
* Rename `USE_CUB` to `GGML_CUDA_USE_CUB`
Suggested by @JohannesGaessler
* Unindent Preprocessor directives
See
https://github.com/ggml-org/llama.cpp/pull/15132#discussion_r2269213506