This commit updates ggml_vk_instance_validation_ext_available() to
check for VK_EXT_validation_features instead of
VK_KHR_portability_enumeration.
Based on how the returned boolean is used later in the code (to enable
both the validation layer and the VK_EXT_validation_features extension),
it appears the function may have been intended to check for the
validation layer features extension.
* remove try/catch
This was a left over from a previous iteration where I was explicitly
quering for a specific validation layer first, which would throw.
Previously, the slope tensor was set to fp16 to improve efficiency.
While this worked correctly in FA, it caused precision issues in soft_max.
This change applies different data types for different operators
to balance both accuracy and performance.
Chenguang Li [Tue, 2 Sep 2025 06:07:48 +0000 (14:07 +0800)]
CANN: Support eager execution mode under ACL graph compilation (llama/15712)
* [CANN] Support eager execution mode under ACL graph compilation
Add support for running operators in eager mode while ACL graph
compilation is enabled. This allows bypassing graph execution
and directly submitting ops, which is useful for debugging and
reducing graph build overhead in certain scenarios.
Signed-off-by: noemotiovon <redacted>
* fix typo
Signed-off-by: noemotiovon <redacted>
* rename to acl_graph_mode
CUDA: fix build error from ambiguous __half conversions in conv2d (llama/15690)
* CUDA: fix build error from ambiguous __half conversions in conv2d
Building conv2d with half precision failed because `__half` defines
multiple implicit conversion operators (to float, int, short, etc.),
causing ambiguous overload resolution when multiplying with float.
Introduce a templated `to_float` helper that explicitly converts
`__half` via `__half2float`, while passing through float unchanged.
Use this helper in conv2d accumulation to ensure unambiguous and
correct promotion to float.
Fixes some build errors with half-precision kernels on CUDA.
ggml-ci
* CUDA: Replace custom to_float helper with unified ggml_cuda_cast and add half‑>float conversion
* CUDA: Add missing convert.cuh header
* CUDA: remove unnecessary extension in ggml_cuda_cast
* CUDA: Address review comment, remove second type template argument
CANN: fix RoPE cache issue on multi-device (llama/15629)
* CANN: fix RoPE cache issue on multi-device
RoPE cache only needs to be computed once per token.
However, in multi-device scenarios, not every device starts
computation from layer 0, which may lead to unallocated memory
issues and precision errors.
This commit records the first layer of each device to avoid
the above issues.
* CANN: Optimize first-layer detection method
* CANN: Remove trailing whitespace
* CANN: Only cache the data that can be determined as unchanged through the parameters.
Diego Devesa [Sun, 31 Aug 2025 13:49:03 +0000 (06:49 -0700)]
llama : separate compute buffer reserve from fattn check (llama/15696)
Exposes ggml_backend_sched_split_graph() to allow splitting the graph without allocating compute buffers and uses it to split the graph for the automatic Flash Attention check.
This commit removes the portability_enumeration_ext variable from the
ggml_vk_instance_portability_enumeration_ext_available function as it
is initialized to false but never modified, making it redundant.
Chenguang Li [Wed, 27 Aug 2025 09:21:41 +0000 (17:21 +0800)]
CANN: refactor mask handling and improve performance in FA (llama/15561)
* CANN(flash-attn): refactor mask handling and improve performance
1. Refactored the mask computation in Flash Attention, unified the logic without separating prefill and decode.
2. Optimized performance in non-alibi scenarios by reducing one repeat operation.
3. Updated operator management to explicitly mark unsupported cases on 310P devices and when dim is not divisible by 16.
Akarshan Biswas [Tue, 26 Aug 2025 18:57:49 +0000 (00:27 +0530)]
SYCL: fix rms_norm_mul_add for tensor dim not a multiple of sg_size (llama/15592)
The original implementation unconditionally returned true for this operation, leading to a failure when the tensor's first dimension (ne[0]) was not a multiple of WARP_SIZE. This caused an GGML_ASSERT(ncols % WARP_SIZE == 0) failure in ggml-sycl/norm.cpp.
This change updates the ggml_backend_sycl_device_supports_op check to correctly return true for GGML_OP_RMS_NORM only when the first dimension of the tensor is a multiple of WARP_SIZE, ensuring the operation can be performed without error.
This patch improves GEMM for FP32 Data Type on PowerPC
Implements GEMM on large blocks with configurable block size mc, nc, kc
(default: 256, 256, 256).
Packing Function optimized to access blocks as per memory layout.
GEMM Optimized to work on larger blocks.
Isolated Packing from GEMM Operations for better MMA utilization.
Verified functionality and correctness uing llama-cli and stand alone
test case (performs matmul and compares final mattrix C result with base).
Minor code refactoring changes:
Replace macro with inline function
Code Indent made consistent with 4 spaces
Performance Testing:
Observed 50% ~ 70% improvement in Prompt Processing Speed mesured using
llama-bench with Meta-Llama3-8B FP32 Model. Similar gains observed with
Mistral-7b-Instruct-v0.3 Model.
model Size Params Backend Threads Test Patch Base
llama 8B all F32 29.92 GiB 8.03 B CPU 20 pp512 98.58 60.3
llama 8B all F32 29.92 GiB 8.03 B CPU 20 pp1024 95.88 57.36
llama 8B all F32 29.92 GiB 8.03 B CPU 20 pp2048 85.46 53.26
llama 8B all F32 29.92 GiB 8.03 B CPU 20 pp4096 68.66 45.78
llama 8B all F32 29.92 GiB 8.03 B CPU 20 pp6144 57.35 40.44
25 ~ 30% improvement in llama-batched-bench with Metla-Llama3-8B in
Prompt Processing Speed for large prompts (256, 512, 1024, 2048, 4096)tokens with various batch
sizes ( 1, 2, 4, 8, 16)
Jeff Bolz [Sun, 24 Aug 2025 09:24:25 +0000 (04:24 -0500)]
vulkan: Support FA with any multiple of 8 head sizes (llama/15537)
The scalar FA shader already handled multiples of 8. The coopmat1 FA
shader assumed 16x16x16 and the shared memory allocations need the HSK
dimensions padded to a multiple of 16. NVIDIA's coopmat2 implementation
requires multiples of 16 for N and K, and needs the matrix dimensions
padded and loads clamped.
Store the FA pipelines in a map, indexed by the pipeline state.
Jeff Bolz [Sat, 23 Aug 2025 18:16:17 +0000 (13:16 -0500)]
vulkan: optimize rms_norm, and allow the work to spread across multiple SMs (llama/15281)
* vulkan: optimize rms_norm, and allow the work to spread across multiple SMs
There are really two parts to this change:
(1) Some optimizations similar to what we have in soft_max, to unroll with
different numbers of iterations.
(2) A fusion optimization where we detect add followed by rms_norm, and make
the add shader atomically accumulate the values^2 into memory. Then the
rms_norm shader can just load that sum. This allows the rms_norm to be
parallelized across multiple workgroups, it just becomes a simple per-element
multiply.
The fusion optimization is currently only applied when the rms_norm is on a
single vector. This previously always ran on a single SM. It could apply more
broadly, but when there are other dimensions the work can already spread across
SMs, and there would be some complexity to tracking multiple atomic sums.
* Change add+rms_norm optimization to write out an array of partial sums
rather than using atomic add, to make it deterministic. The rms_norm
shader fetches a subgroup's worth in parallel and uses subgroupAdd to
add them up.
* complete rebase against fused adds - multi_add shader can also compute partial sums
* fix validation errors
* disable add_rms_fusion for Intel due to possible driver bug
* resolve against #15489, sync after clearing partial sums
Jeff Bolz [Sat, 23 Aug 2025 07:33:36 +0000 (02:33 -0500)]
vulkan: Rewrite synchronization to allow some overlap between nodes (llama/15489)
Track a list of nodes that need synchronization, and only sync if the new node
depends on them (or overwrites them). This allows some overlap which can
improve performance, and centralizes a big chunk of the synchronization logic.
The remaining synchronization logic involves writes to memory other than the
nodes, e.g. for dequantization or split_k. Each of these allocations has a bool
indicating whether they were in use and need to be synced. This should be
checked before they are written to, and set to true after they are done being
consumed.
Jeff Bolz [Sat, 23 Aug 2025 06:31:54 +0000 (01:31 -0500)]
vulkan: optimize mul_mat_id loading row ids into shared memory (llama/15427)
- Spread the work across the whole workgroup. Using more threads seems to
far outweigh the synchronization overhead.
- Specialize the code for when the division is by a power of two.