git.djapps.eu Git - pkg/ggml/sources/llama.cpp/commit

Hexagon add support for f16/f32 flash attention, scale, set-rows and improve f16/32 matmul (#18611)

* hexagon: improve fp16 matmul and add fp32/fp16 flash-attention

* hexagon: add support for set-rows fp32 -> fp16 with i32/i64 row-idx

* hexagon: add support for SCALE fp32

* hexagon: replace scalar fp32 -> fp16 copy with HVX

* hexagon: optimize flash_atten_ext with aligned VTCM buffers and DMA

- Implements double-buffered DMA prefetching for K, V, and Mask tensors.
- Ensures K and V rows in VTCM are padded to 128 bytes to support aligned HVX operations.
- Correctly synchronizes DMA transfers to prevent race conditions.
- Uses `FLASH_ATTN_BLOCK_SIZE` of 128 for efficient chunking.

* hexagon: use aligned mad_f16

* hexagon: flash_atten more aligned ops

* hexagon: optimize scale_f32 hvx helpers

* hexagon: unroll fa loops

* hexagon: remove unused set-rows log

* hexagon: flash_attn_ext add support for DMAing Q

- Update `op_flash_attn_ext` to include Q row size in scratchpad allocation.
- Pad Q row size to 128 bytes for alignment.
- Implement DMA transfer for Q tensor in `flash_attn_ext_f16_thread`.
- Update dot product computations to use VTCM-buffered Q data.

* hexagon: fix handling of NANs hvx dotproducts

* hexagon: cleanup spad allocation in flash-atten

* hexagon: improve fp16/fp32 matmul

- Introduced `vec_dot_f16_f16` and `vec_dot_f16_f16_rx2` kernels using efficient HVX dot product intrinsics.
- Added `quantize_fp32_f16` to copy/convert weights from DDR to VTCM
- Updated `op_matmul` to use the optimized path when VTCM capacity allows and broadcasting requirements are compatible.
- Implemented fallback logic to the original implementation for complex broadcasting scenarios.

* hexagon: fix HVX_ARCH check

* hexagon: matmul cleanup and fp16 fixes

Use aligned vec_dot_f16 for 2d matmuls and unaligned version for 4d.

* hexagon: fix fp16 x fp16 matmuls and some minor refactoring

* hexagon: add support for GET_ROWS f32 -> f32

Also optimize SET_ROWS threading a bit when we have just a few rows to process.

* hexagon: optimize set-rows threading

* hexagon: update adb/run-bench.sh to properly support experimental and verbose options

* hexagon: flash_atten use aligned vectors for dot products

author	Max Krasnyansky <redacted>
	Wed, 7 Jan 2026 01:38:29 +0000 (17:38 -0800)
committer	GitHub <redacted>
	Wed, 7 Jan 2026 01:38:29 +0000 (17:38 -0800)
commit	95ea9e0861b28adca740dbc09494f72105c9b92b
tree	c2cb44d484b67c2c6199b543a46b01bcafef1681	tree
parent	ccbc84a5374bab7a01f68b129411772ddd8e7c79	commit \| diff

ggml/src/ggml-hexagon/ggml-hexagon.cpp		diff \| blob \| history
ggml/src/ggml-hexagon/htp/CMakeLists.txt		diff \| blob \| history
ggml/src/ggml-hexagon/htp/flash-attn-ops.c	[new file with mode: 0644]	blob
ggml/src/ggml-hexagon/htp/get-rows-ops.c	[new file with mode: 0644]	blob
ggml/src/ggml-hexagon/htp/htp-ctx.h		diff \| blob \| history
ggml/src/ggml-hexagon/htp/htp-msg.h		diff \| blob \| history
ggml/src/ggml-hexagon/htp/htp-ops.h		diff \| blob \| history
ggml/src/ggml-hexagon/htp/hvx-utils.c		diff \| blob \| history
ggml/src/ggml-hexagon/htp/hvx-utils.h		diff \| blob \| history
ggml/src/ggml-hexagon/htp/main.c		diff \| blob \| history
ggml/src/ggml-hexagon/htp/matmul-ops.c		diff \| blob \| history
ggml/src/ggml-hexagon/htp/set-rows-ops.c	[new file with mode: 0644]	blob
ggml/src/ggml-hexagon/htp/softmax-ops.c		diff \| blob \| history
ggml/src/ggml-hexagon/htp/unary-ops.c		diff \| blob \| history
scripts/snapdragon/adb/run-bench.sh		diff \| blob \| history