git.djapps.eu Git - pkg/ggml/sources/llama.cpp/commit

author	Jeff Bolz <redacted>
	Sat, 22 Mar 2025 08:40:11 +0000 (03:40 -0500)
committer	GitHub <redacted>
	Sat, 22 Mar 2025 08:40:11 +0000 (09:40 +0100)
commit	eddfb438502bd5d1014d63a812e9b6d03d326f8c
tree	0900c1e2908b45dbf8eaaf9a6bfe4e12c406011b	tree
parent	4375415b4abf94fb36a5fd15f233ac0ee23c0bd1	commit \| diff

vulkan: Optimize mul_mat_vec p021 and nc shaders (#12505)

* tests: add mul_mat perf/functional tests for p021/nc vulkan shaders

* vulkan: Optimize mul_mat_vec p021 and nc shaders.

These shaders are used in attention calculations, and when the KV cache grows
large they start to dominate the run time. For the nc shader (which is called
with large 'k' dimension), use unrolling and vector loads. For the p021 shader
(which is called with large 'm' and small 'k' dimensions), take advantage of
grouped query attention to reuse loads from the A matrix for the whole group,
and reduce the number of workgroups (too much overhead from tiny dispatches).

Using subgroupAdd in the p021 shader also helps, use that conditionally.

ggml/src/ggml-vulkan/ggml-vulkan.cpp		diff \| blob \| history
ggml/src/ggml-vulkan/vulkan-shaders/mul_mat_vec_nc.comp		diff \| blob \| history
ggml/src/ggml-vulkan/vulkan-shaders/mul_mat_vec_p021.comp		diff \| blob \| history
ggml/src/ggml-vulkan/vulkan-shaders/vulkan-shaders-gen.cpp		diff \| blob \| history
tests/test-backend-ops.cpp		diff \| blob \| history