git.djapps.eu Git - pkg/ggml/sources/whisper.cpp/commit

author	Jeff Bolz <redacted>
	Wed, 2 Apr 2025 17:40:32 +0000 (12:40 -0500)
committer	Georgi Gerganov <redacted>
	Thu, 24 Apr 2025 17:39:16 +0000 (20:39 +0300)
commit	2105b110d3fd3180f2b4788f6cf1574d0ae41e46
tree	9ddd0389cc1f5d8dcd30e64a826931f272304d81	tree
parent	f82622180f1dec362b30b87702d28ba6c7a62461	commit \| diff

vulkan: Implement grouped query attention in the coopmat2 FA shader (llama/12559)

When adjacent batches of Q share the same batches of K/V, batch them into
the same workgroup. For example, when:

dst(128,32,1,1) = FA(q(128,1,32,1), k(128,16640,8,1), v(128,16640,8,1))

previously we would run 32 workgroups computing 1 result each, now we will
run 8 workgroups computing 4 results each.

This doesn't directly translate to better performance (at least when you have
>=32 SMs), but in a subsequent change I'll enable split_k which will scale much
better with 4x fewer workgroups.

ggml/src/ggml-vulkan/ggml-vulkan.cpp		diff \| blob \| history
ggml/src/ggml-vulkan/vulkan-shaders/flash_attn_cm2.comp		diff \| blob \| history