]> git.djapps.eu Git - pkg/ggml/sources/whisper.cpp/commit
vulkan: scalar flash attention implementation (llama/13324)
authorJeff Bolz <redacted>
Sat, 10 May 2025 06:07:07 +0000 (23:07 -0700)
committerGeorgi Gerganov <redacted>
Tue, 13 May 2025 10:59:21 +0000 (13:59 +0300)
commita04b329ad172fa3c24d10cb106aa9b7fffb7e511
tree6146e5427d0f04a8608d77a2f374bf3f9a7af1c2
parent45d8b2352e7decfd814d36fb94d7947181ff9ca5
vulkan: scalar flash attention implementation (llama/13324)

* vulkan: scalar flash attention implementation

* vulkan: always use fp32 for scalar flash attention

* vulkan: use vector loads in scalar flash attention shader

* vulkan: remove PV matrix, helps with register usage

* vulkan: reduce register usage in scalar FA, but perf may be slightly worse

* vulkan: load each Q value once. optimize O reduction. more tuning

* vulkan: support q4_0/q8_0 KV in scalar FA

* CI: increase timeout to accommodate newly-supported tests

* vulkan: for scalar FA, select between 1 and 8 rows

* vulkan: avoid using Float16 capability in scalar FA
ggml/src/ggml-vulkan/ggml-vulkan.cpp
ggml/src/ggml-vulkan/vulkan-shaders/flash_attn.comp [new file with mode: 0644]
ggml/src/ggml-vulkan/vulkan-shaders/vulkan-shaders-gen.cpp