]> git.djapps.eu Git - pkg/ggml/sources/llama.cpp/commit
[CUDA ] Write an optimized flash_attn_stream_k_fixup kernel (#21159)
authorGaurav Garg <redacted>
Mon, 6 Apr 2026 18:34:29 +0000 (00:04 +0530)
committerGitHub <redacted>
Mon, 6 Apr 2026 18:34:29 +0000 (20:34 +0200)
commit15f786e6581598638840276948a7e6183fc96a83
tree1ccfe1479613fd24d45959c76dd9263eaa5314e2
parent94ca829b6001019622c0f67fcd48e9ec6bd7dce8
[CUDA ] Write an optimized flash_attn_stream_k_fixup kernel (#21159)

* Write an optimized flash_attn_stream_k_fixup kernel

Write a specialized and more optimized kernel for cases where nblocks_stream_k is multiple of ntiles_dst.
Make nblocks_stream_k to multiple of ntiles_dst if nblocks_stream_k > 2 * ntiles_dst

* Use the new kernel only for nblocks_stream_k_raw > 4 * ntiles_dst to make sure we have enough concurrency on GPUs

* Address review comments

* Address review comments

* Revert variable names to original
ggml/src/ggml-cuda/fattn-common.cuh