]> git.djapps.eu Git - pkg/ggml/sources/whisper.cpp/commit
CUDA: Improve flash decoding kernel GPU occupancy for BS=1 case (llama/12183)
authorGaurav Garg <redacted>
Wed, 19 Mar 2025 19:52:06 +0000 (01:22 +0530)
committerGeorgi Gerganov <redacted>
Thu, 27 Mar 2025 09:06:03 +0000 (11:06 +0200)
commitae6a9bb9a58a0f02e4bb60393a65bde407c9e346
treef2917a76fcabf4ff58308e0afd2ca144ac603183
parent24faba9e9bb819cfc74fd76dbdf0b2c24f1f12df
CUDA: Improve flash decoding kernel GPU occupancy for BS=1 case (llama/12183)

- Find out active blocks per SM using cudaOccupancyMaxActiveBlocksPerMultiprocessor API. Use this value to determine the optimal parallel_blocks value.
- Prefer vector flash attention kernels over MMA kernel for BS=1

Fixes Issue: #12182
---------

Co-authored-by: Johannes Gäßler <redacted>
ggml/src/ggml-cuda/fattn-common.cuh
ggml/src/ggml-cuda/fattn-mma-f16.cuh
ggml/src/ggml-cuda/fattn-tile-f16.cu
ggml/src/ggml-cuda/fattn-tile-f32.cu
ggml/src/ggml-cuda/fattn-vec-f16.cuh
ggml/src/ggml-cuda/fattn-vec-f32.cuh
ggml/src/ggml-cuda/fattn-wmma-f16.cu
ggml/src/ggml-cuda/fattn.cu
ggml/src/ggml-cuda/ggml-cuda.cu
ggml/src/ggml-cuda/vendors/hip.h
ggml/src/ggml-cuda/vendors/musa.h