]> git.djapps.eu Git - pkg/ggml/sources/whisper.cpp/commit
CUDA: Prefer vector flash decoding kernel for Gemma models (llama/12738)
authorGaurav Garg <redacted>
Thu, 3 Apr 2025 16:20:29 +0000 (21:50 +0530)
committerGeorgi Gerganov <redacted>
Thu, 24 Apr 2025 17:39:16 +0000 (20:39 +0300)
commit2f0612cb1c168dbecd2d94b9665b11d2f023ffe9
treeec74e61b1f35b9f61d97ae53fcbb7aec5a430dd5
parente944065d5bb025f0334f2293de9b50b2d42da616
CUDA: Prefer vector flash decoding kernel for Gemma models (llama/12738)

* Prefer vector flash decoding kernel for Gemma models

Vector flash decoding kernel was not being picked for models with head dimension 256. Gemma models are in this category.
Removing this limit improves e2e performance by upto 12% in gen phase throughput for Gemm models.

* Update ggml/src/ggml-cuda/fattn.cu

Co-authored-by: Johannes Gäßler <redacted>
---------

Co-authored-by: Johannes Gäßler <redacted>
ggml/src/ggml-cuda/fattn.cu