]> git.djapps.eu Git - pkg/ggml/sources/ggml/commit
CUDA: Prefer vector flash decoding kernel for Gemma models (llama/12738)
authorGaurav Garg <redacted>
Thu, 3 Apr 2025 16:20:29 +0000 (21:50 +0530)
committerGeorgi Gerganov <redacted>
Tue, 8 Apr 2025 08:47:46 +0000 (11:47 +0300)
commit770370c6603d29a464ac33d8031fb4506059e5ac
tree525ce23131e5da9660ae733630a9426d623e1e27
parent406ee7592472aa3598f38829de7234f27cbb83b4
CUDA: Prefer vector flash decoding kernel for Gemma models (llama/12738)

* Prefer vector flash decoding kernel for Gemma models

Vector flash decoding kernel was not being picked for models with head dimension 256. Gemma models are in this category.
Removing this limit improves e2e performance by upto 12% in gen phase throughput for Gemm models.

* Update ggml/src/ggml-cuda/fattn.cu

Co-authored-by: Johannes Gäßler <redacted>
---------

Co-authored-by: Johannes Gäßler <redacted>
src/ggml-cuda/fattn.cu