]> git.djapps.eu Git - pkg/ggml/sources/llama.cpp/commit
CUDA: Prefer vector flash decoding kernel for Gemma models (#12738)
authorGaurav Garg <redacted>
Thu, 3 Apr 2025 16:20:29 +0000 (21:50 +0530)
committerGitHub <redacted>
Thu, 3 Apr 2025 16:20:29 +0000 (18:20 +0200)
commitc262beddf29f3f3be5bbbf167b56029a19876956
treee5f522b1e2816277a62434aed4bc74cbd158b5ca
parent5dd5d1ab00d074e3b7c02ca3ae12f6bf3e86336a
CUDA: Prefer vector flash decoding kernel for Gemma models (#12738)

* Prefer vector flash decoding kernel for Gemma models

Vector flash decoding kernel was not being picked for models with head dimension 256. Gemma models are in this category.
Removing this limit improves e2e performance by upto 12% in gen phase throughput for Gemm models.

* Update ggml/src/ggml-cuda/fattn.cu

Co-authored-by: Johannes Gäßler <redacted>
---------

Co-authored-by: Johannes Gäßler <redacted>
ggml/src/ggml-cuda/fattn.cu