git.djapps.eu Git - pkg/ggml/sources/llama.cpp/commit

]> git.djapps.eu Git - pkg/ggml/sources/llama.cpp/commit

overview / pkg / ggml / sources / llama.cpp / commit

author	Gaurav Garg <redacted>
	Thu, 3 Apr 2025 16:20:29 +0000 (21:50 +0530)
committer	GitHub <redacted>
	Thu, 3 Apr 2025 16:20:29 +0000 (18:20 +0200)
commit	c262beddf29f3f3be5bbbf167b56029a19876956
tree	e5f522b1e2816277a62434aed4bc74cbd158b5ca	tree
parent	5dd5d1ab00d074e3b7c02ca3ae12f6bf3e86336a	commit \| diff

CUDA: Prefer vector flash decoding kernel for Gemma models (#12738)

* Prefer vector flash decoding kernel for Gemma models

Vector flash decoding kernel was not being picked for models with head dimension 256. Gemma models are in this category.
Removing this limit improves e2e performance by upto 12% in gen phase throughput for Gemm models.

* Update ggml/src/ggml-cuda/fattn.cu

Co-authored-by: Johannes Gäßler <redacted>
---------

Co-authored-by: Johannes Gäßler <redacted>

ggml/src/ggml-cuda/fattn.cu

diff | blob | history

Packaging of ggml-org/llama.cpp

RSS Atom