git.djapps.eu Git - pkg/ggml/sources/llama.cpp/commit

author	Kawrakow <redacted>
	Mon, 19 Jun 2023 15:14:09 +0000 (18:14 +0300)
committer	GitHub <redacted>
	Mon, 19 Jun 2023 15:14:09 +0000 (18:14 +0300)
commit	ca7c3f4da5d144d4cd1dd44903552e6ba49b8ec8
tree	8a0fab78e1cb85d11e4c2c61f4be3e124a72ae5f	tree
parent	b97ca431db35ec96a339a721acb1219c1dd78bed	commit \| diff

cuda : faster k-quants on older GPUs (#1930)

* k_quants: hopefully much faster Q4_K on older GPUs

On the GTX-1660 that I have available to represent
"old GPUs", token prediction drops from 65.5 ms/tok
to 41.5 ms/tok!

* k_quants: hopefully much faster Q3_K on older GPUs

On the GTX-1660 that I have available to represent
"old GPUs", token prediction drops from 60.3 ms/tok
to 41.0 ms/tok!

* k_quants: faster Q2_K on older GPUs

It looks like I didn't need to change anything
compared to what we already had, so this is just
adding clarifying comments. But I now measure
36.3 ms/tok on the GTX-1660, instead fo the
47.2 ms/tok that I have written in the faster
k-quants PR.

* k_quants: faster Q5_K on older GPUs

68.5 ms/tok -> 62.0 ms/tok on GTX-1660.
For some reason the same access pattern that leads
to such resounding success for Q2_K to Q4_K did not
work at all for Q5_K.

It is also more difficult to measure because for Q5_K_S
we only have 32 layers on the GTX-1660, so output, tok embeddings
and kv cache are done on the CPU.

---------

Co-authored-by: Iwan Kawrakow <redacted>