git.djapps.eu Git - pkg/ggml/sources/llama.cpp/commit

author	Kawrakow <redacted>
	Fri, 16 Jun 2023 17:08:44 +0000 (20:08 +0300)
committer	GitHub <redacted>
	Fri, 16 Jun 2023 17:08:44 +0000 (20:08 +0300)
commit	3d0112261042b356621e93db3fa4c6798a5d098f
tree	3634baa70ed23142f86c5a44701bbf4b0971c2fd	tree
parent	602c748863e15270d80d74aa2c3bf86ab8139e07	commit \| diff

CUDA : faster k-quant dot kernels (#1862)

* cuda : faster k-quant dot kernels

* Imrove Q2_K dot kernel on older GPUs

We now have a K_QUANTS_PER_ITERATION macro, which should be
set to 1 on older and to 2 on newer GPUs.
With this, we preserve the performance of the original
PR on RTX-4080, and are faster compared to master on
GTX-1660.

* Imrove Q6_K dot kernel on older GPUs

Using the same K_QUANTS_PER_ITERATION macro as last commit,
we preserve performance on RTX-4080 and speed up
Q6_K on a GTX-1660.

* Add LLAMA_CUDA_KQUANTS_ITER to CMakeLists.txt and Makefile

Allowed values are 1 or 2. 2 gives the best performance on
modern GPUs and is set as default. On older GPUs 1 may work
better.

* PR comments

---------

Co-authored-by: Iwan Kawrakow <redacted>

CMakeLists.txt		diff \| blob \| history
Makefile		diff \| blob \| history
ggml-cuda.cu		diff \| blob \| history