git.djapps.eu Git - pkg/ggml/sources/llama.cpp/commit

author	Kawrakow <redacted>
	Thu, 8 Jun 2023 07:08:23 +0000 (10:08 +0300)
committer	GitHub <redacted>
	Thu, 8 Jun 2023 07:08:23 +0000 (10:08 +0300)
commit	4161bdc04debb70bf5f275492b4d89fd9330087c
tree	9b0c6325e720b101d67ec2415bc0d69e4fd89379	tree
parent	0035858273ebe0694926bf4414d279f3e1cd109d	commit \| diff

metal : add Q4_K implementation (#1733)

* Metal implementation for Q4_K

Very slow for now:
42 ms / token, Q4_0 runs in 28 ms/token on my
30-core M2 Max GPU.

* Optimizing Q4_K on metal

The first token always takes longer, I guess because
the metal kernel is being jit-compiled.
So, using n = 128 to measure time.

At this point Q4_K takes 29.5 ms / token
compared to 27.2 ms / token for Q4_0.
Quite a bit better than the initial attempt,
but still not good enough.

* Optimizing q4_K metal dot some more

For n = 256 it is now 28.1 ms/token compared to
27 ms/token for q4_0.

* Fix after merge with master

---------

Co-authored-by: Iwan Kawrakow <redacted>

.clang-tidy	[deleted file]	blob \| history
ggml-metal.m		diff \| blob \| history
ggml-metal.metal		diff \| blob \| history