git.djapps.eu Git - pkg/ggml/sources/ggml/commit

author	Kawrakow <redacted>
	Mon, 11 Mar 2024 15:53:15 +0000 (16:53 +0100)
committer	Georgi Gerganov <redacted>
	Thu, 14 Mar 2024 16:46:58 +0000 (18:46 +0200)
commit	38ff2b2101087feeff1545ae569667bc6702e48f
tree	bae6385c67474e49c2dbfb9e194d914cba90500c	tree
parent	125e405ac93a6485d1c6e1ffcb4712930792815e	commit \| diff

1.5 bit: we can do even better (llama/5999)

* iq1_s: we can do even better

Spent one of the 4 scale bits on a signs of a 0.125 shift.
I.e., quants are now -1 + delta, delta, 1 + delta, where delta
is +/- 0.125.

CUDA works, same performance as before.
PPL(LLaMA-v2-7B) is now 11.85!

* iq1_s: make scalar and AVX2 work with the new version

* iq1_s: make Neon work with new version.

~10% drop in performance, so will need some more work.

* iq1_s: make Metal work with new version

* iq1_s: very slightly faster dequantize on Metal

* iq1_s: fix dequantize on the CPU

---------

Co-authored-by: Iwan Kawrakow <redacted>

ggml-common.h		diff \| blob \| history
src/ggml-cuda.cu		diff \| blob \| history
src/ggml-metal.metal		diff \| blob \| history
src/ggml-quants.c		diff \| blob \| history