git.djapps.eu Git - pkg/ggml/sources/ggml/commit

author	Kawrakow <redacted>
	Mon, 11 Mar 2024 06:51:49 +0000 (07:51 +0100)
committer	Georgi Gerganov <redacted>
	Thu, 14 Mar 2024 16:46:58 +0000 (18:46 +0200)
commit	4c3f0ac0b4d834704260d5dfea6a483d7a2409fd
tree	636dfe8bcb71f3c7e8ee22641e8c29bd363f2a56	tree
parent	2e305f5609fcf05a7adaa330afc77c98536ffcd0	commit \| diff

Better 1.5 bit quantization (llama/5971)

* Trying blocvks of 16 for IQ1_S - seems slightly better

* iq1s_blocks16: Adjust scale fudge factor to 1.125

* iq1s_blocks16: going to blocks of 32

with 2048 lattice points, so same bpw.
This is even better than blocks of 16.
Should I try blocks of 64? But to keep the same
bpw, when I go to 4096 lattice points, I need to
remove blocks alltogether and just have superblocks of
256 weights.

* iq1s_blocks16: Use 2*<x^2> as sigma2 in weight adjustment

* iq1s_blocks16: scalar and AVX2 dot products

* iq1s_blocks16: CUDA dot product

* iq1s_blocks16: Metal works, Neon does not

Metal works but TG is dog slow (35 t/s). PP is OKish (493 t/s).
Not seeing the bug in the Neon implementation for now.

* iq1s_blocks16: fixed Neon

* iq1s_blocks16: very slightly faster TG on Metal

Still pathetic at 37 t/s

* iq1s_blocks16: speedup Metal by packing codebook into uint32_t's

* Formatting

* iq1s_blocks16: uint32_t codebook is also better in CUDA

TG-128 is now 204 t/s up from 194 t/s.
PP-512 is 5890 t/s, so significantly better than other quants

* iq1s_blocks16: slightly faster Neon dot product

* iq1s_blocks16: faster AVX2 dot product

* iq1s_blocks16: adjust to ggml-common.h

---------

Co-authored-by: Iwan Kawrakow <redacted>

ggml-common.h		diff \| blob \| history
src/ggml-cuda.cu		diff \| blob \| history
src/ggml-metal.metal		diff \| blob \| history
src/ggml-quants.c		diff \| blob \| history
src/ggml-quants.h		diff \| blob \| history