git.djapps.eu Git - pkg/ggml/sources/llama.cpp/commit

author	Kawrakow <redacted>
	Thu, 11 Jan 2024 19:39:39 +0000 (20:39 +0100)
committer	GitHub <redacted>
	Thu, 11 Jan 2024 19:39:39 +0000 (21:39 +0200)
commit	49662cbed3e95f5976c070b85b9fd53fd577038d
tree	b70cd0956715bc11696f6e47d26788e24c5112c4	tree
parent	3ba5b8ca8e6181a5c712c5b77595a29f1d3e2b97	commit \| diff

ggml : SOTA 2-bit quants (add IQ2_XS) (#4856)

* iq2_xs: basics

* iq2_xs: this should have been in the basics

* iq2_xs: CUDA and scalar CPU works

* iq2_xs: WIP Metal

* iq2_xs: Metal now works

* iq2_xs: working, but dog slow, ARM_NEON dot product

* iq2_xs: better ARM_NEON dot product

We are now at 19.5 t/s for TG-128 and 61 t/s for PP-512 when
running on the CPU.

* iq2_xs: AVX2 dot product - 19.5 t/s

* iq2_xs: faster AVX2 dit product

21.4 t/s for TG-128, 59.2 t/s for PP-512.
The latter is 2x compared to the previous version.

* iq2_xs: had forgotten to delete iq2-data.h

* Add llama enum for IQ2_XS

---------

Co-authored-by: Iwan Kawrakow <redacted>

ggml-cuda.cu		diff \| blob \| history
ggml-metal.m		diff \| blob \| history
ggml-metal.metal		diff \| blob \| history
ggml-quants.c		diff \| blob \| history
ggml-quants.h		diff \| blob \| history
ggml.c		diff \| blob \| history
ggml.h		diff \| blob \| history
llama.cpp		diff \| blob \| history
llama.h		diff \| blob \| history
tests/test-quantize-fns.cpp		diff \| blob \| history