git.djapps.eu Git - pkg/ggml/sources/llama.cpp/commit

author	Kawrakow <redacted>
	Sun, 18 Feb 2024 16:16:55 +0000 (18:16 +0200)
committer	GitHub <redacted>
	Sun, 18 Feb 2024 16:16:55 +0000 (18:16 +0200)
commit	bd2d4e393b2b7d2a1b2e201058e26017c9728ead
tree	5c51109459cf1a25fc92fdb11d420895e16785ac	tree
parent	c8e0d7efeb7634ecc2e9832e879ab9fca4510e71	commit \| diff

1.5 bit quantization (#5453)

* iq1_s: WIP basics

* iq1_s: CUDA is working

* iq1_s: scalar CPU dot product

* iq1_s: WIP AVX2 dot product - something is not right

* Fix tests

* Fix shadow warnings

* Fix after merge with latest master

* iq1_s: AVX2 finally works

* iq1_s: ARM_NEON dot product. Works, but not very fast

* iq1_s: better grid

* iq1_s: use IQ2_XXS for attn_output

At a cost of 0.04 extra bpw this gives a big improvement in PPL.

* iq1_s: Metal basics

Dequantize works, but not dot product

* iq1_s: Metal works, but quite slow

As usual, Apple Silicon does not like the code I write.

* iq1_s: Tests

* iq1_s: slightly faster dot product

---------

Co-authored-by: Iwan Kawrakow <redacted>

examples/quantize/quantize.cpp		diff \| blob \| history
ggml-backend.c		diff \| blob \| history
ggml-cuda.cu		diff \| blob \| history
ggml-metal.m		diff \| blob \| history
ggml-metal.metal		diff \| blob \| history
ggml-quants.c		diff \| blob \| history
ggml-quants.h		diff \| blob \| history
ggml.c		diff \| blob \| history
ggml.h		diff \| blob \| history
llama.cpp		diff \| blob \| history
llama.h		diff \| blob \| history
tests/test-backend-ops.cpp		diff \| blob \| history