git.djapps.eu Git - pkg/ggml/sources/ggml/commit

author	Kawrakow <redacted>
	Tue, 30 Jan 2024 13:14:12 +0000 (15:14 +0200)
committer	Georgi Gerganov <redacted>
	Tue, 30 Jan 2024 19:21:10 +0000 (21:21 +0200)
commit	f7b408495c144a682fe07cce75d10a394811aece
tree	683a2e80647ed23a0bc4fc1e663b385c0e2d32b6	tree
parent	ede7a714a4b45d8d84271732d919c6459f6daa4b	commit \| diff

SOTA 3-bit quants (llama/5196)

* iq3_xxs: quantize/dequantize

RMSE seems a bit high-ish at about half-way between q2_K and
q3_K, so need to check more.

* iq3_xxs: CUDA dequantize works

* iq2_xxs: tuning quantization

* iq3_xxs: starting to look better

PPL on wiki.test.raw
LLaMA-v1-7B: 6.4218
LLaMA-v2-7B: 6.3560
Mistral-7B : 6.0717

This is better than Q3_K_XS, with a 5% reduction in quantized model
size.

* iq3_xxs: CUDA dot product

We have
PP-512: 5891 t/s
TG-128: 143.9 t/s

* iq3_xxs: scalar and AVX2 dot products

* iq3_xxs: ARM_NEON and Metal

Metal performance is decent, ARM_NEON is pathetic

* iq3_xxs: slightly better grid points

* Faster iq3_xxs and iq2_xs dot products on CUDA

* iq3_xxs: add some quant mix

* iq3_xxs: fix failing quantization test

Dot product still fails. Is this real?

* iq3_xxs: hopefully fix ROCm

* iq3_xxs: failing tests

This time the dot product accuracy did find an actual bug
in the AVX2 implementation.

* Add IQ3_XXS to test-backend-ops

---------

Co-authored-by: Iwan Kawrakow <redacted>

include/ggml/ggml.h		diff \| blob \| history
src/ggml-cuda.cu		diff \| blob \| history
src/ggml-metal.m		diff \| blob \| history
src/ggml-metal.metal		diff \| blob \| history
src/ggml-quants.c		diff \| blob \| history
src/ggml-quants.h		diff \| blob \| history
src/ggml.c		diff \| blob \| history
tests/test-backend-ops.cpp		diff \| blob \| history
tests/test-quantize-fns.cpp		diff \| blob \| history
tests/test-quantize-perf.cpp		diff \| blob \| history