git.djapps.eu Git - pkg/ggml/sources/llama.cpp/commit

author	Shouzheng Liu <redacted>
	Wed, 16 Aug 2023 20:07:04 +0000 (16:07 -0400)
committer	GitHub <redacted>
	Wed, 16 Aug 2023 20:07:04 +0000 (23:07 +0300)
commit	bf83bff6742c0f1795b4c18695a13a34ac7adf62
tree	1f1d4e77bf04c459686961540d3e359e8aceb519	tree
parent	b5ffb2849d23afe73647f68eec7b68187af09be6	commit \| diff

metal : matrix-matrix multiplication kernel (#2615)

* metal: matrix-matrix multiplication kernel

This commit removes MPS and uses custom matrix-matrix multiplication
kernels for all quantization types. This commit also adds grouped-query
attention to support llama2 70B.

* metal: fix performance degradation from gqa

Integers are slow on the GPU, and 64-bit divides are extremely slow.
In the context of GQA, we introduce a 64-bit divide that cannot be
optimized out by the compiler, which results in a decrease of ~8% in
inference performance. This commit fixes that issue by calculating a
part of the offset with a 32-bit divide. Naturally, this limits the
size of a single matrix to ~4GB. However, this limitation should
suffice for the near future.

* metal: fix bugs for GQA and perplexity test.

I mixed up ne02 and nb02 in previous commit.

CMakeLists.txt		diff \| blob \| history
Makefile		diff \| blob \| history
flake.nix		diff \| blob \| history
ggml-metal.m		diff \| blob \| history
ggml-metal.metal		diff \| blob \| history
llama.cpp		diff \| blob \| history