git.djapps.eu Git - pkg/ggml/sources/llama.cpp/commit

author	Kawrakow <redacted>
	Mon, 11 Sep 2023 07:30:11 +0000 (09:30 +0200)
committer	GitHub <redacted>
	Mon, 11 Sep 2023 07:30:11 +0000 (10:30 +0300)
commit	f31b6f4e2d6def3c0bd7c75f75c0c1e8698e0589
tree	15c450ae8af732c4a0ce48452dc66fc2bfcd3fae	tree
parent	6eeb4d90839bac1e6085e5544654ab5c319ad09a	commit \| diff

metal : PP speedup (#3084)

* Minor speed gains for all quantization types

* metal: faster kernel_scale via float4

* Various other speedups for "small" kernels

* metal: faster soft_max vial float4

* metal: faster diagonal infinity

Although, to me it looks like one should simply
fuse scale + diagnonal infinity + soft_max on the
KQtensor.

* Another faster f16 x f32 matrix multiply kernel

* Reverting the diag infinity change

It does work for PP, but somehow it fails for TG.
Need to look more into it.

* metal: add back faster diagonal infinity

This time more carefully

* metal : minor (readibility)

---------

Co-authored-by: Iwan Kawrakow <redacted>
Co-authored-by: Georgi Gerganov <redacted>

ggml-metal.m		diff \| blob \| history
ggml-metal.metal		diff \| blob \| history