git.djapps.eu Git - pkg/ggml/sources/ggml/commit

author	agray3 <redacted>
	Mon, 14 Oct 2024 00:49:08 +0000 (01:49 +0100)
committer	Georgi Gerganov <redacted>
	Wed, 16 Oct 2024 08:28:39 +0000 (11:28 +0300)
commit	431305e1e4933a9d60c94a3d508cf53e9838fd86
tree	47e54a46bcabc77bbffe6051ce731825943b8dc9	tree
parent	e4baefd3e06ed7d7730ead44bb8f663690abe793	commit \| diff

Vectorize load instructions in dmmv f16 CUDA kernel (llama/9816)

* Vectorize load instructions in dmmv f16 CUDA kernel

Replaces scalar with vector load instructions, which substantially
improves performance on NVIDIA HBM GPUs, e.g. gives a 1.27X overall
speedup for Meta-Llama-3-8B-Instruct-F16 BS1 inference evaluation on
H100 SXM 80GB HBM3. On GDDR GPUs, there is a slight (1.01X) speedup.

* addressed comment

* Update ggml/src/ggml-cuda/dmmv.cu

Co-authored-by: Johannes Gäßler <redacted>
---------

Co-authored-by: Johannes Gäßler <redacted>