]> git.djapps.eu Git - pkg/ggml/sources/llama.cpp/commit
Vectorize load instructions in dmmv f16 CUDA kernel (#9816)
authoragray3 <redacted>
Mon, 14 Oct 2024 00:49:08 +0000 (01:49 +0100)
committerGitHub <redacted>
Mon, 14 Oct 2024 00:49:08 +0000 (02:49 +0200)
commit13dca2a54a394757d56fdd652b9f0df08f44ea22
treeaf4fc70a398266bfec9a9b144616483919500d9e
parentd4c19c0f5cdb1e512573e8c86c79e8d0238c73c4
Vectorize load instructions in dmmv f16 CUDA kernel (#9816)

* Vectorize load instructions in dmmv f16 CUDA kernel

Replaces scalar with vector load instructions, which substantially
improves performance on NVIDIA HBM GPUs, e.g. gives a 1.27X overall
speedup for Meta-Llama-3-8B-Instruct-F16 BS1 inference evaluation on
H100 SXM 80GB HBM3. On GDDR GPUs, there is a slight (1.01X) speedup.

* addressed comment

* Update ggml/src/ggml-cuda/dmmv.cu

Co-authored-by: Johannes Gäßler <redacted>
---------

Co-authored-by: Johannes Gäßler <redacted>
ggml/src/ggml-cuda/dmmv.cu