]> git.djapps.eu Git - pkg/ggml/sources/whisper.cpp/commit
Vectorize load instructions in dmmv f16 CUDA kernel (llama/9816)
authoragray3 <redacted>
Mon, 14 Oct 2024 00:49:08 +0000 (01:49 +0100)
committerGeorgi Gerganov <redacted>
Fri, 1 Nov 2024 08:19:05 +0000 (10:19 +0200)
commit042e95d92f9075cfb2617017ea4a06836761d0ca
tree325a38ed6f62e5e75b34159b9c78526c0790d52c
parent81110c0174e11f49358252ca25a04b4168aaec40
Vectorize load instructions in dmmv f16 CUDA kernel (llama/9816)

* Vectorize load instructions in dmmv f16 CUDA kernel

Replaces scalar with vector load instructions, which substantially
improves performance on NVIDIA HBM GPUs, e.g. gives a 1.27X overall
speedup for Meta-Llama-3-8B-Instruct-F16 BS1 inference evaluation on
H100 SXM 80GB HBM3. On GDDR GPUs, there is a slight (1.01X) speedup.

* addressed comment

* Update ggml/src/ggml-cuda/dmmv.cu

Co-authored-by: Johannes Gäßler <redacted>
---------

Co-authored-by: Johannes Gäßler <redacted>
ggml/src/ggml-cuda/dmmv.cu