]> git.djapps.eu Git - pkg/ggml/sources/ggml/commit
Vectorize load instructions in dmmv f16 CUDA kernel (llama/9816)
authoragray3 <redacted>
Mon, 14 Oct 2024 00:49:08 +0000 (01:49 +0100)
committerGeorgi Gerganov <redacted>
Wed, 16 Oct 2024 08:28:39 +0000 (11:28 +0300)
commit431305e1e4933a9d60c94a3d508cf53e9838fd86
tree47e54a46bcabc77bbffe6051ce731825943b8dc9
parente4baefd3e06ed7d7730ead44bb8f663690abe793
Vectorize load instructions in dmmv f16 CUDA kernel (llama/9816)

* Vectorize load instructions in dmmv f16 CUDA kernel

Replaces scalar with vector load instructions, which substantially
improves performance on NVIDIA HBM GPUs, e.g. gives a 1.27X overall
speedup for Meta-Llama-3-8B-Instruct-F16 BS1 inference evaluation on
H100 SXM 80GB HBM3. On GDDR GPUs, there is a slight (1.01X) speedup.

* addressed comment

* Update ggml/src/ggml-cuda/dmmv.cu

Co-authored-by: Johannes Gäßler <redacted>
---------

Co-authored-by: Johannes Gäßler <redacted>
src/ggml-cuda/dmmv.cu