]> git.djapps.eu Git - pkg/ggml/sources/whisper.cpp/commit
Hide latency of bias and gate-loading (llama/16847)
authorOliver Simons <redacted>
Thu, 30 Oct 2025 03:34:15 +0000 (04:34 +0100)
committerGeorgi Gerganov <redacted>
Sun, 9 Nov 2025 21:38:03 +0000 (23:38 +0200)
commit41f4daca57a6c4e5a10479c3813f7d126625bbdb
treebb32e471865358b8f6e67332c34a3dc671e69059
parentefe80992687e55340b03d3fb7381432b2bb1203d
Hide latency of bias and gate-loading (llama/16847)

This is realised by loading them into registers before computation of
the dot-product, effectively batching them together with said
dot-product. As a lot of threads are alive here, the warp scheduler has
enough threads available to effectively hide the cost of additionally
loading those two floats.
ggml/src/ggml-cuda/mmvq.cu