]> git.djapps.eu Git - pkg/ggml/sources/llama.cpp/commit
Hide latency of bias and gate-loading (#16847)
authorOliver Simons <redacted>
Thu, 30 Oct 2025 03:34:15 +0000 (04:34 +0100)
committerGitHub <redacted>
Thu, 30 Oct 2025 03:34:15 +0000 (11:34 +0800)
commit8b11deea4663f29d3e042ce1056ba643264cd5f1
treed53ace1542e3f0ca58f57db0e62ae2bfa5ce2a85
parentb9ce94017729465895402cbcfffb51fa926c15e3
Hide latency of bias and gate-loading (#16847)

This is realised by loading them into registers before computation of
the dot-product, effectively batching them together with said
dot-product. As a lot of threads are alive here, the warp scheduler has
enough threads available to effectively hide the cost of additionally
loading those two floats.
ggml/src/ggml-cuda/mmvq.cu