]> git.djapps.eu Git - pkg/ggml/sources/ggml/commit
Hide latency of bias and gate-loading (llama/16847)
authorOliver Simons <redacted>
Thu, 30 Oct 2025 03:34:15 +0000 (04:34 +0100)
committerGeorgi Gerganov <redacted>
Sat, 1 Nov 2025 07:41:35 +0000 (09:41 +0200)
commita66c5912d3ac6c6b463522fddd8d3a48c17dd8e4
tree320de1d1bf6febbe1df8e68392591fea841919c6
parent0f0fd00536b9d5d953ad132b0c6c6a0e014e7cee
Hide latency of bias and gate-loading (llama/16847)

This is realised by loading them into registers before computation of
the dot-product, effectively batching them together with said
dot-product. As a lot of threads are alive here, the warp scheduler has
enough threads available to effectively hide the cost of additionally
loading those two floats.
src/ggml-cuda/mmvq.cu