]> git.djapps.eu Git - pkg/ggml/sources/whisper.cpp/commit
cuda: optimize SOLVE_TRI using registers and FMAF (llama/17703)
authorwsbagnsv1 <redacted>
Mon, 8 Dec 2025 09:41:08 +0000 (10:41 +0100)
committerGeorgi Gerganov <redacted>
Fri, 12 Dec 2025 15:53:21 +0000 (17:53 +0200)
commite1562e85fccbcc0992cba29b2310c64bb50fd818
treec9c790bce659e748c3230fde107ab249c767911e
parentc8d0ee2f9f8622e7d0fe2864808f9f0fa5e5e648
cuda: optimize SOLVE_TRI using registers and FMAF (llama/17703)

* ggml-cuda: optimize solve_tri_f32_fast and fix stride handling

- Switch from using shared memory for the RHS/solution matrix to a register-based approach (x_low, x_high), reducing shared memory pressure and bank conflicts.
- Implement explicit `fmaf` instructions for the reduction loop.
- Update kernel arguments to pass strides in bytes rather than elements to align with standard ggml tensor arithmetic (casting to `char *` before addition).
- Remove unused `MAX_K_FAST` definition.

* Small cleanup

* Remove comments in solve_tri.cu

* Update ggml/src/ggml-cuda/solve_tri.cu

Co-authored-by: Johannes Gäßler <redacted>
* Update ggml/src/ggml-cuda/solve_tri.cu

Co-authored-by: Johannes Gäßler <redacted>
* Update ggml/src/ggml-cuda/solve_tri.cu

Co-authored-by: Johannes Gäßler <redacted>
* Use const for variables in solve_tri.cu

* Replace fmaf with more readable code

* remove last fmaf

---------

Co-authored-by: Johannes Gäßler <redacted>
ggml/src/ggml-cuda/solve_tri.cu