]> git.djapps.eu Git - pkg/ggml/sources/llama.cpp/commit
cuda: optimize SOLVE_TRI using registers and FMAF (#17703)
authorwsbagnsv1 <redacted>
Mon, 8 Dec 2025 09:41:08 +0000 (10:41 +0100)
committerGitHub <redacted>
Mon, 8 Dec 2025 09:41:08 +0000 (10:41 +0100)
commit5814b4dce18f9c5cbebef175e381a7b0ff147d72
treec5f9cf02310b6cf1f08caac283a199215188f317
parent79d61896d35f37b79f432ae935698c5459ba8a41
cuda: optimize SOLVE_TRI using registers and FMAF (#17703)

* ggml-cuda: optimize solve_tri_f32_fast and fix stride handling

- Switch from using shared memory for the RHS/solution matrix to a register-based approach (x_low, x_high), reducing shared memory pressure and bank conflicts.
- Implement explicit `fmaf` instructions for the reduction loop.
- Update kernel arguments to pass strides in bytes rather than elements to align with standard ggml tensor arithmetic (casting to `char *` before addition).
- Remove unused `MAX_K_FAST` definition.

* Small cleanup

* Remove comments in solve_tri.cu

* Update ggml/src/ggml-cuda/solve_tri.cu

Co-authored-by: Johannes Gäßler <redacted>
* Update ggml/src/ggml-cuda/solve_tri.cu

Co-authored-by: Johannes Gäßler <redacted>
* Update ggml/src/ggml-cuda/solve_tri.cu

Co-authored-by: Johannes Gäßler <redacted>
* Use const for variables in solve_tri.cu

* Replace fmaf with more readable code

* remove last fmaf

---------

Co-authored-by: Johannes Gäßler <redacted>
ggml/src/ggml-cuda/solve_tri.cu