]> git.djapps.eu Git - pkg/ggml/sources/llama.cpp/commit
improve CUDA cpy memory bandwidth when copying transposed tensor (#16841)
authorbssrdf <redacted>
Wed, 5 Nov 2025 20:55:04 +0000 (15:55 -0500)
committerGitHub <redacted>
Wed, 5 Nov 2025 20:55:04 +0000 (21:55 +0100)
commit230d1169e5bfe04a013b2e20f4662ee56c2454b0
treed9e0011472ef6e166f9210ffe4f86024ab8243ee
parenta44d77126c911d105f7f800c17da21b2a5b112d1
improve CUDA cpy memory bandwidth when copying transposed tensor  (#16841)

* WIP

* added a cpy kernel specific to transposed tensor which uses smem to avoid uncoalesced access; test cases also added shwoing improved memory bandwidth

* added BF16 support

* more strict check to make sure src0 is a transpose

* reformulated to handle more complicated transpose cases

* bring back 2D transpose for higher performance

* allow build on windows

* tranpose copy more shapes

* minor tweak

* final clean up

* restore some test cases

* keep only the kernel for true tranposed case; updated with review suggestions

* make CI happy

* remove headers not needed

* reduced bank conflicts for fp16 and bf16

* add missing const*

* now bank conflicts free

* use padding instead of swizzling

---------

Co-authored-by: bssrdf <redacted>
ggml/src/ggml-cuda/cpy.cu
tests/test-backend-ops.cpp