]> git.djapps.eu Git - pkg/ggml/sources/ggml/commit
improve CUDA cpy memory bandwidth when copying transposed tensor (llama/16841)
authorbssrdf <redacted>
Wed, 5 Nov 2025 20:55:04 +0000 (15:55 -0500)
committerGeorgi Gerganov <redacted>
Sun, 9 Nov 2025 16:30:22 +0000 (18:30 +0200)
commit52c9eb3c2155bda92cea77626a29cbf385089887
tree61fec3ba4a8847997cb969ad3b5adcaa56afdb57
parent92618f2ad58999f4e910c4b9e62016301c4348ef
improve CUDA cpy memory bandwidth when copying transposed tensor (llama/16841)

* WIP

* added a cpy kernel specific to transposed tensor which uses smem to avoid uncoalesced access; test cases also added shwoing improved memory bandwidth

* added BF16 support

* more strict check to make sure src0 is a transpose

* reformulated to handle more complicated transpose cases

* bring back 2D transpose for higher performance

* allow build on windows

* tranpose copy more shapes

* minor tweak

* final clean up

* restore some test cases

* keep only the kernel for true tranposed case; updated with review suggestions

* make CI happy

* remove headers not needed

* reduced bank conflicts for fp16 and bf16

* add missing const*

* now bank conflicts free

* use padding instead of swizzling

---------

Co-authored-by: bssrdf <redacted>
src/ggml-cuda/cpy.cu
tests/test-backend-ops.cpp