]> git.djapps.eu Git - pkg/ggml/sources/whisper.cpp/commit
improve CUDA cpy memory bandwidth when copying transposed tensor (llama/16841)
authorbssrdf <redacted>
Wed, 5 Nov 2025 20:55:04 +0000 (15:55 -0500)
committerGeorgi Gerganov <redacted>
Sun, 9 Nov 2025 21:38:03 +0000 (23:38 +0200)
commit13cd9065016857d5b57c44afb87e9cd1bfc7e99c
treecfd1db9bfc7d7e1252c47db097fc9e8871367f10
parent558a04c9c72cbd1c84341301a81e070e207f67b8
improve CUDA cpy memory bandwidth when copying transposed tensor (llama/16841)

* WIP

* added a cpy kernel specific to transposed tensor which uses smem to avoid uncoalesced access; test cases also added shwoing improved memory bandwidth

* added BF16 support

* more strict check to make sure src0 is a transpose

* reformulated to handle more complicated transpose cases

* bring back 2D transpose for higher performance

* allow build on windows

* tranpose copy more shapes

* minor tweak

* final clean up

* restore some test cases

* keep only the kernel for true tranposed case; updated with review suggestions

* make CI happy

* remove headers not needed

* reduced bank conflicts for fp16 and bf16

* add missing const*

* now bank conflicts free

* use padding instead of swizzling

---------

Co-authored-by: bssrdf <redacted>
ggml/src/ggml-cuda/cpy.cu