git.djapps.eu Git - pkg/ggml/sources/whisper.cpp/commit

author	bssrdf <redacted>
	Wed, 5 Nov 2025 20:55:04 +0000 (15:55 -0500)
committer	Georgi Gerganov <redacted>
	Sun, 9 Nov 2025 21:38:03 +0000 (23:38 +0200)
commit	13cd9065016857d5b57c44afb87e9cd1bfc7e99c
tree	cfd1db9bfc7d7e1252c47db097fc9e8871367f10	tree
parent	558a04c9c72cbd1c84341301a81e070e207f67b8	commit \| diff

improve CUDA cpy memory bandwidth when copying transposed tensor (llama/16841)

* WIP

* added a cpy kernel specific to transposed tensor which uses smem to avoid uncoalesced access; test cases also added shwoing improved memory bandwidth

* added BF16 support

* more strict check to make sure src0 is a transpose

* reformulated to handle more complicated transpose cases

* bring back 2D transpose for higher performance

* allow build on windows

* tranpose copy more shapes

* minor tweak

* final clean up

* restore some test cases

* keep only the kernel for true tranposed case; updated with review suggestions

* make CI happy

* remove headers not needed

* reduced bank conflicts for fp16 and bf16

* add missing const*

* now bank conflicts free

* use padding instead of swizzling

---------

Co-authored-by: bssrdf <redacted>