git.djapps.eu Git - pkg/ggml/sources/llama.cpp/commit

author	Gaurav Garg <redacted>
	Sun, 22 Mar 2026 08:49:35 +0000 (14:19 +0530)
committer	GitHub <redacted>
	Sun, 22 Mar 2026 08:49:35 +0000 (16:49 +0800)
commit	ccb87fa3ee1961ec915f77cb447706f471dca6a5
tree	ba2573930b3c5618cb9ff1adbeb65b9386141795	tree
parent	3306dbaef7553da03971c617e48cd27d00328bb4	commit \| diff

[CUDA] Increase number of output elements per-thread block if the K-dimension is small (#20635)

* Increase per-thread work if the K-dimension is small

With tensor parallelism, the K-dimension of the FFN-down matrices is split, which makes it quite small, especially for MOEs. For example, Qwen3-30b-A3B has a K-dimension of 768, and Qwen3235B-A22B has k-dimension of 1536.
The current heuristic uses a group of 4 warps irrespective of K-dimension size, resulting in some of the threads being idle. This results in poor performance for these matrices.

This change increases the number of output elements per block for such cases.

* Limit this change to ncols_dst = 1

* tab to space