git.djapps.eu Git - pkg/ggml/sources/ggml/commit

author	Gaurav Garg <redacted>
	Sun, 22 Mar 2026 08:49:35 +0000 (14:19 +0530)
committer	Georgi Gerganov <redacted>
	Sat, 28 Mar 2026 11:39:09 +0000 (13:39 +0200)
commit	81419ebe55fed50f3cf55601d57549320325e248
tree	9c8f6ccb2382f9e708372fe43d1423c91ed9e483	tree
parent	cd192193be6966bc167d31fa575b4c6ed845b226	commit \| diff

Increase number of output elements per-thread block if the K-dimension is small (llama/20635)

* Increase per-thread work if the K-dimension is small

With tensor parallelism, the K-dimension of the FFN-down matrices is split, which makes it quite small, especially for MOEs. For example, Qwen3-30b-A3B has a K-dimension of 768, and Qwen3235B-A22B has k-dimension of 1536.
The current heuristic uses a group of 4 warps irrespective of K-dimension size, resulting in some of the threads being idle. This results in poor performance for these matrices.

This change increases the number of output elements per block for such cases.

* Limit this change to ncols_dst = 1

* tab to space