git.djapps.eu Git - pkg/ggml/sources/llama.cpp/commit

author	Aman Gupta <redacted>
	Thu, 25 Sep 2025 14:35:05 +0000 (22:35 +0800)
committer	GitHub <redacted>
	Thu, 25 Sep 2025 14:35:05 +0000 (16:35 +0200)
commit	077c94d0caf87fbd3cf3288dbb5c0fd9670294cf
tree	4d8d02b8f6dc106bfe4ccc807c571ef2839ff770	tree
parent	aa3ee0eb0b80efca126cedf9bcb4fb5864b46ce3	commit \| diff

CUDA: add a fused top-K MoE kernel (#16130)

* CUDA: add a fused top-K MoE kernel

This kernel does the following:
1. softmax over the logits per token [n_experts, n_tokens]
2. argmax reduce over the top-k (n_experts_used) logits
3. write weights + ids to global memory

It is intended as fusion of softmax->top-k->get_rows pipeline for MoE models

* Refactor into ggml_cuda_should_use_topk_moe

* Review: Use better coalescing pattern, use WARP_SIZE, store logits into registers before

* Review: format + micro-optimizations

* Fix bug: fix tie breakers

* Add optional norm + clean-up code

* Use smem for final write

* Add bounds check

* Use better memory pattern for writeback

ggml/src/ggml-cuda/ggml-cuda.cu		diff \| blob \| history
ggml/src/ggml-cuda/topk-moe.cu	[new file with mode: 0644]	blob
ggml/src/ggml-cuda/topk-moe.cuh	[new file with mode: 0644]	blob
src/llama-graph.cpp		diff \| blob \| history
tests/test-backend-ops.cpp		diff \| blob \| history