git.djapps.eu Git - pkg/ggml/sources/llama.cpp/commit

author	Gaurav Garg <redacted>
	Sun, 29 Mar 2026 16:35:18 +0000 (22:05 +0530)
committer	GitHub <redacted>
	Sun, 29 Mar 2026 16:35:18 +0000 (18:35 +0200)
commit	ec16a072f06c9c44d33513405a83068b15ae1b2c
tree	20655e82b49956a981d9346b41306384142bb75b	tree
parent	f5d1c4179fedf726bec744d3125a55df8d02496a	commit \| diff

Optimize MOE GEMV kernel for BS > 1. (#20905)

* Optimize MOE GEMV kernel for BS > 1.

The previous MOE kernel for BS > 1 had too many thread blocks (nrows_x, nchannels_dst, ncols_dst), with very little work per block. block of (32, 4) was doing inner dot product for a single row.

New mul_mat_vec_q_moe kernel is dedicated for MoE multi-token kernel with grid (ceil(nrows_x/rpb), nchannels_dst), block (warp_size, ncols_dst). Each warp handles two rows independently with warp-level reduction only (no shared memory sync).

This change doesn't increase any compilation time as a single template instance is needed per type. This also simplifies the original GEMV kernel and gets rid of `is_multi_token_id` specialization.

* Remove em-dashes

* Cherry-pick changes from @am17an PR https://github.com/ggml-org/llama.cpp/pull/20885 to enable small_k optimization only for cases where it benefits

Increase max batch size for MMVQ kernels for MUL_MAT_ID to 8

* Make the max batch size for MOE GEMV kernel configurable based on GPU arch and datatype

---------

Co-authored-by: Aman Gupta <redacted>

ggml/src/ggml-cuda/ggml-cuda.cu		diff \| blob \| history
ggml/src/ggml-cuda/mmvq.cu		diff \| blob \| history
ggml/src/ggml-cuda/mmvq.cuh		diff \| blob \| history