git.djapps.eu Git - pkg/ggml/sources/llama.cpp/commit

author	Jayant Lohia <redacted>
	Fri, 27 Feb 2026 18:37:26 +0000 (00:07 +0530)
committer	GitHub <redacted>
	Fri, 27 Feb 2026 18:37:26 +0000 (19:37 +0100)
commit	ecbcb7ea9d3303097519723b264a8b5f1e977028
tree	1ed47fec847d5fa7e050b07d0e63279bad9a7a2d	tree
parent	3e6ab244ad32a2fa9c36372dc741079fa4c0c8e8	commit \| diff

CUDA: add CDNA3 MFMA support for flash attention MMA kernel (#19806)

* CUDA: add CDNA3 MFMA support for flash attention MMA kernel

Add MI300X (gfx942) MFMA tensor core flash attention using
v_mfma_f32_16x16x16_f16 (FP16 in, FP32 accumulate).

- Add FATTN_WARP_SIZE=64 for CDNA wavefront64
- Add CDNA config for head sizes 64, 80, 96, 112, 128
- Add FP16 MFMA intrinsic path in mma.cuh
- Add manual V transpose load for MFMA register layout
- Route CDNA to MMA for prompt processing, VEC for token generation
- Fix Q loading and combine stride granularity for non-power-of-2 heads

Benchmarks (Qwen2.5-1.5B Q4_K_M, MI300X):
  pp512  +7%,  pp1024 +13%,  pp2048 +23%,  pp4096 +39%
  tg128  -10% (FA overhead, VEC used for both)

All 2480 flash attention tests pass.

Ref: https://github.com/ggml-org/llama.cpp/issues/17917

* address review: replace FATTN_WARP_SIZE with constexpr, improve dispatch

- Replace #define FATTN_WARP_SIZE with constexpr int warp_size =
  ggml_cuda_get_physical_warp_size() in each device function
- Use ne[1]*gqa_ratio threshold for MMA vs tile dispatch. Benchmarked
  crossover on MI300X @ d32768 with power-of-2 GQA models:
    hsk=64  (Llama 1B, gqa=4): MMA wins at eff >= 128 (+11%)
    hsk=128 (Llama 3B, gqa=4): MMA wins at eff >= 128 (+4%)
  Unified threshold: eff_nq >= 128 for all head sizes.
- Remove VEC fallback; small batches fall through to tile kernel

* Update ggml/src/ggml-cuda/fattn.cu

* use ggml_cuda_info().devices warp_size instead of hardcoded check

---------

Co-authored-by: Johannes Gäßler <redacted>

ggml/src/ggml-cuda/fattn-mma-f16.cuh		diff \| blob \| history
ggml/src/ggml-cuda/fattn.cu		diff \| blob \| history
ggml/src/ggml-cuda/mma.cuh		diff \| blob \| history