git.djapps.eu Git - pkg/ggml/sources/llama.cpp/commit

author	Chenguang Li <redacted>
	Thu, 19 Mar 2026 03:02:42 +0000 (11:02 +0800)
committer	GitHub <redacted>
	Thu, 19 Mar 2026 03:02:42 +0000 (11:02 +0800)
commit	07ba6d275b0f5c138c72f75d7f3df2661f17c27a
tree	caefafeb8b27b0e307a3939373430a561473729b	tree
parent	6729d4920c7509f0d110f114a9652793b5fe668a	commit \| diff

CANN: support flash attention for head dim not multiple of 16, fix ALiBi slope offset (#20031)

- Allow FLASH_ATTN_EXT when head dimension D is not a multiple of 16 by
  padding Q/K/V to D_padded = GGML_PAD(D, 16), running FusedInferAttentionScoreV2,
  then slicing the output back to D (ggml-cann.cpp + aclnn_ops.cpp).
- Fix aclnn_get_slope second-part offset: use ggml_type_size(dtype) instead of
  sizeof(float) so ALiBi slopes are correct when dtype is F16 (e.g. GQA with
  48 heads); fixes buffer overflow and large numerical errors in those cases.

ggml/src/ggml-cann/aclnn_ops.cpp		diff \| blob \| history
ggml/src/ggml-cann/ggml-cann.cpp		diff \| blob \| history