git.djapps.eu Git - pkg/ggml/sources/llama.cpp/commit

]> git.djapps.eu Git - pkg/ggml/sources/llama.cpp/commit

overview / pkg / ggml / sources / llama.cpp / commit

author	Gaurav Garg <redacted>
	Wed, 19 Mar 2025 19:52:06 +0000 (01:22 +0530)
committer	GitHub <redacted>
	Wed, 19 Mar 2025 19:52:06 +0000 (20:52 +0100)
commit	517b5ddbf002b91fd6d6daf5d8db8c88a0173039
tree	d6801861ad6dfd959dfd4333d6e01cea41175533	tree
parent	a9b59288e222f39fc0311dc66944ed5a86c815fa	commit \| diff

CUDA: Improve flash decoding kernel GPU occupancy for BS=1 case (#12183)

- Find out active blocks per SM using cudaOccupancyMaxActiveBlocksPerMultiprocessor API. Use this value to determine the optimal parallel_blocks value.
- Prefer vector flash attention kernels over MMA kernel for BS=1

Fixes Issue: #12182
---------

Co-authored-by: Johannes Gäßler <redacted>

12 files changed:

Packaging of ggml-org/llama.cpp

RSS Atom

ggml/src/ggml-cuda/fattn-common.cuh		diff \| blob \| history
ggml/src/ggml-cuda/fattn-mma-f16.cuh		diff \| blob \| history
ggml/src/ggml-cuda/fattn-tile-f16.cu		diff \| blob \| history
ggml/src/ggml-cuda/fattn-tile-f32.cu		diff \| blob \| history
ggml/src/ggml-cuda/fattn-vec-f16.cuh		diff \| blob \| history
ggml/src/ggml-cuda/fattn-vec-f32.cuh		diff \| blob \| history
ggml/src/ggml-cuda/fattn-wmma-f16.cu		diff \| blob \| history
ggml/src/ggml-cuda/fattn.cu		diff \| blob \| history
ggml/src/ggml-cuda/ggml-cuda.cu		diff \| blob \| history
ggml/src/ggml-cuda/vendors/hip.h		diff \| blob \| history
ggml/src/ggml-cuda/vendors/musa.h		diff \| blob \| history
tests/test-backend-ops.cpp		diff \| blob \| history