git.djapps.eu Git - pkg/ggml/sources/llama.cpp/commit

author	Georgi Gerganov <redacted>
	Tue, 24 Oct 2023 13:48:37 +0000 (16:48 +0300)
committer	GitHub <redacted>
	Tue, 24 Oct 2023 13:48:37 +0000 (16:48 +0300)
commit	2b4ea35e56792064598e922e46d081e02bc96b94
tree	dea0a7b3e47c7d876cbce5d30b31c4c78d7bb030	tree
parent	daab3d7f45832e10773c99f3484b0d5b14d86c0c	commit \| diff

cuda : add batched cuBLAS GEMM for faster attention (#3749)

* cmake : add helper for faster CUDA builds

* batched : add NGL arg

* ggml : skip nops in compute_forward

* cuda : minor indentation

* cuda : batched cuBLAS GEMMs for src0 F16 and src1 F32 (attention ops)

* Apply suggestions from code review

These changes plus:

```c++
#define cublasGemmBatchedEx hipblasGemmBatchedEx
```

are needed to compile with ROCM. I haven't done performance testing, but it seems to work.

I couldn't figure out how to propose a change for lines outside what the pull changed, also this is the first time trying to create a multi-part review so please forgive me if I mess something up.

* cuda : add ROCm / hipBLAS cublasGemmBatchedEx define

* cuda : add cublasGemmStridedBatchedEx for non-broadcasted cases

* cuda : reduce mallocs in cublasGemmBatchedEx branch

* cuda : add TODO for calling cublas from kernel + using mem pool

---------

Co-authored-by: Kerfuffle <redacted>

CMakeLists.txt		diff \| blob \| history
examples/batched/batched.cpp		diff \| blob \| history
ggml-cuda.cu		diff \| blob \| history
ggml.c		diff \| blob \| history