git.djapps.eu Git - pkg/ggml/sources/ggml/commit

author	Gaurav Garg <redacted>
	Sat, 21 Feb 2026 09:39:36 +0000 (15:09 +0530)
committer	Georgi Gerganov <redacted>
	Wed, 25 Feb 2026 10:32:13 +0000 (12:32 +0200)
commit	476f2db227f2ccf8e234be61f3aa3e5d40fac474
tree	45eee86e4e08611c069a353fb86d19f75b6941f6	tree
parent	fead65cefe76a46878847e733d61795c5cac76b7	commit \| diff

Improve CUDA graph capture (llama/19754)

* Improve CUDA graph capture

Currently, CUDA graphs are eagerly enabled on the first call to ggml_backend_cuda_graph_compute. If the graph properties keep changing (4+ consecutive updates), the graph is permanently disabled. This is suboptimal because:

- The first call always incurs CUDA graph capture overhead even if the graph is unstable
- Once permanently disabled, CUDA graphs never re-enable even after the graph stabilizes (e.g., switching from prompt processing to decode)

The new approach delays CUDA graph activation until warmup completes: the same cgraph must be called at least twice with matching properties before CUDA graph capture begins. This avoids wasted capture overhead on volatile graphs and allows graphs to become eligible once they stabilize.
This also fixes issues such as https://github.com/ggml-org/llama.cpp/discussions/19708

* Update ggml/src/ggml-cuda/ggml-cuda.cu

Co-authored-by: Johannes Gäßler <redacted>
* Remove EM dashes

* Update ggml/src/ggml-cuda/ggml-cuda.cu

Co-authored-by: Aman Gupta <redacted>
---------

Co-authored-by: Johannes Gäßler <redacted>
Co-authored-by: Aman Gupta <redacted>

src/ggml-cuda/common.cuh		diff \| blob \| history
src/ggml-cuda/ggml-cuda.cu		diff \| blob \| history