git.djapps.eu Git - pkg/ggml/sources/llama.cpp/commit

author	Gaurav Garg <redacted>
	Tue, 27 Jan 2026 06:52:44 +0000 (06:52 +0000)
committer	GitHub <redacted>
	Tue, 27 Jan 2026 06:52:44 +0000 (08:52 +0200)
commit	a83c73a18aaffba253ffd01e7cd3af41feaf8179
tree	f76b0927da97231d6b16e52265542bdf6dca9730	tree
parent	fc3cdf32ce5ea3017299d2afb947d3ba9844445a	commit \| diff

[CUDA] Reduce CPU-side stalls due to the CUDA command buffer being full (#19042)

* [CUDA] Reduce CPU-side stalls due to the CUDA command buffer being full

With pipeline parallelism, during prompt processing, the CPU-side CUDA command buffer gets full, stalling the CPU. Due to this, enough work doesn't get submitted to the GPU, causing bubbles in the GPU timeline.
Fix this by setting the CUDA environment variable CUDA_SCALE_LAUNCH_QUEUES to 4x to increase the command buffer size.

* Set the env variable in the CUDA backend registry allocation

* Add link to PR in code comment

* Remove warning logs and update documentation

docs/build.md		diff \| blob \| history
ggml/src/ggml-cuda/ggml-cuda.cu		diff \| blob \| history