]> git.djapps.eu Git - pkg/ggml/sources/whisper.cpp/commit
Reduce CPU-side stalls due to the CUDA command buffer being full (llama/19042)
authorGaurav Garg <redacted>
Tue, 27 Jan 2026 06:52:44 +0000 (06:52 +0000)
committerGeorgi Gerganov <redacted>
Fri, 30 Jan 2026 13:56:40 +0000 (15:56 +0200)
commit5fcbbdc0ddda79214fe40d828fc338a4f28c29ff
tree8c9d1adf71c8981fef1cbffacb2b31c9a47ed59c
parentb2e2032856d189c158aafacf853b5fb353461923
Reduce CPU-side stalls due to the CUDA command buffer being full (llama/19042)

* [CUDA] Reduce CPU-side stalls due to the CUDA command buffer being full

With pipeline parallelism, during prompt processing, the CPU-side CUDA command buffer gets full, stalling the CPU. Due to this, enough work doesn't get submitted to the GPU, causing bubbles in the GPU timeline.
Fix this by setting the CUDA environment variable CUDA_SCALE_LAUNCH_QUEUES to 4x to increase the command buffer size.

* Set the env variable in the CUDA backend registry allocation

* Add link to PR in code comment

* Remove warning logs and update documentation
ggml/src/ggml-cuda/ggml-cuda.cu