[CUDA] Reduce CPU-side stalls due to the CUDA command buffer being full (#19042)

author Gaurav Garg <redacted>

Tue, 27 Jan 2026 06:52:44 +0000 (06:52 +0000)

committer GitHub <redacted>

Tue, 27 Jan 2026 06:52:44 +0000 (08:52 +0200)
author Gaurav Garg <redacted>
Tue, 27 Jan 2026 06:52:44 +0000 (06:52 +0000)
committer GitHub <redacted>
Tue, 27 Jan 2026 06:52:44 +0000 (08:52 +0200)
diff --git a/docs/build.md b/docs/build.md

index fce9361b2d6841686633954b3de337049b214062..4983cfcfeae03ef59f2d1aaae0563e74f8b0eefc 100644 (file)
--- a/docs/build.md
+++ b/docs/build.md
@@ -248,6 +248,14 @@ You may set the [cuda environmental variables](https://docs.nvidia.com/cuda/cuda
  CUDA_VISIBLE_DEVICES="-0" ./build/bin/llama-server --model /srv/models/llama.gguf
  ```
  
+#### CUDA_SCALE_LAUNCH_QUEUES
+
+The environment variable [`CUDA_SCALE_LAUNCH_QUEUES`](https://docs.nvidia.com/cuda/cuda-programming-guide/05-appendices/environment-variables.html#cuda-scale-launch-queues) controls the size of CUDA's command buffer, which determines how many GPU operations can be queued before the CPU must wait for the GPU to catch up. A larger buffer reduces CPU-side stalls and allows more work to be queued on a GPU.
+
+**Default behavior:** llama.cpp automatically sets `CUDA_SCALE_LAUNCH_QUEUES=4x`, which increases the CUDA command buffer to 4 times its default size. This optimization is particularly beneficial for **Multi-GPU setups with pipeline parallelism**, where it significantly improves prompt processing throughput by allowing more operations to be enqueued across GPUs.
+
+See PR [#19042](https://github.com/ggml-org/llama.cpp/pull/19042) for performance benchmarks and technical details.
+
  ### Unified Memory
  
  The environment variable `GGML_CUDA_ENABLE_UNIFIED_MEMORY=1` can be used to enable unified memory in Linux. This allows swapping to system RAM instead of crashing when the GPU VRAM is exhausted. In Windows this setting is available in the NVIDIA control panel as `System Memory Fallback`.
diff --git a/ggml/src/ggml-cuda/ggml-cuda.cu b/ggml/src/ggml-cuda/ggml-cuda.cu

index 99f0919a514786608c6fddff25f489678bf6245d..e9df0ea4a7c4398b5173fd12b4a0ece838270589 100644 (file)
--- a/ggml/src/ggml-cuda/ggml-cuda.cu
+++ b/ggml/src/ggml-cuda/ggml-cuda.cu
@@ -4876,6 +4876,16 @@ ggml_backend_reg_t ggml_backend_cuda_reg() {
          static std::mutex mutex;
          std::lock_guard<std::mutex> lock(mutex);
          if (!initialized) {
+            // Set CUDA_SCALE_LAUNCH_QUEUES before any CUDA API call to improve multi-GPU pipeline parallelism performance
+            // PR: https://github.com/ggml-org/llama.cpp/pull/19042
+            if (getenv("CUDA_SCALE_LAUNCH_QUEUES") == nullptr) {
+#ifdef _WIN32
+                _putenv_s("CUDA_SCALE_LAUNCH_QUEUES", "4x");
+#else
+                setenv("CUDA_SCALE_LAUNCH_QUEUES", "4x", 0); // don't overwrite if already set
+#endif // _WIN32
+            }
+
              ggml_backend_cuda_reg_context * ctx = new ggml_backend_cuda_reg_context;
              const int min_batch_size = getenv("GGML_OP_OFFLOAD_MIN_BATCH") ? atoi(getenv("GGML_OP_OFFLOAD_MIN_BATCH")) : 32;
author	Gaurav Garg <redacted>
	Tue, 27 Jan 2026 06:52:44 +0000 (06:52 +0000)
committer	GitHub <redacted>
	Tue, 27 Jan 2026 06:52:44 +0000 (08:52 +0200)
docs/build.md		patch \| blob \| history
ggml/src/ggml-cuda/ggml-cuda.cu		patch \| blob \| history