Clarify default MMQ for CUDA and LLAMA_CUDA_FORCE_MMQ flag (#8115)

author Isaac McFadyen <redacted>

Wed, 26 Jun 2024 06:29:28 +0000 (02:29 -0400)

committer GitHub <redacted>

Wed, 26 Jun 2024 06:29:28 +0000 (08:29 +0200)
author Isaac McFadyen <redacted>
Wed, 26 Jun 2024 06:29:28 +0000 (02:29 -0400)
committer GitHub <redacted>
Wed, 26 Jun 2024 06:29:28 +0000 (08:29 +0200)
diff --git a/README.md b/README.md

index a54ee3951d41dc121a4ba118ad2af0dce2c25742..95d970d8382b37ed7ce495f0cdfce575702f537e 100644 (file)
--- a/README.md
+++ b/README.md
@@ -511,7 +511,7 @@ Building the program with BLAS support may lead to some performance improvements
    | LLAMA_CUDA_FORCE_DMMV          | Boolean                | false   | Force the use of dequantization + matrix vector multiplication kernels instead of using kernels that do matrix vector multiplication on quantized data. By default the decision is made based on compute capability (MMVQ for 6.1/Pascal/GTX 1000 or higher). Does not affect k-quants. |
    | LLAMA_CUDA_DMMV_X              | Positive integer >= 32 | 32      | Number of values in x direction processed by the CUDA dequantization + matrix vector multiplication kernel per iteration. Increasing this value can improve performance on fast GPUs. Power of 2 heavily recommended. Does not affect k-quants.                                         |
    | LLAMA_CUDA_MMV_Y               | Positive integer       | 1       | Block size in y direction for the CUDA mul mat vec kernels. Increasing this value can improve performance on fast GPUs. Power of 2 recommended.                                                                                                                                         |
-  | LLAMA_CUDA_FORCE_MMQ           | Boolean                | false   | Force the use of custom matrix multiplication kernels for quantized models instead of FP16 cuBLAS even if there is no int8 tensor core implementation available (affects V100, RDNA3). Speed for large batch sizes will be worse but VRAM consumption will be lower.                    |
+  | LLAMA_CUDA_FORCE_MMQ           | Boolean                | false   | Force the use of custom matrix multiplication kernels for quantized models instead of FP16 cuBLAS even if there is no int8 tensor core implementation available (affects V100, RDNA3). MMQ kernels are enabled by default on GPUs with int8 tensor core support. With MMQ force enabled, speed for large batch sizes will be worse but VRAM consumption will be lower.                       |
    | LLAMA_CUDA_FORCE_CUBLAS        | Boolean                | false   | Force the use of FP16 cuBLAS instead of custom matrix multiplication kernels for quantized models                                                                                                                                                                                       |
    | LLAMA_CUDA_F16                 | Boolean                | false   | If enabled, use half-precision floating point arithmetic for the CUDA dequantization + mul mat vec kernels and for the q4_1 and q5_1 matrix matrix multiplication kernels. Can improve performance on relatively recent GPUs.                                                           |
    | LLAMA_CUDA_KQUANTS_ITER        | 1 or 2                 | 2       | Number of values processed per iteration and per CUDA thread for Q2_K and Q6_K quantization formats. Setting this value to 1 can improve performance for slow GPUs.                                                                                                                     |
author	Isaac McFadyen <redacted>
	Wed, 26 Jun 2024 06:29:28 +0000 (02:29 -0400)
committer	GitHub <redacted>
	Wed, 26 Jun 2024 06:29:28 +0000 (08:29 +0200)