llama : do not cap thread count when MoE on CPU (#5419)

author Paul Tsochantaris <redacted>

Fri, 9 Feb 2024 10:48:06 +0000 (10:48 +0000)

committer GitHub <redacted>

Fri, 9 Feb 2024 10:48:06 +0000 (12:48 +0200)
author Paul Tsochantaris <redacted>
Fri, 9 Feb 2024 10:48:06 +0000 (10:48 +0000)
committer GitHub <redacted>
Fri, 9 Feb 2024 10:48:06 +0000 (12:48 +0200)
diff --git a/llama.cpp b/llama.cpp

index db7d1c1cd18ee9cddb8b07d934e8112e61a83073..0566b087b2e12e5775f2a999f19a301dd6287133 100644 (file)
--- a/llama.cpp
+++ b/llama.cpp
@@ -7285,7 +7285,9 @@ static int llama_decode_internal(
      // TODO: this is mostly important for Apple Silicon where CBLAS is still performing very well
      //       we still need some threads to process all non-mul_mat ops, but not too much to avoid interfering
      //       with the BLAS calls. need a better solution
-    if (n_tokens >= 32 && ggml_cpu_has_blas() && !ggml_cpu_has_gpublas()) {
+    // MoE Special Case: This logic applies when hparams.n_expert == 0, i.e. the model is NOT an MoE model. When an MoE is
+    //                   being processed then Accelerate/BLAS will not be involved, so capping would limit performance.
+    if (n_tokens >= 32 && hparams.n_expert == 0 && ggml_cpu_has_blas() && !ggml_cpu_has_gpublas()) {
          n_threads = std::min(4, n_threads);
      }
author	Paul Tsochantaris <redacted>
	Fri, 9 Feb 2024 10:48:06 +0000 (10:48 +0000)
committer	GitHub <redacted>
	Fri, 9 Feb 2024 10:48:06 +0000 (12:48 +0200)