From: Paul Tsochantaris Date: Fri, 9 Feb 2024 10:48:06 +0000 (+0000) Subject: llama : do not cap thread count when MoE on CPU (#5419) X-Git-Tag: upstream/0.0.4488~2378 X-Git-Url: https://git.djapps.eu/?a=commitdiff_plain;h=e5ca3937c685d6e012ac4db40555d6ec100ff03c;p=pkg%2Fggml%2Fsources%2Fllama.cpp llama : do not cap thread count when MoE on CPU (#5419) * Not capping thread count when MoE inference is running on CPU * Whitespace --- diff --git a/llama.cpp b/llama.cpp index db7d1c1c..0566b087 100644 --- a/llama.cpp +++ b/llama.cpp @@ -7285,7 +7285,9 @@ static int llama_decode_internal( // TODO: this is mostly important for Apple Silicon where CBLAS is still performing very well // we still need some threads to process all non-mul_mat ops, but not too much to avoid interfering // with the BLAS calls. need a better solution - if (n_tokens >= 32 && ggml_cpu_has_blas() && !ggml_cpu_has_gpublas()) { + // MoE Special Case: This logic applies when hparams.n_expert == 0, i.e. the model is NOT an MoE model. When an MoE is + // being processed then Accelerate/BLAS will not be involved, so capping would limit performance. + if (n_tokens >= 32 && hparams.n_expert == 0 && ggml_cpu_has_blas() && !ggml_cpu_has_gpublas()) { n_threads = std::min(4, n_threads); }