]> git.djapps.eu Git - pkg/ggml/sources/whisper.cpp/commit
llamafile: powerpc: add FP16 MMA path for Q4/Q8 matmul (llama/19709)
authorshalinib-ibm <redacted>
Thu, 19 Feb 2026 06:28:53 +0000 (11:58 +0530)
committerGeorgi Gerganov <redacted>
Fri, 27 Feb 2026 18:57:58 +0000 (20:57 +0200)
commitcc9e5cf89d5e324e4974c62ef8ccd770ffc42252
tree3be58e09f9605c4d80015b32865299ebc01a8b6a
parent8b3a52ba871d092e619835b0bf844d9f11e3a6c8
llamafile: powerpc: add FP16 MMA path for Q4/Q8 matmul (llama/19709)

Avoid xvi8ger4pp signed→unsigned bias correction by dequantizing Q4/Q8
inputs to FP16 and using FP16×FP16→FP32 MMA. This removes
post-processing overhead and improves performance.

Performance Impact:
1.5 ~ 2x improvement in PP_Speed for Q4 and Q8 Models,
measured with llama-bench and llama-batched-bench.
Q8 Model: granite-4.0-h-micro-Q8_0.gguf (from huggingface)
Q4 Model: Meta-Llama3-8b Q4 model (generated with llama-quantize from
f32 model)

llama-bench Q8 Model Results:
 model                                  size       params   backend      threads              test  Base t/s Patch t/s
 granitehybrid 3B Q8_0              3.16 GiB       3.19 B   CPU               10               pp8           64.48 ± 4.72           73.99 ± 0.27
 granitehybrid 3B Q8_0              3.16 GiB       3.19 B   CPU               10              pp16           80.11 ± 0.32          112.53 ± 0.40
 granitehybrid 3B Q8_0              3.16 GiB       3.19 B   CPU               10              pp32           89.10 ± 0.27          152.95 ± 0.68
 granitehybrid 3B Q8_0              3.16 GiB       3.19 B   CPU               10              pp64           93.65 ± 0.25          187.83 ± 0.83
 granitehybrid 3B Q8_0              3.16 GiB       3.19 B   CPU               10             pp128           99.93 ± 0.02          201.32 ± 0.11
 granitehybrid 3B Q8_0              3.16 GiB       3.19 B   CPU               10             pp256          102.32 ± 0.40          208.32 ± 0.41
 granitehybrid 3B Q8_0              3.16 GiB       3.19 B   CPU               10             pp512          103.42 ± 0.40          209.98 ± 0.14
 granitehybrid 3B Q8_0              3.16 GiB       3.19 B   CPU               10             tg128           20.35 ± 0.01           19.57 ± 0.01

llama-bench Q4 Model Results:
 model                                  size       params   backend      threads              test                Base    t/s                 Patch   t/s
 llama 8B Q4_0                      4.33 GiB       8.03 B   CPU               10               pp8           34.77 ± 0.10           41.23 ± 0.08
 llama 8B Q4_0                      4.33 GiB       8.03 B   CPU               10              pp16           40.81 ± 0.04           64.55 ± 0.15
 llama 8B Q4_0                      4.33 GiB       8.03 B   CPU               10              pp32           44.65 ± 0.05           90.84 ± 0.22
 llama 8B Q4_0                      4.33 GiB       8.03 B   CPU               10              pp64           47.49 ± 0.03          114.39 ± 0.11
 llama 8B Q4_0                      4.33 GiB       8.03 B   CPU               10             pp128           49.29 ± 0.24          120.13 ± 0.19
 llama 8B Q4_0                      4.33 GiB       8.03 B   CPU               10             pp256           49.77 ± 0.23          121.51 ± 0.11
 llama 8B Q4_0                      4.33 GiB       8.03 B   CPU               10             pp512           49.89 ± 0.23          117.52 ± 0.10
 llama 8B Q4_0                      4.33 GiB       8.03 B   CPU               10             tg128           13.40 ± 0.01           13.37 ± 0.00

Llama perplexity Results:

Model                     Base Final PPL Estimate Patch Final PPL Estimate
granite-4.0-h-micro-Q8_0    1.3862 +/- 0.04424         1.3868 +/- 0.04432
Meta-Llama3-8b Q4     1.3801 +/- 0.04116         1.3803 +/- 0.04116

Signed-off-by: Shalini.Salomi.Bodapati <redacted>
ggml/src/ggml-cpu/llamafile/sgemm-ppc.h [deleted file]
ggml/src/ggml-cpu/llamafile/sgemm.cpp