git.djapps.eu Git - pkg/ggml/sources/whisper.cpp/commit

llamafile: powerpc: add FP16 MMA path for Q4/Q8 matmul (llama/19709)

Avoid xvi8ger4pp signed→unsigned bias correction by dequantizing Q4/Q8
inputs to FP16 and using FP16×FP16→FP32 MMA. This removes
post-processing overhead and improves performance.

Performance Impact:
1.5 ~ 2x improvement in PP_Speed for Q4 and Q8 Models,
measured with llama-bench and llama-batched-bench.
Q8 Model: granite-4.0-h-micro-Q8_0.gguf (from huggingface)
Q4 Model: Meta-Llama3-8b Q4 model (generated with llama-quantize from
f32 model)

llama-bench Q8 Model Results:
model                                  size      params backend     threads             test Base t/s Patch t/s
granitehybrid 3B Q8_0              3.16 GiB      3.19 B CPU               10              pp8          64.48 ± 4.72          73.99 ± 0.27
granitehybrid 3B Q8_0              3.16 GiB      3.19 B CPU               10             pp16          80.11 ± 0.32         112.53 ± 0.40
granitehybrid 3B Q8_0              3.16 GiB      3.19 B CPU               10             pp32          89.10 ± 0.27         152.95 ± 0.68
granitehybrid 3B Q8_0              3.16 GiB      3.19 B CPU               10             pp64          93.65 ± 0.25         187.83 ± 0.83
granitehybrid 3B Q8_0              3.16 GiB      3.19 B CPU               10            pp128          99.93 ± 0.02         201.32 ± 0.11
granitehybrid 3B Q8_0              3.16 GiB      3.19 B CPU               10            pp256         102.32 ± 0.40         208.32 ± 0.41
granitehybrid 3B Q8_0              3.16 GiB      3.19 B CPU               10            pp512         103.42 ± 0.40         209.98 ± 0.14
granitehybrid 3B Q8_0              3.16 GiB      3.19 B CPU               10            tg128          20.35 ± 0.01          19.57 ± 0.01

llama-bench Q4 Model Results:
model                                  size      params backend     threads             test               Base    t/s                Patch   t/s
llama 8B Q4_0                      4.33 GiB      8.03 B CPU               10              pp8          34.77 ± 0.10          41.23 ± 0.08
llama 8B Q4_0                      4.33 GiB      8.03 B CPU               10             pp16          40.81 ± 0.04          64.55 ± 0.15
llama 8B Q4_0                      4.33 GiB      8.03 B CPU               10             pp32          44.65 ± 0.05          90.84 ± 0.22
llama 8B Q4_0                      4.33 GiB      8.03 B CPU               10             pp64          47.49 ± 0.03         114.39 ± 0.11
llama 8B Q4_0                      4.33 GiB      8.03 B CPU               10            pp128          49.29 ± 0.24         120.13 ± 0.19
llama 8B Q4_0                      4.33 GiB      8.03 B CPU               10            pp256          49.77 ± 0.23         121.51 ± 0.11
llama 8B Q4_0                      4.33 GiB      8.03 B CPU               10            pp512          49.89 ± 0.23         117.52 ± 0.10
llama 8B Q4_0                      4.33 GiB      8.03 B CPU               10            tg128          13.40 ± 0.01          13.37 ± 0.00

Llama perplexity Results:

Model                     Base Final PPL Estimate Patch Final PPL Estimate
granite-4.0-h-micro-Q8_0    1.3862 +/- 0.04424         1.3868 +/- 0.04432
Meta-Llama3-8b Q4     1.3801 +/- 0.04116         1.3803 +/- 0.04116

Signed-off-by: Shalini.Salomi.Bodapati <redacted>

author	shalinib-ibm <redacted>
	Thu, 19 Feb 2026 06:28:53 +0000 (11:58 +0530)
committer	Georgi Gerganov <redacted>
	Fri, 27 Feb 2026 18:57:58 +0000 (20:57 +0200)
commit	cc9e5cf89d5e324e4974c62ef8ccd770ffc42252
tree	3be58e09f9605c4d80015b32865299ebc01a8b6a	tree
parent	8b3a52ba871d092e619835b0bf844d9f11e3a6c8	commit \| diff

ggml/src/ggml-cpu/llamafile/sgemm-ppc.h	[deleted file]	blob \| history
ggml/src/ggml-cpu/llamafile/sgemm.cpp		diff \| blob \| history