]> git.djapps.eu Git - pkg/ggml/sources/whisper.cpp/commit
arm64: optimize q6_k_q8_k kernel with i8mm (llama/13519)
authorYibo Cai <redacted>
Wed, 14 May 2025 19:53:52 +0000 (03:53 +0800)
committerGeorgi Gerganov <redacted>
Mon, 19 May 2025 11:58:39 +0000 (14:58 +0300)
commit8e9bf548f4c30869902c038975ac040daf55244b
treea6d42b8b68d216318bf9447429112b1e51468a42
parent0dda27bc0bbd761fb2dd216d1a32bab660a42cda
arm64: optimize q6_k_q8_k kernel with i8mm (llama/13519)

This PR improves q6_k_q8_k gemm kernel with arm64 i8mm instruction.

Tested on neoverse-n2 with llama3 8b q6_k quantization model.
- 40% ~ 54% S_PP uplift for all batch sizes
- 16% ~ 47% S_TG uplift for batch size 4 and above

Perplexity doesn't change with this PR.

```
// tested on neoverse-n2
$ llama-batched-bench \
      -m Meta-Llama-3-8B-Instruct-Q6_K.gguf \
      --no-mmap -fa \
      -c 8192 -b 4096 -ub 512 -npp 128 -ntg 128 \
      -npl 1,2,4,8,16,32 \
      -t 64

---------------------------------------------------------------------
|    PP |     TG |    B |       S_PP t/s      |       S_TG t/s      |
|       |        |      | original |  this pr | original |  this pr |
|-------|--------|------|----------|----------|----------|----------|
|   128 |    128 |    1 |    78.52 |   109.18 |    18.63 |    18.88 |
|   128 |    128 |    2 |    84.62 |   123.94 |    34.54 |    36.92 |
|   128 |    128 |    4 |    84.36 |   122.49 |    52.65 |    61.32 |
|   128 |    128 |    8 |    90.52 |   138.87 |    63.46 |    84.41 |
|   128 |    128 |   16 |    90.11 |   138.56 |    71.04 |   101.33 |
|   128 |    128 |   32 |    89.81 |   137.79 |    75.14 |   110.47 |
---------------------------------------------------------------------
```
ggml/src/ggml-cpu/ggml-cpu-quants.c
ggml/src/ggml-cpu/ggml-cpu.c