]> git.djapps.eu Git - pkg/ggml/sources/ggml/commit
arm64: optimize q6_k_q8_k kernel with i8mm (llama/13519)
authorYibo Cai <redacted>
Wed, 14 May 2025 19:53:52 +0000 (03:53 +0800)
committerGeorgi Gerganov <redacted>
Mon, 19 May 2025 10:37:56 +0000 (13:37 +0300)
commit5383eeca8b6af295d42296e8dc8b4d60c31f495a
tree80521944fac1e26cbeae85af757b5e08c14b518b
parent9f87acbcffb22c26fe9359d40afb07fc0eb10901
arm64: optimize q6_k_q8_k kernel with i8mm (llama/13519)

This PR improves q6_k_q8_k gemm kernel with arm64 i8mm instruction.

Tested on neoverse-n2 with llama3 8b q6_k quantization model.
- 40% ~ 54% S_PP uplift for all batch sizes
- 16% ~ 47% S_TG uplift for batch size 4 and above

Perplexity doesn't change with this PR.

```
// tested on neoverse-n2
$ llama-batched-bench \
      -m Meta-Llama-3-8B-Instruct-Q6_K.gguf \
      --no-mmap -fa \
      -c 8192 -b 4096 -ub 512 -npp 128 -ntg 128 \
      -npl 1,2,4,8,16,32 \
      -t 64

---------------------------------------------------------------------
|    PP |     TG |    B |       S_PP t/s      |       S_TG t/s      |
|       |        |      | original |  this pr | original |  this pr |
|-------|--------|------|----------|----------|----------|----------|
|   128 |    128 |    1 |    78.52 |   109.18 |    18.63 |    18.88 |
|   128 |    128 |    2 |    84.62 |   123.94 |    34.54 |    36.92 |
|   128 |    128 |    4 |    84.36 |   122.49 |    52.65 |    61.32 |
|   128 |    128 |    8 |    90.52 |   138.87 |    63.46 |    84.41 |
|   128 |    128 |   16 |    90.11 |   138.56 |    71.04 |   101.33 |
|   128 |    128 |   32 |    89.81 |   137.79 |    75.14 |   110.47 |
---------------------------------------------------------------------
```
src/ggml-cpu/ggml-cpu-quants.c
src/ggml-cpu/ggml-cpu.c