Arm AArch64: Documentation updates (#9321)

author Dan Johansson <redacted>

Mon, 9 Sep 2024 07:02:45 +0000 (09:02 +0200)

committer GitHub <redacted>

Mon, 9 Sep 2024 07:02:45 +0000 (10:02 +0300)
author Dan Johansson <redacted>
Mon, 9 Sep 2024 07:02:45 +0000 (09:02 +0200)
committer GitHub <redacted>
Mon, 9 Sep 2024 07:02:45 +0000 (10:02 +0300)
diff --git a/docs/build.md b/docs/build.md

index 152d46d6f31af345f1afaeaf6e599c05e60c67e7..faa0ecfa49768a3bf2116d2f87c8078cc8819c15 100644 (file)
--- a/docs/build.md
+++ b/docs/build.md
@@ -380,3 +380,9 @@ For detailed info, such as model/device supports, CANN install, please refer to
  ### Android
  
  To read documentation for how to build on Android, [click here](./android.md)
+
+### Arm CPU optimized mulmat kernels
+
+Llama.cpp includes a set of optimized mulmat kernels for the Arm architecture, leveraging Arm® Neon™, int8mm and SVE instructions. These kernels are enabled at build time through the appropriate compiler cpu-type flags, such as `-DCMAKE_C_FLAGS=-march=armv8.2a+i8mm+sve`. Note that these optimized kernels require the model to be quantized into one of the formats: `Q4_0_4_4` (Arm Neon), `Q4_0_4_8` (int8mm) or `Q4_0_8_8` (SVE). The SVE mulmat kernel specifically requires a vector width of 256 bits. When running on devices with a different vector width, it is recommended to use the `Q4_0_4_8` (int8mm) or `Q4_0_4_4` (Arm Neon) formats for better performance. Refer to [examples/quantize/README.md](../examples/quantize/README.md) for more information on the quantization formats.
+
+To support `Q4_0_4_4`, you must build with `GGML_NO_LLAMAFILE=1` (`make`) or `-DGGML_LLAMAFILE=OFF` (`cmake`).
diff --git a/examples/quantize/README.md b/examples/quantize/README.md

index 5d1e11c67b13fbea219bb0bc009c7a0ab3f66809..704f0d56bea72686a8948027340bf9341770b237 100644 (file)
--- a/examples/quantize/README.md
+++ b/examples/quantize/README.md
@@ -54,6 +54,8 @@ As the models are currently fully loaded into memory, you will need adequate dis
  
  Several quantization methods are supported. They differ in the resulting model disk size and inference speed.
  
+The quantization formats `Q4_0_4_4`, `Q4_0_4_8` and `Q4_0_8_8` are block interleaved variants of the `Q4_0` format, providing a data layout that is better suited for specific implementations of optimized mulmat kernels. Since these formats differ only in data layout, they have the same quantized size as the `Q4_0` format.
+
  *(outdated)*
  
  | Model | Measure      |    F16 |   Q4_0 |   Q4_1 |   Q5_0 |   Q5_1 |   Q8_0 |
author	Dan Johansson <redacted>
	Mon, 9 Sep 2024 07:02:45 +0000 (09:02 +0200)
committer	GitHub <redacted>
	Mon, 9 Sep 2024 07:02:45 +0000 (10:02 +0300)
docs/build.md		patch \| blob \| history
examples/quantize/README.md		patch \| blob \| history