llama-bench : clarify benchmarked parts of the computation (#16823)

author Georgi Gerganov <redacted>

Tue, 28 Oct 2025 17:41:43 +0000 (19:41 +0200)

committer GitHub <redacted>

Tue, 28 Oct 2025 17:41:43 +0000 (19:41 +0200)
author Georgi Gerganov <redacted>
Tue, 28 Oct 2025 17:41:43 +0000 (19:41 +0200)
committer GitHub <redacted>
Tue, 28 Oct 2025 17:41:43 +0000 (19:41 +0200)
diff --git a/tools/llama-bench/README.md b/tools/llama-bench/README.md

index ead4da45e2957427507a794e484d1a2b896dba6f..87d9c0a219bd82878989e95c41ec10fa823ce842 100644 (file)
--- a/tools/llama-bench/README.md
+++ b/tools/llama-bench/README.md
@@ -82,6 +82,9 @@ Using the `-d <n>` option, each test can be run at a specified context depth, pr
  
  For a description of the other options, see the [main example](../main/README.md).
  
+> [!NOTE]
+> The measurements with `llama-bench` do not include the times for tokenization and for sampling.
+
  ## Examples
  
  ### Text generation with different models
@@ -131,7 +134,7 @@ $ ./llama-bench -n 0 -n 16 -p 64 -t 1,2,4,8,16,32
  | llama 7B mostly Q4_0           |   3.56 GiB |     6.74 B | CPU        |         16 | pp 64      |     33.52 ± 0.03 |
  | llama 7B mostly Q4_0           |   3.56 GiB |     6.74 B | CPU        |         16 | tg 16      |     15.32 ± 0.05 |
  | llama 7B mostly Q4_0           |   3.56 GiB |     6.74 B | CPU        |         32 | pp 64      |     59.00 ± 1.11 |
-| llama 7B mostly Q4_0           |   3.56 GiB |     6.74 B | CPU        |         32 | tg 16      |     16.41 ± 0.79 ||
+| llama 7B mostly Q4_0           |   3.56 GiB |     6.74 B | CPU        |         32 | tg 16      |     16.41 ± 0.79 |
  
  ### Different numbers of layers offloaded to the GPU
author	Georgi Gerganov <redacted>
	Tue, 28 Oct 2025 17:41:43 +0000 (19:41 +0200)
committer	GitHub <redacted>
	Tue, 28 Oct 2025 17:41:43 +0000 (19:41 +0200)