main : update README documentation for batch size (#11353)

author Diego Devesa <redacted>

Wed, 22 Jan 2025 18:22:20 +0000 (19:22 +0100)

committer GitHub <redacted>

Wed, 22 Jan 2025 18:22:20 +0000 (19:22 +0100)
author Diego Devesa <redacted>
Wed, 22 Jan 2025 18:22:20 +0000 (19:22 +0100)
committer GitHub <redacted>
Wed, 22 Jan 2025 18:22:20 +0000 (19:22 +0100)
diff --git a/examples/main/README.md b/examples/main/README.md

index 17d80a622a8bbd3db6dbeb85247c861886147d1c..46f92eb7ae9d5ebac480e03406720ce8fade7a6d 100644 (file)
--- a/examples/main/README.md
+++ b/examples/main/README.md
@@ -310,9 +310,9 @@ These options help improve the performance and memory usage of the LLaMA models.
  
  ### Batch Size
  
--   `-b N, --batch-size N`: Set the batch size for prompt processing (default: `2048`). This large batch size benefits users who have BLAS installed and enabled it during the build. If you don't have BLAS enabled ("BLAS=0"), you can use a smaller number, such as 8, to see the prompt progress as it's evaluated in some situations.
+- `-ub N`, `--ubatch-size N`: Physical batch size. This is the maximum number of tokens that may be processed at a time. Increasing this value may improve performance during prompt processing, at the expense of higher memory usage. Default: `512`.
  
-- `-ub N`, `--ubatch-size N`: physical maximum batch size. This is for pipeline parallelization. Default: `512`.
+- `-b N`, `--batch-size N`: Logical batch size. Increasing this value above the value of the physical batch size may improve prompt processing performance when using multiple GPUs with pipeline parallelism. Default: `2048`.
  
  ### Prompt Caching
author	Diego Devesa <redacted>
	Wed, 22 Jan 2025 18:22:20 +0000 (19:22 +0100)
committer	GitHub <redacted>
	Wed, 22 Jan 2025 18:22:20 +0000 (19:22 +0100)