doc : update documentation for --tensor-split (#15980)

author Radoslav Gerganov <redacted>

Sun, 14 Sep 2025 09:10:07 +0000 (12:10 +0300)

committer GitHub <redacted>

Sun, 14 Sep 2025 09:10:07 +0000 (12:10 +0300)
author Radoslav Gerganov <redacted>
Sun, 14 Sep 2025 09:10:07 +0000 (12:10 +0300)
committer GitHub <redacted>
Sun, 14 Sep 2025 09:10:07 +0000 (12:10 +0300)
diff --git a/tools/main/README.md b/tools/main/README.md

index 4f16ad6b2b10ecd70f7ae14a10fde0cc0625c3d5..54e582de07db55f7d95ff301cba1b2604def534a 100644 (file)
--- a/tools/main/README.md
+++ b/tools/main/README.md
@@ -384,5 +384,5 @@ These options provide extra functionality and customization when running the LLa
  -   `--verbose-prompt`: Print the prompt before generating text.
  -   `--no-display-prompt`: Don't print prompt at generation.
  -   `-mg i, --main-gpu i`: When using multiple GPUs this option controls which GPU is used for small tensors for which the overhead of splitting the computation across all GPUs is not worthwhile. The GPU in question will use slightly more VRAM to store a scratch buffer for temporary results. By default GPU 0 is used.
--   `-ts SPLIT, --tensor-split SPLIT`: When using multiple GPUs this option controls how large tensors should be split across all GPUs. `SPLIT` is a comma-separated list of non-negative values that assigns the proportion of data that each GPU should get in order. For example, "3,2" will assign 60% of the data to GPU 0 and 40% to GPU 1. By default the data is split in proportion to VRAM but this may not be optimal for performance.
+-   `-ts SPLIT, --tensor-split SPLIT`: When using multiple devices this option controls how tensors should be split across devices. `SPLIT` is a comma-separated list of non-negative values that assigns the proportion of data that each device should get in order. For example, "3,2" will assign 60% of the data to device 0 and 40% to device 1. By default, the data is split in proportion to VRAM, but this may not be optimal for performance. The list of the devices which are being used is printed on startup and can be different from the device list given by `--list-devices` or e.g. `nvidia-smi`.
  -   `-hfr URL --hf-repo URL`: The url to the Hugging Face model repository. Used in conjunction with `--hf-file` or `-hff`. The model is downloaded and stored in the file provided by `-m` or `--model`. If `-m` is not provided, the model is auto-stored in the path specified by the `LLAMA_CACHE` environment variable  or in an OS-specific local cache.
author	Radoslav Gerganov <redacted>
	Sun, 14 Sep 2025 09:10:07 +0000 (12:10 +0300)
committer	GitHub <redacted>
	Sun, 14 Sep 2025 09:10:07 +0000 (12:10 +0300)