rpc : update README for cache usage (#12620)

author Radoslav Gerganov <redacted>

Fri, 28 Mar 2025 07:44:13 +0000 (09:44 +0200)

committer GitHub <redacted>

Fri, 28 Mar 2025 07:44:13 +0000 (09:44 +0200)
author Radoslav Gerganov <redacted>
Fri, 28 Mar 2025 07:44:13 +0000 (09:44 +0200)
committer GitHub <redacted>
Fri, 28 Mar 2025 07:44:13 +0000 (09:44 +0200)
diff --git a/examples/rpc/README.md b/examples/rpc/README.md

index 312bb634dc9200e28b434cf466010e4cab37c7c3..561f19fda6b06fb48db149984483d7ace6a5fa0b 100644 (file)
--- a/examples/rpc/README.md
+++ b/examples/rpc/README.md
@@ -72,3 +72,14 @@ $ bin/llama-cli -m ../models/tinyllama-1b/ggml-model-f16.gguf -p "Hello, my name
  
  This way you can offload model layers to both local and remote devices.
  
+### Local cache
+
+The RPC server can use a local cache to store large tensors and avoid transferring them over the network.
+This can speed up model loading significantly, especially when using large models.
+To enable the cache, use the `-c` option:
+
+```bash
+$ bin/rpc-server -c
+```
+
+By default, the cache is stored in the `$HOME/.cache/llama.cpp/rpc` directory and can be controlled via the `LLAMA_CACHE` environment variable.
author	Radoslav Gerganov <redacted>
	Fri, 28 Mar 2025 07:44:13 +0000 (09:44 +0200)
committer	GitHub <redacted>
	Fri, 28 Mar 2025 07:44:13 +0000 (09:44 +0200)