server: docs - refresh and tease a little bit more the http server (#5718)

author Pierrick Hymbert <redacted>

Sun, 25 Feb 2024 20:46:29 +0000 (21:46 +0100)

committer GitHub <redacted>

Sun, 25 Feb 2024 20:46:29 +0000 (21:46 +0100)
author Pierrick Hymbert <redacted>
Sun, 25 Feb 2024 20:46:29 +0000 (21:46 +0100)
committer GitHub <redacted>
Sun, 25 Feb 2024 20:46:29 +0000 (21:46 +0100)
diff --git a/README.md b/README.md

index d61f9171b1b62ec23be013450c0f31343d8b0e0d..d0af5d0b9b077c7ec430827ef9288e7afb76c129 100644 (file)
--- a/README.md
+++ b/README.md
@@ -114,6 +114,9 @@ Typically finetunes of the base models below are supported as well.
  - [x] [MobileVLM 1.7B/3B models](https://huggingface.co/models?search=mobileVLM)
  - [x] [Yi-VL](https://huggingface.co/models?search=Yi-VL)
  
+**HTTP server**
+
+[llama.cpp web server](./examples/server) is a lightweight [OpenAI API](https://github.com/openai/openai-openapi) compatible HTTP server that can be used to serve local models and easily connect them to existing clients.
  
  **Bindings:**
  
diff --git a/examples/server/README.md b/examples/server/README.md

index cb3fd6054095b949745a91dac004464648f4c2c6..0e9bd7fd404bab192a6fac8bf888514e226b096e 100644 (file)
--- a/examples/server/README.md
+++ b/examples/server/README.md
@@ -1,8 +1,20 @@
-# llama.cpp/example/server
+# LLaMA.cpp HTTP Server
  
-This example demonstrates a simple HTTP API server and a simple web front end to interact with llama.cpp.
+Fast, lightweight, pure C/C++ HTTP server based on [httplib](https://github.com/yhirose/cpp-httplib), [nlohmann::json](https://github.com/nlohmann/json) and **llama.cpp**.
  
-Command line options:
+Set of LLM REST APIs and a simple web front end to interact with llama.cpp.
+
+**Features:**
+ * LLM inference of F16 and quantum models on GPU and CPU
+ * [OpenAI API](https://github.com/openai/openai-openapi) compatible chat completions and embeddings routes
+ * Parallel decoding with multi-user support
+ * Continuous batching
+ * Multimodal (wip)
+ * Monitoring endpoints
+
+The project is under active development, and we are [looking for feedback and contributors](https://github.com/ggerganov/llama.cpp/issues/4216).
+
+**Command line options:**
  
  - `--threads N`, `-t N`: Set the number of threads to use during generation.
  - `-tb N, --threads-batch N`: Set the number of threads to use during batch and prompt processing. If not specified, the number of threads will be set to the number of threads used for generation.
author	Pierrick Hymbert <redacted>
	Sun, 25 Feb 2024 20:46:29 +0000 (21:46 +0100)
committer	GitHub <redacted>
	Sun, 25 Feb 2024 20:46:29 +0000 (21:46 +0100)
README.md		patch \| blob \| history
examples/server/README.md		patch \| blob \| history