]> git.djapps.eu Git - pkg/ggml/sources/llama.cpp/commitdiff
server: docs - refresh and tease a little bit more the http server (#5718)
authorPierrick Hymbert <redacted>
Sun, 25 Feb 2024 20:46:29 +0000 (21:46 +0100)
committerGitHub <redacted>
Sun, 25 Feb 2024 20:46:29 +0000 (21:46 +0100)
* server: docs - refresh and tease a little bit more the http server

* Rephrase README.md server doc

Co-authored-by: Georgi Gerganov <redacted>
* Update examples/server/README.md

Co-authored-by: Georgi Gerganov <redacted>
* Update examples/server/README.md

Co-authored-by: Georgi Gerganov <redacted>
* Update README.md

---------

Co-authored-by: Georgi Gerganov <redacted>
README.md
examples/server/README.md

index d61f9171b1b62ec23be013450c0f31343d8b0e0d..d0af5d0b9b077c7ec430827ef9288e7afb76c129 100644 (file)
--- a/README.md
+++ b/README.md
@@ -114,6 +114,9 @@ Typically finetunes of the base models below are supported as well.
 - [x] [MobileVLM 1.7B/3B models](https://huggingface.co/models?search=mobileVLM)
 - [x] [Yi-VL](https://huggingface.co/models?search=Yi-VL)
 
+**HTTP server**
+
+[llama.cpp web server](./examples/server) is a lightweight [OpenAI API](https://github.com/openai/openai-openapi) compatible HTTP server that can be used to serve local models and easily connect them to existing clients.
 
 **Bindings:**
 
index cb3fd6054095b949745a91dac004464648f4c2c6..0e9bd7fd404bab192a6fac8bf888514e226b096e 100644 (file)
@@ -1,8 +1,20 @@
-# llama.cpp/example/server
+# LLaMA.cpp HTTP Server
 
-This example demonstrates a simple HTTP API server and a simple web front end to interact with llama.cpp.
+Fast, lightweight, pure C/C++ HTTP server based on [httplib](https://github.com/yhirose/cpp-httplib), [nlohmann::json](https://github.com/nlohmann/json) and **llama.cpp**.
 
-Command line options:
+Set of LLM REST APIs and a simple web front end to interact with llama.cpp.
+
+**Features:**
+ * LLM inference of F16 and quantum models on GPU and CPU
+ * [OpenAI API](https://github.com/openai/openai-openapi) compatible chat completions and embeddings routes
+ * Parallel decoding with multi-user support
+ * Continuous batching
+ * Multimodal (wip)
+ * Monitoring endpoints
+
+The project is under active development, and we are [looking for feedback and contributors](https://github.com/ggerganov/llama.cpp/issues/4216).
+
+**Command line options:**
 
 - `--threads N`, `-t N`: Set the number of threads to use during generation.
 - `-tb N, --threads-batch N`: Set the number of threads to use during batch and prompt processing. If not specified, the number of threads will be set to the number of threads used for generation.