From: Georgi Gerganov Date: Tue, 1 Oct 2024 15:33:35 +0000 (+0300) Subject: readme : refresh X-Git-Tag: upstream/0.0.1642~320 X-Git-Url: https://git.djapps.eu/?a=commitdiff_plain;h=6ebf0cf75db1739b6c8b26ccca3f5029ab35fe4a;p=pkg%2Fggml%2Fsources%2Fggml readme : refresh --- diff --git a/README.md b/README.md index f15f3808..67ee210c 100644 --- a/README.md +++ b/README.md @@ -9,51 +9,26 @@ Some of the development is currently happening in the [llama.cpp](https://github ## Features -- Written in C -- 16-bit float support -- Integer quantization support (4-bit, 5-bit, 8-bit, etc.) +- Low-level cross-platform implementation +- Integer quantization support +- Broad hardware support - Automatic differentiation - ADAM and L-BFGS optimizers -- Optimized for Apple Silicon -- On x86 architectures utilizes AVX / AVX2 intrinsics -- On ppc64 architectures utilizes VSX intrinsics - No third-party dependencies - Zero memory allocations during runtime -## Updates - -- [X] Example of GPT-2 inference [examples/gpt-2](https://github.com/ggerganov/ggml/tree/master/examples/gpt-2) -- [X] Example of GPT-J inference [examples/gpt-j](https://github.com/ggerganov/ggml/tree/master/examples/gpt-j) -- [X] Example of Whisper inference [ggerganov/whisper.cpp](https://github.com/ggerganov/whisper.cpp) -- [X] Example of LLaMA inference [ggerganov/llama.cpp](https://github.com/ggerganov/llama.cpp) -- [X] Example of LLaMA training [ggerganov/llama.cpp/examples/baby-llama](https://github.com/ggerganov/llama.cpp/tree/master/examples/baby-llama) -- [X] Example of Falcon inference [cmp-nct/ggllm.cpp](https://github.com/cmp-nct/ggllm.cpp) -- [X] Example of BLOOM inference [NouamaneTazi/bloomz.cpp](https://github.com/NouamaneTazi/bloomz.cpp) -- [X] Example of RWKV inference [saharNooby/rwkv.cpp](https://github.com/saharNooby/rwkv.cpp) -- [X] Example of SAM inference [examples/sam](https://github.com/ggerganov/ggml/tree/master/examples/sam) -- [X] Example of BERT inference [skeskinen/bert.cpp](https://github.com/skeskinen/bert.cpp) -- [X] Example of BioGPT inference [PABannier/biogpt.cpp](https://github.com/PABannier/biogpt.cpp) -- [X] Example of Encodec inference [PABannier/encodec.cpp](https://github.com/PABannier/encodec.cpp) -- [X] Example of CLIP inference [monatis/clip.cpp](https://github.com/monatis/clip.cpp) -- [X] Example of MiniGPT4 inference [Maknee/minigpt4.cpp](https://github.com/Maknee/minigpt4.cpp) -- [X] Example of ChatGLM inference [li-plus/chatglm.cpp](https://github.com/li-plus/chatglm.cpp) -- [X] Example of Stable Diffusion inference [leejet/stable-diffusion.cpp](https://github.com/leejet/stable-diffusion.cpp) -- [X] Example of Qwen inference [QwenLM/qwen.cpp](https://github.com/QwenLM/qwen.cpp) -- [X] Example of YOLO inference [examples/yolo](https://github.com/ggerganov/ggml/tree/master/examples/yolo) -- [X] Example of ViT inference [staghado/vit.cpp](https://github.com/staghado/vit.cpp) -- [X] Example of multiple LLMs inference [foldl/chatllm.cpp](https://github.com/foldl/chatllm.cpp) -- [X] SeamlessM4T inference *(in development)* https://github.com/facebookresearch/seamless_communication/tree/main/ggml - -## Python environment setup and building the examples +## Build ```bash git clone https://github.com/ggerganov/ggml cd ggml -# Install python dependencies in a virtual environment -python3.10 -m venv ggml_env -source ./ggml_env/bin/activate + +# install python dependencies in a virtual environment +python3.10 -m venv .venv +source .venv/bin/activate pip install -r requirements.txt -# Build the examples + +# build the examples mkdir build && cd build cmake .. cmake --build . --config Release -j 8 @@ -61,52 +36,15 @@ cmake --build . --config Release -j 8 ## GPT inference (example) -With ggml you can efficiently run [GPT-2](examples/gpt-2) and [GPT-J](examples/gpt-j) inference on the CPU. - -Here is how to run the example programs: - ```bash -# Run the GPT-2 small 117M model +# run the GPT-2 small 117M model ../examples/gpt-2/download-ggml-model.sh 117M ./bin/gpt-2-backend -m models/gpt-2-117M/ggml-model.bin -p "This is an example" - -# Run the GPT-J 6B model (requires 12GB disk space and 16GB CPU RAM) -../examples/gpt-j/download-ggml-model.sh 6B -./bin/gpt-j -m models/gpt-j-6B/ggml-model.bin -p "This is an example" - -# Run the Cerebras-GPT 111M model -# Download from: https://huggingface.co/cerebras -python3 ../examples/gpt-2/convert-cerebras-to-ggml.py /path/to/Cerebras-GPT-111M/ -./bin/gpt-2 -m /path/to/Cerebras-GPT-111M/ggml-model-f16.bin -p "This is an example" ``` -The inference speeds that I get for the different models on my 32GB MacBook M1 Pro are as follows: - -| Model | Size | Time / Token | -| --- | --- | --- | -| GPT-2 | 117M | 5 ms | -| GPT-2 | 345M | 12 ms | -| GPT-2 | 774M | 23 ms | -| GPT-2 | 1558M | 42 ms | -| --- | --- | --- | -| GPT-J | 6B | 125 ms | - For more information, checkout the corresponding programs in the [examples](examples) folder. -## Using Metal (only with GPT-2) - -For GPT-2 models, offloading to GPU is possible. Note that it will not improve inference performances but will reduce power consumption and free up the CPU for other tasks. - -To enable GPU offloading on MacOS: - -```bash -cmake -DGGML_METAL=ON -DBUILD_SHARED_LIBS=Off .. - -# add -ngl 1 -./bin/gpt-2 -t 4 -ngl 100 -m models/gpt-2-117M/ggml-model.bin -p "This is an example" -``` - -## Using cuBLAS +## Using CUDA ```bash # fix the path to point to your CUDA compiler @@ -145,24 +83,20 @@ cmake .. \ ``` ```bash -# Create directories +# create directories adb shell 'mkdir /data/local/tmp/bin' adb shell 'mkdir /data/local/tmp/models' -# Push the compiled binaries to the folder +# push the compiled binaries to the folder adb push bin/* /data/local/tmp/bin/ -# Push the ggml library +# push the ggml library adb push src/libggml.so /data/local/tmp/ -# Push model files +# push model files adb push models/gpt-2-117M/ggml-model.bin /data/local/tmp/models/ - -# Now lets do some inference ... adb shell - -# Now we are in shell cd /data/local/tmp export LD_LIBRARY_PATH=/data/local/tmp ./bin/gpt-2-backend -m models/ggml-model.bin -p "this is an example" @@ -170,7 +104,5 @@ export LD_LIBRARY_PATH=/data/local/tmp ## Resources -- [GGML - Large Language Models for Everyone](https://github.com/rustformers/llm/blob/main/crates/ggml/README.md): a description of the GGML format provided by the maintainers of the `llm` Rust crate, which provides Rust bindings for GGML -- [marella/ctransformers](https://github.com/marella/ctransformers): Python bindings for GGML models. -- [go-skynet/go-ggml-transformers.cpp](https://github.com/go-skynet/go-ggml-transformers.cpp): Golang bindings for GGML models -- [smspillaz/ggml-gobject](https://github.com/smspillaz/ggml-gobject): GObject-introspectable wrapper for use of GGML on the GNOME platform. +- [Introduction to ggml](https://huggingface.co/blog/introduction-to-ggml) +- [The GGUF file format](https://github.com/ggerganov/ggml/blob/master/docs/gguf.md)