"llama-gritlm",
"llama-imatrix",
"llama-infill",
- "llama-llava-cli",
+ "llama-mtmd-cli",
"llama-llava-clip-quantize-cli",
"llama-lookahead",
"llama-lookup",
"llama-lookup-create",
"llama-lookup-merge",
"llama-lookup-stats",
- "llama-minicpmv-cli",
"llama-parallel",
"llama-passkey",
"llama-perplexity",
--- /dev/null
+# MobileVLM
+
+Currently this implementation supports [MobileVLM-1.7B](https://huggingface.co/mtgv/MobileVLM-1.7B) / [MobileVLM_V2-1.7B](https://huggingface.co/mtgv/MobileVLM_V2-1.7B) variants.
+
+for more information, please go to [Meituan-AutoML/MobileVLM](https://github.com/Meituan-AutoML/MobileVLM)
+
+The implementation is based on llava, and is compatible with llava and mobileVLM. The usage is basically same as llava.
+
+Notice: The overall process of model inference for both **MobileVLM** and **MobileVLM_V2** models is the same, but the process of model conversion is a little different. Therefore, using **MobileVLM-1.7B** as an example, the different conversion step will be shown.
+
+## Usage
+
+Build the `llama-mtmd-cli` binary.
+
+After building, run: `./llama-mtmd-cli` to see the usage. For example:
+
+```sh
+./llama-mtmd-cli -m MobileVLM-1.7B/ggml-model-q4_k.gguf \
+ --mmproj MobileVLM-1.7B/mmproj-model-f16.gguf \
+ --chat-template deepseek
+```
+
+## Model conversion
+
+1. Clone `mobileVLM-1.7B` and `clip-vit-large-patch14-336` locally:
+
+```sh
+git clone https://huggingface.co/mtgv/MobileVLM-1.7B
+
+git clone https://huggingface.co/openai/clip-vit-large-patch14-336
+```
+
+2. Use `llava_surgery.py` to split the LLaVA model to LLaMA and multimodel projector constituents:
+
+```sh
+python ./examples/llava/llava_surgery.py -m path/to/MobileVLM-1.7B
+```
+
+3. Use `convert_image_encoder_to_gguf.py` with `--projector-type ldp` (for **V2** please use `--projector-type ldpv2`) to convert the LLaVA image encoder to GGUF:
+
+```sh
+python ./examples/llava/convert_image_encoder_to_gguf.py \
+ -m path/to/clip-vit-large-patch14-336 \
+ --llava-projector path/to/MobileVLM-1.7B/llava.projector \
+ --output-dir path/to/MobileVLM-1.7B \
+ --projector-type ldp
+```
+
+```sh
+python ./examples/llava/convert_image_encoder_to_gguf.py \
+ -m path/to/clip-vit-large-patch14-336 \
+ --llava-projector path/to/MobileVLM-1.7B_V2/llava.projector \
+ --output-dir path/to/MobileVLM-1.7B_V2 \
+ --projector-type ldpv2
+```
+
+4. Use `examples/convert_legacy_llama.py` to convert the LLaMA part of LLaVA to GGUF:
+
+```sh
+python ./examples/convert_legacy_llama.py path/to/MobileVLM-1.7B --skip-unknown
+```
+
+5. Use `quantize` to convert LLaMA part's DataType from `fp32` to `q4_k`
+```sh
+./llama-quantize path/to/MobileVLM-1.7B/ggml-model-F32.gguf path/to/MobileVLM-1.7B/ggml-model-q4_k.gguf q4_k_s
+```
+
+Now both the LLaMA part and the image encoder is in the `MobileVLM-1.7B` directory.
+
+## Android compile and run
+### compile
+refer to `examples/llava/android/build_64.sh`
+```sh
+mkdir examples/llava/android/build_64
+cd examples/llava/android/build_64
+../build_64.sh
+```
+### run on Android
+refer to `android/adb_run.sh`, modify resources' `name` and `path`
+
+## Some result on Android with `Snapdragon 888` chip
+### case 1
+**input**
+```sh
+/data/local/tmp/llama-mtmd-cli \
+ -m /data/local/tmp/ggml-model-q4_k.gguf \
+ --mmproj /data/local/tmp/mmproj-model-f16.gguf \
+ -t 4 \
+ --image /data/local/tmp/demo.jpg \
+ -p "A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. USER: <image>\nWho is the author of this book? \nAnswer the question using a single word or phrase. ASSISTANT:"
+```
+**output**
+```sh
+encode_image_with_clip: image encoded in 21148.71 ms by CLIP ( 146.87 ms per image patch)
+ Susan Wise Bauer
+llama_print_timings: load time = 23574.72 ms
+llama_print_timings: sample time = 1.24 ms / 6 runs ( 0.21 ms per token, 4850.44 tokens per second)
+llama_print_timings: prompt eval time = 12460.15 ms / 246 tokens ( 50.65 ms per token, 19.74 tokens per second)
+llama_print_timings: eval time = 424.86 ms / 6 runs ( 70.81 ms per token, 14.12 tokens per second)
+llama_print_timings: total time = 34731.93 ms
+```
+### case 2
+**input**
+```sh
+/data/local/tmp/llama-mtmd-cli \
+ -m /data/local/tmp/ggml-model-q4_k.gguf \
+ --mmproj /data/local/tmp/mmproj-model-f16.gguf \
+ -t 4 \
+ --image /data/local/tmp/cat.jpeg \
+ -p "A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. USER: <image>\nWhat is in the image? ASSISTANT:"
+```
+**output**
+```sh
+encode_image_with_clip: image encoded in 21149.51 ms by CLIP ( 146.87 ms per image patch)
+ The image depicts a cat sitting in the grass near some tall green plants.
+llama_print_timings: load time = 23257.32 ms
+llama_print_timings: sample time = 5.25 ms / 18 runs ( 0.29 ms per token, 3430.53 tokens per second)
+llama_print_timings: prompt eval time = 11900.73 ms / 232 tokens ( 51.30 ms per token, 19.49 tokens per second)
+llama_print_timings: eval time = 1279.03 ms / 18 runs ( 71.06 ms per token, 14.07 tokens per second)
+llama_print_timings: total time = 34570.79 ms
+```
+
+
+## Some result on Android with `Snapdragon 778G` chip
+### MobileVLM-1.7B case
+#### mtmd-cli release-b2005
+**input**
+```sh
+/data/local/tmp/llama-mtmd-cli \
+ -m /data/local/tmp/ggml-model-q4_k.gguf \
+ --mmproj /data/local/tmp/mmproj-model-f16.gguf \
+ -t 4 \
+ --image /data/local/tmp/many_llamas.jpeg \
+ -p "A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. USER: <image>\nWhat's that? ASSISTANT:"
+```
+**output**
+```sh
+encode_image_with_clip: image encoded in 18728.52 ms by CLIP ( 130.06 ms per image patch)
+system_prompt: A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. USER:
+user_prompt: \nWhat's that? ASSISTANT:
+
+ A group of llamas are standing in a green pasture.
+
+llama_print_timings: load time = 20357.33 ms
+llama_print_timings: sample time = 2.96 ms / 14 runs ( 0.21 ms per token, 4734.53 tokens per second)
+llama_print_timings: prompt eval time = 8119.49 ms / 191 tokens ( 42.51 ms per token, 23.52 tokens per second)
+llama_print_timings: eval time = 1005.75 ms / 14 runs ( 71.84 ms per token, 13.92 tokens per second)
+llama_print_timings: total time = 28038.34 ms / 205 tokens
+```
+#### mtmd-cli latest-version
+**input**
+
+Just the same as above.
+
+**output**(seems to be much slower)
+```sh
+encode_image_with_clip: image embedding created: 144 tokens
+
+encode_image_with_clip: image encoded in 288268.88 ms by CLIP ( 2001.87 ms per image patch)
+system_prompt: A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. USER:
+user_prompt: \nWhat's that? ASSISTANT:
+
+ It is a group of sheep standing together in a grass field.
+
+llama_print_timings: load time = 818120.91 ms
+llama_print_timings: sample time = 3.44 ms / 14 runs ( 0.25 ms per token, 4067.40 tokens per second)
+llama_print_timings: prompt eval time = 529274.69 ms / 191 tokens ( 2771.07 ms per token, 0.36 tokens per second)
+llama_print_timings: eval time = 43894.02 ms / 13 runs ( 3376.46 ms per token, 0.30 tokens per second)
+llama_print_timings: total time = 865441.76 ms / 204 tokens
+```
+### MobileVLM_V2-1.7B case
+#### mtmd-cli release-2005b
+**input**
+
+Just the same as above.
+
+**output**
+```sh
+encode_image_with_clip: image encoded in 20609.61 ms by CLIP ( 143.12 ms per image patch)
+system_prompt: A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. USER:
+user_prompt: \nWhat's that? ASSISTANT:
+
+ This image captures a lively scene of 20 llamas in motion on an expansive, grassy field. The llama is scattered across the landscape with some standing and others sitting down as if taking rest or observing their surroundings from different vantage points within this verdant setting.
+
+The background offers glimpses into a picturesque town nestled amidst hills under an overcast sky, adding depth to the scene while also emphasizing that distance between these llama and human-made structures like houses or roads in which they roam freely without any barriers around them. The image is framed by text at both right angles on white backgrounds against a contrasting blue backdrop with green foliage, further drawing attention to the llamas amidst their natural habitat while also inviting viewers into this picturesque landscape within town limits of Alta Llama
+
+llama_print_timings: load time = 22406.77 ms
+llama_print_timings: sample time = 49.26 ms / 186 runs ( 0.26 ms per token, 3776.27 tokens per second)
+llama_print_timings: prompt eval time = 9044.54 ms / 191 tokens ( 47.35 ms per token, 21.12 tokens per second)
+llama_print_timings: eval time = 14497.49 ms / 186 runs ( 77.94 ms per token, 12.83 tokens per second)
+llama_print_timings: total time = 44411.01 ms / 377 tokens
+```
+
+## Orin compile and run
+### compile
+```sh
+make GGML_CUDA=1 CUDA_DOCKER_ARCH=sm_87 GGML_CUDA_F16=1 -j 32
+```
+### run on Orin
+### case 1
+**input**
+```sh
+./llama-mtmd-cli \
+ -m /data/local/tmp/ggml-model-q4_k.gguf \
+ --mmproj /data/local/tmp/mmproj-model-f16.gguf \
+ --image /data/local/tmp/demo.jpeg \
+ -p "A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. USER: <image>\nWho is the author of this book? \nAnswer the question using a single word or phrase. ASSISTANT:" \
+ --n-gpu-layers 999
+```
+**output**
+```sh
+
+encode_image_with_clip: image encoded in 296.62 ms by CLIP ( 2.06 ms per image patch)
+
+ Susan Wise Bauer
+
+llama_print_timings: load time = 1067.64 ms
+llama_print_timings: sample time = 1.53 ms / 6 runs ( 0.25 ms per token, 3934.43 tokens per second)
+llama_print_timings: prompt eval time = 306.84 ms / 246 tokens ( 1.25 ms per token, 801.72 tokens per second)
+llama_print_timings: eval time = 91.50 ms / 6 runs ( 15.25 ms per token, 65.58 tokens per second)
+llama_print_timings: total time = 1352.63 ms / 252 tokens
+```
+
+### case 2
+**input**
+```sh
+./llama-mtmd-cli \
+ -m /data/local/tmp/ggml-model-q4_k.gguf \
+ --mmproj /data/local/tmp/mmproj-model-f16.gguf \
+ -p "A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. USER: <image>\nWhat is in the image? ASSISTANT:" \
+ --n-gpu-layers 999
+
+```
+**output**
+```sh
+encode_image_with_clip: image encoded in 302.15 ms by CLIP ( 2.10 ms per image patch)
+
+ The image features a cat lying in the grass.
+
+llama_print_timings: load time = 1057.07 ms
+llama_print_timings: sample time = 3.27 ms / 11 runs ( 0.30 ms per token, 3360.83 tokens per second)
+llama_print_timings: prompt eval time = 213.60 ms / 232 tokens ( 0.92 ms per token, 1086.14 tokens per second)
+llama_print_timings: eval time = 166.65 ms / 11 runs ( 15.15 ms per token, 66.01 tokens per second)
+llama_print_timings: total time = 1365.47 ms / 243 tokens
+```
+
+## Running on Intel(R) Core(TM) i7-10750H
+### Operating system
+Ubuntu22.04
+### compile
+```sh
+make -j32
+```
+### MobileVLM-1.7B case
+**input**
+```sh
+-m /path/to/ggml-model-q4_k.gguf \
+ --mmproj /path/to/mmproj-model-f16.gguf \
+ --image /path/to/many_llamas.jpeg
+ -p "A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. USER: <image>\nWhat's that? ASSISTANT:" \
+```
+**output**
+```sh
+encode_image_with_clip: image embedding created: 144 tokens
+
+encode_image_with_clip: image encoded in 2730.94 ms by CLIP ( 18.96 ms per image patch)
+system_prompt: A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. USER:
+user_prompt: \nWhat's that?ASSISTANT:
+
+ A group of llamas are walking together in a field.
+
+llama_print_timings: load time = 5506.60 ms
+llama_print_timings: sample time = 0.44 ms / 13 runs ( 0.03 ms per token, 29545.45 tokens per second)
+llama_print_timings: prompt eval time = 2031.58 ms / 190 tokens ( 10.69 ms per token, 93.52 tokens per second)
+llama_print_timings: eval time = 438.92 ms / 12 runs ( 36.58 ms per token, 27.34 tokens per second)
+llama_print_timings: total time = 5990.25 ms / 202 tokens
+```
+
+### MobileVLM_V2-1.7B case
+**input**
+
+Just the same as above.
+
+**ouput**
+```sh
+encode_image_with_clip: image embedding created: 144 tokens
+
+encode_image_with_clip: image encoded in 3223.89 ms by CLIP ( 22.39 ms per image patch)
+system_prompt: A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. USER:
+user_prompt: \nWhat's that?ASSISTANT:
+
+ The image captures a tranquil scene in a park, where a group of approximately 20 llamas are gathered. The llamas, a mix of white and black, are standing in a line, their black and white patterns contrasting with the lush green grass of the park. The lamas are arranged in a line, suggesting a social order.
+
+The park itself is lush and green, with trees dotting the landscape in the background. A sign reading "Llamas Tico Ana" is also visible in the image, possibly indicating the location or the breed of the llamas. The image seems to be taken from a distance, providing a wide view of the scene and the surrounding environment.
+
+The llamas' positions relative to each other, the sign, and the trees create a harmonious composition. The image does not contain any discernible text. The overall scene is one of peace and natural beauty, with the llamas in their natural habitat, surrounded by the vibrant colors and lush greenery of the park.
+
+llama_print_timings: load time = 6642.61 ms
+llama_print_timings: sample time = 8.15 ms / 223 runs ( 0.04 ms per token, 27358.61 tokens per second)
+llama_print_timings: prompt eval time = 2475.07 ms / 190 tokens ( 13.03 ms per token, 76.77 tokens per second)
+llama_print_timings: eval time = 8760.60 ms / 222 runs ( 39.46 ms per token, 25.34 tokens per second)
+llama_print_timings: total time = 15513.95 ms / 412 tokens
+```
+
+## Run on Intel(R) Core(TM) Ultra7 115H
+### operation system
+Windows11
+### comiple
+```sh
+make -j32
+```
+### MobileVLM-1.7B case
+**input**
+```sh
+-m /path/to/ggml-model-q4_k.gguf \
+ --mmproj /path/to/tmp/mmproj-model-f16.gguf \
+ -p "A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. USER: <image>\nWhat's that? ASSISTANT:" \
+```
+**output**
+```sh
+encode_image_with_clip: image encoded in 4902.81 ms by CLIP ( 34.05 ms per image patch)
+system_prompt: A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. USER:
+user_prompt: \nWhat's that? ASSISTANT:
+
+ The image features a group of brown and white llamas standing in a grassy field.
+
+llama_print_timings: load time = 7441.06 ms
+llama_print_timings: sample time = 0.72 ms / 19 runs ( 0.04 ms per token, 26279.39 tokens per second)
+llama_print_timings: prompt eval time = 2090.71 ms / 191 tokens ( 10.95 ms per token, 91.36 tokens per second)
+llama_print_timings: eval time = 512.35 ms / 18 runs ( 28.46 ms per token, 35.13 tokens per second)
+llama_print_timings: total time = 7987.23 ms / 209 tokens
+```
+
+### MobileVLM_V2-1.7B case
+**input**
+
+Just the same as above.
+
+**output**
+```sh
+encode_image_with_clip: image encoded in 4682.44 ms by CLIP ( 32.52 ms per image patch)
+system_prompt: A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. USER:
+user_prompt: \nWhat's that? ASSISTANT:
+
+ This image captures a lively scene of a group of 14 llamas in a grassy field. The llamas, with their distinctive black and white coats, are standing and walking in a line, seemingly engaged in a social activity. One
+ of them, possibly the first in the line, has its back turned, perhaps observing something in the distance.
+
+The llama in the front of the line stands out due to its black and white coloring, which is quite unusual for llama patterns. The llama in the front also seems to be more aware of its surroundings, as it faces the camera, giving a sense of engagement with the viewer.
+
+The image is taken from the side of the llama, providing a clear view of the llama in the front and its companions. The lameness in the llama in
+ front is not visible, indicating that it might not be the main focus of the photo.
+
+The background of the image features a grassy field, with a fence and a tree visible in the distance. The tree appears to be bare, suggesting that it might be during a time of year when most trees are dormant or have shed their leaves.
+
+
+llama_print_timings: load time = 7015.35 ms
+llama_print_timings: sample time = 10.61 ms / 256 runs ( 0.04 ms per token, 24119.09 tokens per second)
+llama_print_timings: prompt eval time = 2052.45 ms / 191 tokens ( 10.75 ms per token, 93.06 tokens per second)
+llama_print_timings: eval time = 7259.43 ms / 255 runs ( 28.47 ms per token, 35.13 tokens per second)
+llama_print_timings: total time = 14371.19 ms / 446 tokens
+```
+
+## TODO
+
+- [x] Support non-CPU backend for the new operators, such as `depthwise`, `hardswish`, `hardsigmoid`
+- [ ] Optimize LDP projector performance
+
+ - Optimize the structure definition to avoid unnecessary memory rearrangements, to reduce the use of `ggml_permute_cpy`;
+ - Optimize operator implementation (ARM CPU/NVIDIA GPU): such as depthwise conv, hardswish, hardsigmoid, etc.
+- [x] run MobileVLM on `Jetson Orin`
+- [ ] Support more model variants, such as `MobileVLM-3B`.
+
+
+## contributor
+```sh
+zhangjidong05, yangyang260, huyiming03, chenxiaotao03, ZiangWu-77
+```
--- /dev/null
+# Gemma 3 vision
+
+> [!IMPORTANT]
+>
+> This is very experimental, only used for demo purpose.
+
+## Quick started
+
+You can use pre-quantized model from [ggml-org](https://huggingface.co/ggml-org)'s Hugging Face account
+
+```bash
+# build
+cmake -B build
+cmake --build build --target llama-gemma3-cli
+
+# alternatively, install from brew (MacOS)
+brew install llama.cpp
+
+# run it
+llama-gemma3-cli -hf ggml-org/gemma-3-4b-it-GGUF
+llama-gemma3-cli -hf ggml-org/gemma-3-12b-it-GGUF
+llama-gemma3-cli -hf ggml-org/gemma-3-27b-it-GGUF
+
+# note: 1B model does not support vision
+```
+
+## How to get mmproj.gguf?
+
+Simply to add `--mmproj` in when converting model via `convert_hf_to_gguf.py`:
+
+```bash
+cd gemma-3-4b-it
+python ../llama.cpp/convert_hf_to_gguf.py --outfile model.gguf --outtype f16 --mmproj .
+# output file: mmproj-model.gguf
+```
+
+## How to run it?
+
+What you need:
+- The text model GGUF, can be converted using `convert_hf_to_gguf.py`
+- The mmproj file from step above
+- An image file
+
+```bash
+# build
+cmake -B build
+cmake --build build --target llama-gemma3-cli
+
+# run it
+./build/bin/llama-gemma3-cli -m {text_model}.gguf --mmproj mmproj.gguf --image your_image.jpg
+```
--- /dev/null
+# GLMV-EDGE
+
+Currently this implementation supports [glm-edge-v-2b](https://huggingface.co/THUDM/glm-edge-v-2b) and [glm-edge-v-5b](https://huggingface.co/THUDM/glm-edge-v-5b).
+
+## Usage
+Build the `llama-mtmd-cli` binary.
+
+After building, run: `./llama-mtmd-cli` to see the usage. For example:
+
+```sh
+./llama-mtmd-cli -m model_path/ggml-model-f16.gguf --mmproj model_path/mmproj-model-f16.gguf
+```
+
+**note**: A lower temperature like 0.1 is recommended for better quality. add `--temp 0.1` to the command to do so.
+**note**: For GPU offloading ensure to use the `-ngl` flag just like usual
+
+## GGUF conversion
+
+1. Clone a GLMV-EDGE model ([2B](https://huggingface.co/THUDM/glm-edge-v-2b) or [5B](https://huggingface.co/THUDM/glm-edge-v-5b)). For example:
+
+```sh
+git clone https://huggingface.co/THUDM/glm-edge-v-5b or https://huggingface.co/THUDM/glm-edge-v-2b
+```
+
+2. Use `glmedge-surgery.py` to split the GLMV-EDGE model to LLM and multimodel projector constituents:
+
+```sh
+python ./examples/llava/glmedge-surgery.py -m ../model_path
+```
+
+4. Use `glmedge-convert-image-encoder-to-gguf.py` to convert the GLMV-EDGE image encoder to GGUF:
+
+```sh
+python ./examples/llava/glmedge-convert-image-encoder-to-gguf.py -m ../model_path --llava-projector ../model_path/glm.projector --output-dir ../model_path
+```
+
+5. Use `examples/convert_hf_to_gguf.py` to convert the LLM part of GLMV-EDGE to GGUF:
+
+```sh
+python convert_hf_to_gguf.py ../model_path
+```
+
+Now both the LLM part and the image encoder are in the `model_path` directory.
--- /dev/null
+# Granite Vision
+
+Download the model and point your `GRANITE_MODEL` environment variable to the path.
+
+```bash
+$ git clone https://huggingface.co/ibm-granite/granite-vision-3.2-2b
+$ export GRANITE_MODEL=./granite-vision-3.2-2b
+```
+
+
+### 1. Running llava surgery v2.
+First, we need to run the llava surgery script as shown below:
+
+`python llava_surgery_v2.py -C -m $GRANITE_MODEL`
+
+You should see two new files (`llava.clip` and `llava.projector`) written into your model's directory, as shown below.
+
+```bash
+$ ls $GRANITE_MODEL | grep -i llava
+llava.clip
+llava.projector
+```
+
+We should see that the projector and visual encoder get split out into the llava files. Quick check to make sure they aren't empty:
+```python
+import os
+import torch
+
+MODEL_PATH = os.getenv("GRANITE_MODEL")
+if not MODEL_PATH:
+ raise ValueError("env var GRANITE_MODEL is unset!")
+
+encoder_tensors = torch.load(os.path.join(MODEL_PATH, "llava.clip"))
+projector_tensors = torch.load(os.path.join(MODEL_PATH, "llava.projector"))
+
+assert len(encoder_tensors) > 0
+assert len(projector_tensors) > 0
+```
+
+If you actually inspect the `.keys()` of the loaded tensors, you should see a lot of `vision_model` tensors in the `encoder_tensors`, and 5 tensors (`'multi_modal_projector.linear_1.bias'`, `'multi_modal_projector.linear_1.weight'`, `'multi_modal_projector.linear_2.bias'`, `'multi_modal_projector.linear_2.weight'`, `'image_newline'`) in the multimodal `projector_tensors`.
+
+
+### 2. Creating the Visual Component GGUF
+Next, create a new directory to hold the visual components, and copy the llava.clip/projector files, as shown below.
+
+```bash
+$ ENCODER_PATH=$PWD/visual_encoder
+$ mkdir $ENCODER_PATH
+
+$ cp $GRANITE_MODEL/llava.clip $ENCODER_PATH/pytorch_model.bin
+$ cp $GRANITE_MODEL/llava.projector $ENCODER_PATH/
+```
+
+Now, we need to write a config for the visual encoder. In order to convert the model, be sure to use the correct `image_grid_pinpoints`, as these may vary based on the model. You can find the `image_grid_pinpoints` in `$GRANITE_MODEL/config.json`.
+
+```json
+{
+ "_name_or_path": "siglip-model",
+ "architectures": [
+ "SiglipVisionModel"
+ ],
+ "image_grid_pinpoints": [
+ [384,384],
+ [384,768],
+ [384,1152],
+ [384,1536],
+ [384,1920],
+ [384,2304],
+ [384,2688],
+ [384,3072],
+ [384,3456],
+ [384,3840],
+ [768,384],
+ [768,768],
+ [768,1152],
+ [768,1536],
+ [768,1920],
+ [1152,384],
+ [1152,768],
+ [1152,1152],
+ [1536,384],
+ [1536,768],
+ [1920,384],
+ [1920,768],
+ [2304,384],
+ [2688,384],
+ [3072,384],
+ [3456,384],
+ [3840,384]
+ ],
+ "mm_patch_merge_type": "spatial_unpad",
+ "hidden_size": 1152,
+ "image_size": 384,
+ "intermediate_size": 4304,
+ "model_type": "siglip_vision_model",
+ "num_attention_heads": 16,
+ "num_hidden_layers": 27,
+ "patch_size": 14,
+ "layer_norm_eps": 1e-6,
+ "hidden_act": "gelu_pytorch_tanh",
+ "projection_dim": 0,
+ "vision_feature_layer": [-24, -20, -12, -1]
+}
+```
+
+At this point you should have something like this:
+```bash
+$ ls $ENCODER_PATH
+config.json llava.projector pytorch_model.bin
+```
+
+Now convert the components to GGUF; Note that we also override the image mean/std dev to `[.5,.5,.5]` since we use the SigLIP visual encoder - in the transformers model, you can find these numbers in the `preprocessor_config.json`.
+```bash
+$ python convert_image_encoder_to_gguf.py \
+ -m $ENCODER_PATH \
+ --llava-projector $ENCODER_PATH/llava.projector \
+ --output-dir $ENCODER_PATH \
+ --clip-model-is-vision \
+ --clip-model-is-siglip \
+ --image-mean 0.5 0.5 0.5 \
+ --image-std 0.5 0.5 0.5
+```
+
+This will create the first GGUF file at `$ENCODER_PATH/mmproj-model-f16.gguf`; we will refer to the absolute path of this file as the `$VISUAL_GGUF_PATH.`
+
+
+### 3. Creating the LLM GGUF.
+The granite vision model contains a granite LLM as its language model. For now, the easiest way to get the GGUF for LLM is by loading the composite model in `transformers` and exporting the LLM so that it can be directly converted with the normal conversion path.
+
+First, set the `LLM_EXPORT_PATH` to the path to export the `transformers` LLM to.
+```bash
+$ export LLM_EXPORT_PATH=$PWD/granite_vision_llm
+```
+
+```python
+import os
+import transformers
+
+MODEL_PATH = os.getenv("GRANITE_MODEL")
+if not MODEL_PATH:
+ raise ValueError("env var GRANITE_MODEL is unset!")
+
+LLM_EXPORT_PATH = os.getenv("LLM_EXPORT_PATH")
+if not LLM_EXPORT_PATH:
+ raise ValueError("env var LLM_EXPORT_PATH is unset!")
+
+tokenizer = transformers.AutoTokenizer.from_pretrained(MODEL_PATH)
+
+# NOTE: granite vision support was added to transformers very recently (4.49);
+# if you get size mismatches, your version is too old.
+# If you are running with an older version, set `ignore_mismatched_sizes=True`
+# as shown below; it won't be loaded correctly, but the LLM part of the model that
+# we are exporting will be loaded correctly.
+model = transformers.AutoModelForImageTextToText.from_pretrained(MODEL_PATH, ignore_mismatched_sizes=True)
+
+tokenizer.save_pretrained(LLM_EXPORT_PATH)
+model.language_model.save_pretrained(LLM_EXPORT_PATH)
+```
+
+Now you can convert the exported LLM to GGUF with the normal converter in the root of the llama cpp project.
+```bash
+$ LLM_GGUF_PATH=$LLM_EXPORT_PATH/granite_llm.gguf
+...
+$ python convert_hf_to_gguf.py --outfile $LLM_GGUF_PATH $LLM_EXPORT_PATH
+```
+
+
+### 4. Quantization
+If you want to quantize the LLM, you can do so with `llama-quantize` as you would any other LLM. For example:
+```bash
+$ ./build/bin/llama-quantize $LLM_EXPORT_PATH/granite_llm.gguf $LLM_EXPORT_PATH/granite_llm_q4_k_m.gguf Q4_K_M
+$ LLM_GGUF_PATH=$LLM_EXPORT_PATH/granite_llm_q4_k_m.gguf
+```
+
+Note that currently you cannot quantize the visual encoder because granite vision models use SigLIP as the visual encoder, which has tensor dimensions that are not divisible by 32.
+
+
+### 5. Running the Model in Llama cpp
+Build llama cpp normally; you should have a target binary named `llama-mtmd-cli`, which you can pass two binaries to. As an example, we pass the the llama.cpp banner.
+
+```bash
+$ ./build/bin/llama-mtmd-cli -m $LLM_GGUF_PATH \
+ --mmproj $VISUAL_GGUF_PATH \
+ -c 16384 \
+ --temp 0
+```
--- /dev/null
+# LLaVA
+
+Currently this implementation supports [llava-v1.5](https://huggingface.co/liuhaotian/llava-v1.5-7b) variants,
+as well as llava-1.6 [llava-v1.6](https://huggingface.co/collections/liuhaotian/llava-16-65b9e40155f60fd046a5ccf2) variants.
+
+The pre-converted [7b](https://huggingface.co/mys/ggml_llava-v1.5-7b)
+and [13b](https://huggingface.co/mys/ggml_llava-v1.5-13b)
+models are available.
+For llava-1.6 a variety of prepared gguf models are available as well [7b-34b](https://huggingface.co/cmp-nct/llava-1.6-gguf)
+
+After API is confirmed, more models will be supported / uploaded.
+
+## Usage
+Build the `llama-mtmd-cli` binary.
+
+After building, run: `./llama-mtmd-cli` to see the usage. For example:
+
+```sh
+./llama-mtmd-cli -m ../llava-v1.5-7b/ggml-model-f16.gguf \
+ --mmproj ../llava-v1.5-7b/mmproj-model-f16.gguf \
+ --chat-template vicuna
+```
+
+**note**: A lower temperature like 0.1 is recommended for better quality. add `--temp 0.1` to the command to do so.
+**note**: For GPU offloading ensure to use the `-ngl` flag just like usual
+
+## LLaVA 1.5
+
+1. Clone a LLaVA and a CLIP model ([available options](https://github.com/haotian-liu/LLaVA/blob/main/docs/MODEL_ZOO.md)). For example:
+
+```sh
+git clone https://huggingface.co/liuhaotian/llava-v1.5-7b
+
+git clone https://huggingface.co/openai/clip-vit-large-patch14-336
+```
+
+2. Install the required Python packages:
+
+```sh
+pip install -r examples/llava/requirements.txt
+```
+
+3. Use `llava_surgery.py` to split the LLaVA model to LLaMA and multimodel projector constituents:
+
+```sh
+python ./examples/llava/llava_surgery.py -m ../llava-v1.5-7b
+```
+
+4. Use `convert_image_encoder_to_gguf.py` to convert the LLaVA image encoder to GGUF:
+
+```sh
+python ./examples/llava/convert_image_encoder_to_gguf.py -m ../clip-vit-large-patch14-336 --llava-projector ../llava-v1.5-7b/llava.projector --output-dir ../llava-v1.5-7b
+```
+
+5. Use `examples/convert_legacy_llama.py` to convert the LLaMA part of LLaVA to GGUF:
+
+```sh
+python ./examples/convert_legacy_llama.py ../llava-v1.5-7b --skip-unknown
+```
+
+Now both the LLaMA part and the image encoder are in the `llava-v1.5-7b` directory.
+
+## LLaVA 1.6 gguf conversion
+1) First clone a LLaVA 1.6 model:
+```console
+git clone https://huggingface.co/liuhaotian/llava-v1.6-vicuna-7b
+```
+
+2) Install the required Python packages:
+
+```sh
+pip install -r examples/llava/requirements.txt
+```
+
+3) Use `llava_surgery_v2.py` which also supports llava-1.5 variants pytorch as well as safetensor models:
+```console
+python examples/llava/llava_surgery_v2.py -C -m ../llava-v1.6-vicuna-7b/
+```
+- you will find a llava.projector and a llava.clip file in your model directory
+
+4) Copy the llava.clip file into a subdirectory (like vit), rename it to pytorch_model.bin and add a fitting vit configuration to the directory:
+```console
+mkdir vit
+cp ../llava-v1.6-vicuna-7b/llava.clip vit/pytorch_model.bin
+cp ../llava-v1.6-vicuna-7b/llava.projector vit/
+curl -s -q https://huggingface.co/cmp-nct/llava-1.6-gguf/raw/main/config_vit.json -o vit/config.json
+```
+
+5) Create the visual gguf model:
+```console
+python ./examples/llava/convert_image_encoder_to_gguf.py -m vit --llava-projector vit/llava.projector --output-dir vit --clip-model-is-vision
+```
+- This is similar to llava-1.5, the difference is that we tell the encoder that we are working with the pure vision model part of CLIP
+
+6) Then convert the model to gguf format:
+```console
+python ./examples/convert_legacy_llama.py ../llava-v1.6-vicuna-7b/ --skip-unknown
+```
+
+7) And finally we can run the llava cli using the 1.6 model version:
+```console
+./llama-mtmd-cli -m ../llava-v1.6-vicuna-7b/ggml-model-f16.gguf --mmproj vit/mmproj-model-f16.gguf
+```
+
+**note** llava-1.6 needs more context than llava-1.5, at least 3000 is needed (just run it at -c 4096)
+
+**note** llava-1.6 greatly benefits from batched prompt processing (defaults work)
+
+**note** if the language model in step `6)` is incompatible with the legacy conversion script, the easiest way handle the LLM model conversion is to load the model in transformers, and export only the LLM from the llava next model.
+
+```python
+import os
+import transformers
+
+model_path = ...
+llm_export_path = ...
+
+tokenizer = transformers.AutoTokenizer.from_pretrained(model_path)
+model = transformers.AutoModelForImageTextToText.from_pretrained(model_path)
+
+tokenizer.save_pretrained(llm_export_path)
+model.language_model.save_pretrained(llm_export_path)
+```
+
+Then, you can convert the LLM using the `convert_hf_to_gguf.py` script, which handles more LLM architectures.
+
+## Chat template
+
+For llava-1.5 and llava-1.6, you need to use `vicuna` chat template. Simply add `--chat-template vicuna` to activate this template.
+
+
+## How to know if you are running in llava-1.5 or llava-1.6 mode
+
+When running llava-cli you will see a visual information right before the prompt is being processed:
+
+**Llava-1.5:**
+`encode_image_with_clip: image embedding created: 576 tokens`
+
+**Llava-1.6 (anything above 576):**
+`encode_image_with_clip: image embedding created: 2880 tokens`
+
+
+Alternatively just pay notice to how many "tokens" have been used for your prompt, it will also show 1000+ tokens for llava-1.6
--- /dev/null
+## MiniCPM-o 2.6
+Currently, this readme only supports minicpm-omni's image capabilities, and we will update the full-mode support as soon as possible.
+
+### Prepare models and code
+
+Download [MiniCPM-o-2_6](https://huggingface.co/openbmb/MiniCPM-o-2_6) PyTorch model from huggingface to "MiniCPM-o-2_6" folder.
+
+
+### Build llama.cpp
+Readme modification time: 20250206
+
+If there are differences in usage, please refer to the official build [documentation](https://github.com/ggerganov/llama.cpp/blob/master/docs/build.md)
+
+Clone llama.cpp:
+```bash
+git clone https://github.com/ggerganov/llama.cpp
+cd llama.cpp
+```
+
+Build llama.cpp using `CMake`:
+```bash
+cmake -B build
+cmake --build build --config Release
+```
+
+
+### Usage of MiniCPM-o 2.6
+
+Convert PyTorch model to gguf files (You can also download the converted [gguf](https://huggingface.co/openbmb/MiniCPM-o-2_6-gguf) by us)
+
+```bash
+python ./examples/llava/minicpmv-surgery.py -m ../MiniCPM-o-2_6
+python ./examples/llava/minicpmv-convert-image-encoder-to-gguf.py -m ../MiniCPM-o-2_6 --minicpmv-projector ../MiniCPM-o-2_6/minicpmv.projector --output-dir ../MiniCPM-o-2_6/ --image-mean 0.5 0.5 0.5 --image-std 0.5 0.5 0.5 --minicpmv_version 4
+python ./convert_hf_to_gguf.py ../MiniCPM-o-2_6/model
+
+# quantize int4 version
+./build/bin/llama-quantize ../MiniCPM-o-2_6/model/ggml-model-f16.gguf ../MiniCPM-o-2_6/model/ggml-model-Q4_K_M.gguf Q4_K_M
+```
+
+
+Inference on Linux or Mac
+```bash
+# run in single-turn mode
+./build/bin/llama-mtmd-cli -m ../MiniCPM-o-2_6/model/ggml-model-f16.gguf --mmproj ../MiniCPM-o-2_6/mmproj-model-f16.gguf -c 4096 --temp 0.7 --top-p 0.8 --top-k 100 --repeat-penalty 1.05 --image xx.jpg -p "What is in the image?"
+
+# run in conversation mode
+./build/bin/llama-mtmd-cli -m ../MiniCPM-o-2_6/model/ggml-model-Q4_K_M.gguf --mmproj ../MiniCPM-o-2_6/mmproj-model-f16.gguf
+```
--- /dev/null
+## MiniCPM-Llama3-V 2.5
+
+### Prepare models and code
+
+Download [MiniCPM-Llama3-V-2_5](https://huggingface.co/openbmb/MiniCPM-Llama3-V-2_5) PyTorch model from huggingface to "MiniCPM-Llama3-V-2_5" folder.
+
+
+### Build llama.cpp
+Readme modification time: 20250206
+
+If there are differences in usage, please refer to the official build [documentation](https://github.com/ggerganov/llama.cpp/blob/master/docs/build.md)
+
+Clone llama.cpp:
+```bash
+git clone https://github.com/ggml-org/llama.cpp
+cd llama.cpp
+```
+
+Build llama.cpp using `CMake`:
+```bash
+cmake -B build
+cmake --build build --config Release
+```
+
+
+### Usage of MiniCPM-Llama3-V 2.5
+
+Convert PyTorch model to gguf files (You can also download the converted [gguf](https://huggingface.co/openbmb/MiniCPM-Llama3-V-2_5-gguf) by us)
+
+```bash
+python ./examples/llava/minicpmv-surgery.py -m ../MiniCPM-Llama3-V-2_5
+python ./examples/llava/minicpmv-convert-image-encoder-to-gguf.py -m ../MiniCPM-Llama3-V-2_5 --minicpmv-projector ../MiniCPM-Llama3-V-2_5/minicpmv.projector --output-dir ../MiniCPM-Llama3-V-2_5/ --image-mean 0.5 0.5 0.5 --image-std 0.5 0.5 0.5 --minicpmv_version 2
+python ./convert_hf_to_gguf.py ../MiniCPM-Llama3-V-2_5/model
+
+# quantize int4 version
+./build/bin/llama-quantize ../MiniCPM-Llama3-V-2_5/model/model-8B-F16.gguf ../MiniCPM-Llama3-V-2_5/model/ggml-model-Q4_K_M.gguf Q4_K_M
+```
+
+
+Inference on Linux or Mac
+```bash
+# run in single-turn mode
+./build/bin/llama-mtmd-cli -m ../MiniCPM-Llama3-V-2_5/model/model-8B-F16.gguf --mmproj ../MiniCPM-Llama3-V-2_5/mmproj-model-f16.gguf -c 4096 --temp 0.7 --top-p 0.8 --top-k 100 --repeat-penalty 1.05 --image xx.jpg -p "What is in the image?"
+
+# run in conversation mode
+./build/bin/llama-mtmd-cli -m ../MiniCPM-Llama3-V-2_5/model/ggml-model-Q4_K_M.gguf --mmproj ../MiniCPM-Llama3-V-2_5/mmproj-model-f16.gguf
+```
--- /dev/null
+## MiniCPM-V 2.6
+
+### Prepare models and code
+
+Download [MiniCPM-V-2_6](https://huggingface.co/openbmb/MiniCPM-V-2_6) PyTorch model from huggingface to "MiniCPM-V-2_6" folder.
+
+
+### Build llama.cpp
+Readme modification time: 20250206
+
+If there are differences in usage, please refer to the official build [documentation](https://github.com/ggerganov/llama.cpp/blob/master/docs/build.md)
+
+Clone llama.cpp:
+```bash
+git clone https://github.com/ggerganov/llama.cpp
+cd llama.cpp
+```
+
+Build llama.cpp using `CMake`:
+```bash
+cmake -B build
+cmake --build build --config Release
+```
+
+
+### Usage of MiniCPM-V 2.6
+
+Convert PyTorch model to gguf files (You can also download the converted [gguf](https://huggingface.co/openbmb/MiniCPM-V-2_6-gguf) by us)
+
+```bash
+python ./examples/llava/minicpmv-surgery.py -m ../MiniCPM-V-2_6
+python ./examples/llava/minicpmv-convert-image-encoder-to-gguf.py -m ../MiniCPM-V-2_6 --minicpmv-projector ../MiniCPM-V-2_6/minicpmv.projector --output-dir ../MiniCPM-V-2_6/ --image-mean 0.5 0.5 0.5 --image-std 0.5 0.5 0.5 --minicpmv_version 3
+python ./convert_hf_to_gguf.py ../MiniCPM-V-2_6/model
+
+# quantize int4 version
+./build/bin/llama-quantize ../MiniCPM-V-2_6/model/ggml-model-f16.gguf ../MiniCPM-V-2_6/model/ggml-model-Q4_K_M.gguf Q4_K_M
+```
+
+
+Inference on Linux or Mac
+```bash
+# run in single-turn mode
+./build/bin/llama-mtmd-cli -m ../MiniCPM-V-2_6/model/ggml-model-f16.gguf --mmproj ../MiniCPM-V-2_6/mmproj-model-f16.gguf -c 4096 --temp 0.7 --top-p 0.8 --top-k 100 --repeat-penalty 1.05 --image xx.jpg -p "What is in the image?"
+
+# run in conversation mode
+./build/bin/llama-mtmd-cli -m ../MiniCPM-V-2_6/model/ggml-model-Q4_K_M.gguf --mmproj ../MiniCPM-V-2_6/mmproj-model-f16.gguf
+```
+++ /dev/null
-# MobileVLM
-
-Currently this implementation supports [MobileVLM-1.7B](https://huggingface.co/mtgv/MobileVLM-1.7B) / [MobileVLM_V2-1.7B](https://huggingface.co/mtgv/MobileVLM_V2-1.7B) variants.
-
-for more information, please go to [Meituan-AutoML/MobileVLM](https://github.com/Meituan-AutoML/MobileVLM)
-
-The implementation is based on llava, and is compatible with llava and mobileVLM. The usage is basically same as llava.
-
-Notice: The overall process of model inference for both **MobileVLM** and **MobileVLM_V2** models is the same, but the process of model conversion is a little different. Therefore, using **MobileVLM-1.7B** as an example, the different conversion step will be shown.
-
-## Usage
-Build with cmake or run `make llama-llava-cli` to build it.
-
-After building, run: `./llama-llava-cli` to see the usage. For example:
-
-```sh
-./llama-llava-cli -m MobileVLM-1.7B/ggml-model-q4_k.gguf \
- --mmproj MobileVLM-1.7B/mmproj-model-f16.gguf \
- --image path/to/an/image.jpg \
- -p "A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. USER: <image>\nWho is the author of this book? Answer the question using a single word or phrase. ASSISTANT:"
-```
-
-## Model conversion
-
-1. Clone `mobileVLM-1.7B` and `clip-vit-large-patch14-336` locally:
-
-```sh
-git clone https://huggingface.co/mtgv/MobileVLM-1.7B
-
-git clone https://huggingface.co/openai/clip-vit-large-patch14-336
-```
-
-2. Use `llava_surgery.py` to split the LLaVA model to LLaMA and multimodel projector constituents:
-
-```sh
-python ./examples/llava/llava_surgery.py -m path/to/MobileVLM-1.7B
-```
-
-3. Use `convert_image_encoder_to_gguf.py` with `--projector-type ldp` (for **V2** please use `--projector-type ldpv2`) to convert the LLaVA image encoder to GGUF:
-
-```sh
-python ./examples/llava/convert_image_encoder_to_gguf.py \
- -m path/to/clip-vit-large-patch14-336 \
- --llava-projector path/to/MobileVLM-1.7B/llava.projector \
- --output-dir path/to/MobileVLM-1.7B \
- --projector-type ldp
-```
-
-```sh
-python ./examples/llava/convert_image_encoder_to_gguf.py \
- -m path/to/clip-vit-large-patch14-336 \
- --llava-projector path/to/MobileVLM-1.7B_V2/llava.projector \
- --output-dir path/to/MobileVLM-1.7B_V2 \
- --projector-type ldpv2
-```
-
-4. Use `examples/convert_legacy_llama.py` to convert the LLaMA part of LLaVA to GGUF:
-
-```sh
-python ./examples/convert_legacy_llama.py path/to/MobileVLM-1.7B --skip-unknown
-```
-
-5. Use `quantize` to convert LLaMA part's DataType from `fp32` to `q4_k`
-```sh
-./llama-quantize path/to/MobileVLM-1.7B/ggml-model-F32.gguf path/to/MobileVLM-1.7B/ggml-model-q4_k.gguf q4_k_s
-```
-
-Now both the LLaMA part and the image encoder is in the `MobileVLM-1.7B` directory.
-
-## Android compile and run
-### compile
-refer to `examples/llava/android/build_64.sh`
-```sh
-mkdir examples/llava/android/build_64
-cd examples/llava/android/build_64
-../build_64.sh
-```
-### run on Android
-refer to `android/adb_run.sh`, modify resources' `name` and `path`
-
-## Some result on Android with `Snapdragon 888` chip
-### case 1
-**input**
-```sh
-/data/local/tmp/llama-llava-cli \
- -m /data/local/tmp/ggml-model-q4_k.gguf \
- --mmproj /data/local/tmp/mmproj-model-f16.gguf \
- -t 4 \
- --image /data/local/tmp/demo.jpg \
- -p "A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. USER: <image>\nWho is the author of this book? \nAnswer the question using a single word or phrase. ASSISTANT:"
-```
-**output**
-```sh
-encode_image_with_clip: image encoded in 21148.71 ms by CLIP ( 146.87 ms per image patch)
- Susan Wise Bauer
-llama_print_timings: load time = 23574.72 ms
-llama_print_timings: sample time = 1.24 ms / 6 runs ( 0.21 ms per token, 4850.44 tokens per second)
-llama_print_timings: prompt eval time = 12460.15 ms / 246 tokens ( 50.65 ms per token, 19.74 tokens per second)
-llama_print_timings: eval time = 424.86 ms / 6 runs ( 70.81 ms per token, 14.12 tokens per second)
-llama_print_timings: total time = 34731.93 ms
-```
-### case 2
-**input**
-```sh
-/data/local/tmp/llama-llava-cli \
- -m /data/local/tmp/ggml-model-q4_k.gguf \
- --mmproj /data/local/tmp/mmproj-model-f16.gguf \
- -t 4 \
- --image /data/local/tmp/cat.jpeg \
- -p "A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. USER: <image>\nWhat is in the image? ASSISTANT:"
-```
-**output**
-```sh
-encode_image_with_clip: image encoded in 21149.51 ms by CLIP ( 146.87 ms per image patch)
- The image depicts a cat sitting in the grass near some tall green plants.
-llama_print_timings: load time = 23257.32 ms
-llama_print_timings: sample time = 5.25 ms / 18 runs ( 0.29 ms per token, 3430.53 tokens per second)
-llama_print_timings: prompt eval time = 11900.73 ms / 232 tokens ( 51.30 ms per token, 19.49 tokens per second)
-llama_print_timings: eval time = 1279.03 ms / 18 runs ( 71.06 ms per token, 14.07 tokens per second)
-llama_print_timings: total time = 34570.79 ms
-```
-
-
-## Some result on Android with `Snapdragon 778G` chip
-### MobileVLM-1.7B case
-#### llava-cli release-b2005
-**input**
-```sh
-/data/local/tmp/llama-llava-cli \
- -m /data/local/tmp/ggml-model-q4_k.gguf \
- --mmproj /data/local/tmp/mmproj-model-f16.gguf \
- -t 4 \
- --image /data/local/tmp/many_llamas.jpeg \
- -p "A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. USER: <image>\nWhat's that? ASSISTANT:"
-```
-**output**
-```sh
-encode_image_with_clip: image encoded in 18728.52 ms by CLIP ( 130.06 ms per image patch)
-system_prompt: A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. USER:
-user_prompt: \nWhat's that? ASSISTANT:
-
- A group of llamas are standing in a green pasture.
-
-llama_print_timings: load time = 20357.33 ms
-llama_print_timings: sample time = 2.96 ms / 14 runs ( 0.21 ms per token, 4734.53 tokens per second)
-llama_print_timings: prompt eval time = 8119.49 ms / 191 tokens ( 42.51 ms per token, 23.52 tokens per second)
-llama_print_timings: eval time = 1005.75 ms / 14 runs ( 71.84 ms per token, 13.92 tokens per second)
-llama_print_timings: total time = 28038.34 ms / 205 tokens
-```
-#### llava-cli latest-version
-**input**
-
-Just the same as above.
-
-**output**(seems to be much slower)
-```sh
-encode_image_with_clip: image embedding created: 144 tokens
-
-encode_image_with_clip: image encoded in 288268.88 ms by CLIP ( 2001.87 ms per image patch)
-system_prompt: A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. USER:
-user_prompt: \nWhat's that? ASSISTANT:
-
- It is a group of sheep standing together in a grass field.
-
-llama_print_timings: load time = 818120.91 ms
-llama_print_timings: sample time = 3.44 ms / 14 runs ( 0.25 ms per token, 4067.40 tokens per second)
-llama_print_timings: prompt eval time = 529274.69 ms / 191 tokens ( 2771.07 ms per token, 0.36 tokens per second)
-llama_print_timings: eval time = 43894.02 ms / 13 runs ( 3376.46 ms per token, 0.30 tokens per second)
-llama_print_timings: total time = 865441.76 ms / 204 tokens
-```
-### MobileVLM_V2-1.7B case
-#### llava-cli release-2005b
-**input**
-
-Just the same as above.
-
-**output**
-```sh
-encode_image_with_clip: image encoded in 20609.61 ms by CLIP ( 143.12 ms per image patch)
-system_prompt: A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. USER:
-user_prompt: \nWhat's that? ASSISTANT:
-
- This image captures a lively scene of 20 llamas in motion on an expansive, grassy field. The llama is scattered across the landscape with some standing and others sitting down as if taking rest or observing their surroundings from different vantage points within this verdant setting.
-
-The background offers glimpses into a picturesque town nestled amidst hills under an overcast sky, adding depth to the scene while also emphasizing that distance between these llama and human-made structures like houses or roads in which they roam freely without any barriers around them. The image is framed by text at both right angles on white backgrounds against a contrasting blue backdrop with green foliage, further drawing attention to the llamas amidst their natural habitat while also inviting viewers into this picturesque landscape within town limits of Alta Llama
-
-llama_print_timings: load time = 22406.77 ms
-llama_print_timings: sample time = 49.26 ms / 186 runs ( 0.26 ms per token, 3776.27 tokens per second)
-llama_print_timings: prompt eval time = 9044.54 ms / 191 tokens ( 47.35 ms per token, 21.12 tokens per second)
-llama_print_timings: eval time = 14497.49 ms / 186 runs ( 77.94 ms per token, 12.83 tokens per second)
-llama_print_timings: total time = 44411.01 ms / 377 tokens
-```
-
-## Orin compile and run
-### compile
-```sh
-make GGML_CUDA=1 CUDA_DOCKER_ARCH=sm_87 GGML_CUDA_F16=1 -j 32
-```
-### run on Orin
-### case 1
-**input**
-```sh
-./llama-llava-cli \
- -m /data/local/tmp/ggml-model-q4_k.gguf \
- --mmproj /data/local/tmp/mmproj-model-f16.gguf \
- --image /data/local/tmp/demo.jpeg \
- -p "A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. USER: <image>\nWho is the author of this book? \nAnswer the question using a single word or phrase. ASSISTANT:" \
- --n-gpu-layers 999
-```
-**output**
-```sh
-
-encode_image_with_clip: image encoded in 296.62 ms by CLIP ( 2.06 ms per image patch)
-
- Susan Wise Bauer
-
-llama_print_timings: load time = 1067.64 ms
-llama_print_timings: sample time = 1.53 ms / 6 runs ( 0.25 ms per token, 3934.43 tokens per second)
-llama_print_timings: prompt eval time = 306.84 ms / 246 tokens ( 1.25 ms per token, 801.72 tokens per second)
-llama_print_timings: eval time = 91.50 ms / 6 runs ( 15.25 ms per token, 65.58 tokens per second)
-llama_print_timings: total time = 1352.63 ms / 252 tokens
-```
-
-### case 2
-**input**
-```sh
-./llama-llava-cli \
- -m /data/local/tmp/ggml-model-q4_k.gguf \
- --mmproj /data/local/tmp/mmproj-model-f16.gguf \
- -p "A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. USER: <image>\nWhat is in the image? ASSISTANT:" \
- --n-gpu-layers 999
-
-```
-**output**
-```sh
-encode_image_with_clip: image encoded in 302.15 ms by CLIP ( 2.10 ms per image patch)
-
- The image features a cat lying in the grass.
-
-llama_print_timings: load time = 1057.07 ms
-llama_print_timings: sample time = 3.27 ms / 11 runs ( 0.30 ms per token, 3360.83 tokens per second)
-llama_print_timings: prompt eval time = 213.60 ms / 232 tokens ( 0.92 ms per token, 1086.14 tokens per second)
-llama_print_timings: eval time = 166.65 ms / 11 runs ( 15.15 ms per token, 66.01 tokens per second)
-llama_print_timings: total time = 1365.47 ms / 243 tokens
-```
-
-## Running on Intel(R) Core(TM) i7-10750H
-### Operating system
-Ubuntu22.04
-### compile
-```sh
-make -j32
-```
-### MobileVLM-1.7B case
-**input**
-```sh
--m /path/to/ggml-model-q4_k.gguf \
- --mmproj /path/to/mmproj-model-f16.gguf \
- --image /path/to/many_llamas.jpeg
- -p "A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. USER: <image>\nWhat's that? ASSISTANT:" \
-```
-**output**
-```sh
-encode_image_with_clip: image embedding created: 144 tokens
-
-encode_image_with_clip: image encoded in 2730.94 ms by CLIP ( 18.96 ms per image patch)
-system_prompt: A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. USER:
-user_prompt: \nWhat's that?ASSISTANT:
-
- A group of llamas are walking together in a field.
-
-llama_print_timings: load time = 5506.60 ms
-llama_print_timings: sample time = 0.44 ms / 13 runs ( 0.03 ms per token, 29545.45 tokens per second)
-llama_print_timings: prompt eval time = 2031.58 ms / 190 tokens ( 10.69 ms per token, 93.52 tokens per second)
-llama_print_timings: eval time = 438.92 ms / 12 runs ( 36.58 ms per token, 27.34 tokens per second)
-llama_print_timings: total time = 5990.25 ms / 202 tokens
-```
-
-### MobileVLM_V2-1.7B case
-**input**
-
-Just the same as above.
-
-**ouput**
-```sh
-encode_image_with_clip: image embedding created: 144 tokens
-
-encode_image_with_clip: image encoded in 3223.89 ms by CLIP ( 22.39 ms per image patch)
-system_prompt: A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. USER:
-user_prompt: \nWhat's that?ASSISTANT:
-
- The image captures a tranquil scene in a park, where a group of approximately 20 llamas are gathered. The llamas, a mix of white and black, are standing in a line, their black and white patterns contrasting with the lush green grass of the park. The lamas are arranged in a line, suggesting a social order.
-
-The park itself is lush and green, with trees dotting the landscape in the background. A sign reading "Llamas Tico Ana" is also visible in the image, possibly indicating the location or the breed of the llamas. The image seems to be taken from a distance, providing a wide view of the scene and the surrounding environment.
-
-The llamas' positions relative to each other, the sign, and the trees create a harmonious composition. The image does not contain any discernible text. The overall scene is one of peace and natural beauty, with the llamas in their natural habitat, surrounded by the vibrant colors and lush greenery of the park.
-
-llama_print_timings: load time = 6642.61 ms
-llama_print_timings: sample time = 8.15 ms / 223 runs ( 0.04 ms per token, 27358.61 tokens per second)
-llama_print_timings: prompt eval time = 2475.07 ms / 190 tokens ( 13.03 ms per token, 76.77 tokens per second)
-llama_print_timings: eval time = 8760.60 ms / 222 runs ( 39.46 ms per token, 25.34 tokens per second)
-llama_print_timings: total time = 15513.95 ms / 412 tokens
-```
-
-## Run on Intel(R) Core(TM) Ultra7 115H
-### operation system
-Windows11
-### comiple
-```sh
-make -j32
-```
-### MobileVLM-1.7B case
-**input**
-```sh
--m /path/to/ggml-model-q4_k.gguf \
- --mmproj /path/to/tmp/mmproj-model-f16.gguf \
- -p "A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. USER: <image>\nWhat's that? ASSISTANT:" \
-```
-**output**
-```sh
-encode_image_with_clip: image encoded in 4902.81 ms by CLIP ( 34.05 ms per image patch)
-system_prompt: A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. USER:
-user_prompt: \nWhat's that? ASSISTANT:
-
- The image features a group of brown and white llamas standing in a grassy field.
-
-llama_print_timings: load time = 7441.06 ms
-llama_print_timings: sample time = 0.72 ms / 19 runs ( 0.04 ms per token, 26279.39 tokens per second)
-llama_print_timings: prompt eval time = 2090.71 ms / 191 tokens ( 10.95 ms per token, 91.36 tokens per second)
-llama_print_timings: eval time = 512.35 ms / 18 runs ( 28.46 ms per token, 35.13 tokens per second)
-llama_print_timings: total time = 7987.23 ms / 209 tokens
-```
-
-### MobileVLM_V2-1.7B case
-**input**
-
-Just the same as above.
-
-**output**
-```sh
-encode_image_with_clip: image encoded in 4682.44 ms by CLIP ( 32.52 ms per image patch)
-system_prompt: A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. USER:
-user_prompt: \nWhat's that? ASSISTANT:
-
- This image captures a lively scene of a group of 14 llamas in a grassy field. The llamas, with their distinctive black and white coats, are standing and walking in a line, seemingly engaged in a social activity. One
- of them, possibly the first in the line, has its back turned, perhaps observing something in the distance.
-
-The llama in the front of the line stands out due to its black and white coloring, which is quite unusual for llama patterns. The llama in the front also seems to be more aware of its surroundings, as it faces the camera, giving a sense of engagement with the viewer.
-
-The image is taken from the side of the llama, providing a clear view of the llama in the front and its companions. The lameness in the llama in
- front is not visible, indicating that it might not be the main focus of the photo.
-
-The background of the image features a grassy field, with a fence and a tree visible in the distance. The tree appears to be bare, suggesting that it might be during a time of year when most trees are dormant or have shed their leaves.
-
-
-llama_print_timings: load time = 7015.35 ms
-llama_print_timings: sample time = 10.61 ms / 256 runs ( 0.04 ms per token, 24119.09 tokens per second)
-llama_print_timings: prompt eval time = 2052.45 ms / 191 tokens ( 10.75 ms per token, 93.06 tokens per second)
-llama_print_timings: eval time = 7259.43 ms / 255 runs ( 28.47 ms per token, 35.13 tokens per second)
-llama_print_timings: total time = 14371.19 ms / 446 tokens
-```
-
-## TODO
-
-- [x] Support non-CPU backend for the new operators, such as `depthwise`, `hardswish`, `hardsigmoid`
-- [ ] Optimize LDP projector performance
-
- - Optimize the structure definition to avoid unnecessary memory rearrangements, to reduce the use of `ggml_permute_cpy`;
- - Optimize operator implementation (ARM CPU/NVIDIA GPU): such as depthwise conv, hardswish, hardsigmoid, etc.
-- [x] run MobileVLM on `Jetson Orin`
-- [ ] Support more model variants, such as `MobileVLM-3B`.
-
-
-## contributor
-```sh
-zhangjidong05, yangyang260, huyiming03, chenxiaotao03, ZiangWu-77
-```
+++ /dev/null
-# Gemma 3 vision
-
-> [!IMPORTANT]
->
-> This is very experimental, only used for demo purpose.
-
-## Quick started
-
-You can use pre-quantized model from [ggml-org](https://huggingface.co/ggml-org)'s Hugging Face account
-
-```bash
-# build
-cmake -B build
-cmake --build build --target llama-gemma3-cli
-
-# alternatively, install from brew (MacOS)
-brew install llama.cpp
-
-# run it
-llama-gemma3-cli -hf ggml-org/gemma-3-4b-it-GGUF
-llama-gemma3-cli -hf ggml-org/gemma-3-12b-it-GGUF
-llama-gemma3-cli -hf ggml-org/gemma-3-27b-it-GGUF
-
-# note: 1B model does not support vision
-```
-
-## How to get mmproj.gguf?
-
-```bash
-cd gemma-3-4b-it
-python ../llama.cpp/examples/llava/gemma3_convert_encoder_to_gguf.py .
-
-# output file is mmproj.gguf
-```
-
-## How to run it?
-
-What you need:
-- The text model GGUF, can be converted using `convert_hf_to_gguf.py`
-- The mmproj file from step above
-- An image file
-
-```bash
-# build
-cmake -B build
-cmake --build build --target llama-gemma3-cli
-
-# run it
-./build/bin/llama-gemma3-cli -m {text_model}.gguf --mmproj mmproj.gguf --image your_image.jpg
-```
+++ /dev/null
-# GLMV-EDGE
-
-Currently this implementation supports [glm-edge-v-2b](https://huggingface.co/THUDM/glm-edge-v-2b) and [glm-edge-v-5b](https://huggingface.co/THUDM/glm-edge-v-5b).
-
-## Usage
-Build with cmake or run `make llama-llava-cli` to build it.
-
-After building, run: `./llama-llava-cli` to see the usage. For example:
-
-```sh
-./llama-llava-cli -m model_path/ggml-model-f16.gguf --mmproj model_path/mmproj-model-f16.gguf --image img_path/image.jpg -p "<|system|>\n system prompt <image><|user|>\n prompt <|assistant|>\n"
-```
-
-**note**: A lower temperature like 0.1 is recommended for better quality. add `--temp 0.1` to the command to do so.
-**note**: For GPU offloading ensure to use the `-ngl` flag just like usual
-
-## GGUF conversion
-
-1. Clone a GLMV-EDGE model ([2B](https://huggingface.co/THUDM/glm-edge-v-2b) or [5B](https://huggingface.co/THUDM/glm-edge-v-5b)). For example:
-
-```sh
-git clone https://huggingface.co/THUDM/glm-edge-v-5b or https://huggingface.co/THUDM/glm-edge-v-2b
-```
-
-2. Use `glmedge-surgery.py` to split the GLMV-EDGE model to LLM and multimodel projector constituents:
-
-```sh
-python ./examples/llava/glmedge-surgery.py -m ../model_path
-```
-
-4. Use `glmedge-convert-image-encoder-to-gguf.py` to convert the GLMV-EDGE image encoder to GGUF:
-
-```sh
-python ./examples/llava/glmedge-convert-image-encoder-to-gguf.py -m ../model_path --llava-projector ../model_path/glm.projector --output-dir ../model_path
-```
-
-5. Use `examples/convert_hf_to_gguf.py` to convert the LLM part of GLMV-EDGE to GGUF:
-
-```sh
-python convert_hf_to_gguf.py ../model_path
-```
-
-Now both the LLM part and the image encoder are in the `model_path` directory.
+++ /dev/null
-# Granite Vision
-
-Download the model and point your `GRANITE_MODEL` environment variable to the path.
-
-```bash
-$ git clone https://huggingface.co/ibm-granite/granite-vision-3.2-2b
-$ export GRANITE_MODEL=./granite-vision-3.2-2b
-```
-
-
-### 1. Running llava surgery v2.
-First, we need to run the llava surgery script as shown below:
-
-`python llava_surgery_v2.py -C -m $GRANITE_MODEL`
-
-You should see two new files (`llava.clip` and `llava.projector`) written into your model's directory, as shown below.
-
-```bash
-$ ls $GRANITE_MODEL | grep -i llava
-llava.clip
-llava.projector
-```
-
-We should see that the projector and visual encoder get split out into the llava files. Quick check to make sure they aren't empty:
-```python
-import os
-import torch
-
-MODEL_PATH = os.getenv("GRANITE_MODEL")
-if not MODEL_PATH:
- raise ValueError("env var GRANITE_MODEL is unset!")
-
-encoder_tensors = torch.load(os.path.join(MODEL_PATH, "llava.clip"))
-projector_tensors = torch.load(os.path.join(MODEL_PATH, "llava.projector"))
-
-assert len(encoder_tensors) > 0
-assert len(projector_tensors) > 0
-```
-
-If you actually inspect the `.keys()` of the loaded tensors, you should see a lot of `vision_model` tensors in the `encoder_tensors`, and 5 tensors (`'multi_modal_projector.linear_1.bias'`, `'multi_modal_projector.linear_1.weight'`, `'multi_modal_projector.linear_2.bias'`, `'multi_modal_projector.linear_2.weight'`, `'image_newline'`) in the multimodal `projector_tensors`.
-
-
-### 2. Creating the Visual Component GGUF
-Next, create a new directory to hold the visual components, and copy the llava.clip/projector files, as shown below.
-
-```bash
-$ ENCODER_PATH=$PWD/visual_encoder
-$ mkdir $ENCODER_PATH
-
-$ cp $GRANITE_MODEL/llava.clip $ENCODER_PATH/pytorch_model.bin
-$ cp $GRANITE_MODEL/llava.projector $ENCODER_PATH/
-```
-
-Now, we need to write a config for the visual encoder. In order to convert the model, be sure to use the correct `image_grid_pinpoints`, as these may vary based on the model. You can find the `image_grid_pinpoints` in `$GRANITE_MODEL/config.json`.
-
-```json
-{
- "_name_or_path": "siglip-model",
- "architectures": [
- "SiglipVisionModel"
- ],
- "image_grid_pinpoints": [
- [384,384],
- [384,768],
- [384,1152],
- [384,1536],
- [384,1920],
- [384,2304],
- [384,2688],
- [384,3072],
- [384,3456],
- [384,3840],
- [768,384],
- [768,768],
- [768,1152],
- [768,1536],
- [768,1920],
- [1152,384],
- [1152,768],
- [1152,1152],
- [1536,384],
- [1536,768],
- [1920,384],
- [1920,768],
- [2304,384],
- [2688,384],
- [3072,384],
- [3456,384],
- [3840,384]
- ],
- "mm_patch_merge_type": "spatial_unpad",
- "hidden_size": 1152,
- "image_size": 384,
- "intermediate_size": 4304,
- "model_type": "siglip_vision_model",
- "num_attention_heads": 16,
- "num_hidden_layers": 27,
- "patch_size": 14,
- "layer_norm_eps": 1e-6,
- "hidden_act": "gelu_pytorch_tanh",
- "projection_dim": 0,
- "vision_feature_layer": [-24, -20, -12, -1]
-}
-```
-
-At this point you should have something like this:
-```bash
-$ ls $ENCODER_PATH
-config.json llava.projector pytorch_model.bin
-```
-
-Now convert the components to GGUF; Note that we also override the image mean/std dev to `[.5,.5,.5]` since we use the SigLIP visual encoder - in the transformers model, you can find these numbers in the `preprocessor_config.json`.
-```bash
-$ python convert_image_encoder_to_gguf.py \
- -m $ENCODER_PATH \
- --llava-projector $ENCODER_PATH/llava.projector \
- --output-dir $ENCODER_PATH \
- --clip-model-is-vision \
- --clip-model-is-siglip \
- --image-mean 0.5 0.5 0.5 \
- --image-std 0.5 0.5 0.5
-```
-
-This will create the first GGUF file at `$ENCODER_PATH/mmproj-model-f16.gguf`; we will refer to the absolute path of this file as the `$VISUAL_GGUF_PATH.`
-
-
-### 3. Creating the LLM GGUF.
-The granite vision model contains a granite LLM as its language model. For now, the easiest way to get the GGUF for LLM is by loading the composite model in `transformers` and exporting the LLM so that it can be directly converted with the normal conversion path.
-
-First, set the `LLM_EXPORT_PATH` to the path to export the `transformers` LLM to.
-```bash
-$ export LLM_EXPORT_PATH=$PWD/granite_vision_llm
-```
-
-```python
-import os
-import transformers
-
-MODEL_PATH = os.getenv("GRANITE_MODEL")
-if not MODEL_PATH:
- raise ValueError("env var GRANITE_MODEL is unset!")
-
-LLM_EXPORT_PATH = os.getenv("LLM_EXPORT_PATH")
-if not LLM_EXPORT_PATH:
- raise ValueError("env var LLM_EXPORT_PATH is unset!")
-
-tokenizer = transformers.AutoTokenizer.from_pretrained(MODEL_PATH)
-
-# NOTE: granite vision support was added to transformers very recently (4.49);
-# if you get size mismatches, your version is too old.
-# If you are running with an older version, set `ignore_mismatched_sizes=True`
-# as shown below; it won't be loaded correctly, but the LLM part of the model that
-# we are exporting will be loaded correctly.
-model = transformers.AutoModelForImageTextToText.from_pretrained(MODEL_PATH, ignore_mismatched_sizes=True)
-
-tokenizer.save_pretrained(LLM_EXPORT_PATH)
-model.language_model.save_pretrained(LLM_EXPORT_PATH)
-```
-
-Now you can convert the exported LLM to GGUF with the normal converter in the root of the llama cpp project.
-```bash
-$ LLM_GGUF_PATH=$LLM_EXPORT_PATH/granite_llm.gguf
-...
-$ python convert_hf_to_gguf.py --outfile $LLM_GGUF_PATH $LLM_EXPORT_PATH
-```
-
-
-### 4. Quantization
-If you want to quantize the LLM, you can do so with `llama-quantize` as you would any other LLM. For example:
-```bash
-$ ./build/bin/llama-quantize $LLM_EXPORT_PATH/granite_llm.gguf $LLM_EXPORT_PATH/granite_llm_q4_k_m.gguf Q4_K_M
-$ LLM_GGUF_PATH=$LLM_EXPORT_PATH/granite_llm_q4_k_m.gguf
-```
-
-Note that currently you cannot quantize the visual encoder because granite vision models use SigLIP as the visual encoder, which has tensor dimensions that are not divisible by 32.
-
-
-### 5. Running the Model in Llama cpp
-Build llama cpp normally; you should have a target binary named `llama-llava-cli`, which you can pass two binaries to. As an example, we pass the the llama.cpp banner.
-
-```bash
-$ ./build/bin/llama-llava-cli -m $LLM_GGUF_PATH \
- --mmproj $VISUAL_GGUF_PATH \
- --image ./media/llama0-banner.png \
- -c 16384 \
- -p "<|system|>\nA chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions.\n<|user|>\n\<image>\nWhat does the text in this image say?\n<|assistant|>\n" \
- --temp 0
-```
-
-Sample output: `The text in the image reads "LLAMA C++ Can it run DOOM Llama?"`
+++ /dev/null
-## MiniCPM-o 2.6
-Currently, this readme only supports minicpm-omni's image capabilities, and we will update the full-mode support as soon as possible.
-
-### Prepare models and code
-
-Download [MiniCPM-o-2_6](https://huggingface.co/openbmb/MiniCPM-o-2_6) PyTorch model from huggingface to "MiniCPM-o-2_6" folder.
-
-
-### Build llama.cpp
-Readme modification time: 20250206
-
-If there are differences in usage, please refer to the official build [documentation](https://github.com/ggerganov/llama.cpp/blob/master/docs/build.md)
-
-Clone llama.cpp:
-```bash
-git clone https://github.com/ggerganov/llama.cpp
-cd llama.cpp
-```
-
-Build llama.cpp using `CMake`:
-```bash
-cmake -B build
-cmake --build build --config Release
-```
-
-
-### Usage of MiniCPM-o 2.6
-
-Convert PyTorch model to gguf files (You can also download the converted [gguf](https://huggingface.co/openbmb/MiniCPM-o-2_6-gguf) by us)
-
-```bash
-python ./examples/llava/minicpmv-surgery.py -m ../MiniCPM-o-2_6
-python ./examples/llava/minicpmv-convert-image-encoder-to-gguf.py -m ../MiniCPM-o-2_6 --minicpmv-projector ../MiniCPM-o-2_6/minicpmv.projector --output-dir ../MiniCPM-o-2_6/ --image-mean 0.5 0.5 0.5 --image-std 0.5 0.5 0.5 --minicpmv_version 4
-python ./convert_hf_to_gguf.py ../MiniCPM-o-2_6/model
-
-# quantize int4 version
-./build/bin/llama-quantize ../MiniCPM-o-2_6/model/ggml-model-f16.gguf ../MiniCPM-o-2_6/model/ggml-model-Q4_K_M.gguf Q4_K_M
-```
-
-
-Inference on Linux or Mac
-```bash
-# run f16 version
-./build/bin/llama-minicpmv-cli -m ../MiniCPM-o-2_6/model/ggml-model-f16.gguf --mmproj ../MiniCPM-o-2_6/mmproj-model-f16.gguf -c 4096 --temp 0.7 --top-p 0.8 --top-k 100 --repeat-penalty 1.05 --image xx.jpg -p "What is in the image?"
-
-# run quantized int4 version
-./build/bin/llama-minicpmv-cli -m ../MiniCPM-o-2_6/model/ggml-model-Q4_K_M.gguf --mmproj ../MiniCPM-o-2_6/mmproj-model-f16.gguf -c 4096 --temp 0.7 --top-p 0.8 --top-k 100 --repeat-penalty 1.05 --image xx.jpg -p "What is in the image?"
-```
+++ /dev/null
-## MiniCPM-Llama3-V 2.5
-
-### Prepare models and code
-
-Download [MiniCPM-Llama3-V-2_5](https://huggingface.co/openbmb/MiniCPM-Llama3-V-2_5) PyTorch model from huggingface to "MiniCPM-Llama3-V-2_5" folder.
-
-
-### Build llama.cpp
-Readme modification time: 20250206
-
-If there are differences in usage, please refer to the official build [documentation](https://github.com/ggerganov/llama.cpp/blob/master/docs/build.md)
-
-Clone llama.cpp:
-```bash
-git clone https://github.com/ggml-org/llama.cpp
-cd llama.cpp
-```
-
-Build llama.cpp using `CMake`:
-```bash
-cmake -B build
-cmake --build build --config Release
-```
-
-
-### Usage of MiniCPM-Llama3-V 2.5
-
-Convert PyTorch model to gguf files (You can also download the converted [gguf](https://huggingface.co/openbmb/MiniCPM-Llama3-V-2_5-gguf) by us)
-
-```bash
-python ./examples/llava/minicpmv-surgery.py -m ../MiniCPM-Llama3-V-2_5
-python ./examples/llava/minicpmv-convert-image-encoder-to-gguf.py -m ../MiniCPM-Llama3-V-2_5 --minicpmv-projector ../MiniCPM-Llama3-V-2_5/minicpmv.projector --output-dir ../MiniCPM-Llama3-V-2_5/ --image-mean 0.5 0.5 0.5 --image-std 0.5 0.5 0.5 --minicpmv_version 2
-python ./convert_hf_to_gguf.py ../MiniCPM-Llama3-V-2_5/model
-
-# quantize int4 version
-./build/bin/llama-quantize ../MiniCPM-Llama3-V-2_5/model/model-8B-F16.gguf ../MiniCPM-Llama3-V-2_5/model/ggml-model-Q4_K_M.gguf Q4_K_M
-```
-
-
-Inference on Linux or Mac
-```bash
-# run f16 version
-./build/bin/llama-minicpmv-cli -m ../MiniCPM-Llama3-V-2_5/model/model-8B-F16.gguf --mmproj ../MiniCPM-Llama3-V-2_5/mmproj-model-f16.gguf -c 4096 --temp 0.7 --top-p 0.8 --top-k 100 --repeat-penalty 1.05 --image xx.jpg -p "What is in the image?"
-
-# run quantized int4 version
-./build/bin/llama-minicpmv-cli -m ../MiniCPM-Llama3-V-2_5/model/ggml-model-Q4_K_M.gguf --mmproj ../MiniCPM-Llama3-V-2_5/mmproj-model-f16.gguf -c 4096 --temp 0.7 --top-p 0.8 --top-k 100 --repeat-penalty 1.05 --image xx.jpg -p "What is in the image?"
-```
+++ /dev/null
-## MiniCPM-V 2.6
-
-### Prepare models and code
-
-Download [MiniCPM-V-2_6](https://huggingface.co/openbmb/MiniCPM-V-2_6) PyTorch model from huggingface to "MiniCPM-V-2_6" folder.
-
-
-### Build llama.cpp
-Readme modification time: 20250206
-
-If there are differences in usage, please refer to the official build [documentation](https://github.com/ggerganov/llama.cpp/blob/master/docs/build.md)
-
-Clone llama.cpp:
-```bash
-git clone https://github.com/ggerganov/llama.cpp
-cd llama.cpp
-```
-
-Build llama.cpp using `CMake`:
-```bash
-cmake -B build
-cmake --build build --config Release
-```
-
-
-### Usage of MiniCPM-V 2.6
-
-Convert PyTorch model to gguf files (You can also download the converted [gguf](https://huggingface.co/openbmb/MiniCPM-V-2_6-gguf) by us)
-
-```bash
-python ./examples/llava/minicpmv-surgery.py -m ../MiniCPM-V-2_6
-python ./examples/llava/minicpmv-convert-image-encoder-to-gguf.py -m ../MiniCPM-V-2_6 --minicpmv-projector ../MiniCPM-V-2_6/minicpmv.projector --output-dir ../MiniCPM-V-2_6/ --image-mean 0.5 0.5 0.5 --image-std 0.5 0.5 0.5 --minicpmv_version 3
-python ./convert_hf_to_gguf.py ../MiniCPM-V-2_6/model
-
-# quantize int4 version
-./build/bin/llama-quantize ../MiniCPM-V-2_6/model/ggml-model-f16.gguf ../MiniCPM-V-2_6/model/ggml-model-Q4_K_M.gguf Q4_K_M
-```
-
-
-Inference on Linux or Mac
-```bash
-# run f16 version
-./build/bin/llama-minicpmv-cli -m ../MiniCPM-V-2_6/model/ggml-model-f16.gguf --mmproj ../MiniCPM-V-2_6/mmproj-model-f16.gguf -c 4096 --temp 0.7 --top-p 0.8 --top-k 100 --repeat-penalty 1.05 --image xx.jpg -p "What is in the image?"
-
-# run quantized int4 version
-./build/bin/llama-minicpmv-cli -m ../MiniCPM-V-2_6/model/ggml-model-Q4_K_M.gguf --mmproj ../MiniCPM-V-2_6/mmproj-model-f16.gguf -c 4096 --temp 0.7 --top-p 0.8 --top-k 100 --repeat-penalty 1.05 --image xx.jpg -p "What is in the image?"
-```
-# LLaVA
+# Multimodal Support in llama.cpp
-Currently this implementation supports [llava-v1.5](https://huggingface.co/liuhaotian/llava-v1.5-7b) variants,
-as well as llava-1.6 [llava-v1.6](https://huggingface.co/collections/liuhaotian/llava-16-65b9e40155f60fd046a5ccf2) variants.
+This directory provides multimodal capabilities for `llama.cpp`. Initially intended as a showcase for running LLaVA models, its scope has expanded significantly over time to include various other vision-capable models. As a result, LLaVA is no longer the only multimodal architecture supported.
-The pre-converted [7b](https://huggingface.co/mys/ggml_llava-v1.5-7b)
-and [13b](https://huggingface.co/mys/ggml_llava-v1.5-13b)
-models are available.
-For llava-1.6 a variety of prepared gguf models are available as well [7b-34b](https://huggingface.co/cmp-nct/llava-1.6-gguf)
+> [!IMPORTANT]
+>
+> Multimodal support can be viewed as a sub-project within `llama.cpp`. It is under **very heavy development**, and **breaking changes are expected**.
-After API is confirmed, more models will be supported / uploaded.
+The naming and structure related to multimodal support have evolved, which might cause some confusion. Here's a brief timeline to clarify:
-## Usage
-Build with cmake or run `make llama-llava-cli` to build it.
+- [#3436](https://github.com/ggml-org/llama.cpp/pull/3436): Initial support for LLaVA 1.5 was added, introducing `llava.cpp` and `clip.cpp`. The `llava-cli` binary was created for model interaction.
+- [#4954](https://github.com/ggml-org/llama.cpp/pull/4954): Support for MobileVLM was added, becoming the second vision model supported. This built upon the existing `llava.cpp`, `clip.cpp`, and `llava-cli` infrastructure.
+- **Expansion & Fragmentation:** Many new models were subsequently added (e.g., [#7599](https://github.com/ggml-org/llama.cpp/pull/7599), [#10361](https://github.com/ggml-org/llama.cpp/pull/10361), [#12344](https://github.com/ggml-org/llama.cpp/pull/12344), and others). However, `llava-cli` lacked support for the increasingly complex chat templates required by these models. This led to the creation of model-specific binaries like `qwen2vl-cli`, `minicpmv-cli`, and `gemma3-cli`. While functional, this proliferation of command-line tools became confusing for users.
+- [#12849](https://github.com/ggml-org/llama.cpp/pull/12849): `libmtmd` was introduced as a replacement for `llava.cpp`. Its goals include providing a single, unified command-line interface, improving the user/developer experience (UX/DX), and supporting both audio and image inputs.
+- [#13012](https://github.com/ggml-org/llama.cpp/pull/13012): `mtmd-cli` was added, consolidating the various model-specific CLIs into a single tool powered by `libmtmd`.
-After building, run: `./llama-llava-cli` to see the usage. For example:
+## How it works and what is `mmproj`?
-```sh
-./llama-llava-cli -m ../llava-v1.5-7b/ggml-model-f16.gguf --mmproj ../llava-v1.5-7b/mmproj-model-f16.gguf --image path/to/an/image.jpg
-```
+Multimodal support in `llama.cpp` works by encoding images into embeddings using a separate model component, and then feeding these embeddings into the language model.
-**note**: A lower temperature like 0.1 is recommended for better quality. add `--temp 0.1` to the command to do so.
-**note**: For GPU offloading ensure to use the `-ngl` flag just like usual
+This approach keeps the multimodal components distinct from the core `libllama` library. Separating these allows for faster, independent development cycles. While many modern vision models are based on Vision Transformers (ViTs), their specific pre-processing and projection steps can vary significantly. Integrating this diverse complexity directly into `libllama` is currently challenging.
-## LLaVA 1.5
+Consequently, running a multimodal model typically requires two GGUF files:
+1. The standard language model file.
+2. A corresponding **multimodal projector (`mmproj`)** file, which handles the image encoding and projection.
-1. Clone a LLaVA and a CLIP model ([available options](https://github.com/haotian-liu/LLaVA/blob/main/docs/MODEL_ZOO.md)). For example:
+## What is `libmtmd`?
-```sh
-git clone https://huggingface.co/liuhaotian/llava-v1.5-7b
+As outlined in the history, `libmtmd` is the modern library designed to replace the original `llava.cpp` implementation for handling multimodal inputs.
-git clone https://huggingface.co/openai/clip-vit-large-patch14-336
-```
+Built upon `clip.cpp` (similar to `llava.cpp`), `libmtmd` offers several advantages:
+- **Unified Interface:** Aims to consolidate interaction for various multimodal models.
+- **Improved UX/DX:** Features a more intuitive API, inspired by the `Processor` class in the Hugging Face `transformers` library.
+- **Flexibility:** Designed to support multiple input types (text, audio, images) while respecting the wide variety of chat templates used by different models.
-2. Install the required Python packages:
+## How to obtain `mmproj`
-```sh
-pip install -r examples/llava/requirements.txt
-```
+Multimodal projector (`mmproj`) files are specific to each model architecture. Please refer to the relevant guide for instructions on how to obtain or create them:
-3. Use `llava_surgery.py` to split the LLaVA model to LLaMA and multimodel projector constituents:
-
-```sh
-python ./examples/llava/llava_surgery.py -m ../llava-v1.5-7b
-```
-
-4. Use `convert_image_encoder_to_gguf.py` to convert the LLaVA image encoder to GGUF:
-
-```sh
-python ./examples/llava/convert_image_encoder_to_gguf.py -m ../clip-vit-large-patch14-336 --llava-projector ../llava-v1.5-7b/llava.projector --output-dir ../llava-v1.5-7b
-```
-
-5. Use `examples/convert_legacy_llama.py` to convert the LLaMA part of LLaVA to GGUF:
-
-```sh
-python ./examples/convert_legacy_llama.py ../llava-v1.5-7b --skip-unknown
-```
-
-Now both the LLaMA part and the image encoder are in the `llava-v1.5-7b` directory.
-
-## LLaVA 1.6 gguf conversion
-1) First clone a LLaVA 1.6 model:
-```console
-git clone https://huggingface.co/liuhaotian/llava-v1.6-vicuna-7b
-```
-
-2) Install the required Python packages:
-
-```sh
-pip install -r examples/llava/requirements.txt
-```
-
-3) Use `llava_surgery_v2.py` which also supports llava-1.5 variants pytorch as well as safetensor models:
-```console
-python examples/llava/llava_surgery_v2.py -C -m ../llava-v1.6-vicuna-7b/
-```
-- you will find a llava.projector and a llava.clip file in your model directory
-
-4) Copy the llava.clip file into a subdirectory (like vit), rename it to pytorch_model.bin and add a fitting vit configuration to the directory:
-```console
-mkdir vit
-cp ../llava-v1.6-vicuna-7b/llava.clip vit/pytorch_model.bin
-cp ../llava-v1.6-vicuna-7b/llava.projector vit/
-curl -s -q https://huggingface.co/cmp-nct/llava-1.6-gguf/raw/main/config_vit.json -o vit/config.json
-```
-
-5) Create the visual gguf model:
-```console
-python ./examples/llava/convert_image_encoder_to_gguf.py -m vit --llava-projector vit/llava.projector --output-dir vit --clip-model-is-vision
-```
-- This is similar to llava-1.5, the difference is that we tell the encoder that we are working with the pure vision model part of CLIP
-
-6) Then convert the model to gguf format:
-```console
-python ./examples/convert_legacy_llama.py ../llava-v1.6-vicuna-7b/ --skip-unknown
-```
-
-7) And finally we can run the llava cli using the 1.6 model version:
-```console
-./llama-llava-cli -m ../llava-v1.6-vicuna-7b/ggml-model-f16.gguf --mmproj vit/mmproj-model-f16.gguf --image some-image.jpg -c 4096
-```
-
-**note** llava-1.6 needs more context than llava-1.5, at least 3000 is needed (just run it at -c 4096)
-
-**note** llava-1.6 greatly benefits from batched prompt processing (defaults work)
-
-**note** if the language model in step `6)` is incompatible with the legacy conversion script, the easiest way handle the LLM model conversion is to load the model in transformers, and export only the LLM from the llava next model.
-
-```python
-import os
-import transformers
-
-model_path = ...
-llm_export_path = ...
-
-tokenizer = transformers.AutoTokenizer.from_pretrained(model_path)
-model = transformers.AutoModelForImageTextToText.from_pretrained(model_path)
-
-tokenizer.save_pretrained(llm_export_path)
-model.language_model.save_pretrained(llm_export_path)
-```
-
-Then, you can convert the LLM using the `convert_hf_to_gguf.py` script, which handles more LLM architectures.
-
-## llava-cli templating and llava-1.6 prompting
-
-llava-1.5 models all use the same vicuna prompt, here you can just add your image question like `-p "Provide a full description."`
-For llava-1.5 models which are not vicuna (mistral and Yi) you need to adapt system prompt as well as user prompt, for this purpose llava-cli has a basic templating system:
-
-**For Mistral and using llava-cli binary:**
-Add this: `-p "<image>\nUSER:\nProvide a full description.\nASSISTANT:\n"`
-The mistral template for llava-1.6 seems to be no system print and a USER/ASSISTANT role
-
-**For the 34B this should work:**
-Add this: `-e -p <|im_start|>system\nAnswer the questions.<|im_end|><|im_start|>user\n<image>\nProvide a full description.<|im_end|><|im_start|>assistant\n`
-
-
-## How to know if you are running in llava-1.5 or llava-1.6 mode
-
-When running llava-cli you will see a visual information right before the prompt is being processed:
-
-**Llava-1.5:**
-`encode_image_with_clip: image embedding created: 576 tokens`
-
-**Llava-1.6 (anything above 576):**
-`encode_image_with_clip: image embedding created: 2880 tokens`
-
-
-Alternatively just pay notice to how many "tokens" have been used for your prompt, it will also show 1000+ tokens for llava-1.6
-
-
-
-
-## TODO
-
-- [x] Support non-CPU backend for the image encoding part.
-- [ ] Support different sampling methods.
-- [ ] Support more model variants.
+- [LLaVA](../../docs/multimodal/llava.md)
+- [MobileVLM](../../docs/multimodal/MobileVLM.md)
+- [GLM-Edge](../../docs/multimodal/glmedge.md)
+- [MiniCPM-V 2.5](../../docs/multimodal/minicpmv2.5.md)
+- [MiniCPM-V 2.6](../../docs/multimodal/minicpmv2.6.md)
+- [MiniCPM-o 2.6](../../docs/multimodal/minicpmo2.6.md)
+- [IBM Granite Vision](../../docs/multimodal/granitevision.md)
+- [Google Gemma 3](../../docs/multimodal/gemma3.md)
# prompt="A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. USER: <image>\nWhat is in the image? ASSISTANT:"
program_dir="build_64/bin"
-binName="llama-llava-cli"
+binName="llama-mtmd-cli"
n_threads=4
+++ /dev/null
-import gguf
-import argparse
-import logging
-import sys
-import torch
-import json
-import os
-import numpy as np
-from typing import cast, ContextManager, Any, Iterator
-from pathlib import Path
-from torch import Tensor
-
-logger = logging.getLogger("gemma3-mmproj")
-
-
-# (copied from convert_hf_to_gguf.py)
-# tree of lazy tensors
-class LazyTorchTensor(gguf.LazyBase):
- _tensor_type = torch.Tensor
- # to keep the type-checker happy
- dtype: torch.dtype
- shape: torch.Size
-
- # only used when converting a torch.Tensor to a np.ndarray
- _dtype_map: dict[torch.dtype, type] = {
- torch.float16: np.float16,
- torch.float32: np.float32,
- }
-
- # used for safetensors slices
- # ref: https://github.com/huggingface/safetensors/blob/079781fd0dc455ba0fe851e2b4507c33d0c0d407/bindings/python/src/lib.rs#L1046
- # TODO: uncomment U64, U32, and U16, ref: https://github.com/pytorch/pytorch/issues/58734
- _dtype_str_map: dict[str, torch.dtype] = {
- "F64": torch.float64,
- "F32": torch.float32,
- "BF16": torch.bfloat16,
- "F16": torch.float16,
- # "U64": torch.uint64,
- "I64": torch.int64,
- # "U32": torch.uint32,
- "I32": torch.int32,
- # "U16": torch.uint16,
- "I16": torch.int16,
- "U8": torch.uint8,
- "I8": torch.int8,
- "BOOL": torch.bool,
- "F8_E4M3": torch.float8_e4m3fn,
- "F8_E5M2": torch.float8_e5m2,
- }
-
- def numpy(self) -> gguf.LazyNumpyTensor:
- dtype = self._dtype_map[self.dtype]
- return gguf.LazyNumpyTensor(
- meta=gguf.LazyNumpyTensor.meta_with_dtype_and_shape(dtype, self.shape),
- args=(self,),
- func=(lambda s: s.numpy())
- )
-
- @classmethod
- def meta_with_dtype_and_shape(cls, dtype: torch.dtype, shape: tuple[int, ...]) -> Tensor:
- return torch.empty(size=shape, dtype=dtype, device="meta")
-
- @classmethod
- def from_safetensors_slice(cls, st_slice: Any) -> Tensor:
- dtype = cls._dtype_str_map[st_slice.get_dtype()]
- shape: tuple[int, ...] = tuple(st_slice.get_shape())
- lazy = cls(meta=cls.meta_with_dtype_and_shape(dtype, shape), args=(st_slice,), func=lambda s: s[:])
- return cast(torch.Tensor, lazy)
-
- @classmethod
- def __torch_function__(cls, func, types, args=(), kwargs=None):
- del types # unused
-
- if kwargs is None:
- kwargs = {}
-
- if func is torch.Tensor.numpy:
- return args[0].numpy()
-
- return cls._wrap_fn(func)(*args, **kwargs)
-
-
-class Gemma3VisionTower:
- hparams: dict
- gguf_writer: gguf.GGUFWriter
- fname_out: Path
- ftype: gguf.LlamaFileType
-
- @staticmethod
- def load_hparams(dir_model: Path):
- with open(dir_model / "config.json", "r", encoding="utf-8") as f:
- return json.load(f)
-
- @staticmethod
- def get_model_part_names(dir_model: Path, prefix: str, suffix: str) -> list[str]:
- part_names: list[str] = []
- for filename in os.listdir(dir_model):
- if filename.startswith(prefix) and filename.endswith(suffix):
- part_names.append(filename)
- part_names.sort()
- return part_names
-
- def __init__(self,
- dir_model: Path,
- fname_out: Path,
- ftype: gguf.LlamaFileType,
- is_big_endian: bool,):
- hparams = Gemma3VisionTower.load_hparams(dir_model)
- self.hparams = hparams
- self.fname_out = fname_out
- self.ftype = ftype
- endianess = gguf.GGUFEndian.BIG if is_big_endian else gguf.GGUFEndian.LITTLE
- self.gguf_writer = gguf.GGUFWriter(path=None, arch="clip", endianess=endianess)
-
- text_config = hparams["text_config"]
- vision_config = hparams["vision_config"]
-
- assert hparams["architectures"][0] == "Gemma3ForConditionalGeneration"
- assert text_config is not None
- assert vision_config is not None
-
- self.gguf_writer.add_string ("clip.projector_type", "gemma3")
- self.gguf_writer.add_bool ("clip.has_text_encoder", False)
- self.gguf_writer.add_bool ("clip.has_vision_encoder", True)
- self.gguf_writer.add_bool ("clip.has_llava_projector", False) # legacy
- self.gguf_writer.add_uint32 ("clip.vision.image_size", vision_config["image_size"])
- self.gguf_writer.add_uint32 ("clip.vision.patch_size", vision_config["patch_size"])
- self.gguf_writer.add_uint32 ("clip.vision.embedding_length", vision_config["hidden_size"])
- self.gguf_writer.add_uint32 ("clip.vision.feed_forward_length", vision_config["intermediate_size"])
- self.gguf_writer.add_uint32 ("clip.vision.projection_dim", text_config["hidden_size"])
- self.gguf_writer.add_uint32 ("clip.vision.block_count", vision_config["num_hidden_layers"])
- self.gguf_writer.add_uint32 ("clip.vision.attention.head_count", vision_config["num_attention_heads"])
- self.gguf_writer.add_float32("clip.vision.attention.layer_norm_epsilon", vision_config.get("layer_norm_eps", 1e-6))
- # default values taken from HF tranformers code
- self.gguf_writer.add_array ("clip.vision.image_mean", [0.5, 0.5, 0.5])
- self.gguf_writer.add_array ("clip.vision.image_std", [0.5, 0.5, 0.5])
- self.gguf_writer.add_bool ("clip.use_gelu", True)
-
- # load tensors
- for name, data_torch in self.get_tensors(dir_model):
- # convert any unsupported data types to float32
- if data_torch.dtype not in (torch.float16, torch.float32):
- data_torch = data_torch.to(torch.float32)
- self.add_tensor(name, data_torch)
-
- def get_tensors(self, dir_model: Path) -> Iterator[tuple[str, Tensor]]:
- part_names = Gemma3VisionTower.get_model_part_names(dir_model, "model", ".safetensors")
- tensor_names_from_parts: set[str] = set()
- for part_name in part_names:
- logger.info(f"gguf: loading model part '{part_name}'")
- from safetensors import safe_open
- ctx = cast(ContextManager[Any], safe_open(dir_model / part_name, framework="pt", device="cpu"))
- with ctx as model_part:
- tensor_names_from_parts.update(model_part.keys())
-
- for name in model_part.keys():
- data = model_part.get_slice(name)
- data = LazyTorchTensor.from_safetensors_slice(data)
- yield name, data
-
- def add_tensor(self, name: str, data_torch: Tensor):
- is_1d = len(data_torch.shape) == 1
- is_embd = ".embeddings." in name
- old_dtype = data_torch.dtype
- can_quantize = not is_1d and not is_embd
- data_qtype = gguf.GGMLQuantizationType.F32
-
- # this is to support old checkpoint
- # TODO: remove this when we have the final model
- name = name.replace("vision_model.vision_model.", "vision_tower.vision_model.")
- name = name.replace("multimodal_projector.", "multi_modal_projector.")
-
- # filter only vision tensors
- if not name.startswith("vision_tower.vision_model.") and not name.startswith("multi_modal_projector."):
- return
- # prefix
- name = name.replace("vision_tower.vision_model.encoder.layers.", "v.blk.")
- name = name.replace("vision_tower.vision_model.", "v.")
- # projector and input embd
- name = name.replace(".embeddings.patch_embedding.", ".patch_embd.")
- name = name.replace(".embeddings.position_embedding.", ".position_embd.")
- name = name.replace(
- "multi_modal_projector.mm_input_projection_weight",
- "mm.input_projection.weight"
- )
- name = name.replace(
- "multi_modal_projector.mm_soft_emb_norm.weight",
- "mm.soft_emb_norm.weight"
- )
- name = name.replace("post_layernorm.", "post_ln.")
- # each block
- name = name.replace(".self_attn.k_proj.", ".attn_k.")
- name = name.replace(".self_attn.v_proj.", ".attn_v.")
- name = name.replace(".self_attn.q_proj.", ".attn_q.")
- name = name.replace(".self_attn.out_proj.", ".attn_out.")
- name = name.replace(".layer_norm1.", ".ln1.")
- name = name.replace(".layer_norm2.", ".ln2.")
- name = name.replace(".mlp.fc1.", ".ffn_down.")
- name = name.replace(".mlp.fc2.", ".ffn_up.")
-
- if can_quantize:
- if self.ftype == gguf.LlamaFileType.ALL_F32:
- data_qtype = gguf.GGMLQuantizationType.F32
- elif self.ftype == gguf.LlamaFileType.MOSTLY_F16:
- data_qtype = gguf.GGMLQuantizationType.F16
- elif self.ftype == gguf.LlamaFileType.MOSTLY_BF16:
- data_qtype = gguf.GGMLQuantizationType.BF16
- elif self.ftype == gguf.LlamaFileType.MOSTLY_Q8_0:
- data_qtype = gguf.GGMLQuantizationType.Q8_0
- else:
- raise ValueError(f"Unsupported file type: {self.ftype}")
-
- # corrent norm value ; only this "soft_emb_norm" need to be corrected as it's part of Gemma projector
- # the other norm values are part of SigLIP model, and they are already correct
- # ref code: Gemma3RMSNorm
- if "soft_emb_norm.weight" in name:
- logger.info(f"Correcting norm value for '{name}'")
- data_torch = data_torch + 1
-
- data = data_torch.numpy()
-
- try:
- data = gguf.quants.quantize(data, data_qtype)
- except Exception as e:
- logger.error(f"Error quantizing tensor '{name}': {e}, fallback to F16")
- data_qtype = gguf.GGMLQuantizationType.F16
- data = gguf.quants.quantize(data, data_qtype)
-
- # reverse shape to make it similar to the internal ggml dimension order
- shape_str = f"{{{', '.join(str(n) for n in reversed(data_torch.shape))}}}"
- logger.info(f"{f'%-32s' % f'{name},'} {old_dtype} --> {data_qtype.name}, shape = {shape_str}")
-
- self.gguf_writer.add_tensor(name, data, raw_dtype=data_qtype)
-
- def write(self):
- self.gguf_writer.write_header_to_file(path=self.fname_out)
- self.gguf_writer.write_kv_data_to_file()
- self.gguf_writer.write_tensors_to_file(progress=True)
- self.gguf_writer.close()
-
-def parse_args() -> argparse.Namespace:
- parser = argparse.ArgumentParser(
- description="Convert Gemma 3 vision tower safetensors to GGUF format",)
- parser.add_argument(
- "--outfile", type=Path, default="mmproj.gguf",
- help="path to write to",
- )
- parser.add_argument(
- "--outtype", type=str, choices=["f32", "f16", "bf16", "q8_0"], default="f16",
- help="output format",
- )
- parser.add_argument(
- "--bigendian", action="store_true",
- help="model is executed on big endian machine",
- )
- parser.add_argument(
- "model", type=Path,
- help="directory containing model file",
- nargs="?",
- )
- parser.add_argument(
- "--verbose", action="store_true",
- help="increase output verbosity",
- )
-
- args = parser.parse_args()
- if args.model is None:
- parser.error("the following arguments are required: model")
- return args
-
-
-def main() -> None:
- args = parse_args()
-
- if args.verbose:
- logging.basicConfig(level=logging.DEBUG)
- else:
- logging.basicConfig(level=logging.INFO)
-
- dir_model = args.model
-
- if not dir_model.is_dir():
- logger.error(f'Error: {args.model} is not a directory')
- sys.exit(1)
-
- ftype_map: dict[str, gguf.LlamaFileType] = {
- "f32": gguf.LlamaFileType.ALL_F32,
- "f16": gguf.LlamaFileType.MOSTLY_F16,
- "bf16": gguf.LlamaFileType.MOSTLY_BF16,
- "q8_0": gguf.LlamaFileType.MOSTLY_Q8_0,
- }
-
- logger.info(f"Loading model: {dir_model.name}")
-
- with torch.inference_mode():
- gemma3_vision_tower = Gemma3VisionTower(
- dir_model=dir_model,
- fname_out=args.outfile,
- ftype=ftype_map[args.outtype],
- is_big_endian=args.bigendian,
- )
- gemma3_vision_tower.write()
-
-
-if __name__ == '__main__':
- main()
-