]>
git.djapps.eu Git - pkg/ggml/sources/llama.cpp/log
AlpinDale [Fri, 23 Feb 2024 19:31:54 +0000 (19:31 +0000)]
server : add KV cache quantization options (#5684)
Jared Van Bortel [Fri, 23 Feb 2024 18:39:14 +0000 (13:39 -0500)]
convert : fix missing ftype for gemma (#5690)
Jared Van Bortel [Thu, 22 Feb 2024 22:05:23 +0000 (17:05 -0500)]
mpt : do not duplicate token_embd.weight on disk (#5670)
Georgi Gerganov [Thu, 22 Feb 2024 21:23:46 +0000 (23:23 +0200)]
gemma : use more bits for the token_embd.weight tensor (#5650)
* gemma : use Q8_0 for the token_embd.weight tensor
* llama : quantize token_embd.weight using output type
Georgi Gerganov [Thu, 22 Feb 2024 21:22:48 +0000 (23:22 +0200)]
py : add Gemma conversion from HF models (#5647)
* py : add gemma conversion from HF models
* Update convert-hf-to-gguf.py
Co-authored-by: Aarni Koskela <redacted>
* Update convert-hf-to-gguf.py
Co-authored-by: Aarni Koskela <redacted>
* Update convert-hf-to-gguf.py
Co-authored-by: Jared Van Bortel <redacted>
---------
Co-authored-by: Aarni Koskela <redacted>
Co-authored-by: Jared Van Bortel <redacted>
Georgi Gerganov [Thu, 22 Feb 2024 21:21:39 +0000 (23:21 +0200)]
ggml : always define ggml_fp16_t as uint16_t (#5666)
* ggml : always define ggml_fp16_t as uint16_t
ggml-ci
* ggml : cont
ggml-ci
* ggml : cont
* ggml : cont
ggml-ci
* ggml : cont
ggml-ci
* cuda : no longer ggml headers last
ggml-ci
* ggml : fix q6_K FP16 -> FP32 conversion
ggml-ci
* ggml : more FP16 -> FP32 conversion fixes
ggml-ci
Georgi Gerganov [Thu, 22 Feb 2024 21:21:05 +0000 (23:21 +0200)]
sync : ggml
Georgi Gerganov [Thu, 22 Feb 2024 16:31:40 +0000 (18:31 +0200)]
ggml : 32-bit arm compat (whisper/1891)
* ggml : 32-bit arm compat
* ggml : add ggml_vqtbl1q_s8 impl
* ggml : cont
Someone [Thu, 22 Feb 2024 19:44:10 +0000 (19:44 +0000)]
nix: init singularity and docker images (#5056)
Exposes a few attributes demonstrating how to build [singularity](https://docs.sylabs.io/guides/latest/user-guide/)/[apptainer](https://apptainer.org/) and Docker images re-using llama.cpp's Nix expression.
Built locally on `x86_64-linux` with `nix build github:someoneserge/llama.cpp/feat/nix/images#llamaPackages.{docker,docker-min,sif,llama-cpp}` and it's fast and effective.
Georgi Gerganov [Thu, 22 Feb 2024 18:13:25 +0000 (20:13 +0200)]
py : minor fixes (#5668)
Xuan Son Nguyen [Thu, 22 Feb 2024 18:10:21 +0000 (19:10 +0100)]
Add Gemma chat template (#5665)
* add gemma chat template
* gemma: only apply system_prompt on non-model message
Someone [Thu, 22 Feb 2024 16:32:09 +0000 (16:32 +0000)]
workflows: nix: hardcode cachix ids, build unconditionally (#5663)
GitHub does not expose environment and repository variables to PRs coming from forks implies that we've been disabling the Nix CI actions for most PRs.
The `if:` also didn't make much sense, because we can always pull from cachix, and there's no point (albeit no risk either) in pushing cache for the untrusted code.
Georgi Gerganov [Thu, 22 Feb 2024 11:54:03 +0000 (13:54 +0200)]
minor : fix trailing whitespace (#5638)
Georgi Gerganov [Thu, 22 Feb 2024 08:35:54 +0000 (10:35 +0200)]
readme : update hot topics
Xuan Son Nguyen [Thu, 22 Feb 2024 08:33:24 +0000 (09:33 +0100)]
server : fallback to chatml, add AlphaMonarch chat template (#5628)
* server: fallback to chatml
* add new chat template
* server: add AlphaMonarch to test chat template
* server: only check model template if there is no custom tmpl
* remove TODO
Alexey Parfenov [Thu, 22 Feb 2024 08:27:32 +0000 (08:27 +0000)]
server : clarify some params in the docs (#5640)
Dat Quoc Nguyen [Thu, 22 Feb 2024 08:15:13 +0000 (18:15 +1000)]
mpt : add optional bias tensors (#5638)
Update for MPT with optional bias parameters: to work with PhoGPT and SEA-LION models that were pre-trained with 'bias'.
slaren [Wed, 21 Feb 2024 23:42:09 +0000 (00:42 +0100)]
llama : fix loading models with shared tok_embd and output (#5651)
ggml-ci
Xuan Son Nguyen [Wed, 21 Feb 2024 23:31:00 +0000 (00:31 +0100)]
Add docs for llama_chat_apply_template (#5645)
* add docs for llama_chat_apply_template
* fix typo
slaren [Wed, 21 Feb 2024 21:52:39 +0000 (22:52 +0100)]
llama : fix session save/load with quantized KV (#5649)
slaren [Wed, 21 Feb 2024 21:18:23 +0000 (22:18 +0100)]
gemma : allow offloading the output tensor (#5646)
Jared Van Bortel [Wed, 21 Feb 2024 15:33:54 +0000 (10:33 -0500)]
examples : do not assume BOS when shifting context (#5622)
Georgi Gerganov [Wed, 21 Feb 2024 14:52:39 +0000 (16:52 +0200)]
sync : ggml
Pierrick Hymbert [Wed, 21 Feb 2024 14:47:48 +0000 (15:47 +0100)]
server: health: fix race condition on slots data using tasks queue (#5634)
* server: health: fix race condition on slots data using tasks queue
* server: health:
* include_slots only if slots_endpoint
* fix compile warning task.target_id not initialized.
Ettore Di Giacinto [Wed, 21 Feb 2024 14:39:10 +0000 (15:39 +0100)]
readme : add LocalAI to the availables UI (#5629)
Georgi Gerganov [Wed, 21 Feb 2024 14:17:10 +0000 (16:17 +0200)]
sync : ggml (#5633)
* ggml : fix conv_2d batch mode (ggml/737)
Co-authored-by: bssrdf <redacted>
* ggml : compute forward no longer pass src tensors (ggml/729)
* sync : ggml
ggml-ci
---------
Co-authored-by: bssrdf <redacted>
Co-authored-by: bssrdf <redacted>
Georgi Gerganov [Wed, 21 Feb 2024 13:39:54 +0000 (15:39 +0200)]
readme : update hot topics
Daniel Bevenius [Wed, 21 Feb 2024 13:36:57 +0000 (14:36 +0100)]
llava : add --skip-unknown to 1.6 convert.py (#5632)
This commit adds the `--skip-unknown` option to the convert.py script
and removes the saving of the updated checkpoints to avoid updating
possibly checked out files.
The motivation for this change is that this was done for 1.5
in Commit
fc0c8d286a533363a9a663510b62af85ffad58b3 ("llava :
update surgery script to not remove tensors") and makes the examples
more consistent.
Signed-off-by: Daniel Bevenius <redacted>
postmasters [Wed, 21 Feb 2024 13:08:22 +0000 (05:08 -0800)]
llama : add `gemma` model (#5631)
There are couple things in this architecture:
1. Shared input and output embedding parameters.
2. Key length and value length are not derived from `n_embd`.
More information about the models can be found at
https://ai.google.dev/gemma. GGUFs can be downloaded from
https://huggingface.co/google.
Meng, Hengyu [Wed, 21 Feb 2024 09:52:06 +0000 (17:52 +0800)]
[SYCL] conext add name (#5624)
* [SYCL] conext add name
* name should start with SYCL*
Kawrakow [Wed, 21 Feb 2024 09:39:52 +0000 (11:39 +0200)]
IQ4_NL: 4-bit non-linear quants with blocks of 32 (#5590)
* iq4_nl: squash commits for easier rebase
* Basics (quantize, dequantize)
* CUDA dequantize and dot product
* Slightly faster CUDA dot product (120 t/s)
* Switch to 6-bit scales
* Scalar dot product
* AVX2 dot product
* ARM_NEON dot product
* Works on metal, but still slow
* Slightly better Metal dot product
* Another small Metal improvement
* Metal dot product is getting there
* Faster CUDA dot product
* Add 1/8 ffn_down layers as Q5_K when no imatrix has been provided
* Report the actual bpw
* Add _xs mix that is 4.05 bpw for non-MoE models
* Remove IQ4_XS for now, slightly adjust kvalues_iq4nl
* AVX2 dot product uses Q8_0 instead of Q8_K
* Add to test-backend-ops
* Minor fix
* Also use use Q5_K for attn_output in MoE models
* Fixes after merging latest master
* Switching to blocks of 32
* AVX2 for blocks of 32
* Scaler dot product for blocks of 32
* ARM_NEON dot product for blocks of 32
* Metal kernels for blocks of 32
* Slightly faster Metal kernels
* iq4_nl: Fix after merging with master
* iq4_nl: another fix after merging with master
* Use IQ4_NL instead of Q4_K when using k-quants is not possible
* Fix typo that makes several tests fail
* It was the ggml_vdotq thing missed inside the brackets
---------
Co-authored-by: Iwan Kawrakow <redacted>
CJ Pais [Tue, 20 Feb 2024 19:07:22 +0000 (11:07 -0800)]
server : support llava 1.6 (#5553)
* server: init working 1.6
* move clip_image to header
* remove commented code
* remove c++ style from header
* remove todo
* expose llava_image_embed_make_with_clip_img
* fix zig build
slaren [Tue, 20 Feb 2024 19:06:17 +0000 (20:06 +0100)]
make : fix debug build with CUDA (#5616)
Daniel Bevenius [Tue, 20 Feb 2024 17:30:27 +0000 (18:30 +0100)]
llava : add explicit instructions for llava-1.6 (#5611)
This commit contains a suggestion for the README.md in the llava
example. The suggestion adds explicit instructions for how to convert
a llava-1.6 model and run it using llava-cli.
The motivation for this is that having explicit instructions similar to
the 1.5 instructions will make it easier for users to try this out.
Signed-off-by: Daniel Bevenius <redacted>
Xuan Son Nguyen [Tue, 20 Feb 2024 14:58:27 +0000 (15:58 +0100)]
Server: use llama_chat_apply_template (#5593)
* server: use llama_chat_apply_template
* server: remove trailing space
* server: fix format_chat
* server: fix help message
Co-authored-by: Georgi Gerganov <redacted>
* server: fix formatted_chat
---------
Co-authored-by: Georgi Gerganov <redacted>
Dane Madsen [Tue, 20 Feb 2024 10:00:23 +0000 (21:00 +1100)]
readme : update UI list (#5605)
* Add maid to ui list
* Specify licence
Haoxiang Fei [Tue, 20 Feb 2024 09:58:36 +0000 (22:58 -1100)]
metal : add build system support for embedded metal library (#5604)
* add build support for embedded metal library
* Update Makefile
---------
Co-authored-by: Haoxiang Fei <redacted>
Co-authored-by: Georgi Gerganov <redacted>
Pierrick Hymbert [Tue, 20 Feb 2024 07:48:19 +0000 (08:48 +0100)]
server : health endpoint configurable failure on no slot (#5594)
AidanBeltonS [Tue, 20 Feb 2024 07:01:25 +0000 (07:01 +0000)]
Update ggml_sycl_op_mul_mat_vec_q (#5502)
* Update ggml_sycl_op_mul_mat_vec_q
* Apply suggestions from code review
Co-authored-by: Abhilash Majumder <redacted>
* revert suggestion on macro
* fix bug
* Add quant type GGML_TYPE_IQ1_S to unsupported
* fix format
---------
Co-authored-by: Abhilash Majumder <redacted>
Mathijs de Bruin [Tue, 13 Feb 2024 20:28:02 +0000 (20:28 +0000)]
nix: now that we can do so, allow MacOS to build Vulkan binaries
Author: Philip Taron <redacted>
Date: Tue Feb 13 20:28:02 2024 +0000
0cc4m [Sat, 10 Feb 2024 21:18:33 +0000 (22:18 +0100)]
Enable Vulkan MacOS CI
0cc4m [Wed, 14 Feb 2024 19:57:17 +0000 (20:57 +0100)]
Refactor validation and enumeration platform checks into functions to clean up ggml_vk_instance_init()
0cc4m [Sat, 10 Feb 2024 21:14:52 +0000 (22:14 +0100)]
Add check for VK_KHR_portability_enumeration for MoltenVK support
Mathijs de Bruin [Tue, 6 Feb 2024 14:39:22 +0000 (14:39 +0000)]
Add preprocessor checks for Apple devices.
Based on work by @rbourgeat in https://github.com/ggerganov/llama.cpp/pull/5322/files
Mathijs de Bruin [Sat, 3 Feb 2024 18:00:11 +0000 (18:00 +0000)]
Resolve ErrorIncompatibleDriver with Vulkan on MacOS.
Refs:
- https://chat.openai.com/share/
7020ce72 -65fc-45ec-b7be-
9d9d798a5f3f
- https://github.com/SaschaWillems/Vulkan/issues/954
- https://github.com/haasn/libplacebo/issues/128
- https://github.com/KhronosGroup/Vulkan-Samples/issues/476
Mathijs de Bruin [Sat, 3 Feb 2024 17:56:46 +0000 (17:56 +0000)]
Allow for Vulkan build with Accelerate.
Closes #5304
slaren [Mon, 19 Feb 2024 22:40:26 +0000 (23:40 +0100)]
cuda : ignore peer access already enabled errors (#5597)
* cuda : ignore peer access already enabled errors
* fix hip
Jared Van Bortel [Mon, 19 Feb 2024 20:54:12 +0000 (15:54 -0500)]
make : pass CPPFLAGS directly to nvcc, not via -Xcompiler (#5598)
nopperl [Mon, 19 Feb 2024 14:14:07 +0000 (14:14 +0000)]
examples : support minItems/maxItems in JSON grammar converter (#5039)
* support minLength and maxLength in JSON schema grammar converter
* Update examples/json-schema-to-grammar.py
---------
Co-authored-by: Georgi Gerganov <redacted>
Georgi Gerganov [Mon, 19 Feb 2024 13:23:17 +0000 (15:23 +0200)]
llava : remove extra cont (#5587)
slaren [Mon, 19 Feb 2024 13:02:36 +0000 (14:02 +0100)]
llava : replace ggml_cpy with ggml_cont
Georgi Gerganov [Mon, 19 Feb 2024 12:54:21 +0000 (14:54 +0200)]
sync : ggml
ggml-ci
Georgi Gerganov [Mon, 19 Feb 2024 12:53:48 +0000 (14:53 +0200)]
ggml-alloc : apply ggml/731
Didzis Gosko [Sun, 11 Feb 2024 14:41:41 +0000 (16:41 +0200)]
metal : option to embed MSL source into compiled binary (whisper/1842)
* ggml : embed Metal library source (ggml-metal.metal) into binary
enable by setting WHISPER_EMBED_METAL_LIBRARY
* rename the build option
* rename the preprocessor directive
* generate Metal library embedding assembly on-fly during build process
Georgi Gerganov [Mon, 19 Feb 2024 12:45:41 +0000 (14:45 +0200)]
ci : enable -Werror for CUDA builds (#5579)
* cmake : pass -Werror through -Xcompiler
ggml-ci
* make, cmake : enable CUDA errors on warnings
ggml-ci
Georgi Gerganov [Mon, 19 Feb 2024 11:41:51 +0000 (13:41 +0200)]
make : fix CUDA build (#5580)
valiray [Mon, 19 Feb 2024 10:37:10 +0000 (02:37 -0800)]
readme : fix typo in README-sycl.md (#5353)
Abhilash Majumder [Mon, 19 Feb 2024 09:15:18 +0000 (14:45 +0530)]
cmake : remove obsolete sycl compile flags (#5581)
* rm unwanted sycl compile options
* fix bug
* fix bug
* format fix
Georgi Gerganov [Mon, 19 Feb 2024 08:34:10 +0000 (10:34 +0200)]
minor : fix trailing whitespace (#5538)
Daniel Bevenius [Mon, 19 Feb 2024 08:31:59 +0000 (09:31 +0100)]
llava : avoid changing the original BakLLaVA model (#5577)
This is a follup of Commit
fc0c8d286a533363a9a663510b62af85ffad58b3
("llava : update surgery script to not remove tensors") but this time
the change is to the BakLLaVA specific part of the surgery script.
I've been able to test this using SkunkworksAI/BakLLaVA-1 and it works
as expected using the instructions in README.md.
Signed-off-by: Daniel Bevenius <redacted>
NawafAlansari [Mon, 19 Feb 2024 08:25:38 +0000 (03:25 -0500)]
baby-llama : allocate graphs in ggml_context (#5573)
* Fixed the baby-llama issue (see issue #4830)
* minor : fix whitespaces
---------
Co-authored-by: Georgi Gerganov <redacted>
Xuan Son Nguyen [Mon, 19 Feb 2024 08:23:37 +0000 (09:23 +0100)]
llama : add llama_chat_apply_template() (#5538)
* llama: add llama_chat_apply_template
* test-chat-template: remove dedundant vector
* chat_template: do not use std::string for buffer
* add clarification for llama_chat_apply_template
* llama_chat_apply_template: add zephyr template
* llama_chat_apply_template: correct docs
* llama_chat_apply_template: use term "chat" everywhere
* llama_chat_apply_template: change variable name to "tmpl"
slaren [Mon, 19 Feb 2024 08:04:45 +0000 (09:04 +0100)]
cuda, metal : fix nans in soft_max (#5574)
* cuda : fix nans in soft_max
* metal : fix nans in soft_max
---------
Co-authored-by: Georgi Gerganov <redacted>
Mirko185 [Mon, 19 Feb 2024 07:39:31 +0000 (08:39 +0100)]
readme : update (#5572)
Added 1.5-bit on README.md
bmwl [Mon, 19 Feb 2024 07:38:32 +0000 (23:38 -0800)]
ggml : android and old glibc NUMA incompatibility bugfixes (#5557)
* #ifdef out some code NUMA blocks for Android due to lack of support
* added in some __ANDROID__ if def gates around numa code and forced GLIBC prior to 2.29 to use a syscall for getcpu instead of the wrapper
* Changed gates on numa platform specific stuff to __gnu_linux__ to skip any platforms without glibc
* harmonizing #if defined blocks for numa code to __gnu_linux__ since that's the only model that's being followed anyways
---------
Co-authored-by: root <redacted>
Jared Van Bortel [Sun, 18 Feb 2024 21:21:52 +0000 (16:21 -0500)]
build : pass all warning flags to nvcc via -Xcompiler (#5570)
* build : pass all warning flags to nvcc via -Xcompiler
* make : fix apparent mis-merge from #3952
* make : fix incorrect GF_CC_VER for CUDA host compiler
Georgi Gerganov [Sun, 18 Feb 2024 20:58:57 +0000 (22:58 +0200)]
ggml : restore vec dot stride arg names (#5453)
Georgi Gerganov [Sun, 18 Feb 2024 20:39:30 +0000 (22:39 +0200)]
ci : fix wikitext url + compile warnings (#5569)
ggml-ci
Georgi Gerganov [Sun, 18 Feb 2024 19:39:58 +0000 (21:39 +0200)]
metal : fix unused warnings (#0)
Robey Holderith [Sun, 18 Feb 2024 19:11:16 +0000 (11:11 -0800)]
common, server : surface min_keep as its own parameter (#5567)
* Feature - surface min_keep as its own parameter
* Updated README with min_keep param
Pierrick Hymbert [Sun, 18 Feb 2024 17:39:57 +0000 (18:39 +0100)]
server : slots monitoring endpoint (#5550)
Georgi Gerganov [Sun, 18 Feb 2024 17:38:06 +0000 (19:38 +0200)]
sampling : do not set min_keep to n_probs (#5564)
Georgi Gerganov [Sun, 18 Feb 2024 17:17:00 +0000 (19:17 +0200)]
cmake : fix GGML_USE_SYCL typo (#5555)
Pierrick Hymbert [Sun, 18 Feb 2024 16:31:28 +0000 (17:31 +0100)]
server : enhanced health endpoint (#5548)
* server: enrich health endpoint with available slots, return 503 if not slots are available
* server: document new status no slot available in the README.md
Pierrick Hymbert [Sun, 18 Feb 2024 16:30:09 +0000 (17:30 +0100)]
server : --n-predict option document and cap to max value (#5549)
* server: document --n-predict
* server: ensure client request cannot override n_predict if set
* server: fix print usage LF in new --n-predict option
Daniel Hiltgen [Sun, 18 Feb 2024 16:23:16 +0000 (08:23 -0800)]
server : graceful server shutdown (#5244)
This updates the server queue to support graceful shutdown of the server on signals.
Georgi Gerganov [Sun, 18 Feb 2024 16:21:52 +0000 (18:21 +0200)]
common : fix ub (#5530)
Herman Semenov [Sun, 18 Feb 2024 16:20:12 +0000 (16:20 +0000)]
ggml, common, examples, tests : fixed type arguments in printf (#5528)
Daniel Bevenius [Sun, 18 Feb 2024 16:19:23 +0000 (17:19 +0100)]
llava : update surgery script to not remove tensors (#5536)
This commit updates the surgery script to not remove the tensors from the
model file. For this to work the `--skip-unknown` flag is added as an
argument to the convert.py script in README.md.
The motivation for this change is that the surgery script currently
removes the projector tensors from the model file. If the model was
checked out from a repository, the model file will have been updated
and have to be checked out again to reset this effect. If this can be
avoided I think it would be preferable.
I did not perform this change for BakLLaVA models as I am not sure
how that part works.
Kawrakow [Sun, 18 Feb 2024 16:16:55 +0000 (18:16 +0200)]
1.5 bit quantization (#5453)
* iq1_s: WIP basics
* iq1_s: CUDA is working
* iq1_s: scalar CPU dot product
* iq1_s: WIP AVX2 dot product - something is not right
* Fix tests
* Fix shadow warnings
* Fix after merge with latest master
* iq1_s: AVX2 finally works
* iq1_s: ARM_NEON dot product. Works, but not very fast
* iq1_s: better grid
* iq1_s: use IQ2_XXS for attn_output
At a cost of 0.04 extra bpw this gives a big improvement in PPL.
* iq1_s: Metal basics
Dequantize works, but not dot product
* iq1_s: Metal works, but quite slow
As usual, Apple Silicon does not like the code I write.
* iq1_s: Tests
* iq1_s: slightly faster dot product
---------
Co-authored-by: Iwan Kawrakow <redacted>
github-actions[bot] [Sun, 18 Feb 2024 00:17:07 +0000 (00:17 +0000)]
flake.lock: Update
Flake lock file updates:
• Updated input 'nixpkgs':
'github:NixOS/nixpkgs/
f8e2ebd66d097614d51a56a755450d4ae1632df1 ' (2024-02-07)
→ 'github:NixOS/nixpkgs/
5863c27340ba4de8f83e7e3c023b9599c3cb3c80 ' (2024-02-16)
Georgi Gerganov [Sat, 17 Feb 2024 21:04:16 +0000 (23:04 +0200)]
ggml : add ALiBi support for ggml_soft_max_ext (#5488)
* ggml : avoid recomputing alibi slopes (CPU)
* llama : reuse hparams.f_max_alibi_bias in all cases
ggml-ci
* ggml : support alibi bias in ggml_soft_max_ext (CPU + Metal)
ggml-ci
* ggml : handle all SRCs (do not break on first null)
ggml-ci
* tests : do not use slope for large soft_max
accumulates too much error
ggml-ci
* ggml : alternative ALiBi without extra tensor
We compute the slopes in the kernel
ggml-ci
* cuda : add ALiBi support in ggml_soft_max_ext
ggml-ci
* ggml : deprecate ggml_alibi
* ggml : support multi-sequence ALiBi (Metal)
ggml-ci
* cuda : add multi-seq ALiBi + remote F16 soft_max
ggml-ci
* ggml : update deprecation message
* ggml : fix pos ptr when no ALiBi
ggml-ci
* cuda : fix performance (pow -> powf)
* cuda : precompute ALiBi constants
* metal : pre-compute ALiBi slopes
ggml-ci
* llama : init kq_pos only if needed
ggml-ci
* test-backend-ops : add null pos test to soft_max
test-backend-ops : replace soft_max tests
ggml-ci
---------
Co-authored-by: slaren <redacted>
Ananta Bastola [Sat, 17 Feb 2024 21:03:14 +0000 (16:03 -0500)]
ci : add an option to fail on compile warning (#3952)
* feat(ci): add an option to fail on compile warning
* Update CMakeLists.txt
* minor : fix compile warnings
ggml-ci
* ggml : fix unreachable code warnings
ggml-ci
* ci : disable fatal warnings for windows, ios and tvos
* ggml : fix strncpy warning
* ci : disable fatal warnings for MPI build
* ci : add fatal warnings to ggml-ci
ggml-ci
---------
Co-authored-by: Georgi Gerganov <redacted>
clibdev [Sat, 17 Feb 2024 16:28:37 +0000 (18:28 +0200)]
gitignore : update for CLion IDE (#5544)
Georgi Gerganov [Fri, 16 Feb 2024 17:05:56 +0000 (19:05 +0200)]
cmake : fix VULKAN and ROCm builds (#5525)
* cmake : fix VULKAN and ROCm builds
* cmake : fix (cont)
* vulkan : fix compile warnings
ggml-ci
* cmake : fix
ggml-ci
* cmake : minor
ggml-ci
Georgi Gerganov [Fri, 16 Feb 2024 13:14:40 +0000 (15:14 +0200)]
scripts : add helpers script for bench comparing commits (#5521)
* scripts : add helpers script for bench comparing commits
* scripts : detect CUDA
* set flags after checking the command line
* fix make flags
---------
Co-authored-by: slaren <redacted>
Herman Semenov [Fri, 16 Feb 2024 12:43:23 +0000 (12:43 +0000)]
llava : removed excess free(NULL) operation (#5531)
Herman Semenov [Fri, 16 Feb 2024 11:45:48 +0000 (11:45 +0000)]
llama : minor fixed return int value (#5529)
Alexey Parfenov [Fri, 16 Feb 2024 11:33:25 +0000 (11:33 +0000)]
server : add "samplers" param to control the samplers order (#5494)
Rőczey Barnabás [Fri, 16 Feb 2024 10:00:56 +0000 (11:00 +0100)]
server : fix system prompt cli (#5516)
bmwl [Fri, 16 Feb 2024 09:31:07 +0000 (01:31 -0800)]
ggml : add numa options (#5377)
* Added numa options to allow finer grained control as well as plumbing for a new mirror mode that will require numa.h
* Reverted Makefile
* Fixed include
* Removed sched.h from ggml.h, moved ggml_get_numa_affinity into ggml.c, removed trailing whitespace and fixed up a few inconsistent variables
* removed trailing whitespace
* Added numa options to allow finer grained control as well as plumbing for a new mirror mode that will require numa.h
* Reverting Makefile
* Fixed a number of issues with the move from BOOL to ggml_numa_strategies. Added a note about mirror mode note being implemented yet
* Removing MIRROR_MODE code for this PR
* Removing last bit of MIRROR_MODE code for this PR
* Removing unneeded branch in server.cpp example and moving get_numa_affinity and making it static
* Fixed lingering init_llama_backend() bool calls in tests and examples
* Remote enum llama_numa_strategies
* Revert bad merge with dynatemp flags
* add missing enum ggml_numa_strategies declaration and revert sync problem with master
* add missing enum ggml_numa_strategies declaration
* fixed ggml_init_numa variable
* Update ggml.h
Co-authored-by: Jared Van Bortel <redacted>
* Update READMEs with info about numa flags, change INTERLEAVE strategy name to DISTRIBUTE everywhere, implement the improved distribution strategy from @rankaiyx, fix a spelling mistake and un-merge some bad merges
* split numa init out from llama_backend_init and created llama_numa_init. Updated all code paths and samples
* Fix up some boolean vs enum comparisons
* Added #ifdefs for non-Linux OS that don't have cpu_set_t datatype
* Update ggml.h
Align enum values
Co-authored-by: Georgi Gerganov <redacted>
* Update ggml.c
Remove whitespace
Co-authored-by: Georgi Gerganov <redacted>
* Update ggml.c
align paremeters
Co-authored-by: Georgi Gerganov <redacted>
* Update examples/server/server.cpp
remove whitespace and align brace
Co-authored-by: Georgi Gerganov <redacted>
* Update common/common.cpp
Remove whitespace and align brace
Co-authored-by: Georgi Gerganov <redacted>
* unified ggml_numa_strategy enum and fixed text alignment in server.cpp example
* Update ggml.c
simplified return for platforms without NUMA support
Co-authored-by: Jared Van Bortel <redacted>
* removed redundant else from cli argument processing of --numa
* whitespace
---------
Co-authored-by: root <redacted>
Co-authored-by: Jared Van Bortel <redacted>
Co-authored-by: Georgi Gerganov <redacted>
Co-authored-by: Jared Van Bortel <redacted>
Daniel Bevenius [Fri, 16 Feb 2024 09:24:39 +0000 (10:24 +0100)]
llava : fix clip-model-is-vision flag in README.md (#5509)
* llava: fix clip-model-is-vision flag in README.md
This commit fixes the flag `--clip_model_is_vision` in README.md which
is does not match the actual flag:
```console
$ python convert-image-encoder-to-gguf.py --help
...
--clip-model-is-vision
The clip model is a pure vision model
(ShareGPT4V vision extract for example)
```
Signed-off-by: Daniel Bevenius <redacted>
* llava: update link to vit config in README.md
Signed-off-by: Daniel Bevenius <redacted>
---------
Signed-off-by: Daniel Bevenius <redacted>
Georgi Gerganov [Fri, 16 Feb 2024 07:57:55 +0000 (09:57 +0200)]
ci : fix BERT model download and convert
Douglas Hanley [Thu, 15 Feb 2024 17:21:49 +0000 (11:21 -0600)]
Use correct type of pooling for embedding models (#5500)
Use correct type of pooling for embedding models
Georgi Gerganov [Thu, 15 Feb 2024 16:49:08 +0000 (18:49 +0200)]
clip : fix wrong loop condition
slaren [Thu, 15 Feb 2024 15:49:01 +0000 (16:49 +0100)]
cuda : print message when initialization fails (#5512)
* cuda : print message when initialization fails
* use CUDA_NAME both times
Georgi Gerganov [Thu, 15 Feb 2024 13:41:15 +0000 (15:41 +0200)]
scripts : add hf.sh helper script (#5501)
* scripts : add hf.sh helper scripts
* hf : add error logs
* hf : add support for --repo and --file
Michaël de Vries [Thu, 15 Feb 2024 13:14:37 +0000 (14:14 +0100)]
fix(gguf-py): special tokens are no longer skipped when add_<token>_token is set to false (#5487)
* fix(gguf-py): special tokens are no longer skipped when add_<token>_token is set to false
* fix(gguf-py): added missing cls and mask token ids to the gguf metadata
Elbios [Thu, 15 Feb 2024 08:01:57 +0000 (09:01 +0100)]
llava : fix memory management bug (#5491)
* Fix memory management in llava and server code
Fixes this error:
llama_new_context_with_model: graph splits (measure): 3
Available slots:
-> Slot 0 - max context: 6000
{"timestamp":
1707926446 ,"level":"INFO","function":"main","line":2623,"message":"model loaded"}
all slots are idle and system prompt is empty, clear the KV cache
slot 0 - loaded image
slot 0 is processing [task id: 0]
slot 0 : kv cache rm - [0, end)
slot 0 - encoding image [id: 1]
munmap_chunk(): invalid pointer
Aborted
* Make it cleaner by checking size in batch free wrapper
John [Thu, 15 Feb 2024 07:59:18 +0000 (08:59 +0100)]
llaba : hotfix for llava-1.6 image number (#5495)
Co-authored-by: John <redacted>