git.djapps.eu Git - pkg/ggml/sources/llama.cpp/log

]> git.djapps.eu Git - pkg/ggml/sources/llama.cpp/log

overview / pkg / ggml / sources / llama.cpp / log

nopperl [Mon, 19 Feb 2024 14:14:07 +0000 (14:14 +0000)]

examples : support minItems/maxItems in JSON grammar converter (#5039)

* support minLength and maxLength in JSON schema grammar converter

* Update examples/json-schema-to-grammar.py

---------

Co-authored-by: Georgi Gerganov <redacted>

commit | commitdiff | tree

Georgi Gerganov [Mon, 19 Feb 2024 13:23:17 +0000 (15:23 +0200)]

llava : remove extra cont (#5587)

commit | commitdiff | tree

slaren [Mon, 19 Feb 2024 13:02:36 +0000 (14:02 +0100)]

llava : replace ggml_cpy with ggml_cont

commit | commitdiff | tree

Georgi Gerganov [Mon, 19 Feb 2024 12:54:21 +0000 (14:54 +0200)]

sync : ggml

ggml-ci

commit | commitdiff | tree

Georgi Gerganov [Mon, 19 Feb 2024 12:53:48 +0000 (14:53 +0200)]

ggml-alloc : apply ggml/731

commit | commitdiff | tree

Didzis Gosko [Sun, 11 Feb 2024 14:41:41 +0000 (16:41 +0200)]

metal : option to embed MSL source into compiled binary (whisper/1842)

* ggml : embed Metal library source (ggml-metal.metal) into binary

enable by setting WHISPER_EMBED_METAL_LIBRARY

* rename the build option

* rename the preprocessor directive

* generate Metal library embedding assembly on-fly during build process

commit | commitdiff | tree

Georgi Gerganov [Mon, 19 Feb 2024 12:45:41 +0000 (14:45 +0200)]

ci : enable -Werror for CUDA builds (#5579)

* cmake : pass -Werror through -Xcompiler

ggml-ci

* make, cmake : enable CUDA errors on warnings

ggml-ci

commit | commitdiff | tree

Georgi Gerganov [Mon, 19 Feb 2024 11:41:51 +0000 (13:41 +0200)]

make : fix CUDA build (#5580)

commit | commitdiff | tree

valiray [Mon, 19 Feb 2024 10:37:10 +0000 (02:37 -0800)]

readme : fix typo in README-sycl.md (#5353)

commit | commitdiff | tree

Abhilash Majumder [Mon, 19 Feb 2024 09:15:18 +0000 (14:45 +0530)]

cmake : remove obsolete sycl compile flags (#5581)

* rm unwanted sycl compile options

* fix bug

* fix bug

* format fix

commit | commitdiff | tree

Georgi Gerganov [Mon, 19 Feb 2024 08:34:10 +0000 (10:34 +0200)]

minor : fix trailing whitespace (#5538)

commit | commitdiff | tree

Daniel Bevenius [Mon, 19 Feb 2024 08:31:59 +0000 (09:31 +0100)]

llava : avoid changing the original BakLLaVA model (#5577)

This is a follup of Commit fc0c8d286a533363a9a663510b62af85ffad58b3
("llava : update surgery script to not remove tensors") but this time
the change is to the BakLLaVA specific part of the surgery script.

I've been able to test this using SkunkworksAI/BakLLaVA-1 and it works
as expected using the instructions in README.md.

Signed-off-by: Daniel Bevenius <redacted>

commit | commitdiff | tree

NawafAlansari [Mon, 19 Feb 2024 08:25:38 +0000 (03:25 -0500)]

baby-llama : allocate graphs in ggml_context (#5573)

* Fixed the baby-llama issue (see issue #4830)

* minor : fix whitespaces

---------

Co-authored-by: Georgi Gerganov <redacted>

commit | commitdiff | tree

Xuan Son Nguyen [Mon, 19 Feb 2024 08:23:37 +0000 (09:23 +0100)]

llama : add llama_chat_apply_template() (#5538)

* llama: add llama_chat_apply_template

* test-chat-template: remove dedundant vector

* chat_template: do not use std::string for buffer

* add clarification for llama_chat_apply_template

* llama_chat_apply_template: add zephyr template

* llama_chat_apply_template: correct docs

* llama_chat_apply_template: use term "chat" everywhere

* llama_chat_apply_template: change variable name to "tmpl"

commit | commitdiff | tree

slaren [Mon, 19 Feb 2024 08:04:45 +0000 (09:04 +0100)]

cuda, metal : fix nans in soft_max (#5574)

* cuda : fix nans in soft_max

* metal : fix nans in soft_max

---------

Co-authored-by: Georgi Gerganov <redacted>

commit | commitdiff | tree

Mirko185 [Mon, 19 Feb 2024 07:39:31 +0000 (08:39 +0100)]

readme : update (#5572)

Added 1.5-bit on README.md

commit | commitdiff | tree

bmwl [Mon, 19 Feb 2024 07:38:32 +0000 (23:38 -0800)]

ggml : android and old glibc NUMA incompatibility bugfixes (#5557)

* #ifdef out some code NUMA blocks for Android due to lack of support

* added in some __ANDROID__ if def gates around numa code and forced GLIBC prior to 2.29 to use a syscall for getcpu instead of the wrapper

* Changed gates on numa platform specific stuff to __gnu_linux__ to skip any platforms without glibc

* harmonizing #if defined blocks for numa code to __gnu_linux__ since that's the only model that's being followed anyways

---------

Co-authored-by: root <redacted>

commit | commitdiff | tree

Jared Van Bortel [Sun, 18 Feb 2024 21:21:52 +0000 (16:21 -0500)]

build : pass all warning flags to nvcc via -Xcompiler (#5570)

* build : pass all warning flags to nvcc via -Xcompiler
* make : fix apparent mis-merge from #3952
* make : fix incorrect GF_CC_VER for CUDA host compiler

commit | commitdiff | tree

Georgi Gerganov [Sun, 18 Feb 2024 20:58:57 +0000 (22:58 +0200)]

ggml : restore vec dot stride arg names (#5453)

commit | commitdiff | tree

Georgi Gerganov [Sun, 18 Feb 2024 20:39:30 +0000 (22:39 +0200)]

ci : fix wikitext url + compile warnings (#5569)

ggml-ci

commit | commitdiff | tree

Georgi Gerganov [Sun, 18 Feb 2024 19:39:58 +0000 (21:39 +0200)]

metal : fix unused warnings (#0)

commit | commitdiff | tree

Robey Holderith [Sun, 18 Feb 2024 19:11:16 +0000 (11:11 -0800)]

common, server : surface min_keep as its own parameter (#5567)

* Feature - surface min_keep as its own parameter

* Updated README with min_keep param

commit | commitdiff | tree

Pierrick Hymbert [Sun, 18 Feb 2024 17:39:57 +0000 (18:39 +0100)]

server : slots monitoring endpoint (#5550)

commit | commitdiff | tree

Georgi Gerganov [Sun, 18 Feb 2024 17:38:06 +0000 (19:38 +0200)]

sampling : do not set min_keep to n_probs (#5564)

commit | commitdiff | tree

Georgi Gerganov [Sun, 18 Feb 2024 17:17:00 +0000 (19:17 +0200)]

cmake : fix GGML_USE_SYCL typo (#5555)

commit | commitdiff | tree

Pierrick Hymbert [Sun, 18 Feb 2024 16:31:28 +0000 (17:31 +0100)]

server : enhanced health endpoint (#5548)

* server: enrich health endpoint with available slots, return 503 if not slots are available

* server: document new status no slot available in the README.md

commit | commitdiff | tree

Pierrick Hymbert [Sun, 18 Feb 2024 16:30:09 +0000 (17:30 +0100)]

server : --n-predict option document and cap to max value (#5549)

* server: document --n-predict

* server: ensure client request cannot override n_predict if set

* server: fix print usage LF in new --n-predict option

commit | commitdiff | tree

Daniel Hiltgen [Sun, 18 Feb 2024 16:23:16 +0000 (08:23 -0800)]

server : graceful server shutdown (#5244)

This updates the server queue to support graceful shutdown of the server on signals.

commit | commitdiff | tree

Georgi Gerganov [Sun, 18 Feb 2024 16:21:52 +0000 (18:21 +0200)]

common : fix ub (#5530)

commit | commitdiff | tree

Herman Semenov [Sun, 18 Feb 2024 16:20:12 +0000 (16:20 +0000)]

ggml, common, examples, tests : fixed type arguments in printf (#5528)

commit | commitdiff | tree

Daniel Bevenius [Sun, 18 Feb 2024 16:19:23 +0000 (17:19 +0100)]

llava : update surgery script to not remove tensors (#5536)

This commit updates the surgery script to not remove the tensors from the
model file. For this to work the `--skip-unknown` flag is added as an
argument to the convert.py script in README.md.

The motivation for this change is that the surgery script currently
removes the projector tensors from the model file. If the model was
checked out from a repository, the model file will have been updated
and have to be checked out again to reset this effect. If this can be
avoided I think it would be preferable.

I did not perform this change for BakLLaVA models as I am not sure
how that part works.

commit | commitdiff | tree

Kawrakow [Sun, 18 Feb 2024 16:16:55 +0000 (18:16 +0200)]

1.5 bit quantization (#5453)

* iq1_s: WIP basics

* iq1_s: CUDA is working

* iq1_s: scalar CPU dot product

* iq1_s: WIP AVX2 dot product - something is not right

* Fix tests

* Fix shadow warnings

* Fix after merge with latest master

* iq1_s: AVX2 finally works

* iq1_s: ARM_NEON dot product. Works, but not very fast

* iq1_s: better grid

* iq1_s: use IQ2_XXS for attn_output

At a cost of 0.04 extra bpw this gives a big improvement in PPL.

* iq1_s: Metal basics

Dequantize works, but not dot product

* iq1_s: Metal works, but quite slow

As usual, Apple Silicon does not like the code I write.

* iq1_s: Tests

* iq1_s: slightly faster dot product

---------

Co-authored-by: Iwan Kawrakow <redacted>

commit | commitdiff | tree

github-actions[bot] [Sun, 18 Feb 2024 00:17:07 +0000 (00:17 +0000)]

flake.lock: Update

Flake lock file updates:

• Updated input 'nixpkgs':
'github:NixOS/nixpkgs/f8e2ebd66d097614d51a56a755450d4ae1632df1' (2024-02-07)
→ 'github:NixOS/nixpkgs/5863c27340ba4de8f83e7e3c023b9599c3cb3c80' (2024-02-16)

commit | commitdiff | tree

Georgi Gerganov [Sat, 17 Feb 2024 21:04:16 +0000 (23:04 +0200)]

ggml : add ALiBi support for ggml_soft_max_ext (#5488)

* ggml : avoid recomputing alibi slopes (CPU)

* llama : reuse hparams.f_max_alibi_bias in all cases

ggml-ci

* ggml : support alibi bias in ggml_soft_max_ext (CPU + Metal)

ggml-ci

* ggml : handle all SRCs (do not break on first null)

ggml-ci

* tests : do not use slope for large soft_max

accumulates too much error

ggml-ci

* ggml : alternative ALiBi without extra tensor

We compute the slopes in the kernel

ggml-ci

* cuda : add ALiBi support in ggml_soft_max_ext

ggml-ci

* ggml : deprecate ggml_alibi

* ggml : support multi-sequence ALiBi (Metal)

ggml-ci

* cuda : add multi-seq ALiBi + remote F16 soft_max

ggml-ci

* ggml : update deprecation message

* ggml : fix pos ptr when no ALiBi

ggml-ci

* cuda : fix performance (pow -> powf)

* cuda : precompute ALiBi constants

* metal : pre-compute ALiBi slopes

ggml-ci

* llama : init kq_pos only if needed

ggml-ci

* test-backend-ops : add null pos test to soft_max

test-backend-ops : replace soft_max tests

ggml-ci

---------

Co-authored-by: slaren <redacted>

commit | commitdiff | tree

Ananta Bastola [Sat, 17 Feb 2024 21:03:14 +0000 (16:03 -0500)]

ci : add an option to fail on compile warning (#3952)

* feat(ci): add an option to fail on compile warning

* Update CMakeLists.txt

* minor : fix compile warnings

ggml-ci

* ggml : fix unreachable code warnings

ggml-ci

* ci : disable fatal warnings for windows, ios and tvos

* ggml : fix strncpy warning

* ci : disable fatal warnings for MPI build

* ci : add fatal warnings to ggml-ci

ggml-ci

---------

Co-authored-by: Georgi Gerganov <redacted>

commit | commitdiff | tree

clibdev [Sat, 17 Feb 2024 16:28:37 +0000 (18:28 +0200)]

gitignore : update for CLion IDE (#5544)

commit | commitdiff | tree

Georgi Gerganov [Fri, 16 Feb 2024 17:05:56 +0000 (19:05 +0200)]

cmake : fix VULKAN and ROCm builds (#5525)

* cmake : fix VULKAN and ROCm builds

* cmake : fix (cont)

* vulkan : fix compile warnings

ggml-ci

* cmake : fix

ggml-ci

* cmake : minor

ggml-ci

commit | commitdiff | tree

Georgi Gerganov [Fri, 16 Feb 2024 13:14:40 +0000 (15:14 +0200)]

scripts : add helpers script for bench comparing commits (#5521)

* scripts : add helpers script for bench comparing commits

* scripts : detect CUDA

* set flags after checking the command line

* fix make flags

---------

Co-authored-by: slaren <redacted>

commit | commitdiff | tree

Herman Semenov [Fri, 16 Feb 2024 12:43:23 +0000 (12:43 +0000)]

llava : removed excess free(NULL) operation (#5531)

commit | commitdiff | tree

Herman Semenov [Fri, 16 Feb 2024 11:45:48 +0000 (11:45 +0000)]

llama : minor fixed return int value (#5529)

commit | commitdiff | tree

Alexey Parfenov [Fri, 16 Feb 2024 11:33:25 +0000 (11:33 +0000)]

server : add "samplers" param to control the samplers order (#5494)

commit | commitdiff | tree

Rőczey Barnabás [Fri, 16 Feb 2024 10:00:56 +0000 (11:00 +0100)]

server : fix system prompt cli (#5516)

commit | commitdiff | tree

bmwl [Fri, 16 Feb 2024 09:31:07 +0000 (01:31 -0800)]

ggml : add numa options (#5377)

* Added numa options to allow finer grained control as well as plumbing for a new mirror mode that will require numa.h

* Reverted Makefile

* Fixed include

* Removed sched.h from ggml.h, moved ggml_get_numa_affinity into ggml.c, removed trailing whitespace and fixed up a few inconsistent variables

* removed trailing whitespace

* Added numa options to allow finer grained control as well as plumbing for a new mirror mode that will require numa.h

* Reverting Makefile

* Fixed a number of issues with the move from BOOL to ggml_numa_strategies. Added a note about mirror mode note being implemented yet

* Removing MIRROR_MODE code for this PR

* Removing last bit of MIRROR_MODE code for this PR

* Removing unneeded branch in server.cpp example and moving get_numa_affinity and making it static

* Fixed lingering init_llama_backend() bool calls in tests and examples

* Remote enum llama_numa_strategies

* Revert bad merge with dynatemp flags

* add missing enum ggml_numa_strategies declaration and revert sync problem with master

* add missing enum ggml_numa_strategies declaration

* fixed ggml_init_numa variable

* Update ggml.h

Co-authored-by: Jared Van Bortel <redacted>
* Update READMEs with info about numa flags, change INTERLEAVE strategy name to DISTRIBUTE everywhere, implement the improved distribution strategy from @rankaiyx, fix a spelling mistake and un-merge some bad merges

* split numa init out from llama_backend_init and created llama_numa_init. Updated all code paths and samples

* Fix up some boolean vs enum comparisons

* Added #ifdefs for non-Linux OS that don't have cpu_set_t datatype

* Update ggml.h

Align enum values

Co-authored-by: Georgi Gerganov <redacted>
* Update ggml.c

Remove whitespace

Co-authored-by: Georgi Gerganov <redacted>
* Update ggml.c

align paremeters

Co-authored-by: Georgi Gerganov <redacted>
* Update examples/server/server.cpp

remove whitespace and align brace

Co-authored-by: Georgi Gerganov <redacted>
* Update common/common.cpp

Remove whitespace and align brace

Co-authored-by: Georgi Gerganov <redacted>
* unified ggml_numa_strategy enum and fixed text alignment in server.cpp example

* Update ggml.c

simplified return for platforms without NUMA support

Co-authored-by: Jared Van Bortel <redacted>
* removed redundant else from cli argument processing of --numa

* whitespace

---------

Co-authored-by: root <redacted>
Co-authored-by: Jared Van Bortel <redacted>
Co-authored-by: Georgi Gerganov <redacted>
Co-authored-by: Jared Van Bortel <redacted>

commit | commitdiff | tree

Daniel Bevenius [Fri, 16 Feb 2024 09:24:39 +0000 (10:24 +0100)]

llava : fix clip-model-is-vision flag in README.md (#5509)

* llava: fix clip-model-is-vision flag in README.md

This commit fixes the flag `--clip_model_is_vision` in README.md which
is does not match the actual flag:
```console
$ python convert-image-encoder-to-gguf.py --help
...
  --clip-model-is-vision
                        The clip model is a pure vision model
                        (ShareGPT4V vision extract for example)
```

Signed-off-by: Daniel Bevenius <redacted>
* llava: update link to vit config in README.md

Signed-off-by: Daniel Bevenius <redacted>
---------

Signed-off-by: Daniel Bevenius <redacted>

commit | commitdiff | tree

Georgi Gerganov [Fri, 16 Feb 2024 07:57:55 +0000 (09:57 +0200)]

ci : fix BERT model download and convert

commit | commitdiff | tree

Douglas Hanley [Thu, 15 Feb 2024 17:21:49 +0000 (11:21 -0600)]

Use correct type of pooling for embedding models (#5500)

Use correct type of pooling for embedding models

commit | commitdiff | tree

Georgi Gerganov [Thu, 15 Feb 2024 16:49:08 +0000 (18:49 +0200)]

clip : fix wrong loop condition

commit | commitdiff | tree

slaren [Thu, 15 Feb 2024 15:49:01 +0000 (16:49 +0100)]

cuda : print message when initialization fails (#5512)

* cuda : print message when initialization fails

* use CUDA_NAME both times

commit | commitdiff | tree

Georgi Gerganov [Thu, 15 Feb 2024 13:41:15 +0000 (15:41 +0200)]

scripts : add hf.sh helper script (#5501)

* scripts : add hf.sh helper scripts

* hf : add error logs

* hf : add support for --repo and --file

commit | commitdiff | tree

Michaël de Vries [Thu, 15 Feb 2024 13:14:37 +0000 (14:14 +0100)]

fix(gguf-py): special tokens are no longer skipped when add_<token>_token is set to false (#5487)

* fix(gguf-py): special tokens are no longer skipped when add_<token>_token is set to false

* fix(gguf-py): added missing cls and mask token ids to the gguf metadata

commit | commitdiff | tree

Elbios [Thu, 15 Feb 2024 08:01:57 +0000 (09:01 +0100)]

llava : fix memory management bug (#5491)

* Fix memory management in llava and server code

Fixes this error:

llama_new_context_with_model: graph splits (measure): 3
Available slots:
-> Slot 0 - max context: 6000
{"timestamp":1707926446,"level":"INFO","function":"main","line":2623,"message":"model loaded"}
all slots are idle and system prompt is empty, clear the KV cache
slot 0 - loaded image
slot 0 is processing [task id: 0]
slot 0 : kv cache rm - [0, end)
slot 0 - encoding image [id: 1]
munmap_chunk(): invalid pointer
Aborted

* Make it cleaner by checking size in batch free wrapper

commit | commitdiff | tree

John [Thu, 15 Feb 2024 07:59:18 +0000 (08:59 +0100)]

llaba : hotfix for llava-1.6 image number (#5495)

Co-authored-by: John <redacted>

commit | commitdiff | tree

Neuman Vong [Thu, 15 Feb 2024 06:11:15 +0000 (17:11 +1100)]

vulkan: Find optimal memory type but with fallback (#5381)

* @0cc4m feedback

* More feedback @0cc4m

commit | commitdiff | tree

Rune [Wed, 14 Feb 2024 15:15:49 +0000 (16:15 +0100)]

readme : fix typo (#5490)

executabhle -> executable

commit | commitdiff | tree

John [Wed, 14 Feb 2024 14:49:42 +0000 (15:49 +0100)]

llava : update README.md (#5489)

* Update README.md

* Update README.md

* Update examples/llava/README.md

---------

Co-authored-by: Georgi Gerganov <redacted>

commit | commitdiff | tree

Michael Podvitskiy [Wed, 14 Feb 2024 08:49:01 +0000 (11:49 +0300)]

cmake : ARM intrinsics detection for MSVC (#5401)

commit | commitdiff | tree

John [Wed, 14 Feb 2024 07:38:35 +0000 (08:38 +0100)]

llava : support v1.6 (#5267)

* Create llava-survery-v2.py

* Update convert-image-encoder-to-gguf.py

* Update convert-image-encoder-to-gguf.py

* Rename llava-survery-v2.py to llava-surgery-v2.py

* Update convert-image-encoder-to-gguf.py

will now search for projector

* Update convert-image-encoder-to-gguf.py

whoops

* Update llava-surgery-v2.py

* Clip: Bugfix for normalization (it did not loat the 3 std and mean values)
Clip: bicubic resize function
Clip: added save-to-bmp/pil for debugging and conversion from/to 32/8 images
Clip: added normalization with FP16 precision simulation (image tensors match HF implementation, can be switched off, only used for llava-1.6)
Clip: added newline tensor, mergetype kv, image-grid kv, new resize-pad function with resolution from gridpoints
Clip: clip_image_preprocess now returns a float * vector instead of float, this way llava 1.5 and 1.6 is supported
llava: added ggml cpu graph for embedding patching, added spatial_unpad preliminary support, added a lot of comments that need to be cleaned when all is final
convert-image-encoder: fixed image-grid flattening

* whitespace corrections

* ws

* Tensors are now properly permuted.
Before the embeddings were inserted 1:1, now they are split into the 24x24 patches as in reference.

* ws

* added verbose_prompt support into cli
added stopwords for llava-1.6 into cli

* moved llava functions to llava.cpp, made clip.h C compatible API, replaced vector style functions with pointers, added a debug define to remove functions from compilation while not needed

* ws

* convert : skip unknown tensors (need for LLaVA)

* llava : update readme

* llava : fix compile warnings

* llava : style

* convert : add --skip-unknown CLI arg

* server : remove clip structs

* bugfix for non llava-1.6

It should now work with llava-1.5 as well

* clip : minor code rearrange

* llava : update readme a bit

---------

Co-authored-by: John <redacted>
Co-authored-by: Georgi Gerganov <redacted>

commit | commitdiff | tree

AT [Tue, 13 Feb 2024 21:44:25 +0000 (15:44 -0600)]

Early return for zero size calls to get_tensor. (#5482)

* Early return for zero size calls to get_tensor.

Signed-off-by: Adam Treat <redacted>
* Update ggml-kompute.cpp

Co-authored-by: Georgi Gerganov <redacted>
* Update ggml-kompute.cpp

Co-authored-by: Georgi Gerganov <redacted>
* Add an early return to the get/set tensor when the size is null.

Signed-off-by: Adam Treat <redacted>
* Early return after the assertions.

Signed-off-by: Adam Treat <redacted>
* Since we do the early return in the generic backend now no reason to do so here as well.

Signed-off-by: Adam Treat <redacted>
---------

Signed-off-by: Adam Treat <redacted>
Co-authored-by: Georgi Gerganov <redacted>

commit | commitdiff | tree

John [Tue, 13 Feb 2024 17:56:38 +0000 (18:56 +0100)]

gguf : add python reader example (#5216)

* Update CMakeLists.txt

* Create reader.py

* Update reader.py

* Update reader.py

another whitespace :|

* Update reader.py

* lintlintlint

commit | commitdiff | tree

Jared Van Bortel [Tue, 13 Feb 2024 17:03:53 +0000 (12:03 -0500)]

llama : add support for Nomic Embed (#5468)

commit | commitdiff | tree

Aarni Koskela [Tue, 13 Feb 2024 16:18:16 +0000 (18:18 +0200)]

llama : allow raw byte in SPM vocabs; don't crash on nl 404 (#5478)

* common : don't crash if newline token is not found

* common : llama_byte_to_token: allow falling back to finding just the token byte in SPM vocabs

commit | commitdiff | tree

Aarni Koskela [Tue, 13 Feb 2024 13:24:50 +0000 (15:24 +0200)]

llama : make load error reporting more granular (#5477)

Makes it easier to pinpoint where e.g. `unordered_map::at: key not found` comes from.

commit | commitdiff | tree

Daniel Bevenius [Tue, 13 Feb 2024 13:15:42 +0000 (14:15 +0100)]

finetune : rename feed-forward tensors (w1/w2/w3) (#4839)

* finetune: rename feed-forward tensors (w1/w2/w3)

This commit renames the feed-forward tensors w1, w2 and w3 to ffn_gate,
ffn_down and ffn_up respectively.

The motivation for this change is to make it easier to understand the
purpose of the tensors. This also seems to be inline with the names
used in the llama_layer struct in llama.cpp.

Signed-off-by: Daniel Bevenius <redacted>
* train-text-from-scratch: rename ff tensors

This commit renames the feed-forward tensors w1, w2 and w3 to ffn_gate,
ffn_down and ffn_up respectively.

The motivation for this change is to make it easier to understand the
purpose of the tensors. This also seems to be inline with the names
used in the llama_layer struct in llama.cpp

Signed-off-by: Daniel Bevenius <redacted>
---------

Signed-off-by: Daniel Bevenius <redacted>

commit | commitdiff | tree

Georgi Gerganov [Tue, 13 Feb 2024 13:14:22 +0000 (15:14 +0200)]

tests : multi-thread the tokenizer tests (#5474)

* tests : multi-thread the tokenizer tests

ggml-ci

* unicode : fix data race for unidentified codepoints

ggml-ci

* unicode : minor style fixes

ggml-ci

commit | commitdiff | tree

Douglas Hanley [Tue, 13 Feb 2024 12:06:58 +0000 (06:06 -0600)]

llama : support batched embeddings (#5466)

* batched embedding: pool outputs by sequence id. updated embedding example

* bring back non-causal attention

* embd : minor improvements

* llama : minor

---------

Co-authored-by: Georgi Gerganov <redacted>

commit | commitdiff | tree

Johannes Gäßler [Tue, 13 Feb 2024 11:38:37 +0000 (12:38 +0100)]

make: add error message for bad CUDA version (#5444)

* make: add error message for bad CUDA version

* Update Makefile

Co-authored-by: Jared Van Bortel <redacted>
---------

Co-authored-by: Jared Van Bortel <redacted>

commit | commitdiff | tree

Georgi Gerganov [Tue, 13 Feb 2024 11:01:29 +0000 (13:01 +0200)]

bert : add tests + fix quantization (#5475)

* llama : do not quantize pos embd and token type tensors

* ci : add BERT tests

ggml-ci

* ci : do not do BERT tests on low-perf nodes

ggml-ci

commit | commitdiff | tree

Georgi Gerganov [Tue, 13 Feb 2024 09:20:24 +0000 (11:20 +0200)]

tests : disable moe test (#5473)

commit | commitdiff | tree

Kawrakow [Tue, 13 Feb 2024 07:07:57 +0000 (09:07 +0200)]

ggml-quants : fix compiler warnings (shadow variable) (#5472)

Co-authored-by: Iwan Kawrakow <redacted>

commit | commitdiff | tree

Georgi Gerganov [Mon, 12 Feb 2024 18:14:39 +0000 (20:14 +0200)]

llama : fix quantization when tensors are missing (#5423)

commit | commitdiff | tree

Georgi Gerganov [Mon, 12 Feb 2024 17:54:29 +0000 (19:54 +0200)]

swift : package no longer use ggml dependency (#5465)

* Revert "swift : update Package.swift to use ggml as dependency (#4691)"

This reverts commit ece9a45e8ffb73ad461c792720c2fec28b0137bc.

* spm : add ggml headers

commit | commitdiff | tree

Lee [Mon, 12 Feb 2024 17:29:57 +0000 (01:29 +0800)]

py : fix persimmon `n_rot` conversion (#5460)

* convert : fix persimmon offical weight conversion to write correct n_rot.

* Update convert-persimmon-to-gguf.py

---------

Co-authored-by: Georgi Gerganov <redacted>

commit | commitdiff | tree

Abhilash Majumder [Mon, 12 Feb 2024 14:52:05 +0000 (20:22 +0530)]

ggml-sycl: Replace 3d ops with macro (#5458)

* use macro

* use macro

* fix format

commit | commitdiff | tree

Daniel Bevenius [Mon, 12 Feb 2024 08:38:44 +0000 (09:38 +0100)]

llava : remove prog parameter from ArgumentParser (#5457)

* llava: remove prog parameter from ArgumentParser

This commit removes the `prog` parameter from `ArgumentParser`
so that it uses the default value which is the name of the script.

The motivation for this change is that currently the usage output looks
like this:
```console
$ python examples/llava/convert-image-encoder-to-gguf.py --help
usage: convert_hf_to_gguf.py [-h] ...
```
And with this change it will look like this:
```console
$ python examples/llava/convert-image-encoder-to-gguf.py --help
usage: convert-image-encoder-to-gguf.py [-h] ...
```

Signed-off-by: Daniel Bevenius <redacted>
* ci: add W503 to flake8 ignore list

This commit adds W503 to the ignore list for flake8. This is done to
avoid the following error:
W503 line break before binary operator

Signed-off-by: Daniel Bevenius <redacted>
---------

Signed-off-by: Daniel Bevenius <redacted>

commit | commitdiff | tree

Georgi Gerganov [Mon, 12 Feb 2024 07:16:06 +0000 (09:16 +0200)]

sync : ggml (#5452)

* ggml-alloc : v3 (ggml/727)

* ggml-alloc v3

ggml-ci

* fix ci

ggml-ci

* whisper : check for backend buffer allocation failures

* whisper : avoid leaks when initialization fails

* cleanup

ggml-ci

* style fixes

ggml-ci

* sync : ggml

* update llama.cpp, clip.cpp, export-lora.cpp

* update finetune.cpp, train-text-from-scratch.cpp

ggml-ci

* ggml-backend : reduce alignment to 32 to match gguf and fix mmap

---------

Co-authored-by: slaren <redacted>

commit | commitdiff | tree

Johannes Gäßler [Sun, 11 Feb 2024 18:08:39 +0000 (19:08 +0100)]

CUDA: mul_mat_vec_q tiling, refactor mul mat logic (#5434)

* CUDA: mul_mat_vec_q tiling, refactor mul mat logic

Co-authored-by: slaren <redacted>
---------

Co-authored-by: slaren <redacted>

commit | commitdiff | tree

Douglas Hanley [Sun, 11 Feb 2024 16:21:38 +0000 (10:21 -0600)]

Add support for BERT embedding models (#5423)

* BERT model graph construction (build_bert)
* WordPiece tokenizer (llm_tokenize_wpm)
* Add flag for non-causal attention models
* Allow for models that only output embeddings
* Support conversion of BERT models to GGUF
* Based on prior work by @xyzhang626 and @skeskinen

---------

Co-authored-by: Jared Van Bortel <redacted>
Co-authored-by: Jared Van Bortel <redacted>
Co-authored-by: Georgi Gerganov <redacted>

commit | commitdiff | tree

github-actions[bot] [Sun, 11 Feb 2024 00:17:31 +0000 (00:17 +0000)]

flake.lock: Update

Flake lock file updates:

• Updated input 'nixpkgs':
'github:NixOS/nixpkgs/b8b232ae7b8b144397fdb12d20f592e5e7c1a64d' (2024-01-31)
→ 'github:NixOS/nixpkgs/f8e2ebd66d097614d51a56a755450d4ae1632df1' (2024-02-07)

commit | commitdiff | tree

Sergio López [Sun, 11 Feb 2024 14:12:00 +0000 (15:12 +0100)]

vulkan: only use M-sized matmul on Apple GPUs (#5412)

* vulkan: refactor guess_matmul_pipeline for vendor

Refactor ggml_vk_guess_matmul_pipeline to simplify adding per-vendor
conditionals.

Signed-off-by: Sergio Lopez <redacted>
* vulkan: only use M-sized matmul on Apple GPUs

L-sized and S-sized matmuls are broken on Apple GPUs, force using
M-size with this vendor.

Signed-off-by: Sergio Lopez <redacted>
---------

Signed-off-by: Sergio Lopez <redacted>

commit | commitdiff | tree

Alexey Parfenov [Sun, 11 Feb 2024 13:43:31 +0000 (13:43 +0000)]

common : use enums for sampler types (#5418)

* common: use enums for sampler types

* Apply suggestions from code review

Co-authored-by: Georgi Gerganov <redacted>
* minor : spaces

---------

Co-authored-by: Georgi Gerganov <redacted>

commit | commitdiff | tree

Alexey Parfenov [Sun, 11 Feb 2024 13:38:14 +0000 (13:38 +0000)]

server : allow to specify tokens as strings in logit_bias (#5003)

* server: allow to specify tokens as strings in logit_bias

* Apply suggestions from code review

Co-authored-by: Georgi Gerganov <redacted>
---------

Co-authored-by: Georgi Gerganov <redacted>

commit | commitdiff | tree

Georgi Gerganov [Sun, 11 Feb 2024 13:35:50 +0000 (15:35 +0200)]

main : ctrl+C print timing in non-interactive mode (#3873)

commit | commitdiff | tree

Georgi Gerganov [Sun, 11 Feb 2024 13:33:43 +0000 (15:33 +0200)]

common : fix compile warning

commit | commitdiff | tree

Georgi Gerganov [Sun, 11 Feb 2024 13:33:01 +0000 (15:33 +0200)]

ggml : fix compile warnings (unused vars) (#4966)

commit | commitdiff | tree

snadampal [Sun, 11 Feb 2024 13:22:33 +0000 (07:22 -0600)]

ggml : add mmla kernels for quantized GEMM (#4966)

* ggml: aarch64: implement smmla kernel for q8_0_q8_0 quantized gemm

armv8.2-a and above supports MMLA instructions that have higher
throughput than DOT. this commit adds mmla kernel for
q8_0_q8_0 gemm. The feature is enabled if the platform supports
"__ARM_FEATURE_MATMUL_INT8"

On AWS Graviton3 processors this kernel resulted up to 1.5x
improvement for prompt evaluation throughput compared to the
default sdot kernel.

* ggml: aarch64: implement smmla kernel for q4_0_q8_0 quantized gemm

armv8.2-a and above supports MMLA instructions that have higher
throughput than DOT. this commit adds mmla kernel for
q4_0_q8_0 gemm. The feature is enabled if the platform supports
"__ARM_FEATURE_MATMUL_INT8"

On AWS Graviton3 processors this kernel resulted up to 1.5x
improvement for prompt evaluation throughput compared to the
default sdot kernel.

* ggml: aarch64: implement smmla kernel for q4_1_q8_1 quantized gemm

armv8.2-a and above supports MMLA instructions that have higher
throughput than DOT. this commit adds mmla kernel for
q4_1_q8_1 gemm. The feature is enabled if the platform supports
"__ARM_FEATURE_MATMUL_INT8"

On AWS Graviton3 processors this kernel resulted up to 1.5x
improvement for prompt evaluation throughput compared to the
default sdot kernel.

* ggml: update unit tests for the new vec_dot interface

* llama.cpp: add MATMUL_INT8 capability to system_info

commit | commitdiff | tree

Johannes Gäßler [Sun, 11 Feb 2024 11:44:51 +0000 (12:44 +0100)]

lookup: add print for drafting performance (#5450)

commit | commitdiff | tree

Xuan Son Nguyen [Sun, 11 Feb 2024 10:16:22 +0000 (11:16 +0100)]

server : add llama2 chat template (#5425)

* server: add mistral chat template

* server: fix typo

* server: rename template mistral to llama2

* server: format_llama2: remove BOS

* server: validate "--chat-template" argument

* server: clean up using_chatml variable

Co-authored-by: Jared Van Bortel <redacted>
---------

Co-authored-by: Jared Van Bortel <redacted>

commit | commitdiff | tree

Ian Bull [Sat, 10 Feb 2024 10:53:28 +0000 (02:53 -0800)]

metal : use autoreleasepool to avoid memory leaks (#5437)

There appears to be a known memory leak when using the
`MLTCommandBuffer`. It is suggested to use `@autoreleasepool` in
[1,2]

[1] https://developer.apple.com/forums/thread/662721
[2] https://forums.developer.apple.com/forums/thread/120931

This change-set wraps the `ggml_metal_graph_compute` in a
`@autoreleasepool`.

This commit addresses https://github.com/ggerganov/llama.cpp/issues/5436

commit | commitdiff | tree

Georgi Gerganov [Sat, 10 Feb 2024 07:53:05 +0000 (09:53 +0200)]

scripts : update sync scripts with new backends

commit | commitdiff | tree

Georgi Gerganov [Sat, 10 Feb 2024 07:30:36 +0000 (09:30 +0200)]

sync : ggml

commit | commitdiff | tree

Michael Podvitskiy [Fri, 9 Feb 2024 09:42:27 +0000 (10:42 +0100)]

ggml : add abort_callback for cpu backend (ggml/725)

* a way to use abort_callback with the cpu backend

* whisper update

commit | commitdiff | tree

Neuman Vong [Fri, 9 Feb 2024 18:30:19 +0000 (05:30 +1100)]

vulkan: Set limit for task concurrency (#5427)

A common default for the maximum number of open files is 256, which can
lead to `asyncio.gather(*tasks)` failing with Too many open files.

    $ python ggml_vk_generate_shaders.py --glslc=$ANDROID_NDK_PATH/shader-tools/darwin-x86_64/glslc
    ggml_vulkan: Generating and compiling shaders to SPIR-V
    Traceback (most recent call last):
      File "/Users/neuman/Code.noindex/github/llama.cpp/ggml_vk_generate_shaders.py", line 2326, in <module>
        asyncio.run(main())
      File "/Users/neuman/Code.noindex/miniforge3/lib/python3.10/asyncio/runners.py", line 44, in run
        return loop.run_until_complete(main)
      File "/Users/neuman/Code.noindex/miniforge3/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete
        return future.result()
      File "/Users/neuman/Code.noindex/github/llama.cpp/ggml_vk_generate_shaders.py", line 2294, in main
        await asyncio.gather(*tasks)
    [...snip...]
    OSError: [Errno 24] Too many open files

This change sets a reasonable concurrency limit for tasks (and therefore
open files), without significant impact on run time.

commit | commitdiff | tree

Daniel Bevenius [Fri, 9 Feb 2024 13:00:59 +0000 (14:00 +0100)]

llava : add requirements.txt and update README.md (#5428)

* llava: add requirements.txt and update README.md

This commit adds a `requirements.txt` file to the `examples/llava`
directory. This file contains the required Python packages to run the
scripts in the `examples/llava` directory.

The motivation of this to make it easier for users to run the scripts in
`examples/llava`. This will avoid users from having to possibly run into
missing package issues if the packages are not installed on their system.

Signed-off-by: Daniel Bevenius <redacted>
* llava: fix typo in llava-surgery.py output

Signed-off-by: Daniel Bevenius <redacted>
---------

Signed-off-by: Daniel Bevenius <redacted>

commit | commitdiff | tree

Riley Stewart [Fri, 9 Feb 2024 10:49:49 +0000 (02:49 -0800)]

server : fix prompt caching for repeated prompts (#5420)

commit | commitdiff | tree

Paul Tsochantaris [Fri, 9 Feb 2024 10:48:06 +0000 (10:48 +0000)]

llama : do not cap thread count when MoE on CPU (#5419)

* Not capping thread count when MoE inference is running on CPU

* Whitespace

commit | commitdiff | tree

Marko Tasic [Fri, 9 Feb 2024 10:17:00 +0000 (11:17 +0100)]

readme : add JavaScript/Wasm repo (#5415)

commit | commitdiff | tree

Michael Podvitskiy [Fri, 9 Feb 2024 09:56:43 +0000 (10:56 +0100)]

ggml : fix `error C2078: too many initializers` for MSVC ARM64 (#5404)

commit | commitdiff | tree

0cc4m [Fri, 9 Feb 2024 05:52:33 +0000 (06:52 +0100)]

Fix Vulkan crash on APUs with very little device memory (#5424)

* Fix Vulkan crash on APUs with very little device memory

* Fix debug output function names

commit | commitdiff | tree

Johannes Gäßler [Thu, 8 Feb 2024 20:56:40 +0000 (21:56 +0100)]

CUDA: more warps for mmvq on NVIDIA (#5394)

commit | commitdiff | tree

slaren [Thu, 8 Feb 2024 20:33:03 +0000 (21:33 +0100)]

llama : do not print "offloading layers" message in CPU-only builds (#5416)

Packaging of ggml-org/llama.cpp

RSS Atom