git.djapps.eu Git - pkg/ggml/sources/llama.cpp/log

]> git.djapps.eu Git - pkg/ggml/sources/llama.cpp/log

overview / pkg / ggml / sources / llama.cpp / log

commit | commitdiff | tree

dm4 [Wed, 15 May 2024 12:01:12 +0000 (20:01 +0800)]

embedding : free the batch after execution (#7297)

commit | commitdiff | tree

Georgi Gerganov [Wed, 15 May 2024 10:23:41 +0000 (13:23 +0300)]

sync : ggml

commit | commitdiff | tree

John Balis [Wed, 15 May 2024 08:52:33 +0000 (03:52 -0500)]

ggml : add `ggml_upscale_ext` (ggml/814)

* initial commit with CPU implementation of upscale to shape and test, cuda implementation next

* experimental commit to see if dst shape is correct

* test version

* test

* removed unnecessary params

* refactor

* fixed tests

* ggml : metal impl + cleanup + sycl dev warnings

* patched ggml_upscale cuda op to handle non-contiguous tensors, added test for non-contiguous behavior

* metal : fix upsacle op to support nb00 + style

---------

Co-authored-by: Georgi Gerganov <redacted>

commit | commitdiff | tree

Johannes Gäßler [Wed, 15 May 2024 06:44:16 +0000 (08:44 +0200)]

server bench: fix bench not waiting for model load (#7284)

commit | commitdiff | tree

Georgi Gerganov [Tue, 14 May 2024 16:14:38 +0000 (19:14 +0300)]

script : sync ggml-rpc

commit | commitdiff | tree

Georgi Gerganov [Tue, 14 May 2024 16:09:30 +0000 (19:09 +0300)]

metal : support FA without mask + add asserts (#7278)

* ggml : fa without mask + add asserts

ggml-ci

* metal : support non-contiguous KV

ggml-ci

commit | commitdiff | tree

Georgi Gerganov [Tue, 14 May 2024 12:33:16 +0000 (15:33 +0300)]

sync : ggml

ggml-ci

commit | commitdiff | tree

Georgi Gerganov [Mon, 13 May 2024 08:01:07 +0000 (11:01 +0300)]

metal : tune soft_max number of threads (whisper/0)

commit | commitdiff | tree

Georgi Gerganov [Sun, 12 May 2024 17:36:31 +0000 (20:36 +0300)]

ggml : try fix ppc64 (whisper/0)

commit | commitdiff | tree

Przemysław Pawełczyk [Wed, 8 May 2024 15:33:43 +0000 (17:33 +0200)]

ggml : expose SSE3 and SSSE3 for MSVC when AVX is available (whisper/2128)

commit | commitdiff | tree

Hong Bo PENG [Sun, 12 May 2024 09:17:18 +0000 (17:17 +0800)]

ggml : optimize for ppc64le using VSX intrinsics (ggml/784)

* optimize for ppc64le using VSX intrinsics

* 1. code clean up by removing comments about overflow concern.

2. fix typo in suffix of scaling.

* Continue to fix typo in suffix of scaling for QK_K <> 256

---------

Co-authored-by: Georgi Gerganov <redacted>

commit | commitdiff | tree

Steve Grubb [Tue, 14 May 2024 14:11:24 +0000 (10:11 -0400)]

server: free sampling contexts on exit (#7264)

* server: free sampling contexts on exit

This cleans up last leak found by the address sanitizer.

* fix whitespace

* fix whitespace

commit | commitdiff | tree

Brian [Tue, 14 May 2024 13:10:39 +0000 (23:10 +1000)]

Revert "move ndk code to a new library (#6951)" (#7282)

This reverts commit efc8f767c8c8c749a245dd96ad4e2f37c164b54c.

commit | commitdiff | tree

Radoslav Gerganov [Tue, 14 May 2024 11:27:19 +0000 (14:27 +0300)]

ggml : add RPC backend (#6829)

* ggml : add RPC backend

The RPC backend proxies all operations to a remote server which runs a
regular backend (CPU, CUDA, Metal, etc).

* set TCP_NODELAY

* add CI workflows

* Address review comments

* fix warning

* implement llama_max_devices() for RPC

* Address review comments

* Address review comments

* wrap sockfd into a struct

* implement get_alignment and get_max_size

* add get_device_memory

* fix warning

* win32 support

* add README

* readme : trim trailing whitespace

* Address review comments

* win32 fix

* Address review comments

* fix compile warnings on macos

commit | commitdiff | tree

slaren [Tue, 14 May 2024 07:33:42 +0000 (09:33 +0200)]

llama : disable pipeline parallelism with nkvo (#7265)

commit | commitdiff | tree

Elton Kola [Tue, 14 May 2024 07:30:30 +0000 (03:30 -0400)]

move ndk code to a new library (#6951)

commit | commitdiff | tree

Haggai Nuchi [Tue, 14 May 2024 05:25:56 +0000 (22:25 -0700)]

Add left recursion check: quit early instead of going into an infinite loop (#7083)

* Add left recursion check: quit early instead of going into an infinite loop

* Remove custom enum, rename left recursion check and move to "grammar internal" section, add handling for edge case where a leftmost nonterminal may be empty

* Remove unnecessary declaration

commit | commitdiff | tree

Ryuei [Tue, 14 May 2024 05:20:47 +0000 (14:20 +0900)]

docs: Fix typo and update description for --embeddings flag (#7026)

- Change '--embedding' to '--embeddings' in the README
- Update the description to match the latest --help output
- Added a caution about defining physical batch size

commit | commitdiff | tree

compilade [Mon, 13 May 2024 18:10:51 +0000 (14:10 -0400)]

convert-hf : support direct Q8_0 conversion (#7234)

* convert-hf : support q8_0 conversion

* convert-hf : add missing ftype

This was messing with the checksums otherwise.

* convert-hf : add missing ftype to Baichuan and Xverse

I didn't notice these on my first pass.

commit | commitdiff | tree

Georgi Gerganov [Mon, 13 May 2024 14:15:15 +0000 (17:15 +0300)]

llama : less KV padding when FA is off (#7257)

ggml-ci

commit | commitdiff | tree

k.h.lai [Mon, 13 May 2024 14:02:36 +0000 (22:02 +0800)]

llava-cli: fix base64 prompt (#7248)

commit | commitdiff | tree

Johannes Gäßler [Mon, 13 May 2024 11:03:27 +0000 (13:03 +0200)]

perplexity: add BF16 vs. FP16 results (#7150)

commit | commitdiff | tree

Neo Zhang [Mon, 13 May 2024 10:11:26 +0000 (18:11 +0800)]

[SYCL] rm wait() (#7233)

commit | commitdiff | tree

Joan Fontanals [Mon, 13 May 2024 08:35:14 +0000 (10:35 +0200)]

llama : rename jina tokenizers to v2 (#7249)

* refactor: rename jina tokenizers to v2

* refactor: keep refactoring non-breaking

commit | commitdiff | tree

Brian [Mon, 13 May 2024 02:56:47 +0000 (12:56 +1000)]

convert.py: Outfile default name change and additional metadata support (#4858)

* convert.py: Outfile default name change and additional metadata support

* convert.py: don't stringify Metadata load method output

* convert.py: typo fix

* convert.py: fix metadata format to sync with LLM_KV_NAMES in llama.cpp

commit | commitdiff | tree

Benjamin Findley [Mon, 13 May 2024 02:40:08 +0000 (19:40 -0700)]

change default temperature of OAI compat API from 0 to 1 (#7226)

* change default temperature of OAI compat API from 0 to 1

* make tests explicitly send temperature to OAI API

commit | commitdiff | tree

Neo Zhang [Mon, 13 May 2024 00:04:29 +0000 (08:04 +0800)]

[SYCL] Add oneapi runtime dll files to win release package (#7241)

* add oneapi running time dlls to release package

* fix path

* fix path

* fix path

* fix path

* fix path

---------

Co-authored-by: Zhang <redacted>

commit | commitdiff | tree

Neo Zhang [Mon, 13 May 2024 00:02:55 +0000 (08:02 +0800)]

[SYCL] update CI with oneapi 2024.1 (#7235)

Co-authored-by: Zhang <redacted>

commit | commitdiff | tree

Johannes Gäßler [Sun, 12 May 2024 17:40:45 +0000 (19:40 +0200)]

CUDA: add FP32 FlashAttention vector kernel (#7188)

* CUDA: add FP32 FlashAttention vector kernel

* fixup! CUDA: add FP32 FlashAttention vector kernel

* fixup! fixup! CUDA: add FP32 FlashAttention vector kernel

* fixup! fixup! fixup! CUDA: add FP32 FlashAttention vector kernel

commit | commitdiff | tree

Georgi Gerganov [Sun, 12 May 2024 15:30:23 +0000 (18:30 +0300)]

cmake : fix version cmp (#7227)

commit | commitdiff | tree

slaren [Sun, 12 May 2024 00:29:33 +0000 (02:29 +0200)]

remove convert-lora-to-ggml.py (#7204)

commit | commitdiff | tree

Georgi Gerganov [Sat, 11 May 2024 18:36:20 +0000 (21:36 +0300)]

metal : fix warnings (skipme) (#0)

commit | commitdiff | tree

Georgi Gerganov [Sat, 11 May 2024 18:35:05 +0000 (21:35 +0300)]

sync : ggml

commit | commitdiff | tree

Georgi Gerganov [Sat, 11 May 2024 13:57:53 +0000 (16:57 +0300)]

metal : fix indent (ggml/0)

commit | commitdiff | tree

Georgi Gerganov [Sat, 11 May 2024 13:25:50 +0000 (16:25 +0300)]

ggml : resolve merge (ggml/0)

ggml-ci

commit | commitdiff | tree

Josh Ramer [Sat, 11 May 2024 17:26:35 +0000 (12:26 -0500)]

Scripting & documenting debugging one test without anything else in the loop. (#7096)

* A little documentation that shares my quick tips for working in the repository.

* Update startup-testing-debugging.md

* script that shows a menu of tests to pick from & run the debugger on

* debug-test.sh: Refactor CLI help message

* debug-test.sh: documentation update

* debug-test.sh: CLI Help output corrections

* debug-test.sh: minor doc fix

---------

authored-by: Josh Ramer <redacted>
Assisted-by: brian khuu <redacted>

commit | commitdiff | tree

Xuan Son Nguyen [Sat, 11 May 2024 15:28:10 +0000 (17:28 +0200)]

fix system prompt handling (#7153)

commit | commitdiff | tree

compilade [Sat, 11 May 2024 15:06:26 +0000 (11:06 -0400)]

convert-hf : support bfloat16 conversion (#7158)

* convert-hf : support bfloat16 conversion

* gguf-py : flake8 fixes

* convert-hf : add missing space after comma

* convert-hf : get bit-exact same output as ./quantize

The quantization version was missing.

* convert-hf : don't round bf16 NANs

* convert-hf : save some memory with np.int16 intermediate bf16 weights

* convert-hf : more closely match llama.cpp with which weights to keep in f32

* convert-hf : add --outtype auto-f16

A reason for this to exist is for model quantizers who want an initial
GGUF with the most fidelity to the original model while still using
a 16-bit float type instead of 32-bit floats.

* convert-hf : remove a semicolon because flake8 doesn't like it

It's a reflex from when programming in C/C++, I guess.

* convert-hf : support outtype templating in outfile name

* convert-hf : rename --outtype auto-f16 to --outtype auto

commit | commitdiff | tree

Georgi Gerganov [Sat, 11 May 2024 09:02:39 +0000 (12:02 +0300)]

sync : ggml

ggml-ci

commit | commitdiff | tree

Justina Cho [Wed, 1 May 2024 21:44:26 +0000 (14:44 -0700)]

feat: implemented sigmoid function (ggml/806)

* added sigmoid function

* implemented metal kernel for sigmoid

* implemented cuda kernel for sigmoid

* added sigmoid unary op and incremented count

commit | commitdiff | tree

Borislav Stanimirov [Thu, 25 Apr 2024 14:24:07 +0000 (17:24 +0300)]

build: fix and ignore msvc warnings (ggml/805)

commit | commitdiff | tree

CrispStrobe [Sat, 11 May 2024 08:18:35 +0000 (10:18 +0200)]

convert : skip unaccessible HF repos (#7210)

commit | commitdiff | tree

Steve Grubb [Sat, 11 May 2024 08:13:02 +0000 (04:13 -0400)]

server : free llama_batch on exit (#7212)

* [server] Cleanup a memory leak on exit

There are a couple memory leaks on exit of the server. This hides others.
After cleaning this up, you can see leaks on slots. But that is another
patch to be sent after this.

* make tab into spaces

commit | commitdiff | tree

Haoxiang Fei [Sat, 11 May 2024 08:12:06 +0000 (16:12 +0800)]

llama : lookup word in vocab before doing BPE merges (#7193)

* fix: llama-3 ignore_merges

* test: add test for llama-3 bpe ignore_merges

* fix: set ignore_merges only for llama-3

* fix: test-tokenizer-1-bpe --ingore-merges detection

* fix: copy to fix fallthrough

* fix: change ignore_merges to bool

* fix: add ignore merges tests to cmake

* llama : alternative merge ignore logic

---------

Co-authored-by: Haoxiang Fei <redacted>
Co-authored-by: Georgi Gerganov <redacted>

commit | commitdiff | tree

Johannes Gäßler [Sat, 11 May 2024 08:11:28 +0000 (10:11 +0200)]

server: fix reported top tokens for temperature 0 (#7203)

commit | commitdiff | tree

Joan Fontanals [Sat, 11 May 2024 07:46:09 +0000 (09:46 +0200)]

llama : add Jina Embeddings architecture (#6826)

* feat: first things to do

* feat: create tensors for Jina architecture

* fix: use other tensors

* feat: embedding gets results

* fix: fix usage of ALIBI

* fix: clean prints

* fix: do some cleanup unused vars

* fix: revert changes to Makefile and CMakeLists

* fix: revert some changes

* fix: fix small detail

* fix: fix convert formatting

* fix: fix linting and editor

* feat: set proper vocab settings

* fix: JinaBertForMaskedLM registration

* feat: support q_normalization and k_normalization in Jina arch

* feat: handle gpt2 tokenizer with Jina architecture

* feat: example comments in embedding

* feat: rename Jina Bert to Jina Bert V2

* fix: add some changes as per review

* feat: proper KQ_pos for Jina embeddings

* feat: add capacity to load models ES and DE for Spanish

* llama : fix pre-tokenizers

* ggml : full ALiBi support

* ggml : update ggml_soft_max_ext() CUDA, SYCL

* ggml : ggml_flash_attn_ext() support ALiBi (CPU)

* ggml : ggml_flash_attn_ext() support ALiBi (Metal)

* ggml : fix warning

* ggml : ggml_flash_attn_ext() support ALiBi (CUDA)

ggml-ci

* minor : clean-up

* embedding : add warning about missing SEP

---------

Co-authored-by: Georgi Gerganov <redacted>

commit | commitdiff | tree

Georgi Gerganov [Sat, 11 May 2024 07:32:41 +0000 (10:32 +0300)]

ggml : full ALiBi support (#7192)

* ggml : full ALiBi support

* ggml : update ggml_soft_max_ext() CUDA, SYCL

* ggml : ggml_flash_attn_ext() support ALiBi (CPU)

* ggml : ggml_flash_attn_ext() support ALiBi (Metal)

* ggml : fix warning

* ggml : ggml_flash_attn_ext() support ALiBi (CUDA)

ggml-ci

* ggml : fix assert message

* vulkan : add dev notes

* ggml : require mask when using ALiBi

ggml-ci

* convert : fix convert for refact models

commit | commitdiff | tree

slaren [Fri, 10 May 2024 16:03:54 +0000 (18:03 +0200)]

llama-bench : add pp+tg test type (#7199)

commit | commitdiff | tree

Georgi Gerganov [Fri, 10 May 2024 15:20:10 +0000 (18:20 +0300)]

metal : fix flash attention kernel requirements (#7169)

* metal : fix flash attention kernel requirements

ggml-ci

* metal : fix ggml_metal_supports_op

ggml-ci

commit | commitdiff | tree

Georgi Gerganov [Fri, 10 May 2024 14:53:04 +0000 (17:53 +0300)]

convert : print "ignore_merges" field

commit | commitdiff | tree

slaren [Fri, 10 May 2024 12:28:01 +0000 (14:28 +0200)]

llama : use n_vocab to differentiate between mistral 7B and llama3 8B (#7200)

commit | commitdiff | tree

Justine Tunney [Fri, 10 May 2024 11:01:08 +0000 (07:01 -0400)]

Fix memory bug in grammar parser (#7194)

The llama.cpp grammar parser had a bug where forgetting to add a closing
quotation mark to strings would cause parsing to crash. Anyone running a
server on a public endpoint is advised to upgrade. To reproduce this bug

./llamafile -m foo.gguf -p bar --grammar 'root::="'

Credit for discovering and reporting this issue goes to Eclypsium
Security Researcher Richard Johnson <redacted>.

commit | commitdiff | tree

HanishKVC [Fri, 10 May 2024 10:21:58 +0000 (15:51 +0530)]

Main+: optionally allow special tokens from user in interactive mode (#7097)

@hanishkvc added a new `--interactive-specials` flag which would allow for inserting special tokens from user side into the embedding stream.

commit | commitdiff | tree

Andrei [Fri, 10 May 2024 06:41:10 +0000 (02:41 -0400)]

llava : fix moondream support (#7163)

* Revert "Revert "llava : add support for moondream vision language model (#6899)""

This reverts commit 9da243b36ac0b9d609adfaaa4c8f1cc8c592f737.

* Fix num_positions and embeddings initialization

commit | commitdiff | tree

Ouadie EL FAROUKI [Fri, 10 May 2024 00:32:15 +0000 (01:32 +0100)]

Minor arithmetic improvement to mmvq wrapper kernel (#7172)

commit | commitdiff | tree

slaren [Thu, 9 May 2024 23:04:12 +0000 (01:04 +0200)]

eval-callback : fix conversion to float (#7184)

commit | commitdiff | tree

0cc4m [Thu, 9 May 2024 18:39:54 +0000 (20:39 +0200)]

Vulkan Bugfixes and Improvements (#7084)

* Modify mat mat mul shader for mul_mat_id, modify mat vec mul shaders for single call batch operation

* Further work towards MoE, disabled for now

* Disable MoE code (not ready yet), fix a number of bugs in shaders and Vulkan code

* Add softmax with f16 mask and pos buffer support

* Disable mul_mat_id shaders for now

* Fix flake8

* Fix validation errors caused by empty buffers on larger batch sizes

commit | commitdiff | tree

Georgi Gerganov [Thu, 9 May 2024 13:40:42 +0000 (16:40 +0300)]

readme : add scheduled server workflow status badge

commit | commitdiff | tree

l3utterfly [Thu, 9 May 2024 13:32:40 +0000 (22:32 +0900)]

readme : add app (#6371)

* added Layla to supported UIs

* Update README.md

commit | commitdiff | tree

jaime-m-p [Thu, 9 May 2024 13:30:44 +0000 (15:30 +0200)]

llama3 custom regex split (#6965)

* merged the changes from deepseeker models to main branch

* Moved regex patterns to unicode.cpp and updated unicode.h

* Moved header files

* Resolved issues

* added and refactored unicode_regex_split and related functions

* Updated/merged the deepseek coder pr

* Refactored code

* Adding unicode regex mappings

* Adding unicode regex function

* Added needed functionality, testing remains

* Fixed issues

* Fixed issue with gpt2 regex custom preprocessor

* unicode : fix? unicode_wstring_to_utf8

* lint : fix whitespaces

* tests : add tokenizer tests for numbers

* unicode : remove redundant headers

* tests : remove and rename tokenizer test scripts

* tests : add sample usage

* gguf-py : reader prints warnings on duplicate keys

* llama : towards llama3 tokenization support (wip)

* unicode : shot in the dark to fix tests on Windows

* unicode : first try custom implementations

* convert : add "tokenizer.ggml.pre" GGUF KV (wip)

* llama : use new pre-tokenizer type

* convert : fix pre-tokenizer type writing

* lint : fix

* make : add test-tokenizer-0-llama-v3

* wip

* models : add llama v3 vocab file

* llama : adapt punctuation regex + add llama 3 regex

* minor

* unicode : set bomb

* unicode : set bomb

* unicode : always use std::wregex

* unicode : support \p{N}, \p{L} and \p{P} natively

* unicode : try fix windows

* unicode : category support via std::regex

* unicode : clean-up

* unicode : simplify

* llama3 custom regex split

* convert : add convert-hf-to-gguf-update.py

ggml-ci

* lint : update

* convert : add falcon

ggml-ci

* unicode : normalize signatures

* lint : fix

* lint : fix

* convert : remove unused functions

* convert : add comments

* convert : exercise contractions

ggml-ci

* Using char32_t for codepoints

* lint : fix

* already exists unicode_tolower()

* Typing

* Restore BOM

* cmake : refactor test targets

* tests : refactor vocab tests

ggml-ci

* tests : add more vocabs and tests

ggml-ci

* unicode : cleanup

* scripts : ignore new update script in check-requirements.sh

* Fix merge

* models : add phi-3, mpt, gpt-2, starcoder

* tests : disable obsolete

ggml-ci

* tests : use faster bpe test

ggml-ci

* llama : more prominent warning for old BPE models

* tests : disable test-tokenizer-1-bpe due to slowness

ggml-ci

* Move unused variable value

* GPT2 custom regex split

* Add alternative regex for custom aplit llama3

Co-authored-by: Georgi Gerganov <redacted>
* Style

* Add bruteforce random tests for token encoding

* wip: fixing unicode codepoint ranges

* Fix merge

* Unicode tables: separator, lowercase, uppercase and whitespace

* llama3 custom regex split: fix \s

* Restore BOM

* Style

* wip: generate NDF table

* Ignore special tokens for testing

* Clean gen-unicode-data.py

* Refactor random tokenizer test

* lint : fix

* tests : add fail test for llama-bpe

---------

Co-authored-by: Jaggzh <redacted>
Co-authored-by: Kazim Abrar Mahi <redacted>
Co-authored-by: Georgi Gerganov <redacted>
Co-authored-by: jaime-m-p <>

commit | commitdiff | tree

Johannes Gäßler [Thu, 9 May 2024 12:32:02 +0000 (14:32 +0200)]

CUDA: generalize FP16 fattn vec kernel (#7061)

* CUDA: generalize FP16 fattn vec kernel

* disable unsupported head sizes for AMD in test

* try AMD fix

* fix batch size 2-8

* partially revert changes

commit | commitdiff | tree

Galunid [Thu, 9 May 2024 12:13:05 +0000 (14:13 +0200)]

Add warning if token is invalid (#7173)

commit | commitdiff | tree

Daniel Bevenius [Thu, 9 May 2024 11:03:29 +0000 (13:03 +0200)]

llama : update llama_timings.n_p_eval setting (#7160)

This commit changes the value assigned to llama_timings.n_p_eval when
ctx->n_p_eval is 0 to be 1 instead of 1 which is the current value.

The motivation for this change is that if session caching is enabled,
for example using the `--prompt-cache main-session.txt` command line
argument for the main example, and if the same prompt is used then on
subsequent runs, the prompt tokens will not actually be passed to
llama_decode, and n_p_eval will not be updated by llama_synchoronize.

But the value of n_p_eval will be set 1 by llama_get_timings because
ctx->n_p_eval will be 0. This could be interpreted as 1 token was
evaluated for the prompt which could be misleading for applications
using this value.

Signed-off-by: Daniel Bevenius <redacted>

commit | commitdiff | tree

Sigbjørn Skjæret [Thu, 9 May 2024 10:56:00 +0000 (12:56 +0200)]

gguf-py : add special token modification capability (#7166)

* Add special token modification capability

To be able to fix/amend special tokens in a GGUF let's add two new arguments:
* `--special-token <name> <value>` where `<name>` can be bos, eos, prefix, middle, etc. while `<value>` is the token value, f.ex. `"<｜fim▁begin｜>"`
* `--special-token-by-id <name> <id>` where `<id>` is the ID of the token, f.ex. 32006

So, in order to f.ex. add fill-in-middle tokens to a GGUF you would do the following:
```bash
python3 gguf-new-metadata.py input.gguf output.gguf --special-token prefix "<｜fim▁begin｜>" --special-token middle "<｜fim▁hole｜>" --special-token suffix "<｜fim▁end｜>"
```

* improve help text

* flake--

* fix multiple tokens warning

* make script executable

* switch to namedtuple, no need to dataclass

* typing++

* add progress bar

* Add special token modification capability

To be able to fix/amend special tokens in a GGUF let's add two new arguments:
* `--special-token <name> <value>` where `<name>` can be bos, eos, prefix, middle, etc. while `<value>` is the token value, f.ex. `"<｜fim▁begin｜>"`
* `--special-token-by-id <name> <id>` where `<id>` is the ID of the token, f.ex. 32006

So, in order to f.ex. add fill-in-middle tokens to a GGUF you would do the following:
```bash
gguf-new-metadata.py input.gguf output.gguf --special-token prefix "<｜fim▁begin｜>" --special-token middle "<｜fim▁end｜>" --special-token suffix "<｜fim▁hole｜>"
```
(yes, fim_end is the `middle` token, because completion is a `prefix`/`suffix`/`middle` sequence (where `middle` is unfilled))
or
```bash
gguf-new-metadata.py input.gguf output.gguf --special-token prefix "<fim_prefix>" --special-token middle "<fim_middle>" --special-token suffix "<fim_suffix>"
```
etc...

NB: The tokens have to exist already, trying to add non-existent token name/IDs will be ignored (with a warning), while non-existent values will fail (with an error).

* improve help text

* flake--

* fix multiple tokens warning

* make script executable

* switch to namedtuple, no need to dataclass

* typing++

* add progress bar

* fail on invalid token id

commit | commitdiff | tree

Albert Jin [Thu, 9 May 2024 09:34:37 +0000 (17:34 +0800)]

opencl : alignment size converted from bits to bytes (#7090)

* opencl alignment size should be converted from bits to bytes

Reference: https://registry.khronos.org/OpenCL/specs/3.0-unified/html/OpenCL_API.html#CL_DEVICE_MEM_BASE_ADDR_ALIGN

> Alignment requirement (in bits) for sub-buffer offsets.

* Update ggml-opencl.cpp for readability using division instead of shift

Co-authored-by: Jared Van Bortel <redacted>
---------

Co-authored-by: Jared Van Bortel <redacted>

commit | commitdiff | tree

Ahmet Zeer [Thu, 9 May 2024 08:16:45 +0000 (11:16 +0300)]

TypoFix (#7162)

commit | commitdiff | tree

Jared Van Bortel [Wed, 8 May 2024 23:55:32 +0000 (19:55 -0400)]

cmake : fix typo (#7151)

commit | commitdiff | tree

compilade [Wed, 8 May 2024 22:16:38 +0000 (18:16 -0400)]

convert-hf : save memory with lazy evaluation (#7075)

* convert-hf : begin refactoring write_tensor

* convert : upgrade to sentencepiece v0.2.0

* convert-hf : remove unused n_dims in extra_*_tensors

* convert-hf : simplify MoE weights stacking

* convert-hf : flake8 linter doesn't like semicolons

* convert-hf : allow unusual model part names

For example, loading `model-00001-of-00001.safetensors` now works.

* convert-hf : fix stacking MoE expert tensors

`torch.stack` and `torch.cat` don't do the same thing.

* convert-hf : fix Mamba conversion

Tested to work even with a SentencePiece-based tokenizer.

* convert : use a string for the SentencePiece tokenizer path

* convert-hf : display tensor shape

* convert-hf : convert norms to f32 by default

* convert-hf : sort model part names

`os.listdir` is said to list files in arbitrary order.
Sorting the file names should let "model-00009-of-00042.safetensors"
be loaded before "model-00010-of-00042.safetensors".

* convert-hf : use an ABC for Model again

It seems Protocol can't be used as a statically type-checked ABC,
because its subclasses also can't be instantiated. (why did it seem to work?)

At least there's still a way to throw an error when forgetting to define
the `model_arch` property of any registered Model subclasses.

* convert-hf : use a plain class for Model, and forbid direct instantiation

There are no abstract methods used anyway,
so using ABC isn't really necessary.

* convert-hf : more consistent formatting of cmdline args

* convert-hf : align the message logged for converted tensors

* convert-hf : fix Refact conversion

* convert-hf : save memory with lazy evaluation

* convert-hf : flake8 doesn't like lowercase L as a variable name

* convert-hf : remove einops requirement for InternLM2

* convert-hf : faster model parts loading

Instead of pre-loading them all into a dict, iterate on the tensors
in the model parts progressively as needed in Model.write_tensors

Conversion for some architectures relies on checking for the presence
of specific tensor names, so for multi-part models, the weight map is read
from the relevant json file to quickly get these names up-front.

* convert-hf : minor changes for consistency

* gguf-py : add tqdm as a dependency

It's small, and used for a progress bar
in GGUFWriter.write_tensors_to_file

commit | commitdiff | tree

agray3 [Wed, 8 May 2024 20:55:49 +0000 (21:55 +0100)]

Introduction of CUDA Graphs to LLama.cpp (#6766)

* DRAFT: Introduction of CUDA Graphs to LLama.cpp

* FIx issues raised in comments

* Tidied to now only use CUDA runtime (not mixed with driver calls)

* disable for multi-gpu and batch size > 1

* Disable CUDA graphs for old GPU arch and with env var

* added missing CUDA_CHECKs

* Addressed comments

* further addressed comments

* limit to GGML_ALLOW_CUDA_GRAPHS defined in llama.cpp cmake

* Added more comprehensive graph node checking

* With mechanism to fall back if graph capture fails

* Revert "With mechanism to fall back if graph capture fails"

This reverts commit eb9f15fb6fcb81384f732c4601a5b25c016a5143.

* Fall back if graph capture fails and address other comments

* - renamed GGML_ALLOW_CUDA_GRAPHS to GGML_CUDA_USE_GRAPHS

- rename env variable to disable CUDA graphs to GGML_CUDA_DISABLE_GRAPHS

- updated Makefile build to enable CUDA graphs

- removed graph capture failure checking in ggml_cuda_error
  using a global variable to track this is not thread safe, but I am also not safistied with checking an error by string
  if this is necessary to workaround some issues with graph capture with eg. cuBLAS, we can pass the ggml_backend_cuda_context to the error checking macro and store the result in the context

- fixed several resource leaks

- fixed issue with zero node graphs

- changed fixed size arrays to vectors

- removed the count of number of evaluations before start capturing, and instead changed the capture mode to relaxed

- removed the check for multiple devices so that it is still possible to use a single device, instead checks for split buffers to disable cuda graphs with -sm row

- changed the op for checking batch size to GGML_OP_ADD, should be more reliable than GGML_OP_SOFT_MAX

- code style fixes

- things to look into
  - VRAM usage of the cudaGraphExec_t, if it is significant we may need to make it optional
  - possibility of using cudaStreamBeginCaptureToGraph to keep track of which ggml graph nodes correspond to which cuda graph nodes

* fix build without cuda graphs

* remove outdated comment

* replace minimum cc value with a constant

---------

Co-authored-by: slaren <redacted>

commit | commitdiff | tree

Johannes Gäßler [Wed, 8 May 2024 19:53:08 +0000 (21:53 +0200)]

JSON: [key] -> .at(key), assert() -> GGML_ASSERT (#7143)

commit | commitdiff | tree

Georgi Gerganov [Wed, 8 May 2024 19:14:39 +0000 (22:14 +0300)]

Revert "llava : add support for moondream vision language model (#6899)"

This reverts commit 46e12c4692a37bdd31a0432fc5153d7d22bc7f72.

commit | commitdiff | tree

JohnnyB [Wed, 8 May 2024 19:12:06 +0000 (20:12 +0100)]

server : add themes + favicon (#6848)

* Added themes support with two sample themes and a favicon.

* Newline

* Newline

* Newline

* Trailing whitespace

* Increased opacity for contrast

* Increase opacity.

Check actions cancelled for some other priority job and I can't seem to manually re-run them, so MOAR OPACITY

* Opacity action trigger.

Trying to re-trigger the cancelled action.

* One more opacity adjustment

This Actions pipeline is failing for random issues.

* Delete examples/server/themes/buttons_top/completion.js

This will be served from the static string built-in to server.

* Delete examples/server/themes/buttons_top/index.js

This will be served from the static string built-in to server.

* Delete examples/server/themes/wild/completion.js

This will be served from the static string built-in to server.

* Delete examples/server/themes/buttons_top/json-schema-to-grammar.mjs

This will be served from the static string built-in to server.

* Delete examples/server/themes/wild/index.js

This will be served from the static string built-in to server.

* Delete examples/server/themes/wild/json-schema-to-grammar.mjs

This will be served from the static string built-in to server.

* Replaced underscore.

commit | commitdiff | tree

Gilad S [Wed, 8 May 2024 19:08:10 +0000 (22:08 +0300)]

metal : use `vm_allocate` instead of `posix_memalign` on macOS (#7078)

* fix: use `malloc` instead of `posix_memalign` in `ggml-metal.m` to make it not crash Electron proccesses

* fix: typo

* fix: use `vm_allocate` instead of `posix_memalign`

* fix: don't call `newBufferWithBytesNoCopy` with `NULL` when `ggml_metal_host_malloc` returns `NULL`

* fix: use `vm_allocate` only on macOS

commit | commitdiff | tree

Dawid Potocki [Wed, 8 May 2024 14:32:32 +0000 (02:32 +1200)]

main : add --conversation / -cnv flag (#7108)

commit | commitdiff | tree

Eve [Wed, 8 May 2024 14:29:23 +0000 (14:29 +0000)]

sgemm : AVX Q4_0 and Q8_0 (#6891)

* basic avx implementation

* style

* combine denibble with load

* reduce 256 to 128 (and back!) conversions

* sse load

* Update sgemm.cpp

* oops

oops

commit | commitdiff | tree

Johan [Wed, 8 May 2024 12:27:58 +0000 (14:27 +0200)]

server : add_special option for tokenize endpoint (#7059)

commit | commitdiff | tree

20kdc [Wed, 8 May 2024 12:22:32 +0000 (13:22 +0100)]

convert.py : --vocab-only generates false but valid params (#7027)

An example of how this might be used in the style of baby-llama will be attached with this PR.

commit | commitdiff | tree

Ren Xuancheng [Wed, 8 May 2024 12:06:43 +0000 (20:06 +0800)]

llama : add BPE pre-tokenization for Qwen2 (#7114)

* Add BPE pre-tokenization for Qwen2.

* minor : fixes

---------

Co-authored-by: Ren Xuancheng <redacted>
Co-authored-by: Georgi Gerganov <redacted>

commit | commitdiff | tree

Xuan Son Nguyen [Wed, 8 May 2024 11:24:14 +0000 (13:24 +0200)]

clean up json_value & server_log (#7142)

commit | commitdiff | tree

DAN™ [Wed, 8 May 2024 10:43:23 +0000 (06:43 -0400)]

convert : add BPE pre-tokenization for DBRX (#7132)

* Add BPE pre-tokenization for DBRX.

* Add vocab GGUFs.

* Remove test.

* Remove GGUFs.

commit | commitdiff | tree

Georgi Gerganov [Wed, 8 May 2024 09:47:07 +0000 (12:47 +0300)]

py : also print the normalizers

commit | commitdiff | tree

Brian [Wed, 8 May 2024 08:54:39 +0000 (18:54 +1000)]

compare-llama-bench.py: add missing basicConfig (#7138)

* compare-llama-bench.py: add missing basicConfig

* compare-llama-bench.py: Add line break between error message and print_help()

* Add regular print() markdown table

commit | commitdiff | tree

Justine Tunney [Wed, 8 May 2024 06:30:09 +0000 (02:30 -0400)]

ggml : introduce bfloat16 support (#6412)

* Introduce bfloat16 support

Many models on Hugging Face (e.g. Mistral, TinyLLaMA) use bfloat16 as
their canonical floating point format.

      ┌sign
      │
      │   ┌exponent
      │   │
      │   │      ┌mantissa
      │   │      │
      │┌──┴───┐┌─┴───┐
    0b0000000000000000 brain16

This encoding has the same number of exponent bits as float32. That
makes conversion relatively straightforward, even in the absence of
hardware support. For example, converting brain16 to binary32 means
simply shifting 16 bits to the left.

      ┌sign
      │
      │   ┌exponent
      │   │
      │   │      ┌mantissa
      │   │      │
      │┌──┴───┐┌─┴───────────────────┐
    0b00000000000000000000000000000000 IEEE binary32

The issue is that converting bf16 to fp16 can result in information
loss. Only 13% of bf16 numbers can be precisely represented in fp16
which in practice ends up being 99.71% of Mistral 7b v0.2's weights
however there is currently no way other than fp32 to get the others

      ┌sign
      │
      │  ┌exponent
      │  │
      │  │    ┌mantissa
      │  │    │
      │┌─┴─┐┌─┴──────┐
    0b0000000000000000 IEEE binary16

This change fixes that, by adding a bf16 data type to GGML. Support
for CPU inference has been implemented along with optimizations for
the AVX2, AVX512, and AVX512BF16 ISAs. Perplexity on Mistral 7b 0.2
improves somewhere around -0.0024 to -0.0046 compared to using fp16

* Remove GGML code that's not needed

* Minimize the GGML API surface area for BF16

* Remove bf16 luts

* Make the GGML header look nicer

* Fix documentation

* Apply ggerganov's fixes for test-backend-ops

* Add BF16 code for new ggml_validate_row_data() function

commit | commitdiff | tree

Georgi Gerganov [Wed, 8 May 2024 06:14:50 +0000 (09:14 +0300)]

metal : fix unused warning

commit | commitdiff | tree

Jeximo [Wed, 8 May 2024 00:26:43 +0000 (21:26 -0300)]

Further tidy on Android instructions README.md (#7077)

* Further tidy on Android instructions README.md

Fixed some logic when following readme direction

* Clean up redundent information

A new user arriving will see simple directions on llama.cpp homepage

* corrected puncuation

Period after cmake, colon after termux

* re-word for clarity

method seems to be more correct, instead of alternative in this context

* Organized required packages per build type

building llama.cpp with NDK on a pc doesn't require installing clang, cmake, git, or wget in termux.

* README.md

corrected title

* fix trailing whitespace

commit | commitdiff | tree

jukofyork [Wed, 8 May 2024 00:24:16 +0000 (01:24 +0100)]

Fixed save_imatrix to match old behaviour for MoE (#7099)

* Fixed save_imatrix to match old behaviour for MoE

This fix is simple and clear, but unnecessarily doubles the memory overhead..

* Fixed missing idx variable

* Unconditionally increment ncall

Co-authored-by: slaren <redacted>
* Fixed 2 bugs in save_imatrix()

- Fixed segfault bug because the counts vector needed to be created.
- Fixed pre-existing bug didn't actually add to the counts for "--combine" option.

* ncall needs summing too

* Trailing whitespace

---------

Co-authored-by: slaren <redacted>

commit | commitdiff | tree

Johannes Gäßler [Tue, 7 May 2024 21:07:58 +0000 (23:07 +0200)]

server: fix incorrectly reported token probabilities (#7125)

* server: normalize token probabilities

* fix temperature == 0.0f

commit | commitdiff | tree

nopperl [Tue, 7 May 2024 19:39:43 +0000 (19:39 +0000)]

Fix OLMo HF to GGUF conversion (#6910)

commit | commitdiff | tree

Kyle Mistele [Tue, 7 May 2024 18:44:29 +0000 (13:44 -0500)]

server : update readme with undocumented options (#7013)

commit | commitdiff | tree

Georgi Gerganov [Tue, 7 May 2024 18:43:13 +0000 (21:43 +0300)]

readme : update hot topics

commit | commitdiff | tree

RhinoDevel [Tue, 7 May 2024 17:51:31 +0000 (19:51 +0200)]

main : update log text (EOS to EOG) (#7104)

* Update log text (EOS to EOG)

The log text "found EOS" is no longer always correct, here, because there is now an is-EOG check that also returns true for EOT.

* Improve log msg. further by using "an" instead of "some".

As suggested, to avoid misunderstanding (no multiple EOG tokens found, just one).

commit | commitdiff | tree

omahs [Tue, 7 May 2024 15:20:33 +0000 (17:20 +0200)]

docs: fix typos (#7124)

* fix typo

* fix typos

* fix typo

* fix typos

* fix typo

* fix typos

commit | commitdiff | tree

Georgi Gerganov [Tue, 7 May 2024 08:08:49 +0000 (11:08 +0300)]

ci : add GG_BUILD_EXTRA_TESTS_0 env (#7098)

* ci : add GG_BUILD_EXTRA_TESTS_0 env

ggml-ci

* Update run.sh

ggml-ci

commit | commitdiff | tree

William Tambellini [Mon, 6 May 2024 18:12:14 +0000 (11:12 -0700)]

Add an option to build without CUDA VMM (#7067)

Add an option to build ggml cuda without CUDA VMM
resolves
https://github.com/ggerganov/llama.cpp/issues/6889
https://forums.developer.nvidia.com/t/potential-nvshmem-allocated-memory-performance-issue/275416/4

commit | commitdiff | tree

Georgi Gerganov [Mon, 6 May 2024 15:36:06 +0000 (18:36 +0300)]

flake.lock: Update (#7079)

Flake lock file updates:

• Updated input 'flake-parts':
    'github:hercules-ci/flake-parts/9126214d0a59633752a136528f5f3b9aa8565b7d?narHash=sha256-sB4SWl2lX95bExY2gMFG5HIzvva5AVMJd4Igm%2BGpZNw%3D' (2024-04-01)
  → 'github:hercules-ci/flake-parts/e5d10a24b66c3ea8f150e47dfdb0416ab7c3390e?narHash=sha256-yzcRNDoyVP7%2BSCNX0wmuDju1NUCt8Dz9%2BlyUXEI0dbI%3D' (2024-05-02)
• Updated input 'flake-parts/nixpkgs-lib':
    'github:NixOS/nixpkgs/d8fe5e6c92d0d190646fb9f1056741a229980089?dir=lib&narHash=sha256-iMUFArF0WCatKK6RzfUJknjem0H9m4KgorO/p3Dopkk%3D' (2024-03-29)
  → 'https://github.com/NixOS/nixpkgs/archive/50eb7ecf4cd0a5756d7275c8ba36790e5bd53e33.tar.gz?narHash=sha256-QBx10%2Bk6JWz6u7VsohfSw8g8hjdBZEf8CFzXH1/1Z94%3D' (2024-05-02)
• Updated input 'nixpkgs':
    'github:NixOS/nixpkgs/7bb2ccd8cdc44c91edba16c48d2c8f331fb3d856?narHash=sha256-Drmja/f5MRHZCskS6mvzFqxEaZMeciScCTFxWVLqWEY%3D' (2024-04-25)
  → 'github:NixOS/nixpkgs/63c3a29ca82437c87573e4c6919b09a24ea61b0f?narHash=sha256-4cPymbty65RvF1DWQfc%2BBc8B233A1BWxJnNULJKQ1EY%3D' (2024-05-02)

Co-authored-by: github-actions[bot] <redacted>

commit | commitdiff | tree

Georgi Gerganov [Mon, 6 May 2024 06:31:30 +0000 (09:31 +0300)]

minor : fix trailing whitespace

commit | commitdiff | tree

kunnis [Sun, 5 May 2024 12:17:47 +0000 (07:17 -0500)]

Adding support for the --numa argument for llama-bench. (#7080)

commit | commitdiff | tree

Sigbjørn Skjæret [Sun, 5 May 2024 11:38:55 +0000 (13:38 +0200)]

Disable benchmark on forked repo (#7034)

* Disable benchmark on forked repo

* only check owner on schedule event

* check owner on push also

* more readable as multi-line

* ternary won't work

* style++

* test++

* enable actions debug

* test--

* remove debug

* test++

* do debug where we can get logs

* test--

* this is driving me crazy

* correct github.event usage

* remove test condition

* correct github.event usage

* test++

* test--

* event_name is pull_request_target

* test++

* test--

* update ref checks

commit | commitdiff | tree

Lyle Dean [Sun, 5 May 2024 05:21:46 +0000 (06:21 +0100)]

readme : add note that LLaMA 3 is not supported with convert.py (#7065)

commit | commitdiff | tree

DAN™ [Sun, 5 May 2024 05:19:30 +0000 (01:19 -0400)]

command-r : add BPE pre-tokenization (#7063)

* Add BPE pre-tokenization for Command-R/R+.

* Bump transformers convert requirement.

* command-r : add individual digits regex

---------

Co-authored-by: Georgi Gerganov <redacted>

Packaging of ggml-org/llama.cpp

RSS Atom