git.djapps.eu Git - pkg/ggml/sources/llama.cpp/log

]> git.djapps.eu Git - pkg/ggml/sources/llama.cpp/log

overview / pkg / ggml / sources / llama.cpp / log

summary | shortlog | log | commit | commitdiff | tree
first ⋅ prev ⋅ next

commit | commitdiff | tree

Georgi Gerganov [Wed, 26 Jun 2024 15:33:02 +0000 (18:33 +0300)]

llama : reorganize source code + improve CMake (#8006)

* scripts : update sync [no ci]

* files : relocate [no ci]

* ci : disable kompute build [no ci]

* cmake : fixes [no ci]

* server : fix mingw build

ggml-ci

* cmake : minor [no ci]

* cmake : link math library [no ci]

* cmake : build normal ggml library (not object library) [no ci]

* cmake : fix kompute build

ggml-ci

* make,cmake : fix LLAMA_CUDA + replace GGML_CDEF_PRIVATE

ggml-ci

* move public backend headers to the public include directory (#8122)

* move public backend headers to the public include directory

* nix test

* spm : fix metal header

---------

Co-authored-by: Georgi Gerganov <redacted>
* scripts : fix sync paths [no ci]

* scripts : sync ggml-blas.h [no ci]

---------

Co-authored-by: slaren <redacted>

commit | commitdiff | tree

Isaac McFadyen [Wed, 26 Jun 2024 06:29:28 +0000 (02:29 -0400)]

Clarify default MMQ for CUDA and LLAMA_CUDA_FORCE_MMQ flag (#8115)

* Add message about int8 support

* Add suggestions from review

Co-authored-by: Johannes Gäßler <redacted>
---------

Co-authored-by: Johannes Gäßler <redacted>

commit | commitdiff | tree

Johannes Gäßler [Wed, 26 Jun 2024 06:28:02 +0000 (08:28 +0200)]

CUDA: fix misaligned shared memory read (#8123)

commit | commitdiff | tree

Eddie-Wang [Wed, 26 Jun 2024 06:27:46 +0000 (14:27 +0800)]

llama : extend llm_build_ffn() to support _scale tensors (#8103)

commit | commitdiff | tree

Olivier Chafik [Wed, 26 Jun 2024 00:46:35 +0000 (01:46 +0100)]

`json`: better support for "type" unions (e.g. nullable arrays w/ typed items) (#7863)

* json: better suport for "type" arrays (e.g. `{"type": ["array", "null"], "items": {"type": "string"}}`)

* json: add test for type: [array, null] fix

* update tests

commit | commitdiff | tree

Olivier Chafik [Wed, 26 Jun 2024 00:45:58 +0000 (01:45 +0100)]

`json`: fix additionalProperties, allow space after enum/const (#7840)

* json: default additionalProperty to true

* json: don't force additional props after normal properties!

* json: allow space after enum/const

* json: update pydantic example to set additionalProperties: false

* json: prevent additional props to redefine a typed prop

* port not_strings to python, add trailing space

* fix not_strings & port to js+py

* Update json-schema-to-grammar.cpp

* fix _not_strings for substring overlaps

* json: fix additionalProperties default, uncomment tests

* json: add integ. test case for additionalProperties

* json: nit: simplify condition

* reformat grammar integ tests w/ R"""()""" strings where there's escapes

* update # tokens in server test: consts can now have trailing space

commit | commitdiff | tree

jukofyork [Tue, 25 Jun 2024 20:47:40 +0000 (21:47 +0100)]

fixes #7999 (adds control vectors to all `build_XXX()` functions in `llama.cpp` [needs testing] (#8060)

* fixes #7999

The `build_command_r` forgot to add the control vector.

* Fixes qwen2 too

* Fixed all models' control vectors

* Removed double calls to `cb(cur, "l_out", il)`

* Moved control vector logic to llama_control_vector:apply_to()

commit | commitdiff | tree

fairydreaming [Tue, 25 Jun 2024 19:14:35 +0000 (21:14 +0200)]

llama : implement Unigram tokenizer needed by T5 and FLAN-T5 model families (#5763)

* llama : add T5 model architecture, tensors and model header parameters

* llama : add implementation of Unigram tokenizer with SentencePiece-like text normalization using precompiled charsmap

---------

Co-authored-by: Stanisław Szymczyk <redacted>

commit | commitdiff | tree

Daniel Bevenius [Tue, 25 Jun 2024 19:07:28 +0000 (21:07 +0200)]

llama : return nullptr from llama_grammar_init (#8093)

* llama : return nullptr from llama_grammar_init

This commit updates llama_grammar_init to return nullptr instead of
throwing an exception.

The motivation for this is that this function is declared inside an
extern "C" block and is intended/may be used from C code which will not
be able to handle exceptions thrown, and results in undefined behavior.

On Windows and using MSVC the following warning is currently generated:
```console
C:\llama.cpp\llama.cpp(13998,1): warning C4297: 'llama_grammar_init':
function assumed not to throw an exception but does
C:\llama.cpp\llama.cpp(13998,1): message :
__declspec(nothrow), throw(), noexcept(true), or noexcept was specified
on the function
```

Signed-off-by: Daniel Bevenius <redacted>
* squash! llama : return nullptr from llama_grammar_init

Add checks for nullptr when calling llama_grammar_init.

Signed-off-by: Daniel Bevenius <redacted>
---------

Signed-off-by: Daniel Bevenius <redacted>
Co-authored-by: Clint Herron <redacted>

commit | commitdiff | tree

Olivier Chafik [Tue, 25 Jun 2024 19:06:20 +0000 (20:06 +0100)]

`json`: support integer minimum, maximum, exclusiveMinimum, exclusiveMaximum (#7797)

* json: support minimum for positive integer values

* json: fix min 0

* json: min + max integer constraints

* json: handle negative min / max integer bounds

* json: fix missing paren min/max bug

* json: proper paren fix

* json: integration test for schemas

* json: fix bounds tests

* Update json-schema-to-grammar.cpp

* json: fix negative max

* json: fix negative min (w/ more than 1 digit)

* Update test-grammar-integration.cpp

* json: nit: move string rules together

* json: port min/max integer support to Python & JS

* nit: move + rename _build_min_max_int

* fix min in [1, 9]

* Update test-grammar-integration.cpp

* add C++11-compatible replacement for std::string_view

* add min/max constrained int field to pydantic json schema example

* fix merge

* json: add integration tests for min/max bounds

* reshuffle/merge min/max integ test cases

* nits / cleanups

* defensive code against string out of bounds (apparently different behaviour of libstdc++ vs. clang's libc++, can't read final NULL char w/ former)

commit | commitdiff | tree

slaren [Tue, 25 Jun 2024 17:20:06 +0000 (19:20 +0200)]

disable docker CI on pull requests (#8110)

commit | commitdiff | tree

joecryptotoo [Tue, 25 Jun 2024 15:13:27 +0000 (08:13 -0700)]

Add healthchecks to llama-server containers (#8081)

* added healthcheck

* added healthcheck

* added healthcheck

* added healthcheck

* added healthcheck

* moved curl to base

* moved curl to base

commit | commitdiff | tree

Brian [Tue, 25 Jun 2024 12:03:25 +0000 (22:03 +1000)]

Gguf dump start data offset via --data-offset and some extra refactor (#8054)

* gguf-dump: add --data-offset

* gguf-dump: add tensor data offset table

* gguf-dump: refactor GGUFReader for clarity

* gguf-dump: add --data-alignment

* gguf-dump.py: Rename variables and adjust comments

start_data_offset --> data_offset

_build_tensors_info_fields --> _build_tensor_info

commit | commitdiff | tree

Xuan Son Nguyen [Tue, 25 Jun 2024 11:59:54 +0000 (13:59 +0200)]

cvector: better prompt handling, add "mean vector" method (#8069)

* remove completions file

* fix inverted vector

* add mean method

* code style

* remove inverted pca hotfix

commit | commitdiff | tree

Xuan Son Nguyen [Tue, 25 Jun 2024 11:56:49 +0000 (13:56 +0200)]

Add chat template support for llama-cli (#8068)

* add chat template support for llama-cli

* add help message

* server: simplify format_chat

* more consistent naming

* improve

* add llama_chat_format_example

* fix server

* code style

* code style

* Update examples/main/main.cpp

Co-authored-by: Georgi Gerganov <redacted>
---------

Co-authored-by: Georgi Gerganov <redacted>

commit | commitdiff | tree

HanishKVC [Tue, 25 Jun 2024 11:27:35 +0000 (16:57 +0530)]

SimpleChat v3.1: Boolean chat request options in Settings UI, cache_prompt (#7950)

* SimpleChat: Allow for chat req bool options to be user controlled

* SimpleChat: Allow user to control cache_prompt flag in request

* SimpleChat: Add sample GUI images to readme file

Show the chat screen and the settings screen

* SimpleChat:Readme: Add quickstart block, title to image, cleanup

* SimpleChat: RePosition contents of the Info and Settings UI

Make it more logically structured and flow through.

* SimpleChat: Rename to apiRequestOptions from chatRequestOptions

So that it is not wrongly assumed that these request options are
used only for chat/completions endpoint. Rather these are used
for both the end points, so rename to match semantic better.

* SimpleChat: Update image included with readme wrt settings ui

* SimpleChat:ReadMe: Switch to webp screen image to reduce size

commit | commitdiff | tree

HatsuneMikuUwU33 [Tue, 25 Jun 2024 08:44:48 +0000 (10:44 +0200)]

Update control vector help (#8104)

commit | commitdiff | tree

Meng, Hengyu [Tue, 25 Jun 2024 02:19:20 +0000 (10:19 +0800)]

[SYCL] Re-enabled mul_mat_batched_sycl (#8095)

commit | commitdiff | tree

Johannes Gäßler [Mon, 24 Jun 2024 23:22:33 +0000 (01:22 +0200)]

CUDA: fix matrix multiplication algorithm choice (#8102)

commit | commitdiff | tree

Johannes Gäßler [Mon, 24 Jun 2024 20:15:33 +0000 (22:15 +0200)]

CUDA: fix MMQ writeback for int8 tensor cores (#8100)

commit | commitdiff | tree

Johannes Gäßler [Mon, 24 Jun 2024 15:43:42 +0000 (17:43 +0200)]

CUDA: use MMQ instead of cuBLAS by default (#8075)

commit | commitdiff | tree

fairydreaming [Mon, 24 Jun 2024 12:13:39 +0000 (14:13 +0200)]

gguf-py : fix tensor groups for encoder-decoder models in gguf-dump.py (#8090)

Co-authored-by: Stanisław Szymczyk <redacted>
Co-authored-by: Brian <redacted>

commit | commitdiff | tree

Johannes Gäßler [Mon, 24 Jun 2024 10:41:23 +0000 (12:41 +0200)]

CUDA: optimize MMQ int8 tensor core performance (#8062)

* CUDA: optimize MMQ int8 tensor core performance

* only a single get_mma_tile_x_k function

* simplify code, make functions constexpr

commit | commitdiff | tree

Christian Zhou-Zheng [Mon, 24 Jun 2024 09:42:03 +0000 (05:42 -0400)]

Option to split during conversion (#6942)

* support splits in convert.py

* Support split by size and dry run to write estimated shards/filesizes

* Move split functionality to new GGUFManager class

* fix improper function signature

* tentative push of convert-hf-to-gguf support

* resolve merge + SplitArguments for easier parsing

* Fix eager tensor memory leak and remove convert.py changes

Removed a memory leak caused by unexpected reference retention to eager tensors.

Also removed GGUFManager functionality in convert.py in favor of specializing for convert-hf-to-gguf.py.

* refactor SplitStrategy to be a deque

Instead of having SplitStrategy have a `data` field that is a deque, just have SplitStrategy be a subclass of deque itself.

* fix Q8 quantization

* remove unnecessary imports in gguf_manager

* fix final? merge issue

* fix gguf_writer placement and remove comments

* oops, actually fix gguf_writer placement

* reduce duplicated code from gguf_writer

* further simplify GGUFManager

* simplify even further and standardize with GGUFWriter

* reduce diffs with master

* form shards while adding tensors, SHA256 sums agree with master

* re-add type hint

Co-authored-by: compilade <redacted>
* GGUFWriter compatibility fix

Co-authored-by: compilade <redacted>
* Shard dataclass and un-negative dont_add_architecture

* type consistency in format_n_bytes_to_str

* move kv keys to constants.py

* make pathlib explicit

* base-1024 bytes to base-1000

* rename GGUFManager to GGUFWriterSplit

* Update gguf-py/gguf/constants.py

Co-authored-by: compilade <redacted>
* fix convert-hf-to-gguf.py permissions

* fix line endings

* Update gguf-py/gguf/gguf_writer_split.py

Co-authored-by: compilade <redacted>
* convert-hf : restore executable file permission

* examples/convert-legacy-llama.py: restore executable file permission

* reinstate original gguf package import and fix type annotation

* attempt to appease the linter

* attempt 2 to appease the linter

* attempt 3 to appease the linter

* comma consistency

* Update convert-hf-to-gguf.py

Co-authored-by: compilade <redacted>
* edit cmd line args

* use simplification from #7827

* kv/ti data are still wrong

* try to refactor kv data (still fails)

* fix ti data messiness

* tidy up

* fix linting

* actually make the linter happy

* cleanup round 1

* remove SplitStrategy, SplitArguments

* appease linter

* fix typing and clean up

* fix linting

* Update gguf-py/gguf/gguf_writer.py

Co-authored-by: compilade <redacted>
* progress bar, fix split logic

* Update gguf-py/gguf/gguf_writer.py

Co-authored-by: compilade <redacted>
* catch oversights

* Update gguf-py/gguf/gguf_writer.py

Co-authored-by: compilade <redacted>
* Update gguf-py/gguf/gguf_writer.py

Co-authored-by: compilade <redacted>
* Update gguf-py/gguf/gguf_writer.py

Co-authored-by: compilade <redacted>
* Update gguf-py/gguf/gguf_writer.py

Co-authored-by: compilade <redacted>
* Update gguf-py/gguf/gguf_writer.py

Co-authored-by: compilade <redacted>
* swap bar orders

* Update gguf-py/gguf/gguf_writer.py

Co-authored-by: compilade <redacted>
* Update gguf-py/gguf/gguf_writer.py

Co-authored-by: compilade <redacted>
* compatibility fix

* Update gguf-py/gguf/gguf_writer.py

Co-authored-by: compilade <redacted>
* Update convert-hf-to-gguf.py

Co-authored-by: compilade <redacted>
---------

Co-authored-by: Brian <redacted>
Co-authored-by: compilade <redacted>

commit | commitdiff | tree

slaren [Mon, 24 Jun 2024 05:36:11 +0000 (07:36 +0200)]

disable publishing the full-rocm docker image (#8083)

commit | commitdiff | tree

Yann Follet [Mon, 24 Jun 2024 05:30:24 +0000 (13:30 +0800)]

embedding : more cli arguments (#7458)

* add parameters for embeddings
--embd-normalize
--embd-output-format
--embd-separator
description in the README.md

* Update README.md

fix tipo

* Trailing whitespace

* fix json generation, use " not '

* fix merge master

* fix code formating
group of parameters // embedding
print usage for embedding parameters

---------

Co-authored-by: Brian <redacted>

commit | commitdiff | tree

fairydreaming [Mon, 24 Jun 2024 05:06:05 +0000 (07:06 +0200)]

gguf-py, convert-hf : model conversion support for T5 and FLAN-T5 model variants (#5763)

* gguf-py : add T5 model architecture

* gguf-py : add separate tensors for encoder and decoder

* gguf-py : add new model header parameters: decoder_start_token_id, attention.relative_buckets_count, tokenizer.ggml.remove_extra_whitespaces, tokenizer.ggml.precompiled_charsmap

* convert-hf : add model conversion support for T5ForConditionalGeneration and T5WithLMHeadModel

---------

Co-authored-by: Stanisław Szymczyk <redacted>

commit | commitdiff | tree

slaren [Mon, 24 Jun 2024 01:07:59 +0000 (03:07 +0200)]

ggml : remove ggml_task_type and GGML_PERF (#8017)

* ggml : remove ggml_task_type and GGML_PERF

* check abort_callback on main thread only

* vulkan : remove usage of ggml_compute_params

* remove LLAMA_PERF

commit | commitdiff | tree

Eddie-Wang [Sun, 23 Jun 2024 18:27:57 +0000 (02:27 +0800)]

llama : add support for BitnetForCausalLM (#7931)

* hf bitnet v1

* hf bitnet e2e v2

* finish bitnet e2e

* finish f16 hf bitnet e2e

* remove unsed

* finish bitnet i2 e2e

* move i2s to quantize v1

* move i2 to quantize

* clean code

* clean code 2

* fix codestyle

* fix code

* fix

* fix code

* fix merge

* remove unused

* change table name

* fix whitespace

* delete redundant

* i2_s to absmax

* finish i2_s/i8_s vec_dot x86 simd

* i2s->q22

* fix code

* remove block scale

* add dequantize

* fix seq

* update avx2

* remove q2_2

* remove q22_grid

* fix whitespace

* reuse llm_build_kv

* fix bo

---------

Co-authored-by: root <redacted>

commit | commitdiff | tree

Aarni Koskela [Sun, 23 Jun 2024 15:03:08 +0000 (18:03 +0300)]

server : fix JSON-Scheme typo (#7975)

commit | commitdiff | tree

Daniel Bevenius [Sun, 23 Jun 2024 13:39:45 +0000 (15:39 +0200)]

Fix typo in llama_set_embeddings comment (#8077)

commit | commitdiff | tree

slaren [Sun, 23 Jun 2024 11:14:45 +0000 (13:14 +0200)]

fix CI failures (#8066)

* test-backend-ops : increase cpy max nmse

* server ci : disable thread sanitizer

commit | commitdiff | tree

0cc4m [Sun, 23 Jun 2024 08:21:25 +0000 (10:21 +0200)]

Refactor Vulkan backend to allow multiple contexts (#7961)

* Refactor Vulkan backend to allow multiple contexts

* Fix too many shader groups called validation error in llama3 on AMD and Intel GPUs

* Fix Vulkan debug build error

commit | commitdiff | tree

Clint Herron [Sat, 22 Jun 2024 18:28:18 +0000 (14:28 -0400)]

Removing extra blank lines that were breaking Lint. (#8067)

commit | commitdiff | tree

Xuan Son Nguyen [Sat, 22 Jun 2024 16:11:30 +0000 (18:11 +0200)]

cvector: fix CI + correct help message (#8064)

* cvector: fix CI + correct help message

* also correct --pca-iter

commit | commitdiff | tree

HatsuneMikuUwU33 [Sat, 22 Jun 2024 15:19:37 +0000 (17:19 +0200)]

cvector-generator: Moe Moe Fixie-Fixie for Lots of Formats~! ♡(ᐢ ᴥ ᐢ)♡ (#8052)

* Update negative.txt

* Update positive.txt

* Update cvector-generator.cpp

* Update cvector-generator.cpp

commit | commitdiff | tree

0xspringtime [Sat, 22 Jun 2024 13:37:41 +0000 (09:37 -0400)]

convert-hf : change assert to exception (#8015)

commit | commitdiff | tree

ddh0 [Sat, 22 Jun 2024 13:16:10 +0000 (07:16 -0600)]

Update llama-quantize ppl/file size output from LLaMA-v1 to Llama-3 values (#8058)

Uses the values computed by @JohannesGaessler in PR #7413

commit | commitdiff | tree

Clint Herron [Sat, 22 Jun 2024 03:18:36 +0000 (23:18 -0400)]

JSON Schema to GBNF integration tests (#7790)

* Adding simple bare-bones test for end-to-end integration test for json validation against auto-generated JSON-schema grammars.

* Adding additional examples as documented in #7789 . Also adding the ability to automatically output improperly failing grammars to debug output files so they can more easily be examined in the gbnf-validator program.

* Uncommenting formerly commented tests so that they fail for others who are attempting to reproduce the bugs.

* Merging improved schema test methods added by @ochafik in #7797

* Adding #define to temporarily remove failing tests so that this PR can pass CI, but still be useful for other PRs that want to leverage the framework.

* Fixing nits from ochafik. Removing escape slashes, adding additional failing cases, fixing some other strings.

* Fixing grammar indentation to be consistent throughout file.

commit | commitdiff | tree

k.h.lai [Fri, 21 Jun 2024 08:28:20 +0000 (16:28 +0800)]

vulkan: detect multiple devices by deviceUUID instead of deviceID (#8022)

* vulkan: detect multiple devices by deviceUUID instead of deviceID

* vulkan: remove unneeded variables

* vulkan: fix id query

commit | commitdiff | tree

Eve [Fri, 21 Jun 2024 05:57:36 +0000 (05:57 +0000)]

ggml : AVX IQ quants (#7845)

* initial iq4_xs

* fix ci

* iq4_nl

* iq1_m

* iq1_s

* iq2_xxs

* iq3_xxs

* iq2_s

* iq2_xs

* iq3_s before sllv

* iq3_s

* iq3_s small fix

* iq3_s sllv can be safely replaced with sse multiply

commit | commitdiff | tree

Georgi Gerganov [Fri, 21 Jun 2024 05:51:28 +0000 (08:51 +0300)]

llama : optimize long word tokenization with WPM (#8034)

ggml-ci

commit | commitdiff | tree

Douglas Hanley [Fri, 21 Jun 2024 05:38:22 +0000 (00:38 -0500)]

llama : allow pooled embeddings on any model (#7477)

* create append_pooling operation; allow to specify attention_type; add last token pooling; update examples

* find result_norm/result_embd tensors properly; update output allocation logic

* only use embd output for pooling_type NONE

* get rid of old causal_attn accessor

* take out attention_type; add in llama_set_embeddings

* bypass logits when doing non-NONE pooling

commit | commitdiff | tree

Shuichi Tsutsumi [Fri, 21 Jun 2024 05:30:58 +0000 (14:30 +0900)]

swiftui : enable stream updating (#7754)

commit | commitdiff | tree

Hamdoud Hakem [Thu, 20 Jun 2024 20:01:15 +0000 (21:01 +0100)]

requirements : Bump torch and numpy for python3.12 (#8041)

commit | commitdiff | tree

Hamdoud Hakem [Thu, 20 Jun 2024 19:59:59 +0000 (20:59 +0100)]

convert-hf : Fix the encoding in the convert-hf-to-gguf-update.py (#8040)

commit | commitdiff | tree

Johannes Gäßler [Thu, 20 Jun 2024 14:40:13 +0000 (16:40 +0200)]

common: fix warning (#8036)

* common: fix warning

* Update common/common.cpp

Co-authored-by: slaren <redacted>
---------

Co-authored-by: slaren <redacted>

commit | commitdiff | tree

luoyu-intel [Thu, 20 Jun 2024 13:19:05 +0000 (13:19 +0000)]

[SYCL] Fix windows build and inference (#8003)

* add sycl preset

* fix debug link error. fix windows crash

* update README

commit | commitdiff | tree

Johannes Gäßler [Thu, 20 Jun 2024 12:39:21 +0000 (14:39 +0200)]

CUDA: stream-k decomposition for MMQ (#8018)

* CUDA: stream-k decomposition for MMQ

* fix undefined memory reads for small matrices

commit | commitdiff | tree

Michael de Gans [Thu, 20 Jun 2024 05:32:01 +0000 (22:32 -0700)]

metal : fix `ggml_metal_supports_op` for BF16 (#8021)

Currently the Metal backend does not support BF16. `ggml_metal_supports_op` was returning true in these cases, leading to a crash with models converted with `--leave-output-tensor`. This commit checks if the first few sources types are BF16 and returns false if that's the case.

commit | commitdiff | tree

sasha0552 [Wed, 19 Jun 2024 23:57:10 +0000 (23:57 +0000)]

server : fix smart slot selection (#8020)

commit | commitdiff | tree

Michael de Gans [Wed, 19 Jun 2024 20:10:42 +0000 (13:10 -0700)]

un-ignore `build-info.cmake` and `build-info.sh` (#7996)

* un-ignore `build-info.cmake` and `build-info.sh`

I am assuming that ignoring them was unintentional. If they are ignored, some tools, like cargo, will consider the files inexistent, even if they're comitted, for the purpose of publishing. This leads to the build failing in such cases.

* un-ignore `build-info.cpp.in`

For the same reason as the previous two files.

* Reorganize `.gitignore`

* Add exceptions for files mentioned by @slaren

I did leave .clang-tidy since it was explicitly ignored before.

* Add comments for organization
* Sort some lines for pretty
* Test with `make` and `cmake` builds to ensure no build artifacts might be comitted

* Remove `.clang-tidy` from `.gitignore`

Per comment by @ggerganov

* Remove `IDEWorkspaceChecks.plist` from root-level `.gitignore`

commit | commitdiff | tree

slaren [Wed, 19 Jun 2024 13:04:15 +0000 (15:04 +0200)]

ggml : synchronize threads using barriers (#7993)

commit | commitdiff | tree

Georgi Gerganov [Wed, 19 Jun 2024 10:04:36 +0000 (13:04 +0300)]

codecov : remove (#8004)

commit | commitdiff | tree

Meng, Hengyu [Wed, 19 Jun 2024 01:11:51 +0000 (09:11 +0800)]

[SYCL] refactor (#6408)

* seperate lower precision GEMM from the main files

* fix workgroup size hardcode

commit | commitdiff | tree

jaime-m-p [Tue, 18 Jun 2024 16:40:52 +0000 (18:40 +0200)]

tokenizer : BPE fixes (#7530)

* Random test: add_bos_token, add_eos_token
* Random test: add BPE models for testing
* Custom regex split fails with codepoint 0
* Fix falcon punctuation regex
* Refactor llm_tokenizer_bpe: move code to constructor
* Move 'add_special_bos/eos' logic to llm_tokenizer_bpe
* Move tokenizer flags to vocab structure.
* Default values for special_add_bos/eos
* Build vocab.special_tokens_cache using vocab token types
* Generalize 'jina-v2' per token attributes
* Fix unicode whitespaces (deepseek-coder, deepseek-llm)
* Skip missing byte tokens (falcon)
* Better unicode data generation
* Replace char32_t with uint32_t

commit | commitdiff | tree

Sigbjørn Skjæret [Tue, 18 Jun 2024 12:19:45 +0000 (14:19 +0200)]

Only use FIM middle token if it exists (#7648)

* Only use FIM middle if it exists

* Only use FIM middle if it exists

commit | commitdiff | tree

jojorne [Tue, 18 Jun 2024 12:18:32 +0000 (09:18 -0300)]

Fix no gcc pragma on Windows (#7751)

commit | commitdiff | tree

Ulrich Drepper [Tue, 18 Jun 2024 12:00:14 +0000 (14:00 +0200)]

Allow compiling with CUDA without CUDA runtime installed (#7989)

On hosts which are not prepared/dedicated to execute code using CUDA
it is still possible to compile llama.cpp with CUDA support by just
installing the development packages.  Missing are the runtime
libraries like /usr/lib64/libcuda.so* and currently the link step
will fail.

The development environment is prepared for such situations.  There
are stub libraries for all the CUDA libraries available in the
$(CUDA_PATH)/lib64/stubs directory.  Adding this directory to the end
of the search path will not change anything for environments which
currently work fine but will enable compiling llama.cpp also in case
the runtime code is not available.

commit | commitdiff | tree

Frank Mai [Tue, 18 Jun 2024 07:11:40 +0000 (15:11 +0800)]

chore: clean useless beam search param (#7985)

Signed-off-by: thxCode <redacted>

commit | commitdiff | tree

Abheek Gulati [Tue, 18 Jun 2024 06:57:41 +0000 (23:57 -0700)]

readme : update UI list (#7943)

commit | commitdiff | tree

Georgi Gerganov [Tue, 18 Jun 2024 06:50:45 +0000 (09:50 +0300)]

ggml : sync

commit | commitdiff | tree

Georgi Gerganov [Tue, 18 Jun 2024 06:37:20 +0000 (09:37 +0300)]

whisper : use ggml_backend_sched (whisper/2239)

* whisper : use ggml_backend_sched (wip)

* use sched in whisper_allocr

* whisper : single backend in whisper_context

* whisper : remove whisper_state->backends_used

* whisper : remove whisper_context->backend

* whisper : reset scheduler after init

* whisper : fix external encoder (e.g. CoreML)

* whisper : cleanup

* whisper : handle null GPU buffer types + fix sycl

---------

Co-authored-by: slaren <redacted>

commit | commitdiff | tree

Ștefan-Gabriel Muscalu [Mon, 17 Jun 2024 19:08:46 +0000 (22:08 +0300)]

update: support Qwen2-57B-A14B (#7835)

* update: convert-hf-to-gguf.py to support Qwen2-57B-A14B

* fix: QWEN2MOE support for expert_feed_forward_length

previously, expert ff was taken from n_ff (intermediate size) but it is now properly taken from LLM_KV_EXPERT_FEED_FORWARD_LENGTH

n_ff_exp and n_ff_shared_exp are now properly calculated

* update: convert-hf-to-gguf.py cleanup for Qwen2MoeForCausalLM

* fix: QWEN2MOE support for expert_feed_forward_length

previously, expert ff was taken from n_ff (intermediate size) but it is now properly taken from LLM_KV_EXPERT_FEED_FORWARD_LENGTH

n_ff_exp and n_ff_shexp are now properly calculated

commit | commitdiff | tree

Srihari-mcw [Mon, 17 Jun 2024 18:23:17 +0000 (23:53 +0530)]

Make updates to type cast based on compiler instead of OS (#7851)

commit | commitdiff | tree

Georgi Gerganov [Mon, 17 Jun 2024 16:40:01 +0000 (19:40 +0300)]

llama : disable FA if KV head size do not match (#7982)

commit | commitdiff | tree

Bryan Honof [Mon, 17 Jun 2024 15:37:55 +0000 (17:37 +0200)]

Add Nix and Flox install instructions (#7899)

commit | commitdiff | tree

slaren [Mon, 17 Jun 2024 14:51:42 +0000 (16:51 +0200)]

sched : offload_op also requires supports_op (#7977)

commit | commitdiff | tree

Frank Mai [Mon, 17 Jun 2024 14:11:08 +0000 (22:11 +0800)]

fix: divide 0 exception in mamba (#7932)

Signed-off-by: thxCode <redacted>

commit | commitdiff | tree

Markus Tavenrath [Mon, 17 Jun 2024 14:10:15 +0000 (16:10 +0200)]

Implement non-mapped async IO for CUDA on Windows. (#7896)

* Implement non-mapped async IO for CUDA on Windows. On a fast Gen5 NVMe drive this change improves model load time by >3x while it should be the same (or slightly faster) on any other drive.

* Free resources except for backend.

* Change assertions to exceptions in llama_file, find correct cuda backend to create CUDA resources and respect the use_mmap flag again for CUDA.

* Apply suggestions from code review

Co-authored-by: slaren <redacted>
* Fix editorconfig and unused variable

* Fix issues with Windows build

---------

Co-authored-by: slaren <redacted>

commit | commitdiff | tree

Georgi Gerganov [Mon, 17 Jun 2024 08:09:20 +0000 (11:09 +0300)]

rpc : fix load/store misaligned addresses (#7948)

commit | commitdiff | tree

Brian [Mon, 17 Jun 2024 05:25:20 +0000 (15:25 +1000)]

gguf-dump.py: add --markdown dump output (#7853)

* gguf-dump.py: add --markdown dump output

* gguf-dump.py: Add toc

* gguf-dump.py: use standard tensor name lookup. Also add tensor ID field

* gguf-dump.py: Add tensor overview count

* gguf-dump.py: fix array preview

* gguf-dump.py: markdownTableWithAlignmentSupport() added

* Add type hints and spacing

Co-authored-by: compilade <redacted>
* gguf-dump.py: prettyfy dimention

* gguf-dump: right align element count

* gguf-dump.py: element count autosizing

* Apply suggestions from code review

Co-authored-by: compilade <redacted>
---------

Co-authored-by: compilade <redacted>

commit | commitdiff | tree

Neo Zhang [Mon, 17 Jun 2024 03:17:07 +0000 (11:17 +0800)]

[SYCL] Update README-sycl.md for Chapter "Recommended release" and "News" (#7946)

* Update README-sycl.md

* Update README-sycl.md

* Update README-sycl.md

* Update README-sycl.md

commit | commitdiff | tree

Calvin Laurenson [Sun, 16 Jun 2024 22:23:04 +0000 (15:23 -0700)]

Add support for sqrt on CUDA (#7953)

* cuda sqrt support

* enable cuda in pca

* fix comments in pca

* add test

* add sqrt to ggml_backend_cuda_supports_op

* fix test

* new line

* Use F32 sqrtf instead of F64 sqrt

Co-authored-by: Johannes Gäßler <redacted>
---------

Co-authored-by: Johannes Gäßler <redacted>

commit | commitdiff | tree

Georgi Gerganov [Tue, 11 Jun 2024 14:39:01 +0000 (17:39 +0300)]

cuda : fix bounds check for src0 rows in MMVQ kernel (whisper/2231)

* cuda : fix bounds check for src0 rows in MMVQ kernel

* Update ggml-cuda/mmvq.cu

Co-authored-by: Johannes Gäßler <redacted>
---------

Co-authored-by: Johannes Gäßler <redacted>

commit | commitdiff | tree

Hong Bo PENG [Sun, 16 Jun 2024 08:53:11 +0000 (16:53 +0800)]

ggml : fix and optimize ppc64le (ggml/849)

* fix compile issues introduced by loongarch_asx

* restore quant changes to merge

* fix compile issues introduced by loongarch_asx

* further optimize by using vec_msum & vec_sum4s on ppc64le

commit | commitdiff | tree

Daniel Bevenius [Sun, 16 Jun 2024 08:51:18 +0000 (10:51 +0200)]

ggml : remove duplicate include of ggml-common.h (ggml/853)

Signed-off-by: Daniel Bevenius <redacted>

commit | commitdiff | tree

Georgi Gerganov [Sun, 16 Jun 2024 16:16:21 +0000 (19:16 +0300)]

flake.lock: Update (#7951)

commit | commitdiff | tree

Georgi Gerganov [Sun, 16 Jun 2024 11:51:40 +0000 (14:51 +0300)]

unicode : avoid char32_t (#7957)

ggml-ci

commit | commitdiff | tree

hopkins385 [Sun, 16 Jun 2024 11:51:18 +0000 (13:51 +0200)]

readme : update UI list [no ci] (#7958)

commit | commitdiff | tree

Georgi Gerganov [Sun, 16 Jun 2024 11:50:12 +0000 (14:50 +0300)]

ggml : fix handling of zero blocks in IQ quants (#7955)

ggml-ci

commit | commitdiff | tree

Georgi Gerganov [Sun, 16 Jun 2024 07:46:51 +0000 (10:46 +0300)]

github : update pr template

commit | commitdiff | tree

0cc4m [Sun, 16 Jun 2024 05:17:31 +0000 (07:17 +0200)]

Vulkan Shader Refactor, Memory Debugging Option (#7947)

* Refactor shaders, extract GLSL code from ggml_vk_generate_shaders.py into vulkan-shaders directory

* Improve debug log code

* Add memory debug output option

* Fix flake8

* Fix unnecessary high llama-3 VRAM use

commit | commitdiff | tree

Xuan Son Nguyen [Sat, 15 Jun 2024 16:53:40 +0000 (18:53 +0200)]

Add `cvector-generator` example (#7514)

* add control-vector-generator

* calc diff

* add comments

* proof-of-concept stdlib implementation

Implements PCA and file writing using mostly standard libraries. The output is recognized as a functional control vector, but outputs gibberish.

* param parsing, refactor, comments

Added basic command-line parameters for outfile and one each positive/negative prompt.

Refactored some messy code in PCA computation and GGUF exporting.

Left a bunch of comments regarding further work needed.

* example template completions

Implements an example template set built from the positive/negative prompts like the control vector Python implementation.

* add multi prompts, multi-thread for PCA

* fix mem error

* add debugs

* fix matrix transpose multiplication

you have got to be kidding me

* preliminary template/multiprompt support

model is running out of context and that ought to be fixed (segfaulting) but other than that it looks goodish

* fix zero output & param parsing, functional templating

fixed a bug where the output file had no tensor data/was all zero

fixed a bug where single hyphen flags were not being correctly parsed

implements creation of templated prompts from input (still need to adapt based on model)

* fix square_diff matmul index range and CRLF->LF line endings

fixed a logic error where square_diff would not multiply all rows

fixed a formatting error where the provided completions.txt had CRLF line endings

* add command-line args for num threads, num completions file lines, always reload model

refactored a few things and did what the commit message says on the tin

* code aestheticization

* fix compiler warnings

* in-series multithreading for prompt embedding?

added commented-out code to attempt to start implementing mutlithreading for embedding in main

* remove unnecessary multithreading

* interim fix memory leak

* translated everything but PCA (I think)

* tentatively translate the rest

* fix ggml errors and make new ones

at least it compiles and runs

* fix cb_eval

* temporary commit while I move dev environments

it finally outputs a functioning control vector - "functioning" in the sense that it can be loaded and it clearly has the right idea, but makes the model incoherent

* update debug statements

* pre-tokenize so we can allocate correct memory to ctx_diffs_wrapped

* update comments

* (wip) refactor

* clean up PCA ggml implementation

* fix shape of v_diff_original

* add n_batch for pca

* working version

* remember to copy back the last_eigenvector

* fix n_completions

* bring back n_completions

* default n_pca_batch to 20

* fix macos build

* add to makefile all targets

* use ggml_format_name

* add readme

* fix .editorconfig

* use ggml_backend_tensor_copy

* attemp to fix compile problem on mac

* fix compile warn

* reuse allocr

* move param parser to common

* better error handling

* clean up a bit

* add print_usage

* shorten help msg

* beautify help msg

* escape prompt by default

* change compile target to llama-cvector-generator

* typo

* disable GPU for PCA

* code style

---------

Co-authored-by: Christian Zhou-Zheng <redacted>

commit | commitdiff | tree

Meng, Hengyu [Sat, 15 Jun 2024 06:05:10 +0000 (14:05 +0800)]

[SYCL] remove global variables (#7710)

* separate DPCT helpers outside

* replace global variables with context

* remove useless extra

* update mul_mat condition

* remove duplicate buft initialization

* remove duplicate extra and global work group size

* remove useless backend check

* remove duplicated extras

* use macro for group_size and remove cuda-related

commit | commitdiff | tree

olexiyb [Fri, 14 Jun 2024 17:28:34 +0000 (20:28 +0300)]

ci : fix macos x86 build (#7940)

In order to use old `macos-latest` we should use `macos-12`

Potentially will fix: https://github.com/ggerganov/llama.cpp/issues/6975

commit | commitdiff | tree

Johannes Gäßler [Fri, 14 Jun 2024 16:41:49 +0000 (18:41 +0200)]

CUDA: faster q2_K, q3_K MMQ + int8 tensor cores (#7921)

* CUDA: faster q2_K, q3_K MMQ + int8 tensor cores

* try CI fix

* try CI fix

* try CI fix

* fix data race

* rever q2_K precision related changes

commit | commitdiff | tree

Georgi Gerganov [Fri, 14 Jun 2024 14:14:09 +0000 (17:14 +0300)]

metal : utilize max shared memory for mul_mat_id (#7935)

commit | commitdiff | tree

Radoslav Gerganov [Fri, 14 Jun 2024 13:47:41 +0000 (16:47 +0300)]

llama-bench : fix RPC indication (#7936)

Show "<backend_name>+RPC" when RPC offloading is used

commit | commitdiff | tree

Sigbjørn Skjæret [Fri, 14 Jun 2024 10:20:04 +0000 (12:20 +0200)]

llama : more checks before assuming FIM tokens (#7644)

* More checks before assuming FIM tokens for Llama arch

* extensive token check

commit | commitdiff | tree

Elaine [Fri, 14 Jun 2024 10:16:49 +0000 (13:16 +0300)]

convert : add Poro-34B-chat tokenizer support (#7713)

* support for Poro chat pre-tokenizer

* add support for Poro pre-tokenizer

* Update convert-hf-to-gguf-update.py

Co-authored-by: Georgi Gerganov <redacted>
* Change Poro-34B-chat to poro-chat

* Change Poro-34B-chat to poro-chat

* Update convert-hf-to-gguf-update.py

* Update llama.cpp

---------

Co-authored-by: Georgi Gerganov <redacted>

commit | commitdiff | tree

Radoslav Gerganov [Thu, 13 Jun 2024 12:18:44 +0000 (15:18 +0300)]

rpc : fix ggml_backend_rpc_supports_buft() (#7918)

commit | commitdiff | tree

Galunid [Thu, 13 Jun 2024 07:42:41 +0000 (09:42 +0200)]

readme : Remove outdated instructions from README.md (#7914) [no ci]

commit | commitdiff | tree

slaren [Thu, 13 Jun 2024 01:11:35 +0000 (03:11 +0200)]

move BLAS to a separate backend (#6210)

* move BLAS to a separate backend

* rename GGML_USE_OPENBLAS to GGML_USE_BLAS

* alloc : reuse same buffer when the same buffer type if used multiple times

* set number of threads automatically for openblas and blis

* sched : print assignments when GGML_SCHED_DEBUG env variable is set

* sched : allow ops with weights on an incompatible buffer type

This will cause the weight to be copied to a backend that supports the
op, which is very costly. The weight should have been stored in a buffer
of a backend that can run the op, but llama.cpp cannot do this
automatically at the moment.

---------

Co-authored-by: Georgi Gerganov <redacted>

commit | commitdiff | tree

Olivier Chafik [Wed, 12 Jun 2024 23:41:52 +0000 (00:41 +0100)]

`build`: rename main → llama-cli, server → llama-server, llava-cli → llama-llava-cli, etc... (#7809)

* `main`/`server`: rename to `llama` / `llama-server` for consistency w/ homebrew

* server: update refs -> llama-server

gitignore llama-server

* server: simplify nix package

* main: update refs -> llama

fix examples/main ref

* main/server: fix targets

* update more names

* Update build.yml

* rm accidentally checked in bins

* update straggling refs

* Update .gitignore

* Update server-llm.sh

* main: target name -> llama-cli

* Prefix all example bins w/ llama-

* fix main refs

* rename {main->llama}-cmake-pkg binary

* prefix more cmake targets w/ llama-

* add/fix gbnf-validator subfolder to cmake

* sort cmake example subdirs

* rm bin files

* fix llama-lookup-* Makefile rules

* gitignore /llama-*

* rename Dockerfiles

* rename llama|main -> llama-cli; consistent RPM bin prefixes

* fix some missing -cli suffixes

* rename dockerfile w/ llama-cli

* rename(make): llama-baby-llama

* update dockerfile refs

* more llama-cli(.exe)

* fix test-eval-callback

* rename: llama-cli-cmake-pkg(.exe)

* address gbnf-validator unused fread warning (switched to C++ / ifstream)

* add two missing llama- prefixes

* Updating docs for eval-callback binary to use new `llama-` prefix.

* Updating a few lingering doc references for rename of main to llama-cli

* Updating `run-with-preset.py` to use new binary names.
Updating docs around `perplexity` binary rename.

* Updating documentation references for lookup-merge and export-lora

* Updating two small `main` references missed earlier in the finetune docs.

* Update apps.nix

* update grammar/README.md w/ new llama-* names

* update llama-rpc-server bin name + doc

* Revert "update llama-rpc-server bin name + doc"

This reverts commit e474ef1df481fd8936cd7d098e3065d7de378930.

* add hot topic notice to README.md

* Update README.md

* Update README.md

* rename gguf-split & quantize bins refs in **/tests.sh

---------

Co-authored-by: HanClinto <redacted>

commit | commitdiff | tree

Johannes Gäßler [Wed, 12 Jun 2024 15:41:51 +0000 (17:41 +0200)]

CUDA: fix broken oob check for FA vec f32 kernel (#7904)

commit | commitdiff | tree

Georgi Gerganov [Wed, 12 Jun 2024 13:00:22 +0000 (16:00 +0300)]

tests : add non-cont unary tests (#7857)

* tests : add non-cont unary tests

* ggml : update unary asserts and "supports_op"

ggml-ci

commit | commitdiff | tree

Georgi Gerganov [Wed, 12 Jun 2024 12:24:20 +0000 (15:24 +0300)]

ggml : improve ggml_is_contiguous logic (#7856)

* ggml : improve ggml_is_contiguous logic

ggml-ci

* ggml : support more contiguous cases

ggml-ci

commit | commitdiff | tree

Georgi Gerganov [Wed, 12 Jun 2024 11:42:29 +0000 (14:42 +0300)]

server : restore numeric prompts (#7883)

commit | commitdiff | tree

Meng, Hengyu [Wed, 12 Jun 2024 09:05:35 +0000 (17:05 +0800)]

update intel docker oneapi-basekit to 2024.1.1-devel-ubuntu22.04 (#7894)

In addition this reverts a workaround we had to do to workaround the upstream issue with expired intel GPG package keys in 2024.0.1-devel-ubuntu22.04

Packaging of ggml-org/llama.cpp