]>
git.djapps.eu Git - pkg/ggml/sources/llama.cpp/log
Johannes Gäßler [Mon, 1 Jul 2024 18:39:06 +0000 (20:39 +0200)]
CUDA: refactor and optimize IQ MMVQ (#8215)
* CUDA: refactor and optimize IQ MMVQ
* uint -> uint32_t
* __dp4a -> ggml_cuda_dp4a
* remove MIN_CC_DP4A checks
* change default
* try CI fix
Mateusz Charytoniuk [Mon, 1 Jul 2024 17:13:22 +0000 (19:13 +0200)]
readme: add Paddler to the list of projects (#8239)
Xuan Son Nguyen [Mon, 1 Jul 2024 16:48:34 +0000 (18:48 +0200)]
gemma2: add sliding window mask (#8227)
* gemma2: add sliding window mask
* fix data_swa uninitialized
* better naming
* add co-author
Co-authored-by: Arlo Phoenix <redacted>
* replace list with single tensor
* update
* llama : minor styling
* convert : add sanity check for query_pre_attn_scalar
* fix small typo in README
---------
Co-authored-by: Arlo Phoenix <redacted>
Co-authored-by: Georgi Gerganov <redacted>
Roni [Mon, 1 Jul 2024 12:48:16 +0000 (14:48 +0200)]
readme : update tool list (#8209)
* Added gppm to Tool list in README
* Update README.md
---------
Co-authored-by: Georgi Gerganov <redacted>
Michael Francis [Mon, 1 Jul 2024 11:47:04 +0000 (07:47 -0400)]
nix : enable curl (#8043)
Co-authored-by: Georgi Gerganov <redacted>
Georgi Gerganov [Mon, 1 Jul 2024 11:46:18 +0000 (14:46 +0300)]
nix : remove OpenCL remnants (#8235)
* nix : remove OpenCL remnants
* minor : remove parentheses
iacore [Mon, 1 Jul 2024 11:40:58 +0000 (11:40 +0000)]
Document BERT support. (#8205)
* Update README.md
document BERT support
* Update README.md
zhentaoyu [Mon, 1 Jul 2024 11:39:06 +0000 (19:39 +0800)]
[SYCL] Update SYCL-Rope op and Refactor (#8157)
* align with rope.cu and move sycl-op to a single file
Georgi Gerganov [Sun, 30 Jun 2024 23:09:34 +0000 (02:09 +0300)]
flake.lock: Update (#8218)
Xuan Son Nguyen [Sun, 30 Jun 2024 18:27:13 +0000 (20:27 +0200)]
Fix new line issue with chat template, disable template when in-prefix/suffix is set (#8203)
* preserve new line llama_chat_format_single
* disable chat template if in-prefix/suffix is set
* remove redundant change
Andrei [Sun, 30 Jun 2024 03:44:08 +0000 (20:44 -0700)]
llama: Add attention and final logit soft-capping, update scaling factor to Gemma2 (#8197)
* Add attention and final logit softcapping.
* fix
* Add custom add_ functions
* Disable flash attention for Gemma2
* Update src/llama.cpp
Co-authored-by: slaren <redacted>
* Add default value for attention and final logit softcap value
* Add custom kq scaling from Gemma2Attention
* Remove custom pre attention scaling and use computed value instead.
---------
Co-authored-by: slaren <redacted>
Xuan Son Nguyen [Fri, 28 Jun 2024 22:14:20 +0000 (00:14 +0200)]
fix code typo in llama-cli (#8198)
Olivier Chafik [Fri, 28 Jun 2024 17:02:05 +0000 (18:02 +0100)]
json: attempt to skip slow tests when running under emulator (#8189)
Xuan Son Nguyen [Fri, 28 Jun 2024 13:11:44 +0000 (15:11 +0200)]
Add MiniCPM, Deepseek V2 chat template + clean up `llama_chat_apply_template_internal` (#8172)
* tmp_contains
* minicpm chat template
* add DeepSeek Lite template
* change deepseek-lite to deepseek2
* correct code comment
* correct code from master branch
Sigbjørn Skjæret [Fri, 28 Jun 2024 10:53:43 +0000 (12:53 +0200)]
Add SPM infill support (#8016)
* add --spm-infill option
* support --spm-infill
* support --spm-infill
slaren [Fri, 28 Jun 2024 10:37:45 +0000 (12:37 +0200)]
cmake : allow user to override default options (#8178)
Olivier Chafik [Fri, 28 Jun 2024 08:26:45 +0000 (09:26 +0100)]
`json`: restore default additionalProperties to false, fix some pattern escapes (#8180)
* json: expand ESCAPED_IN_REGEXPS_BUT_NOT_IN_LITERALS charset
* json: revert default of additionalProperties to false
* Update README.md
pculliton [Fri, 28 Jun 2024 04:00:43 +0000 (00:00 -0400)]
llama: Add support for Gemma2ForCausalLM (#8156)
* Inference support for Gemma 2 model family
* Update convert-hf-to-gguf.py, constants, and tensor mappings
* cleanup
* format fix
* Fix special token vocab bug
* Don't add space prefix
* fix deleted lines
* Update src/llama.cpp
Co-authored-by: slaren <redacted>
* Add model type names
* Add control vector
* Fix model type identification
---------
Co-authored-by: Andrei Betlen <redacted>
Co-authored-by: slaren <redacted>
Xuan Son Nguyen [Fri, 28 Jun 2024 00:19:11 +0000 (02:19 +0200)]
Add missing items in makefile (#8177)
Olivier Chafik [Thu, 27 Jun 2024 21:08:42 +0000 (22:08 +0100)]
`json`: update grammars/README w/ examples & note about additionalProperties (#8132)
* json: update grammars/README
* mention broken prefixItems
* add mention to llama-gbnf-validator
* json: explicit type: object for nested items object in cli example
loonerin [Thu, 27 Jun 2024 19:01:23 +0000 (15:01 -0400)]
CI: fix release build (Ubuntu+Mac) (#8170)
* CI: fix release build (Ubuntu)
PR #8006 changes defaults to build shared libs. However, CI for releases
expects static builds.
* CI: fix release build (Mac)
---------
Co-authored-by: loonerin <redacted>
slaren [Thu, 27 Jun 2024 18:04:39 +0000 (20:04 +0200)]
cmake : fix deprecated option names not working (#8171)
* cmake : fix deprecated option names not working
* remove LlAMA_OPENMP
Xuan Son Nguyen [Thu, 27 Jun 2024 16:14:19 +0000 (18:14 +0200)]
Add chatml fallback for cpp `llama_chat_apply_template` (#8160)
* add chatml fallback for cpp `llama_chat_apply_template`
* remove redundant code
Georgi Gerganov [Thu, 27 Jun 2024 15:37:29 +0000 (18:37 +0300)]
flake.lock: Update (#8071)
Flake lock file updates:
• Updated input 'nixpkgs':
'github:NixOS/nixpkgs/
e9ee548d90ff586a6471b4ae80ae9cfcbceb3420 ?narHash=sha256-4Zu0RYRcAY/VWuu6awwq4opuiD//ahpc2aFHg2CWqFY%3D' (2024-06-13)
→ 'github:NixOS/nixpkgs/
d603719ec6e294f034936c0d0dc06f689d91b6c3 ?narHash=sha256-k3JqJrkdoYwE3fHE6xGDY676AYmyh4U2Zw%2B0Bwe5DLU%3D' (2024-06-20)
Co-authored-by: github-actions[bot] <redacted>
Co-authored-by: Philip Taron <redacted>
jukofyork [Thu, 27 Jun 2024 14:48:07 +0000 (15:48 +0100)]
Control vector loading fixes (#8137)
* Fixed leak in llama_control_vector_load_one() and allow llama_control_vector_load() to grow
* refactored `llama_control_vector_load_one()`
* allow multiple directions for same layer in same file
* llama_control_vector_load_one() and llama_control_vector_load() now break on error
* removed unnecessary ggml_free() call
Raj Hammeer Singh Hada [Thu, 27 Jun 2024 14:39:29 +0000 (20:09 +0530)]
Delete examples/llama.android/llama/CMakeLists.txt (#8165)
* Delete examples/llama.android/llama/CMakeLists.txt
https://github.com/ggerganov/llama.cpp/pull/8145#issuecomment-
2194534244
This file is not being used for building on Android. `llama.cpp/examples/llama.android/llama/src/main/cpp/CMakeLists.txt` is being used instead.
* Update CMakeLists.txt
Pick local llama.cpp files instead of fetching content from git
Sigbjørn Skjæret [Thu, 27 Jun 2024 14:27:41 +0000 (16:27 +0200)]
Add Qwen2MoE 57B-A14B model identifier (#8158)
* Add Qwen2MoE 57B-A14B
* Add Qwen2MoE 57B-A14B
Johannes Gäßler [Thu, 27 Jun 2024 14:26:05 +0000 (16:26 +0200)]
CUDA: fix MMQ stream-k for --split-mode row (#8167)
kustaaya [Thu, 27 Jun 2024 08:58:54 +0000 (11:58 +0300)]
Added support for Viking pre-tokenizer (#8135)
Co-authored-by: kustaaya <redacted>
Sigbjørn Skjæret [Thu, 27 Jun 2024 07:46:41 +0000 (09:46 +0200)]
llama : fix CodeLlama FIM token checks (#8144)
* account for space prefix character
* use find instead
Raj Hammeer Singh Hada [Thu, 27 Jun 2024 01:57:57 +0000 (07:27 +0530)]
Fix llama-android.cpp for error - "common/common.h not found" (#8145)
- Path seems to be wrong for the common.h header file in llama-android.cpp file. Fixing the path so the Android Build doesn't fail with the error "There is no file common/common.h"
Daniel Bevenius [Wed, 26 Jun 2024 23:50:09 +0000 (01:50 +0200)]
clip : suppress unused variable warnings (#8105)
* clip : suppress unused variable warnings
This commit suppresses unused variable warnings for the variables e in
the catch blocks.
The motivation for this change is to suppress the warnings that are
generated on Windows when using the MSVC compiler. The warnings are
not displayed when using GCC because GCC will mark all catch parameters
as used.
Signed-off-by: Daniel Bevenius <redacted>
* squash! clip : suppress unused variable warnings
Remove e (/*e*/) instead instead of using GGML_UNUSED.
---------
Signed-off-by: Daniel Bevenius <redacted>
Georgi Gerganov [Wed, 26 Jun 2024 20:25:22 +0000 (23:25 +0300)]
scripts : fix filename sync
slaren [Wed, 26 Jun 2024 19:59:28 +0000 (21:59 +0200)]
ci : publish new docker images only when the files change (#8142)
slaren [Wed, 26 Jun 2024 19:34:14 +0000 (21:34 +0200)]
ggml : add GGML_CUDA_USE_GRAPHS option, restore GGML_CUDA_FORCE_CUBLAS (cmake) (#8140)
slaren [Wed, 26 Jun 2024 18:20:22 +0000 (20:20 +0200)]
make : fix missing -O3 (#8143)
Georgi Gerganov [Wed, 26 Jun 2024 16:39:19 +0000 (19:39 +0300)]
sync : ggml
Georgi Gerganov [Wed, 26 Jun 2024 16:36:44 +0000 (19:36 +0300)]
authors : regen
Georgi Gerganov [Wed, 26 Jun 2024 16:32:07 +0000 (19:32 +0300)]
devops : remove clblast + LLAMA_CUDA -> GGML_CUDA (#8139)
ggml-ci
Georgi Gerganov [Wed, 26 Jun 2024 16:26:13 +0000 (19:26 +0300)]
readme : update API notes
Georgi Gerganov [Wed, 26 Jun 2024 15:33:02 +0000 (18:33 +0300)]
llama : reorganize source code + improve CMake (#8006)
* scripts : update sync [no ci]
* files : relocate [no ci]
* ci : disable kompute build [no ci]
* cmake : fixes [no ci]
* server : fix mingw build
ggml-ci
* cmake : minor [no ci]
* cmake : link math library [no ci]
* cmake : build normal ggml library (not object library) [no ci]
* cmake : fix kompute build
ggml-ci
* make,cmake : fix LLAMA_CUDA + replace GGML_CDEF_PRIVATE
ggml-ci
* move public backend headers to the public include directory (#8122)
* move public backend headers to the public include directory
* nix test
* spm : fix metal header
---------
Co-authored-by: Georgi Gerganov <redacted>
* scripts : fix sync paths [no ci]
* scripts : sync ggml-blas.h [no ci]
---------
Co-authored-by: slaren <redacted>
Isaac McFadyen [Wed, 26 Jun 2024 06:29:28 +0000 (02:29 -0400)]
Clarify default MMQ for CUDA and LLAMA_CUDA_FORCE_MMQ flag (#8115)
* Add message about int8 support
* Add suggestions from review
Co-authored-by: Johannes Gäßler <redacted>
---------
Co-authored-by: Johannes Gäßler <redacted>
Johannes Gäßler [Wed, 26 Jun 2024 06:28:02 +0000 (08:28 +0200)]
CUDA: fix misaligned shared memory read (#8123)
Eddie-Wang [Wed, 26 Jun 2024 06:27:46 +0000 (14:27 +0800)]
llama : extend llm_build_ffn() to support _scale tensors (#8103)
Olivier Chafik [Wed, 26 Jun 2024 00:46:35 +0000 (01:46 +0100)]
`json`: better support for "type" unions (e.g. nullable arrays w/ typed items) (#7863)
* json: better suport for "type" arrays (e.g. `{"type": ["array", "null"], "items": {"type": "string"}}`)
* json: add test for type: [array, null] fix
* update tests
Olivier Chafik [Wed, 26 Jun 2024 00:45:58 +0000 (01:45 +0100)]
`json`: fix additionalProperties, allow space after enum/const (#7840)
* json: default additionalProperty to true
* json: don't force additional props after normal properties!
* json: allow space after enum/const
* json: update pydantic example to set additionalProperties: false
* json: prevent additional props to redefine a typed prop
* port not_strings to python, add trailing space
* fix not_strings & port to js+py
* Update json-schema-to-grammar.cpp
* fix _not_strings for substring overlaps
* json: fix additionalProperties default, uncomment tests
* json: add integ. test case for additionalProperties
* json: nit: simplify condition
* reformat grammar integ tests w/ R"""()""" strings where there's escapes
* update # tokens in server test: consts can now have trailing space
jukofyork [Tue, 25 Jun 2024 20:47:40 +0000 (21:47 +0100)]
fixes #7999 (adds control vectors to all `build_XXX()` functions in `llama.cpp` [needs testing] (#8060)
* fixes #7999
The `build_command_r` forgot to add the control vector.
* Fixes qwen2 too
* Fixed all models' control vectors
* Removed double calls to `cb(cur, "l_out", il)`
* Moved control vector logic to llama_control_vector:apply_to()
fairydreaming [Tue, 25 Jun 2024 19:14:35 +0000 (21:14 +0200)]
llama : implement Unigram tokenizer needed by T5 and FLAN-T5 model families (#5763)
* llama : add T5 model architecture, tensors and model header parameters
* llama : add implementation of Unigram tokenizer with SentencePiece-like text normalization using precompiled charsmap
---------
Co-authored-by: Stanisław Szymczyk <redacted>
Daniel Bevenius [Tue, 25 Jun 2024 19:07:28 +0000 (21:07 +0200)]
llama : return nullptr from llama_grammar_init (#8093)
* llama : return nullptr from llama_grammar_init
This commit updates llama_grammar_init to return nullptr instead of
throwing an exception.
The motivation for this is that this function is declared inside an
extern "C" block and is intended/may be used from C code which will not
be able to handle exceptions thrown, and results in undefined behavior.
On Windows and using MSVC the following warning is currently generated:
```console
C:\llama.cpp\llama.cpp(13998,1): warning C4297: 'llama_grammar_init':
function assumed not to throw an exception but does
C:\llama.cpp\llama.cpp(13998,1): message :
__declspec(nothrow), throw(), noexcept(true), or noexcept was specified
on the function
```
Signed-off-by: Daniel Bevenius <redacted>
* squash! llama : return nullptr from llama_grammar_init
Add checks for nullptr when calling llama_grammar_init.
Signed-off-by: Daniel Bevenius <redacted>
---------
Signed-off-by: Daniel Bevenius <redacted>
Co-authored-by: Clint Herron <redacted>
Olivier Chafik [Tue, 25 Jun 2024 19:06:20 +0000 (20:06 +0100)]
`json`: support integer minimum, maximum, exclusiveMinimum, exclusiveMaximum (#7797)
* json: support minimum for positive integer values
* json: fix min 0
* json: min + max integer constraints
* json: handle negative min / max integer bounds
* json: fix missing paren min/max bug
* json: proper paren fix
* json: integration test for schemas
* json: fix bounds tests
* Update json-schema-to-grammar.cpp
* json: fix negative max
* json: fix negative min (w/ more than 1 digit)
* Update test-grammar-integration.cpp
* json: nit: move string rules together
* json: port min/max integer support to Python & JS
* nit: move + rename _build_min_max_int
* fix min in [1, 9]
* Update test-grammar-integration.cpp
* add C++11-compatible replacement for std::string_view
* add min/max constrained int field to pydantic json schema example
* fix merge
* json: add integration tests for min/max bounds
* reshuffle/merge min/max integ test cases
* nits / cleanups
* defensive code against string out of bounds (apparently different behaviour of libstdc++ vs. clang's libc++, can't read final NULL char w/ former)
slaren [Tue, 25 Jun 2024 17:20:06 +0000 (19:20 +0200)]
disable docker CI on pull requests (#8110)
joecryptotoo [Tue, 25 Jun 2024 15:13:27 +0000 (08:13 -0700)]
Add healthchecks to llama-server containers (#8081)
* added healthcheck
* added healthcheck
* added healthcheck
* added healthcheck
* added healthcheck
* moved curl to base
* moved curl to base
Brian [Tue, 25 Jun 2024 12:03:25 +0000 (22:03 +1000)]
Gguf dump start data offset via --data-offset and some extra refactor (#8054)
* gguf-dump: add --data-offset
* gguf-dump: add tensor data offset table
* gguf-dump: refactor GGUFReader for clarity
* gguf-dump: add --data-alignment
* gguf-dump.py: Rename variables and adjust comments
start_data_offset --> data_offset
_build_tensors_info_fields --> _build_tensor_info
Xuan Son Nguyen [Tue, 25 Jun 2024 11:59:54 +0000 (13:59 +0200)]
cvector: better prompt handling, add "mean vector" method (#8069)
* remove completions file
* fix inverted vector
* add mean method
* code style
* remove inverted pca hotfix
Xuan Son Nguyen [Tue, 25 Jun 2024 11:56:49 +0000 (13:56 +0200)]
Add chat template support for llama-cli (#8068)
* add chat template support for llama-cli
* add help message
* server: simplify format_chat
* more consistent naming
* improve
* add llama_chat_format_example
* fix server
* code style
* code style
* Update examples/main/main.cpp
Co-authored-by: Georgi Gerganov <redacted>
---------
Co-authored-by: Georgi Gerganov <redacted>
HanishKVC [Tue, 25 Jun 2024 11:27:35 +0000 (16:57 +0530)]
SimpleChat v3.1: Boolean chat request options in Settings UI, cache_prompt (#7950)
* SimpleChat: Allow for chat req bool options to be user controlled
* SimpleChat: Allow user to control cache_prompt flag in request
* SimpleChat: Add sample GUI images to readme file
Show the chat screen and the settings screen
* SimpleChat:Readme: Add quickstart block, title to image, cleanup
* SimpleChat: RePosition contents of the Info and Settings UI
Make it more logically structured and flow through.
* SimpleChat: Rename to apiRequestOptions from chatRequestOptions
So that it is not wrongly assumed that these request options are
used only for chat/completions endpoint. Rather these are used
for both the end points, so rename to match semantic better.
* SimpleChat: Update image included with readme wrt settings ui
* SimpleChat:ReadMe: Switch to webp screen image to reduce size
HatsuneMikuUwU33 [Tue, 25 Jun 2024 08:44:48 +0000 (10:44 +0200)]
Update control vector help (#8104)
Meng, Hengyu [Tue, 25 Jun 2024 02:19:20 +0000 (10:19 +0800)]
[SYCL] Re-enabled mul_mat_batched_sycl (#8095)
Johannes Gäßler [Mon, 24 Jun 2024 23:22:33 +0000 (01:22 +0200)]
CUDA: fix matrix multiplication algorithm choice (#8102)
Johannes Gäßler [Mon, 24 Jun 2024 20:15:33 +0000 (22:15 +0200)]
CUDA: fix MMQ writeback for int8 tensor cores (#8100)
Johannes Gäßler [Mon, 24 Jun 2024 15:43:42 +0000 (17:43 +0200)]
CUDA: use MMQ instead of cuBLAS by default (#8075)
fairydreaming [Mon, 24 Jun 2024 12:13:39 +0000 (14:13 +0200)]
gguf-py : fix tensor groups for encoder-decoder models in gguf-dump.py (#8090)
Co-authored-by: Stanisław Szymczyk <redacted>
Co-authored-by: Brian <redacted>
Johannes Gäßler [Mon, 24 Jun 2024 10:41:23 +0000 (12:41 +0200)]
CUDA: optimize MMQ int8 tensor core performance (#8062)
* CUDA: optimize MMQ int8 tensor core performance
* only a single get_mma_tile_x_k function
* simplify code, make functions constexpr
Christian Zhou-Zheng [Mon, 24 Jun 2024 09:42:03 +0000 (05:42 -0400)]
Option to split during conversion (#6942)
* support splits in convert.py
* Support split by size and dry run to write estimated shards/filesizes
* Move split functionality to new GGUFManager class
* fix improper function signature
* tentative push of convert-hf-to-gguf support
* resolve merge + SplitArguments for easier parsing
* Fix eager tensor memory leak and remove convert.py changes
Removed a memory leak caused by unexpected reference retention to eager tensors.
Also removed GGUFManager functionality in convert.py in favor of specializing for convert-hf-to-gguf.py.
* refactor SplitStrategy to be a deque
Instead of having SplitStrategy have a `data` field that is a deque, just have SplitStrategy be a subclass of deque itself.
* fix Q8 quantization
* remove unnecessary imports in gguf_manager
* fix final? merge issue
* fix gguf_writer placement and remove comments
* oops, actually fix gguf_writer placement
* reduce duplicated code from gguf_writer
* further simplify GGUFManager
* simplify even further and standardize with GGUFWriter
* reduce diffs with master
* form shards while adding tensors, SHA256 sums agree with master
* re-add type hint
Co-authored-by: compilade <redacted>
* GGUFWriter compatibility fix
Co-authored-by: compilade <redacted>
* Shard dataclass and un-negative dont_add_architecture
* type consistency in format_n_bytes_to_str
* move kv keys to constants.py
* make pathlib explicit
* base-1024 bytes to base-1000
* rename GGUFManager to GGUFWriterSplit
* Update gguf-py/gguf/constants.py
Co-authored-by: compilade <redacted>
* fix convert-hf-to-gguf.py permissions
* fix line endings
* Update gguf-py/gguf/gguf_writer_split.py
Co-authored-by: compilade <redacted>
* convert-hf : restore executable file permission
* examples/convert-legacy-llama.py: restore executable file permission
* reinstate original gguf package import and fix type annotation
* attempt to appease the linter
* attempt 2 to appease the linter
* attempt 3 to appease the linter
* comma consistency
* Update convert-hf-to-gguf.py
Co-authored-by: compilade <redacted>
* edit cmd line args
* use simplification from #7827
* kv/ti data are still wrong
* try to refactor kv data (still fails)
* fix ti data messiness
* tidy up
* fix linting
* actually make the linter happy
* cleanup round 1
* remove SplitStrategy, SplitArguments
* appease linter
* fix typing and clean up
* fix linting
* Update gguf-py/gguf/gguf_writer.py
Co-authored-by: compilade <redacted>
* progress bar, fix split logic
* Update gguf-py/gguf/gguf_writer.py
Co-authored-by: compilade <redacted>
* catch oversights
* Update gguf-py/gguf/gguf_writer.py
Co-authored-by: compilade <redacted>
* Update gguf-py/gguf/gguf_writer.py
Co-authored-by: compilade <redacted>
* Update gguf-py/gguf/gguf_writer.py
Co-authored-by: compilade <redacted>
* Update gguf-py/gguf/gguf_writer.py
Co-authored-by: compilade <redacted>
* Update gguf-py/gguf/gguf_writer.py
Co-authored-by: compilade <redacted>
* swap bar orders
* Update gguf-py/gguf/gguf_writer.py
Co-authored-by: compilade <redacted>
* Update gguf-py/gguf/gguf_writer.py
Co-authored-by: compilade <redacted>
* compatibility fix
* Update gguf-py/gguf/gguf_writer.py
Co-authored-by: compilade <redacted>
* Update convert-hf-to-gguf.py
Co-authored-by: compilade <redacted>
---------
Co-authored-by: Brian <redacted>
Co-authored-by: compilade <redacted>
slaren [Mon, 24 Jun 2024 05:36:11 +0000 (07:36 +0200)]
disable publishing the full-rocm docker image (#8083)
Yann Follet [Mon, 24 Jun 2024 05:30:24 +0000 (13:30 +0800)]
embedding : more cli arguments (#7458)
* add parameters for embeddings
--embd-normalize
--embd-output-format
--embd-separator
description in the README.md
* Update README.md
fix tipo
* Trailing whitespace
* fix json generation, use " not '
* fix merge master
* fix code formating
group of parameters // embedding
print usage for embedding parameters
---------
Co-authored-by: Brian <redacted>
fairydreaming [Mon, 24 Jun 2024 05:06:05 +0000 (07:06 +0200)]
gguf-py, convert-hf : model conversion support for T5 and FLAN-T5 model variants (#5763)
* gguf-py : add T5 model architecture
* gguf-py : add separate tensors for encoder and decoder
* gguf-py : add new model header parameters: decoder_start_token_id, attention.relative_buckets_count, tokenizer.ggml.remove_extra_whitespaces, tokenizer.ggml.precompiled_charsmap
* convert-hf : add model conversion support for T5ForConditionalGeneration and T5WithLMHeadModel
---------
Co-authored-by: Stanisław Szymczyk <redacted>
slaren [Mon, 24 Jun 2024 01:07:59 +0000 (03:07 +0200)]
ggml : remove ggml_task_type and GGML_PERF (#8017)
* ggml : remove ggml_task_type and GGML_PERF
* check abort_callback on main thread only
* vulkan : remove usage of ggml_compute_params
* remove LLAMA_PERF
Eddie-Wang [Sun, 23 Jun 2024 18:27:57 +0000 (02:27 +0800)]
llama : add support for BitnetForCausalLM (#7931)
* hf bitnet v1
* hf bitnet e2e v2
* finish bitnet e2e
* finish f16 hf bitnet e2e
* remove unsed
* finish bitnet i2 e2e
* move i2s to quantize v1
* move i2 to quantize
* clean code
* clean code 2
* fix codestyle
* fix code
* fix
* fix code
* fix merge
* remove unused
* change table name
* fix whitespace
* delete redundant
* i2_s to absmax
* finish i2_s/i8_s vec_dot x86 simd
* i2s->q22
* fix code
* remove block scale
* add dequantize
* fix seq
* update avx2
* remove q2_2
* remove q22_grid
* fix whitespace
* reuse llm_build_kv
* fix bo
---------
Co-authored-by: root <redacted>
Aarni Koskela [Sun, 23 Jun 2024 15:03:08 +0000 (18:03 +0300)]
server : fix JSON-Scheme typo (#7975)
Daniel Bevenius [Sun, 23 Jun 2024 13:39:45 +0000 (15:39 +0200)]
Fix typo in llama_set_embeddings comment (#8077)
slaren [Sun, 23 Jun 2024 11:14:45 +0000 (13:14 +0200)]
fix CI failures (#8066)
* test-backend-ops : increase cpy max nmse
* server ci : disable thread sanitizer
0cc4m [Sun, 23 Jun 2024 08:21:25 +0000 (10:21 +0200)]
Refactor Vulkan backend to allow multiple contexts (#7961)
* Refactor Vulkan backend to allow multiple contexts
* Fix too many shader groups called validation error in llama3 on AMD and Intel GPUs
* Fix Vulkan debug build error
Clint Herron [Sat, 22 Jun 2024 18:28:18 +0000 (14:28 -0400)]
Removing extra blank lines that were breaking Lint. (#8067)
Xuan Son Nguyen [Sat, 22 Jun 2024 16:11:30 +0000 (18:11 +0200)]
cvector: fix CI + correct help message (#8064)
* cvector: fix CI + correct help message
* also correct --pca-iter
HatsuneMikuUwU33 [Sat, 22 Jun 2024 15:19:37 +0000 (17:19 +0200)]
cvector-generator: Moe Moe Fixie-Fixie for Lots of Formats~! ♡(ᐢ ᴥ ᐢ)♡ (#8052)
* Update negative.txt
* Update positive.txt
* Update cvector-generator.cpp
* Update cvector-generator.cpp
0xspringtime [Sat, 22 Jun 2024 13:37:41 +0000 (09:37 -0400)]
convert-hf : change assert to exception (#8015)
ddh0 [Sat, 22 Jun 2024 13:16:10 +0000 (07:16 -0600)]
Update llama-quantize ppl/file size output from LLaMA-v1 to Llama-3 values (#8058)
Uses the values computed by @JohannesGaessler in PR #7413
Clint Herron [Sat, 22 Jun 2024 03:18:36 +0000 (23:18 -0400)]
JSON Schema to GBNF integration tests (#7790)
* Adding simple bare-bones test for end-to-end integration test for json validation against auto-generated JSON-schema grammars.
* Adding additional examples as documented in #7789 . Also adding the ability to automatically output improperly failing grammars to debug output files so they can more easily be examined in the gbnf-validator program.
* Uncommenting formerly commented tests so that they fail for others who are attempting to reproduce the bugs.
* Merging improved schema test methods added by @ochafik in #7797
* Adding #define to temporarily remove failing tests so that this PR can pass CI, but still be useful for other PRs that want to leverage the framework.
* Fixing nits from ochafik. Removing escape slashes, adding additional failing cases, fixing some other strings.
* Fixing grammar indentation to be consistent throughout file.
k.h.lai [Fri, 21 Jun 2024 08:28:20 +0000 (16:28 +0800)]
vulkan: detect multiple devices by deviceUUID instead of deviceID (#8022)
* vulkan: detect multiple devices by deviceUUID instead of deviceID
* vulkan: remove unneeded variables
* vulkan: fix id query
Eve [Fri, 21 Jun 2024 05:57:36 +0000 (05:57 +0000)]
ggml : AVX IQ quants (#7845)
* initial iq4_xs
* fix ci
* iq4_nl
* iq1_m
* iq1_s
* iq2_xxs
* iq3_xxs
* iq2_s
* iq2_xs
* iq3_s before sllv
* iq3_s
* iq3_s small fix
* iq3_s sllv can be safely replaced with sse multiply
Georgi Gerganov [Fri, 21 Jun 2024 05:51:28 +0000 (08:51 +0300)]
llama : optimize long word tokenization with WPM (#8034)
ggml-ci
Douglas Hanley [Fri, 21 Jun 2024 05:38:22 +0000 (00:38 -0500)]
llama : allow pooled embeddings on any model (#7477)
* create append_pooling operation; allow to specify attention_type; add last token pooling; update examples
* find result_norm/result_embd tensors properly; update output allocation logic
* only use embd output for pooling_type NONE
* get rid of old causal_attn accessor
* take out attention_type; add in llama_set_embeddings
* bypass logits when doing non-NONE pooling
Shuichi Tsutsumi [Fri, 21 Jun 2024 05:30:58 +0000 (14:30 +0900)]
swiftui : enable stream updating (#7754)
Hamdoud Hakem [Thu, 20 Jun 2024 20:01:15 +0000 (21:01 +0100)]
requirements : Bump torch and numpy for python3.12 (#8041)
Hamdoud Hakem [Thu, 20 Jun 2024 19:59:59 +0000 (20:59 +0100)]
convert-hf : Fix the encoding in the convert-hf-to-gguf-update.py (#8040)
Johannes Gäßler [Thu, 20 Jun 2024 14:40:13 +0000 (16:40 +0200)]
common: fix warning (#8036)
* common: fix warning
* Update common/common.cpp
Co-authored-by: slaren <redacted>
---------
Co-authored-by: slaren <redacted>
luoyu-intel [Thu, 20 Jun 2024 13:19:05 +0000 (13:19 +0000)]
[SYCL] Fix windows build and inference (#8003)
* add sycl preset
* fix debug link error. fix windows crash
* update README
Johannes Gäßler [Thu, 20 Jun 2024 12:39:21 +0000 (14:39 +0200)]
CUDA: stream-k decomposition for MMQ (#8018)
* CUDA: stream-k decomposition for MMQ
* fix undefined memory reads for small matrices
Michael de Gans [Thu, 20 Jun 2024 05:32:01 +0000 (22:32 -0700)]
metal : fix `ggml_metal_supports_op` for BF16 (#8021)
Currently the Metal backend does not support BF16. `ggml_metal_supports_op` was returning true in these cases, leading to a crash with models converted with `--leave-output-tensor`. This commit checks if the first few sources types are BF16 and returns false if that's the case.
sasha0552 [Wed, 19 Jun 2024 23:57:10 +0000 (23:57 +0000)]
server : fix smart slot selection (#8020)
Michael de Gans [Wed, 19 Jun 2024 20:10:42 +0000 (13:10 -0700)]
un-ignore `build-info.cmake` and `build-info.sh` (#7996)
* un-ignore `build-info.cmake` and `build-info.sh`
I am assuming that ignoring them was unintentional. If they are ignored, some tools, like cargo, will consider the files inexistent, even if they're comitted, for the purpose of publishing. This leads to the build failing in such cases.
* un-ignore `build-info.cpp.in`
For the same reason as the previous two files.
* Reorganize `.gitignore`
* Add exceptions for files mentioned by @slaren
I did leave .clang-tidy since it was explicitly ignored before.
* Add comments for organization
* Sort some lines for pretty
* Test with `make` and `cmake` builds to ensure no build artifacts might be comitted
* Remove `.clang-tidy` from `.gitignore`
Per comment by @ggerganov
* Remove `IDEWorkspaceChecks.plist` from root-level `.gitignore`
slaren [Wed, 19 Jun 2024 13:04:15 +0000 (15:04 +0200)]
ggml : synchronize threads using barriers (#7993)
Georgi Gerganov [Wed, 19 Jun 2024 10:04:36 +0000 (13:04 +0300)]
codecov : remove (#8004)
Meng, Hengyu [Wed, 19 Jun 2024 01:11:51 +0000 (09:11 +0800)]
[SYCL] refactor (#6408)
* seperate lower precision GEMM from the main files
* fix workgroup size hardcode
jaime-m-p [Tue, 18 Jun 2024 16:40:52 +0000 (18:40 +0200)]
tokenizer : BPE fixes (#7530)
* Random test: add_bos_token, add_eos_token
* Random test: add BPE models for testing
* Custom regex split fails with codepoint 0
* Fix falcon punctuation regex
* Refactor llm_tokenizer_bpe: move code to constructor
* Move 'add_special_bos/eos' logic to llm_tokenizer_bpe
* Move tokenizer flags to vocab structure.
* Default values for special_add_bos/eos
* Build vocab.special_tokens_cache using vocab token types
* Generalize 'jina-v2' per token attributes
* Fix unicode whitespaces (deepseek-coder, deepseek-llm)
* Skip missing byte tokens (falcon)
* Better unicode data generation
* Replace char32_t with uint32_t
Sigbjørn Skjæret [Tue, 18 Jun 2024 12:19:45 +0000 (14:19 +0200)]
Only use FIM middle token if it exists (#7648)
* Only use FIM middle if it exists
* Only use FIM middle if it exists
jojorne [Tue, 18 Jun 2024 12:18:32 +0000 (09:18 -0300)]
Fix no gcc pragma on Windows (#7751)
Ulrich Drepper [Tue, 18 Jun 2024 12:00:14 +0000 (14:00 +0200)]
Allow compiling with CUDA without CUDA runtime installed (#7989)
On hosts which are not prepared/dedicated to execute code using CUDA
it is still possible to compile llama.cpp with CUDA support by just
installing the development packages. Missing are the runtime
libraries like /usr/lib64/libcuda.so* and currently the link step
will fail.
The development environment is prepared for such situations. There
are stub libraries for all the CUDA libraries available in the
$(CUDA_PATH)/lib64/stubs directory. Adding this directory to the end
of the search path will not change anything for environments which
currently work fine but will enable compiling llama.cpp also in case
the runtime code is not available.
Frank Mai [Tue, 18 Jun 2024 07:11:40 +0000 (15:11 +0800)]
chore: clean useless beam search param (#7985)
Signed-off-by: thxCode <redacted>