]> git.djapps.eu Git - pkg/ggml/sources/llama.cpp/log
pkg/ggml/sources/llama.cpp
11 months agoggml : add AArch64 optimized GEMV and GEMM Q4 kernels (#5780)
Dibakar Gope [Wed, 10 Jul 2024 12:14:51 +0000 (07:14 -0500)]
ggml : add AArch64 optimized GEMV and GEMM Q4 kernels (#5780)

* Arm AArch64: optimized GEMV and GEMM kernels for q4_0_q8_0, and q8_0_q8_0 quantization

* Arm AArch64: add optimized GEMV and GEMM asm kernels for q4_0_q8_0 quantization and refactor code to address llama.cpp pr#5780 suggestions

* Arm AArch64: add optimized GEMV and GEMM asm kernels for q4_0_q8_0 quantization and refactor code to address llama.cpp pr#5780 suggestions

* Arm AArch64: add optimized GEMV and GEMM asm kernels for q4_0_q8_0 quantization and refactor code to address llama.cpp pr#5780 suggestions

* Arm AArch64: add optimized GEMV and GEMM asm kernels for q4_0_q8_0 quantization and refactor code to address llama.cpp pr#5780 suggestions

* Arm AArch64: add copyright claim only to ggml-aarch64.cpp and ggml-aarch64.h files

* Arm AArch64: minor code refactoring for rebase

* Arm AArch64: minor code refactoring for resolving a build issue with cmake

* Arm AArch64: minor code refactoring to split the Q4_0_AARC64 type into three separate types: Q4_0_4_4, Q4_0_4_8, and Q4_0_8_8

* Arm AArch64: minor code change for resolving a build issue with server-windows

* retrigger checks

* Arm AArch64: minor code changes for rebase

* Arm AArch64: minor changes to skip the pr#7433 vec_dot code for arm cpus with SVE VL not equal to 256 bits

* Arm AArch64: remove stale LLAMA_QKK_64 from CMakeLists.txt and delete build.zig

* Arm AArch64: add reference scalar gemm and gemv, and avoid dynamic memory allocations during quantization for Q4_0_4_4, Q4_0_4_8, and Q4_0_8_8

* Arm AArch64: add multithreaded quantization support for the new types: Q4_0_4_4, Q4_0_4_8, and Q4_0_8_8

* Arm AArch64: minor code refactoring

* Arm AArch64: simplify logic for calling gemm and gemv functions in ggml_compute_forward_mul_mat

* Arm AArch64: minimize changes in ggml_compute_forward_mul_mat

* Arm AArch64: minor code refactoring, and add reference scalar code to quantize routines for new quant types

* Arm AArch64: minor code refactoring

* Arm AArch64: minor code refactoring

* Arm AArch64: minor code refactoring

* rebase on the latest master commit 3fd62a6 and adapt to the new directory structure

* Arm AArch64: remove a redundant comment

* Arm AArch64: add pragma in ggml-aarch64.c to turn -Woverlength-strings warning off

* Arm AArch64: use __aarch64__ check to guard 64-bit neon kernels

* Arm AArch64: update docs/build.md README to include compile time flags for buiilding the Q4_0_4_4 quant type

11 months agogguf-py rel pipeline (#8410)
M. Yusuf Sarıgöz [Wed, 10 Jul 2024 12:12:35 +0000 (15:12 +0300)]
gguf-py rel pipeline (#8410)

* Upd gguf-py/readme

* Bump patch version for release

11 months agollama : C++20 compatibility for u8 strings (#8408)
Borislav Stanimirov [Wed, 10 Jul 2024 11:45:44 +0000 (14:45 +0300)]
llama : C++20 compatibility for u8 strings (#8408)

11 months agomsvc : silence codecvt c++17 deprecation warnings (#8395)
Borislav Stanimirov [Wed, 10 Jul 2024 11:40:53 +0000 (14:40 +0300)]
msvc : silence codecvt c++17 deprecation warnings (#8395)

11 months agollama : add assert about missing llama_encode() call (#8400)
fairydreaming [Wed, 10 Jul 2024 11:38:58 +0000 (13:38 +0200)]
llama : add assert about missing llama_encode() call (#8400)

Co-authored-by: Stanisław Szymczyk <redacted>
11 months agopy : fix converter for internlm2 (#8321)
RunningLeon [Wed, 10 Jul 2024 11:26:40 +0000 (19:26 +0800)]
py : fix converter for internlm2 (#8321)

* update internlm2

* remove unused file

* fix lint

11 months agopy : fix extra space in convert_hf_to_gguf.py (#8407)
laik [Wed, 10 Jul 2024 11:19:10 +0000 (19:19 +0800)]
py : fix extra space in convert_hf_to_gguf.py (#8407)

11 months agoServer: Enable setting default sampling parameters via command-line (#8402)
Clint Herron [Tue, 9 Jul 2024 22:26:40 +0000 (18:26 -0400)]
Server: Enable setting default sampling parameters via command-line (#8402)

* Load server sampling parameters from the server context by default.

* Wordsmithing comment

11 months agoUpdate README.md to fix broken link to docs (#8399)
Andy Salerno [Tue, 9 Jul 2024 18:58:44 +0000 (11:58 -0700)]
Update README.md to fix broken link to docs (#8399)

Update the "Performance troubleshooting" doc link to be correct - the file was moved into a dir called 'development'

11 months agoDeprecation warning to assist with migration to new binary names (#8283)
Clint Herron [Tue, 9 Jul 2024 15:54:43 +0000 (11:54 -0400)]
Deprecation warning to assist with migration to new binary names (#8283)

* Adding a simple program to provide a deprecation warning that can exist to help people notice the binary name change from #7809 and migrate to the new filenames.

* Build legacy replacement binaries only if they already exist. Check for their existence every time so that they are not ignored.

11 months agomake/cmake: LLAMA_NO_CCACHE -> GGML_NO_CCACHE (#8392)
Johannes Gäßler [Tue, 9 Jul 2024 15:11:07 +0000 (17:11 +0200)]
make/cmake: LLAMA_NO_CCACHE -> GGML_NO_CCACHE (#8392)

11 months agosycl : Reenabled mmvq path for the SYCL Nvidia Backend (#8372)
Alberto Cabrera Pérez [Tue, 9 Jul 2024 14:03:15 +0000 (15:03 +0100)]
sycl : Reenabled mmvq path for the SYCL Nvidia Backend (#8372)

* SYCL : Reenabled mmvq path for the SYCL Nvidia Backend

* Reduced verbosity of comment

11 months agocmake : allow external ggml (#8370)
Borislav Stanimirov [Tue, 9 Jul 2024 08:38:00 +0000 (11:38 +0300)]
cmake : allow external ggml (#8370)

11 months agoreadme : fix typo [no ci] (#8389)
daghanerdonmez [Tue, 9 Jul 2024 06:16:00 +0000 (09:16 +0300)]
readme : fix typo [no ci] (#8389)

Bakus-Naur --> Backus-Naur

11 months agogguf-py : do not use internal numpy types (#7472)
compilade [Tue, 9 Jul 2024 05:04:49 +0000 (01:04 -0400)]
gguf-py : do not use internal numpy types (#7472)

11 months agoflake.lock: Update (#8342)
Georgi Gerganov [Mon, 8 Jul 2024 22:36:38 +0000 (01:36 +0300)]
flake.lock: Update (#8342)

Flake lock file updates:

• Updated input 'flake-parts':
    'github:hercules-ci/flake-parts/2a55567fcf15b1b1c7ed712a2c6fadaec7412ea8?narHash=sha256-iKzJcpdXih14qYVcZ9QC9XuZYnPc6T8YImb6dX166kw%3D' (2024-06-01)
  → 'github:hercules-ci/flake-parts/9227223f6d922fee3c7b190b2cc238a99527bbb7?narHash=sha256-pQMhCCHyQGRzdfAkdJ4cIWiw%2BJNuWsTX7f0ZYSyz0VY%3D' (2024-07-03)
• Updated input 'flake-parts/nixpkgs-lib':
    'https://github.com/NixOS/nixpkgs/archive/eb9ceca17df2ea50a250b6b27f7bf6ab0186f198.tar.gz?narHash=sha256-lIbdfCsf8LMFloheeE6N31%2BBMIeixqyQWbSr2vk79EQ%3D' (2024-06-01)
  → 'https://github.com/NixOS/nixpkgs/archive/5daf0514482af3f97abaefc78a6606365c9108e2.tar.gz?narHash=sha256-Fm2rDDs86sHy0/1jxTOKB1118Q0O3Uc7EC0iXvXKpbI%3D' (2024-07-01)
• Updated input 'nixpkgs':
    'github:NixOS/nixpkgs/b2852eb9365c6de48ffb0dc2c9562591f652242a?narHash=sha256-C8e9S7RzshSdHB7L%2Bv9I51af1gDM5unhJ2xO1ywxNH8%3D' (2024-06-27)
  → 'github:NixOS/nixpkgs/9f4128e00b0ae8ec65918efeba59db998750ead6?narHash=sha256-rwz8NJZV%2B387rnWpTYcXaRNvzUSnnF9aHONoJIYmiUQ%3D' (2024-07-03)

Co-authored-by: github-actions[bot] <redacted>
11 months agolabeler : updated sycl to match docs and code refactor (#8373)
Alberto Cabrera Pérez [Mon, 8 Jul 2024 20:35:17 +0000 (21:35 +0100)]
labeler : updated sycl to match docs and code refactor (#8373)

11 months agoreadme : fix web link error [no ci] (#8347)
b4b4o [Mon, 8 Jul 2024 14:19:24 +0000 (22:19 +0800)]
readme : fix web link error [no ci] (#8347)

11 months agosycl : fix powf call in device code (#8368)
Alberto Cabrera Pérez [Mon, 8 Jul 2024 13:22:41 +0000 (14:22 +0100)]
sycl : fix powf call in device code (#8368)

11 months agoscripts : fix sync for sycl
Georgi Gerganov [Mon, 8 Jul 2024 10:51:31 +0000 (13:51 +0300)]
scripts : fix sync for sycl

11 months agosync : ggml
Georgi Gerganov [Mon, 8 Jul 2024 07:39:50 +0000 (10:39 +0300)]
sync : ggml

ggml-ci

11 months agotests : fix whitespace (#0)
Georgi Gerganov [Mon, 8 Jul 2024 07:39:36 +0000 (10:39 +0300)]
tests : fix whitespace (#0)

11 months agofeat: cuda implementation for `ggml_conv_transpose_1d` (ggml/854)
John Balis [Tue, 2 Jul 2024 16:09:52 +0000 (11:09 -0500)]
feat: cuda implementation for `ggml_conv_transpose_1d` (ggml/854)

* conv transpose 1d passing test for 1d input and kernel

* working for different input and output channel counts, added test for variable stride

* initial draft appears to work with stride other than 1

* working with all old and new conv1d  tests

* added a test for large tensors

* removed use cuda hardcoding

* restored test-conv-transpose.c

* removed unused arugments, and fixed bug where test failure would cause subsequent tests to fail

* fixed accumulator bug

* added test to test-backend-ops

* fixed mistake

* addressed review

* fixed includes

* removed blank lines

* style and warning fixes

* return failure when test fails

* fix supports_op

---------

Co-authored-by: slaren <redacted>
11 months agocommon : preallocate sampling token data vector (#8363)
Kevin Wang [Mon, 8 Jul 2024 07:26:53 +0000 (03:26 -0400)]
common : preallocate sampling token data vector (#8363)

`emplace_back` repeatedly-called is slower than preallocating the vector to the vocab size and directly inserting the data. Some rudimentary profiling with `chrono` improves the performance of this block of code from ~500us/op to ~40us/op.

Overall, this slightly improves the sampling performance which has a more substantial impact for the `examples/lookahead` implementation -- I am able to see a ~10% performance boost in lookahead inference.

11 months agoinfill : assert prefix/suffix tokens + remove old space logic (#8351)
Georgi Gerganov [Mon, 8 Jul 2024 06:34:35 +0000 (09:34 +0300)]
infill : assert prefix/suffix tokens + remove old space logic (#8351)

11 months agocommon : avoid unnecessary logits fetch (#8358)
Kevin Wang [Mon, 8 Jul 2024 06:31:55 +0000 (02:31 -0400)]
common : avoid unnecessary logits fetch (#8358)

11 months agoreadme : add supported glm models (#8360)
toyer [Mon, 8 Jul 2024 05:57:19 +0000 (13:57 +0800)]
readme : add supported glm models (#8360)

11 months agopy : type-check all Python scripts with Pyright (#8341)
compilade [Sun, 7 Jul 2024 19:04:39 +0000 (15:04 -0400)]
py : type-check all Python scripts with Pyright (#8341)

* py : type-check all Python scripts with Pyright

* server-tests : use trailing slash in openai base_url

* server-tests : add more type annotations

* server-tests : strip "chat" from base_url in oai_chat_completions

* server-tests : model metadata is a dict

* ci : disable pip cache in type-check workflow

The cache is not shared between branches, and it's 250MB in size,
so it would become quite a big part of the 10GB cache limit of the repo.

* py : fix new type errors from master branch

* tests : fix test-tokenizer-random.py

Apparently, gcc applies optimisations even when pre-processing,
which confuses pycparser.

* ci : only show warnings and errors in python type-check

The "information" level otherwise has entries
from 'examples/pydantic_models_to_grammar.py',
which could be confusing for someone trying to figure out what failed,
considering that these messages can safely be ignored
even though they look like errors.

11 months agoUpdate llama-cli documentation (#8315)
Denis Spasyuk [Sun, 7 Jul 2024 15:08:28 +0000 (09:08 -0600)]
Update llama-cli documentation (#8315)

* Update README.md

* Update README.md

* Update README.md

fixed llama-cli/main, templates on some cmds added chat template sections and fixed typos in some areas

* Update README.md

* Update README.md

* Update README.md

11 months agoci : add checks for cmake,make and ctest in ci/run.sh (#8200)
Alex Tuddenham [Sun, 7 Jul 2024 14:59:14 +0000 (15:59 +0100)]
ci : add checks for cmake,make and ctest in ci/run.sh (#8200)

* Added checks for cmake,make and ctest

* Removed erroneous whitespace

11 months agoreadme : update bindings list (#8222)
Andy Tai [Sun, 7 Jul 2024 13:21:37 +0000 (06:21 -0700)]
readme : update bindings list (#8222)

* adding guile_llama_cpp  to binding list

* fix formatting

* fix formatting

11 months agogguf-hash: model wide and per tensor hashing using xxhash and sha1 (#8048)
Brian [Sun, 7 Jul 2024 12:58:43 +0000 (22:58 +1000)]
gguf-hash: model wide and per tensor hashing using xxhash and sha1 (#8048)

CLI to hash GGUF files to detect difference on a per model and per tensor level

The hash type we support is:

- `--xxh64`: use xhash 64bit hash mode (default)
- `--sha1`: use sha1
- `--uuid`: use uuid
- `--sha256`: use sha256

While most POSIX systems already have hash checking programs like sha256sum, it
is designed to check entire files. This is not ideal for our purpose if we want
to check for consistency of the tensor data even if the metadata content of the
gguf KV store has been updated.

This program is designed to hash a gguf tensor payload on a 'per tensor layer'
in addition to a 'entire tensor model' hash. The intent is that the entire
tensor layer can be checked first but if there is any detected inconsistencies,
then the per tensor hash can be used to narrow down the specific tensor layer
that has inconsistencies.

Co-authored-by: Georgi Gerganov <redacted>
11 months agollama : support glm3 and glm4 (#8031)
toyer [Sun, 7 Jul 2024 12:52:10 +0000 (20:52 +0800)]
llama : support glm3 and glm4 (#8031)

* add chatglm3-6b model support huggingface model:
 https://hf-mirror.com/THUDM/chatglm3-6b

Signed-off-by: XingXing Qiao <redacted>
* remove .rotary_pos_emb.inv_freq and unuse code for chatglm3 model

Signed-off-by: XingXing Qiao <redacted>
* fix lint error

Signed-off-by: XingXing Qiao <redacted>
* optimize convert-hf-to-gguf.py for chatglm model

Signed-off-by: XingXing Qiao <redacted>
* support glm-4-9b-chat

Signed-off-by: XingXing Qiao <redacted>
* fix eos tokens to glm4

* remove unused log

* add preprocess to chatglm3 and chatglm4

* add eos_id_list to llama.cpp

* fix code style

* fix code style

* fix conflicts

* fix conflicts

* Revert "add eos_id_list to llama.cpp"

This reverts commit 3a4d5790bfdc205c5b658204239f168fc21cc1a8.

* set <|endoftext|> as eos and <|user|> as eot

* fix chat template bug

* add comment to glm prefix and suffix

* fix conflicts and add rope_ratio & ChatGLMForConditionalGeneration

* fix chat template bug

* fix codestyle

* fix conflicts

* modified the general name of glm model

* fix conflicts

* remove prefix and suffix

* use normal glm4 chattempalte & use LLM_FFN_SWIGLU in phi3

* fix: resolve Flake8 errors in `convert-hf-to-gguf.py`

- Fix E302 by adding two blank lines before top-level function definitions
- Replace print statements to fix NP100
- Fix E303 by ensuring only one blank line between lines of code

* fix rope ratio to solve incorrect answers

* fix by comments

---------

Signed-off-by: XingXing Qiao <redacted>
Co-authored-by: XingXing Qiao <redacted>
Co-authored-by: Umpire2018 <redacted>
11 months agollama : fix n_rot default (#8348)
Georgi Gerganov [Sun, 7 Jul 2024 11:59:02 +0000 (14:59 +0300)]
llama : fix n_rot default (#8348)

ggml-ci

11 months agopy : use cpu-only torch in requirements.txt (#8335)
compilade [Sun, 7 Jul 2024 11:23:38 +0000 (07:23 -0400)]
py : use cpu-only torch in requirements.txt (#8335)

11 months agofinetune: Rename command name in README.md (#8343)
standby24x7 [Sun, 7 Jul 2024 10:38:02 +0000 (19:38 +0900)]
finetune: Rename command name in README.md (#8343)

Rename an old command name "finetune" to "llama-finetune"
in README.md

Signed-off-by: Masanari Iida <redacted>
11 months agofinetune: Rename an old command name in finetune.sh (#8344)
standby24x7 [Sun, 7 Jul 2024 10:37:47 +0000 (19:37 +0900)]
finetune: Rename an old command name in finetune.sh (#8344)

This patch replaces an old commad "main" with "llama-cli"
in finetune.sh.
The part that I fixed is comment, so it doesn't change
the script.

Signed-off-by: Masanari Iida <redacted>
11 months agoserver: Retrieve prompt template in /props (#8337)
Bjarke Viksøe [Sun, 7 Jul 2024 09:10:38 +0000 (11:10 +0200)]
server: Retrieve prompt template in /props (#8337)

* server: Retrieve prompt template in /props

This PR adds the following:
- Expose the model's Jinja2 prompt template from the model in the /props endpoint.
- Change log-level from Error to Warning for warning about template mismatch.

The front-end stands a better chance of actually executing the Jinja template format correctly. Server is currently just guessing it.

Ideally this should have been inside a JSON block that expose the same key/value pairs as listed during startup in "llm_load_print_meta" function.

* Make string buffer dynamic

* Add doc and better string handling

* Using chat_template naming convention

* Use intermediate vector for string assignment

11 months agoadded support for Authorization Bearer tokens when downloading model (#8307)
Derrick T. Woolworth [Sat, 6 Jul 2024 20:32:04 +0000 (15:32 -0500)]
added support for Authorization Bearer tokens when downloading model (#8307)

* added support for Authorization Bearer tokens

* removed auth_token, removed set_ function, other small fixes

* Update common/common.cpp

---------

Co-authored-by: Xuan Son Nguyen <redacted>
11 months agoupdate main readme (#8333)
Xuan Son Nguyen [Sat, 6 Jul 2024 17:01:23 +0000 (19:01 +0200)]
update main readme (#8333)

11 months agollama : add early return for empty range (#8327)
Daniel Bevenius [Sat, 6 Jul 2024 07:22:16 +0000 (09:22 +0200)]
llama : add early return for empty range (#8327)

* llama : add early return for empty range

This commit adds an early return to the llama_kv_cache_seq_add and
llama_kv_cache_seq_div functions.

The motivation for adding this is to avoid looping over the cache
when the range is empty. I ran into this when using the self-extend
feature in main.cpp.

Signed-off-by: Daniel Bevenius <redacted>
* llama : add static_cast to fix CI warning/error

This commit attempts to fix the following warning/error:

```console
src/llama.cpp:7271:31: error:
comparison of integer expressions of different signedness:
‘int’ and ‘uint32_t’ {aka ‘unsigned int’} [-Werror=sign-compare]
 7271 |                         if (i < hparams.n_layer_dense_lead) {
      |                             ~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~
```
This can be reproduced locally by setting -Wsign-compare in the
Makefile.

Signed-off-by: Daniel Bevenius <redacted>
* squash! llama : add early return for empty range

Remove the setting of cache.head to 0 when the range is empty.

Signed-off-by: Daniel Bevenius <redacted>
* Update src/llama.cpp

---------

Signed-off-by: Daniel Bevenius <redacted>
Co-authored-by: Georgi Gerganov <redacted>
11 months agoDetokenizer fixes (#8039)
jaime-m-p [Fri, 5 Jul 2024 17:01:35 +0000 (19:01 +0200)]
Detokenizer fixes (#8039)

* Add llama_detokenize():
  - Update header files location
  - UNKNOWN and CONTROL are 'special pieces'
  - Remove space after UNKNOWN and CONTROL
  - Refactor llama_token_to_piece()
  - Add flag: clean_up_tokenization_spaces
  - Symmetric params for llama_tokenize() and llama_detokenize()

* Update and fix tokenizer tests:
  - Using llama_detokenize()
  - Unexpected vocab type as test fail instead of error
    - Useful when automating tests:
    - If you don't know in advance the vocab type
    - Differenciate other loading errors
  - Skip unicode surrogaes and undefined
  - Gracefully exit threads
    - Using exit() is throwing random exceptions
  - Clean old known problematic codepoints
  - Minor: confusing hexadecimal codepoint

* Update bruteforce random tests
  - Add detokenizer checks
  - New generator: ascii_lr_strip
  - New generator: apostrophe
  - Add more vocabs files
  - Detokenize special tokens.
  - Replace errors with '\uFFFD' when detokenizing to 'utf-8'
  - More edge cases
  - Better detokenization results check

* Fix add_space_prefix, set false by default
* Better leading space removal
* Do not remove space when decoding special tokens
* Bugfix: custom regexs splits undefined unicode codepoints
* 'viking' detokenizer clean spaces

11 months agoReorganize documentation pages (#8325)
Xuan Son Nguyen [Fri, 5 Jul 2024 16:08:32 +0000 (18:08 +0200)]
Reorganize documentation pages (#8325)

* re-organize docs

* add link among docs

* add link to build docs

* fix style

* de-duplicate sections

11 months agollama : fix compile warning (#8304)
Georgi Gerganov [Fri, 5 Jul 2024 14:32:09 +0000 (17:32 +0300)]
llama : fix compile warning (#8304)

11 months agocmake : add GGML_BUILD and GGML_SHARED macro definitions (#8281)
Natsu [Fri, 5 Jul 2024 14:29:35 +0000 (22:29 +0800)]
cmake : add GGML_BUILD and GGML_SHARED macro definitions (#8281)

11 months agoEnabled more data types for oneMKL gemm_batch (#8236)
Ouadie EL FAROUKI [Fri, 5 Jul 2024 12:23:25 +0000 (13:23 +0100)]
Enabled more data types for oneMKL gemm_batch (#8236)

11 months agoconvert : remove AWQ remnants (#8320)
Georgi Gerganov [Fri, 5 Jul 2024 07:15:36 +0000 (10:15 +0300)]
convert : remove AWQ remnants (#8320)

11 months agollama : minor indentation during tensor loading (#8304)
Georgi Gerganov [Fri, 5 Jul 2024 07:15:24 +0000 (10:15 +0300)]
llama : minor indentation during tensor loading (#8304)

* llama : minor indentation during tensor loading

ggml-ci

* llama : use int for layer iterators [no ci]

11 months agoCUDA: MMQ support for iq4_nl, iq4_xs (#8278)
Johannes Gäßler [Fri, 5 Jul 2024 07:06:31 +0000 (09:06 +0200)]
CUDA: MMQ support for iq4_nl, iq4_xs (#8278)

11 months agoCUDA: revert part of the RDNA1 optimizations (#8309)
Daniele [Fri, 5 Jul 2024 07:06:09 +0000 (07:06 +0000)]
CUDA: revert part of the RDNA1 optimizations (#8309)

The change on the launch_bounds was causing a small performance drop in perplexity of 25 t/s

11 months agollama : streamline embeddings from "non-embedding" models (#8087)
Douglas Hanley [Fri, 5 Jul 2024 07:05:56 +0000 (02:05 -0500)]
llama : streamline embeddings from "non-embedding" models (#8087)

11 months agoCUDA: fix MMQ stream-k rounding if ne00 % 128 != 0 (#8311)
Johannes Gäßler [Fri, 5 Jul 2024 07:05:34 +0000 (09:05 +0200)]
CUDA: fix MMQ stream-k rounding if ne00 % 128 != 0 (#8311)

11 months agoreadme : fix minor typos [no ci] (#8314)
Pieter Ouwerkerk [Fri, 5 Jul 2024 06:58:41 +0000 (02:58 -0400)]
readme : fix minor typos [no ci] (#8314)

11 months agopasskey : add short intro to README.md [no-ci] (#8317)
Daniel Bevenius [Fri, 5 Jul 2024 06:14:24 +0000 (08:14 +0200)]
passkey : add short intro to README.md [no-ci] (#8317)

* passkey : add short intro to README.md [no-ci]

This commit adds a short introduction to the README.md file in the
examples/passkey directory.

Signed-off-by: Daniel Bevenius <redacted>
* Update examples/passkey/README.md

---------

Signed-off-by: Daniel Bevenius <redacted>
Co-authored-by: Georgi Gerganov <redacted>
11 months agollama : prefer n_ over num_ prefix (#8308)
Georgi Gerganov [Fri, 5 Jul 2024 06:10:03 +0000 (09:10 +0300)]
llama : prefer n_ over num_ prefix (#8308)

11 months agocontributing : update guidelines (#8316)
Georgi Gerganov [Fri, 5 Jul 2024 06:09:47 +0000 (09:09 +0300)]
contributing : update guidelines (#8316)

11 months ago[SYCL] Fix WARP_SIZE=16 bug of Intel GPU (#8266)
luoyu-intel [Fri, 5 Jul 2024 05:06:13 +0000 (05:06 +0000)]
[SYCL] Fix WARP_SIZE=16 bug of Intel GPU (#8266)

* fix group_norm ut

* split softmax

* fix softmax

* add concat support condition

* revert debug code

* move QK_WARP_SIZE to presets.hpp

11 months agopy : switch to snake_case (#8305)
Georgi Gerganov [Fri, 5 Jul 2024 04:53:33 +0000 (07:53 +0300)]
py : switch to snake_case (#8305)

* py : switch to snake_case

ggml-ci

* cont

ggml-ci

* cont

ggml-ci

* cont : fix link

* gguf-py : use snake_case in scripts entrypoint export

* py : rename requirements for convert_legacy_llama.py

Needed for scripts/check-requirements.sh

---------

Co-authored-by: Francis Couture-Harpin <redacted>
11 months agorm get_work_group_size() by local cache for performance (#8286)
Neo Zhang Jianyu [Fri, 5 Jul 2024 02:32:29 +0000 (10:32 +0800)]
rm get_work_group_size() by local cache for performance (#8286)

Co-authored-by: arthw <redacted>
11 months agocli: add EOT when user hit Ctrl+C (#8296)
Xuan Son Nguyen [Thu, 4 Jul 2024 18:55:03 +0000 (20:55 +0200)]
cli: add EOT when user hit Ctrl+C (#8296)

* main: add need_insert_eot

* do not format system prompt if it is empty

11 months agollama : add OpenELM support (#7359)
Icecream95 [Thu, 4 Jul 2024 17:14:21 +0000 (05:14 +1200)]
llama : add OpenELM support (#7359)

* Initial OpenELM support (270M only so far)

* Fill out missing entries in llama_model_type_name

* fixup! Initial OpenELM support (270M only so far)

Fix formatting

* llama : support all OpenELM models

* llama : add variable GQA and variable FFN sizes

Some metadata keys can now also be arrays to support setting
their value per-layer for models like OpenELM.

* llama : minor spacing changes

Co-authored-by: Georgi Gerganov <redacted>
* llama : use std::array for per-layer hparams

* llama : fix save/load state

* llama : do not print hparams for vocab-only models

* llama : handle n_head == 0

* llama : use const ref for print_f and fix division by zero

* llama : fix t5 uses of n_head and n_ff

* llama : minor comment

---------

Co-authored-by: Francis Couture-Harpin <redacted>
Co-authored-by: Georgi Gerganov <redacted>
11 months agotokenize : add --show-count (token) option (#8299)
Daniel Bevenius [Thu, 4 Jul 2024 16:38:58 +0000 (18:38 +0200)]
tokenize : add --show-count (token) option (#8299)

This commit adds a new option to the tokenize example, --show-count.
When this is set the total number of tokens are printed to stdout.

This was added as an option as I was concerned that there might be
scripts that use the output from this program and it might be better to
not print this information by default.

The motivation for this is that can be useful to find out how many
tokens a file contains, for example when trying to determine prompt
input file sizes for testing.

Signed-off-by: Daniel Bevenius <redacted>
11 months agobuild: Export hf-to-gguf as snakecase
ditsuke [Thu, 4 Jul 2024 15:24:35 +0000 (20:54 +0530)]
build: Export hf-to-gguf as snakecase

11 months agodoc: Add context for why we add an explicit pytorch source
ditsuke [Tue, 2 Jul 2024 19:32:56 +0000 (01:02 +0530)]
doc: Add context for why we add an explicit pytorch source

11 months agochore: Remove rebase artifacts
ditsuke [Tue, 2 Jul 2024 10:18:13 +0000 (15:48 +0530)]
chore: Remove rebase artifacts

11 months agochore: Fixup requirements and build
ditsuke [Tue, 2 Jul 2024 10:05:43 +0000 (15:35 +0530)]
chore: Fixup requirements and build

11 months agochore: ignore all __pychache__
ditsuke [Tue, 2 Jul 2024 09:48:13 +0000 (15:18 +0530)]
chore: ignore all __pychache__

11 months agofix: Update script paths in CI scripts
ditsuke [Sun, 10 Mar 2024 17:51:46 +0000 (23:21 +0530)]
fix: Update script paths in CI scripts

11 months agofix: Actually include scripts in build
ditsuke [Wed, 28 Feb 2024 20:17:15 +0000 (01:47 +0530)]
fix: Actually include scripts in build

Not namespaced though :(

11 months agobuild(python): Package scripts with pip-0517 compliance
ditsuke [Tue, 27 Feb 2024 06:31:02 +0000 (12:01 +0530)]
build(python): Package scripts with pip-0517 compliance

11 months agoInference support for T5 and FLAN-T5 model families (#5763)
fairydreaming [Thu, 4 Jul 2024 13:46:11 +0000 (15:46 +0200)]
Inference support for T5 and FLAN-T5 model families (#5763)

* llama : add inference support and model types for T5 and FLAN-T5 model families

* llama : add new API functions to support encoder-decoder models: llama_encode(), llama_model_has_encoder(), llama_model_decoder_start_token()

* common, llama-cli, llama-batched : add support for encoder-decoder models

* convert-hf : handle shared token embeddings tensors in T5Model

* convert-hf : add support for SentencePiece BPE tokenizer in T5Model (for Pile-T5 models)

* convert-hf : add MT5ForConditionalGeneration and UMT5ForConditionalGeneration to architectures supported by T5Model

* convert : add t5 tokenizer tests, use "slow" HF tokenizer for t5

---------

Co-authored-by: Stanisław Szymczyk <redacted>
Co-authored-by: Georgi Gerganov <redacted>
11 months agotests : add _CRT_SECURE_NO_WARNINGS for WIN32 (#8231)
Daniel Bevenius [Thu, 4 Jul 2024 10:53:42 +0000 (12:53 +0200)]
tests : add _CRT_SECURE_NO_WARNINGS for WIN32 (#8231)

This commit adds the compile definition `_CRT_SECURE_NO_WARNINGS`
to the root cmake subproject.

The motivation for this is that currently the following warnings are
displayed when compiling the tests and common cmake subprojects:
```console
test-llama-grammar.cpp
C:\llama.cpp\src\.\llama.cpp(1406,77): warning C4996: 'strerror':
This function or variable may be unsafe. Consider using strerror_s
instead. To disable deprecation, use _CRT_SECURE_NO_WARNINGS. See
online help for details.
[C:\llama.cpp\build\tests\test-llama-grammar.vcxproj]
...
```

This compile definition is currently set for the `src` subproject
and this change moves into the root cmake project so that it is applied
to all cmake subprojects.

11 months agollama : suppress unref var in Windows MSVC (#8150)
Daniel Bevenius [Thu, 4 Jul 2024 10:50:57 +0000 (12:50 +0200)]
llama : suppress unref var in Windows MSVC (#8150)

* llama : suppress unref var in Windows MSVC

This commit suppresses two warnings that are currently generated for
src/llama.cpp when building on Windows MSVC

```console
C:\llama.cpp\src\llama.cpp(14349,45): warning C4101: 'ex':
unreferenced local variable [C:\llama.cpp\build\src\llama.vcxproj]
C:\llama.cpp\src\llama.cpp(19285,44): warning C4101: 'e':
unreferenced local variable [C:\llama.cpp\build\src\llama.vcxproj]
```

* Update src/llama.cpp

---------

Co-authored-by: Georgi Gerganov <redacted>
11 months agoconvert : fix gemma v1 tokenizer convert (#8248)
Georgi Gerganov [Thu, 4 Jul 2024 07:41:03 +0000 (10:41 +0300)]
convert : fix gemma v1 tokenizer convert (#8248)

ggml-ci

11 months ago[SYCL] Remove unneeded semicolons (#8280)
AidanBeltonS [Thu, 4 Jul 2024 01:07:19 +0000 (02:07 +0100)]
[SYCL] Remove unneeded semicolons (#8280)

11 months agoDefine and optimize RDNA1 (#8085)
Daniele [Wed, 3 Jul 2024 23:02:58 +0000 (23:02 +0000)]
Define and optimize  RDNA1 (#8085)

11 months agoppl : fix n_seq_max for perplexity (#8277)
slaren [Wed, 3 Jul 2024 17:33:31 +0000 (19:33 +0200)]
ppl : fix n_seq_max for perplexity (#8277)

* ppl : fix n_seq_max for perplexity

* use 1 seq for kl_divergence

11 months agofix phi 3 conversion (#8262)
Xuan Son Nguyen [Wed, 3 Jul 2024 14:01:54 +0000 (16:01 +0200)]
fix phi 3 conversion (#8262)

11 months agofix typo (#8267)
Judd [Wed, 3 Jul 2024 12:40:16 +0000 (20:40 +0800)]
fix typo (#8267)

Co-authored-by: Judd <redacted>
11 months agoDequant improvements rebase (#8255)
AidanBeltonS [Wed, 3 Jul 2024 01:55:34 +0000 (02:55 +0100)]
Dequant improvements rebase (#8255)

* Single load for half2

* Store scales in local mem

* Vec load quantized values

11 months agofix: add missing short command line argument -mli for multiline-input (#8261)
MistApproach [Tue, 2 Jul 2024 20:56:46 +0000 (22:56 +0200)]
fix: add missing short command line argument -mli for multiline-input (#8261)

11 months agoAdding step to `clean` target to remove legacy binary names to reduce upgrade / migra...
Clint Herron [Tue, 2 Jul 2024 17:19:56 +0000 (13:19 -0400)]
Adding step to `clean` target to remove legacy binary names to reduce upgrade / migration confusion arising from #7809. (#8257)

11 months agoRemoves multiple newlines at the end of files that is breaking the editorconfig step...
Clint Herron [Tue, 2 Jul 2024 16:18:10 +0000 (12:18 -0400)]
Removes multiple newlines at the end of files that is breaking the editorconfig step of CI. (#8258)

11 months agoAdd `JAIS` model(s) (#8118)
Faisal Zaghloul [Tue, 2 Jul 2024 14:36:00 +0000 (10:36 -0400)]
Add `JAIS` model(s) (#8118)

* Add `JAIS` model(s)

* cleanup

* address review comments

* remove hack

* un-hardcode max-alibi-bias

* minor tweaks

---------

Co-authored-by: fmz <redacted>
11 months agoconvert-hf : print output file name when completed (#8181)
Daniel Bevenius [Tue, 2 Jul 2024 06:40:49 +0000 (08:40 +0200)]
convert-hf : print output file name when completed (#8181)

* convert-hf : print output file name when completed

This commit adds the output file name to the log message when the
conversion is completed.

The motivation for this change is that when `--outfile` option is not
specified it migth not be obvious where the output file is written.

With this change the output of running the script will be something like
the following:
```console
INFO:hf-to-gguf:Model successfully exported to models/gemma-2-9b-it.gguf.
```

Signed-off-by: Daniel Bevenius <redacted>
* squash! convert-hf : print output file name when completed

Updates the output of to support printing the directory if the output is
split into multiple files. Also the output file name is now retrieved
from the model_instance object.

Signed-off-by: Daniel Bevenius <redacted>
* squash! convert-hf : print output file name when completed

Use parent attribute of Path object and string interpolation.

Signed-off-by: Daniel Bevenius <redacted>
* squash! convert-hf : print output file name when completed

Use os.sep instead of hardcoding the path separator.

Signed-off-by: Daniel Bevenius <redacted>
---------

Signed-off-by: Daniel Bevenius <redacted>
11 months agocuda : update supports_op for matrix multiplication (#8245)
slaren [Tue, 2 Jul 2024 06:39:38 +0000 (08:39 +0200)]
cuda : update supports_op for matrix multiplication (#8245)

11 months ago[SYCL] Fix win build conflict of math library (#8230)
luoyu-intel [Tue, 2 Jul 2024 04:50:07 +0000 (04:50 +0000)]
[SYCL] Fix win build conflict of math library (#8230)

* fix win build conflict of math library

* fix the condition: !(win32 & SYCL)

* revert warp_size=16

11 months ago[SYCL] Fix the sub group size of Intel (#8106)
luoyu-intel [Tue, 2 Jul 2024 02:16:00 +0000 (02:16 +0000)]
[SYCL] Fix the sub group size of Intel (#8106)

* use warp_size macro for all sycl kernels

* fix mask of permute_sub_group_by_xor

* fix rms_norm with correct warp number

* fix rms_norm_f32/group_norm_f32

* move norm to norm.cpp file

* fix quantize bug

* fix mmvq's batch size

11 months agoFix gemma2 tokenizer convert (#8244)
Xuan Son Nguyen [Mon, 1 Jul 2024 23:07:23 +0000 (01:07 +0200)]
Fix gemma2 tokenizer convert (#8244)

* fix gemma2 tokenizer convert

* remove scores

* improve code, fix new line issue

11 months agoCUDA: refactor and optimize IQ MMVQ (#8215)
Johannes Gäßler [Mon, 1 Jul 2024 18:39:06 +0000 (20:39 +0200)]
CUDA: refactor and optimize IQ MMVQ (#8215)

* CUDA: refactor and optimize IQ MMVQ

* uint -> uint32_t

* __dp4a -> ggml_cuda_dp4a

* remove MIN_CC_DP4A checks

* change default

* try CI fix

11 months agoreadme: add Paddler to the list of projects (#8239)
Mateusz Charytoniuk [Mon, 1 Jul 2024 17:13:22 +0000 (19:13 +0200)]
readme: add Paddler to the list of projects (#8239)

11 months agogemma2: add sliding window mask (#8227)
Xuan Son Nguyen [Mon, 1 Jul 2024 16:48:34 +0000 (18:48 +0200)]
gemma2: add sliding window mask (#8227)

* gemma2: add sliding window mask

* fix data_swa uninitialized

* better naming

* add co-author

Co-authored-by: Arlo Phoenix <redacted>
* replace list with single tensor

* update

* llama : minor styling

* convert : add sanity check for query_pre_attn_scalar

* fix small typo in README

---------

Co-authored-by: Arlo Phoenix <redacted>
Co-authored-by: Georgi Gerganov <redacted>
11 months agoreadme : update tool list (#8209)
Roni [Mon, 1 Jul 2024 12:48:16 +0000 (14:48 +0200)]
readme : update tool list (#8209)

* Added gppm to Tool list in README

* Update README.md

---------

Co-authored-by: Georgi Gerganov <redacted>
11 months agonix : enable curl (#8043)
Michael Francis [Mon, 1 Jul 2024 11:47:04 +0000 (07:47 -0400)]
nix : enable curl (#8043)

Co-authored-by: Georgi Gerganov <redacted>
11 months agonix : remove OpenCL remnants (#8235)
Georgi Gerganov [Mon, 1 Jul 2024 11:46:18 +0000 (14:46 +0300)]
nix : remove OpenCL remnants (#8235)

* nix : remove OpenCL remnants

* minor : remove parentheses

11 months agoDocument BERT support. (#8205)
iacore [Mon, 1 Jul 2024 11:40:58 +0000 (11:40 +0000)]
Document BERT support. (#8205)

* Update README.md

document BERT support

* Update README.md

11 months ago[SYCL] Update SYCL-Rope op and Refactor (#8157)
zhentaoyu [Mon, 1 Jul 2024 11:39:06 +0000 (19:39 +0800)]
[SYCL] Update SYCL-Rope op and Refactor (#8157)

* align with rope.cu and move sycl-op to a single file

12 months agoflake.lock: Update (#8218)
Georgi Gerganov [Sun, 30 Jun 2024 23:09:34 +0000 (02:09 +0300)]
flake.lock: Update (#8218)

12 months agoFix new line issue with chat template, disable template when in-prefix/suffix is...
Xuan Son Nguyen [Sun, 30 Jun 2024 18:27:13 +0000 (20:27 +0200)]
Fix new line issue with chat template, disable template when in-prefix/suffix is set (#8203)

* preserve new line llama_chat_format_single

* disable chat template if in-prefix/suffix is set

* remove redundant change

12 months agollama: Add attention and final logit soft-capping, update scaling factor to Gemma2...
Andrei [Sun, 30 Jun 2024 03:44:08 +0000 (20:44 -0700)]
llama: Add attention and final logit soft-capping, update scaling factor to Gemma2 (#8197)

* Add attention and final logit softcapping.

* fix

* Add custom add_ functions

* Disable flash attention for Gemma2

* Update src/llama.cpp

Co-authored-by: slaren <redacted>
* Add default value for attention and final logit softcap value

* Add custom kq scaling from Gemma2Attention

* Remove custom pre attention scaling and use computed value instead.

---------

Co-authored-by: slaren <redacted>