]> git.djapps.eu Git - pkg/ggml/sources/llama.cpp/log
pkg/ggml/sources/llama.cpp
16 months agollama : allow raw byte in SPM vocabs; don't crash on nl 404 (#5478)
Aarni Koskela [Tue, 13 Feb 2024 16:18:16 +0000 (18:18 +0200)]
llama : allow raw byte in SPM vocabs; don't crash on nl 404 (#5478)

* common : don't crash if newline token is not found

* common : llama_byte_to_token: allow falling back to finding just the token byte in SPM vocabs

16 months agollama : make load error reporting more granular (#5477)
Aarni Koskela [Tue, 13 Feb 2024 13:24:50 +0000 (15:24 +0200)]
llama : make load error reporting more granular (#5477)

Makes it easier to pinpoint where e.g. `unordered_map::at: key not found` comes from.

16 months agofinetune : rename feed-forward tensors (w1/w2/w3) (#4839)
Daniel Bevenius [Tue, 13 Feb 2024 13:15:42 +0000 (14:15 +0100)]
finetune : rename feed-forward tensors (w1/w2/w3) (#4839)

* finetune: rename feed-forward tensors (w1/w2/w3)

This commit renames the feed-forward tensors w1, w2 and w3 to ffn_gate,
ffn_down and ffn_up respectively.

The motivation for this change is to make it easier to understand the
purpose of the tensors. This also seems to be inline with the names
used in the llama_layer struct in llama.cpp.

Signed-off-by: Daniel Bevenius <redacted>
* train-text-from-scratch: rename ff tensors

This commit renames the feed-forward tensors w1, w2 and w3 to ffn_gate,
ffn_down and ffn_up respectively.

The motivation for this change is to make it easier to understand the
purpose of the tensors. This also seems to be inline with the names
used in the llama_layer struct in llama.cpp

Signed-off-by: Daniel Bevenius <redacted>
---------

Signed-off-by: Daniel Bevenius <redacted>
16 months agotests : multi-thread the tokenizer tests (#5474)
Georgi Gerganov [Tue, 13 Feb 2024 13:14:22 +0000 (15:14 +0200)]
tests : multi-thread the tokenizer tests (#5474)

* tests : multi-thread the tokenizer tests

ggml-ci

* unicode : fix data race for unidentified codepoints

ggml-ci

* unicode : minor style fixes

ggml-ci

16 months agollama : support batched embeddings (#5466)
Douglas Hanley [Tue, 13 Feb 2024 12:06:58 +0000 (06:06 -0600)]
llama : support batched embeddings (#5466)

* batched embedding: pool outputs by sequence id. updated embedding example

* bring back non-causal attention

* embd : minor improvements

* llama : minor

---------

Co-authored-by: Georgi Gerganov <redacted>
16 months agomake: add error message for bad CUDA version (#5444)
Johannes Gäßler [Tue, 13 Feb 2024 11:38:37 +0000 (12:38 +0100)]
make: add error message for bad CUDA version (#5444)

* make: add error message for bad CUDA version

* Update Makefile

Co-authored-by: Jared Van Bortel <redacted>
---------

Co-authored-by: Jared Van Bortel <redacted>
16 months agobert : add tests + fix quantization (#5475)
Georgi Gerganov [Tue, 13 Feb 2024 11:01:29 +0000 (13:01 +0200)]
bert : add tests + fix quantization (#5475)

* llama : do not quantize pos embd and token type tensors

* ci : add BERT tests

ggml-ci

* ci : do not do BERT tests on low-perf nodes

ggml-ci

16 months agotests : disable moe test (#5473)
Georgi Gerganov [Tue, 13 Feb 2024 09:20:24 +0000 (11:20 +0200)]
tests : disable moe test (#5473)

16 months agoggml-quants : fix compiler warnings (shadow variable) (#5472)
Kawrakow [Tue, 13 Feb 2024 07:07:57 +0000 (09:07 +0200)]
ggml-quants : fix compiler warnings (shadow variable) (#5472)

Co-authored-by: Iwan Kawrakow <redacted>
16 months agollama : fix quantization when tensors are missing (#5423)
Georgi Gerganov [Mon, 12 Feb 2024 18:14:39 +0000 (20:14 +0200)]
llama : fix quantization when tensors are missing (#5423)

16 months agoswift : package no longer use ggml dependency (#5465)
Georgi Gerganov [Mon, 12 Feb 2024 17:54:29 +0000 (19:54 +0200)]
swift : package no longer use ggml dependency (#5465)

* Revert "swift : update Package.swift to use ggml as dependency (#4691)"

This reverts commit ece9a45e8ffb73ad461c792720c2fec28b0137bc.

* spm : add ggml headers

16 months agopy : fix persimmon `n_rot` conversion (#5460)
Lee [Mon, 12 Feb 2024 17:29:57 +0000 (01:29 +0800)]
py : fix persimmon `n_rot` conversion (#5460)

* convert : fix persimmon offical weight conversion to write correct n_rot.

* Update convert-persimmon-to-gguf.py

---------

Co-authored-by: Georgi Gerganov <redacted>
16 months agoggml-sycl: Replace 3d ops with macro (#5458)
Abhilash Majumder [Mon, 12 Feb 2024 14:52:05 +0000 (20:22 +0530)]
ggml-sycl: Replace 3d ops with macro  (#5458)

* use macro

* use macro

* fix format

16 months agollava : remove prog parameter from ArgumentParser (#5457)
Daniel Bevenius [Mon, 12 Feb 2024 08:38:44 +0000 (09:38 +0100)]
llava : remove prog parameter from ArgumentParser (#5457)

* llava: remove prog parameter from ArgumentParser

This commit removes the `prog` parameter from `ArgumentParser`
so that it uses the default value which is the name of the script.

The motivation for this change is that currently the usage output looks
like this:
```console
$ python examples/llava/convert-image-encoder-to-gguf.py --help
usage: convert_hf_to_gguf.py [-h] ...
```
And with this change it will look like this:
```console
$ python examples/llava/convert-image-encoder-to-gguf.py --help
usage: convert-image-encoder-to-gguf.py [-h] ...
```

Signed-off-by: Daniel Bevenius <redacted>
* ci: add W503 to flake8 ignore list

This commit adds W503 to the ignore list for flake8. This is done to
avoid the following error:
W503 line break before binary operator

Signed-off-by: Daniel Bevenius <redacted>
---------

Signed-off-by: Daniel Bevenius <redacted>
16 months agosync : ggml (#5452)
Georgi Gerganov [Mon, 12 Feb 2024 07:16:06 +0000 (09:16 +0200)]
sync : ggml (#5452)

* ggml-alloc : v3 (ggml/727)

* ggml-alloc v3

ggml-ci

* fix ci

ggml-ci

* whisper : check for backend buffer allocation failures

* whisper : avoid leaks when initialization fails

* cleanup

ggml-ci

* style fixes

ggml-ci

* sync : ggml

* update llama.cpp, clip.cpp, export-lora.cpp

* update finetune.cpp, train-text-from-scratch.cpp

ggml-ci

* ggml-backend : reduce alignment to 32 to match gguf and fix mmap

---------

Co-authored-by: slaren <redacted>
16 months agoCUDA: mul_mat_vec_q tiling, refactor mul mat logic (#5434)
Johannes Gäßler [Sun, 11 Feb 2024 18:08:39 +0000 (19:08 +0100)]
CUDA: mul_mat_vec_q tiling, refactor mul mat logic (#5434)

* CUDA: mul_mat_vec_q tiling, refactor mul mat logic

Co-authored-by: slaren <redacted>
---------

Co-authored-by: slaren <redacted>
16 months agoAdd support for BERT embedding models (#5423)
Douglas Hanley [Sun, 11 Feb 2024 16:21:38 +0000 (10:21 -0600)]
Add support for BERT embedding models (#5423)

* BERT model graph construction (build_bert)
* WordPiece tokenizer (llm_tokenize_wpm)
* Add flag for non-causal attention models
* Allow for models that only output embeddings
* Support conversion of BERT models to GGUF
* Based on prior work by @xyzhang626 and @skeskinen

---------

Co-authored-by: Jared Van Bortel <redacted>
Co-authored-by: Jared Van Bortel <redacted>
Co-authored-by: Georgi Gerganov <redacted>
16 months agoflake.lock: Update
github-actions[bot] [Sun, 11 Feb 2024 00:17:31 +0000 (00:17 +0000)]
flake.lock: Update

Flake lock file updates:

• Updated input 'nixpkgs':
    'github:NixOS/nixpkgs/b8b232ae7b8b144397fdb12d20f592e5e7c1a64d' (2024-01-31)
  → 'github:NixOS/nixpkgs/f8e2ebd66d097614d51a56a755450d4ae1632df1' (2024-02-07)

16 months agovulkan: only use M-sized matmul on Apple GPUs (#5412)
Sergio López [Sun, 11 Feb 2024 14:12:00 +0000 (15:12 +0100)]
vulkan: only use M-sized matmul on Apple GPUs (#5412)

* vulkan: refactor guess_matmul_pipeline for vendor

Refactor ggml_vk_guess_matmul_pipeline to simplify adding per-vendor
conditionals.

Signed-off-by: Sergio Lopez <redacted>
* vulkan: only use M-sized matmul on Apple GPUs

L-sized and S-sized matmuls are broken on Apple GPUs, force using
M-size with this vendor.

Signed-off-by: Sergio Lopez <redacted>
---------

Signed-off-by: Sergio Lopez <redacted>
16 months agocommon : use enums for sampler types (#5418)
Alexey Parfenov [Sun, 11 Feb 2024 13:43:31 +0000 (13:43 +0000)]
common : use enums for sampler types (#5418)

* common: use enums for sampler types

* Apply suggestions from code review

Co-authored-by: Georgi Gerganov <redacted>
* minor : spaces

---------

Co-authored-by: Georgi Gerganov <redacted>
16 months agoserver : allow to specify tokens as strings in logit_bias (#5003)
Alexey Parfenov [Sun, 11 Feb 2024 13:38:14 +0000 (13:38 +0000)]
server : allow to specify tokens as strings in logit_bias (#5003)

* server: allow to specify tokens as strings in logit_bias

* Apply suggestions from code review

Co-authored-by: Georgi Gerganov <redacted>
---------

Co-authored-by: Georgi Gerganov <redacted>
16 months agomain : ctrl+C print timing in non-interactive mode (#3873)
Georgi Gerganov [Sun, 11 Feb 2024 13:35:50 +0000 (15:35 +0200)]
main : ctrl+C print timing in non-interactive mode (#3873)

16 months agocommon : fix compile warning
Georgi Gerganov [Sun, 11 Feb 2024 13:33:43 +0000 (15:33 +0200)]
common : fix compile warning

16 months agoggml : fix compile warnings (unused vars) (#4966)
Georgi Gerganov [Sun, 11 Feb 2024 13:33:01 +0000 (15:33 +0200)]
ggml : fix compile warnings (unused vars) (#4966)

16 months agoggml : add mmla kernels for quantized GEMM (#4966)
snadampal [Sun, 11 Feb 2024 13:22:33 +0000 (07:22 -0600)]
ggml : add mmla kernels for quantized GEMM (#4966)

* ggml: aarch64: implement smmla kernel for q8_0_q8_0 quantized gemm

armv8.2-a and above supports MMLA instructions that have higher
throughput than DOT. this commit adds mmla kernel for
q8_0_q8_0 gemm. The feature is enabled if the platform supports
"__ARM_FEATURE_MATMUL_INT8"

On AWS Graviton3 processors this kernel resulted up to 1.5x
improvement for prompt evaluation throughput compared to the
default sdot kernel.

* ggml: aarch64: implement smmla kernel for q4_0_q8_0 quantized gemm

armv8.2-a and above supports MMLA instructions that have higher
throughput than DOT. this commit adds mmla kernel for
q4_0_q8_0 gemm. The feature is enabled if the platform supports
"__ARM_FEATURE_MATMUL_INT8"

On AWS Graviton3 processors this kernel resulted up to 1.5x
improvement for prompt evaluation throughput compared to the
default sdot kernel.

* ggml: aarch64: implement smmla kernel for q4_1_q8_1 quantized gemm

armv8.2-a and above supports MMLA instructions that have higher
throughput than DOT. this commit adds mmla kernel for
q4_1_q8_1 gemm. The feature is enabled if the platform supports
"__ARM_FEATURE_MATMUL_INT8"

On AWS Graviton3 processors this kernel resulted up to 1.5x
improvement for prompt evaluation throughput compared to the
default sdot kernel.

* ggml: update unit tests for the new vec_dot interface

* llama.cpp: add MATMUL_INT8 capability to system_info

16 months agolookup: add print for drafting performance (#5450)
Johannes Gäßler [Sun, 11 Feb 2024 11:44:51 +0000 (12:44 +0100)]
lookup: add print for drafting performance (#5450)

16 months agoserver : add llama2 chat template (#5425)
Xuan Son Nguyen [Sun, 11 Feb 2024 10:16:22 +0000 (11:16 +0100)]
server : add llama2 chat template (#5425)

* server: add mistral chat template

* server: fix typo

* server: rename template mistral to llama2

* server: format_llama2: remove BOS

* server: validate "--chat-template" argument

* server: clean up using_chatml variable

Co-authored-by: Jared Van Bortel <redacted>
---------

Co-authored-by: Jared Van Bortel <redacted>
16 months agometal : use autoreleasepool to avoid memory leaks (#5437)
Ian Bull [Sat, 10 Feb 2024 10:53:28 +0000 (02:53 -0800)]
metal : use autoreleasepool to avoid memory leaks (#5437)

There appears to be a known memory leak when using the
`MLTCommandBuffer`. It is suggested to use `@autoreleasepool` in
[1,2]

[1] https://developer.apple.com/forums/thread/662721
[2] https://forums.developer.apple.com/forums/thread/120931

This change-set wraps the `ggml_metal_graph_compute` in a
`@autoreleasepool`.

This commit addresses https://github.com/ggerganov/llama.cpp/issues/5436

16 months agoscripts : update sync scripts with new backends
Georgi Gerganov [Sat, 10 Feb 2024 07:53:05 +0000 (09:53 +0200)]
scripts : update sync scripts with new backends

16 months agosync : ggml
Georgi Gerganov [Sat, 10 Feb 2024 07:30:36 +0000 (09:30 +0200)]
sync : ggml

16 months agoggml : add abort_callback for cpu backend (ggml/725)
Michael Podvitskiy [Fri, 9 Feb 2024 09:42:27 +0000 (10:42 +0100)]
ggml : add abort_callback for cpu backend (ggml/725)

* a way to use abort_callback with the cpu backend

* whisper update

16 months agovulkan: Set limit for task concurrency (#5427)
Neuman Vong [Fri, 9 Feb 2024 18:30:19 +0000 (05:30 +1100)]
vulkan: Set limit for task concurrency (#5427)

A common default for the maximum number of open files is 256, which can
lead to `asyncio.gather(*tasks)` failing with Too many open files.

    $ python ggml_vk_generate_shaders.py --glslc=$ANDROID_NDK_PATH/shader-tools/darwin-x86_64/glslc
    ggml_vulkan: Generating and compiling shaders to SPIR-V
    Traceback (most recent call last):
      File "/Users/neuman/Code.noindex/github/llama.cpp/ggml_vk_generate_shaders.py", line 2326, in <module>
        asyncio.run(main())
      File "/Users/neuman/Code.noindex/miniforge3/lib/python3.10/asyncio/runners.py", line 44, in run
        return loop.run_until_complete(main)
      File "/Users/neuman/Code.noindex/miniforge3/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete
        return future.result()
      File "/Users/neuman/Code.noindex/github/llama.cpp/ggml_vk_generate_shaders.py", line 2294, in main
        await asyncio.gather(*tasks)
    [...snip...]
    OSError: [Errno 24] Too many open files

This change sets a reasonable concurrency limit for tasks (and therefore
open files), without significant impact on run time.

16 months agollava : add requirements.txt and update README.md (#5428)
Daniel Bevenius [Fri, 9 Feb 2024 13:00:59 +0000 (14:00 +0100)]
llava : add requirements.txt and update README.md (#5428)

* llava: add requirements.txt and update README.md

This commit adds a `requirements.txt` file to the `examples/llava`
directory. This file contains the required Python packages to run the
scripts in the `examples/llava` directory.

The motivation of this to make it easier for users to run the scripts in
`examples/llava`. This will avoid users from having to possibly run into
missing package issues if the packages are not installed on their system.

Signed-off-by: Daniel Bevenius <redacted>
* llava: fix typo in llava-surgery.py output

Signed-off-by: Daniel Bevenius <redacted>
---------

Signed-off-by: Daniel Bevenius <redacted>
16 months agoserver : fix prompt caching for repeated prompts (#5420)
Riley Stewart [Fri, 9 Feb 2024 10:49:49 +0000 (02:49 -0800)]
server : fix prompt caching for repeated prompts (#5420)

16 months agollama : do not cap thread count when MoE on CPU (#5419)
Paul Tsochantaris [Fri, 9 Feb 2024 10:48:06 +0000 (10:48 +0000)]
llama : do not cap thread count when MoE on CPU (#5419)

* Not capping thread count when MoE inference is running on CPU

* Whitespace

16 months agoreadme : add JavaScript/Wasm repo (#5415)
Marko Tasic [Fri, 9 Feb 2024 10:17:00 +0000 (11:17 +0100)]
readme : add JavaScript/Wasm repo (#5415)

16 months agoggml : fix `error C2078: too many initializers` for MSVC ARM64 (#5404)
Michael Podvitskiy [Fri, 9 Feb 2024 09:56:43 +0000 (10:56 +0100)]
ggml : fix `error C2078: too many initializers` for MSVC ARM64 (#5404)

16 months agoFix Vulkan crash on APUs with very little device memory (#5424)
0cc4m [Fri, 9 Feb 2024 05:52:33 +0000 (06:52 +0100)]
Fix Vulkan crash on APUs with very little device memory (#5424)

* Fix Vulkan crash on APUs with very little device memory

* Fix debug output function names

16 months agoCUDA: more warps for mmvq on NVIDIA (#5394)
Johannes Gäßler [Thu, 8 Feb 2024 20:56:40 +0000 (21:56 +0100)]
CUDA: more warps for mmvq on NVIDIA (#5394)

16 months agollama : do not print "offloading layers" message in CPU-only builds (#5416)
slaren [Thu, 8 Feb 2024 20:33:03 +0000 (21:33 +0100)]
llama : do not print "offloading layers" message in CPU-only builds (#5416)

16 months agoFix f16_sycl cpy call from Arc (#5411)
Abhilash Majumder [Thu, 8 Feb 2024 17:09:10 +0000 (22:39 +0530)]
Fix f16_sycl cpy call from Arc (#5411)

* fix f16_sycl cpy call

* rm old logic

* add fp16 build CI

* use macro

* format fix

16 months agollava : add missing .py, and fix paths in README.md (#5414)
Daniel Bevenius [Thu, 8 Feb 2024 14:20:03 +0000 (15:20 +0100)]
llava : add missing .py, and fix paths in README.md (#5414)

This commit adds the missing .py extension to the convert-image-encoder-to-gguf
script. It also fixes the paths for the `model` and `mmproj` options in the
example llava-cli command.

Signed-off-by: Daniel Bevenius <redacted>
16 months agofix trailing whitespace (#5407)
Johannes Gäßler [Thu, 8 Feb 2024 10:36:54 +0000 (11:36 +0100)]
fix trailing whitespace (#5407)

16 months agollama : fix MiniCPM (#5392)
runfuture [Thu, 8 Feb 2024 10:36:19 +0000 (18:36 +0800)]
llama : fix MiniCPM (#5392)

* fix bug for norm_rms_eps missing

* to align with the same order as convert.py for model write

* fix: undo HF models permute tensor

* update for flake8 lint

16 months agollava: fix typo/formatting in README.md (#5405)
Daniel Bevenius [Thu, 8 Feb 2024 08:58:19 +0000 (09:58 +0100)]
llava: fix typo/formatting in README.md (#5405)

This commit fixes a typo in the README.md file for the llava example
which is causing the formatting to look a little off:

Clone llava-v15-7b`` and clip-vit-large-patch14-336`` locally

Signed-off-by: Daniel Bevenius <redacted>
16 months agosampling: fix top_k <= 0 (#5388)
Johannes Gäßler [Thu, 8 Feb 2024 08:46:30 +0000 (09:46 +0100)]
sampling: fix top_k <= 0 (#5388)

* sampling: fix top_k <= 0

* Update llama.cpp

Co-authored-by: Georgi Gerganov <redacted>
---------

Co-authored-by: Georgi Gerganov <redacted>
16 months agotests : .gitignore obj files
Georgi Gerganov [Thu, 8 Feb 2024 07:46:47 +0000 (09:46 +0200)]
tests : .gitignore obj files

16 months agoCMAKE_OSX_ARCHITECTURES for MacOS cross compilation (#5393)
Michael Podvitskiy [Wed, 7 Feb 2024 21:39:23 +0000 (22:39 +0100)]
CMAKE_OSX_ARCHITECTURES for MacOS cross compilation (#5393)

Co-authored-by: Jared Van Bortel <redacted>
16 months agofix typo in readme (#5399)
Ebey Abraham [Wed, 7 Feb 2024 21:11:30 +0000 (21:11 +0000)]
fix typo in readme (#5399)

Co-authored-by: Ebey Abraham <redacted>
16 months agoAdd Ava in the list of llama.cpp UIs (#4362)
Kamil Tomšík [Wed, 7 Feb 2024 18:44:52 +0000 (19:44 +0100)]
Add Ava in the list of llama.cpp UIs (#4362)

16 months agoCUDA: fixed mmvq kernel for bs 2,3,4 and -sm row (#5386)
Johannes Gäßler [Wed, 7 Feb 2024 11:40:26 +0000 (12:40 +0100)]
CUDA: fixed mmvq kernel for bs 2,3,4 and -sm row (#5386)

16 months ago[SYCL] update install make by w64devkit (#5297)
Neo Zhang Jianyu [Wed, 7 Feb 2024 10:16:55 +0000 (18:16 +0800)]
[SYCL] update install make by w64devkit (#5297)

16 months agollava-cli : always tokenize special tokens (#5382)
Xiao-Yong Jin [Wed, 7 Feb 2024 08:17:25 +0000 (02:17 -0600)]
llava-cli : always tokenize special tokens (#5382)

* llava-cli: tokenize special tokens in prompt

* llava-cli: use the escape CLI argument, remove incomplete separate escaping process

16 months agoBasic Vulkan Multi-GPU implementation (#5321)
0cc4m [Wed, 7 Feb 2024 06:54:50 +0000 (07:54 +0100)]
Basic Vulkan Multi-GPU implementation (#5321)

* Initial Vulkan multi-gpu implementation

Move most global variables into backend context

* Add names to backend device functions

* Add further missing cleanup code

* Reduce code duplication in tensor split layer assignment

* generalize LLAMA_SPLIT_LAYER for all backends, do not expose device count and memory in llama.h

* Only do device info print in the beginning and initialize one backend for cpu assist

Add missing cleanup code

* Rework backend memory management to make sure devices and buffers get properly allocated and freed

* Rename cpu assist free function

---------

Co-authored-by: slaren <redacted>
16 months agoreadme : modernize (#5379)
Eve [Wed, 7 Feb 2024 06:21:30 +0000 (06:21 +0000)]
readme : modernize (#5379)

* first cleanup, update everything to Llama 2 and remove outdated content

* Delete SHA256SUMS

* make build instructions generic

* recommend Q4_K_M quantization method

* Update README.md

16 months agoreadme : update ui list (#5354)
Ben Williams [Wed, 7 Feb 2024 06:16:48 +0000 (22:16 -0800)]
readme : update ui list (#5354)

16 months agollama : add MiniCPM support (#5346)
runfuture [Wed, 7 Feb 2024 06:15:56 +0000 (14:15 +0800)]
llama : add MiniCPM support (#5346)

* support minicpm arch.

* fix tab/space typo.

* convert minicpm model via convert-hf-gguf.py

* try to make tokenizer work

* fix bug for quantize minicpm

* fix for flake8 lint

* remove convert-minicpm.py

* fix for editorconfig

* correct minicpm model type (size)

* constants expanded for minicpm

* Minor change of the constant names for minicpm

16 months agoserver : update `/props` with "total_slots" value (#5373)
Justin Parker [Wed, 7 Feb 2024 06:15:19 +0000 (01:15 -0500)]
server : update `/props` with "total_slots" value (#5373)

* include total "num_slots" in default_generation_settings_for_props

* cleanup total_slots return value in /props endpoint

* update /props endpoint docs with total_slots

* remove num_slots from default_generation_settings_for_props

* update /props endpoint section

16 months agoconvert : fix TypeError on GPT-2 vocab.json (#5288)
Sang-Kil Park [Wed, 7 Feb 2024 04:28:00 +0000 (13:28 +0900)]
convert : fix TypeError on GPT-2 vocab.json (#5288)

16 months agoserver : remove model.json endpoint (#5371)
Alexey Parfenov [Tue, 6 Feb 2024 18:08:38 +0000 (18:08 +0000)]
server : remove model.json endpoint (#5371)

16 months agoCUDA: mul_mat_vec_q max. batch size 8 -> 4 (#5370)
Johannes Gäßler [Tue, 6 Feb 2024 17:43:06 +0000 (18:43 +0100)]
CUDA: mul_mat_vec_q max. batch size 8 -> 4 (#5370)

16 months agoUpdate README.md (#5366)
Kawrakow [Tue, 6 Feb 2024 17:00:16 +0000 (19:00 +0200)]
Update README.md (#5366)

Add some links to quantization related PRs

16 months agoSlight quantization improvement for Q4_K and Q5_K (#5361)
Kawrakow [Tue, 6 Feb 2024 15:28:02 +0000 (17:28 +0200)]
Slight quantization improvement for Q4_K and Q5_K (#5361)

* Q4_K: slightly better quantization

* Q5_K: slightly better quantization

---------

Co-authored-by: Iwan Kawrakow <redacted>
16 months agoreadme : add phi, orion 14b, internlm2, and yi-VL to readme (#5362)
BarfingLemurs [Tue, 6 Feb 2024 14:06:48 +0000 (09:06 -0500)]
readme : add phi, orion 14b, internlm2, and yi-VL to readme (#5362)

16 months agoCUDA: mul_mat_vec_q for batch sizes > 1 (#5351)
Johannes Gäßler [Tue, 6 Feb 2024 13:44:06 +0000 (14:44 +0100)]
CUDA: mul_mat_vec_q for batch sizes > 1 (#5351)

16 months agoserver : include total "num_slots" in props endpoint (#5349)
Justin Parker [Tue, 6 Feb 2024 09:20:59 +0000 (04:20 -0500)]
server : include total "num_slots" in props endpoint (#5349)

16 months agoserver : add `dynatemp_range` and `dynatemp_exponent` (#5352)
Michael Coppola [Tue, 6 Feb 2024 09:20:00 +0000 (04:20 -0500)]
server : add `dynatemp_range` and `dynatemp_exponent` (#5352)

* server: added `dynatemp_range` and `dynatemp_exponent`

* Update README.md

---------

Co-authored-by: Michael Coppola <redacted>
16 months agoserver : various fixes for the prompt field in /completion (#5300)
Niall Coates [Tue, 6 Feb 2024 08:16:23 +0000 (08:16 +0000)]
server : various fixes for the prompt field in /completion (#5300)

server : fix deadlock when prompt array contains strings and numbers

server : removed an unnecessary generation when generating multi-prompts

server : removed an unnecessary assert

16 months agopy : handle byte tokens in `get_token_type` (#5341)
Georgi Gerganov [Tue, 6 Feb 2024 05:47:22 +0000 (07:47 +0200)]
py : handle byte tokens in `get_token_type` (#5341)

* py : handle byte tokens in `get_token_type`

* py : fix empty bytes arg

16 months agomake: Use ccache for faster compilation (#5318)
Johannes Gäßler [Mon, 5 Feb 2024 18:33:00 +0000 (19:33 +0100)]
make: Use ccache for faster compilation (#5318)

* make: Use ccache for faster compilation

16 months agoREADME: updated introduction (#5343)
Johannes Gäßler [Mon, 5 Feb 2024 14:55:10 +0000 (15:55 +0100)]
README: updated introduction (#5343)

* README: updated introduction

* readme : update

---------

Co-authored-by: Georgi Gerganov <redacted>
16 months agoggml : make use of ggml-quants.h possible in C++ code (#5338)
Kawrakow [Mon, 5 Feb 2024 12:09:47 +0000 (14:09 +0200)]
ggml : make use of ggml-quants.h possible in C++ code (#5338)

* Make use of ggml-quants.h possible in C++ code

* One cannot possibly be defining static_assert in a C++ compilation

---------

Co-authored-by: Iwan Kawrakow <redacted>
16 months agoggml : avoid duplicating function calls using MIN/MAX macros (#5325)
Dr. Tom Murphy VII Ph.D [Mon, 5 Feb 2024 11:13:57 +0000 (06:13 -0500)]
ggml : avoid duplicating function calls using MIN/MAX macros (#5325)

* Avoid duplicating function calls when using MIN/MAX macros.

Since these copy "a" and "b" they ask the compiler to evaluate one of them twice. The compiler doesn't have a problem with removing the duplication in something like MAX(0, x + 2), but in some cases we're calling functions, and those calls just happen twice.
By explicitly evaluating at the expression we get smaller and faster code without duplicate calls. See ggml_rope_yarn_corr_dims in Compiler Explorer:

https://godbolt.org/z/Ee4KMrvKh

Code behaves exactly the same.

* Update ggml.c

---------

Co-authored-by: Georgi Gerganov <redacted>
16 months agoiq3_xxs: quards for the no-imatrix situation (#5334)
Kawrakow [Mon, 5 Feb 2024 10:32:27 +0000 (12:32 +0200)]
iq3_xxs: quards for the no-imatrix situation (#5334)

Co-authored-by: Iwan Kawrakow <redacted>
16 months agopy : fix internlm2-hf convert to gguf (#5305)
Guoteng [Mon, 5 Feb 2024 09:04:06 +0000 (17:04 +0800)]
py : fix internlm2-hf convert to gguf (#5305)

* py : fix internlm2-hf convert to gguf

* ggml-ci

16 months agoiq2_xxs: tune quantization (#5320)
Kawrakow [Mon, 5 Feb 2024 08:46:06 +0000 (10:46 +0200)]
iq2_xxs: tune quantization (#5320)

We get slightly better PPL, and we cut quantization time in
nearly half.

The trick is to 1st quantize without forcing points onto the E8-lattice.
We can then use a narrower search range around the block scale that we
got that way.

Co-authored-by: Iwan Kawrakow <redacted>
16 months agoserver : allow to get default generation settings for completion (#5307)
Alexey Parfenov [Mon, 5 Feb 2024 08:10:22 +0000 (08:10 +0000)]
server : allow to get default generation settings for completion (#5307)

16 months agocommon : add dynamic temperature parameters to main example cli (#5295)
l3utterfly [Mon, 5 Feb 2024 08:00:47 +0000 (17:00 +0900)]
common : add dynamic temperature parameters to main example cli (#5295)

* added dynamic temp params in main

* added help text

16 months agoscripts : fix typos, cleanup (#5303)
Georgi Gerganov [Mon, 5 Feb 2024 07:48:03 +0000 (09:48 +0200)]
scripts : fix typos, cleanup (#5303)

16 months agoscripts : add non-interactive server-llm.sh (#5303)
Нияз Гарифзянов [Mon, 5 Feb 2024 07:43:57 +0000 (10:43 +0300)]
scripts : add non-interactive server-llm.sh (#5303)

* Update server-llm.sh

Add flag --non-interactive that allows run script without asking a permission

* Update scripts/server-llm.sh

---------

Co-authored-by: Georgi Gerganov <redacted>
16 months agoreadme : add CodeShell models to the supported models list (#5330)
chiranko [Mon, 5 Feb 2024 07:41:38 +0000 (15:41 +0800)]
readme : add CodeShell models to the supported models list (#5330)

16 months ago[SYCL] Fix cpy with dims of 3 (#5289)
AidanBeltonS [Mon, 5 Feb 2024 07:08:24 +0000 (07:08 +0000)]
[SYCL] Fix cpy with dims of 3 (#5289)

* Fix cpy with dims of 3

* rm asserts

---------

Co-authored-by: Abhilash Majumder <redacted>
16 months agoflake.lock: Update
github-actions[bot] [Sun, 4 Feb 2024 00:17:24 +0000 (00:17 +0000)]
flake.lock: Update

Flake lock file updates:

• Updated input 'flake-parts':
    'github:hercules-ci/flake-parts/07f6395285469419cf9d078f59b5b49993198c00' (2024-01-11)
  → 'github:hercules-ci/flake-parts/b253292d9c0a5ead9bc98c4e9a26c6312e27d69f' (2024-02-01)
• Updated input 'flake-parts/nixpkgs-lib':
    'github:NixOS/nixpkgs/b0d36bd0a420ecee3bc916c91886caca87c894e9?dir=lib' (2023-12-30)
  → 'github:NixOS/nixpkgs/97b17f32362e475016f942bbdfda4a4a72a8a652?dir=lib' (2024-01-29)
• Updated input 'nixpkgs':
    'github:NixOS/nixpkgs/ae5c332cbb5827f6b1f02572496b141021de335f' (2024-01-25)
  → 'github:NixOS/nixpkgs/b8b232ae7b8b144397fdb12d20f592e5e7c1a64d' (2024-01-31)

16 months agoAdding some imatrix tools (#5302)
Kawrakow [Sun, 4 Feb 2024 08:39:58 +0000 (10:39 +0200)]
Adding some imatrix tools (#5302)

* imatrix: adding --combine and --continue-from

* imatrix: be able to start from a specific chunk

---------

Co-authored-by: Iwan Kawrakow <redacted>
16 months agocmake : use set() for LLAMA_WIN_VER (#5298)
Welby Seely [Sun, 4 Feb 2024 04:18:51 +0000 (23:18 -0500)]
cmake : use set() for LLAMA_WIN_VER (#5298)

option() is specifically for booleans.

Fixes #5158

17 months agomake: add nvcc info print (#5310)
Johannes Gäßler [Sat, 3 Feb 2024 19:15:13 +0000 (20:15 +0100)]
make: add nvcc info print (#5310)

17 months agomake: fix nvcc optimization flags for host code (#5309)
Johannes Gäßler [Sat, 3 Feb 2024 19:14:59 +0000 (20:14 +0100)]
make: fix nvcc optimization flags for host code (#5309)

17 months agoadd Vulkan support to Nix flake
Martin Schwaighofer [Sun, 28 Jan 2024 11:59:43 +0000 (12:59 +0100)]
add Vulkan support to Nix flake

17 months agoVulkan Intel Fixes, Optimizations and Debugging Flags (#5301)
0cc4m [Sat, 3 Feb 2024 17:15:00 +0000 (18:15 +0100)]
Vulkan Intel Fixes, Optimizations and Debugging Flags (#5301)

* Fix Vulkan on Intel ARC

Optimize matmul for Intel ARC

Add Vulkan dequant test

* Add Vulkan debug and validate flags to Make and CMakeLists.txt

* Enable asynchronous transfers in Vulkan backend

* Fix flake8

* Disable Vulkan async backend functions for now

* Also add Vulkan run tests command to Makefile and CMakeLists.txt

17 months agorefactor : switch to emplace_back to avoid extra object (#5291)
Michael Klimenko [Sat, 3 Feb 2024 11:23:37 +0000 (12:23 +0100)]
refactor : switch to emplace_back to avoid extra object (#5291)

17 months agoYaRN : store rope scaling type as int32_t in memory (#5285)
Jared Van Bortel [Sat, 3 Feb 2024 11:22:06 +0000 (06:22 -0500)]
YaRN : store rope scaling type as int32_t in memory (#5285)

* YaRN : store rope scaling type as int32_t in memory

* llama : store mapped names as const char *

17 months agoreadme : add tenere in the ui tools list (#5284)
BADR [Sat, 3 Feb 2024 11:20:26 +0000 (12:20 +0100)]
readme : add tenere in the ui tools list (#5284)

17 months agoFix im2col with 32fp (#5286)
AidanBeltonS [Sat, 3 Feb 2024 08:11:37 +0000 (08:11 +0000)]
Fix im2col with 32fp (#5286)

17 months agoperplexity : fix KL divergence calculations on Windows (#5273)
kalomaze [Fri, 2 Feb 2024 14:15:30 +0000 (08:15 -0600)]
perplexity : fix KL divergence calculations on Windows (#5273)

17 months agoscripts : parse wtype in server-llm.sh (#5167)
Georgi Gerganov [Fri, 2 Feb 2024 12:23:40 +0000 (14:23 +0200)]
scripts : parse wtype in server-llm.sh (#5167)

* scripts : parse wtype in server-llm.sh

* scripts : fix check for wfile

17 months agopy : add check for '.attn.masked_bias' layers to GPT2model (#5281)
Mirror Azure [Fri, 2 Feb 2024 11:39:09 +0000 (14:39 +0300)]
py : add check for '.attn.masked_bias' layers to GPT2model (#5281)

17 months agoTidy ggml-sycl (#5261)
AidanBeltonS [Fri, 2 Feb 2024 08:39:48 +0000 (08:39 +0000)]
Tidy ggml-sycl (#5261)

* Tidy some code in ggml-sycl

* Remove blank space

* Remove std::printf comments

---------

Co-authored-by: Abhilash Majumder <redacted>
17 months agodocker : add build for SYCL, Vulkan + update readme (#5228)
Xuan Son Nguyen [Fri, 2 Feb 2024 07:56:31 +0000 (08:56 +0100)]
docker : add build for SYCL, Vulkan + update readme (#5228)

* add vulkan dockerfile

* intel dockerfile: compile sycl by default

* fix vulkan dockerfile

* add docs for vulkan

* docs: sycl build in docker

* docs: remove trailing spaces

* docs: sycl: add docker section

* docs: clarify install vulkan SDK outside docker

* sycl: use intel/oneapi-basekit docker image

* docs: correct TOC

* docs: correct docker image for Intel oneMKL

17 months ago[SYCL] get MAX_MEM_ALLOC from device property (#5270)
Meng, Hengyu [Fri, 2 Feb 2024 07:54:14 +0000 (15:54 +0800)]
[SYCL] get MAX_MEM_ALLOC from device property (#5270)

* get max alloc size from device prop

* fix macro typo

17 months ago[SYCL] update guide of SYCL backend (#5254)
Neo Zhang Jianyu [Fri, 2 Feb 2024 07:53:27 +0000 (15:53 +0800)]
[SYCL] update guide of SYCL backend (#5254)

* update guide for make installation, memory, gguf model link,  rm todo for windows build

* add vs install requirement

* update for gpu device check

* update help of llama-bench

* fix grammer issues