]> git.djapps.eu Git - pkg/ggml/sources/llama.cpp/log
pkg/ggml/sources/llama.cpp
16 months agollama : fix quantization when tensors are missing (#5423)
Georgi Gerganov [Mon, 12 Feb 2024 18:14:39 +0000 (20:14 +0200)]
llama : fix quantization when tensors are missing (#5423)

16 months agoswift : package no longer use ggml dependency (#5465)
Georgi Gerganov [Mon, 12 Feb 2024 17:54:29 +0000 (19:54 +0200)]
swift : package no longer use ggml dependency (#5465)

* Revert "swift : update Package.swift to use ggml as dependency (#4691)"

This reverts commit ece9a45e8ffb73ad461c792720c2fec28b0137bc.

* spm : add ggml headers

16 months agopy : fix persimmon `n_rot` conversion (#5460)
Lee [Mon, 12 Feb 2024 17:29:57 +0000 (01:29 +0800)]
py : fix persimmon `n_rot` conversion (#5460)

* convert : fix persimmon offical weight conversion to write correct n_rot.

* Update convert-persimmon-to-gguf.py

---------

Co-authored-by: Georgi Gerganov <redacted>
16 months agoggml-sycl: Replace 3d ops with macro (#5458)
Abhilash Majumder [Mon, 12 Feb 2024 14:52:05 +0000 (20:22 +0530)]
ggml-sycl: Replace 3d ops with macro  (#5458)

* use macro

* use macro

* fix format

16 months agollava : remove prog parameter from ArgumentParser (#5457)
Daniel Bevenius [Mon, 12 Feb 2024 08:38:44 +0000 (09:38 +0100)]
llava : remove prog parameter from ArgumentParser (#5457)

* llava: remove prog parameter from ArgumentParser

This commit removes the `prog` parameter from `ArgumentParser`
so that it uses the default value which is the name of the script.

The motivation for this change is that currently the usage output looks
like this:
```console
$ python examples/llava/convert-image-encoder-to-gguf.py --help
usage: convert_hf_to_gguf.py [-h] ...
```
And with this change it will look like this:
```console
$ python examples/llava/convert-image-encoder-to-gguf.py --help
usage: convert-image-encoder-to-gguf.py [-h] ...
```

Signed-off-by: Daniel Bevenius <redacted>
* ci: add W503 to flake8 ignore list

This commit adds W503 to the ignore list for flake8. This is done to
avoid the following error:
W503 line break before binary operator

Signed-off-by: Daniel Bevenius <redacted>
---------

Signed-off-by: Daniel Bevenius <redacted>
16 months agosync : ggml (#5452)
Georgi Gerganov [Mon, 12 Feb 2024 07:16:06 +0000 (09:16 +0200)]
sync : ggml (#5452)

* ggml-alloc : v3 (ggml/727)

* ggml-alloc v3

ggml-ci

* fix ci

ggml-ci

* whisper : check for backend buffer allocation failures

* whisper : avoid leaks when initialization fails

* cleanup

ggml-ci

* style fixes

ggml-ci

* sync : ggml

* update llama.cpp, clip.cpp, export-lora.cpp

* update finetune.cpp, train-text-from-scratch.cpp

ggml-ci

* ggml-backend : reduce alignment to 32 to match gguf and fix mmap

---------

Co-authored-by: slaren <redacted>
16 months agoCUDA: mul_mat_vec_q tiling, refactor mul mat logic (#5434)
Johannes Gäßler [Sun, 11 Feb 2024 18:08:39 +0000 (19:08 +0100)]
CUDA: mul_mat_vec_q tiling, refactor mul mat logic (#5434)

* CUDA: mul_mat_vec_q tiling, refactor mul mat logic

Co-authored-by: slaren <redacted>
---------

Co-authored-by: slaren <redacted>
16 months agoAdd support for BERT embedding models (#5423)
Douglas Hanley [Sun, 11 Feb 2024 16:21:38 +0000 (10:21 -0600)]
Add support for BERT embedding models (#5423)

* BERT model graph construction (build_bert)
* WordPiece tokenizer (llm_tokenize_wpm)
* Add flag for non-causal attention models
* Allow for models that only output embeddings
* Support conversion of BERT models to GGUF
* Based on prior work by @xyzhang626 and @skeskinen

---------

Co-authored-by: Jared Van Bortel <redacted>
Co-authored-by: Jared Van Bortel <redacted>
Co-authored-by: Georgi Gerganov <redacted>
16 months agoflake.lock: Update
github-actions[bot] [Sun, 11 Feb 2024 00:17:31 +0000 (00:17 +0000)]
flake.lock: Update

Flake lock file updates:

• Updated input 'nixpkgs':
    'github:NixOS/nixpkgs/b8b232ae7b8b144397fdb12d20f592e5e7c1a64d' (2024-01-31)
  → 'github:NixOS/nixpkgs/f8e2ebd66d097614d51a56a755450d4ae1632df1' (2024-02-07)

16 months agovulkan: only use M-sized matmul on Apple GPUs (#5412)
Sergio López [Sun, 11 Feb 2024 14:12:00 +0000 (15:12 +0100)]
vulkan: only use M-sized matmul on Apple GPUs (#5412)

* vulkan: refactor guess_matmul_pipeline for vendor

Refactor ggml_vk_guess_matmul_pipeline to simplify adding per-vendor
conditionals.

Signed-off-by: Sergio Lopez <redacted>
* vulkan: only use M-sized matmul on Apple GPUs

L-sized and S-sized matmuls are broken on Apple GPUs, force using
M-size with this vendor.

Signed-off-by: Sergio Lopez <redacted>
---------

Signed-off-by: Sergio Lopez <redacted>
16 months agocommon : use enums for sampler types (#5418)
Alexey Parfenov [Sun, 11 Feb 2024 13:43:31 +0000 (13:43 +0000)]
common : use enums for sampler types (#5418)

* common: use enums for sampler types

* Apply suggestions from code review

Co-authored-by: Georgi Gerganov <redacted>
* minor : spaces

---------

Co-authored-by: Georgi Gerganov <redacted>
16 months agoserver : allow to specify tokens as strings in logit_bias (#5003)
Alexey Parfenov [Sun, 11 Feb 2024 13:38:14 +0000 (13:38 +0000)]
server : allow to specify tokens as strings in logit_bias (#5003)

* server: allow to specify tokens as strings in logit_bias

* Apply suggestions from code review

Co-authored-by: Georgi Gerganov <redacted>
---------

Co-authored-by: Georgi Gerganov <redacted>
16 months agomain : ctrl+C print timing in non-interactive mode (#3873)
Georgi Gerganov [Sun, 11 Feb 2024 13:35:50 +0000 (15:35 +0200)]
main : ctrl+C print timing in non-interactive mode (#3873)

16 months agocommon : fix compile warning
Georgi Gerganov [Sun, 11 Feb 2024 13:33:43 +0000 (15:33 +0200)]
common : fix compile warning

16 months agoggml : fix compile warnings (unused vars) (#4966)
Georgi Gerganov [Sun, 11 Feb 2024 13:33:01 +0000 (15:33 +0200)]
ggml : fix compile warnings (unused vars) (#4966)

16 months agoggml : add mmla kernels for quantized GEMM (#4966)
snadampal [Sun, 11 Feb 2024 13:22:33 +0000 (07:22 -0600)]
ggml : add mmla kernels for quantized GEMM (#4966)

* ggml: aarch64: implement smmla kernel for q8_0_q8_0 quantized gemm

armv8.2-a and above supports MMLA instructions that have higher
throughput than DOT. this commit adds mmla kernel for
q8_0_q8_0 gemm. The feature is enabled if the platform supports
"__ARM_FEATURE_MATMUL_INT8"

On AWS Graviton3 processors this kernel resulted up to 1.5x
improvement for prompt evaluation throughput compared to the
default sdot kernel.

* ggml: aarch64: implement smmla kernel for q4_0_q8_0 quantized gemm

armv8.2-a and above supports MMLA instructions that have higher
throughput than DOT. this commit adds mmla kernel for
q4_0_q8_0 gemm. The feature is enabled if the platform supports
"__ARM_FEATURE_MATMUL_INT8"

On AWS Graviton3 processors this kernel resulted up to 1.5x
improvement for prompt evaluation throughput compared to the
default sdot kernel.

* ggml: aarch64: implement smmla kernel for q4_1_q8_1 quantized gemm

armv8.2-a and above supports MMLA instructions that have higher
throughput than DOT. this commit adds mmla kernel for
q4_1_q8_1 gemm. The feature is enabled if the platform supports
"__ARM_FEATURE_MATMUL_INT8"

On AWS Graviton3 processors this kernel resulted up to 1.5x
improvement for prompt evaluation throughput compared to the
default sdot kernel.

* ggml: update unit tests for the new vec_dot interface

* llama.cpp: add MATMUL_INT8 capability to system_info

16 months agolookup: add print for drafting performance (#5450)
Johannes Gäßler [Sun, 11 Feb 2024 11:44:51 +0000 (12:44 +0100)]
lookup: add print for drafting performance (#5450)

16 months agoserver : add llama2 chat template (#5425)
Xuan Son Nguyen [Sun, 11 Feb 2024 10:16:22 +0000 (11:16 +0100)]
server : add llama2 chat template (#5425)

* server: add mistral chat template

* server: fix typo

* server: rename template mistral to llama2

* server: format_llama2: remove BOS

* server: validate "--chat-template" argument

* server: clean up using_chatml variable

Co-authored-by: Jared Van Bortel <redacted>
---------

Co-authored-by: Jared Van Bortel <redacted>
16 months agometal : use autoreleasepool to avoid memory leaks (#5437)
Ian Bull [Sat, 10 Feb 2024 10:53:28 +0000 (02:53 -0800)]
metal : use autoreleasepool to avoid memory leaks (#5437)

There appears to be a known memory leak when using the
`MLTCommandBuffer`. It is suggested to use `@autoreleasepool` in
[1,2]

[1] https://developer.apple.com/forums/thread/662721
[2] https://forums.developer.apple.com/forums/thread/120931

This change-set wraps the `ggml_metal_graph_compute` in a
`@autoreleasepool`.

This commit addresses https://github.com/ggerganov/llama.cpp/issues/5436

16 months agoscripts : update sync scripts with new backends
Georgi Gerganov [Sat, 10 Feb 2024 07:53:05 +0000 (09:53 +0200)]
scripts : update sync scripts with new backends

16 months agosync : ggml
Georgi Gerganov [Sat, 10 Feb 2024 07:30:36 +0000 (09:30 +0200)]
sync : ggml

16 months agoggml : add abort_callback for cpu backend (ggml/725)
Michael Podvitskiy [Fri, 9 Feb 2024 09:42:27 +0000 (10:42 +0100)]
ggml : add abort_callback for cpu backend (ggml/725)

* a way to use abort_callback with the cpu backend

* whisper update

16 months agovulkan: Set limit for task concurrency (#5427)
Neuman Vong [Fri, 9 Feb 2024 18:30:19 +0000 (05:30 +1100)]
vulkan: Set limit for task concurrency (#5427)

A common default for the maximum number of open files is 256, which can
lead to `asyncio.gather(*tasks)` failing with Too many open files.

    $ python ggml_vk_generate_shaders.py --glslc=$ANDROID_NDK_PATH/shader-tools/darwin-x86_64/glslc
    ggml_vulkan: Generating and compiling shaders to SPIR-V
    Traceback (most recent call last):
      File "/Users/neuman/Code.noindex/github/llama.cpp/ggml_vk_generate_shaders.py", line 2326, in <module>
        asyncio.run(main())
      File "/Users/neuman/Code.noindex/miniforge3/lib/python3.10/asyncio/runners.py", line 44, in run
        return loop.run_until_complete(main)
      File "/Users/neuman/Code.noindex/miniforge3/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete
        return future.result()
      File "/Users/neuman/Code.noindex/github/llama.cpp/ggml_vk_generate_shaders.py", line 2294, in main
        await asyncio.gather(*tasks)
    [...snip...]
    OSError: [Errno 24] Too many open files

This change sets a reasonable concurrency limit for tasks (and therefore
open files), without significant impact on run time.

16 months agollava : add requirements.txt and update README.md (#5428)
Daniel Bevenius [Fri, 9 Feb 2024 13:00:59 +0000 (14:00 +0100)]
llava : add requirements.txt and update README.md (#5428)

* llava: add requirements.txt and update README.md

This commit adds a `requirements.txt` file to the `examples/llava`
directory. This file contains the required Python packages to run the
scripts in the `examples/llava` directory.

The motivation of this to make it easier for users to run the scripts in
`examples/llava`. This will avoid users from having to possibly run into
missing package issues if the packages are not installed on their system.

Signed-off-by: Daniel Bevenius <redacted>
* llava: fix typo in llava-surgery.py output

Signed-off-by: Daniel Bevenius <redacted>
---------

Signed-off-by: Daniel Bevenius <redacted>
16 months agoserver : fix prompt caching for repeated prompts (#5420)
Riley Stewart [Fri, 9 Feb 2024 10:49:49 +0000 (02:49 -0800)]
server : fix prompt caching for repeated prompts (#5420)

16 months agollama : do not cap thread count when MoE on CPU (#5419)
Paul Tsochantaris [Fri, 9 Feb 2024 10:48:06 +0000 (10:48 +0000)]
llama : do not cap thread count when MoE on CPU (#5419)

* Not capping thread count when MoE inference is running on CPU

* Whitespace

16 months agoreadme : add JavaScript/Wasm repo (#5415)
Marko Tasic [Fri, 9 Feb 2024 10:17:00 +0000 (11:17 +0100)]
readme : add JavaScript/Wasm repo (#5415)

16 months agoggml : fix `error C2078: too many initializers` for MSVC ARM64 (#5404)
Michael Podvitskiy [Fri, 9 Feb 2024 09:56:43 +0000 (10:56 +0100)]
ggml : fix `error C2078: too many initializers` for MSVC ARM64 (#5404)

16 months agoFix Vulkan crash on APUs with very little device memory (#5424)
0cc4m [Fri, 9 Feb 2024 05:52:33 +0000 (06:52 +0100)]
Fix Vulkan crash on APUs with very little device memory (#5424)

* Fix Vulkan crash on APUs with very little device memory

* Fix debug output function names

16 months agoCUDA: more warps for mmvq on NVIDIA (#5394)
Johannes Gäßler [Thu, 8 Feb 2024 20:56:40 +0000 (21:56 +0100)]
CUDA: more warps for mmvq on NVIDIA (#5394)

16 months agollama : do not print "offloading layers" message in CPU-only builds (#5416)
slaren [Thu, 8 Feb 2024 20:33:03 +0000 (21:33 +0100)]
llama : do not print "offloading layers" message in CPU-only builds (#5416)

16 months agoFix f16_sycl cpy call from Arc (#5411)
Abhilash Majumder [Thu, 8 Feb 2024 17:09:10 +0000 (22:39 +0530)]
Fix f16_sycl cpy call from Arc (#5411)

* fix f16_sycl cpy call

* rm old logic

* add fp16 build CI

* use macro

* format fix

16 months agollava : add missing .py, and fix paths in README.md (#5414)
Daniel Bevenius [Thu, 8 Feb 2024 14:20:03 +0000 (15:20 +0100)]
llava : add missing .py, and fix paths in README.md (#5414)

This commit adds the missing .py extension to the convert-image-encoder-to-gguf
script. It also fixes the paths for the `model` and `mmproj` options in the
example llava-cli command.

Signed-off-by: Daniel Bevenius <redacted>
16 months agofix trailing whitespace (#5407)
Johannes Gäßler [Thu, 8 Feb 2024 10:36:54 +0000 (11:36 +0100)]
fix trailing whitespace (#5407)

16 months agollama : fix MiniCPM (#5392)
runfuture [Thu, 8 Feb 2024 10:36:19 +0000 (18:36 +0800)]
llama : fix MiniCPM (#5392)

* fix bug for norm_rms_eps missing

* to align with the same order as convert.py for model write

* fix: undo HF models permute tensor

* update for flake8 lint

16 months agollava: fix typo/formatting in README.md (#5405)
Daniel Bevenius [Thu, 8 Feb 2024 08:58:19 +0000 (09:58 +0100)]
llava: fix typo/formatting in README.md (#5405)

This commit fixes a typo in the README.md file for the llava example
which is causing the formatting to look a little off:

Clone llava-v15-7b`` and clip-vit-large-patch14-336`` locally

Signed-off-by: Daniel Bevenius <redacted>
16 months agosampling: fix top_k <= 0 (#5388)
Johannes Gäßler [Thu, 8 Feb 2024 08:46:30 +0000 (09:46 +0100)]
sampling: fix top_k <= 0 (#5388)

* sampling: fix top_k <= 0

* Update llama.cpp

Co-authored-by: Georgi Gerganov <redacted>
---------

Co-authored-by: Georgi Gerganov <redacted>
16 months agotests : .gitignore obj files
Georgi Gerganov [Thu, 8 Feb 2024 07:46:47 +0000 (09:46 +0200)]
tests : .gitignore obj files

16 months agoCMAKE_OSX_ARCHITECTURES for MacOS cross compilation (#5393)
Michael Podvitskiy [Wed, 7 Feb 2024 21:39:23 +0000 (22:39 +0100)]
CMAKE_OSX_ARCHITECTURES for MacOS cross compilation (#5393)

Co-authored-by: Jared Van Bortel <redacted>
16 months agofix typo in readme (#5399)
Ebey Abraham [Wed, 7 Feb 2024 21:11:30 +0000 (21:11 +0000)]
fix typo in readme (#5399)

Co-authored-by: Ebey Abraham <redacted>
16 months agoAdd Ava in the list of llama.cpp UIs (#4362)
Kamil Tomšík [Wed, 7 Feb 2024 18:44:52 +0000 (19:44 +0100)]
Add Ava in the list of llama.cpp UIs (#4362)

16 months agoCUDA: fixed mmvq kernel for bs 2,3,4 and -sm row (#5386)
Johannes Gäßler [Wed, 7 Feb 2024 11:40:26 +0000 (12:40 +0100)]
CUDA: fixed mmvq kernel for bs 2,3,4 and -sm row (#5386)

16 months ago[SYCL] update install make by w64devkit (#5297)
Neo Zhang Jianyu [Wed, 7 Feb 2024 10:16:55 +0000 (18:16 +0800)]
[SYCL] update install make by w64devkit (#5297)

16 months agollava-cli : always tokenize special tokens (#5382)
Xiao-Yong Jin [Wed, 7 Feb 2024 08:17:25 +0000 (02:17 -0600)]
llava-cli : always tokenize special tokens (#5382)

* llava-cli: tokenize special tokens in prompt

* llava-cli: use the escape CLI argument, remove incomplete separate escaping process

16 months agoBasic Vulkan Multi-GPU implementation (#5321)
0cc4m [Wed, 7 Feb 2024 06:54:50 +0000 (07:54 +0100)]
Basic Vulkan Multi-GPU implementation (#5321)

* Initial Vulkan multi-gpu implementation

Move most global variables into backend context

* Add names to backend device functions

* Add further missing cleanup code

* Reduce code duplication in tensor split layer assignment

* generalize LLAMA_SPLIT_LAYER for all backends, do not expose device count and memory in llama.h

* Only do device info print in the beginning and initialize one backend for cpu assist

Add missing cleanup code

* Rework backend memory management to make sure devices and buffers get properly allocated and freed

* Rename cpu assist free function

---------

Co-authored-by: slaren <redacted>
16 months agoreadme : modernize (#5379)
Eve [Wed, 7 Feb 2024 06:21:30 +0000 (06:21 +0000)]
readme : modernize (#5379)

* first cleanup, update everything to Llama 2 and remove outdated content

* Delete SHA256SUMS

* make build instructions generic

* recommend Q4_K_M quantization method

* Update README.md

16 months agoreadme : update ui list (#5354)
Ben Williams [Wed, 7 Feb 2024 06:16:48 +0000 (22:16 -0800)]
readme : update ui list (#5354)

16 months agollama : add MiniCPM support (#5346)
runfuture [Wed, 7 Feb 2024 06:15:56 +0000 (14:15 +0800)]
llama : add MiniCPM support (#5346)

* support minicpm arch.

* fix tab/space typo.

* convert minicpm model via convert-hf-gguf.py

* try to make tokenizer work

* fix bug for quantize minicpm

* fix for flake8 lint

* remove convert-minicpm.py

* fix for editorconfig

* correct minicpm model type (size)

* constants expanded for minicpm

* Minor change of the constant names for minicpm

16 months agoserver : update `/props` with "total_slots" value (#5373)
Justin Parker [Wed, 7 Feb 2024 06:15:19 +0000 (01:15 -0500)]
server : update `/props` with "total_slots" value (#5373)

* include total "num_slots" in default_generation_settings_for_props

* cleanup total_slots return value in /props endpoint

* update /props endpoint docs with total_slots

* remove num_slots from default_generation_settings_for_props

* update /props endpoint section

16 months agoconvert : fix TypeError on GPT-2 vocab.json (#5288)
Sang-Kil Park [Wed, 7 Feb 2024 04:28:00 +0000 (13:28 +0900)]
convert : fix TypeError on GPT-2 vocab.json (#5288)

16 months agoserver : remove model.json endpoint (#5371)
Alexey Parfenov [Tue, 6 Feb 2024 18:08:38 +0000 (18:08 +0000)]
server : remove model.json endpoint (#5371)

16 months agoCUDA: mul_mat_vec_q max. batch size 8 -> 4 (#5370)
Johannes Gäßler [Tue, 6 Feb 2024 17:43:06 +0000 (18:43 +0100)]
CUDA: mul_mat_vec_q max. batch size 8 -> 4 (#5370)

16 months agoUpdate README.md (#5366)
Kawrakow [Tue, 6 Feb 2024 17:00:16 +0000 (19:00 +0200)]
Update README.md (#5366)

Add some links to quantization related PRs

16 months agoSlight quantization improvement for Q4_K and Q5_K (#5361)
Kawrakow [Tue, 6 Feb 2024 15:28:02 +0000 (17:28 +0200)]
Slight quantization improvement for Q4_K and Q5_K (#5361)

* Q4_K: slightly better quantization

* Q5_K: slightly better quantization

---------

Co-authored-by: Iwan Kawrakow <redacted>
16 months agoreadme : add phi, orion 14b, internlm2, and yi-VL to readme (#5362)
BarfingLemurs [Tue, 6 Feb 2024 14:06:48 +0000 (09:06 -0500)]
readme : add phi, orion 14b, internlm2, and yi-VL to readme (#5362)

16 months agoCUDA: mul_mat_vec_q for batch sizes > 1 (#5351)
Johannes Gäßler [Tue, 6 Feb 2024 13:44:06 +0000 (14:44 +0100)]
CUDA: mul_mat_vec_q for batch sizes > 1 (#5351)

16 months agoserver : include total "num_slots" in props endpoint (#5349)
Justin Parker [Tue, 6 Feb 2024 09:20:59 +0000 (04:20 -0500)]
server : include total "num_slots" in props endpoint (#5349)

16 months agoserver : add `dynatemp_range` and `dynatemp_exponent` (#5352)
Michael Coppola [Tue, 6 Feb 2024 09:20:00 +0000 (04:20 -0500)]
server : add `dynatemp_range` and `dynatemp_exponent` (#5352)

* server: added `dynatemp_range` and `dynatemp_exponent`

* Update README.md

---------

Co-authored-by: Michael Coppola <redacted>
16 months agoserver : various fixes for the prompt field in /completion (#5300)
Niall Coates [Tue, 6 Feb 2024 08:16:23 +0000 (08:16 +0000)]
server : various fixes for the prompt field in /completion (#5300)

server : fix deadlock when prompt array contains strings and numbers

server : removed an unnecessary generation when generating multi-prompts

server : removed an unnecessary assert

16 months agopy : handle byte tokens in `get_token_type` (#5341)
Georgi Gerganov [Tue, 6 Feb 2024 05:47:22 +0000 (07:47 +0200)]
py : handle byte tokens in `get_token_type` (#5341)

* py : handle byte tokens in `get_token_type`

* py : fix empty bytes arg

16 months agomake: Use ccache for faster compilation (#5318)
Johannes Gäßler [Mon, 5 Feb 2024 18:33:00 +0000 (19:33 +0100)]
make: Use ccache for faster compilation (#5318)

* make: Use ccache for faster compilation

16 months agoREADME: updated introduction (#5343)
Johannes Gäßler [Mon, 5 Feb 2024 14:55:10 +0000 (15:55 +0100)]
README: updated introduction (#5343)

* README: updated introduction

* readme : update

---------

Co-authored-by: Georgi Gerganov <redacted>
16 months agoggml : make use of ggml-quants.h possible in C++ code (#5338)
Kawrakow [Mon, 5 Feb 2024 12:09:47 +0000 (14:09 +0200)]
ggml : make use of ggml-quants.h possible in C++ code (#5338)

* Make use of ggml-quants.h possible in C++ code

* One cannot possibly be defining static_assert in a C++ compilation

---------

Co-authored-by: Iwan Kawrakow <redacted>
16 months agoggml : avoid duplicating function calls using MIN/MAX macros (#5325)
Dr. Tom Murphy VII Ph.D [Mon, 5 Feb 2024 11:13:57 +0000 (06:13 -0500)]
ggml : avoid duplicating function calls using MIN/MAX macros (#5325)

* Avoid duplicating function calls when using MIN/MAX macros.

Since these copy "a" and "b" they ask the compiler to evaluate one of them twice. The compiler doesn't have a problem with removing the duplication in something like MAX(0, x + 2), but in some cases we're calling functions, and those calls just happen twice.
By explicitly evaluating at the expression we get smaller and faster code without duplicate calls. See ggml_rope_yarn_corr_dims in Compiler Explorer:

https://godbolt.org/z/Ee4KMrvKh

Code behaves exactly the same.

* Update ggml.c

---------

Co-authored-by: Georgi Gerganov <redacted>
16 months agoiq3_xxs: quards for the no-imatrix situation (#5334)
Kawrakow [Mon, 5 Feb 2024 10:32:27 +0000 (12:32 +0200)]
iq3_xxs: quards for the no-imatrix situation (#5334)

Co-authored-by: Iwan Kawrakow <redacted>
16 months agopy : fix internlm2-hf convert to gguf (#5305)
Guoteng [Mon, 5 Feb 2024 09:04:06 +0000 (17:04 +0800)]
py : fix internlm2-hf convert to gguf (#5305)

* py : fix internlm2-hf convert to gguf

* ggml-ci

16 months agoiq2_xxs: tune quantization (#5320)
Kawrakow [Mon, 5 Feb 2024 08:46:06 +0000 (10:46 +0200)]
iq2_xxs: tune quantization (#5320)

We get slightly better PPL, and we cut quantization time in
nearly half.

The trick is to 1st quantize without forcing points onto the E8-lattice.
We can then use a narrower search range around the block scale that we
got that way.

Co-authored-by: Iwan Kawrakow <redacted>
16 months agoserver : allow to get default generation settings for completion (#5307)
Alexey Parfenov [Mon, 5 Feb 2024 08:10:22 +0000 (08:10 +0000)]
server : allow to get default generation settings for completion (#5307)

16 months agocommon : add dynamic temperature parameters to main example cli (#5295)
l3utterfly [Mon, 5 Feb 2024 08:00:47 +0000 (17:00 +0900)]
common : add dynamic temperature parameters to main example cli (#5295)

* added dynamic temp params in main

* added help text

16 months agoscripts : fix typos, cleanup (#5303)
Georgi Gerganov [Mon, 5 Feb 2024 07:48:03 +0000 (09:48 +0200)]
scripts : fix typos, cleanup (#5303)

16 months agoscripts : add non-interactive server-llm.sh (#5303)
Нияз Гарифзянов [Mon, 5 Feb 2024 07:43:57 +0000 (10:43 +0300)]
scripts : add non-interactive server-llm.sh (#5303)

* Update server-llm.sh

Add flag --non-interactive that allows run script without asking a permission

* Update scripts/server-llm.sh

---------

Co-authored-by: Georgi Gerganov <redacted>
16 months agoreadme : add CodeShell models to the supported models list (#5330)
chiranko [Mon, 5 Feb 2024 07:41:38 +0000 (15:41 +0800)]
readme : add CodeShell models to the supported models list (#5330)

16 months ago[SYCL] Fix cpy with dims of 3 (#5289)
AidanBeltonS [Mon, 5 Feb 2024 07:08:24 +0000 (07:08 +0000)]
[SYCL] Fix cpy with dims of 3 (#5289)

* Fix cpy with dims of 3

* rm asserts

---------

Co-authored-by: Abhilash Majumder <redacted>
16 months agoflake.lock: Update
github-actions[bot] [Sun, 4 Feb 2024 00:17:24 +0000 (00:17 +0000)]
flake.lock: Update

Flake lock file updates:

• Updated input 'flake-parts':
    'github:hercules-ci/flake-parts/07f6395285469419cf9d078f59b5b49993198c00' (2024-01-11)
  → 'github:hercules-ci/flake-parts/b253292d9c0a5ead9bc98c4e9a26c6312e27d69f' (2024-02-01)
• Updated input 'flake-parts/nixpkgs-lib':
    'github:NixOS/nixpkgs/b0d36bd0a420ecee3bc916c91886caca87c894e9?dir=lib' (2023-12-30)
  → 'github:NixOS/nixpkgs/97b17f32362e475016f942bbdfda4a4a72a8a652?dir=lib' (2024-01-29)
• Updated input 'nixpkgs':
    'github:NixOS/nixpkgs/ae5c332cbb5827f6b1f02572496b141021de335f' (2024-01-25)
  → 'github:NixOS/nixpkgs/b8b232ae7b8b144397fdb12d20f592e5e7c1a64d' (2024-01-31)

16 months agoAdding some imatrix tools (#5302)
Kawrakow [Sun, 4 Feb 2024 08:39:58 +0000 (10:39 +0200)]
Adding some imatrix tools (#5302)

* imatrix: adding --combine and --continue-from

* imatrix: be able to start from a specific chunk

---------

Co-authored-by: Iwan Kawrakow <redacted>
16 months agocmake : use set() for LLAMA_WIN_VER (#5298)
Welby Seely [Sun, 4 Feb 2024 04:18:51 +0000 (23:18 -0500)]
cmake : use set() for LLAMA_WIN_VER (#5298)

option() is specifically for booleans.

Fixes #5158

17 months agomake: add nvcc info print (#5310)
Johannes Gäßler [Sat, 3 Feb 2024 19:15:13 +0000 (20:15 +0100)]
make: add nvcc info print (#5310)

17 months agomake: fix nvcc optimization flags for host code (#5309)
Johannes Gäßler [Sat, 3 Feb 2024 19:14:59 +0000 (20:14 +0100)]
make: fix nvcc optimization flags for host code (#5309)

17 months agoadd Vulkan support to Nix flake
Martin Schwaighofer [Sun, 28 Jan 2024 11:59:43 +0000 (12:59 +0100)]
add Vulkan support to Nix flake

17 months agoVulkan Intel Fixes, Optimizations and Debugging Flags (#5301)
0cc4m [Sat, 3 Feb 2024 17:15:00 +0000 (18:15 +0100)]
Vulkan Intel Fixes, Optimizations and Debugging Flags (#5301)

* Fix Vulkan on Intel ARC

Optimize matmul for Intel ARC

Add Vulkan dequant test

* Add Vulkan debug and validate flags to Make and CMakeLists.txt

* Enable asynchronous transfers in Vulkan backend

* Fix flake8

* Disable Vulkan async backend functions for now

* Also add Vulkan run tests command to Makefile and CMakeLists.txt

17 months agorefactor : switch to emplace_back to avoid extra object (#5291)
Michael Klimenko [Sat, 3 Feb 2024 11:23:37 +0000 (12:23 +0100)]
refactor : switch to emplace_back to avoid extra object (#5291)

17 months agoYaRN : store rope scaling type as int32_t in memory (#5285)
Jared Van Bortel [Sat, 3 Feb 2024 11:22:06 +0000 (06:22 -0500)]
YaRN : store rope scaling type as int32_t in memory (#5285)

* YaRN : store rope scaling type as int32_t in memory

* llama : store mapped names as const char *

17 months agoreadme : add tenere in the ui tools list (#5284)
BADR [Sat, 3 Feb 2024 11:20:26 +0000 (12:20 +0100)]
readme : add tenere in the ui tools list (#5284)

17 months agoFix im2col with 32fp (#5286)
AidanBeltonS [Sat, 3 Feb 2024 08:11:37 +0000 (08:11 +0000)]
Fix im2col with 32fp (#5286)

17 months agoperplexity : fix KL divergence calculations on Windows (#5273)
kalomaze [Fri, 2 Feb 2024 14:15:30 +0000 (08:15 -0600)]
perplexity : fix KL divergence calculations on Windows (#5273)

17 months agoscripts : parse wtype in server-llm.sh (#5167)
Georgi Gerganov [Fri, 2 Feb 2024 12:23:40 +0000 (14:23 +0200)]
scripts : parse wtype in server-llm.sh (#5167)

* scripts : parse wtype in server-llm.sh

* scripts : fix check for wfile

17 months agopy : add check for '.attn.masked_bias' layers to GPT2model (#5281)
Mirror Azure [Fri, 2 Feb 2024 11:39:09 +0000 (14:39 +0300)]
py : add check for '.attn.masked_bias' layers to GPT2model (#5281)

17 months agoTidy ggml-sycl (#5261)
AidanBeltonS [Fri, 2 Feb 2024 08:39:48 +0000 (08:39 +0000)]
Tidy ggml-sycl (#5261)

* Tidy some code in ggml-sycl

* Remove blank space

* Remove std::printf comments

---------

Co-authored-by: Abhilash Majumder <redacted>
17 months agodocker : add build for SYCL, Vulkan + update readme (#5228)
Xuan Son Nguyen [Fri, 2 Feb 2024 07:56:31 +0000 (08:56 +0100)]
docker : add build for SYCL, Vulkan + update readme (#5228)

* add vulkan dockerfile

* intel dockerfile: compile sycl by default

* fix vulkan dockerfile

* add docs for vulkan

* docs: sycl build in docker

* docs: remove trailing spaces

* docs: sycl: add docker section

* docs: clarify install vulkan SDK outside docker

* sycl: use intel/oneapi-basekit docker image

* docs: correct TOC

* docs: correct docker image for Intel oneMKL

17 months ago[SYCL] get MAX_MEM_ALLOC from device property (#5270)
Meng, Hengyu [Fri, 2 Feb 2024 07:54:14 +0000 (15:54 +0800)]
[SYCL] get MAX_MEM_ALLOC from device property (#5270)

* get max alloc size from device prop

* fix macro typo

17 months ago[SYCL] update guide of SYCL backend (#5254)
Neo Zhang Jianyu [Fri, 2 Feb 2024 07:53:27 +0000 (15:53 +0800)]
[SYCL] update guide of SYCL backend (#5254)

* update guide for make installation, memory, gguf model link,  rm todo for windows build

* add vs install requirement

* update for gpu device check

* update help of llama-bench

* fix grammer issues

17 months agollama : fix memory leak in llama_batch_free (#5252)
Ian Bull [Fri, 2 Feb 2024 07:20:13 +0000 (23:20 -0800)]
llama : fix memory leak in llama_batch_free (#5252)

The llama_batch_init allocates memory for a fixed number of tokens.
However, the llama_batch_free only frees memory for the number of
tokens that were added to the batch.

This change-set uses a null terminated array for the batch seq_id, and
frees all the elements until the nullptr is reached. This change-set
also changes the name of the first parameter from `n_tokens` to
`n_tokens_alloc` to more clearly indicate that this value is the number
of tokens allocated to the batch, not the number of tokens in the batch.

17 months agoadd --no-mmap in llama-bench (#5257)
Neo Zhang Jianyu [Thu, 1 Feb 2024 19:48:53 +0000 (03:48 +0800)]
add --no-mmap in llama-bench (#5257)

* add --no-mmap, show sycl backend

* fix conflict

* fix code format, change print for --no-mmap

* ren no_mmap to mmap, show mmap when not default value in printer

* update guide for mmap

* mv position to reduce model reload

17 months agoVulkan Phi Fix for AMD Proprietary Drivers (#5260)
0cc4m [Thu, 1 Feb 2024 18:25:24 +0000 (19:25 +0100)]
Vulkan Phi Fix for AMD Proprietary Drivers (#5260)

* Replace tanh to avoid NaN in gelu shader on AMD proprietary driver

* Fix another Vulkan CPY buffer size bug

17 months agocuda : fix LLAMA_CUDA_F16 (#5262)
slaren [Thu, 1 Feb 2024 17:30:17 +0000 (18:30 +0100)]
cuda : fix LLAMA_CUDA_F16 (#5262)

17 months agomake : generate .a library for static linking (#5205)
Ali Nehzat [Thu, 1 Feb 2024 15:18:53 +0000 (02:18 +1100)]
make : generate .a library for static linking (#5205)

17 months agollama : support InternLM2 (#5184)
Guoteng [Thu, 1 Feb 2024 09:19:51 +0000 (17:19 +0800)]
llama : support InternLM2 (#5184)

* support InternLM2 inference
  * add add_space_prefix KV pair

17 months agoFix broken Vulkan Cmake (properly) (#5230)
Eve [Wed, 31 Jan 2024 19:21:55 +0000 (19:21 +0000)]
Fix broken Vulkan Cmake (properly) (#5230)

* build vulkan as object

* vulkan ci

17 months agollama : reorder build_orion() at correct place (#5118)
Georgi Gerganov [Wed, 31 Jan 2024 16:47:10 +0000 (18:47 +0200)]
llama : reorder build_orion() at correct place (#5118)

17 months agollama : remove LLAMA_MAX_DEVICES and LLAMA_SUPPORTS_GPU_OFFLOAD (#5240)
Georgi Gerganov [Wed, 31 Jan 2024 15:30:17 +0000 (17:30 +0200)]
llama : remove LLAMA_MAX_DEVICES and LLAMA_SUPPORTS_GPU_OFFLOAD (#5240)

* llama : remove LLAMA_MAX_DEVICES from llama.h

ggml-ci

* Update llama.cpp

Co-authored-by: slaren <redacted>
* server : remove LLAMA_MAX_DEVICES

ggml-ci

* llama : remove LLAMA_SUPPORTS_GPU_OFFLOAD

ggml-ci

* train : remove LLAMA_SUPPORTS_GPU_OFFLOAD

* readme : add deprecation notice

* readme : change deprecation notice to "remove" and fix url

* llama : remove gpu includes from llama.h

ggml-ci

---------

Co-authored-by: slaren <redacted>