]> git.djapps.eu Git - pkg/ggml/sources/llama.cpp/log
pkg/ggml/sources/llama.cpp
8 months agoVectorize load instructions in dmmv f16 CUDA kernel (#9816)
agray3 [Mon, 14 Oct 2024 00:49:08 +0000 (01:49 +0100)]
Vectorize load instructions in dmmv f16 CUDA kernel (#9816)

* Vectorize load instructions in dmmv f16 CUDA kernel

Replaces scalar with vector load instructions, which substantially
improves performance on NVIDIA HBM GPUs, e.g. gives a 1.27X overall
speedup for Meta-Llama-3-8B-Instruct-F16 BS1 inference evaluation on
H100 SXM 80GB HBM3. On GDDR GPUs, there is a slight (1.01X) speedup.

* addressed comment

* Update ggml/src/ggml-cuda/dmmv.cu

Co-authored-by: Johannes Gäßler <redacted>
---------

Co-authored-by: Johannes Gäßler <redacted>
8 months agoserver : accept extra_context for the infill endpoint (#9874)
Georgi Gerganov [Sun, 13 Oct 2024 18:31:35 +0000 (21:31 +0300)]
server : accept extra_context for the infill endpoint (#9874)

* server : accept extra_context for the infill endpoint

ggml-ci

* server : update readme [no ci]

* server : use repo-level FIM pattern if possible

ggml-ci

8 months agoserver : reuse cached context chunks (#9866)
Georgi Gerganov [Sun, 13 Oct 2024 15:52:48 +0000 (18:52 +0300)]
server : reuse cached context chunks (#9866)

ggml-ci

8 months agoflake.lock: Update (#9870)
Georgi Gerganov [Sun, 13 Oct 2024 03:11:26 +0000 (06:11 +0300)]
flake.lock: Update (#9870)

Flake lock file updates:

• Updated input 'nixpkgs':
    'github:NixOS/nixpkgs/bc947f541ae55e999ffdb4013441347d83b00feb?narHash=sha256-NOiTvBbRLIOe5F6RbHaAh6%2B%2BBNjsb149fGZd1T4%2BKBg%3D' (2024-10-04)
  → 'github:NixOS/nixpkgs/5633bcff0c6162b9e4b5f1264264611e950c8ec7?narHash=sha256-9UTxR8eukdg%2BXZeHgxW5hQA9fIKHsKCdOIUycTryeVw%3D' (2024-10-09)

Co-authored-by: github-actions[bot] <redacted>
8 months agoserver : add option to time limit the generation phase (#9865)
Georgi Gerganov [Sat, 12 Oct 2024 13:14:27 +0000 (16:14 +0300)]
server : add option to time limit the generation phase (#9865)

ggml-ci

8 months agoserver : remove self-extend features (#9860)
Georgi Gerganov [Sat, 12 Oct 2024 13:06:31 +0000 (16:06 +0300)]
server : remove self-extend features (#9860)

* server : remove self-extend

ggml-ci

* server : fix context limit check to use slot.n_past

ggml-ci

8 months agoserver : remove legacy system_prompt feature (#9857)
Georgi Gerganov [Sat, 12 Oct 2024 11:51:54 +0000 (14:51 +0300)]
server : remove legacy system_prompt feature (#9857)

* server : remove legacy system_prompt feature

ggml-ci

* readme : update [no ci]

* server : fix non-transformer logic + remove response from /props

8 months agollama : improve infill support and special token detection (#9798)
Georgi Gerganov [Sat, 12 Oct 2024 05:21:51 +0000 (08:21 +0300)]
llama : improve infill support and special token detection (#9798)

* llama : improve infill support

ggml-ci

* llama : add more FIM token strings

ggml-ci

* server : update prompt on slot restore (#9800)

* gguf : deprecate old FIM token KVs

8 months agomusa : update doc (#9856)
R0CKSTAR [Sat, 12 Oct 2024 05:09:53 +0000 (13:09 +0800)]
musa : update doc (#9856)

Signed-off-by: Xiaodong Ye <redacted>
8 months agoggml : move more prints to the ggml log system (#9839)
Diego Devesa [Fri, 11 Oct 2024 13:34:45 +0000 (15:34 +0200)]
ggml : move more prints to the ggml log system (#9839)

* ggml : move more prints to the ggml log system

* show BLAS OpenMP warnings in all builds using debug print

8 months agocommon : use common_ prefix for common library functions (#9805)
Diego Devesa [Thu, 10 Oct 2024 20:57:42 +0000 (22:57 +0200)]
common : use common_ prefix for common library functions (#9805)

* common : use common_ prefix for common library functions

---------

Co-authored-by: Georgi Gerganov <redacted>
8 months agorpc : add backend registry / device interfaces (#9812)
Diego Devesa [Thu, 10 Oct 2024 18:14:55 +0000 (20:14 +0200)]
rpc : add backend registry / device interfaces (#9812)

* rpc : add backend registry / device interfaces

* llama : add llama_supports_rpc API

* ggml_backend_rpc_start_rpc_server -> ggml_backend_rpc_start_server

8 months agomusa: add docker image support (#9685)
R0CKSTAR [Thu, 10 Oct 2024 18:10:37 +0000 (02:10 +0800)]
musa: add docker image support (#9685)

* mtgpu: add docker image support

Signed-off-by: Xiaodong Ye <redacted>
* mtgpu: enable docker workflow

Signed-off-by: Xiaodong Ye <redacted>
---------

Signed-off-by: Xiaodong Ye <redacted>
8 months agoexamples : do not use common library in simple example (#9803)
Diego Devesa [Thu, 10 Oct 2024 17:50:49 +0000 (19:50 +0200)]
examples : do not use common library in simple example (#9803)

* examples : do not use common library in simple example

* add command line parser, simplify code

8 months agocmake : do not build common library by default when standalone (#9804)
Diego Devesa [Wed, 9 Oct 2024 16:49:52 +0000 (18:49 +0200)]
cmake : do not build common library by default when standalone (#9804)

8 months agoperplexity : fix integer overflow (#9783)
Georgi Gerganov [Wed, 9 Oct 2024 14:00:18 +0000 (17:00 +0300)]
perplexity : fix integer overflow (#9783)

* perplexity : fix integer overflow

ggml-ci

* perplexity : keep n_vocab as int and make appropriate casts

ggml-ci

8 months agoexamples : remove llama.vim
Georgi Gerganov [Wed, 9 Oct 2024 07:55:42 +0000 (10:55 +0300)]
examples : remove llama.vim

An updated version will be added in #9787

8 months agoggml : fix BLAS with unsupported types (#9775)
Diego Devesa [Tue, 8 Oct 2024 12:21:43 +0000 (14:21 +0200)]
ggml : fix BLAS with unsupported types (#9775)

* ggml : do not use BLAS with types without to_float

* ggml : return pointer from ggml_internal_get_type_traits to avoid unnecessary copies

* ggml : rename ggml_internal_get_type_traits -> ggml_get_type_traits

it's not really internal if everybody uses it

8 months agoserver : better security control for public deployments (#9776)
Xuan Son Nguyen [Tue, 8 Oct 2024 11:27:04 +0000 (13:27 +0200)]
server : better security control for public deployments (#9776)

* server : more explicit endpoint access settings

* protect /props endpoint

* fix tests

* update server docs

* fix typo

* fix tests

8 months agoscripts : fix spelling typo in messages and comments (#9782)
standby24x7 [Tue, 8 Oct 2024 06:19:53 +0000 (15:19 +0900)]
scripts : fix spelling typo in messages and comments (#9782)

Signed-off-by: Masanari Iida <redacted>
8 months agoggml : add backend registry / device interfaces to BLAS backend (#9752)
Diego Devesa [Mon, 7 Oct 2024 19:55:08 +0000 (21:55 +0200)]
ggml : add backend registry / device interfaces to BLAS backend (#9752)

* ggml : add backend registry / device interfaces to BLAS backend

* fix mmap usage when using host buffers

8 months agoUpdate building for Android (#9672)
Andrew Minh Nguyen [Mon, 7 Oct 2024 16:37:31 +0000 (09:37 -0700)]
Update building for Android (#9672)

* docs : clarify building Android on Termux

* docs : update building Android on Termux

* docs : add cross-compiling for Android

* cmake : link dl explicitly for Android

8 months agoflake.lock: Update (#9753)
Georgi Gerganov [Mon, 7 Oct 2024 16:35:42 +0000 (19:35 +0300)]
flake.lock: Update (#9753)

Flake lock file updates:

• Updated input 'flake-parts':
    'github:hercules-ci/flake-parts/bcef6817a8b2aa20a5a6dbb19b43e63c5bf8619a?narHash=sha256-HO4zgY0ekfwO5bX0QH/3kJ/h4KvUDFZg8YpkNwIbg1U%3D' (2024-09-12)
  → 'github:hercules-ci/flake-parts/3d04084d54bedc3d6b8b736c70ef449225c361b1?narHash=sha256-K5ZLCyfO/Zj9mPFldf3iwS6oZStJcU4tSpiXTMYaaL0%3D' (2024-10-01)
• Updated input 'flake-parts/nixpkgs-lib':
    'https://github.com/NixOS/nixpkgs/archive/356624c12086a18f2ea2825fed34523d60ccc4e3.tar.gz?narHash=sha256-Ss8QWLXdr2JCBPcYChJhz4xJm%2Bh/xjl4G0c0XlP6a74%3D' (2024-09-01)
  → 'https://github.com/NixOS/nixpkgs/archive/fb192fec7cc7a4c26d51779e9bab07ce6fa5597a.tar.gz?narHash=sha256-0xHYkMkeLVQAMa7gvkddbPqpxph%2BhDzdu1XdGPJR%2BOs%3D' (2024-10-01)
• Updated input 'nixpkgs':
    'github:NixOS/nixpkgs/1925c603f17fc89f4c8f6bf6f631a802ad85d784?narHash=sha256-J%2BPeFKSDV%2BpHL7ukkfpVzCOO7mBSrrpJ3svwBFABbhI%3D' (2024-09-26)
  → 'github:NixOS/nixpkgs/bc947f541ae55e999ffdb4013441347d83b00feb?narHash=sha256-NOiTvBbRLIOe5F6RbHaAh6%2B%2BBNjsb149fGZd1T4%2BKBg%3D' (2024-10-04)

Co-authored-by: github-actions[bot] <redacted>
8 months agoggml : add metal backend registry / device (#9713)
Georgi Gerganov [Mon, 7 Oct 2024 15:27:51 +0000 (18:27 +0300)]
ggml : add metal backend registry / device (#9713)

* ggml : add metal backend registry / device

ggml-ci

* metal : fix names [no ci]

* metal : global registry and device instances

ggml-ci

* cont : alternative initialization of global objects

ggml-ci

* llama : adapt to backend changes

ggml-ci

* fixes

* metal : fix indent

* metal : fix build when MTLGPUFamilyApple3 is not available

ggml-ci

* fix merge

* metal : avoid unnecessary singleton accesses

ggml-ci

* metal : minor fix [no ci]

* metal : g_state -> g_ggml_ctx_dev_main [no ci]

* metal : avoid reference of device context in the backend context

ggml-ci

* metal : minor [no ci]

* metal : fix maxTransferRate check

* metal : remove transfer rate stuff

---------

Co-authored-by: slaren <redacted>
8 months agometal : single allocation of encode_async block (#9747)
Paul Tsochantaris [Mon, 7 Oct 2024 12:26:31 +0000 (13:26 +0100)]
metal : single allocation of encode_async block (#9747)

* Single allocation of encode_async block with non-ARC capture in ggml-metal.m

* Moving Block_release to the deallocation code

* Release encode block when re-setting encoding buffer count if needed

* Update ggml/src/ggml-metal.m

---------

Co-authored-by: Georgi Gerganov <redacted>
8 months agocontrib : simplify + minor edits [no ci]
Georgi Gerganov [Sun, 6 Oct 2024 11:15:27 +0000 (14:15 +0300)]
contrib : simplify + minor edits [no ci]

8 months agoreadme : fix typo [no ci]
Georgi Gerganov [Sun, 6 Oct 2024 10:49:41 +0000 (13:49 +0300)]
readme : fix typo [no ci]

8 months agosync : llama.cpp
Georgi Gerganov [Sun, 6 Oct 2024 09:53:28 +0000 (12:53 +0300)]
sync : llama.cpp

8 months agovulkan : retry allocation with fallback flags (whisper/2451)
SRHMorris [Sun, 6 Oct 2024 07:34:20 +0000 (08:34 +0100)]
vulkan : retry allocation with fallback flags (whisper/2451)

Co-authored-by: Samuel Morris <redacted>
8 months agorerank : use [SEP] token instead of [BOS] (#9737)
Georgi Gerganov [Sat, 5 Oct 2024 12:55:04 +0000 (15:55 +0300)]
rerank : use [SEP] token instead of [BOS] (#9737)

* rerank : use [SEP] token instead of [BOS]

ggml-ci

* common : sanity check for non-NULL tokens

ggml-ci

* ci : adjust rank score interval

ggml-ci

* ci : add shebang to run.sh

ggml-ci

8 months agosync : ggml
Georgi Gerganov [Sat, 5 Oct 2024 12:53:49 +0000 (15:53 +0300)]
sync : ggml

8 months agometal : zero-init buffer contexts (whisper/0)
Georgi Gerganov [Sat, 5 Oct 2024 11:33:54 +0000 (14:33 +0300)]
metal : zero-init buffer contexts (whisper/0)

8 months agoAdd Llama Assistant (#9744)
Viet-Anh NGUYEN (Andrew) [Fri, 4 Oct 2024 18:29:35 +0000 (01:29 +0700)]
Add Llama Assistant (#9744)

8 months agosync : ggml
Georgi Gerganov [Fri, 4 Oct 2024 15:50:25 +0000 (18:50 +0300)]
sync : ggml

8 months agoggml : fix typo in example usage ggml_gallocr_new (ggml/984)
Daniel Bevenius [Fri, 4 Oct 2024 13:46:18 +0000 (15:46 +0200)]
ggml : fix typo in example usage ggml_gallocr_new (ggml/984)

8 months agoggml : fixes after sync (ggml/983)
Diego Devesa [Fri, 4 Oct 2024 06:41:40 +0000 (08:41 +0200)]
ggml : fixes after sync (ggml/983)

ggml : remove test-backend-buffer

ggml : fix CUDA build warnings

8 months agoci : fine-grant permission (#9710)
Xuan Son Nguyen [Fri, 4 Oct 2024 09:47:19 +0000 (11:47 +0200)]
ci : fine-grant permission (#9710)

8 months agoFixed RNG seed docs (#9723)
Daniel Kleine [Fri, 4 Oct 2024 08:54:44 +0000 (10:54 +0200)]
Fixed RNG seed docs (#9723)

* Update README.md

fixed RNG seed info

* changed print format to unsigned

8 months agometal : remove abort (skip) (ggml/0)
Georgi Gerganov [Thu, 3 Oct 2024 18:18:19 +0000 (21:18 +0300)]
metal : remove abort (skip) (ggml/0)

8 months agosync : ggml
Georgi Gerganov [Thu, 3 Oct 2024 18:17:49 +0000 (21:17 +0300)]
sync : ggml

8 months agoggml/ex: calculate accuracy in graph, adapt MNIST (ggml/980)
Johannes Gäßler [Thu, 3 Oct 2024 15:29:59 +0000 (17:29 +0200)]
ggml/ex: calculate accuracy in graph, adapt MNIST (ggml/980)

8 months agoggml: refactor cross entropy loss CPU impl. (ggml/976)
Johannes Gäßler [Wed, 2 Oct 2024 13:32:39 +0000 (15:32 +0200)]
ggml: refactor cross entropy loss CPU impl. (ggml/976)

8 months agometal : fix compute pass descriptor autorelease crash (#9718)
Jack Mousseau [Thu, 3 Oct 2024 18:01:46 +0000 (11:01 -0700)]
metal : fix compute pass descriptor autorelease crash (#9718)

8 months agoggml-backend : add device description to CPU backend (#9720)
Diego Devesa [Thu, 3 Oct 2024 15:39:18 +0000 (17:39 +0200)]
ggml-backend : add device description to CPU backend (#9720)

8 months agoggml: unify backend logging mechanism (#9709)
bandoti [Thu, 3 Oct 2024 15:39:03 +0000 (12:39 -0300)]
ggml: unify backend logging mechanism (#9709)

* Add scaffolding for ggml logging macros

* Metal backend now uses GGML logging

* Cuda backend now uses GGML logging

* Cann backend now uses GGML logging

* Add enum tag to parameters

* Use C memory allocation funcs

* Fix compile error

* Use GGML_LOG instead of GGML_PRINT

* Rename llama_state to llama_logger_state

* Prevent null format string

* Fix whitespace

* Remove log callbacks from ggml backends

* Remove cuda log statement

8 months agoconvert : handle tokenizer merges format from transformers 4.45 (#9696)
compilade [Thu, 3 Oct 2024 14:22:15 +0000 (10:22 -0400)]
convert : handle tokenizer merges format from transformers 4.45 (#9696)

8 months agorpc : enable vulkan (#9714)
Radoslav Gerganov [Thu, 3 Oct 2024 10:00:52 +0000 (13:00 +0300)]
rpc : enable vulkan (#9714)

closes #8536

8 months agoFixed dequant precision issues in Q4_1 and Q5_1 (#9711)
Ouadie EL FAROUKI [Thu, 3 Oct 2024 06:50:44 +0000 (07:50 +0100)]
Fixed dequant precision issues in Q4_1 and Q5_1 (#9711)

8 months agoggml-backend : add device and backend reg interfaces (#9707)
Diego Devesa [Wed, 2 Oct 2024 23:49:47 +0000 (01:49 +0200)]
ggml-backend : add device and backend reg interfaces (#9707)

Co-authored-by: Johannes Gäßler <redacted>
8 months agollama : reduce compile time and binary size (#9712)
Xuan Son Nguyen [Wed, 2 Oct 2024 13:49:55 +0000 (15:49 +0200)]
llama : reduce compile time and binary size (#9712)

* llama : speed up compile time

* fix build

* fix build (2)

8 months ago[SYCL] Initial cmake support of SYCL for AMD GPUs (#9658)
Alberto Cabrera Pérez [Wed, 2 Oct 2024 12:57:18 +0000 (13:57 +0100)]
[SYCL] Initial cmake support of SYCL for AMD GPUs (#9658)

sycl: initial cmake support of SYCL for AMD GPUs

8 months agovulkan : do not use tensor->extra (#9407)
Radoslav Gerganov [Wed, 2 Oct 2024 10:49:16 +0000 (13:49 +0300)]
vulkan : do not use tensor->extra (#9407)

* vulkan : do not use tensor->extra

This patch allows using the Vulkan backend with the RPC backend as
tensor->extra is no longer used.

Ref: #8536

* Adapt GGML_VULKAN_CHECK_RESULTS to extra removal (#2)

---------

Co-authored-by: 0cc4m <redacted>
8 months agogguf-split : improve --split and --merge logic (#9619)
Zhenwei Jin [Wed, 2 Oct 2024 07:21:57 +0000 (15:21 +0800)]
gguf-split : improve --split and --merge logic (#9619)

* make sure params --split and --merge are not specified at same time

* update gguf-split params parse logic

* Update examples/gguf-split/gguf-split.cpp

Co-authored-by: slaren <redacted>
---------

Co-authored-by: Xuan Son Nguyen <redacted>
Co-authored-by: slaren <redacted>
8 months agoexamples : remove benchmark (#9704)
Georgi Gerganov [Wed, 2 Oct 2024 07:14:44 +0000 (10:14 +0300)]
examples : remove benchmark (#9704)

ggml-ci

8 months agoUpdate README.md (#9591)
Paweł Wodnicki [Tue, 1 Oct 2024 17:18:46 +0000 (12:18 -0500)]
Update README.md (#9591)

Add Bielik model.

8 months agosync : ggml
Georgi Gerganov [Tue, 1 Oct 2024 13:09:42 +0000 (16:09 +0300)]
sync : ggml

8 months agotest: fix OPT_STEP_ADAMW for test-backend-ops (ggml/974)
Johannes Gäßler [Mon, 30 Sep 2024 07:55:23 +0000 (09:55 +0200)]
test: fix OPT_STEP_ADAMW for test-backend-ops (ggml/974)

8 months agovulkan : mul_mat: fix UB with small warps (ggml/952)
Salvatore Mesoraca [Mon, 30 Sep 2024 07:14:09 +0000 (09:14 +0200)]
vulkan : mul_mat: fix UB with small warps (ggml/952)

When the device's warp size is less than 16,
it is possible for loadstride_a (mul_mm.comp:114)
and loadstride_b (mul_mm.comp:115) to be set to 0.
Because they are calculated as: the workgroup size,
multiplied by LOAD_VEC_* (which can be 1) and divided by 16.
And the workgroup size is set to be the same as the
warp/subgroup size.

The loadstride_* variables are used as increments in the
loops that populate the buffers used for the multiplication.

When they are 0 they cause an infinite loop.
But infinite loops without side-effects are UB and the
values of loadstride_* are known at compile time.
So, the compiler quietly optimizes all the loops away.
As a consequence, the buffers are not populated and
the multiplication result is just a matrix with all elements
set to 0.

We prevent the UB by making sure that the workgroup size
will never be less than 16, even if our device has a
smaller warp size (e.g. 8).

Signed-off-by: Salvatore Mesoraca <redacted>
8 months agoggml : fix ggml_cast (ggml/973)
Borislav Stanimirov [Mon, 30 Sep 2024 07:11:41 +0000 (10:11 +0300)]
ggml : fix ggml_cast (ggml/973)

8 months agoggml: fix gradient allocation logic (ggml/966)
Johannes Gäßler [Sun, 29 Sep 2024 21:18:02 +0000 (23:18 +0200)]
ggml: fix gradient allocation logic (ggml/966)

* ggml: fix gradient allocation logic

* gradient allocation in ggml_build_backward_expand

* fixup

* fix test-backend-ops grad

* suggestions by slaren

* fix test1.c

* fix legacy opt API

* fix test-grad0

* remove keep arg

8 months agometal : reduce command encoding overhead (#9698)
Georgi Gerganov [Tue, 1 Oct 2024 13:00:25 +0000 (16:00 +0300)]
metal : reduce command encoding overhead (#9698)

* metal : reduce command encoding overhead

ggml-ci

* metal : add comments

8 months agollama : print correct model type for Llama 3.2 1B and 3B
Georgi Gerganov [Tue, 1 Oct 2024 08:42:01 +0000 (11:42 +0300)]
llama : print correct model type for Llama 3.2 1B and 3B

8 months agoconvert : refactor rope_freqs generation (#9396)
compilade [Tue, 1 Oct 2024 06:31:36 +0000 (02:31 -0400)]
convert : refactor rope_freqs generation (#9396)

* convert : refactor rope_freqs generation

This should also fix vocab-only conversion for Phi-3.

* convert : adapt MiniCPM3 to separate rope_freqs insertion

MiniCPM3's tokenizer is treated as a SentencePiece tokenizer to avoid
having to run its custom Python code which mixes tokenization
in the same file as tool calls.

gguf-py : add long and short RoPE factors to tensor mappings

Empty, but the key names are used to populate the mappings.

8 months agoFix Docker ROCM builds, use AMDGPU_TARGETS instead of GPU_TARGETS (#9641)
serhii-nakon [Mon, 30 Sep 2024 18:57:12 +0000 (21:57 +0300)]
Fix Docker ROCM builds, use AMDGPU_TARGETS instead of GPU_TARGETS (#9641)

* Fix Docker ROCM builds, use AMDGPU_TARGETS instead of GPU_TARGETS

* Set ROCM_DOCKER_ARCH as string due it incorrectly build and cause OOM exit code

8 months agoci : reduce severity of unused Pyright ignore comments (#9697)
compilade [Mon, 30 Sep 2024 18:13:16 +0000 (14:13 -0400)]
ci : reduce severity of unused Pyright ignore comments (#9697)

8 months agopy : update transfomers version (#9694)
vb [Mon, 30 Sep 2024 15:03:47 +0000 (17:03 +0200)]
py : update transfomers version (#9694)

* update transfomers version.

* update hfh version.

8 months agoflake.lock: Update (#9680)
Georgi Gerganov [Mon, 30 Sep 2024 14:48:49 +0000 (17:48 +0300)]
flake.lock: Update (#9680)

Flake lock file updates:

• Updated input 'nixpkgs':
    'github:NixOS/nixpkgs/c04d5652cfa9742b1d519688f65d1bbccea9eb7e?narHash=sha256-PmUr/2GQGvFTIJ6/Tvsins7Q43KTMvMFhvG6oaYK%2BWk%3D' (2024-09-19)
  → 'github:NixOS/nixpkgs/1925c603f17fc89f4c8f6bf6f631a802ad85d784?narHash=sha256-J%2BPeFKSDV%2BpHL7ukkfpVzCOO7mBSrrpJ3svwBFABbhI%3D' (2024-09-26)

Co-authored-by: github-actions[bot] <redacted>
8 months agoconsole : utf-8 fix for windows stdin (#9690)
Ruchira Hasaranga [Mon, 30 Sep 2024 08:23:42 +0000 (13:53 +0530)]
console : utf-8 fix for windows stdin (#9690)

* utf-8 fix for windows stdin

* Update common/console.cpp

---------

Co-authored-by: Georgi Gerganov <redacted>
8 months agoggml : define missing HWCAP flags (#9684)
Georgi Gerganov [Sun, 29 Sep 2024 18:18:23 +0000 (21:18 +0300)]
ggml : define missing HWCAP flags (#9684)

ggml-ci

Co-authored-by: Willy Tarreau <redacted>
8 months agosync : ggml
Georgi Gerganov [Sun, 29 Sep 2024 18:16:07 +0000 (21:16 +0300)]
sync : ggml

8 months agoCUDA: remove bad assert (ggml/972)
Johannes Gäßler [Sun, 29 Sep 2024 17:56:17 +0000 (19:56 +0200)]
CUDA: remove bad assert (ggml/972)

8 months agovulkan : multithread pipeline creation (ggml/963)
Jeff Bolz [Sun, 29 Sep 2024 16:50:17 +0000 (11:50 -0500)]
vulkan : multithread pipeline creation (ggml/963)

8 months agovulkan : fix build for GGML_VULKAN_RUN_TESTS, add TFLOPS to log (ggml/961)
Jeff Bolz [Fri, 27 Sep 2024 07:58:01 +0000 (02:58 -0500)]
vulkan : fix build for GGML_VULKAN_RUN_TESTS, add TFLOPS to log (ggml/961)

8 months agovulkan : argsort barriers must be under uniform control flow (ggml/951)
Salvatore Mesoraca [Thu, 26 Sep 2024 06:59:42 +0000 (08:59 +0200)]
vulkan : argsort barriers must be under uniform control flow (ggml/951)

a return before a barrier (that happens only in some threads in
a workgroup) leads to UB.
While the old code actually works on some devices,
it fails on some others (i.e. "smaller" GPUs).

BTW, I think it would be better to set specialization constants
when the graph is built, in that way the local workgroup
could be sized appropriately.
But it would take a lot of work.

Signed-off-by: Salvatore Mesoraca <redacted>
8 months agoggml : fix GGML_MAX_N_THREADS + improve formatting (ggml/969)
Georgi Gerganov [Tue, 24 Sep 2024 10:23:59 +0000 (13:23 +0300)]
ggml : fix GGML_MAX_N_THREADS + improve formatting (ggml/969)

8 months agocommon : ensure llama_batch size does not exceed max size (#9668)
matiaslin [Sun, 29 Sep 2024 12:25:00 +0000 (05:25 -0700)]
common : ensure llama_batch size does not exceed max size (#9668)

A crash was observed when the number of tokens added to a batch exceeds
llama_batch size. An assertion in llama_batch_add was added to protect
against llama_batch size overflow.

8 months agopy : add model class for Chameleon conversion (#9683)
nopperl [Sun, 29 Sep 2024 12:02:06 +0000 (12:02 +0000)]
py : add model class for Chameleon conversion (#9683)

8 months agocontrib : add Resources section (#9675)
Georgi Gerganov [Sun, 29 Sep 2024 11:38:18 +0000 (14:38 +0300)]
contrib : add Resources section (#9675)

9 months agollama : add reranking support (#9510)
Georgi Gerganov [Sat, 28 Sep 2024 14:42:03 +0000 (17:42 +0300)]
llama : add reranking support (#9510)

* py : add XLMRobertaForSequenceClassification [no ci]

* py : fix scalar-tensor conversion [no ci]

* py : fix position embeddings chop [no ci]

* llama : read new cls tensors [no ci]

* llama : add classigication head (wip) [no ci]

* llama : add "rank" pooling type

ggml-ci

* server : add rerank endpoint

ggml-ci

* llama : aboud ggml_repeat during classification

* rerank : cleanup + comments

* server : accept /rerank endpoint in addition to /v1/rerank [no ci]

* embedding : parse special tokens

* jina : support v1 reranker

* vocab : minor style

ggml-ci

* server : initiate tests for later

ggml-ci

* server : add docs

* llama : add comment [no ci]

* llama : fix uninitialized tensors

* ci : add rerank tests

ggml-ci

* add reranking test

* change test data

* Update examples/server/server.cpp

Co-authored-by: Xuan Son Nguyen <redacted>
* add `--reranking` argument

* update server docs

* llama : fix comment [no ci]

ggml-ci

---------

Co-authored-by: Xuan Son Nguyen <redacted>
Co-authored-by: Xuan Son Nguyen <redacted>
9 months agotest-backend-ops : use flops for some performance tests (#9657)
slaren [Sat, 28 Sep 2024 12:32:46 +0000 (14:32 +0200)]
test-backend-ops : use flops for some performance tests (#9657)

* test-backend-ops : use flops for some performance tests

- parallelize tensor quantization

- use a different set of cases for performance and correctness tests

- run each test for at least one second

9 months agollama : add comment about thread-safety [no ci] (#9449)
Georgi Gerganov [Sat, 28 Sep 2024 12:13:21 +0000 (15:13 +0300)]
llama : add comment about thread-safety [no ci] (#9449)

9 months agovocab : refactor tokenizer to reduce init overhead (#9449)
Zhenwei Jin [Sat, 28 Sep 2024 12:10:58 +0000 (20:10 +0800)]
vocab : refactor tokenizer to reduce init overhead (#9449)

* refactor tokenizer

* llama : make llm_tokenizer more private

ggml-ci

* refactor tokenizer

* refactor tokenizer

* llama : make llm_tokenizer more private

ggml-ci

* remove unused files

* remove unused fileds to avoid unused filed build error

* avoid symbol link error

* Update src/llama.cpp

* Update src/llama.cpp

---------

Co-authored-by: Georgi Gerganov <redacted>
9 months agollama : add support for Chameleon (#8543)
nopperl [Sat, 28 Sep 2024 12:08:43 +0000 (12:08 +0000)]
llama : add support for Chameleon (#8543)

* convert chameleon hf to gguf

* add chameleon tokenizer tests

* fix lint

* implement chameleon graph

* add swin norm param

* return qk norm weights and biases to original format

* implement swin norm

* suppress image token output

* rem tabs

* add comment to conversion

* fix ci

* check for k norm separately

* adapt to new lora implementation

* fix layer input for swin norm

* move swin_norm in gguf writer

* add comment regarding special token regex in chameleon pre-tokenizer

* Update src/llama.cpp

Co-authored-by: compilade <redacted>
* fix punctuation regex in chameleon pre-tokenizer (@compilade)

Co-authored-by: compilade <redacted>
* fix lint

* trigger ci

---------

Co-authored-by: compilade <redacted>
9 months agoreadme : add tool (#9655)
Aarni Koskela [Sat, 28 Sep 2024 12:07:14 +0000 (15:07 +0300)]
readme : add tool (#9655)

9 months agoggml : add run-time detection of neon, i8mm and sve (#9331)
Dan Johansson [Sat, 28 Sep 2024 12:06:16 +0000 (14:06 +0200)]
ggml : add run-time detection of neon, i8mm and sve (#9331)

* ggml: Added run-time detection of neon, i8mm and sve

Adds run-time detection of the Arm instructions set features
neon, i8mm and sve for Linux and Apple build targets.

* ggml: Extend feature detection to include non aarch64 Arm arch

* ggml: Move definition of ggml_arm_arch_features to the global data section

9 months agoEnable use to the rebar feature to upload buffers to the device. (#9251)
Markus Tavenrath [Sat, 28 Sep 2024 10:05:05 +0000 (12:05 +0200)]
Enable use to the rebar feature to upload buffers to the device. (#9251)

9 months agoreadme : update hot topics
Georgi Gerganov [Fri, 27 Sep 2024 17:57:51 +0000 (20:57 +0300)]
readme : update hot topics

9 months agocmake : add option for common library (#9661)
Borislav Stanimirov [Fri, 27 Sep 2024 07:42:06 +0000 (10:42 +0300)]
cmake : add option for common library (#9661)

9 months ago[SYCL] add missed dll file in package (#9577)
Neo Zhang Jianyu [Thu, 26 Sep 2024 09:38:31 +0000 (17:38 +0800)]
[SYCL] add missed dll file in package (#9577)

* update oneapi to 2024.2

* use 2024.1

---------

Co-authored-by: arthw <redacted>
9 months agomtgpu: enable VMM (#9597)
R0CKSTAR [Thu, 26 Sep 2024 01:27:40 +0000 (09:27 +0800)]
mtgpu: enable VMM (#9597)

Signed-off-by: Xiaodong Ye <redacted>
9 months agoci : fix docker build number and tag name (#9638)
Xuan Son Nguyen [Wed, 25 Sep 2024 15:26:01 +0000 (17:26 +0200)]
ci : fix docker build number and tag name (#9638)

* ci : fix docker build number and tag name

* fine-grant permissions

9 months agoggml : remove assert for AArch64 GEMV and GEMM Q4 kernels (#9217)
Charles Xu [Wed, 25 Sep 2024 13:12:20 +0000 (15:12 +0200)]
ggml : remove assert for AArch64 GEMV and GEMM Q4 kernels (#9217)

* ggml : remove assert for AArch64 GEMV and GEMM Q4 kernels

* added fallback mechanism when the offline re-quantized model is not
optimized for the underlying target.

* fix for build errors

* remove prints from the low-level code

* Rebase to the latest upstream

9 months agoserver : add more env vars, improve gen-docs (#9635)
Xuan Son Nguyen [Wed, 25 Sep 2024 12:05:13 +0000 (14:05 +0200)]
server : add more env vars, improve gen-docs (#9635)

* server : add more env vars, improve gen-docs

* update server docs

* LLAMA_ARG_NO_CONTEXT_SHIFT

9 months agollama : add IBM Granite MoE architecture (#9438)
Gabe Goodhart [Wed, 25 Sep 2024 07:06:52 +0000 (01:06 -0600)]
llama : add IBM Granite MoE architecture (#9438)

* feat(gguf-py): Add granitemoe architecture

This includes the addition of new tensor names for the new moe layers.
These may not be correct at this point due to the need for the hack in
gguf_writer.py to double-check the length of the shape for these layers.

Branch: GraniteMoE

Signed-off-by: Gabe Goodhart <redacted>
* feat(convert_hf_to_gguf): Add GraniteMoeModel

GraniteMoe has the same configuration deltas as Granite

Branch: GraniteMoE

Signed-off-by: Gabe Goodhart <redacted>
* fix(granitemoe convert): Split the double-sized input layer into gate and up

After a lot of staring and squinting, it's clear that the standard mixtral
expert implementation is equivalent to the vectorized parallel experts in
granite. The difference is that in granite, the w1 and w3 are concatenated
into a single tensor "input_linear." Rather than reimplementing all of the
math on the llama.cpp side, the much simpler route is to just split this
tensor during conversion and follow the standard mixtral route.

Branch: GraniteMoE

Co-Authored-By: alex.brooks@ibm.com
Signed-off-by: Gabe Goodhart <redacted>
* feat(granitemoe): Implement granitemoe

GraniteMoE follows the mixtral architecture (once the input_linear layers
are split into gate_exps/up_exps). The main delta is the addition of the
same four multipliers used in Granite.

Branch: GraniteMoE

Signed-off-by: Gabe Goodhart <redacted>
* Typo fix in docstring

Co-Authored-By: ggerganov@gmail.com
Co-authored-by: Georgi Gerganov <redacted>
Signed-off-by: Gabe Goodhart <redacted>
* fix(conversion): Simplify tensor name mapping in conversion

Branch: GraniteMoE

Co-Authored-By: git@compilade.net
Signed-off-by: Gabe Goodhart <redacted>
* fix(convert): Remove unused tensor name mappings

Branch: GraniteMoE

Co-Authored-By: git@compilade.net
Signed-off-by: Gabe Goodhart <redacted>
* fix(convert): Sanity check on merged FFN tensor sizes

Branch: GraniteMoE

Co-Authored-By: git@compilade.net
Signed-off-by: Gabe Goodhart <redacted>
* fix: Allow "output" layer in granite moe architecture (convert and cpp)

Branch: GraniteMoE

Co-Authored-By: git@compilade.net
Signed-off-by: Gabe Goodhart <redacted>
* fix(granite): Add missing 'output' tensor for Granite

This is a fix for the previous `granite` architecture PR. Recent snapshots
have included this (`lm_head.weights`) as part of the architecture

Branch: GraniteMoE

Signed-off-by: Gabe Goodhart <redacted>
---------

Signed-off-by: Gabe Goodhart <redacted>
Co-authored-by: Georgi Gerganov <redacted>
9 months agocann: fix crash when llama-bench is running on multiple cann devices (#9627)
Dou Xinpeng [Wed, 25 Sep 2024 03:30:38 +0000 (11:30 +0800)]
cann: fix crash when llama-bench is running on multiple cann devices (#9627)

9 months agoggml : add AVX512DQ requirement for AVX512 builds (#9622)
Eric Zhang [Tue, 24 Sep 2024 08:03:21 +0000 (16:03 +0800)]
ggml : add AVX512DQ requirement for AVX512 builds (#9622)

9 months agosync : ggml
Georgi Gerganov [Tue, 24 Sep 2024 08:01:18 +0000 (11:01 +0300)]
sync : ggml

9 months agoexamples : adapt to ggml.h changes (ggml/0)
Georgi Gerganov [Fri, 20 Sep 2024 18:50:16 +0000 (21:50 +0300)]
examples : adapt to ggml.h changes (ggml/0)

ggml-ci

9 months agollama : keep track of all EOG tokens in the vocab (#9609)
Georgi Gerganov [Tue, 24 Sep 2024 07:16:06 +0000 (10:16 +0300)]
llama : keep track of all EOG tokens in the vocab (#9609)

ggml-ci

9 months agolog : add CONT level for continuing previous log entry (#9610)
Georgi Gerganov [Tue, 24 Sep 2024 07:15:35 +0000 (10:15 +0300)]
log : add CONT level for continuing previous log entry (#9610)