git.djapps.eu Git - pkg/ggml/sources/llama.cpp/log

]> git.djapps.eu Git - pkg/ggml/sources/llama.cpp/log

overview / pkg / ggml / sources / llama.cpp / log

Georgi Gerganov [Sat, 2 Nov 2024 13:18:56 +0000 (15:18 +0200)]

llama : adjust default context size + print warnings (#10136)

* llama : adjust default context size + print warnings

ggml-ci

* ggml-ci : add missing gpu-layers + adjust context sizes

commit | commitdiff | tree

Diego Devesa [Sat, 2 Nov 2024 12:08:53 +0000 (13:08 +0100)]

simple-chat : only add bos on first prompt (#10129)

commit | commitdiff | tree

Xuan Son Nguyen [Sat, 2 Nov 2024 11:53:17 +0000 (12:53 +0100)]

convert-lora : make `--base` optional (#10110)

* convert-lora : make `--base` optional

* lint

* handle case where base_model_name_or_path is invalid

* do not include metadata from base model

* clarify unspecified --base

* add small comment [no ci]

* trigger ci

commit | commitdiff | tree

Diego Devesa [Fri, 1 Nov 2024 22:50:59 +0000 (23:50 +0100)]

llama : add simple-chat example (#10124)

* llama : add simple-chat example

---------

Co-authored-by: Xuan Son Nguyen <redacted>

commit | commitdiff | tree

Diego Devesa [Fri, 1 Nov 2024 22:48:26 +0000 (23:48 +0100)]

llama : use smart pointers for ggml resources (#10117)

commit | commitdiff | tree

Shupei Fan [Fri, 1 Nov 2024 18:33:14 +0000 (02:33 +0800)]

vulkan : improve ggml_vk_create_buffer error handling (#9898)

commit | commitdiff | tree

Georgi Gerganov [Fri, 1 Nov 2024 15:31:51 +0000 (17:31 +0200)]

readme : update hot topics

commit | commitdiff | tree

sasha0552 [Fri, 1 Nov 2024 13:33:14 +0000 (13:33 +0000)]

server : fix smart selection of available slot (#10120)

* Fix smart selection of available slot

* minor fix

* replace vectors of tokens with shorthands

commit | commitdiff | tree

Georgi Gerganov [Fri, 1 Nov 2024 10:58:45 +0000 (12:58 +0200)]

ggml : remove ggml_scratch (#10121)

ggml-ci

commit | commitdiff | tree

Georgi Gerganov [Fri, 1 Nov 2024 08:28:24 +0000 (10:28 +0200)]

sync : ggml

commit | commitdiff | tree

Georgi Gerganov [Fri, 1 Nov 2024 08:23:05 +0000 (10:23 +0200)]

ggml : alloc ggml_contexts on the heap (whisper/2525)

commit | commitdiff | tree

Zhenwei Jin [Fri, 1 Nov 2024 03:09:59 +0000 (11:09 +0800)]

build: fix build error in Windows env with OneAPI setup (#10107)

commit | commitdiff | tree

Diego Devesa [Thu, 31 Oct 2024 23:49:53 +0000 (00:49 +0100)]

llama : improve output buffer type selection (#10098)

commit | commitdiff | tree

Diego Devesa [Thu, 31 Oct 2024 23:45:34 +0000 (00:45 +0100)]

quantize : fix --keep-split (#10114)

commit | commitdiff | tree

Diego Devesa [Thu, 31 Oct 2024 21:54:23 +0000 (22:54 +0100)]

llama : fix buffer checks for mamba and rwk (#10111)

* llama : fix buffer checks for mamba and rwk

* llama : fix missing worst case flag during reserve

* cuda : fix supports_op for norm

* disable sched SET_CAUSE

commit | commitdiff | tree

Zhenwei Jin [Thu, 31 Oct 2024 18:50:39 +0000 (02:50 +0800)]

loader: refactor tensor weights storage (#9935)

* loader: refactor tensor weights storage

* use sorted map, sort weights by layer

---------

Co-authored-by: slaren <redacted>

commit | commitdiff | tree

Kevin Gibbons [Thu, 31 Oct 2024 13:02:35 +0000 (06:02 -0700)]

server : include scheme when printing URL (#10106)

commit | commitdiff | tree

Diego Devesa [Thu, 31 Oct 2024 10:40:59 +0000 (11:40 +0100)]

ggml : check tensor name lengths in gguf files (#10100)

commit | commitdiff | tree

Sergio López [Thu, 31 Oct 2024 09:09:52 +0000 (10:09 +0100)]

kompute: add mul_mat_q4_k shader (#10097)

This is a more or less direct translation from the Metal implementation
to GLSL.

Signed-off-by: Sergio Lopez <redacted>

commit | commitdiff | tree

Sergio López [Wed, 30 Oct 2024 16:01:52 +0000 (17:01 +0100)]

kompute: add backend registry / device interfaces (#10045)

Get in line with the other backends by supporting the newer
backend/device registry interfaces.

Signed-off-by: Sergio Lopez <redacted>

commit | commitdiff | tree

Diego Devesa [Wed, 30 Oct 2024 13:51:21 +0000 (14:51 +0100)]

ggml : fix memory leaks when loading invalid gguf files (#10094)

* ggml : fix gguf string leak when reading kv pairs fails

* ggml : avoid crashing with GGML_ABORT when the KV has an invalid type

* ggml : avoid crashing on failed memory allocations when loading a gguf file

commit | commitdiff | tree

Rich Dougherty [Wed, 30 Oct 2024 12:22:39 +0000 (01:22 +1300)]

readme : more lora detail in main example readme (#10064)

commit | commitdiff | tree

Rich Dougherty [Wed, 30 Oct 2024 12:22:21 +0000 (01:22 +1300)]

convert : more detailed convert lora usage docs (#10065)

commit | commitdiff | tree

xctan [Wed, 30 Oct 2024 07:00:40 +0000 (15:00 +0800)]

ggml : add Q4_0_8_8 RISC-V GEMV and GEMM kernels (#10029)

* ggml : RISC-V vector gemv for q4_0_8x8

* ggml : Added WIP rvv q4_0_8x8 gemm

* ggml : Added initial implementation of rvv gemm

* ggml : optimize gemm to avoid register spillover

* ggml : Fix GCC rvv load alignment issue

* ggml : Format gemm rvv code

* ggml : Fix a typo in RVV q4_0_8_8 GEMM

commit | commitdiff | tree

Diego Devesa [Wed, 30 Oct 2024 01:01:23 +0000 (02:01 +0100)]

llama : refactor model loader with backend registry (#10026)

commit | commitdiff | tree

Changyeon Kim [Tue, 29 Oct 2024 08:52:56 +0000 (17:52 +0900)]

ggml: Add POOL2D OP for GPU acceleration to the Vulkan backend in the MobileVLM model. (#9763)

* ggml: Add POOL2D OP for GPU ACC to the Vulkan.

- The MobileVLM model now supports inference acceleration through GPU by utilizing the Vulkan backend.
- A GGML_OP_POOL_2D shader has been added. (Pooling)
- The encoding performance of the CLIP model improved from 2.8s on the CPU to 0.7s on the GPU.

Signed-off-by: Changyeon Kim <redacted>
* [fix] Correct the incorrect order of the parameters.

fix casting to int.

Signed-off-by: Changyeon Kim <redacted>
---------

Signed-off-by: Changyeon Kim <redacted>

commit | commitdiff | tree

Georgi Gerganov [Tue, 29 Oct 2024 08:42:05 +0000 (10:42 +0200)]

llama : remove Tail-Free sampling (#10071)

ggml-ci

commit | commitdiff | tree

arch-btw [Mon, 28 Oct 2024 17:45:33 +0000 (10:45 -0700)]

llama : Add IBM granite template (#10013)

* Add granite template to llama.cpp

* Add granite template to test-chat-template.cpp

* Update src/llama.cpp

Co-authored-by: Xuan Son Nguyen <redacted>
* Update tests/test-chat-template.cpp

Co-authored-by: Xuan Son Nguyen <redacted>
* Added proper template and expected output

* Small change to \n

Small change to \n

* Add code space &

Co-authored-by: Xuan Son Nguyen <redacted>
* Fix spacing

* Apply suggestions from code review

* Update src/llama.cpp

---------

Co-authored-by: Xuan Son Nguyen <redacted>

commit | commitdiff | tree

Georgi Gerganov [Mon, 28 Oct 2024 15:41:24 +0000 (17:41 +0200)]

flake.lock: Update (#10063)

Flake lock file updates:

• Updated input 'nixpkgs':
'github:NixOS/nixpkgs/4c2fcb090b1f3e5b47eaa7bd33913b574a11e0a0?narHash=sha256-/uilDXvCIEs3C9l73JTACm4quuHUsIHcns1c%2BcHUJwA%3D' (2024-10-18)
→ 'github:NixOS/nixpkgs/2768c7d042a37de65bb1b5b3268fc987e534c49d?narHash=sha256-AlcmCXJZPIlO5dmFzV3V2XF6x/OpNWUV8Y/FMPGd8Z4%3D' (2024-10-23)

Co-authored-by: github-actions[bot] <redacted>

commit | commitdiff | tree

R0CKSTAR [Mon, 28 Oct 2024 09:02:48 +0000 (17:02 +0800)]

musa: workaround for Guilty Lockup in cleaning src0 (#10042)

Signed-off-by: Xiaodong Ye <redacted>

commit | commitdiff | tree

Georgi Gerganov [Mon, 28 Oct 2024 06:49:32 +0000 (08:49 +0200)]

server : don't overfill the batch during infill (#10018)

ggml-ci

commit | commitdiff | tree

Georgi Gerganov [Sun, 27 Oct 2024 18:59:58 +0000 (20:59 +0200)]

llama : switch KQ multiplication to F32 precision by default (#10015)

ggml-ci

commit | commitdiff | tree

Georgi Gerganov [Sat, 26 Oct 2024 07:34:08 +0000 (10:34 +0300)]

sync : ggml

commit | commitdiff | tree

bssrdf [Wed, 23 Oct 2024 18:34:00 +0000 (14:34 -0400)]

increase cuda_cpy block size (ggml/996)

Co-authored-by: bssrdf <redacted>

commit | commitdiff | tree

Georgi Gerganov [Sat, 26 Oct 2024 07:33:31 +0000 (10:33 +0300)]

scripts : fix amx sync [no ci]

commit | commitdiff | tree

Georgi Gerganov [Fri, 25 Oct 2024 19:26:15 +0000 (22:26 +0300)]

metal : support permuted matrix multiplicaions (#10033)

* metal : support permuted matrix multiplicaions

ggml-ci

* cont : use nb01 directly for row steps

ggml-ci

* cont : add comments [no ci]

* metal : minor refactor

* metal : minor

commit | commitdiff | tree

wwoodsTM [Fri, 25 Oct 2024 16:07:34 +0000 (10:07 -0600)]

llama : add DRY sampler (#9702)

* sampling : add DRY sampler (post-refactor)

* DRY: Trying to fix coauthors, removed unneeded line

* DRY: Fixed redundant code

* DRY: Fixed crash issue due to DRY being in chain but uninitialized

---------

Co-authored-by: l3utterfly <redacted>
Co-authored-by: pi6am <redacted>

commit | commitdiff | tree

Michael Podvitskiy [Fri, 25 Oct 2024 15:57:54 +0000 (17:57 +0200)]

llama: string_split fix (#10022)

* llama: Refactor string_split to use template specialization, fixes parsing strings with spaces

* llama: Add static_assert in the string_split template to ensure the correct template specialization is used for std::string

commit | commitdiff | tree

Srihari-mcw [Fri, 25 Oct 2024 07:27:41 +0000 (12:57 +0530)]

llamafile : extend sgemm.cpp support for Q5_0 models (#10010)

commit | commitdiff | tree

Georgi Gerganov [Fri, 25 Oct 2024 07:13:46 +0000 (10:13 +0300)]

server : check that the prompt fits in the slot's context (#10030)

ggml-ci

commit | commitdiff | tree

Xuan Son Nguyen [Thu, 24 Oct 2024 19:51:22 +0000 (21:51 +0200)]

server : refactor slot input data, move tokenizer to HTTP thread (#10023)

* server : refactor slot input data, move tokenizer to HTTP thread

* move prompt_tokens.empty() check

* fix incorrect if branch

* fix infinite generation loop

* bring back infill validation

* add infill test

* try fixing format_infill

* fix test

* remove redundant code

* rename completion to inference

* update docs

* use llama_tokens everywhere

commit | commitdiff | tree

Georgi Gerganov [Thu, 24 Oct 2024 18:23:33 +0000 (21:23 +0300)]

ci : fix cmake flags for SYCL

commit | commitdiff | tree

Johannes Gäßler [Thu, 24 Oct 2024 12:40:23 +0000 (14:40 +0200)]

CUDA: fix insufficient buffer clearing for MMQ (#10032)

commit | commitdiff | tree

Johannes Gäßler [Thu, 24 Oct 2024 09:09:36 +0000 (11:09 +0200)]

CUDA: fix MMQ for non-contiguous src0, add tests (#10021)

* CUDA: fix MMQ for non-contiguous src0, add tests

* revise test code

commit | commitdiff | tree

wwoodsTM [Wed, 23 Oct 2024 19:27:51 +0000 (13:27 -0600)]

server : samplers accept the prompt correctly (#10019)

commit | commitdiff | tree

Georgi Gerganov [Wed, 23 Oct 2024 14:23:55 +0000 (17:23 +0300)]

sync : ggml

commit | commitdiff | tree

Georgi Gerganov [Wed, 23 Oct 2024 14:16:56 +0000 (17:16 +0300)]

llama.vim : bump generation time limit to 3s [no ci]

commit | commitdiff | tree

Johannes Gäßler [Fri, 18 Oct 2024 07:24:44 +0000 (09:24 +0200)]

CUDA: fix 1D im2col, add tests (ggml/993)

commit | commitdiff | tree

Daniel Bevenius [Wed, 16 Oct 2024 18:10:01 +0000 (20:10 +0200)]

ggml : remove redundant set of contexts used field (ggml/978)

This commit removes the setting of the `used` field of the contexts in
the global state (g_state) in `ggml_init`.

The motivation for this change is that I believe that this additional
initialization might not be required after the changes in Commit
45fc4fed0b9fb5b1af4a8525cbebb95e11208732 ("sync : latest changes from
whisper.cpp"), which changed the initialization of the contexts field
from `{ 0 }` to `{ { 0 } }`:

```console
             g_state = (struct ggml_state) {
-                /*.contexts =*/ { 0 },
+                /*.contexts =*/ { { 0 } },
             };
```
My understanding is that the `{0}` initialization might not have
zero-initialized all the nested fields in every array element because of
compiler differences, and might have been the reason for having the
explicit setting of the `used` fields to false.

commit | commitdiff | tree

Michael Coppola [Wed, 23 Oct 2024 11:09:26 +0000 (07:09 -0400)]

llama.vim : add classic vim support (#9995)

* added classic vim support

* fixed ring update, removed blank line

* minor

* minor

* minor doc update

* removed uneeded var

* minor

* minor

* fixed job_start creating new scratch buffers

* fixed job_start creating new scratch buffers

* fixed ghost text indenting when expandtab is on

* removed unused code

* minor

* unified fim_on_exit

* minor

* vim ghost text rendering now uses pos_x and pos_y parameters

* renamed *_hlgroup to hlgroup_*

* renamed *_ghost_text to ghost_text_*, moved nvim/vim detection to llama#init()

* minor

---------

Co-authored-by: Michael Coppola <redacted>

commit | commitdiff | tree

Jun Hee Yoo [Wed, 23 Oct 2024 10:33:45 +0000 (19:33 +0900)]

metal : add POOL2D and fix IM2COL (#9943)

* add pool_2d

Signed-off-by: Junhee Yoo <redacted>
* fix im2col and add unittest for N>=1024

Signed-off-by: Junhee Yoo <redacted>
* add tests for N % 1024 != 0

Signed-off-by: Junhee Yoo <redacted>
* remove trailing whitespaces

Signed-off-by: Junhee Yoo <redacted>
* apply suggestions

Signed-off-by: Junhee Yoo <redacted>
* apply more optimization

- original IM2COL kernel + _ext with MIN()

Signed-off-by: Junhee Yoo <redacted>
* apply review: change kernel name of pool_2d

Signed-off-by: Junhee Yoo <redacted>
* apply review

Signed-off-by: Junhee Yoo <redacted>
* fix more formatting and enhance readability

Signed-off-by: Junhee Yoo <redacted>
---------

Signed-off-by: Junhee Yoo <redacted>

commit | commitdiff | tree

github-actions[bot] [Sun, 20 Oct 2024 00:22:59 +0000 (00:22 +0000)]

flake.lock: Update

Flake lock file updates:

• Updated input 'nixpkgs':
'github:NixOS/nixpkgs/5633bcff0c6162b9e4b5f1264264611e950c8ec7?narHash=sha256-9UTxR8eukdg%2BXZeHgxW5hQA9fIKHsKCdOIUycTryeVw%3D' (2024-10-09)
→ 'github:NixOS/nixpkgs/4c2fcb090b1f3e5b47eaa7bd33913b574a11e0a0?narHash=sha256-/uilDXvCIEs3C9l73JTACm4quuHUsIHcns1c%2BcHUJwA%3D' (2024-10-18)

commit | commitdiff | tree

Xuan Son Nguyen [Tue, 22 Oct 2024 14:59:02 +0000 (16:59 +0200)]

llama : fix empty batch causing llama_batch_allocr to crash (#9966)

* llama : fix empty batch cause llama_batch_allocr to crash

* move batch_allocr inside decode/encode_internal

* fix build

* add GGML_ASSERT

* Apply suggestions from code review

Co-authored-by: Georgi Gerganov <redacted>
---------

Co-authored-by: Georgi Gerganov <redacted>

commit | commitdiff | tree

Daniel Bevenius [Tue, 22 Oct 2024 13:31:06 +0000 (15:31 +0200)]

llama : rename batch to ubatch (#9950)

This commit renames the member field batch in llm_build_context to
ubatch, and also the parameter batch in llama_build_graph, and
llama_set_inputs to ubatch.

The motivation for this change is to make the code more readable
(considering there are the structs llama_batch and llama_sbatch), and
consistent with other parts of the code base where parameters/fields of
type llama_ubatch are named ubatch.

commit | commitdiff | tree

Molly Sophia [Tue, 22 Oct 2024 13:22:26 +0000 (21:22 +0800)]

Rwkv chat template fix (#10001)

* llama: remove useless template matching for rwkv-world

Signed-off-by: Molly Sophia <redacted>
* converter: Add comment about the hack for rwkv models

Signed-off-by: Molly Sophia <redacted>
* Update src/llama.cpp

Co-authored-by: Xuan Son Nguyen <redacted>
---------

Signed-off-by: Molly Sophia <redacted>
Co-authored-by: Xuan Son Nguyen <redacted>

commit | commitdiff | tree

Xuan Son Nguyen [Tue, 22 Oct 2024 11:08:41 +0000 (13:08 +0200)]

lora : warn user if new token is added in the adapter (#9948)

commit | commitdiff | tree

Molly Sophia [Tue, 22 Oct 2024 10:33:37 +0000 (18:33 +0800)]

llama : add chat template for RWKV-World + fix EOT (#9968)

* Add chat template for RWKV-World

Signed-off-by: Molly Sophia <redacted>
* RWKV: Fix the chat template not being used

Signed-off-by: Molly Sophia <redacted>
* RWKV v6: Set EOT token to ``\n\n``

Signed-off-by: Molly Sophia <redacted>
* readme: add rwkv into supported model list

Signed-off-by: Molly Sophia <redacted>
---------

Signed-off-by: Molly Sophia <redacted>

commit | commitdiff | tree

leo-pony [Tue, 22 Oct 2024 08:16:01 +0000 (16:16 +0800)]

[CANN] Adapt to dynamically loadable backends mechanism (#9970)

* [CANN] Adapt to dynamically loadable backends mechanism

* Fix the Bug: inference running result is garbled in debug running model for LM models who's type is Q4_0 class

* Handle the review comments of this pull request

commit | commitdiff | tree

Daniel Bevenius [Tue, 22 Oct 2024 07:40:02 +0000 (09:40 +0200)]

arg : fix typo in embeddings argument help [no ci] (#9994)

This commit fixes two typos in the help text for the `--embd-normalize`
and `--embd-separator` arguments. It also updates common.h which contain
the same typo in two comments.

commit | commitdiff | tree

Georgi Gerganov [Mon, 21 Oct 2024 21:35:25 +0000 (00:35 +0300)]

llama.vim : fix info text display [no ci] (#9787)

commit | commitdiff | tree

Georgi Gerganov [Mon, 21 Oct 2024 19:52:22 +0000 (22:52 +0300)]

llama.vim : move info to the right of screen [no ci] (#9787)

'eol' messes up the rendering with nvim v0.10.2 for some reason

commit | commitdiff | tree

Asghar Ghorbani [Mon, 21 Oct 2024 18:20:59 +0000 (20:20 +0200)]

readme : update UI list (#9972)

add PocketPal AI app

commit | commitdiff | tree

Daniel Bevenius [Mon, 21 Oct 2024 18:12:52 +0000 (20:12 +0200)]

arg : fix attention non-causal arg value hint (#9985)

This commit updates the argument value hint for the `--attention`
argument to `non-causal`.

The motivation for this change is that the only values for this argument
are `causal` and `non-causal`.

commit | commitdiff | tree

Georgi Gerganov [Mon, 21 Oct 2024 17:25:02 +0000 (20:25 +0300)]

llama.vim : plugin for Neovim (#9787)

commit | commitdiff | tree

Georgi Gerganov [Mon, 21 Oct 2024 13:20:46 +0000 (16:20 +0300)]

ggml : add asserts for type conversion in fattn kernels (#9971)

ggml-ci

commit | commitdiff | tree

Radoslav Gerganov [Mon, 21 Oct 2024 10:35:40 +0000 (13:35 +0300)]

rpc : pack only RPC structs (#9959)

commit | commitdiff | tree

Georgi Gerganov [Mon, 21 Oct 2024 06:46:40 +0000 (09:46 +0300)]

llama : default sampling changes + greedy update (#9897)

* llama : deprecate softmax sampler + fix dist sampler

ggml-ci

* tests : replace macros with functions

ggml-ci

* sampling : change temperature sampler logic

For t <= 0.0f, keep the max logit intact and set the rest to -inf

* cont : no need for special "greedy" logic

top-k == 1 is the same

* tests : init prob correctly

* llama : handle temp <= 0.0 in the temp_ext sampler too

ggml-ci

* cont : avoid extra loop in temperature sampler for sub-zero temp

ggml-ci

commit | commitdiff | tree

Georgi Gerganov [Mon, 21 Oct 2024 06:37:12 +0000 (09:37 +0300)]

speculative : fix handling of some input params (#9963)

* speculative : fix batch sizes at initialization

ggml-ci

* speculative : handle params.n_predict == -1

* speculative : limit batch size to llama_n_batch

commit | commitdiff | tree

Neo Zhang Jianyu [Mon, 21 Oct 2024 06:26:09 +0000 (14:26 +0800)]

fix mul_mat_vec_q and *_vec_q error (#9939)

Co-authored-by: arthw <redacted>

commit | commitdiff | tree

Loïc Carrère [Sun, 20 Oct 2024 16:25:41 +0000 (18:25 +0200)]

readme : update bindings list (#9951)

Update the binding list by adding LM-Kit.NET (C# & VB.NET)

commit | commitdiff | tree

icppWorld [Sun, 20 Oct 2024 16:01:34 +0000 (12:01 -0400)]

readme : update infra list (#9942)

llama_cpp_canister allows you to run llama.cpp as a Smart Contract on the Internet Computer. The smart contract runs as WebAssembly in a so-called 'canister'.

commit | commitdiff | tree

Xuan Son Nguyen [Fri, 18 Oct 2024 21:18:01 +0000 (23:18 +0200)]

llama : remove all_pos_0, all_pos_1, all_seq_id from llama_batch (#9745)

* refactor llama_batch_get_one

* adapt all examples

* fix simple.cpp

* fix llama_bench

* fix

* fix context shifting

* free batch before return

* use common_batch_add, reuse llama_batch in loop

* null terminated seq_id list

* fix save-load-state example

* fix perplexity

* correct token pos in llama_batch_allocr

commit | commitdiff | tree

Radoslav Gerganov [Fri, 18 Oct 2024 11:33:58 +0000 (14:33 +0300)]

rpc : backend refactoring (#9912)

* rpc : refactor backend

Use structs for RPC request/response messages

* rpc : refactor server

commit | commitdiff | tree

Ouadie EL FAROUKI [Fri, 18 Oct 2024 05:46:16 +0000 (06:46 +0100)]

[SYCL] Add SYCL Backend registry, device and Event Interfaces (#9705)

* implemented missing SYCL event APIs

* sycl : Added device and backend reg interfaces

* Restructured ggml-sycl.cpp

commit | commitdiff | tree

Ma Mingfei [Fri, 18 Oct 2024 05:34:36 +0000 (13:34 +0800)]

add amx kernel for gemm (#8998)

add intel amx isa detection

add vnni kernel for gemv cases

add vnni and amx kernel support for block_q8_0

code cleanup

fix packing B issue

enable openmp

fine tune amx kernel

switch to aten parallel pattern

add error message for nested parallelism

code cleanup

add f16 support in ggml-amx

add amx kernels for QK_K quant formats: Q4_K, Q5_K, Q6_K and IQ4_XS

update CMakeList

update README

fix some compilation warning

fix compiler warning when amx is not enabled

minor change

ggml-ci

move ggml_amx_init from ggml.c to ggml-amx/mmq.cpp

ggml-ci

update CMakeLists with -mamx-tile, -mamx-int8 and -mamx-bf16

ggml-ci

add amx as an ggml-backend

update header file, the old path for immintrin.h has changed to ggml-cpu-impl.h

minor change

update CMakeLists.txt

minor change

apply weight prepacking in set_tensor method in ggml-backend

fix compile error

ggml-ci

minor change

ggml-ci

update CMakeLists.txt

ggml-ci

add march dependency

minor change

ggml-ci

change ggml_backend_buffer_is_host to return false for amx backend

ggml-ci

fix supports_op

use device reg for AMX backend

ggml-ci

minor change

ggml-ci

minor change

fix rebase

set .buffer_from_host_ptr to be false for AMX backend

commit | commitdiff | tree

Georgi Gerganov [Fri, 18 Oct 2024 04:32:19 +0000 (07:32 +0300)]

server : add n_indent parameter for line indentation requirement (#9929)

ggml-ci

commit | commitdiff | tree

Daniel Bevenius [Thu, 17 Oct 2024 23:41:51 +0000 (01:41 +0200)]

llama : rename batch_all to batch (#8881)

This commit addresses the TODO in the code to rename the `batch_all`
parameter to `batch` in `llama_decode_internal`.

commit | commitdiff | tree

Georgi Gerganov [Thu, 17 Oct 2024 20:43:05 +0000 (23:43 +0300)]

readme : remove --memory-f32 references (#9925)

commit | commitdiff | tree

Georgi Gerganov [Thu, 17 Oct 2024 20:26:32 +0000 (23:26 +0300)]

llama : change warning to debug log

commit | commitdiff | tree

Georgi Gerganov [Thu, 17 Oct 2024 19:32:47 +0000 (22:32 +0300)]

llama : infill sampling handle very long tokens (#9924)

* llama : infill sampling handle very long tokens

ggml-ci

* cont : better indices

ggml-ci

commit | commitdiff | tree

Tim Wang [Thu, 17 Oct 2024 06:57:14 +0000 (17:57 +1100)]

readme : update bindings list (#9918)

Co-authored-by: Tim Wang <redacted>

commit | commitdiff | tree

Diego Devesa [Thu, 17 Oct 2024 00:46:58 +0000 (02:46 +0200)]

vulkan : add backend registry / device interfaces (#9721)

* vulkan : add backend registry / device interfaces

* llama : print devices used on model load

commit | commitdiff | tree

Gilad S. [Wed, 16 Oct 2024 23:34:22 +0000 (02:34 +0300)]

fix: allocating CPU buffer with size `0` (#9917)

commit | commitdiff | tree

Gilad S. [Wed, 16 Oct 2024 22:36:51 +0000 (01:36 +0300)]

fix: use `vm_allocate` to allocate CPU backend buffer on macOS (#9875)

* fix: use `vm_allocate` to allocate CPU backend buffer on macOS

* fix: switch to `posix_memalign` to keep existing `free()` usages work

* feat: move `GGML_ALIGNED_MALLOC` to `ggml-backend-impl.h`, add support for `vm_allocate` on macOS

* style: formatting

* fix: move const outside of `#ifndef`

* style: formatting

* fix: unused var

* fix: transform `GGML_ALIGNED_MALLOC` and `GGML_ALIGNED_FREE` into functions and add them to `ggml-impl.h`

* fix: unused var

* fix: page align to `GGUF_DEFAULT_ALIGNMENT`

* fix: page align to `TENSOR_ALIGNMENT`

* fix: convert `TENSOR_ALIGNMENT` to a macro

* fix: increase page size to `32` on iOS

* fix: iOS page size

* fix: `hbw_posix_memalign` alignment

commit | commitdiff | tree

Daniel Bevenius [Wed, 16 Oct 2024 17:34:28 +0000 (19:34 +0200)]

llama : suppress conversion from 'size_t' to 'int' (#9046)

* llama : suppress conversion from 'size_t' to 'int'

This commit updates llm_tokenizer_spm.tokenize to suppress/remove the
following warnings that are generated on Windows when using MSVC:

```console
src\llama-vocab.cpp(211,1): warning C4267: 'argument':
conversion from 'size_t' to 'int', possible loss of data
src\llama-vocab.cpp(517,1): warning C4267: 'argument':
conversion from 'size_t' to 'int', possible loss of data
```

This is done by adding a cast for the size_t returned from
symbols.size(). I believe this is safe as it seems unlikely that
symbols, which stores an entry for each UTF8 character, would become
larger than INT_MAX.

The motivation for this change is to reduce the number of warnings that
are currently generated when building on Windows.

* squash! llama : suppress conversion from 'size_t' to 'int'

Move cast into for loop.

commit | commitdiff | tree

Daniel Bevenius [Wed, 16 Oct 2024 17:24:05 +0000 (19:24 +0200)]

llava : fix typo in error message [no ci] (#9884)

commit | commitdiff | tree

Joe Eli McIlvain [Wed, 16 Oct 2024 16:03:24 +0000 (09:03 -0700)]

grammar : fix JSON Schema for string regex with top-level alt. (#9903)

Prior to this commit, using a JSON Schema containing a string
with `pattern` regular expression that uses top-level alternation
(e.g. `"pattern": "^A|B|C|D$"`) would result in invalid JSON
output from the constrained sampling grammar, because it
ended up creating a grammar rule like this for the string:

```
thing ::= "\"" "A" | "B" | "C" | "D" "\"" space
```

Note that this rule will only match a starting quote for the "A" case,
and will only match an ending quote for the "D" case,
so this rule will always produce invalid JSON when used for sampling
(that is, the JSON will always be lacking the starting quote,
the ending quote, or both).

This was fixed in a simple way by adding parentheses to the
generated rule (for all string pattern rules, to keep it simple),
such that the new generated rule looks like this (correct):

```
thing ::= "\"" ("A" | "B" | "C" | "D") "\"" space
```

commit | commitdiff | tree

Molly Sophia [Wed, 16 Oct 2024 10:10:21 +0000 (18:10 +0800)]

llama : add tensor name for "result_norm" (#9907)

Signed-off-by: Molly Sophia <redacted>

commit | commitdiff | tree

Alexey Parfenov [Wed, 16 Oct 2024 08:35:53 +0000 (08:35 +0000)]

server : fix the disappearance of the end of the text (#9867)

* server: fix the disappearance of the end of the text when streaming with stop strings

* simplify "send text" checks

commit | commitdiff | tree

Georgi Gerganov [Wed, 16 Oct 2024 08:28:14 +0000 (11:28 +0300)]

sync : ggml

commit | commitdiff | tree

Daniel Bevenius [Wed, 9 Oct 2024 14:40:35 +0000 (16:40 +0200)]

ggml-alloc : remove buffer_id from leaf_alloc (ggml/987)

This commit removes the buffer_id field from the leaf_alloc struct.

The motivation for is that this field is only written to and never
read/used as far as I can tell. Each tensor_alloc has a buffer_id field
and this is what caused me to look into this more closely, to
understand what the buffer_id in leaf_alloc was used for.

commit | commitdiff | tree

leo-pony [Wed, 16 Oct 2024 00:51:46 +0000 (08:51 +0800)]

[CANN] Fix cann compilation error (#9891)

Fix cann compilation error after merging llama.cpp supports dynamically loadable backends.

commit | commitdiff | tree

Georgi Gerganov [Tue, 15 Oct 2024 13:35:33 +0000 (16:35 +0300)]

llama : add infill sampler (#9896)

ggml-ci

commit | commitdiff | tree

Georgi Gerganov [Tue, 15 Oct 2024 13:28:55 +0000 (16:28 +0300)]

server : improve infill context reuse (#9894)

ggml-ci

commit | commitdiff | tree

MaggotHATE [Tue, 15 Oct 2024 10:54:55 +0000 (15:54 +0500)]

sampling : add XTC sampler (#9742)

* Initial XTC commit

Adds XTC sampler, not activated by default, but recommended settings by default.

* Cleanup

* Simplified chances calculation

To be more inline with the original implementation, chance is calculated once at the beginning.

* First fixes by comments

Still need to look into sorting

* Fixed trailing backspaces

* Fixed RNG to be reproduceable

Thanks to @slaren for directions

* Fixed forgotten header

* Moved `min_keep`

Moved from conditions to a simple check at the end.

* Fixed broken randomization

Thanks to @slaren for explanation

* Swapped sorting for a custom algorithm

Shifts tokens to remove the penalized ones, then puts the penalized at the back. Should make `min_keep` still viable.

* Algorithm rework

1. Scan token from top till the first non-penalizable
2. Remove the last captured token (the least probable above threshold)
3. Shift all tokens to override the remaining penalizable
4. Penalize and put them at the the bottom.

* Added XTC to `test-sampling`

* Simplified algorithm and more tests

* Updated info in common and args

* Merged back lost commits in common and arg

* Update dump info in common

* Fixed incorrect min_keep check

* Added XTC to README

* Renamed parameters, fixed info and defaults

* probability is at 0 by default, but XTC is included in sampling queue
* threshold higher than 0.5 switches XTC off

* Initial server support

* Added XTC to server UIs

* Fixed labels in old server UI

* Made algorithm safer and more readable

* Removed xtc_threshold_max

* Fixed arg after update

* Quick fixes by comments

* Simplified algorithm since threshold_max is removed

* Renamed random distribution

* Fixed tests and outdated README

* Small fixes

commit | commitdiff | tree

Georgi Gerganov [Tue, 15 Oct 2024 09:48:44 +0000 (12:48 +0300)]

server : update preact (#9895)

commit | commitdiff | tree

Michał Tuszyński [Tue, 15 Oct 2024 08:20:34 +0000 (10:20 +0200)]

readme : update bindings list (#9889)

commit | commitdiff | tree

VoidIsVoid [Mon, 14 Oct 2024 07:04:36 +0000 (15:04 +0800)]

server : handle "logprobs" field with false value (#9871)

Co-authored-by: Gimling <redacted>

commit | commitdiff | tree

agray3 [Mon, 14 Oct 2024 00:49:08 +0000 (01:49 +0100)]

Vectorize load instructions in dmmv f16 CUDA kernel (#9816)

* Vectorize load instructions in dmmv f16 CUDA kernel

Replaces scalar with vector load instructions, which substantially
improves performance on NVIDIA HBM GPUs, e.g. gives a 1.27X overall
speedup for Meta-Llama-3-8B-Instruct-F16 BS1 inference evaluation on
H100 SXM 80GB HBM3. On GDDR GPUs, there is a slight (1.01X) speedup.

* addressed comment

* Update ggml/src/ggml-cuda/dmmv.cu

Co-authored-by: Johannes Gäßler <redacted>
---------

Co-authored-by: Johannes Gäßler <redacted>

commit | commitdiff | tree

Georgi Gerganov [Sun, 13 Oct 2024 18:31:35 +0000 (21:31 +0300)]

server : accept extra_context for the infill endpoint (#9874)

* server : accept extra_context for the infill endpoint

ggml-ci

* server : update readme [no ci]

* server : use repo-level FIM pattern if possible

ggml-ci

Packaging of ggml-org/llama.cpp

RSS Atom