]> git.djapps.eu Git - pkg/ggml/sources/llama.cpp/log
pkg/ggml/sources/llama.cpp
9 months agolog : add CONT level for continuing previous log entry (#9610)
Georgi Gerganov [Tue, 24 Sep 2024 07:15:35 +0000 (10:15 +0300)]
log : add CONT level for continuing previous log entry (#9610)

9 months agoserver : add newline after chat example (#9616)
StrangeBytesDev [Tue, 24 Sep 2024 06:04:39 +0000 (23:04 -0700)]
server : add newline after chat example (#9616)

9 months agosampling : avoid expensive softmax during greedy sampling (#9605)
Georgi Gerganov [Tue, 24 Sep 2024 06:03:17 +0000 (09:03 +0300)]
sampling : avoid expensive softmax during greedy sampling (#9605)

* sampling : avoid expensive softmax during greedy sampling

ggml-ci

* speculative : fix default RNG seed + set sparams.n_probs

* Update tests/test-sampling.cpp

Co-authored-by: slaren <redacted>
* sampling : add clarifying comment [no ci]

---------

Co-authored-by: slaren <redacted>
9 months agothreads: fix msvc build without openmp (#9615)
Max Krasnyansky [Tue, 24 Sep 2024 04:18:48 +0000 (21:18 -0700)]
threads: fix msvc build without openmp (#9615)

We're missing atomic_thread_fence() in MSVC builds when openmp is disabled.

9 months agocuda: add q8_0->f32 cpy operation (#9571)
Ivan [Tue, 24 Sep 2024 00:14:24 +0000 (03:14 +0300)]
cuda: add q8_0->f32 cpy operation (#9571)

llama: enable K-shift for quantized KV cache
It will fail on unsupported backends or quant types.

9 months agoserver : add --no-context-shift option (#9607)
Xuan Son Nguyen [Mon, 23 Sep 2024 20:23:54 +0000 (22:23 +0200)]
server : add --no-context-shift option (#9607)

* server : add --no-context-shift option

* small fix

* Update examples/server/tests/features/embeddings.feature

Co-authored-by: Georgi Gerganov <redacted>
* tests : minor fix

* revert usage of GGML_ASSERT

* update server documentation

---------

Co-authored-by: Georgi Gerganov <redacted>
9 months agothreads: improve ggml_barrier scaling with large number of threads (#9598)
Max Krasnyansky [Mon, 23 Sep 2024 18:42:43 +0000 (11:42 -0700)]
threads: improve ggml_barrier scaling with large number of threads (#9598)

Make sure n_barrier and n_barrier_passed do not share the cache line to avoid cache line bouncing.
This optimization shows performance improvements even for n_threads <= 8 cases.

Resurect TSAN (Thread Sanitizer) check so that we can avoid doing expensive read-modify-write
in the normal case and just use thread-fence as originally intended.

---
Here is the original description and suggestions from Willy Tarreau :

There's currently some false sharing between n_barrier and
n_barrier_passed that is amplified in ggml_barrier() by the fact that
all threads need to increment n_barrier when entering, while all
previous threads continue to read n_barrier_passed, waiting for the last
one to release them all. The side effect is that all these readers are
slowing down all new threads by making the cache line bounce back and
forth between readers and writers.

Just placing them in two distinct cache lines is sufficient to boost
the performance by 21% on a 80-core ARM server compared to the
no-openmp version, and by 3% compared to the openmp version.

Note that the variables could have been spread apart in the structure
as well, but it doesn't seem that the size of this threadpool struct is
critical so here we're simply aligning them.

Finally, the same issue was present when leaving the barrier since all
threads had to update the n_barrier_passed counter, though only one
would add a non-zero value. This alone is responsible for half of the
cost due to undesired serialization.

It might be possible that using a small array of n_barrier counters
could make things even faster on many-core systems, but it would likely
complicate the logic needed to detect the last thread.

Co-authored-by: Willy Tarreau <redacted>
9 months agoreadme : add programmable prompt engine language CLI (#9599)
Riceball LEE [Mon, 23 Sep 2024 15:58:17 +0000 (23:58 +0800)]
readme : add programmable prompt engine language CLI (#9599)

9 months agoflake.lock: Update (#9586)
Georgi Gerganov [Mon, 23 Sep 2024 15:43:40 +0000 (18:43 +0300)]
flake.lock: Update (#9586)

Flake lock file updates:

• Updated input 'nixpkgs':
    'github:NixOS/nixpkgs/4f807e8940284ad7925ebd0a0993d2a1791acb2f?narHash=sha256-IiA3jfbR7K/B5%2B9byVi9BZGWTD4VSbWe8VLpp9B/iYk%3D' (2024-09-11)
  → 'github:NixOS/nixpkgs/c04d5652cfa9742b1d519688f65d1bbccea9eb7e?narHash=sha256-PmUr/2GQGvFTIJ6/Tvsins7Q43KTMvMFhvG6oaYK%2BWk%3D' (2024-09-19)

Co-authored-by: github-actions[bot] <redacted>
9 months agoggml : AVX512 gemm for Q4_0_8_8 (#9532)
Srihari-mcw [Mon, 23 Sep 2024 14:06:38 +0000 (19:36 +0530)]
ggml : AVX512 gemm for Q4_0_8_8 (#9532)

* AVX512 version of ggml_gemm_q4_0_8x8_q8_0

* Remove zero vector parameter passing

* Rename functions and rearrange order of macros

* Edit commments

* style : minor adjustments

* Update x to start from 0

---------

Co-authored-by: Georgi Gerganov <redacted>
9 months agoperplexity : remove extra new lines after chunks (#9596)
Georgi Gerganov [Mon, 23 Sep 2024 08:28:02 +0000 (11:28 +0300)]
perplexity : remove extra new lines after chunks (#9596)

9 months agometal : use F32 prec for K*Q in vec FA (#9595)
Georgi Gerganov [Mon, 23 Sep 2024 08:27:47 +0000 (11:27 +0300)]
metal : use F32 prec for K*Q in vec FA (#9595)

ggml-ci

9 months agoRevert "[SYCL] fallback mmvq (#9088)" (#9579)
Akarshan Biswas [Mon, 23 Sep 2024 03:28:06 +0000 (08:58 +0530)]
Revert "[SYCL] fallback mmvq (#9088)" (#9579)

This reverts commit 50addec9a532a6518146ab837a85504850627316.

9 months agomusa: enable building fat binaries, enable unified memory, and disable Flash Attentio...
R0CKSTAR [Sun, 22 Sep 2024 14:55:49 +0000 (22:55 +0800)]
musa: enable building fat binaries, enable unified memory, and disable Flash Attention on QY1 (MTT S80) (#9526)

* mtgpu: add mp_21 support

Signed-off-by: Xiaodong Ye <redacted>
* mtgpu: disable flash attention on qy1 (MTT S80); disable q3_k and mul_mat_batched_cublas

Signed-off-by: Xiaodong Ye <redacted>
* mtgpu: enable unified memory

Signed-off-by: Xiaodong Ye <redacted>
* mtgpu: map cublasOperation_t to mublasOperation_t (sync code to latest)

Signed-off-by: Xiaodong Ye <redacted>
---------

Signed-off-by: Xiaodong Ye <redacted>
9 months agoFix merge error in #9454 (#9589)
Molly Sophia [Sun, 22 Sep 2024 13:26:50 +0000 (21:26 +0800)]
Fix merge error in #9454 (#9589)

Signed-off-by: Molly Sophia <redacted>
9 months agoCUDA: enable Gemma FA for HIP/Pascal (#9581)
Johannes Gäßler [Sun, 22 Sep 2024 07:34:52 +0000 (09:34 +0200)]
CUDA: enable Gemma FA for HIP/Pascal (#9581)

9 months agollama: remove redundant loop when constructing ubatch (#9574)
Shankar [Sun, 22 Sep 2024 02:30:34 +0000 (19:30 -0700)]
llama: remove redundant loop when constructing ubatch (#9574)

9 months agoRWKV v6: RWKV_WKV op CUDA implementation (#9454)
Molly Sophia [Sun, 22 Sep 2024 02:29:12 +0000 (10:29 +0800)]
RWKV v6: RWKV_WKV op CUDA implementation (#9454)

* ggml: CUDA unary op EXP

Signed-off-by: Molly Sophia <redacted>
* ggml: rwkv_wkv op CUDA impl

Signed-off-by: Molly Sophia <redacted>
---------

Signed-off-by: Molly Sophia <redacted>
9 months agoggml-alloc : fix list of allocated tensors with GGML_ALLOCATOR_DEBUG (#9573)
slaren [Sat, 21 Sep 2024 12:24:23 +0000 (14:24 +0200)]
ggml-alloc : fix list of allocated tensors with GGML_ALLOCATOR_DEBUG (#9573)

9 months agoUpdate CUDA graph on scale change plus clear nodes/params (#9550)
agray3 [Sat, 21 Sep 2024 00:41:07 +0000 (01:41 +0100)]
Update CUDA graph on scale change plus clear nodes/params  (#9550)

* Avoid using saved CUDA graph if scale changes and reset nodes/params on update

Fixes https://github.com/ggerganov/llama.cpp/issues/9451

* clear before resize

9 months agoCI: Provide prebuilt windows binary for hip (#9467)
Huang Qi [Sat, 21 Sep 2024 00:39:41 +0000 (08:39 +0800)]
CI: Provide prebuilt windows binary for hip (#9467)

9 months agoquantize : improve type name parsing (#9570)
slaren [Fri, 20 Sep 2024 18:55:36 +0000 (20:55 +0200)]
quantize : improve type name parsing (#9570)

quantize : do not ignore invalid types in arg parsing

quantize : ignore case of type and ftype arguments

9 months agoggml : fix builds (#0)
Georgi Gerganov [Fri, 20 Sep 2024 17:12:52 +0000 (20:12 +0300)]
ggml : fix builds (#0)

ggml-ci

9 months agoggml : fix trailing whitespace (#0)
Georgi Gerganov [Fri, 20 Sep 2024 16:13:02 +0000 (19:13 +0300)]
ggml : fix trailing whitespace (#0)

ggml-ci

9 months agosync : ggml
Georgi Gerganov [Fri, 20 Sep 2024 16:06:59 +0000 (19:06 +0300)]
sync : ggml

ggml-ci

9 months agoggml/examples: add backend support for numerical optimization (ggml/949)
Johannes Gäßler [Fri, 20 Sep 2024 16:04:44 +0000 (19:04 +0300)]
ggml/examples: add backend support for numerical optimization (ggml/949)

* CUDA eval works

* stochastic gradient descent op

* Adam except decay

* CUDA CROSS_ENTROPY_LOSS_BACK

* CUDA mnist-fc training works

* backend CLI arg

* refactor gguf load

* remove sched from opt_step_adam

* implement l1 regularization (weight decay)

* extra call to add optimizer

* initialize gradients with ggml_graph_reset

* gradient accumulation

* increment iter per eval instead of epoch

* adjust backend interfaces

* fix ggml_graph_reset without backend

* fix ggml graph export/import

* fixup

* rename

* revert ggml_opt changes

* more general CUDA repeat_back

* update documentation, fix CNN

* validation split

* add clarifying comment

* optimize PyTorch training

* adjust buffer size, thread count

* fix 0.0f validation split

* Update examples/mnist/mnist-common.cpp

Co-authored-by: Georgi Gerganov <redacted>
* fix gradient accumulation

* tensor flag for accumulators -> tensor hash set

* Update include/ggml.h

Co-authored-by: slaren <redacted>
* Update tests/test-backend-ops.cpp

Co-authored-by: slaren <redacted>
* Update tests/test-backend-ops.cpp

Co-authored-by: slaren <redacted>
* fix test prints

* Update src/ggml-backend.c

Co-authored-by: Georgi Gerganov <redacted>
* better CUDA support for noncontiguous out_prod

* add comment

---------

Co-authored-by: Georgi Gerganov <redacted>
Co-authored-by: slaren <redacted>
9 months agoexamples : add null threadpool args where needed (ggml/0)
Georgi Gerganov [Sun, 8 Sep 2024 08:10:43 +0000 (11:10 +0300)]
examples : add null threadpool args where needed (ggml/0)

ggml-ci

9 months agoCUDA: fix sum.cu compilation for CUDA < 11.7 (#9562)
Johannes Gäßler [Fri, 20 Sep 2024 16:35:35 +0000 (18:35 +0200)]
CUDA: fix sum.cu compilation for CUDA < 11.7 (#9562)

9 months agoexamples : flush log upon ctrl+c (#9559)
Georgi Gerganov [Fri, 20 Sep 2024 08:46:56 +0000 (11:46 +0300)]
examples : flush log upon ctrl+c (#9559)

9 months agoperplexity : do not escape input data by default (#9548)
Sigbjørn Skjæret [Fri, 20 Sep 2024 06:38:10 +0000 (08:38 +0200)]
perplexity : do not escape input data by default (#9548)

9 months agoserver : clean-up completed tasks from waiting list (#9531)
Georgi Gerganov [Thu, 19 Sep 2024 09:44:53 +0000 (12:44 +0300)]
server : clean-up completed tasks from waiting list (#9531)

ggml-ci

9 months agoimatrix : disable prompt escape by default (#9543)
Sigbjørn Skjæret [Thu, 19 Sep 2024 07:58:14 +0000 (09:58 +0200)]
imatrix : disable prompt escape by default (#9543)

9 months agoggml : fix n_threads_cur initialization with one thread (#9538)
slaren [Wed, 18 Sep 2024 17:13:08 +0000 (19:13 +0200)]
ggml : fix n_threads_cur initialization with one thread (#9538)

* ggml : fix n_threads_cur initialization with one thread

* Update ggml/src/ggml.c

---------

Co-authored-by: Max Krasnyansky <redacted>
9 months agoscripts : verify py deps at the start of compare (#9520)
Georgi Gerganov [Wed, 18 Sep 2024 15:34:32 +0000 (18:34 +0300)]
scripts : verify py deps at the start of compare (#9520)

9 months agollama : use reserve/emplace_back in sampler_sample (#9534)
Daniel Bevenius [Wed, 18 Sep 2024 11:42:36 +0000 (13:42 +0200)]
llama : use reserve/emplace_back in sampler_sample (#9534)

This commit updates the llama_sampler_sample function to use reserve and
emplace_back for the vector of llama_token_data structs.

The motivation for this change is to avoid the creation of n_vocab
default-constructed llama_token_data structs which are then
immediately overwritten.

9 months agoserver : match OAI structured output response (#9527)
Vinesh Janarthanan [Wed, 18 Sep 2024 06:50:34 +0000 (01:50 -0500)]
server : match OAI structured output response (#9527)

9 months agoserver : fix OpenSSL build (remove obsolete `LOG_INFO`) (#9529)
Eric Zhang [Wed, 18 Sep 2024 06:28:20 +0000 (14:28 +0800)]
server : fix OpenSSL build (remove obsolete `LOG_INFO`) (#9529)

9 months ago[SYCL]set context default value to avoid memory issue, update guide (#9476)
Neo Zhang Jianyu [Wed, 18 Sep 2024 00:30:31 +0000 (08:30 +0800)]
[SYCL]set context default value to avoid memory issue, update guide (#9476)

* set context default to avoid memory issue, update guide

* Update docs/backend/SYCL.md

Co-authored-by: Meng, Hengyu <redacted>
---------

Co-authored-by: arthw <redacted>
Co-authored-by: Meng, Hengyu <redacted>
9 months agollama-bench: correct argument parsing error message (#9524)
Michael Podvitskiy [Tue, 17 Sep 2024 20:41:38 +0000 (22:41 +0200)]
llama-bench: correct argument parsing error message (#9524)

9 months agoarg : add env variable for parallel (#9513)
Bert Wagner [Tue, 17 Sep 2024 13:35:38 +0000 (09:35 -0400)]
arg : add env variable for parallel (#9513)

* add env variable for parallel

* Update README.md with env:  LLAMA_ARG_N_PARALLEL

9 months agollama : fix n_vocab init for 'no_vocab' case (#9511)
Michael Podvitskiy [Tue, 17 Sep 2024 10:18:22 +0000 (12:18 +0200)]
llama : fix n_vocab init for 'no_vocab' case (#9511)

* llama: fixed n_vocab for `no_vocab` models

* llama: updated error output for `llama_decode_internal` and `llama_encode_internal`

* llama: log warning if there's no vocab_size in metadata

* llama: correct vocab size for logging

Co-authored-by: Georgi Gerganov <redacted>
---------

Co-authored-by: Georgi Gerganov <redacted>
9 months agothreadpool : skip polling for unused threads (#9461)
Max Krasnyansky [Tue, 17 Sep 2024 08:19:46 +0000 (01:19 -0700)]
threadpool : skip polling for unused threads (#9461)

* threadpool: skip polling for unused threads

Currently all threads do N polling rounds even if only 1 thread is active (n_threads_cur == 1).
This commit adds a check to skip the polling for unused threads (ith >= n_threads_cur).

n_threads_cur is now an atomic_int to explicitly tell thread sanitizer that it is written
from one thread and read from other threads (not a race conditions).

* threadpool: further simplify and improve ggml_barrier

Avoid using strict memory order while polling, yet make sure that all threads go through
full memory barrier (memory fence) on ggml_barrier entrace and exit.

* threads: add simple barrier test

This test does lots of small, parallel matmul ops where the barriers in between dominate the overhead.

* threadpool: improve thread sync for new-graphs

Using the same tricks as ggml_barrier. All the polling is done with relaxed memory order
to keep it efficient, once the new graph is detected we do full fence using read-modify-write
with strict memory order.

* threadpool: improve abort handling

Do not use threadpool->ec (exit code) to decide whether to exit the compute loop.
threadpool->ec is not atomic which makes thread-sanitizer rightfully unhappy about it.

Instead introduce atomic threadpool->abort flag used for this. This is consistent with
how we handle threadpool->stop or pause.

While at it add an explicit atomic_load for n_threads_cur for consistency.

* test-barrier: release threadpool before releasing the context

fixes use-after-free detected by gcc thread-sanitizer on x86-64
for some reason llvm sanitizer is not detecting this issue.

9 months agounicode : add <algorithm> (#9508)
Yuri Khrustalev [Tue, 17 Sep 2024 06:51:15 +0000 (02:51 -0400)]
unicode : add <algorithm> (#9508)

9 months agollama : support IBM Granite architecture (#9412)
Gabe Goodhart [Tue, 17 Sep 2024 06:44:58 +0000 (00:44 -0600)]
llama : support IBM Granite architecture (#9412)

* feat(gguf-py): Add Granite model and params to gguf-py

Branch: GraniteLM

Signed-off-by: Gabe Goodhart <redacted>
* feat(convert_hf_to_gguf): Add registration and param setup for Granite

Branch: GraniteLM

Signed-off-by: Gabe Goodhart <redacted>
* feat(llama.cpp): Add config parsing for Granite multiplier params

Branch: GraniteLM

Signed-off-by: Gabe Goodhart <redacted>
* feat(llama.cpp): First pass at full port of granite deviations from llama

Something is still not working right since the results are mostly terrible,
but on occasion it's producing relevant results at this point, so
_something_ is working.

Branch: GraniteLM

Signed-off-by: Gabe Goodhart <redacted>
* fix(llama.cpp): Determine granite language 3b instruct by vocab size

Branch: GraniteLM

Signed-off-by: Gabe Goodhart <redacted>
* fix(convert_hf_to_gguf): Use LlamaModel as base for GraniteModel

The defaults in LlamaModel are needed for Granite as well

Branch: GraniteLM

Signed-off-by: Gabe Goodhart <redacted>
* fix(llama.cpp): Switch Granite param names to use _scale for consistency

Other scalar multipliers are called *_scale, so this provides a more
consistent naming convention.

Branch: GraniteLM

Signed-off-by: Gabe Goodhart <redacted>
* fix(convert_hf_to_gguf/gguf-py): _multiplier -> _scale

The transformers names with _multiplier will now be converted to the _scale
equivalent during conversion.

Branch: GraniteLM

Signed-off-by: Gabe Goodhart <redacted>
* fix(llama.cpp): Use separate switch clause for granite in llm_load_hparams

Branch: GraniteLM

Signed-off-by: Gabe Goodhart <redacted>
---------

Signed-off-by: Gabe Goodhart <redacted>
9 months agollama : add llama_n_head() (#9512)
Michael Podvitskiy [Tue, 17 Sep 2024 06:23:30 +0000 (08:23 +0200)]
llama : add llama_n_head() (#9512)

9 months agoggml : move common CPU backend impl to new header (#9509)
slaren [Mon, 16 Sep 2024 14:22:07 +0000 (16:22 +0200)]
ggml : move common CPU backend impl to new header (#9509)

9 months agollama : rename n_embed to n_embd in rwkv6_time_mix (#9504)
Daniel Bevenius [Mon, 16 Sep 2024 11:07:13 +0000 (13:07 +0200)]
llama : rename n_embed to n_embd in rwkv6_time_mix (#9504)

This commit renames n_embed to n_embd in llm_build_rwkv6_time_mix.

The motivation for this change is consistency with the other rwkv6
functions like build_rwkv6 (and other parts of the code base).

9 months agoggml : link MATH_LIBRARY not by its full path (#9339)
Michael Podvitskiy [Mon, 16 Sep 2024 11:06:50 +0000 (13:06 +0200)]
ggml : link MATH_LIBRARY not by its full path (#9339)

9 months agoconvert : identify missing model files (#9397)
compilade [Mon, 16 Sep 2024 07:30:22 +0000 (03:30 -0400)]
convert : identify missing model files (#9397)

9 months agocmake : do not hide GGML options + rename option (#9465)
Georgi Gerganov [Mon, 16 Sep 2024 07:27:50 +0000 (10:27 +0300)]
cmake : do not hide GGML options + rename option (#9465)

* cmake : do not hide GGML options

ggml-ci

* build : rename flag GGML_CUDA_USE_GRAPHS -> GGML_CUDA_GRAPHS

for consistency

ggml-ci

9 months agoggml : IQ4_NL sgemm + Q4_0 AVX optimization (#9422)
Eve [Mon, 16 Sep 2024 06:48:24 +0000 (06:48 +0000)]
ggml : IQ4_NL sgemm + Q4_0 AVX optimization (#9422)

* squashed

readd my iq4_nl sgemm PR https://github.com/ggerganov/llama.cpp/pull/8049

have ggml_vec_dot_q4_0 do two blocks per loop for avx

try out f16c ggml_vec_dot_iq4_nl, but it's not really faster. as per https://github.com/ggerganov/llama.cpp/pull/8549 we can calculate several blocks at a time with no issue

* shuffle

* remove f16c iq4_nl as i cant make it faster than before

9 months agollama : support OLMoE (#9462)
Shane A [Mon, 16 Sep 2024 06:47:37 +0000 (23:47 -0700)]
llama : support OLMoE (#9462)

9 months agollama : support MiniCPM3 (#9322)
CarryFun [Mon, 16 Sep 2024 06:45:20 +0000 (14:45 +0800)]
llama : support MiniCPM3 (#9322)

Co-authored-by: 范睿凯 <redacted>
9 months agomain : option to disable context shift (#9484)
Vinesh Janarthanan [Mon, 16 Sep 2024 06:20:01 +0000 (01:20 -0500)]
main : option to disable context shift (#9484)

* added cli arg to disable context shift

* reverted precommit

* updated README.md for main

* white space

* allow disabling context shift in the server

* Update common/arg.cpp

no-context-shift only works for main example

Co-authored-by: Georgi Gerganov <redacted>
* added server example to --no-context-shift args

* removed server changes

* white space

---------

Co-authored-by: Georgi Gerganov <redacted>
9 months agometal : handle zero-sized allocs (#9466)
Georgi Gerganov [Mon, 16 Sep 2024 06:05:56 +0000 (09:05 +0300)]
metal : handle zero-sized allocs (#9466)

9 months agoflake.lock: Update (#9488)
Georgi Gerganov [Mon, 16 Sep 2024 02:14:23 +0000 (05:14 +0300)]
flake.lock: Update (#9488)

9 months agocommon : reimplement logging (#9418)
Georgi Gerganov [Sun, 15 Sep 2024 17:46:12 +0000 (20:46 +0300)]
common : reimplement logging (#9418)

https://github.com/ggerganov/llama.cpp/pull/9418

9 months agogguf-split : add basic checks (#9499)
slaren [Sun, 15 Sep 2024 17:02:27 +0000 (19:02 +0200)]
gguf-split : add basic checks (#9499)

* gguf-split : do not overwrite existing files when merging

* gguf-split : error when too many arguments are passed

9 months agocmake : correct order of sycl flags (#9497)
Michael Podvitskiy [Sun, 15 Sep 2024 16:55:52 +0000 (18:55 +0200)]
cmake : correct order of sycl flags (#9497)

9 months agopy : add "LLaMAForCausalLM" conversion support (#9485)
Csaba Kecskemeti [Sun, 15 Sep 2024 07:48:25 +0000 (00:48 -0700)]
py : add "LLaMAForCausalLM" conversion support (#9485)

Co-authored-by: Csaba Kecskemeti <redacted>
9 months agoreadme : update tools list (#9475)
OSecret [Sun, 15 Sep 2024 07:36:53 +0000 (10:36 +0300)]
readme : update tools list (#9475)

* Added link to proprietary wrapper for Unity3d into README.md

Wrapper has prebuild library and was tested on iOS, Android, WebGL, PC, Mac platforms, has online demos like [this](https://d23myu0xfn2ttc.cloudfront.net/rich/index.html) and [that](https://d23myu0xfn2ttc.cloudfront.net/).

* Update README.md

Fixes upon review

9 months agocmake : try to fix sycl+intel build (#9487)
Michael Podvitskiy [Sun, 15 Sep 2024 07:06:38 +0000 (09:06 +0200)]
cmake : try to fix sycl+intel build (#9487)

9 months agoggml : ggml_type_name return "NONE" for invalid values (#9458)
Yuri Khrustalev [Sat, 14 Sep 2024 09:54:37 +0000 (05:54 -0400)]
ggml : ggml_type_name return "NONE" for invalid values (#9458)

When running on Windows, the quantization utility attempts to print the types that are not set which leads to a crash.

9 months agoserver: add data: [DONE] to /chat/completions stream response (#9459)
VoidIsVoid [Sat, 14 Sep 2024 09:36:44 +0000 (17:36 +0800)]
server: add data: [DONE] to /chat/completions stream response (#9459)

9 months agocmake : use list(APPEND ...) instead of set() + dedup linker (#9463)
Georgi Gerganov [Sat, 14 Sep 2024 07:55:05 +0000 (10:55 +0300)]
cmake : use list(APPEND ...) instead of set() + dedup linker (#9463)

* cmake : use list(APPEND ...) instead of set() + dedup linker

ggml-ci

* cmake : try fix sycl

* cmake : try to fix sycl 2

* cmake : fix sycl build (#9469)

* try fix sycl build

* use CMAKE_CXX_FLAGS as a string variable

---------

Co-authored-by: Georgi Gerganov <redacted>
* one more CMAKE_CXX_FLAGS fix (#9471)

---------

Co-authored-by: Michael Podvitskiy <redacted>
9 months agollama : make cell_id const in inp_s_mask block (#9470)
Daniel Bevenius [Sat, 14 Sep 2024 07:50:12 +0000 (09:50 +0200)]
llama : make cell_id const in inp_s_mask block (#9470)

This commit makes the cell_id variable const in the inp_s_mask block.

The motivation for this change is consistency with the code in the
inp_s_copy block.

9 months agoserver : add loading html page while model is loading (#9468)
Xuan Son Nguyen [Fri, 13 Sep 2024 12:23:11 +0000 (14:23 +0200)]
server : add loading html page while model is loading (#9468)

* Adding loading page for '/' server requests

* set content when model is loading

* removed loading html file

* updated cmakelist

* updated makefile

* cleaned up whitespace

* cleanup for PR removed error

* updated server test to handle 503 HTML

* updated server test to handle 503 HTML

* ca†ch 503 before parsing json

* revert test

* account for both api and web browser requests

* precommit corrections

* eol fix

* revert changes to pre-commit

* removed print statement

* made loading message more descriptive

* also support .html files

---------

Co-authored-by: VJHack <redacted>
Co-authored-by: Vinesh Janarthanan <redacted>
9 months agollama : llama_perf + option to disable timings during decode (#9355)
Georgi Gerganov [Fri, 13 Sep 2024 06:53:38 +0000 (09:53 +0300)]
llama : llama_perf + option to disable timings during decode (#9355)

* llama : llama_perf + option to disable timings during decode

ggml-ci

* common : add llama_arg

* Update src/llama.cpp

Co-authored-by: Xuan Son Nguyen <redacted>
* perf : separate functions in the API

ggml-ci

* perf : safer pointer handling + naming update

ggml-ci

* minor : better local var name

* perf : abort on invalid sampler pointer

ggml-ci

---------

Co-authored-by: Xuan Son Nguyen <redacted>
9 months agofeat: remove a sampler from a chain (#9445)
Gilad S. [Fri, 13 Sep 2024 01:54:49 +0000 (04:54 +0300)]
feat: remove a sampler from a chain (#9445)

* feat: remove a sampler from a chain

* fix: return removed sampler

* fix: safer casting

9 months agoserver : Add option to return token pieces in /tokenize endpoint (#9108)
Mathijs Henquet [Thu, 12 Sep 2024 20:30:11 +0000 (22:30 +0200)]
server : Add option to return token pieces in /tokenize endpoint (#9108)

* server : added with_pieces functionality to /tokenize endpoint

* server : Add tokenize with pieces tests to server.feature

* Handle case if tokenizer splits along utf8 continuation bytes

* Add example of token splitting

* Remove trailing ws

* Fix trailing ws

* Maybe fix ci

* maybe this fix windows ci?

---------

Co-authored-by: Xuan Son Nguyen <redacted>
9 months agocann: Add host buffer type for Ascend NPU (#9406)
Dou Xinpeng [Thu, 12 Sep 2024 11:46:43 +0000 (19:46 +0800)]
cann: Add host buffer type for Ascend NPU (#9406)

* feat: Add host buffer type for Ascend NPU(CANN backend)

* fix some checking errors

* Add a few comments

9 months agollava : fix the script error in MobileVLM README (#9054)
fengerhu1 [Thu, 12 Sep 2024 11:34:22 +0000 (19:34 +0800)]
llava : fix the script error in MobileVLM README (#9054)

Signed-off-by: Erhu Feng <redacted>
9 months agolora : raise error if lm_head is ignored (#9103)
Xuan Son Nguyen [Thu, 12 Sep 2024 11:33:57 +0000 (13:33 +0200)]
lora : raise error if lm_head is ignored (#9103)

* lora : raise error if lm_head is ignored

* fix style

* clarify comment

9 months agocmake : fix for builds without `GGML_CDEF_PUBLIC` (#9338)
Michael Podvitskiy [Thu, 12 Sep 2024 11:30:01 +0000 (13:30 +0200)]
cmake : fix for builds without `GGML_CDEF_PUBLIC` (#9338)

* `GGML_TARGET_DEFINES-NOTFOUND` fix for builds without `GGML_CDEF_PUBLIC`

* Update CMakeLists.txt, spaces fix

9 months agoci : update HIP SDK to 24.Q3 (ROCm 6.1) (#9329)
Huang Qi [Thu, 12 Sep 2024 11:28:43 +0000 (19:28 +0800)]
ci : update HIP SDK to 24.Q3 (ROCm 6.1) (#9329)

9 months agopy : add Phi-1.5/Phi-2 tokenizer (#9361)
daminho [Thu, 12 Sep 2024 11:28:20 +0000 (20:28 +0900)]
py : add Phi-1.5/Phi-2 tokenizer (#9361)

* add phi2 tokenizer

* add phi name to convert_hf_to_gguf_update.py

* make tokenizer_pre consistent; llama.cpp work

9 months agoci : bump actions/checkout to v4 (#9377)
Trivikram Kamat [Thu, 12 Sep 2024 11:27:45 +0000 (04:27 -0700)]
ci : bump actions/checkout to v4 (#9377)

9 months agocmake : fixed the order of linking libraries for llama-quantize (#9450)
Michael Podvitskiy [Thu, 12 Sep 2024 11:27:14 +0000 (13:27 +0200)]
cmake : fixed the order of linking libraries for llama-quantize (#9450)

9 months agopy : add special tokens in hf_converter for RWKV v6 (#9428)
Molly Sophia [Thu, 12 Sep 2024 11:25:16 +0000 (19:25 +0800)]
py : add special tokens in hf_converter for RWKV v6 (#9428)

Signed-off-by: Molly Sophia <redacted>
9 months agoriscv : modify Makefile and add a RISCV_VECT to print log info (#9442)
Ahmad Tameem [Thu, 12 Sep 2024 11:24:31 +0000 (16:24 +0500)]
riscv : modify Makefile and add a RISCV_VECT to print log info (#9442)

- Added ggml_cpu_has_riscv_v() in GGML to print system info in log
- Modified Makefile to only use flag when cross compiling for RISC-V

9 months agoggml : hide ggml_object, ggml_cgraph, ggml_hash_set (#9408)
Georgi Gerganov [Thu, 12 Sep 2024 11:23:49 +0000 (14:23 +0300)]
ggml : hide ggml_object, ggml_cgraph, ggml_hash_set (#9408)

* ggml : hide ggml_object, ggml_cgraph, ggml_hash_set

ggml-ci

* ggml : add ggml-impl.h to backends

* ggml : fix compiler warnings

ggml-ci

* ggml : add assert upon adding nodes

9 months agoenhance run script to be easy to change the parameters (#9448)
Neo Zhang Jianyu [Thu, 12 Sep 2024 09:44:17 +0000 (17:44 +0800)]
enhance run script to be easy to change the parameters (#9448)

Co-authored-by: arthw <redacted>
9 months agocann: Fix error when running a non-exist op (#9424)
Xinpeng Dou [Thu, 12 Sep 2024 01:02:35 +0000 (09:02 +0800)]
cann: Fix error when running a non-exist op (#9424)

9 months agoAdd Jais to list of supported models (#9439)
Faisal Zaghloul [Thu, 12 Sep 2024 00:29:53 +0000 (20:29 -0400)]
Add Jais to list of supported models (#9439)

Co-authored-by: fmz <redacted>
9 months agollama : skip token bounds check when evaluating embeddings (#9437)
slaren [Wed, 11 Sep 2024 15:52:13 +0000 (17:52 +0200)]
llama : skip token bounds check when evaluating embeddings (#9437)

9 months agopy : support converting local models (#7547)
Pavel Zloi [Wed, 11 Sep 2024 12:29:51 +0000 (15:29 +0300)]
py : support converting local models (#7547)

* Support of converting local models added to convert-hf-to-gguf-update.py

* Description fixed

* shutil added to imports

9 months agollava : correct args for minicpmv-cli (#9429)
Xuan Son Nguyen [Wed, 11 Sep 2024 10:59:13 +0000 (12:59 +0200)]
llava : correct args for minicpmv-cli (#9429)

9 months agofiles : remove accidentally added `lora_test` submodule (#9430)
Xuan Son Nguyen [Wed, 11 Sep 2024 10:02:09 +0000 (12:02 +0200)]
files : remove accidentally added `lora_test` submodule (#9430)

9 months agofeat: Implements retrying logic for downloading models using --model-url flag (#9255)
Farbod Bijary [Wed, 11 Sep 2024 09:22:37 +0000 (12:52 +0330)]
feat: Implements retrying logic for downloading models using --model-url flag (#9255)

* feat: Implements retrying logic for downloading models using --model-url flag

* Update common/common.cpp

Co-authored-by: Xuan Son Nguyen <redacted>
* Update common/common.cpp

Co-authored-by: Xuan Son Nguyen <redacted>
* apply comments

* implements a retry function to avoid duplication

* fix editorconfig

* change function name

---------

Co-authored-by: farbod <redacted>
Co-authored-by: Xuan Son Nguyen <redacted>
Co-authored-by: slaren <redacted>
Co-authored-by: Xuan Son Nguyen <redacted>
9 months agoCUDA: fix --split-mode row race condition (#9413)
Johannes Gäßler [Wed, 11 Sep 2024 08:22:40 +0000 (10:22 +0200)]
CUDA: fix --split-mode row race condition (#9413)

9 months agobatched-bench : remove unused code (#9305)
Georgi Gerganov [Wed, 11 Sep 2024 07:03:54 +0000 (10:03 +0300)]
batched-bench : remove unused code (#9305)

9 months agomusa: remove Clang builtins mapping (#9421)
R0CKSTAR [Wed, 11 Sep 2024 01:46:55 +0000 (09:46 +0800)]
musa: remove Clang builtins mapping (#9421)

Signed-off-by: Xiaodong Ye <redacted>
9 months agosycl : update support conditions (#9394)
Alberto Cabrera Pérez [Wed, 11 Sep 2024 00:53:42 +0000 (01:53 +0100)]
sycl : update support conditions  (#9394)

* sycl : update support condition to im2col

Signed-off-by: Alberto Cabrera <redacted>
* Added TODO to remind supporting FP32 im2col

---------

Signed-off-by: Alberto Cabrera <redacted>
9 months agoflake.lock: Update (#9360)
Georgi Gerganov [Tue, 10 Sep 2024 22:46:59 +0000 (01:46 +0300)]
flake.lock: Update (#9360)

Flake lock file updates:

• Updated input 'flake-parts':
    'github:hercules-ci/flake-parts/af510d4a62d071ea13925ce41c95e3dec816c01d?narHash=sha256-ODYRm8zHfLTH3soTFWE452ydPYz2iTvr9T8ftDMUQ3E%3D' (2024-08-30)
  → 'github:hercules-ci/flake-parts/567b938d64d4b4112ee253b9274472dc3a346eb6?narHash=sha256-%2Bebgonl3NbiKD2UD0x4BszCZQ6sTfL4xioaM49o5B3Y%3D' (2024-09-01)
• Updated input 'flake-parts/nixpkgs-lib':
    'https://github.com/NixOS/nixpkgs/archive/a5d394176e64ab29c852d03346c1fc9b0b7d33eb.tar.gz?narHash=sha256-uFf2QeW7eAHlYXuDktm9c25OxOyCoUOQmh5SZ9amE5Q%3D' (2024-08-01)
  → 'https://github.com/NixOS/nixpkgs/archive/356624c12086a18f2ea2825fed34523d60ccc4e3.tar.gz?narHash=sha256-Ss8QWLXdr2JCBPcYChJhz4xJm%2Bh/xjl4G0c0XlP6a74%3D' (2024-09-01)
• Updated input 'nixpkgs':
    'github:NixOS/nixpkgs/71e91c409d1e654808b2621f28a327acfdad8dc2?narHash=sha256-GnR7/ibgIH1vhoy8cYdmXE6iyZqKqFxQSVkFgosBh6w%3D' (2024-08-28)
  → 'github:NixOS/nixpkgs/574d1eac1c200690e27b8eb4e24887f8df7ac27c?narHash=sha256-v3rIhsJBOMLR8e/RNWxr828tB%2BWywYIoajrZKFM%2B0Gg%3D' (2024-09-06)

Co-authored-by: github-actions[bot] <redacted>
9 months agoarg : bring back missing ifdef (#9411)
Xuan Son Nguyen [Tue, 10 Sep 2024 20:41:29 +0000 (22:41 +0200)]
arg : bring back missing ifdef (#9411)

* arg : bring back missing ifdef

* replace with llama_supports_gpu_offload

9 months agoenable --special arg for llama-server (#9419)
matteo [Tue, 10 Sep 2024 20:40:59 +0000 (22:40 +0200)]
enable --special arg for llama-server (#9419)

Co-authored-by: matteo serva <redacted>
9 months agollama : move random seed generation to the samplers (#9398)
slaren [Tue, 10 Sep 2024 16:04:25 +0000 (18:04 +0200)]
llama : move random seed generation to the samplers (#9398)

* llama_sampler_penalties : clamp penalty_last_n to zero

9 months agometal : fix compile warning with GGML_METAL_NDEBUG (#0)
Georgi Gerganov [Tue, 10 Sep 2024 07:17:03 +0000 (10:17 +0300)]
metal : fix compile warning with GGML_METAL_NDEBUG (#0)

9 months agollama : update llm_build_copy_mask_state comment [no ci] (#9385)
Daniel Bevenius [Tue, 10 Sep 2024 07:03:21 +0000 (09:03 +0200)]
llama : update llm_build_copy_mask_state comment [no ci] (#9385)

This commit updates the comment, which seems to contain a typo or be an
outdated comment, in the copy_mask_state function changing the variable
n_rs to n_kv.

I believe this change is correct and what the comment wants to
convey is to copy the states that are not going to be used in the
upcoming processing, which are the tokens states from n_seqs up to
the number of possible token states n_kv.

9 months agoRWKV v6: Add time_mix_decay_w1/w2 in quant exclusion list (#9387)
Molly Sophia [Tue, 10 Sep 2024 07:02:30 +0000 (15:02 +0800)]
RWKV v6: Add time_mix_decay_w1/w2 in quant exclusion list (#9387)

Signed-off-by: Molly Sophia <redacted>