]> git.djapps.eu Git - pkg/ggml/sources/ggml/log
pkg/ggml/sources/ggml
9 months agollama : support RWKV v6 models (llama/8980)
Molly Sophia [Sun, 1 Sep 2024 14:38:17 +0000 (22:38 +0800)]
llama : support RWKV v6 models (llama/8980)

* convert_hf_to_gguf: Add support for RWKV v6

Signed-off-by: Molly Sophia <redacted>
* Add RWKV tokenization

* Fix build

Signed-off-by: Molly Sophia <redacted>
* Do not use special tokens when matching in RWKV tokenizer

* Fix model loading

* Add (broken) placeholder graph builder for RWKV

* Add workaround for kv cache

* Add logits conversion to rwkv5

* Add rwkv5 layer norms

* Add time mix KVRG & correct merge mistake

* Add remaining time mix parameters

* Add time mix output loading

* Add placeholder llm_build_time_mix

* Fix build

Signed-off-by: Molly Sophia <redacted>
* Load more tensors for rwkv v6

Signed-off-by: Molly Sophia <redacted>
* Fix rwkv tokenizer

Signed-off-by: Molly Sophia <redacted>
* ggml: Add unary operator Exp

Signed-off-by: Molly Sophia <redacted>
* RWKV v6 graph building

Signed-off-by: Molly Sophia <redacted>
* Add ``rescale_every_n_layers`` parameter

Signed-off-by: Molly Sophia <redacted>
* Add ``wkv.head_size`` key for RWKV

so it doesn't reuse Mamba ssm parameters

Signed-off-by: Molly Sophia <redacted>
* Fix offloading layers to CUDA

Signed-off-by: Molly Sophia <redacted>
* Fix parallel inferencing for RWKV

Signed-off-by: Molly Sophia <redacted>
* Remove trailing whitespaces

Signed-off-by: Molly Sophia <redacted>
* build_rwkv: Avoid using inplace operations

Signed-off-by: Molly Sophia <redacted>
* convert_hf_to_gguf: rwkv: Avoid using ``eval``

Signed-off-by: Molly Sophia <redacted>
* convert_hf_to_gguf: rwkv tokenizer: Don't escape sequences manually

Signed-off-by: Molly Sophia <redacted>
* Update convert_hf_to_gguf.py

Co-authored-by: compilade <redacted>
* ggml: Add backward computation for unary op ``exp``

Signed-off-by: Molly Sophia <redacted>
* Update convert_hf_to_gguf.py

Co-authored-by: compilade <redacted>
* Update convert_hf_to_gguf.py

Co-authored-by: compilade <redacted>
* Use MODEL_ARCH.RWKV6 instead of MODEL_ARCH.RWKV

Signed-off-by: Molly Sophia <redacted>
* build_rwkv6: Simplify graph

Signed-off-by: Molly Sophia <redacted>
* llama: rwkv6: Detect model.type

Signed-off-by: Molly Sophia <redacted>
* llama: rwkv6: Fix tensor loading for 7B/14B models

Signed-off-by: Molly Sophia <redacted>
* llama: rwkv6: Fix group_norm assertion failure with Metal

Signed-off-by: Molly Sophia <redacted>
* llama: rwkv6: Clean up

Signed-off-by: Molly Sophia <redacted>
* llama: rwkv6: Add quantization tensor exclusion

Signed-off-by: Molly Sophia <redacted>
* llama: rwkv6: Use the new advanced batch splits

Signed-off-by: Molly Sophia <redacted>
* Update src/llama.cpp

Co-authored-by: compilade <redacted>
* llama: rwkv6: Use ``ggml_norm`` instead of ``ggml_group_norm``

Co-authored-by: compilade <redacted>
* llama: rwkv6: Apply code style and misc changes

Signed-off-by: Molly Sophia <redacted>
* converter: Use class name ``Rwkv6Model``

Signed-off-by: Molly Sophia <redacted>
* llama: rwkv6: Make use of key ``feed_forward_length``

Signed-off-by: Molly Sophia <redacted>
* llama: rwkv6: Add kv ``time_mix_extra_dim`` and ``time_decay_extra_dim``

Signed-off-by: Molly Sophia <redacted>
* converter: Match ``new_name`` instead of ``name`` for float32 explicit tensors

Signed-off-by: Molly Sophia <redacted>
* llama: rwkv6: Keep ``time_mix_w1/w2`` as F32

Signed-off-by: Molly Sophia <redacted>
* llama: rwkv6: Remove unused nodes

Signed-off-by: Molly Sophia <redacted>
* llama: rwkv6: Apply code format changes

Signed-off-by: Molly Sophia <redacted>
* llama: rwkv6: Add lora for some supported tensors

Currently att.key/receptance/value/gate/output, ffn.receptance/key/value, as well as head.weight

Signed-off-by: Molly Sophia <redacted>
* rwkv : speed-up tokenization using trie

* minor : style + indentation

* llama: rwkv6: Avoid division by zero

Co-authored-by: compilade <redacted>
* ggml: rwkv_wkv: Avoid copying the state

Signed-off-by: Molly Sophia <redacted>
---------

Signed-off-by: Molly Sophia <redacted>
Co-authored-by: Layl Bongers <redacted>
Co-authored-by: compilade <redacted>
Co-authored-by: Georgi Gerganov <redacted>
9 months agoThreadpool: take 2 (llama/8672)
Faisal Zaghloul [Thu, 29 Aug 2024 23:20:53 +0000 (19:20 -0400)]
Threadpool: take 2 (llama/8672)

* Introduce ggml_compute_threadpool

- OpenMP functional: check
- Vanilla ggml functional: Check
- ggml w/threadpool functional: Check
- OpenMP no regression: No glaring problems
- Vanilla ggml no regression: No glaring problems
- ggml w/threadpool no regression: No glaring problems

* Minor fixes

* fixed use after release bug

* fixed a harmless race condition

* Fix Android bulid issue

* fix more race conditions

* fix deadlock for cases where cgraph.n_nodes == 1

and fix --poll case

* threadpool: use cpu_get_num_math to set the default number of threadpool threads

This way we avoid using E-Cores and Hyperthreaded siblings.

* bench: create fresh threadpool for each test

For benchmarking it's better to start a fresh pool for each test with the exact number of threads
needed for that test. Having larger pools is suboptimal (causes more load, etc).

* atomics: always use stdatomics with clang and use relaxed memory order when polling in ggml_barrier

This also removes sched_yield() calls from ggml_barrier() to match OpenMP behavior.

* threadpool: make polling the default to match openmp behavior

All command line args now allow for setting poll to 0 (false).

* threadpool: do not wakeup threads in already paused threadpool

* fix potential race condition in check_for_work

* threadpool: do not create two threadpools if their params are identical

* threadpool: reduce pause/resume/wakeup overhead in common cases

We now start threadpool in paused state only if we have two.
The resume is now implicit (ie new work) which allows for reduced locking and context-switch overhead.

* threadpool: add support for hybrid polling

poll params (--poll, ...) now specify "polling level", i.e. how aggresively we poll before waiting on cond.var.
poll=0 means no polling, 1 means poll for 128K rounds then wait, 2 for 256K rounds, ...

The default value of 50 (ie 50x128K rounds) seems like a decent default across modern platforms.
We can tune this further as things evolve.

* threadpool: reduce the number of barrier required

New work is now indicated with an atomic counter that is incremented for
each new graph that needs to be computed.
This removes the need for extra barrier for clearing the "new_work" and
removes the special case for trivial graphs.

* threadpool: remove special-casing for disposable threadpools

With the efficient hybrid polling there is no need to make disposable pools any different.
This simplifies the overall logic and reduces branching.

Include n_threads in debug print for disposable threadpool.

Declare pause and stop flags as atomic_bool
This doesn't actually generate any memory barriers and simply informs
the thread sanitizer that these flags can be written & read by different
threads without locking.

* threadpool: do not clear barrier counters between graphs computes (fixes race with small graphs)

This fixes the race condition with very small graphs where the main thread happens to
start a new graph while the workers are just about to exit from barriers.

* threadpool: use relaxed order for chunk sync

Full memory barrier is an overkill for this since each thread works on different chunk

* threadpool: remove abort_callback from threadpool state

* threadpool: better naming for thread/cpumask releated functions

* threadpool: consistent use of int type for n_threads params

* threadpool: add support for ggml_threadpool_params_default/init

Also removes the need for explicit mask_specified param.
all-zero cpumask means use default (usually inherited) cpu affinity mask.

* threadpool: move typedef into ggml.h

* threadpool: fix apply_priority() function name

* threadpool: fix swift wrapper errors due to n_threads int type cleanup

* threadpool: enable --cpu-mask and other threadpool related options only if threadpool is enabled

* threadpool: replace checks for compute_thread ret code with proper status check

* threadpool: simplify threadpool init logic and fix main thread affinity application

Most of the init code is now exactly the same between threadpool and openmp.

* threadpool: update threadpool resume/pause function names

* threadpool: enable openmp by default for now

* threadpool: don't forget to free workers state when omp is enabled

* threadpool: avoid updating process priority on the platforms that do not require it

On Windows we need to change overall process priority class in order to set thread priorities,
but on Linux, Mac, etc we do not need to touch the overall process settings.

* threadpool: update calling thread prio and affinity only at start/resume

This avoids extra syscalls for each graph_compute()

* llama-bench: turn threadpool params into vectors, add output headers, etc

* llama-bench: add support for cool off between tests --delay

This helps for long running tests on platforms that are thermally limited (phones, laptops, etc).
--delay (disabled by default) introduces the sleep for N seconds before starting each test.

* threadpool: move process priority setting into the apps (bench and cli)

This avoids changing the overall process priority on Windows for the apps
that use ggml/llama.cpp directy.

* threadpool: move all pause/resume logic into ggml

* threadpool: futher api cleanup and prep for future refactoring

All threadpool related functions and structs use ggml_threadpool prefix.

* threadpool: minor indent fixes

* threadpool: improve setprioty error message

* Update examples/llama-bench/llama-bench.cpp

Co-authored-by: slaren <redacted>
* threadpool: fix indent in set_threadpool call

* use int32_t for n_thread type in public llama.cpp API

* threadpool: use _new and _free instead of _create and _release

* fix two more public APIs to use int32_t for n_threads

* build: set _GNU_SOURCE for Adroid

---------

Co-authored-by: Max Krasnyansky <redacted>
Co-authored-by: fmz <redacted>
Co-authored-by: Max Krasnyansky <redacted>
Co-authored-by: slaren <redacted>
9 months agovulkan: fix compilation with GGML_VULKAN_DEBUG=ON (#948)
Salvatore Mesoraca [Fri, 6 Sep 2024 12:34:33 +0000 (14:34 +0200)]
vulkan: fix compilation with GGML_VULKAN_DEBUG=ON (#948)

the old code was trying to print a non-existent field (size)
and the struct as a whole (which doesn't have a operator<<
override defined).
Probably a typo happened during refactoring.

Signed-off-by: Salvatore Mesoraca <redacted>
9 months agovulkan: add dryrun support to sin and cos ops (#947)
Salvatore Mesoraca [Fri, 6 Sep 2024 12:34:25 +0000 (14:34 +0200)]
vulkan: add dryrun support to sin and cos ops (#947)

sin and cos failed test-backend-ops because they
tried to dereference a context pointer that is null
on dry runs.

This commit prevents that segfault.

Signed-off-by: Salvatore Mesoraca <redacted>
9 months agovulkan: correctly report support for OP_CONT (#946)
Salvatore Mesoraca [Fri, 6 Sep 2024 12:34:07 +0000 (14:34 +0200)]
vulkan: correctly report support for OP_CONT (#946)

test-backend-ops fails because ggml_cont aborts
when invoked passing an unsupported type.

This commit makes ggml_cont tests pass

Signed-off-by: Salvatore Mesoraca <redacted>
9 months agotests: add gradient tests for all backends (#932)
Johannes Gäßler [Tue, 3 Sep 2024 15:21:46 +0000 (17:21 +0200)]
tests: add gradient tests for all backends (#932)

* tests: add gradient checking to test-backend-ops

* remove old comment

* reorder includes

* adjust SIN/COS parameters

* add documentation, use supports_op if possible

9 months agoggml: fix ggml_graph_cpy undefined behavior (#943)
Johannes Gäßler [Sat, 31 Aug 2024 12:35:42 +0000 (14:35 +0200)]
ggml: fix ggml_graph_cpy undefined behavior (#943)

10 months agocann : fix doxy (#0)
Georgi Gerganov [Wed, 28 Aug 2024 15:45:01 +0000 (18:45 +0300)]
cann : fix doxy (#0)

10 months agosync : whisper.cpp
Georgi Gerganov [Wed, 28 Aug 2024 15:35:53 +0000 (18:35 +0300)]
sync : whisper.cpp

10 months agoexamples : use colorblind friendly TTY color scheme (whisper/2360)
Justine Tunney [Tue, 20 Aug 2024 07:49:10 +0000 (00:49 -0700)]
examples : use colorblind friendly TTY color scheme (whisper/2360)

This change updates the -pc flag, so that a new xterm256 color scheme is
used. This color scheme is believed to be better for three reasons:

1. It should be friendlier to the colorblind. The scheme was designed by
   Paul Tol (see: https://personal.sron.nl/~pault/). TensorBoard uses it
   since 2017, so it's already popular in the machine learning community

2. It should appear to be the same colors as before to people who aren't
   i.e. it's still a red-green spectrum like before but lightly modified

3. It is readable in both white and black background terminals. The neon
   colors before were probably a bit too intense for white backgrounds.

10 months agocann : add Ascend NPU support (whisper/2336)
Mengqing Cao [Fri, 9 Aug 2024 12:21:56 +0000 (20:21 +0800)]
cann : add Ascend NPU support (whisper/2336)

* enable Ascend NPU in src/whisper.cpp
  * sync test-backend-ops with llama.cpp

10 months agosync : llama.cpp
Georgi Gerganov [Wed, 28 Aug 2024 15:32:09 +0000 (18:32 +0300)]
sync : llama.cpp

10 months agoscripts : sync vulkan shaders generator
Georgi Gerganov [Wed, 28 Aug 2024 15:31:53 +0000 (18:31 +0300)]
scripts : sync vulkan shaders generator

10 months agovulkan : fix build (llama/0)
Georgi Gerganov [Tue, 27 Aug 2024 19:10:58 +0000 (22:10 +0300)]
vulkan : fix build (llama/0)

ggml-ci

10 months agotests : disable BF16 cont test
Georgi Gerganov [Wed, 28 Aug 2024 14:22:56 +0000 (17:22 +0300)]
tests : disable BF16 cont test

10 months agocuda : mark BF16 CONT as unsupported
Georgi Gerganov [Wed, 28 Aug 2024 14:08:03 +0000 (17:08 +0300)]
cuda : mark BF16 CONT as unsupported

10 months agoggml : fix cont with transposed tensors when one dimension is 1 (#934)
Salvatore Mesoraca [Wed, 28 Aug 2024 08:23:02 +0000 (10:23 +0200)]
ggml : fix cont with transposed tensors when one dimension is 1 (#934)

* ggml_cont: fix issue with transposed tensors when one dimension is 1

when using multiple threads, it is not enough
to check for the tensors to be contiguous for
ggml_compute_forward_dup_same_cont to work correctly.
The tensors strides also need to match.

Signed-off-by: Salvatore Mesoraca <redacted>
* Add ggml_cont tests

Signed-off-by: Salvatore Mesoraca <redacted>
* Remove dead code

it isn't possible to reach this code because
all these functions are invoked by ggml_compute_forward_dup
if and only if src0->type != dst->type

Signed-off-by: Salvatore Mesoraca <redacted>
* Make ggml_compute_forward_dup_same_cont work with contiguous tensors

Co-authored-by: Georgi Gerganov <redacted>
Signed-off-by: Salvatore Mesoraca <redacted>
---------

Signed-off-by: Salvatore Mesoraca <redacted>
Co-authored-by: Georgi Gerganov <redacted>
10 months agoci : fix yolo check
Georgi Gerganov [Wed, 28 Aug 2024 07:51:38 +0000 (10:51 +0300)]
ci : fix yolo check

10 months agosync : llama.cpp
Georgi Gerganov [Tue, 27 Aug 2024 19:41:59 +0000 (22:41 +0300)]
sync : llama.cpp

10 months agoci : disable mnist test
Georgi Gerganov [Tue, 27 Aug 2024 18:53:12 +0000 (21:53 +0300)]
ci : disable mnist test

ggml-ci

10 months agosync : vulkan (skip) (llama/0)
Georgi Gerganov [Tue, 27 Aug 2024 18:48:22 +0000 (21:48 +0300)]
sync : vulkan (skip) (llama/0)

10 months agosync : llama.cpp
Georgi Gerganov [Tue, 27 Aug 2024 18:35:06 +0000 (21:35 +0300)]
sync : llama.cpp

10 months agoggml : do not crash when quantizing q4_x_x with an imatrix (llama/9192)
slaren [Mon, 26 Aug 2024 17:44:43 +0000 (19:44 +0200)]
ggml : do not crash when quantizing q4_x_x with an imatrix (llama/9192)

10 months agometal : separate scale and mask from QKT in FA kernel (llama/9189)
Georgi Gerganov [Mon, 26 Aug 2024 15:31:02 +0000 (18:31 +0300)]
metal : separate scale and mask from QKT in FA kernel (llama/9189)

* metal : separate scale and mask from QKT in FA kernel

* metal : ne01 check no longer necessary

* metal : keep data in local memory

10 months agoggml : add SSM Metal kernels (llama/8546)
Georgi Gerganov [Mon, 26 Aug 2024 14:55:36 +0000 (17:55 +0300)]
ggml : add SSM Metal kernels (llama/8546)

* ggml : add ggml_ssm_conv metal impl

* ggml : add ssm_scan metal impl

ggml-ci

10 months agometal : gemma2 flash attention support (llama/9159)
slaren [Mon, 26 Aug 2024 09:08:59 +0000 (11:08 +0200)]
metal : gemma2 flash attention support (llama/9159)

10 months agoCPU/CUDA: Gemma 2 FlashAttention support (llama/8542)
Johannes Gäßler [Sat, 24 Aug 2024 19:34:59 +0000 (21:34 +0200)]
CPU/CUDA: Gemma 2 FlashAttention support (llama/8542)

* CPU/CUDA: Gemma 2 FlashAttention support

* apply logit_softcap to scale in kernel

* disable logit softcapping tests on Metal

* remove metal check

10 months agoAdd a space to supress a cmake warning (llama/9133)
Akarshan Biswas [Thu, 22 Aug 2024 14:09:47 +0000 (19:39 +0530)]
Add a space to supress a cmake warning (llama/9133)

10 months agoAdd oneDNN primitive support (llama/9091)
luoyu-intel [Thu, 22 Aug 2024 04:50:10 +0000 (12:50 +0800)]
Add oneDNN primitive support (llama/9091)

* add onednn

* add sycl_f16

* add dnnl stream

* add engine map

* use dnnl for intel only

* use fp16fp16fp16

* update doc

10 months agollama : simplify Mamba with advanced batch splits (llama/8526)
compilade [Wed, 21 Aug 2024 21:58:11 +0000 (17:58 -0400)]
llama : simplify Mamba with advanced batch splits (llama/8526)

* llama : advanced batch splits

This includes equal-sequence-length batch splits which are useful
to simplify recurrent model operators.

* llama : always make recurrent state slots contiguous

* ggml : simplify mamba operators

* llama : fix integer signedness mixing

* llama : logits_all has priority over batch->logits

Otherwise, the server embeddings tests failed.
This was likely an existing problem but was only detected here
because of an additional assertion.

* llama : apply suggestions

Co-authored-by: Georgi Gerganov <redacted>
* llama : fix t5 segfault

* llama : fix Mamba session save and restore

* llama : minor cosmetic changes

* llama : rename llama_reorder_outputs to llama_output_reorder

Also move it closer to llama_output_reserve.

* llama : fix pooled embeddings when using batches with equal_seqs

* minor : add struct members for clarity

ggml-ci

* llama : fix T5 segfault again

* llama : fix Mamba pooled embeddings with multiple sequences

Until the pooled embeddings are refactored to allow splitting
across ubatches for causal embeddings,
recurrent models can only process a single sequence per ubatch
when calculating pooled embeddings.

* llama : add llama_model_is_recurrent to simplify figuring that out

This will make it easier to more cleanly support RWKV-v6 and Mamba-2.

* llama : fix simple splits when the batch contains embeddings

---------

Co-authored-by: Georgi Gerganov <redacted>
10 months agofallback mmvq (llama/9088)
Meng, Hengyu [Tue, 20 Aug 2024 15:50:17 +0000 (23:50 +0800)]
fallback mmvq (llama/9088)

* fallback mmvq to mul_mat

* mmvq in cuda path

* Update ggml/src/ggml-sycl.cpp

Co-authored-by: Alberto Cabrera Pérez <redacted>
---------

Co-authored-by: Alberto Cabrera Pérez <redacted>
10 months agoFix SYCL `im2col` and `convert` Overflow with Large Dims (llama/9052)
zhentaoyu [Tue, 20 Aug 2024 15:06:51 +0000 (23:06 +0800)]
Fix SYCL `im2col` and `convert` Overflow with Large Dims (llama/9052)

* sycl: fix im2col overflow and sync with cuda

Signed-off-by: zhentaoyu <redacted>
* sycl: fix convert overflow

Signed-off-by: zhentaoyu <redacted>
* sycl: fix convert and dequantize

Signed-off-by: zhentaoyu <redacted>
* sycl: fix ib in dmmv

Signed-off-by: zhentaoyu <redacted>
* sycl:refine convert

Signed-off-by: zhentaoyu <redacted>
* sycl: move downsample global_range into common

Signed-off-by: zhentaoyu <redacted>
* test: add im2col and convert test cases

Signed-off-by: zhentaoyu <redacted>
* test: make new cases only in sycl

Signed-off-by: zhentaoyu <redacted>
* test: comment new test_cases for only local testing

Signed-off-by: zhentaoyu <redacted>
---------

Signed-off-by: zhentaoyu <redacted>
10 months agorpc : print error message when failed to connect endpoint (llama/9042)
Radoslav Gerganov [Mon, 19 Aug 2024 07:11:45 +0000 (10:11 +0300)]
rpc : print error message when failed to connect endpoint (llama/9042)

10 months agorpc : prevent crashes on invalid input (llama/9040)
Radoslav Gerganov [Mon, 19 Aug 2024 07:10:21 +0000 (10:10 +0300)]
rpc : prevent crashes on invalid input (llama/9040)

Add more checks which prevent RPC server from crashing if invalid input
is received from client

10 months agoggml : dynamic ggml_sched_max_splits based on graph_size (llama/9047)
Nico Bosshard [Fri, 16 Aug 2024 02:22:55 +0000 (04:22 +0200)]
ggml : dynamic ggml_sched_max_splits based on graph_size (llama/9047)

* ggml : Dynamic ggml_sched_max_splits based on graph_size

* Fixed and readded debug code for causes

10 months agocmake : remove unused option GGML_CURL (llama/9011)
Georgi Gerganov [Wed, 14 Aug 2024 06:14:49 +0000 (09:14 +0300)]
cmake : remove unused option GGML_CURL (llama/9011)

10 months agoggml : move rope type enum to ggml.h (llama/8949)
Daniel Bevenius [Tue, 13 Aug 2024 19:13:15 +0000 (21:13 +0200)]
ggml : move rope type enum to ggml.h (llama/8949)

* ggml : move rope type enum to ggml.h

This commit moves the `llama_rope_type` enum from `llama.h` to
`ggml.h` and changes its name to `ggml_rope_type`.

The motivation for this change is to address the TODO in `llama.h` and
use the enum in ggml.

Note: This commit does not change the `mode` parameter to be of type
`enum ggml_rope_type`. The name `mode` and its usage suggest that it
might be more generic and possibly used as a bit field for multiple
flags. Further investigation/discussion may be needed to determine
if `mode` should be restricted to RoPE types.

* squash! ggml : move rope type enum to ggml.h

This commit removes GGML_ROPE_TYPE_NONE and GGML_ROPE_TYPE_GLM from
ggml.h, and back the llama_rope_type enum.

I've kept the assert for GGML_ROPE_TYPE_GLM as I'm not sure if it is
safe to remove it yet.

* squash! ggml : move rope type enum to ggml.h

This commit removes the enum ggml_rope_type from ggml.h and replaces it
with a define (GGML_ROPE_TYPE_NEOX). This define is used in the code to
check if the mode is set to GPT-NeoX. Also the enum llama_rope_type has
been updated to reflect this change.

* squash! ggml : move rope type enum to ggml.h

This commit contains a suggestion enable the GGML_ROPE_TYPE_NEOX
macro/define to be passed to the shader compiler.

* squash! ggml : move rope type enum to ggml.h

This commit fixes the editorconfig-checker warnings.

* squash! ggml : move rope type enum to ggml.h

Update comment for ggml_rope function.

* Revert "squash! ggml : move rope type enum to ggml.h"

This reverts commit 6261222bd0dc0efd51f0fb0435ad3f16a5b52fd6.

* squash! ggml : move rope type enum to ggml.h

Add GGML_ROPE_TYPE_NEOX to rope_common.comp.

* remove extra line

---------

Co-authored-by: slaren <redacted>
10 months agoggml: fix div-by-zero (llama/9003)
DavidKorczynski [Mon, 12 Aug 2024 12:21:41 +0000 (13:21 +0100)]
ggml: fix div-by-zero (llama/9003)

Fixes: https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=70724
In order to access the above bug you need to login using one of the
emails in
https://github.com/google/oss-fuzz/blob/master/projects/llamacpp/project.yaml#L3-L5

Signed-off-by: David Korczynski <redacted>
10 months agoOptimize Vulkan backend for better CPU performance and less GPU synchronization overh...
Markus Tavenrath [Sun, 11 Aug 2024 08:09:09 +0000 (10:09 +0200)]
Optimize Vulkan backend for better CPU performance and less GPU synchronization overhead. (llama/8943)

* Optimize Vulkan backend for better CPU performance and less GPU synchronization overhead.

- Allocation overhead for the temporary std::vectors was easily detectable with a sampling profiler and simple to remove.
- ggml_vk_sync_buffer introduce a full pipeline sync which has a significant cost on the GPU side, sometimes larger than the actual kernel execution. Adding only barriers for shader read/writes and transfers seems to be sufficient looking at the code which either launches compute kernels or copies tensors.

* Fix small typo

---------

Co-authored-by: 0cc4m <redacted>
10 months agofeat: ref. cross entropy, add CUDA, fix grad test (#929)
Johannes Gäßler [Tue, 27 Aug 2024 18:39:30 +0000 (20:39 +0200)]
feat: ref. cross entropy, add CUDA, fix grad test (#929)

10 months agotests : fix memory leaks (#936)
Salvatore Mesoraca [Tue, 27 Aug 2024 06:25:12 +0000 (08:25 +0200)]
tests : fix memory leaks (#936)

It is annoying to run the tests using the sanitizers
because of all the uninteresting reports about the memory
leaked by the tests themselves.

Signed-off-by: Salvatore Mesoraca <redacted>
10 months agoggml: remove bad assert (#928)
Johannes Gäßler [Sat, 24 Aug 2024 17:27:02 +0000 (19:27 +0200)]
ggml: remove bad assert (#928)

10 months agoset NULL to ggml_context pointer to pass assert check in in case some compiler does...
ucag.li [Thu, 22 Aug 2024 19:49:45 +0000 (03:49 +0800)]
set NULL to ggml_context pointer to pass assert check in  in case some compiler does not set uninitialized pointer to NULL for mnist example

10 months agoexamples: add MNIST training + missing ops
Johannes Gäßler [Tue, 30 Jul 2024 13:56:35 +0000 (15:56 +0200)]
examples: add MNIST training + missing ops

10 months agoyolo : add backend support (#924)
Radoslav Gerganov [Mon, 19 Aug 2024 07:09:33 +0000 (10:09 +0300)]
yolo : add backend support (#924)

* yolo : add backend support

* metal : add sub and sqrt kernels

---------

Co-authored-by: Georgi Gerganov <redacted>
10 months agoggml : fix typo in ggml-quants.c comment (#922)
Daniel Bevenius [Thu, 15 Aug 2024 07:42:38 +0000 (09:42 +0200)]
ggml : fix typo in ggml-quants.c comment (#922)

10 months agofeat: add new `sin` and `cos` operators (#919)
Ronsor [Mon, 12 Aug 2024 13:02:08 +0000 (06:02 -0700)]
feat: add new `sin` and `cos` operators (#919)

* ggml : add sin/cos operators

* ggml-cuda : add sin/cos operators

* ggml : add corresponding tests for sin/cos

* ggml : add backward computation for sin/cos operators

* ggml-vulkan : add sin/cos operators

* ggml-vulkan : add sin/cos shader source

* metal : add sin, cos

---------

Co-authored-by: Georgi Gerganov <redacted>
10 months agoggml : support forward pass broadcasting in ggml_sub (#914)
Salvatore Mesoraca [Sun, 11 Aug 2024 08:08:53 +0000 (10:08 +0200)]
ggml : support forward pass broadcasting in ggml_sub (#914)

* ggml: support forward pass broadcasting in ggml_sub

Signed-off-by: Salvatore Mesoraca <redacted>
* Use assert instead of GGML_ASSERT in ggml_compute_forward_sub_f32

The check is already performed in ggml_sub_impl

Signed-off-by: Salvatore Mesoraca <redacted>
---------

Signed-off-by: Salvatore Mesoraca <redacted>
10 months agosync : llama.cpp
Georgi Gerganov [Sun, 11 Aug 2024 08:06:21 +0000 (11:06 +0300)]
sync : llama.cpp

10 months agometal : fix uninitialized abort_callback (llama/8968)
slaren [Sat, 10 Aug 2024 13:42:10 +0000 (15:42 +0200)]
metal : fix uninitialized abort_callback (llama/8968)

10 months agosync : llama.cpp
Georgi Gerganov [Sat, 10 Aug 2024 06:51:19 +0000 (09:51 +0300)]
sync : llama.cpp

10 months agorpc : sanitize tensor data + warnings (llama/0)
Georgi Gerganov [Fri, 9 Aug 2024 20:03:21 +0000 (23:03 +0300)]
rpc : sanitize tensor data + warnings (llama/0)

Co-authored-by: slaren <redacted>
10 months agosync : whisper.cpp
Georgi Gerganov [Fri, 9 Aug 2024 07:03:29 +0000 (10:03 +0300)]
sync : whisper.cpp

10 months agowhisper : use vulkan as gpu backend when available (whisper/2302)
Matt Stephenson [Tue, 16 Jul 2024 07:21:09 +0000 (03:21 -0400)]
whisper : use vulkan as gpu backend when available (whisper/2302)

* ggml: use vulkan as gpu backend when available

Signed-off-by: Matt Stephenson <redacted>
* whisper: enable using vk as default buffer type

Signed-off-by: Matt Stephenson <redacted>
---------

Signed-off-by: Matt Stephenson <redacted>
10 months agoggml : add CANN backend (llama/0)
hipudding [Thu, 8 Aug 2024 11:48:06 +0000 (14:48 +0300)]
ggml : add CANN backend (llama/0)

ggml-ci

10 months agosync : vulkan (llama/0)
Georgi Gerganov [Thu, 8 Aug 2024 11:46:24 +0000 (14:46 +0300)]
sync : vulkan (llama/0)

10 months agoscripts : sync sycl (#0)
Georgi Gerganov [Thu, 8 Aug 2024 10:57:33 +0000 (13:57 +0300)]
scripts : sync sycl (#0)

10 months agoscripts : remove obsolete header (#0)
Georgi Gerganov [Thu, 8 Aug 2024 10:54:39 +0000 (13:54 +0300)]
scripts : remove obsolete header (#0)

10 months agoscripts : update sync scripts (#0)
Georgi Gerganov [Thu, 8 Aug 2024 10:51:09 +0000 (13:51 +0300)]
scripts : update sync scripts (#0)

10 months agosync : llama.cpp
Georgi Gerganov [Thu, 8 Aug 2024 10:25:41 +0000 (13:25 +0300)]
sync : llama.cpp

ggml-ci

10 months agoggml-backend : fix async copy from CPU (llama/8897)
slaren [Wed, 7 Aug 2024 11:29:02 +0000 (13:29 +0200)]
ggml-backend : fix async copy from CPU (llama/8897)

* ggml-backend : fix async copy from CPU

* cuda : more reliable async copy, fix stream used when the devices are the same

10 months agoUpdated SYCL device filtering (llama/8901)
Ouadie EL FAROUKI [Wed, 7 Aug 2024 10:25:36 +0000 (11:25 +0100)]
Updated SYCL device filtering (llama/8901)

* Updated device filter to depend on default_selector (fixes non-intel device issues)
* Small related update to example/sycl Readme

10 months agoCUDA/HIP: fix tests/test-backend-ops (llama/8896)
Johannes Gäßler [Wed, 7 Aug 2024 07:07:52 +0000 (09:07 +0200)]
CUDA/HIP: fix tests/test-backend-ops (llama/8896)

10 months agoCUDA: fix padding logic for FP16/FP32 (llama/8884)
Johannes Gäßler [Tue, 6 Aug 2024 15:13:55 +0000 (17:13 +0200)]
CUDA: fix padding logic for FP16/FP32 (llama/8884)

10 months agoggml : add epsilon as a parameter for group_norm (llama/8818)
Molly Sophia [Tue, 6 Aug 2024 07:26:46 +0000 (15:26 +0800)]
ggml : add epsilon as a parameter for group_norm (llama/8818)

Signed-off-by: Molly Sophia <redacted>
10 months agoFix ggml_backend_cann_buffer_get_tensor (llama/8871)
Mengqing Cao [Tue, 6 Aug 2024 04:42:42 +0000 (12:42 +0800)]
Fix ggml_backend_cann_buffer_get_tensor (llama/8871)

* cann: fix ggml_backend_cann_buffer_get_tensor

 1. fix data ptr offset
 2. enable the acquisition of incomplete tensors

* fix backend cann set_tensor

10 months agocann: fix buffer_num and runtime speed slowly error (llama/8865)
wangshuai09 [Mon, 5 Aug 2024 13:10:37 +0000 (21:10 +0800)]
cann: fix buffer_num and runtime speed slowly error (llama/8865)

10 months agoggml : fix overflows in elu function (llama/8866)
Justine Tunney [Mon, 5 Aug 2024 12:43:40 +0000 (05:43 -0700)]
ggml : fix overflows in elu function (llama/8866)

It's helpful to use expm1f(x), because expf(x)-1 will result in overflow
for 25% of single-precision floating point numbers.

10 months agovulkan : fix Qantized Mat-Vec Mul on AMD GPUs for ncols < 64 (llama/8855)
0cc4m [Mon, 5 Aug 2024 05:52:55 +0000 (07:52 +0200)]
vulkan : fix Qantized Mat-Vec Mul on AMD GPUs for ncols < 64 (llama/8855)

* Fix Vulkan mul mat vec invalid results when ncols < warp size

* Only run backend ops mul mat vec block size test if block size not already covered

10 months agocann: support q4_0 model (llama/8822)
wangshuai09 [Mon, 5 Aug 2024 04:22:30 +0000 (12:22 +0800)]
cann: support q4_0 model (llama/8822)

10 months agoggml : reading the runtime sve config of the cpu (llama/8709)
jdomke [Sat, 3 Aug 2024 16:34:41 +0000 (01:34 +0900)]
ggml : reading the runtime sve config of the cpu (llama/8709)

* ggml : reading the runtime sve config of the cpu

* change to one time init to prevent performance drop

* prefix variable to avoid possible conflicts

* revert xxhash fix and add brackets

---------

Co-authored-by: domke <redacted>
10 months agoFix conversion of unnormalized BF16->BF16 weights (llama/7843)
Sigbjørn Skjæret [Fri, 2 Aug 2024 19:11:39 +0000 (21:11 +0200)]
Fix conversion of unnormalized BF16->BF16 weights (llama/7843)

* add truncate_bf16

* truncate intermediate fp32 if converting bf16 to bf16

* fix masking in __compute_fp32_to_bf16

* np.int16 no longer used

* missing cast and additional numpy 2.x fix

* ggml-impl : do not flush bf16 subnormals to zero

* ggml : add reference fp32 to bf16 conversion

The fast version is no longer equivalent for all platforms
because of the handling of subnormal values.

* gguf-py : remove flush to zero for bf16 subnormals

* gguf-py : remove float32 truncation to bf16

Rounding achieves the same thing in the cases where this was used.

* missed prototype update in merge

* merge cleanup

---------

Co-authored-by: Francis Couture-Harpin <redacted>
10 months agocann: Fix ggml_cann_im2col for 1D im2col (llama/8819)
Mengqing Cao [Fri, 2 Aug 2024 08:50:53 +0000 (16:50 +0800)]
cann: Fix ggml_cann_im2col for 1D im2col (llama/8819)

* fix ggml_cann_im2col for 1D im2col

* fix build warning

10 months agoFixing wrong VDR iq4nl value (llama/8812)
Ouadie EL FAROUKI [Fri, 2 Aug 2024 00:55:17 +0000 (01:55 +0100)]
Fixing wrong VDR iq4nl value (llama/8812)

10 months agoggml-cuda: Adding support for unified memory (llama/8035)
matteo [Thu, 1 Aug 2024 21:28:28 +0000 (23:28 +0200)]
ggml-cuda: Adding support for unified memory (llama/8035)

* Adding support for unified memory

* adding again the documentation about unified memory

* refactoring: Moved the unified memory code in the correct location.

* Fixed compilation error when using hipblas

* cleaning up the documentation

* Updating the documentation

Co-authored-by: Johannes Gäßler <redacted>
* adding one more case where the PR should not be enabled

---------

Co-authored-by: matteo serva <redacted>
Co-authored-by: Johannes Gäßler <redacted>
10 months agoBuild: Only include execinfo.h on linux systems that support it (llama/8783)
Alex O'Connell [Thu, 1 Aug 2024 16:53:46 +0000 (12:53 -0400)]
Build: Only include execinfo.h on linux systems that support it (llama/8783)

* Only enable backtrace on GLIBC linux systems

* fix missing file from copy

* use glibc macro instead of defining a custom one

10 months agocuda : fix dmmv cols requirement to 2*GGML_CUDA_DMMV_X (llama/8800)
slaren [Thu, 1 Aug 2024 13:26:22 +0000 (15:26 +0200)]
cuda : fix dmmv cols requirement to 2*GGML_CUDA_DMMV_X (llama/8800)

* cuda : fix dmmv cols requirement to 2*GGML_CUDA_DMMV_X

* update asserts

* only use dmmv for supported types

* add test

10 months agoadded android implementation of ggml_print_backtrace_symbols (llama/8751)
l3utterfly [Tue, 30 Jul 2024 14:40:18 +0000 (23:40 +0900)]
added android implementation of ggml_print_backtrace_symbols (llama/8751)

* added android implementation of ggml_print_backtrace_symbols

* Update ggml/src/ggml.c

Co-authored-by: slaren <redacted>
* Update ggml/src/ggml.c

Co-authored-by: slaren <redacted>
* Update ggml/src/ggml.c

Co-authored-by: slaren <redacted>
* Update ggml/src/ggml.c

Co-authored-by: slaren <redacted>
* Update ggml/src/ggml.c

Co-authored-by: slaren <redacted>
---------

Co-authored-by: slaren <redacted>
10 months agocann: update cmake (llama/8765)
wangshuai09 [Tue, 30 Jul 2024 10:37:35 +0000 (18:37 +0800)]
cann: update cmake (llama/8765)

10 months agoAdd `TIMESTEP_EMBEDDING` OP (llama/8707)
zhentaoyu [Tue, 30 Jul 2024 06:56:51 +0000 (14:56 +0800)]
Add `TIMESTEP_EMBEDDING` OP (llama/8707)

Signed-off-by: zhentaoyu <redacted>
10 months agoggml: bugfix: fix the inactive elements is agnostic for risc-v vector (llama/8748)
CarterLi999 [Mon, 29 Jul 2024 16:38:34 +0000 (00:38 +0800)]
ggml: bugfix: fix the inactive elements is agnostic for risc-v vector (llama/8748)

In these codes, we want to retain the value that they previously held
when mask[i] is false. So we should use undisturbed. With the default
agnostic policy of rvv intrinsic, these values can be held or be
written with 1s.

Co-authored-by: carter.li <redacted>
10 months agocuda : organize vendor-specific headers into vendors directory (llama/8746)
R0CKSTAR [Mon, 29 Jul 2024 12:56:12 +0000 (20:56 +0800)]
cuda : organize vendor-specific headers into vendors directory (llama/8746)

Signed-off-by: Xiaodong Ye <redacted>
10 months agoadd conv support (llama/8688)
Meng, Hengyu [Mon, 29 Jul 2024 02:50:27 +0000 (10:50 +0800)]
add conv support (llama/8688)

10 months agofeat: Support Moore Threads GPU (llama/8383)
R0CKSTAR [Sat, 27 Jul 2024 23:41:25 +0000 (07:41 +0800)]
feat: Support Moore Threads GPU (llama/8383)

* Update doc for MUSA

Signed-off-by: Xiaodong Ye <redacted>
* Add GGML_MUSA in Makefile

Signed-off-by: Xiaodong Ye <redacted>
* Add GGML_MUSA in CMake

Signed-off-by: Xiaodong Ye <redacted>
* CUDA => MUSA

Signed-off-by: Xiaodong Ye <redacted>
* MUSA adds support for __vsubss4

Signed-off-by: Xiaodong Ye <redacted>
* Fix CI build failure

Signed-off-by: Xiaodong Ye <redacted>
---------

Signed-off-by: Xiaodong Ye <redacted>
10 months agoggml : ignore more msvc warnings (#906)
Borislav Stanimirov [Wed, 7 Aug 2024 07:00:56 +0000 (10:00 +0300)]
ggml : ignore more msvc warnings (#906)

10 months agometal : fix struct name (#912)
Georgi Gerganov [Wed, 7 Aug 2024 06:57:00 +0000 (09:57 +0300)]
metal : fix struct name (#912)

ggml-ci

10 months agometal : add abort callback (#905)
Conrad Kramer [Wed, 7 Aug 2024 06:55:49 +0000 (02:55 -0400)]
metal : add abort callback (#905)

10 months agovulkan : implement Stable Diffusion operators (#904)
0cc4m [Sun, 4 Aug 2024 15:28:08 +0000 (17:28 +0200)]
vulkan : implement Stable Diffusion operators (#904)

* Fix Vulkan repeat op

* Implement Vulkan concat op

* Delete old Vulkan shader generator

* Implement Vulkan im2col op

* Implement Vulkan unary gelu_quick op

* Implement Vulkan group_norm op

* Implement Vulkan timestep_embedding op

* Implement Vulkan upscale op

* Fix Vulkan vk_context tensor extra index issue

* Fix Vulkan matmul shader parameter bug

* Properly fix Vulkan matmul shader parameter bug

* Add Vulkan ADD f16 + f32 -> f16 operator support

* Implement Vulkan tanh op

* Fix Vulkan group count too large Validation error on non-Nvidia GPUs

* Throw error when too much memory is requested

* Fix another Vulkan group count too large Validation error on non-Nvidia GPUs

* Fix matmul MMQ condition

* Implement Vulkan pad op

* Fix Vulkan crash when tensor is used multiple times in a compute graph

* Add Vulkan CONCAT f16 + f16 -> f16 op

* Add Vulkan LEAKY_RELU op

11 months agoggml : move c parameter comment to ggml_rope_ext (#901)
Daniel Bevenius [Mon, 29 Jul 2024 13:06:06 +0000 (15:06 +0200)]
ggml : move c parameter comment to ggml_rope_ext (#901)

This commit moves the comment for the c parameter from ggml_rope to
ggml_rope_ext. The comment is currently incorrect as ggml_rope does not
have a c parameter (freq_factors tensor).

Signed-off-by: Daniel Bevenius <redacted>
11 months agoexamples: add TensorFlow to requirements.txt (#902)
Johannes Gäßler [Mon, 29 Jul 2024 13:03:08 +0000 (15:03 +0200)]
examples: add TensorFlow to requirements.txt (#902)

11 months agoggml : sync vulkan shaders (#0)
0cc4m [Sat, 27 Jul 2024 14:52:35 +0000 (17:52 +0300)]
ggml : sync vulkan shaders (#0)

ggml-ci

11 months agoggml : resolve sync conflicst (#0)
Georgi Gerganov [Sat, 27 Jul 2024 14:17:23 +0000 (17:17 +0300)]
ggml : resolve sync conflicst (#0)

ggml-ci

11 months agocommon : handle new quant types (#0)
Georgi Gerganov [Sat, 27 Jul 2024 14:17:04 +0000 (17:17 +0300)]
common : handle new quant types (#0)

11 months agoggml : add ggml-aarch64 (#0)
Dibakar Gope [Sat, 27 Jul 2024 14:16:40 +0000 (17:16 +0300)]
ggml : add ggml-aarch64 (#0)

11 months agocann: Fix Multi-NPU execution error (llama/8710)
wangshuai09 [Sat, 27 Jul 2024 08:36:44 +0000 (16:36 +0800)]
cann: Fix Multi-NPU execution error (llama/8710)

* cann: fix multi-npu exec error

* cann: update comment  for ggml_backend_cann_supports_buft

11 months agoggml : reduce hash table reset cost (llama/8698)
slaren [Sat, 27 Jul 2024 02:41:55 +0000 (04:41 +0200)]
ggml : reduce hash table reset cost (llama/8698)

* ggml : reduce hash table reset cost

* fix unreachable code warnings after GGML_ASSERT(false)

* GGML_ASSERT(false) -> GGML_ABORT("fatal error")

* GGML_ABORT use format string

11 months agoggml: handle ggml_init failure to fix NULL pointer deref (llama/8692)
DavidKorczynski [Thu, 25 Jul 2024 21:23:05 +0000 (22:23 +0100)]
ggml: handle ggml_init failure to fix NULL pointer deref (llama/8692)

`ggml_init` can fail if no unused context is found. In that case, a NULL-pointer deref will happen later in the code during a call to `ggml_set_on_alloc`.

This fixes it by bailing out if no context is found.

11 months agoggml : fix build on Windows with Snapdragon X (llama/8531)
Andreas (Andi) Kunar [Thu, 25 Jul 2024 16:01:00 +0000 (18:01 +0200)]
ggml : fix build on Windows with Snapdragon X (llama/8531)

* Improvements for Windows with Snapdragon X

* Revert "Improvements for Windows with Snapdragon X"

This reverts commit bf21397ae5ea7c73d3494db3b91505599909227d.

* Improvements for Windows with Snapdragon X

* WOA build clarifications

* WIndows on ARM build clarifications

* cmake build for Windows clarifications

* Update docs/build.md

Co-authored-by: Georgi Gerganov <redacted>
---------

Co-authored-by: AndreasKunar <andreaskmsn.com>
Co-authored-by: Georgi Gerganov <redacted>
11 months agofix multi-gpu issue on sycl (llama/8554)
Chen Xi [Thu, 25 Jul 2024 11:45:18 +0000 (11:45 +0000)]
fix multi-gpu issue on sycl (llama/8554)

---------

Signed-off-by: Chen Xi <redacted>
Co-authored-by: Meng, Hengyu <redacted>
11 months agoggml : add and use ggml_cpu_has_llamafile() (llama/8664)
Georgi Gerganov [Thu, 25 Jul 2024 09:37:42 +0000 (12:37 +0300)]
ggml : add and use ggml_cpu_has_llamafile() (llama/8664)