git.djapps.eu Git - pkg/ggml/sources/llama.cpp/log

]> git.djapps.eu Git - pkg/ggml/sources/llama.cpp/log

overview / pkg / ggml / sources / llama.cpp / log

summary | shortlog | log | commit | commitdiff | tree
first ⋅ prev ⋅ next

commit | commitdiff | tree

bandoti [Thu, 3 Oct 2024 15:39:03 +0000 (12:39 -0300)]

ggml: unify backend logging mechanism (#9709)

* Add scaffolding for ggml logging macros

* Metal backend now uses GGML logging

* Cuda backend now uses GGML logging

* Cann backend now uses GGML logging

* Add enum tag to parameters

* Use C memory allocation funcs

* Fix compile error

* Use GGML_LOG instead of GGML_PRINT

* Rename llama_state to llama_logger_state

* Prevent null format string

* Fix whitespace

* Remove log callbacks from ggml backends

* Remove cuda log statement

commit | commitdiff | tree

compilade [Thu, 3 Oct 2024 14:22:15 +0000 (10:22 -0400)]

convert : handle tokenizer merges format from transformers 4.45 (#9696)

commit | commitdiff | tree

Radoslav Gerganov [Thu, 3 Oct 2024 10:00:52 +0000 (13:00 +0300)]

rpc : enable vulkan (#9714)

closes #8536

commit | commitdiff | tree

Ouadie EL FAROUKI [Thu, 3 Oct 2024 06:50:44 +0000 (07:50 +0100)]

Fixed dequant precision issues in Q4_1 and Q5_1 (#9711)

commit | commitdiff | tree

Diego Devesa [Wed, 2 Oct 2024 23:49:47 +0000 (01:49 +0200)]

ggml-backend : add device and backend reg interfaces (#9707)

Co-authored-by: Johannes Gäßler <redacted>

commit | commitdiff | tree

Xuan Son Nguyen [Wed, 2 Oct 2024 13:49:55 +0000 (15:49 +0200)]

llama : reduce compile time and binary size (#9712)

* llama : speed up compile time

* fix build

* fix build (2)

commit | commitdiff | tree

Alberto Cabrera Pérez [Wed, 2 Oct 2024 12:57:18 +0000 (13:57 +0100)]

[SYCL] Initial cmake support of SYCL for AMD GPUs (#9658)

sycl: initial cmake support of SYCL for AMD GPUs

commit | commitdiff | tree

Radoslav Gerganov [Wed, 2 Oct 2024 10:49:16 +0000 (13:49 +0300)]

vulkan : do not use tensor->extra (#9407)

* vulkan : do not use tensor->extra

This patch allows using the Vulkan backend with the RPC backend as
tensor->extra is no longer used.

Ref: #8536

* Adapt GGML_VULKAN_CHECK_RESULTS to extra removal (#2)

---------

Co-authored-by: 0cc4m <redacted>

commit | commitdiff | tree

Zhenwei Jin [Wed, 2 Oct 2024 07:21:57 +0000 (15:21 +0800)]

gguf-split : improve --split and --merge logic (#9619)

* make sure params --split and --merge are not specified at same time

* update gguf-split params parse logic

* Update examples/gguf-split/gguf-split.cpp

Co-authored-by: slaren <redacted>
---------

Co-authored-by: Xuan Son Nguyen <redacted>
Co-authored-by: slaren <redacted>

commit | commitdiff | tree

Georgi Gerganov [Wed, 2 Oct 2024 07:14:44 +0000 (10:14 +0300)]

examples : remove benchmark (#9704)

ggml-ci

commit | commitdiff | tree

Paweł Wodnicki [Tue, 1 Oct 2024 17:18:46 +0000 (12:18 -0500)]

Update README.md (#9591)

Add Bielik model.

commit | commitdiff | tree

Georgi Gerganov [Tue, 1 Oct 2024 13:09:42 +0000 (16:09 +0300)]

sync : ggml

commit | commitdiff | tree

Johannes Gäßler [Mon, 30 Sep 2024 07:55:23 +0000 (09:55 +0200)]

test: fix OPT_STEP_ADAMW for test-backend-ops (ggml/974)

commit | commitdiff | tree

Salvatore Mesoraca [Mon, 30 Sep 2024 07:14:09 +0000 (09:14 +0200)]

vulkan : mul_mat: fix UB with small warps (ggml/952)

When the device's warp size is less than 16,
it is possible for loadstride_a (mul_mm.comp:114)
and loadstride_b (mul_mm.comp:115) to be set to 0.
Because they are calculated as: the workgroup size,
multiplied by LOAD_VEC_* (which can be 1) and divided by 16.
And the workgroup size is set to be the same as the
warp/subgroup size.

The loadstride_* variables are used as increments in the
loops that populate the buffers used for the multiplication.

When they are 0 they cause an infinite loop.
But infinite loops without side-effects are UB and the
values of loadstride_* are known at compile time.
So, the compiler quietly optimizes all the loops away.
As a consequence, the buffers are not populated and
the multiplication result is just a matrix with all elements
set to 0.

We prevent the UB by making sure that the workgroup size
will never be less than 16, even if our device has a
smaller warp size (e.g. 8).

Signed-off-by: Salvatore Mesoraca <redacted>

commit | commitdiff | tree

Borislav Stanimirov [Mon, 30 Sep 2024 07:11:41 +0000 (10:11 +0300)]

ggml : fix ggml_cast (ggml/973)

commit | commitdiff | tree

Johannes Gäßler [Sun, 29 Sep 2024 21:18:02 +0000 (23:18 +0200)]

ggml: fix gradient allocation logic (ggml/966)

* ggml: fix gradient allocation logic

* gradient allocation in ggml_build_backward_expand

* fixup

* fix test-backend-ops grad

* suggestions by slaren

* fix test1.c

* fix legacy opt API

* fix test-grad0

* remove keep arg

commit | commitdiff | tree

Georgi Gerganov [Tue, 1 Oct 2024 13:00:25 +0000 (16:00 +0300)]

metal : reduce command encoding overhead (#9698)

* metal : reduce command encoding overhead

ggml-ci

* metal : add comments

commit | commitdiff | tree

Georgi Gerganov [Tue, 1 Oct 2024 08:42:01 +0000 (11:42 +0300)]

llama : print correct model type for Llama 3.2 1B and 3B

commit | commitdiff | tree

compilade [Tue, 1 Oct 2024 06:31:36 +0000 (02:31 -0400)]

convert : refactor rope_freqs generation (#9396)

* convert : refactor rope_freqs generation

This should also fix vocab-only conversion for Phi-3.

* convert : adapt MiniCPM3 to separate rope_freqs insertion

MiniCPM3's tokenizer is treated as a SentencePiece tokenizer to avoid
having to run its custom Python code which mixes tokenization
in the same file as tool calls.

gguf-py : add long and short RoPE factors to tensor mappings

Empty, but the key names are used to populate the mappings.

commit | commitdiff | tree

serhii-nakon [Mon, 30 Sep 2024 18:57:12 +0000 (21:57 +0300)]

Fix Docker ROCM builds, use AMDGPU_TARGETS instead of GPU_TARGETS (#9641)

* Fix Docker ROCM builds, use AMDGPU_TARGETS instead of GPU_TARGETS

* Set ROCM_DOCKER_ARCH as string due it incorrectly build and cause OOM exit code

commit | commitdiff | tree

compilade [Mon, 30 Sep 2024 18:13:16 +0000 (14:13 -0400)]

ci : reduce severity of unused Pyright ignore comments (#9697)

commit | commitdiff | tree

vb [Mon, 30 Sep 2024 15:03:47 +0000 (17:03 +0200)]

py : update transfomers version (#9694)

* update transfomers version.

* update hfh version.

commit | commitdiff | tree

Georgi Gerganov [Mon, 30 Sep 2024 14:48:49 +0000 (17:48 +0300)]

flake.lock: Update (#9680)

Flake lock file updates:

• Updated input 'nixpkgs':
'github:NixOS/nixpkgs/c04d5652cfa9742b1d519688f65d1bbccea9eb7e?narHash=sha256-PmUr/2GQGvFTIJ6/Tvsins7Q43KTMvMFhvG6oaYK%2BWk%3D' (2024-09-19)
→ 'github:NixOS/nixpkgs/1925c603f17fc89f4c8f6bf6f631a802ad85d784?narHash=sha256-J%2BPeFKSDV%2BpHL7ukkfpVzCOO7mBSrrpJ3svwBFABbhI%3D' (2024-09-26)

Co-authored-by: github-actions[bot] <redacted>

commit | commitdiff | tree

Ruchira Hasaranga [Mon, 30 Sep 2024 08:23:42 +0000 (13:53 +0530)]

console : utf-8 fix for windows stdin (#9690)

* utf-8 fix for windows stdin

* Update common/console.cpp

---------

Co-authored-by: Georgi Gerganov <redacted>

commit | commitdiff | tree

Georgi Gerganov [Sun, 29 Sep 2024 18:18:23 +0000 (21:18 +0300)]

ggml : define missing HWCAP flags (#9684)

ggml-ci

Co-authored-by: Willy Tarreau <redacted>

commit | commitdiff | tree

Georgi Gerganov [Sun, 29 Sep 2024 18:16:07 +0000 (21:16 +0300)]

sync : ggml

commit | commitdiff | tree

Johannes Gäßler [Sun, 29 Sep 2024 17:56:17 +0000 (19:56 +0200)]

CUDA: remove bad assert (ggml/972)

commit | commitdiff | tree

Jeff Bolz [Sun, 29 Sep 2024 16:50:17 +0000 (11:50 -0500)]

vulkan : multithread pipeline creation (ggml/963)

commit | commitdiff | tree

Jeff Bolz [Fri, 27 Sep 2024 07:58:01 +0000 (02:58 -0500)]

vulkan : fix build for GGML_VULKAN_RUN_TESTS, add TFLOPS to log (ggml/961)

commit | commitdiff | tree

Salvatore Mesoraca [Thu, 26 Sep 2024 06:59:42 +0000 (08:59 +0200)]

vulkan : argsort barriers must be under uniform control flow (ggml/951)

a return before a barrier (that happens only in some threads in
a workgroup) leads to UB.
While the old code actually works on some devices,
it fails on some others (i.e. "smaller" GPUs).

BTW, I think it would be better to set specialization constants
when the graph is built, in that way the local workgroup
could be sized appropriately.
But it would take a lot of work.

Signed-off-by: Salvatore Mesoraca <redacted>

commit | commitdiff | tree

Georgi Gerganov [Tue, 24 Sep 2024 10:23:59 +0000 (13:23 +0300)]

ggml : fix GGML_MAX_N_THREADS + improve formatting (ggml/969)

commit | commitdiff | tree

matiaslin [Sun, 29 Sep 2024 12:25:00 +0000 (05:25 -0700)]

common : ensure llama_batch size does not exceed max size (#9668)

A crash was observed when the number of tokens added to a batch exceeds
llama_batch size. An assertion in llama_batch_add was added to protect
against llama_batch size overflow.

commit | commitdiff | tree

nopperl [Sun, 29 Sep 2024 12:02:06 +0000 (12:02 +0000)]

py : add model class for Chameleon conversion (#9683)

commit | commitdiff | tree

Georgi Gerganov [Sun, 29 Sep 2024 11:38:18 +0000 (14:38 +0300)]

contrib : add Resources section (#9675)

commit | commitdiff | tree

Georgi Gerganov [Sat, 28 Sep 2024 14:42:03 +0000 (17:42 +0300)]

llama : add reranking support (#9510)

* py : add XLMRobertaForSequenceClassification [no ci]

* py : fix scalar-tensor conversion [no ci]

* py : fix position embeddings chop [no ci]

* llama : read new cls tensors [no ci]

* llama : add classigication head (wip) [no ci]

* llama : add "rank" pooling type

ggml-ci

* server : add rerank endpoint

ggml-ci

* llama : aboud ggml_repeat during classification

* rerank : cleanup + comments

* server : accept /rerank endpoint in addition to /v1/rerank [no ci]

* embedding : parse special tokens

* jina : support v1 reranker

* vocab : minor style

ggml-ci

* server : initiate tests for later

ggml-ci

* server : add docs

* llama : add comment [no ci]

* llama : fix uninitialized tensors

* ci : add rerank tests

ggml-ci

* add reranking test

* change test data

* Update examples/server/server.cpp

Co-authored-by: Xuan Son Nguyen <redacted>
* add `--reranking` argument

* update server docs

* llama : fix comment [no ci]

ggml-ci

---------

Co-authored-by: Xuan Son Nguyen <redacted>
Co-authored-by: Xuan Son Nguyen <redacted>

commit | commitdiff | tree

slaren [Sat, 28 Sep 2024 12:32:46 +0000 (14:32 +0200)]

test-backend-ops : use flops for some performance tests (#9657)

* test-backend-ops : use flops for some performance tests

- parallelize tensor quantization

- use a different set of cases for performance and correctness tests

- run each test for at least one second

commit | commitdiff | tree

Georgi Gerganov [Sat, 28 Sep 2024 12:13:21 +0000 (15:13 +0300)]

llama : add comment about thread-safety [no ci] (#9449)

commit | commitdiff | tree

Zhenwei Jin [Sat, 28 Sep 2024 12:10:58 +0000 (20:10 +0800)]

vocab : refactor tokenizer to reduce init overhead (#9449)

* refactor tokenizer

* llama : make llm_tokenizer more private

ggml-ci

* refactor tokenizer

* refactor tokenizer

* llama : make llm_tokenizer more private

ggml-ci

* remove unused files

* remove unused fileds to avoid unused filed build error

* avoid symbol link error

* Update src/llama.cpp

* Update src/llama.cpp

---------

Co-authored-by: Georgi Gerganov <redacted>

commit | commitdiff | tree

nopperl [Sat, 28 Sep 2024 12:08:43 +0000 (12:08 +0000)]

llama : add support for Chameleon (#8543)

* convert chameleon hf to gguf

* add chameleon tokenizer tests

* fix lint

* implement chameleon graph

* add swin norm param

* return qk norm weights and biases to original format

* implement swin norm

* suppress image token output

* rem tabs

* add comment to conversion

* fix ci

* check for k norm separately

* adapt to new lora implementation

* fix layer input for swin norm

* move swin_norm in gguf writer

* add comment regarding special token regex in chameleon pre-tokenizer

* Update src/llama.cpp

Co-authored-by: compilade <redacted>
* fix punctuation regex in chameleon pre-tokenizer (@compilade)

Co-authored-by: compilade <redacted>
* fix lint

* trigger ci

---------

Co-authored-by: compilade <redacted>

commit | commitdiff | tree

Aarni Koskela [Sat, 28 Sep 2024 12:07:14 +0000 (15:07 +0300)]

readme : add tool (#9655)

commit | commitdiff | tree

Dan Johansson [Sat, 28 Sep 2024 12:06:16 +0000 (14:06 +0200)]

ggml : add run-time detection of neon, i8mm and sve (#9331)

* ggml: Added run-time detection of neon, i8mm and sve

Adds run-time detection of the Arm instructions set features
neon, i8mm and sve for Linux and Apple build targets.

* ggml: Extend feature detection to include non aarch64 Arm arch

* ggml: Move definition of ggml_arm_arch_features to the global data section

commit | commitdiff | tree

Markus Tavenrath [Sat, 28 Sep 2024 10:05:05 +0000 (12:05 +0200)]

Enable use to the rebar feature to upload buffers to the device. (#9251)

commit | commitdiff | tree

Georgi Gerganov [Fri, 27 Sep 2024 17:57:51 +0000 (20:57 +0300)]

readme : update hot topics

commit | commitdiff | tree

Borislav Stanimirov [Fri, 27 Sep 2024 07:42:06 +0000 (10:42 +0300)]

cmake : add option for common library (#9661)

commit | commitdiff | tree

Neo Zhang Jianyu [Thu, 26 Sep 2024 09:38:31 +0000 (17:38 +0800)]

[SYCL] add missed dll file in package (#9577)

* update oneapi to 2024.2

* use 2024.1

---------

Co-authored-by: arthw <redacted>

commit | commitdiff | tree

R0CKSTAR [Thu, 26 Sep 2024 01:27:40 +0000 (09:27 +0800)]

mtgpu: enable VMM (#9597)

Signed-off-by: Xiaodong Ye <redacted>

commit | commitdiff | tree

Xuan Son Nguyen [Wed, 25 Sep 2024 15:26:01 +0000 (17:26 +0200)]

ci : fix docker build number and tag name (#9638)

* ci : fix docker build number and tag name

* fine-grant permissions

commit | commitdiff | tree

Charles Xu [Wed, 25 Sep 2024 13:12:20 +0000 (15:12 +0200)]

ggml : remove assert for AArch64 GEMV and GEMM Q4 kernels (#9217)

* ggml : remove assert for AArch64 GEMV and GEMM Q4 kernels

* added fallback mechanism when the offline re-quantized model is not
optimized for the underlying target.

* fix for build errors

* remove prints from the low-level code

* Rebase to the latest upstream

commit | commitdiff | tree

Xuan Son Nguyen [Wed, 25 Sep 2024 12:05:13 +0000 (14:05 +0200)]

server : add more env vars, improve gen-docs (#9635)

* server : add more env vars, improve gen-docs

* update server docs

* LLAMA_ARG_NO_CONTEXT_SHIFT

commit | commitdiff | tree

Gabe Goodhart [Wed, 25 Sep 2024 07:06:52 +0000 (01:06 -0600)]

llama : add IBM Granite MoE architecture (#9438)

* feat(gguf-py): Add granitemoe architecture

This includes the addition of new tensor names for the new moe layers.
These may not be correct at this point due to the need for the hack in
gguf_writer.py to double-check the length of the shape for these layers.

Branch: GraniteMoE

Signed-off-by: Gabe Goodhart <redacted>
* feat(convert_hf_to_gguf): Add GraniteMoeModel

GraniteMoe has the same configuration deltas as Granite

Branch: GraniteMoE

Signed-off-by: Gabe Goodhart <redacted>
* fix(granitemoe convert): Split the double-sized input layer into gate and up

After a lot of staring and squinting, it's clear that the standard mixtral
expert implementation is equivalent to the vectorized parallel experts in
granite. The difference is that in granite, the w1 and w3 are concatenated
into a single tensor "input_linear." Rather than reimplementing all of the
math on the llama.cpp side, the much simpler route is to just split this
tensor during conversion and follow the standard mixtral route.

Branch: GraniteMoE

Co-Authored-By: alex.brooks@ibm.com
Signed-off-by: Gabe Goodhart <redacted>
* feat(granitemoe): Implement granitemoe

GraniteMoE follows the mixtral architecture (once the input_linear layers
are split into gate_exps/up_exps). The main delta is the addition of the
same four multipliers used in Granite.

Branch: GraniteMoE

Signed-off-by: Gabe Goodhart <redacted>
* Typo fix in docstring

Co-Authored-By: ggerganov@gmail.com
Co-authored-by: Georgi Gerganov <redacted>
Signed-off-by: Gabe Goodhart <redacted>
* fix(conversion): Simplify tensor name mapping in conversion

Branch: GraniteMoE

Co-Authored-By: git@compilade.net
Signed-off-by: Gabe Goodhart <redacted>
* fix(convert): Remove unused tensor name mappings

Branch: GraniteMoE

Co-Authored-By: git@compilade.net
Signed-off-by: Gabe Goodhart <redacted>
* fix(convert): Sanity check on merged FFN tensor sizes

Branch: GraniteMoE

Co-Authored-By: git@compilade.net
Signed-off-by: Gabe Goodhart <redacted>
* fix: Allow "output" layer in granite moe architecture (convert and cpp)

Branch: GraniteMoE

Co-Authored-By: git@compilade.net
Signed-off-by: Gabe Goodhart <redacted>
* fix(granite): Add missing 'output' tensor for Granite

This is a fix for the previous `granite` architecture PR. Recent snapshots
have included this (`lm_head.weights`) as part of the architecture

Branch: GraniteMoE

Signed-off-by: Gabe Goodhart <redacted>
---------

Signed-off-by: Gabe Goodhart <redacted>
Co-authored-by: Georgi Gerganov <redacted>

commit | commitdiff | tree

Dou Xinpeng [Wed, 25 Sep 2024 03:30:38 +0000 (11:30 +0800)]

cann: fix crash when llama-bench is running on multiple cann devices (#9627)

commit | commitdiff | tree

Eric Zhang [Tue, 24 Sep 2024 08:03:21 +0000 (16:03 +0800)]

ggml : add AVX512DQ requirement for AVX512 builds (#9622)

commit | commitdiff | tree

Georgi Gerganov [Tue, 24 Sep 2024 08:01:18 +0000 (11:01 +0300)]

sync : ggml

commit | commitdiff | tree

Georgi Gerganov [Fri, 20 Sep 2024 18:50:16 +0000 (21:50 +0300)]

examples : adapt to ggml.h changes (ggml/0)

ggml-ci

commit | commitdiff | tree

Georgi Gerganov [Tue, 24 Sep 2024 07:16:06 +0000 (10:16 +0300)]

llama : keep track of all EOG tokens in the vocab (#9609)

ggml-ci

commit | commitdiff | tree

Georgi Gerganov [Tue, 24 Sep 2024 07:15:35 +0000 (10:15 +0300)]

log : add CONT level for continuing previous log entry (#9610)

commit | commitdiff | tree

StrangeBytesDev [Tue, 24 Sep 2024 06:04:39 +0000 (23:04 -0700)]

server : add newline after chat example (#9616)

commit | commitdiff | tree

Georgi Gerganov [Tue, 24 Sep 2024 06:03:17 +0000 (09:03 +0300)]

sampling : avoid expensive softmax during greedy sampling (#9605)

* sampling : avoid expensive softmax during greedy sampling

ggml-ci

* speculative : fix default RNG seed + set sparams.n_probs

* Update tests/test-sampling.cpp

Co-authored-by: slaren <redacted>
* sampling : add clarifying comment [no ci]

---------

Co-authored-by: slaren <redacted>

commit | commitdiff | tree

Max Krasnyansky [Tue, 24 Sep 2024 04:18:48 +0000 (21:18 -0700)]

threads: fix msvc build without openmp (#9615)

We're missing atomic_thread_fence() in MSVC builds when openmp is disabled.

commit | commitdiff | tree

Ivan [Tue, 24 Sep 2024 00:14:24 +0000 (03:14 +0300)]

cuda: add q8_0->f32 cpy operation (#9571)

llama: enable K-shift for quantized KV cache
It will fail on unsupported backends or quant types.

commit | commitdiff | tree

Xuan Son Nguyen [Mon, 23 Sep 2024 20:23:54 +0000 (22:23 +0200)]

server : add --no-context-shift option (#9607)

* server : add --no-context-shift option

* small fix

* Update examples/server/tests/features/embeddings.feature

Co-authored-by: Georgi Gerganov <redacted>
* tests : minor fix

* revert usage of GGML_ASSERT

* update server documentation

---------

Co-authored-by: Georgi Gerganov <redacted>

commit | commitdiff | tree

Max Krasnyansky [Mon, 23 Sep 2024 18:42:43 +0000 (11:42 -0700)]

threads: improve ggml_barrier scaling with large number of threads (#9598)

Make sure n_barrier and n_barrier_passed do not share the cache line to avoid cache line bouncing.
This optimization shows performance improvements even for n_threads <= 8 cases.

Resurect TSAN (Thread Sanitizer) check so that we can avoid doing expensive read-modify-write
in the normal case and just use thread-fence as originally intended.

---
Here is the original description and suggestions from Willy Tarreau :

There's currently some false sharing between n_barrier and
n_barrier_passed that is amplified in ggml_barrier() by the fact that
all threads need to increment n_barrier when entering, while all
previous threads continue to read n_barrier_passed, waiting for the last
one to release them all. The side effect is that all these readers are
slowing down all new threads by making the cache line bounce back and
forth between readers and writers.

Just placing them in two distinct cache lines is sufficient to boost
the performance by 21% on a 80-core ARM server compared to the
no-openmp version, and by 3% compared to the openmp version.

Note that the variables could have been spread apart in the structure
as well, but it doesn't seem that the size of this threadpool struct is
critical so here we're simply aligning them.

Finally, the same issue was present when leaving the barrier since all
threads had to update the n_barrier_passed counter, though only one
would add a non-zero value. This alone is responsible for half of the
cost due to undesired serialization.

It might be possible that using a small array of n_barrier counters
could make things even faster on many-core systems, but it would likely
complicate the logic needed to detect the last thread.

Co-authored-by: Willy Tarreau <redacted>

commit | commitdiff | tree

Riceball LEE [Mon, 23 Sep 2024 15:58:17 +0000 (23:58 +0800)]

readme : add programmable prompt engine language CLI (#9599)

commit | commitdiff | tree

Georgi Gerganov [Mon, 23 Sep 2024 15:43:40 +0000 (18:43 +0300)]

flake.lock: Update (#9586)

Flake lock file updates:

• Updated input 'nixpkgs':
'github:NixOS/nixpkgs/4f807e8940284ad7925ebd0a0993d2a1791acb2f?narHash=sha256-IiA3jfbR7K/B5%2B9byVi9BZGWTD4VSbWe8VLpp9B/iYk%3D' (2024-09-11)
→ 'github:NixOS/nixpkgs/c04d5652cfa9742b1d519688f65d1bbccea9eb7e?narHash=sha256-PmUr/2GQGvFTIJ6/Tvsins7Q43KTMvMFhvG6oaYK%2BWk%3D' (2024-09-19)

Co-authored-by: github-actions[bot] <redacted>

commit | commitdiff | tree

Srihari-mcw [Mon, 23 Sep 2024 14:06:38 +0000 (19:36 +0530)]

ggml : AVX512 gemm for Q4_0_8_8 (#9532)

* AVX512 version of ggml_gemm_q4_0_8x8_q8_0

* Remove zero vector parameter passing

* Rename functions and rearrange order of macros

* Edit commments

* style : minor adjustments

* Update x to start from 0

---------

Co-authored-by: Georgi Gerganov <redacted>

commit | commitdiff | tree

Georgi Gerganov [Mon, 23 Sep 2024 08:28:02 +0000 (11:28 +0300)]

perplexity : remove extra new lines after chunks (#9596)

commit | commitdiff | tree

Georgi Gerganov [Mon, 23 Sep 2024 08:27:47 +0000 (11:27 +0300)]

metal : use F32 prec for K*Q in vec FA (#9595)

ggml-ci

commit | commitdiff | tree

Akarshan Biswas [Mon, 23 Sep 2024 03:28:06 +0000 (08:58 +0530)]

Revert "[SYCL] fallback mmvq (#9088)" (#9579)

This reverts commit 50addec9a532a6518146ab837a85504850627316.

commit | commitdiff | tree

R0CKSTAR [Sun, 22 Sep 2024 14:55:49 +0000 (22:55 +0800)]

musa: enable building fat binaries, enable unified memory, and disable Flash Attention on QY1 (MTT S80) (#9526)

* mtgpu: add mp_21 support

Signed-off-by: Xiaodong Ye <redacted>
* mtgpu: disable flash attention on qy1 (MTT S80); disable q3_k and mul_mat_batched_cublas

Signed-off-by: Xiaodong Ye <redacted>
* mtgpu: enable unified memory

Signed-off-by: Xiaodong Ye <redacted>
* mtgpu: map cublasOperation_t to mublasOperation_t (sync code to latest)

Signed-off-by: Xiaodong Ye <redacted>
---------

Signed-off-by: Xiaodong Ye <redacted>

commit | commitdiff | tree

Molly Sophia [Sun, 22 Sep 2024 13:26:50 +0000 (21:26 +0800)]

Fix merge error in #9454 (#9589)

Signed-off-by: Molly Sophia <redacted>

commit | commitdiff | tree

Johannes Gäßler [Sun, 22 Sep 2024 07:34:52 +0000 (09:34 +0200)]

CUDA: enable Gemma FA for HIP/Pascal (#9581)

commit | commitdiff | tree

Shankar [Sun, 22 Sep 2024 02:30:34 +0000 (19:30 -0700)]

llama: remove redundant loop when constructing ubatch (#9574)

commit | commitdiff | tree

Molly Sophia [Sun, 22 Sep 2024 02:29:12 +0000 (10:29 +0800)]

RWKV v6: RWKV_WKV op CUDA implementation (#9454)

* ggml: CUDA unary op EXP

Signed-off-by: Molly Sophia <redacted>
* ggml: rwkv_wkv op CUDA impl

Signed-off-by: Molly Sophia <redacted>
---------

Signed-off-by: Molly Sophia <redacted>

commit | commitdiff | tree

slaren [Sat, 21 Sep 2024 12:24:23 +0000 (14:24 +0200)]

ggml-alloc : fix list of allocated tensors with GGML_ALLOCATOR_DEBUG (#9573)

commit | commitdiff | tree

agray3 [Sat, 21 Sep 2024 00:41:07 +0000 (01:41 +0100)]

Update CUDA graph on scale change plus clear nodes/params (#9550)

* Avoid using saved CUDA graph if scale changes and reset nodes/params on update

Fixes https://github.com/ggerganov/llama.cpp/issues/9451

* clear before resize

commit | commitdiff | tree

Huang Qi [Sat, 21 Sep 2024 00:39:41 +0000 (08:39 +0800)]

CI: Provide prebuilt windows binary for hip (#9467)

commit | commitdiff | tree

slaren [Fri, 20 Sep 2024 18:55:36 +0000 (20:55 +0200)]

quantize : improve type name parsing (#9570)

quantize : do not ignore invalid types in arg parsing

quantize : ignore case of type and ftype arguments

commit | commitdiff | tree

Georgi Gerganov [Fri, 20 Sep 2024 17:12:52 +0000 (20:12 +0300)]

ggml : fix builds (#0)

ggml-ci

commit | commitdiff | tree

Georgi Gerganov [Fri, 20 Sep 2024 16:13:02 +0000 (19:13 +0300)]

ggml : fix trailing whitespace (#0)

ggml-ci

commit | commitdiff | tree

Georgi Gerganov [Fri, 20 Sep 2024 16:06:59 +0000 (19:06 +0300)]

sync : ggml

ggml-ci

commit | commitdiff | tree

Johannes Gäßler [Fri, 20 Sep 2024 16:04:44 +0000 (19:04 +0300)]

ggml/examples: add backend support for numerical optimization (ggml/949)

* CUDA eval works

* stochastic gradient descent op

* Adam except decay

* CUDA CROSS_ENTROPY_LOSS_BACK

* CUDA mnist-fc training works

* backend CLI arg

* refactor gguf load

* remove sched from opt_step_adam

* implement l1 regularization (weight decay)

* extra call to add optimizer

* initialize gradients with ggml_graph_reset

* gradient accumulation

* increment iter per eval instead of epoch

* adjust backend interfaces

* fix ggml_graph_reset without backend

* fix ggml graph export/import

* fixup

* rename

* revert ggml_opt changes

* more general CUDA repeat_back

* update documentation, fix CNN

* validation split

* add clarifying comment

* optimize PyTorch training

* adjust buffer size, thread count

* fix 0.0f validation split

* Update examples/mnist/mnist-common.cpp

Co-authored-by: Georgi Gerganov <redacted>
* fix gradient accumulation

* tensor flag for accumulators -> tensor hash set

* Update include/ggml.h

Co-authored-by: slaren <redacted>
* Update tests/test-backend-ops.cpp

Co-authored-by: slaren <redacted>
* Update tests/test-backend-ops.cpp

Co-authored-by: slaren <redacted>
* fix test prints

* Update src/ggml-backend.c

Co-authored-by: Georgi Gerganov <redacted>
* better CUDA support for noncontiguous out_prod

* add comment

---------

Co-authored-by: Georgi Gerganov <redacted>
Co-authored-by: slaren <redacted>

commit | commitdiff | tree

Georgi Gerganov [Sun, 8 Sep 2024 08:10:43 +0000 (11:10 +0300)]

examples : add null threadpool args where needed (ggml/0)

ggml-ci

commit | commitdiff | tree

Johannes Gäßler [Fri, 20 Sep 2024 16:35:35 +0000 (18:35 +0200)]

CUDA: fix sum.cu compilation for CUDA < 11.7 (#9562)

commit | commitdiff | tree

Georgi Gerganov [Fri, 20 Sep 2024 08:46:56 +0000 (11:46 +0300)]

examples : flush log upon ctrl+c (#9559)

commit | commitdiff | tree

Sigbjørn Skjæret [Fri, 20 Sep 2024 06:38:10 +0000 (08:38 +0200)]

perplexity : do not escape input data by default (#9548)

commit | commitdiff | tree

Georgi Gerganov [Thu, 19 Sep 2024 09:44:53 +0000 (12:44 +0300)]

server : clean-up completed tasks from waiting list (#9531)

ggml-ci

commit | commitdiff | tree

Sigbjørn Skjæret [Thu, 19 Sep 2024 07:58:14 +0000 (09:58 +0200)]

imatrix : disable prompt escape by default (#9543)

commit | commitdiff | tree

slaren [Wed, 18 Sep 2024 17:13:08 +0000 (19:13 +0200)]

ggml : fix n_threads_cur initialization with one thread (#9538)

* ggml : fix n_threads_cur initialization with one thread

* Update ggml/src/ggml.c

---------

Co-authored-by: Max Krasnyansky <redacted>

commit | commitdiff | tree

Georgi Gerganov [Wed, 18 Sep 2024 15:34:32 +0000 (18:34 +0300)]

scripts : verify py deps at the start of compare (#9520)

commit | commitdiff | tree

Daniel Bevenius [Wed, 18 Sep 2024 11:42:36 +0000 (13:42 +0200)]

llama : use reserve/emplace_back in sampler_sample (#9534)

This commit updates the llama_sampler_sample function to use reserve and
emplace_back for the vector of llama_token_data structs.

The motivation for this change is to avoid the creation of n_vocab
default-constructed llama_token_data structs which are then
immediately overwritten.

commit | commitdiff | tree

Vinesh Janarthanan [Wed, 18 Sep 2024 06:50:34 +0000 (01:50 -0500)]

server : match OAI structured output response (#9527)

commit | commitdiff | tree

Eric Zhang [Wed, 18 Sep 2024 06:28:20 +0000 (14:28 +0800)]

server : fix OpenSSL build (remove obsolete `LOG_INFO`) (#9529)

commit | commitdiff | tree

Neo Zhang Jianyu [Wed, 18 Sep 2024 00:30:31 +0000 (08:30 +0800)]

[SYCL]set context default value to avoid memory issue, update guide (#9476)

* set context default to avoid memory issue, update guide

* Update docs/backend/SYCL.md

Co-authored-by: Meng, Hengyu <redacted>
---------

Co-authored-by: arthw <redacted>
Co-authored-by: Meng, Hengyu <redacted>

commit | commitdiff | tree

Michael Podvitskiy [Tue, 17 Sep 2024 20:41:38 +0000 (22:41 +0200)]

llama-bench: correct argument parsing error message (#9524)

commit | commitdiff | tree

Bert Wagner [Tue, 17 Sep 2024 13:35:38 +0000 (09:35 -0400)]

arg : add env variable for parallel (#9513)

* add env variable for parallel

* Update README.md with env: LLAMA_ARG_N_PARALLEL

commit | commitdiff | tree

Michael Podvitskiy [Tue, 17 Sep 2024 10:18:22 +0000 (12:18 +0200)]

llama : fix n_vocab init for 'no_vocab' case (#9511)

* llama: fixed n_vocab for `no_vocab` models

* llama: updated error output for `llama_decode_internal` and `llama_encode_internal`

* llama: log warning if there's no vocab_size in metadata

* llama: correct vocab size for logging

Co-authored-by: Georgi Gerganov <redacted>
---------

Co-authored-by: Georgi Gerganov <redacted>

commit | commitdiff | tree

Max Krasnyansky [Tue, 17 Sep 2024 08:19:46 +0000 (01:19 -0700)]

threadpool : skip polling for unused threads (#9461)

* threadpool: skip polling for unused threads

Currently all threads do N polling rounds even if only 1 thread is active (n_threads_cur == 1).
This commit adds a check to skip the polling for unused threads (ith >= n_threads_cur).

n_threads_cur is now an atomic_int to explicitly tell thread sanitizer that it is written
from one thread and read from other threads (not a race conditions).

* threadpool: further simplify and improve ggml_barrier

Avoid using strict memory order while polling, yet make sure that all threads go through
full memory barrier (memory fence) on ggml_barrier entrace and exit.

* threads: add simple barrier test

This test does lots of small, parallel matmul ops where the barriers in between dominate the overhead.

* threadpool: improve thread sync for new-graphs

Using the same tricks as ggml_barrier. All the polling is done with relaxed memory order
to keep it efficient, once the new graph is detected we do full fence using read-modify-write
with strict memory order.

* threadpool: improve abort handling

Do not use threadpool->ec (exit code) to decide whether to exit the compute loop.
threadpool->ec is not atomic which makes thread-sanitizer rightfully unhappy about it.

Instead introduce atomic threadpool->abort flag used for this. This is consistent with
how we handle threadpool->stop or pause.

While at it add an explicit atomic_load for n_threads_cur for consistency.

* test-barrier: release threadpool before releasing the context

fixes use-after-free detected by gcc thread-sanitizer on x86-64
for some reason llvm sanitizer is not detecting this issue.

commit | commitdiff | tree

Yuri Khrustalev [Tue, 17 Sep 2024 06:51:15 +0000 (02:51 -0400)]

unicode : add <algorithm> (#9508)

commit | commitdiff | tree

Gabe Goodhart [Tue, 17 Sep 2024 06:44:58 +0000 (00:44 -0600)]

llama : support IBM Granite architecture (#9412)

* feat(gguf-py): Add Granite model and params to gguf-py

Branch: GraniteLM

Signed-off-by: Gabe Goodhart <redacted>
* feat(convert_hf_to_gguf): Add registration and param setup for Granite

Branch: GraniteLM

Signed-off-by: Gabe Goodhart <redacted>
* feat(llama.cpp): Add config parsing for Granite multiplier params

Branch: GraniteLM

Signed-off-by: Gabe Goodhart <redacted>
* feat(llama.cpp): First pass at full port of granite deviations from llama

Something is still not working right since the results are mostly terrible,
but on occasion it's producing relevant results at this point, so
_something_ is working.

Branch: GraniteLM

Signed-off-by: Gabe Goodhart <redacted>
* fix(llama.cpp): Determine granite language 3b instruct by vocab size

Branch: GraniteLM

Signed-off-by: Gabe Goodhart <redacted>
* fix(convert_hf_to_gguf): Use LlamaModel as base for GraniteModel

The defaults in LlamaModel are needed for Granite as well

Branch: GraniteLM

Signed-off-by: Gabe Goodhart <redacted>
* fix(llama.cpp): Switch Granite param names to use _scale for consistency

Other scalar multipliers are called *_scale, so this provides a more
consistent naming convention.

Branch: GraniteLM

Signed-off-by: Gabe Goodhart <redacted>
* fix(convert_hf_to_gguf/gguf-py): _multiplier -> _scale

The transformers names with _multiplier will now be converted to the _scale
equivalent during conversion.

Branch: GraniteLM

Signed-off-by: Gabe Goodhart <redacted>
* fix(llama.cpp): Use separate switch clause for granite in llm_load_hparams

Branch: GraniteLM

Signed-off-by: Gabe Goodhart <redacted>
---------

Signed-off-by: Gabe Goodhart <redacted>

commit | commitdiff | tree

Michael Podvitskiy [Tue, 17 Sep 2024 06:23:30 +0000 (08:23 +0200)]

llama : add llama_n_head() (#9512)

Packaging of ggml-org/llama.cpp