git.djapps.eu Git - pkg/ggml/sources/llama.cpp/log

]> git.djapps.eu Git - pkg/ggml/sources/llama.cpp/log

overview / pkg / ggml / sources / llama.cpp / log

kuronekosaiko [Sun, 21 Jan 2024 16:28:14 +0000 (00:28 +0800)]

add safetensors support to convert-lora-to-ggml.py (#5062)

* add safetensors support to convert-lora-to-ggml.py

* Update convert-lora-to-ggml.py

Remove white space in line 69.

commit | commitdiff | tree

bobqianic [Sun, 21 Jan 2024 15:17:35 +0000 (15:17 +0000)]

add `#include <string>` to unicode.h (#5051)

Co-authored-by: Jared Van Bortel <redacted>

commit | commitdiff | tree

Kawrakow [Sun, 21 Jan 2024 12:42:44 +0000 (14:42 +0200)]

Add ability to evauate multiple choice tasks (#5047)

* TruthfulQA: 1st attempt, does not look like it is working

The same implementation can be used for HellaSwag as well,
so I converted a HellaSwag validation dataset to the binary
format used here and tested with that. The score is only
around 50, so something is not quite right.

* TruthfulQA: works but the result is bad

I know it works because if I convert the HellaSwag validation
data to the binary format used in the truthful_qa_score() function
I get the exact same result as from the hellaswag_score() function.
But I guess, the questions are tricky and the way I have done
the combination of question + answer is very likely not the best.
The TruthfulQA validation dataset contains 817 questions, with
random chance result around 19%. With this version I get
29.1% for Mistral-7B and 55.2% for Mistral-7B-Instruct-v0.2.
The HF leader board results for these two models are
42.2% and 68.3%, respectively.

* TruthfulQA: fix random sample

* TruthfulQA: prepare tasks in parallel for large test datasets

* Rename truthful_qa to multiple_choice

* Make MSVC happy

I had forgotten that MSVC does not make constexpr's available
inside a lambda.

---------

Co-authored-by: Iwan Kawrakow <redacted>

commit | commitdiff | tree

Kawrakow [Sun, 21 Jan 2024 06:01:20 +0000 (08:01 +0200)]

Slightly faster imatrix (#5050)

* imatrix: speedup by avoiding unnecessary allocations and copies

* imatrix: add --no-ppl option to skip PPL calculations altogether

---------

Co-authored-by: Iwan Kawrakow <redacted>

commit | commitdiff | tree

Georgi Gerganov [Sun, 21 Jan 2024 03:17:27 +0000 (05:17 +0200)]

flake.lock: Update (#5054)

Flake lock file updates:

• Updated input 'nixpkgs':
'github:NixOS/nixpkgs/9b19f5e77dd906cb52dade0b7bd280339d2a1f3d' (2024-01-13)
→ 'github:NixOS/nixpkgs/bbe7d8f876fbbe7c959c90ba2ae2852220573261' (2024-01-19)

Co-authored-by: github-actions[bot] <redacted>

commit | commitdiff | tree

Jared Van Bortel [Sat, 20 Jan 2024 23:14:18 +0000 (18:14 -0500)]

convert : partially revert PR #4818 (#5041)

commit | commitdiff | tree

Jared Van Bortel [Sat, 20 Jan 2024 15:08:08 +0000 (10:08 -0500)]

perplexity : fix MSVC build after #5020 (#5043)

* perplexity : fix MSVC build after #5020

* try a differerent fix

commit | commitdiff | tree

slaren [Sat, 20 Jan 2024 15:05:49 +0000 (16:05 +0100)]

llama : run all KQV ops on the CPU with no KV offload (#5049)

ggml-ci

commit | commitdiff | tree

Herman Semenov [Sat, 20 Jan 2024 08:11:31 +0000 (08:11 +0000)]

cmake : add support for ccache (#5002)

* Added support ccache for speedup recompilation

* cmake : option to disable ccache

---------

Co-authored-by: Georgi Gerganov <redacted>

commit | commitdiff | tree

adel boussaken [Sat, 20 Jan 2024 08:05:43 +0000 (09:05 +0100)]

Add a dart/flutter binding to README.md (#4882)

commit | commitdiff | tree

Kylin [Sat, 20 Jan 2024 07:01:46 +0000 (15:01 +0800)]

cuda : fix compile error in jetson platform (#4975)

* cuda: fix compile error in jetson platform

* cuda: update comment in ggml-cuda.cu

* cuda: update ggml-cuda.cu comment

commit | commitdiff | tree

Uzo Nweke [Fri, 19 Jan 2024 18:20:50 +0000 (13:20 -0500)]

finetune : fix ggml_allocr lifetimes (tmp workaround) (#5033)

* Fix issue with alloc causing max_compute_size to be calculated

* remove ggml_allocr_free as suggested in issue #4791

commit | commitdiff | tree

Georgi Gerganov [Fri, 19 Jan 2024 13:24:47 +0000 (15:24 +0200)]

imatrix : add README.md

commit | commitdiff | tree

Shijie [Fri, 19 Jan 2024 11:53:13 +0000 (19:53 +0800)]

llama : support upcoming Qwen2 (#5037)

commit | commitdiff | tree

Georgi Gerganov [Fri, 19 Jan 2024 11:52:22 +0000 (13:52 +0200)]

py : fix flake8 lint

commit | commitdiff | tree

Kawrakow [Fri, 19 Jan 2024 09:39:11 +0000 (11:39 +0200)]

winogrande: evaluate log-probs in parallel (#5036)

This is a relatively minor performance tweak resulting in
~10% speedup on my system.

Co-authored-by: Iwan Kawrakow <redacted>

commit | commitdiff | tree

chiranko [Fri, 19 Jan 2024 09:07:27 +0000 (17:07 +0800)]

llama : add CodeShell support (#5016)

* llama: add codeshell support

* llama.cpp: fix codeshell with NeoX rope

Co-authored-by: Georgi Gerganov <redacted>
---------

Co-authored-by: Georgi Gerganov <redacted>

commit | commitdiff | tree

Kawrakow [Fri, 19 Jan 2024 09:02:39 +0000 (11:02 +0200)]

perplexity: avoid unnecessary alloocations and logit copies (#5035)

Co-authored-by: Iwan Kawrakow <redacted>

commit | commitdiff | tree

Georgi Gerganov [Fri, 19 Jan 2024 08:45:06 +0000 (10:45 +0200)]

perplexity : faster Winogrande via batching (#5024)

* perplexity : faster Winogrande via batching

ggml-ci

* perplexity : remove unused function

* perplexity : only tokenize selected tasks for Winogrande

commit | commitdiff | tree

John [Thu, 18 Jan 2024 22:12:15 +0000 (23:12 +0100)]

llama : fix falcon arch for tied output embeddings (#4978)

* falcon arch fix for tied output embeddings

* Update llama.cpp

Co-authored-by: Georgi Gerganov <redacted>
* Update llama.cpp

* Update llama.cpp

Co-authored-by: Georgi Gerganov <redacted>
* Update llama.cpp

---------

Co-authored-by: Georgi Gerganov <redacted>

commit | commitdiff | tree

Georgi Gerganov [Thu, 18 Jan 2024 21:36:07 +0000 (23:36 +0200)]

cmake : add ggml public headers (#5011)

commit | commitdiff | tree

Xuan Son Nguyen [Thu, 18 Jan 2024 20:33:05 +0000 (21:33 +0100)]

server : defer tasks when "slot unavailable" (#5018)

* server: defer task when no slot is available

* remove unnecessary log

---------

Co-authored-by: Xuan Son Nguyen <redacted>

commit | commitdiff | tree

slaren [Thu, 18 Jan 2024 20:12:15 +0000 (21:12 +0100)]

llama : fix mlock with no-mmap with Metal (#5025)

commit | commitdiff | tree

Georgi Gerganov [Thu, 18 Jan 2024 19:45:51 +0000 (21:45 +0200)]

imatrix : fix assert for src0 non-cont check

commit | commitdiff | tree

Georgi Gerganov [Thu, 18 Jan 2024 18:49:00 +0000 (20:49 +0200)]

perplexity : fix winogrande N tasks option

commit | commitdiff | tree

Georgi Gerganov [Thu, 18 Jan 2024 18:45:39 +0000 (20:45 +0200)]

scripts : add get-winogrande.sh

commit | commitdiff | tree

David Sommers [Thu, 18 Jan 2024 17:20:59 +0000 (12:20 -0500)]

convert.py : fix llama/llama2 conversion due to vocab_size=-1 (#5019)

PR #4818 (merged last week) reintroduced a config check for vocab_size that was addressed in PR #4258 (merged 2023-11-30).

Without the fix, llama2 models can't be converted. The error is:

`ValueError: The model's vocab size is set to -1 in params.json. Please update it manually. Maybe 32000?`

commit | commitdiff | tree

Kawrakow [Thu, 18 Jan 2024 17:18:21 +0000 (19:18 +0200)]

HellaSwag: speed up by parallelizing log-prob evaluation (#5020)

For Mistral-7B and fp16, time on my system goes down from 536 seconds
to 423 seconds for the full evaluation dataset (10042 tasks).

Co-authored-by: Iwan Kawrakow <redacted>

commit | commitdiff | tree

Georgi Gerganov [Thu, 18 Jan 2024 13:33:01 +0000 (15:33 +0200)]

perplexity : faster HellaSwag via batching (#5017)

* perplexity : faster HellaSwag

ggml-ci

* perplexity : clean-up

ggml-ci

* perplexity : no need for decode_helper

ggml-ci

* perplexity : add comments

* perplexity : option to specify max batched tasks via `n_parallel`

* perplexity : remove HellaSwag restruction for n_batch

commit | commitdiff | tree

Kawrakow [Thu, 18 Jan 2024 11:46:27 +0000 (13:46 +0200)]

Add Winogrande evaluation (#5015)

* winogrande: simple implementation

It doesn't look like it is working - why?
For Mistral-7B it is barely better than
random chance (score ~60% for 1267 tasks), while I see
Mistral-7B scoring 78.4% on the HF leader board.
1-sigma statistical uncertainty for 1267 tasks is ~1.4,
so no way the difference is due to statistics.

* winogrande: somewhat better

Score for Mistrali7-B is now 68.9 on the validation set of
winogrande_debiased. Still far from the reported 78.4, but
better than what I had before.

* winogrande: improving

Mistral-7B score is now 73.56.
Still not quite 78.4 but getting there.
We are also getting a lower score on HellaSwag
compared to HF leader board, so I'm not expecting
we will get up to 78.4 anyway.

It looks like it is better to skip the choice word(s)
when evaluating the average log-likelihood. This kind of
makes sense because a more common word (in Winogrande this is
often a name) will have a higher probability without knowing
about the follow up context, and this will skew the log-likelihood
towards the more common word. We can only do this if the
choice words are not last in the sentence.

It also looks like it is better to skip the punctuation at the
end of the sentence, provided the choice words are not last.

* winogrande: add dataset instructions

---------

Co-authored-by: Iwan Kawrakow <redacted>

commit | commitdiff | tree

Georgi Gerganov [Thu, 18 Jan 2024 09:44:49 +0000 (11:44 +0200)]

scritps : add helper script to get hellaswag data in txt format

commit | commitdiff | tree

Paul Tsochantaris [Thu, 18 Jan 2024 08:47:24 +0000 (08:47 +0000)]

metal : fix memory leak, dangling pointer and unused autorel (#5007)

* Metal memory: Small memory leak on init, dangling pointer, and unused autorelease pool in graph compute

* SPM header potential fix

* Reverting symlinks

commit | commitdiff | tree

Georgi Gerganov [Wed, 17 Jan 2024 18:54:50 +0000 (20:54 +0200)]

sync : ggml

commit | commitdiff | tree

Georgi Gerganov [Wed, 17 Jan 2024 16:54:56 +0000 (18:54 +0200)]

ggml : add IQ2 to test-backend-ops + refactoring (#4990)

* ggml : add IQ2 to test-backend-ops + refactoring

ggml-ci

* cuda : update supports_op for IQ2

ggml-ci

* ci : enable LLAMA_CUBLAS=1 for CUDA nodes

ggml-ci

* cuda : fix out-of-bounds-access in `mul_mat_vec_q`

ggml-ci

* tests : avoid creating RNGs for each Q tensor

ggml-ci

* tests : avoid creating RNGs for each tensor

ggml-ci

commit | commitdiff | tree

Georgi Gerganov [Wed, 17 Jan 2024 16:46:30 +0000 (18:46 +0200)]

imatrix : offload to GPU support (#4957)

* backend : add eval callback

ggml-ci

* backend : group nodes in a single compute when user don't need them

* backend : clean-up the implementation

ggml-ci

* simple : do not perform tensor data copy if not needed

* simple : fix

* imatrix : offload to GPU support

* imatrix : fix ggml_mul_mat_id hanlding

ggml-ci

* ci : add imatrix test

ggml-ci

* ci : rearrange output

ggml-ci

commit | commitdiff | tree

Georgi Gerganov [Wed, 17 Jan 2024 16:39:41 +0000 (18:39 +0200)]

backend : add eval callback (#4935)

* backend : add eval callback

ggml-ci

* backend : group nodes in a single compute when user don't need them

* backend : clean-up the implementation

ggml-ci

* simple : do not perform tensor data copy if not needed

* simple : fix

* simple : no need for ggml_is_contiguous + fix bool parse

* llama : fix callback placement in llama_context_params

* backend : avoid double-ask callback calls

* simple : restore examples, imatrix will serve as a demo

commit | commitdiff | tree

Georgi Gerganov [Wed, 17 Jan 2024 16:38:39 +0000 (18:38 +0200)]

metal : create autorelease pool during library build (#4970)

* metal : create autorelease pool during library build

ggml-ci

* test : simplify

ggml-ci

commit | commitdiff | tree

Georgi Gerganov [Wed, 17 Jan 2024 16:37:36 +0000 (18:37 +0200)]

py : fix whitespace

commit | commitdiff | tree

Georgi Gerganov [Wed, 17 Jan 2024 13:45:03 +0000 (15:45 +0200)]

py : fix missing added_tokens_dict for SPM and BPE vocabs (#4971)

* py : fix missing added_tokens_dict for SPM vocab

* py : pad with unknown tokens when data is missing

ggml-ci

* py : fix BPE vocab conversion

ggml-ci

* py : fix padded dummy tokens (I hope)

commit | commitdiff | tree

Kawrakow [Wed, 17 Jan 2024 10:36:37 +0000 (12:36 +0200)]

llama : use Q4_K for attn_v for Q2_K_S when n_gqa >= 4 (#4996)

Co-authored-by: Iwan Kawrakow <redacted>

commit | commitdiff | tree

Paul Tsochantaris [Wed, 17 Jan 2024 08:07:24 +0000 (08:07 +0000)]

metal : remove unnecessary nil check (#4986)

commit | commitdiff | tree

David Renshaw [Wed, 17 Jan 2024 07:17:50 +0000 (02:17 -0500)]

llama : fix copy/paste error in llama_sampling_params comment (#4994)

commit | commitdiff | tree

Georgi Gerganov [Tue, 16 Jan 2024 18:59:31 +0000 (20:59 +0200)]

py : remove unnecessary hasattr (#4903)

commit | commitdiff | tree

Philip Taron [Tue, 16 Jan 2024 17:56:21 +0000 (09:56 -0800)]

nix: remove nixConfig from flake.nix (#4984)

commit | commitdiff | tree

Daniel Bevenius [Tue, 16 Jan 2024 17:54:24 +0000 (18:54 +0100)]

finetune : add training data file to log message (#4979)

This commit adds the name of the training data file to the log message
printed when the training data is tokenized.

The motivation for this change is that it can be useful to show which
file is being tokenized when running the finetune example.

Signed-off-by: Daniel Bevenius <redacted>

commit | commitdiff | tree

Kawrakow [Tue, 16 Jan 2024 17:51:26 +0000 (19:51 +0200)]

ggml : importance matrix support for legacy quants (#4969)

* imatrix: adding support for legacy quants

* imatrix: guard Q4_0/Q5_0 against ffn_down craziness

---------

Co-authored-by: Iwan Kawrakow <redacted>

commit | commitdiff | tree

Maximilian Winter [Tue, 16 Jan 2024 17:41:42 +0000 (18:41 +0100)]

examples : add complete parallel function calling example (#4974)

commit | commitdiff | tree

Georgi Gerganov [Tue, 16 Jan 2024 17:34:54 +0000 (19:34 +0200)]

perplexity : fix kv cache handling for hellaswag (#4981)

ggml-ci

commit | commitdiff | tree

Georgi Gerganov [Tue, 16 Jan 2024 17:13:54 +0000 (19:13 +0200)]

flake.lock: update flake-parts, flake-parts/nixpkgs-lib, and nixpkgs (#4920)

Flake lock file updates:

• Updated input 'flake-parts':
    'github:hercules-ci/flake-parts/34fed993f1674c8d06d58b37ce1e0fe5eebcb9f5' (2023-12-01)
  → 'github:hercules-ci/flake-parts/07f6395285469419cf9d078f59b5b49993198c00' (2024-01-11)
• Updated input 'flake-parts/nixpkgs-lib':
    'github:NixOS/nixpkgs/e92039b55bcd58469325ded85d4f58dd5a4eaf58?dir=lib' (2023-11-29)
  → 'github:NixOS/nixpkgs/b0d36bd0a420ecee3bc916c91886caca87c894e9?dir=lib' (2023-12-30)
• Updated input 'nixpkgs':
    'github:NixOS/nixpkgs/cfc3698c31b1fb9cdcf10f36c9643460264d0ca8' (2023-12-27)
  → 'github:NixOS/nixpkgs/317484b1ead87b9c1b8ac5261a8d2dd748a0492d' (2024-01-08)

Co-authored-by: github-actions[bot] <redacted>

commit | commitdiff | tree

Paul Tsochantaris [Tue, 16 Jan 2024 17:05:19 +0000 (17:05 +0000)]

metal : localized logic in `ggml_metal_graph_compute` (#4924)

* Metal: Localized logic in `ggml_metal_graph_compute`, minor performance improvement

* Whitespace

* Collecting command buffer completions on single thread

* Whitespace

* Reduce diff noise

commit | commitdiff | tree

Neuman Vong [Tue, 16 Jan 2024 13:47:34 +0000 (00:47 +1100)]

android : introduce starter project example (#4926)

* Introduce starter project for Android

Based on examples/llama.swiftui.

* Add github workflow

* Set NDK version

* Only build arm64-v8a in CI

* Sync bench code

* Rename CI prop to skip-armeabi-v7a

* Remove unused tests

commit | commitdiff | tree

Alex Azarov [Tue, 16 Jan 2024 13:41:27 +0000 (14:41 +0100)]

metal : replace loop of dispatch_async with dispatch_apply (#4934)

* Replace loop of dispatch_async with dispatch_apply

* Update ggml-metal.m

---------

Co-authored-by: Georgi Gerganov <redacted>

commit | commitdiff | tree

Alex Azarov [Tue, 16 Jan 2024 13:33:02 +0000 (14:33 +0100)]

metal : log `recommendedMaxWorkingSetSize` on iOS 16+ (#4936)

* metal: Log `recommendedMaxWorkingSetSize` on iOS 16+

* Only log on iOS and macOS, ignoring tvOS and other platforms

* Check for Xcode version before using recommendedMaxWorkingSetSize

---------

Co-authored-by: Georgi Gerganov <redacted>

commit | commitdiff | tree

Maximilian Winter [Tue, 16 Jan 2024 12:10:48 +0000 (13:10 +0100)]

examples : fix and improv docs for the grammar generator (#4909)

* Create pydantic-models-to-grammar.py

* Added some comments for usage

* Refactored Grammar Generator

Added example and usage instruction.

* Update pydantic_models_to_grammar.py

* Update pydantic-models-to-grammar-examples.py

* Renamed module and imported it.

* Update pydantic-models-to-grammar.py

* Renamed file and fixed grammar generator issue.

* Fixed some issues and bugs of the grammar generator. Imporved Documentation

* Update pydantic_models_to_grammar.py

commit | commitdiff | tree

Justine Tunney [Tue, 16 Jan 2024 11:16:33 +0000 (03:16 -0800)]

ggml : introduce GGML_CALL function annotation (#4850)

This change makes it possible to build ggml-cuda.cu and ggml-metal.m as
independent dynamic shared objects, that may be conditionally linked at
runtime in a multiplatform binary. It introduces a GGML_CALL annotation
that documents which functions have a cyclic call relationship, between
the application code and GPU modules.

This change does nothing, unless the build defines -DGGML_MULTIPLATFORM
which causes back-references and function pointers to conform to MS ABI
which is supported by NVCC, ROCm, XCode, GCC and Clang across platforms

commit | commitdiff | tree

Daniel Bevenius [Tue, 16 Jan 2024 11:14:19 +0000 (12:14 +0100)]

finetune : use LLAMA_FILE_MAGIC_GGLA (#4961)

This commit replaces the magic number LLAMA_FILE_MAGIC_LORA used in
finetune.cpp with LLAMA_FILE_MAGIC_GGLA defined in llama.h.

Signed-off-by: Daniel Bevenius <redacted>

commit | commitdiff | tree

stduhpf [Tue, 16 Jan 2024 11:04:32 +0000 (12:04 +0100)]

speculative : threading options (#4959)

* speculative: expose draft threading

* fix usage format

* accept -td and -tbd args

* speculative: revert default behavior when -td is unspecified

* fix trailing whitespace

commit | commitdiff | tree

ngc92 [Mon, 15 Jan 2024 18:40:48 +0000 (20:40 +0200)]

pass cpu-architecture arguments only to host code (C;C++) (#4943)

commit | commitdiff | tree

David Friehs [Mon, 15 Jan 2024 13:06:52 +0000 (14:06 +0100)]

llama : apply classifier-free guidance to logits directly (#4951)

commit | commitdiff | tree

Victor Z. Peng [Mon, 15 Jan 2024 12:41:46 +0000 (04:41 -0800)]

awq-py : fix typo in awq-py/README.md (#4947)

commit | commitdiff | tree

Georgi Gerganov [Mon, 15 Jan 2024 11:27:00 +0000 (13:27 +0200)]

cuda : fix dequantize kernel names (#4938)

commit | commitdiff | tree

Kawrakow [Mon, 15 Jan 2024 08:09:38 +0000 (10:09 +0200)]

llama : check for 256 divisibility for IQ2_XS, IQ2_XXS (#4950)

Co-authored-by: Iwan Kawrakow <redacted>

commit | commitdiff | tree

Kawrakow [Mon, 15 Jan 2024 05:48:06 +0000 (07:48 +0200)]

CUDA: faster dequantize kernels for Q4_0 and Q4_1 (#4938)

Co-authored-by: Iwan Kawrakow <redacted>

commit | commitdiff | tree

David Pflug [Sun, 14 Jan 2024 15:46:00 +0000 (10:46 -0500)]

llama : fix missing quotes (#4937)

commit | commitdiff | tree

Kawrakow [Sun, 14 Jan 2024 14:21:12 +0000 (16:21 +0200)]

Add ability to use importance matrix for all k-quants (#4930)

Co-authored-by: Iwan Kawrakow <redacted>

commit | commitdiff | tree

Georgi Gerganov [Sun, 14 Jan 2024 11:26:53 +0000 (13:26 +0200)]

llama : check LLAMA_TRACE env for extra logging (#4929)

* llama : minor fix indent

* llama : check LLAMA_TRACE env for extra logging

ggml-ci

commit | commitdiff | tree

Georgi Gerganov [Sun, 14 Jan 2024 09:08:09 +0000 (11:08 +0200)]

scripts : sync-ggml-am.sh option to skip commits

commit | commitdiff | tree

Georgi Gerganov [Sun, 14 Jan 2024 09:03:19 +0000 (11:03 +0200)]

llama : use LLAMA_LOG_ macros for logging

commit | commitdiff | tree

Kawrakow [Sun, 14 Jan 2024 08:53:39 +0000 (10:53 +0200)]

Fix ffn_down quantization mix for MoE models (#4927)

* Fix ffn_down quantization mix for MoE models

In #4872 I did not consider the part where every third
tensor is quantized with more bits. Fir MoE this leads to tensors
of the same layer being quantized with different number of bits,
which is not considered as a possibility in the inference implementation
(it is assumed all experts use the same quantization).

* Fix the fix

* Review suggestion

---------

Co-authored-by: Iwan Kawrakow <redacted>

commit | commitdiff | tree

Alex Azarov [Sun, 14 Jan 2024 08:44:39 +0000 (09:44 +0100)]

metal : correctly set SIMD support flags on iOS (#4923)

* Correctly set support_simdgroup_reduction and support_simdgroup_mm on iPhone/iPad

* log a little bit more info on iOS

commit | commitdiff | tree

Karthik Kumar Viswanathan [Sun, 14 Jan 2024 08:41:44 +0000 (00:41 -0800)]

llama : support WinXP build with MinGW 8.1.0 (#3419)

commit | commitdiff | tree

Kawrakow [Sun, 14 Jan 2024 07:45:56 +0000 (09:45 +0200)]

2-bit quantizations (#4897)

* imatrix: load

* imatrix: WIP

* imatrix: Add Q2_K quantization

* imatrix: also guard against Q2_K_S quantization without importance matrix

* imatrix: guard even more against low-bit quantization misuse

---------

Co-authored-by: Iwan Kawrakow <redacted>

commit | commitdiff | tree

Kawrakow [Sun, 14 Jan 2024 07:44:30 +0000 (09:44 +0200)]

Make Q3_K_S be the same as olf Q3_K_L for Mixtral-8x7B (#4906)

Co-authored-by: Iwan Kawrakow <redacted>

commit | commitdiff | tree

Georgi Gerganov [Sat, 13 Jan 2024 22:14:46 +0000 (00:14 +0200)]

sync : ggml

commit | commitdiff | tree

Johannes Gäßler [Sat, 13 Jan 2024 20:41:37 +0000 (21:41 +0100)]

ggml: cache sin/cos for RoPE (#4908)

commit | commitdiff | tree

Georgi Gerganov [Sat, 13 Jan 2024 18:45:45 +0000 (20:45 +0200)]

metal : remove old API (#4919)

ggml-ci

commit | commitdiff | tree

Georgi Gerganov [Sat, 13 Jan 2024 17:31:26 +0000 (19:31 +0200)]

server : fix prompt caching with system prompt (#4914)

commit | commitdiff | tree

Georgi Gerganov [Sat, 13 Jan 2024 16:47:38 +0000 (18:47 +0200)]

llama : fix detokenization of non-special added-tokens (#4916)

Co-authored-by: goerch <redacted>

commit | commitdiff | tree

Georgi Gerganov [Sat, 13 Jan 2024 16:46:37 +0000 (18:46 +0200)]

metal : disable log for loaded kernels (#4794)

commit | commitdiff | tree

David Friehs [Sat, 13 Jan 2024 16:29:43 +0000 (17:29 +0100)]

llama : minimize size used for state save/load (#4820)

* examples : save-load-state: save only required state

* llama : only reserve n_vocab * n_batch at most for logits

llama_decode asserts that only n_batch tokens are passed each call, and
n_ctx is expected to be bigger than n_batch.

* llama : always reserve n_vocab * n_batch for logits

llama_context de-serialization breaks if the contexts have differing
capacity for logits and llama_decode will at maximum resize to
n_vocab * n_batch.

* llama : only save and restore used logits

for batch sizes of 512 this reduces save state in the best case by
around 62 MB, which can be a lot if planning to save on each message
to allow regenerating messages.

* llama : use ostringstream and istringstream for save and load

* llama : serialize rng into minimum amount of space required

* llama : break session version due to serialization changes

commit | commitdiff | tree

Someone [Sat, 13 Jan 2024 16:29:16 +0000 (16:29 +0000)]

workflows: unbreak nix-build-aarch64, and split it out (#4915)

The fix should be just the `sudo apt-get update`

commit | commitdiff | tree

Yann Follet [Sat, 13 Jan 2024 16:09:08 +0000 (00:09 +0800)]

main : add parameter --no-display-prompt (#4541)

* add the parameter : --no-display-prompt , combine with --log-disable it will display only the generated tokens

* remove empty line

---------

Co-authored-by: Georgi Gerganov <redacted>

commit | commitdiff | tree

texmex76 [Sat, 13 Jan 2024 16:06:20 +0000 (17:06 +0100)]

gguf : fix potential infinite for-loop (#4600)

Co-authored-by: Bernhard Gstrein <redacted>

commit | commitdiff | tree

Georgi Gerganov [Sat, 13 Jan 2024 16:03:45 +0000 (18:03 +0200)]

metal : refactor kernel loading code (#4794)

* metal : detect more GPU families

* metal : refactor kernel loading

* metal : set kernel family requirements

* metal : fix kernel init + fix compile options

* metal : take into account simdgroup reduction support

* metal : print only skipped kernels

* metal : fix check for simdgroup reduction support

* metal : check for Metal 3

* metal : free allocations

* metal : normalize encoder:setComputePipelineStatus calls

ggml-ci

* metal : fix Metal3 family check

ggml-ci

* metal : check for simdgroup matrix mul. feature

ggml-ci

commit | commitdiff | tree

Johannes Gäßler [Sat, 13 Jan 2024 14:52:53 +0000 (15:52 +0100)]

compare-llama-bench: tweak output format (#4910)

commit | commitdiff | tree

Ziad Ben Hadj-Alouane [Sat, 13 Jan 2024 14:20:46 +0000 (09:20 -0500)]

server : fix deadlock that occurs in multi-prompt scenarios (#4905)

* * fix deadlock

* * dont ruint all whitespace

commit | commitdiff | tree

makomk [Sat, 13 Jan 2024 14:16:11 +0000 (14:16 +0000)]

server : fix crash with multimodal models without BOS token (#4904)

commit | commitdiff | tree

Georgi Gerganov [Sat, 13 Jan 2024 11:44:37 +0000 (13:44 +0200)]

convert : update phi-2 to latest HF repo (#4903)

* convert : update phi-2 to latest HF repo

ggml-ci

* py : try to fix flake stuff

commit | commitdiff | tree

Georgi Gerganov [Fri, 12 Jan 2024 20:02:43 +0000 (22:02 +0200)]

sync : ggml

commit | commitdiff | tree

Georgi Gerganov [Fri, 12 Jan 2024 12:02:30 +0000 (14:02 +0200)]

ggml : fix 32-bit ARM compat for IQ2_XS (whisper/1758)

* ggml : fix 32-bit ARM compat

* ggml : fix fix

* ggml : fix fix fix

commit | commitdiff | tree

slaren [Fri, 12 Jan 2024 19:38:34 +0000 (20:38 +0100)]

backend_sched : fix assignments

ggml-ci

commit | commitdiff | tree

Maximilian Winter [Fri, 12 Jan 2024 19:46:45 +0000 (20:46 +0100)]

examples : add pydantic models to GBNF grammar generator (#4883)

* Create pydantic-models-to-grammar.py

* Added some comments for usage

* Refactored Grammar Generator

Added example and usage instruction.

* Update pydantic_models_to_grammar.py

* Update pydantic-models-to-grammar-examples.py

* Renamed module and imported it.

* Update pydantic-models-to-grammar.py

* Renamed file and fixed grammar generator issue.

commit | commitdiff | tree

Johannes Gäßler [Fri, 12 Jan 2024 19:38:54 +0000 (20:38 +0100)]

CUDA: faster q8_0 -> f16 dequantization (#4895)

commit | commitdiff | tree

slaren [Fri, 12 Jan 2024 19:07:38 +0000 (20:07 +0100)]

llama : ggml-backend integration (#4766)

* llama : ggml-backend integration

* ggml-backend : add names to buffers

* fix unmap after loading

* batched-bench : add tensor_split param

* llama : check for null tensor_split

* ggml-backend : increase GGML_MAX_BACKENDS

* improve graph splitting, partial fix for --no-kv-offload

* cuda : add ggml-backend split buffer support

* cuda : do not create buffer types for devices that don't exist (fixes usage without CUDA devices available)

* ggml : fix null backend dereference (#4807)

* ggml : fix null backend dereference

* ggml : also check ggml_backend_is_cpu

* test-backend-ops : check buffer allocation failures

* llama : add cparam (split_mode) and command line argument (--split-mode, -sm) to configure the split mode (none, layer or row)

* ggml : fix mul_mat_id work size

* llama : rewrite session kv load/set without graphs

* minor

* llama : only initialize used backends, free backends on context free

* llama : abort ctx if cuda backend init fails

* llama : rewrite lora with ggml-backend and compute on CPU

ggml-ci

* llama : only map to a backend buffer the region of the file mapping containing the tensors used in the buffer

* opencl : add ggml-backend buffer type

* cuda : only use batched_cublas with batched mat muls (fixes fp16 tg perf)

* llama : on Metal, by default offload the full model

ggml-ci

* metal : page align the data ptr (#4854)

* Apply suggestions from code review

Co-authored-by: Johannes Gäßler <redacted>
* cuda : fix split buffer free

* address review comments

* llama-bench : add split-mode parameter

* fix whitespace

* opencl : fix double initialization

* server : add --split-mode parameter

* use async copy and compute to improve multi-gpu performance

ggml-ci

* use async memcpys to copy the graph outputs to the CPU

* fix opencl

* use a host buffer for the cpu compute buffer for faster copies to the gpu

---------

Co-authored-by: Georgi Gerganov <redacted>
Co-authored-by: Johannes Gäßler <redacted>

commit | commitdiff | tree

Georgi Gerganov [Fri, 12 Jan 2024 18:54:12 +0000 (20:54 +0200)]

llama : remove redundant assert for StableLM (#4901)

commit | commitdiff | tree

Daniel Bevenius [Fri, 12 Jan 2024 17:54:53 +0000 (18:54 +0100)]

export-lora : use LLAMA_FILE_MAGIC_GGLA (#4894)

This commit replaces the magic number used in export-lora.cpp with
the one defined in llama.h, which is indirectly included via common.h.

Signed-off-by: Daniel Bevenius <redacted>

commit | commitdiff | tree

Zay [Fri, 12 Jan 2024 12:48:00 +0000 (05:48 -0700)]

llama.swiftui : update models layout (#4826)

* Updated Models Layout

- Added a models drawer
- Added downloading directly from Hugging Face
- Load custom models from local folder
- Delete models by swiping left

* trimmed trailing white space

* Updated Models Layout

commit | commitdiff | tree

Georgi Gerganov [Fri, 12 Jan 2024 12:33:21 +0000 (14:33 +0200)]

gitignore : imatrix

commit | commitdiff | tree

Johannes Gäßler [Fri, 12 Jan 2024 11:30:41 +0000 (12:30 +0100)]

CUDA: fix softmax compile for old CUDA versions (#4862)

commit | commitdiff | tree

Georgi Gerganov [Fri, 12 Jan 2024 11:10:19 +0000 (13:10 +0200)]

llama : fix typo "imp_embd" -> "inp_embd"

Packaging of ggml-org/llama.cpp

RSS Atom