]> git.djapps.eu Git - pkg/ggml/sources/llama.cpp/log
pkg/ggml/sources/llama.cpp
16 months agosync : ggml
Georgi Gerganov [Mon, 4 Mar 2024 08:40:04 +0000 (10:40 +0200)]
sync : ggml

16 months agoadd some new ops, fix some operators and add batch operations to certain operators...
leejet [Sun, 3 Mar 2024 12:23:52 +0000 (20:23 +0800)]
add some new ops, fix some operators and add batch operations to certain operators. (ggml/747)

* cuda: fix group_norm

* cuda: add batch inference support for ggml_pad/ggml_upscale

* add ggml_arrange

* add ggml_timestep_embedding

* update ggml_arange/ggml_timestep_embedding tests

* cuda: fix im2col

* add ggml_arange/ggml_timestep_embbeding support for metal backend

* fix some bugs

* fix some bugs

* Update ggml.h

Co-authored-by: Georgi Gerganov <redacted>
* Update ggml-cuda.cu

Co-authored-by: Georgi Gerganov <redacted>
* Update ggml-metal.m

Co-authored-by: Georgi Gerganov <redacted>
* Update ggml-metal.m

Co-authored-by: Georgi Gerganov <redacted>
* Update ggml-metal.metal

Co-authored-by: Georgi Gerganov <redacted>
* modify according to the review comments

* ggml : fix compile warnings + code style

* ggml : normalize compute_forward calls + fix seg fault in debug

* minor

---------

Co-authored-by: Georgi Gerganov <redacted>
Co-authored-by: slaren <redacted>
16 months agocommon : use LLAMA_DEFAULT_SEED (#5855)
DAN™ [Mon, 4 Mar 2024 08:08:19 +0000 (03:08 -0500)]
common : use LLAMA_DEFAULT_SEED (#5855)

16 months agomain : support special tokens as reverse/anti prompt (#5847)
DAN™ [Mon, 4 Mar 2024 07:57:20 +0000 (02:57 -0500)]
main : support special tokens as reverse/anti prompt (#5847)

* Support special tokens as reverse/anti prompt.

* Tokenize antiprompts only once.

* main : minor

---------

Co-authored-by: Georgi Gerganov <redacted>
16 months agocuda : fix data race in soft max (#5853)
slaren [Sun, 3 Mar 2024 13:26:18 +0000 (14:26 +0100)]
cuda : fix data race in soft max (#5853)

16 months agoreadme : add API changes section
Georgi Gerganov [Sun, 3 Mar 2024 10:44:03 +0000 (12:44 +0200)]
readme : add API changes section

16 months agollama : allow for user specified embedding pooling type (#5849)
Douglas Hanley [Sun, 3 Mar 2024 10:40:27 +0000 (04:40 -0600)]
llama : allow for user specified embedding pooling type (#5849)

* allow for user specified pooling type

* llama : use enum types over int

---------

Co-authored-by: Georgi Gerganov <redacted>
16 months agogguf-dump : support i-quants (#5841)
Nindaleth [Sun, 3 Mar 2024 08:43:42 +0000 (09:43 +0100)]
gguf-dump : support i-quants (#5841)

Co-authored-by: Black_Fox <redacted>
16 months agollama : fix llama_copy_state_data with fragmented KV cache (#5840)
compilade [Sun, 3 Mar 2024 08:41:55 +0000 (03:41 -0500)]
llama : fix llama_copy_state_data with fragmented KV cache (#5840)

The row size of the saved states was based on kv_self.head while
it should be based on llama_kv_cache_cell_max.

Existing session files should still work.

* llama : fix llama_kv_cache_cell_max inability to return 1

I've also changed its return type to uint32_t,
because this function is always used to set the value of uint32_t variables,
and because the index already has this type.

* llama : fix state size calculation

Some bytes in the state were unaccounted for in llama_get_state_size.
Since the logits reserve so much space, it did not cause problems.

16 months agoci : schedule slow server tests only on Release or on demand (#5839)
Pierrick Hymbert [Sun, 3 Mar 2024 08:35:23 +0000 (09:35 +0100)]
ci : schedule slow server tests only on Release or on demand (#5839)

16 months agoserver : init http requests thread pool with --parallel if set (#5836)
Pierrick Hymbert [Sun, 3 Mar 2024 07:48:36 +0000 (08:48 +0100)]
server : init http requests thread pool with --parallel if set (#5836)

16 months agoflake.lock: Update (#5842)
Georgi Gerganov [Sun, 3 Mar 2024 04:11:31 +0000 (06:11 +0200)]
flake.lock: Update (#5842)

Flake lock file updates:

• Updated input 'flake-parts':
    'github:hercules-ci/flake-parts/b253292d9c0a5ead9bc98c4e9a26c6312e27d69f' (2024-02-01)
  → 'github:hercules-ci/flake-parts/f7b3c975cf067e56e7cda6cb098ebe3fb4d74ca2' (2024-03-01)
• Updated input 'flake-parts/nixpkgs-lib':
    'github:NixOS/nixpkgs/97b17f32362e475016f942bbdfda4a4a72a8a652?dir=lib' (2024-01-29)
  → 'github:NixOS/nixpkgs/1536926ef5621b09bba54035ae2bb6d806d72ac8?dir=lib' (2024-02-29)
• Updated input 'nixpkgs':
    'github:NixOS/nixpkgs/cbc4211f0afffe6dfd2478a62615dd5175a13f9a' (2024-02-23)
  → 'github:NixOS/nixpkgs/1536926ef5621b09bba54035ae2bb6d806d72ac8' (2024-02-29)

Co-authored-by: github-actions[bot] <redacted>
16 months agoserver: tests: passkey challenge / self-extend with context shift demo (#5832)
Pierrick Hymbert [Sat, 2 Mar 2024 21:00:14 +0000 (22:00 +0100)]
server: tests: passkey challenge /  self-extend with context shift demo (#5832)

* server: tests: add models endpoint scenario

* server: /v1/models add some metadata

* server: tests: add debug field in context before scenario

* server: tests: download model from HF, add batch size

* server: tests: add passkey test

* server: tests: add group attention params

* server: do not truncate prompt tokens if self-extend through group attention is enabled

* server: logs: do not truncate log values

* server: tests - passkey - first good working value of nga

* server: tests: fix server timeout

* server: tests: fix passkey, add doc, fix regex content matching, fix timeout

* server: tests: fix regex content matching

* server: tests: schedule slow tests on master

* server: metrics: fix when no prompt processed

* server: tests: self-extend add llama-2-7B and Mixtral-8x7B-v0.1

* server: tests: increase timeout for completion

* server: tests: keep only the PHI-2 test

* server: tests: passkey add a negative test

16 months agollama : add abort_callback to interrupt computation (#5409)
Michael Podvitskiy [Sat, 2 Mar 2024 19:52:25 +0000 (20:52 +0100)]
llama : add abort_callback to interrupt computation (#5409)

* using abort_callback from ggml to stop llama computation

* format fix

* a brief explaining comment

---------

Co-authored-by: Georgi Gerganov <redacted>
16 months agoggml : fix IQ3_S AVX implementation (#5834)
Georgi Gerganov [Sat, 2 Mar 2024 18:00:49 +0000 (20:00 +0200)]
ggml : fix IQ3_S AVX implementation (#5834)

ggml-ci

16 months agoconvert : automatically fall back to HfVocab if tokenizer.model doesn't exist (#5821)
Jared Van Bortel [Sat, 2 Mar 2024 17:27:26 +0000 (12:27 -0500)]
convert : automatically fall back to HfVocab if tokenizer.model doesn't exist (#5821)

16 months agoconvert-hf : make model class definitions self-contained (#5825)
Jared Van Bortel [Sat, 2 Mar 2024 17:21:47 +0000 (12:21 -0500)]
convert-hf : make model class definitions self-contained (#5825)

16 months agoggml : IQ3_S improvements (#5829)
Kawrakow [Sat, 2 Mar 2024 15:00:51 +0000 (17:00 +0200)]
ggml : IQ3_S improvements (#5829)

* iq3_s: somewhat faster AVX2 dot product

On Ryzen a 7950X TG-128 increases to 16 t/s from 15.5 t/s using
16 threads. For 8 threads it is 13.85 t/s vs 11.75 t/s.
PP-512 increases to 28.5 t/s from 23.8 t/s.

* iq3_s: somewhat faster ARM_NEON dot product

Still dog slow - 10.7 t/s up from 9.9 t/s.

* iq3_s: another small ARM_NEON improvement

10.7 -> 11.0 t/s. Using vmulq_s8 is faster than the xor - sub trick
that works best on AVX2.

* iq3_s: minor improvement on Metal

49.4 t/s -> 50.3 t/s

* iq3_s: PPL improvement

E.g., for a context of 4096 LLaMA-v2-7B goes to 5.1340 from 5.1653.

* iq3_s: use new grid everywhere

* Fix ARM_NEON

---------

Co-authored-by: Iwan Kawrakow <redacted>
16 months agoscripts : add pod-llama.sh
Georgi Gerganov [Sat, 2 Mar 2024 14:54:08 +0000 (16:54 +0200)]
scripts : add pod-llama.sh

16 months agollama : refactor internal quantization functions (#5830)
Xuan Son Nguyen [Sat, 2 Mar 2024 14:19:09 +0000 (15:19 +0100)]
llama : refactor internal quantization functions (#5830)

16 months agollama : fix segfault from unknown model arch name (#5820)
compilade [Sat, 2 Mar 2024 13:42:56 +0000 (08:42 -0500)]
llama : fix segfault from unknown model arch name (#5820)

* llama : fix segfault from unknown model arch name

* llama : make all LLM maps const

This also requires using `std::map::at` instead of its `operator[]`
which does not exist for const maps.

* llama : name LLM_ARCH_UNKNOWN to "(unknown)"

This avoids errors from `std::map::at` when
getting the general name of the model architecture.
Using "(unknown)" instead of an empty string as per suggestion
https://github.com/ggerganov/llama.cpp/pull/5820#issuecomment-1973735284

* llama : remove redundant inner const for LLM_TENSOR_NAMES

The extra const won't do anything here as const maps
return const references to values.

Co-authored-by: Jared Van Bortel <redacted>
* llama : remove redundant nullptr check in llm_arch_from_string

Since LLM_ARCH_NAMES is a const map, no spurious elements
with a NULL name are inserted anymore, so this check is dead code.

---------

Co-authored-by: Jared Van Bortel <redacted>
16 months agoSupport multiple GPUs (split mode) on SYCL backend (#5806)
Neo Zhang Jianyu [Sat, 2 Mar 2024 11:49:30 +0000 (19:49 +0800)]
Support multiple GPUs (split mode) on SYCL backend (#5806)

* suport multiple cards: split-mode - layer|row

* rm warning

* rebase with master, support tow new OPs, close feature for -sm=row, fix for unit test

* update news

* fix merge error

* update according to review comments

16 months agoworkflows : remove nocleanup arg for check-requirements.sh (#5826)
crasm [Sat, 2 Mar 2024 05:11:06 +0000 (00:11 -0500)]
workflows : remove nocleanup arg for check-requirements.sh (#5826)

Reduces peak tmpfs usage and should prevent the check from failing from
running out of space.

Fixes the 'No space left on device' issue mentioned in #5703.

16 months agobuild(nix): Introduce flake.formatter for `nix fmt` (#5687)
Tushar [Fri, 1 Mar 2024 23:18:26 +0000 (04:48 +0530)]
build(nix): Introduce flake.formatter for `nix fmt` (#5687)

* build(nix): Introduce flake.formatter for `nix fmt`
* chore: Switch to pkgs.nixfmt-rfc-style

16 months agoconvert-hf-to-gguf : require einops for InternLM2ForCausalLM (#5792)
nold [Fri, 1 Mar 2024 21:51:12 +0000 (22:51 +0100)]
convert-hf-to-gguf : require einops for InternLM2ForCausalLM (#5792)

16 months agollama : add StarCoder2 support (#5795)
Sourab Mangrulkar [Fri, 1 Mar 2024 19:30:46 +0000 (01:00 +0530)]
llama : add StarCoder2 support (#5795)

* Add support for starcoder2

* handle rope type

* skip rope freq and rotary embeddings from being serialized

* resolve comments

* Update llama.cpp

* remove redundant changes

* handle `rope-theta`

* llama : change starcoder2 rope type

* address comment

---------

Co-authored-by: Georgi Gerganov <redacted>
16 months agoserver : remove api_like_OAI.py proxy script (#5808)
Georgi Gerganov [Fri, 1 Mar 2024 18:00:58 +0000 (20:00 +0200)]
server : remove api_like_OAI.py proxy script (#5808)

16 months agoggml-vulkan: fix VULKAN_CHECK_RESULTS flag, which was previously broken (#5813)
ddpasa [Fri, 1 Mar 2024 17:00:00 +0000 (18:00 +0100)]
ggml-vulkan: fix VULKAN_CHECK_RESULTS flag, which was previously broken (#5813)

16 months agogemma : fix bfloat16 -> float16 conversion issue (#5810)
kunal-vaishnavi [Fri, 1 Mar 2024 14:08:08 +0000 (06:08 -0800)]
gemma : fix bfloat16 -> float16 conversion issue (#5810)

16 months agocommon : fix flag `--logits-all` to `--all-logits` (#5805)
Miwa / Ensan [Fri, 1 Mar 2024 13:48:56 +0000 (22:48 +0900)]
common : fix flag `--logits-all` to `--all-logits` (#5805)

16 months agollama : cleanup unused mmq flags (#5772)
Pierrick Hymbert [Fri, 1 Mar 2024 11:39:06 +0000 (12:39 +0100)]
llama : cleanup unused mmq flags (#5772)

* cleanup unused --no-mul-mat-q,-nommq, -mmq, --mul-mat-q, mul_mat_q

* remove: mul_mat_q in compare llama bench and usage

* update llama-bench

---------

Co-authored-by: slaren <redacted>
16 months agounicode : switch to multimap based nfd_map (#5799)
Douglas Hanley [Fri, 1 Mar 2024 09:15:36 +0000 (03:15 -0600)]
unicode : switch to multimap based nfd_map (#5799)

* switch to multimap based nfd_map due to compile time issues

* simplify multimap keys

* dont construct new locale every time

16 months agoserver: allow to override threads server pool with --threads-http (#5794)
Pierrick Hymbert [Fri, 1 Mar 2024 09:08:08 +0000 (10:08 +0100)]
server: allow to override threads server pool with --threads-http (#5794)

16 months agoci : add Ubuntu 22 Vulkan CI run (#5789)
Eve [Fri, 1 Mar 2024 08:54:53 +0000 (08:54 +0000)]
ci : add Ubuntu 22 Vulkan CI run (#5789)

16 months agoserver : fix newlines in help (#5785)
Georgi Gerganov [Fri, 1 Mar 2024 07:59:43 +0000 (09:59 +0200)]
server : fix newlines in help (#5785)

16 months ago[SYCL] Use batched mul_mat pathway (#5591)
AidanBeltonS [Fri, 1 Mar 2024 07:36:47 +0000 (07:36 +0000)]
[SYCL] Use batched mul_mat pathway (#5591)

* Use batched mul_mat pathway

* rm extra line

* Explicitly state scaled data type

---------

Co-authored-by: Abhilash Majumder <redacted>
16 months agoServer: normalize naming (#5779)
Xuan Son Nguyen [Thu, 29 Feb 2024 20:42:11 +0000 (21:42 +0100)]
Server: normalize naming (#5779)

* server: normalize naming

* fix spacing

16 months agollama : constified `llama_set_state_data`'s `src` (#5774)
Marcus Dunn [Thu, 29 Feb 2024 08:17:23 +0000 (00:17 -0800)]
llama : constified `llama_set_state_data`'s `src` (#5774)

16 months agoci : reduce 3b ppl chunks to 1 to avoid timeout (#5771)
Georgi Gerganov [Wed, 28 Feb 2024 19:44:21 +0000 (21:44 +0200)]
ci : reduce 3b ppl chunks to 1 to avoid timeout (#5771)

ggml-ci

16 months agomake portability_enumeration_ext apple only (#5757)
Eve [Wed, 28 Feb 2024 19:33:37 +0000 (19:33 +0000)]
make portability_enumeration_ext apple only (#5757)

16 months agollama : remove deprecated API (#5770)
Georgi Gerganov [Wed, 28 Feb 2024 16:43:38 +0000 (18:43 +0200)]
llama : remove deprecated API (#5770)

ggml-ci

16 months agoawq-py : remove (#5768)
Georgi Gerganov [Wed, 28 Feb 2024 15:36:53 +0000 (17:36 +0200)]
awq-py : remove (#5768)

16 months agosync : ggml
Georgi Gerganov [Wed, 28 Feb 2024 09:17:32 +0000 (11:17 +0200)]
sync : ggml

16 months agoadd google magika inference example (ggml/748)
slaren [Sun, 25 Feb 2024 19:41:35 +0000 (20:41 +0100)]
add google magika inference example (ggml/748)

* add magika inference example

* ggml : fix unaligned accesses in custom ops

* ggml : fix FP32 GELU for values that exceed the FP16 range

* use ggml_pool_1d

* add README

* Update README.md

* pad inputs if the files are too small

* cleanup

ggml-ci

16 months agoIntroduce backend GUIDs (ggml/743)
UEXTM.com [Sat, 24 Feb 2024 16:27:36 +0000 (11:27 -0500)]
Introduce backend GUIDs (ggml/743)

* Introduce backend GUIDs

Initial proposed implementation of backend GUIDs
(Discussed in https://github.com/ggerganov/ggml/pull/741)

Hardcoded CPU backend GUID (for now)
Change ggml_backend_is_cpu logic to use GUID

* Remove redundant functions

Remove redundant functions `ggml_backend_i::get_name` and `ggml_backend_guid` which are not desired for future expansion

* Add spaces to match style

Co-authored-by: slaren <redacted>
* Fix brace style to match

Co-authored-by: slaren <redacted>
* Add void to () in function signature

Co-authored-by: slaren <redacted>
* Add back ggml_backend_guid and make CPU_GUID a local static in ggml_backend_cpu_guid

* add guids to all backends

ggml-ci

---------

Co-authored-by: slaren <redacted>
16 months agoserver : hit Ctrl+C twice to exit (#5734)
Xuan Son Nguyen [Wed, 28 Feb 2024 08:55:37 +0000 (09:55 +0100)]
server : hit Ctrl+C twice to exit (#5734)

* server: twice ctrl+C to exit

* std::atomic_flag

* sigint: message

* sigint: stderr

* Update examples/server/server.cpp

Co-authored-by: Jared Van Bortel <redacted>
---------

Co-authored-by: Jared Van Bortel <redacted>
16 months agollama : fix non-quantization of expert gating tensors (#5754)
compilade [Wed, 28 Feb 2024 08:52:56 +0000 (03:52 -0500)]
llama : fix non-quantization of expert gating tensors (#5754)

This reverts a single line from #5475

16 months agollama : improve BERT tokenization (#5740)
Douglas Hanley [Wed, 28 Feb 2024 08:51:11 +0000 (02:51 -0600)]
llama : improve BERT tokenization (#5740)

* implement nfd for stripping accents in wpm tokenizer

* sort nfd map; reuse iterator

* use builtin tolower

* add locale include

* Simplify to_lower cases

Co-authored-by: Jared Van Bortel <redacted>
---------

Co-authored-by: Jared Van Bortel <redacted>
16 months agoreadme : add link to LLaVA 1.6 models (#5758)
Daniel Bevenius [Wed, 28 Feb 2024 08:39:39 +0000 (09:39 +0100)]
readme : add link to LLaVA 1.6 models (#5758)

Signed-off-by: Daniel Bevenius <redacted>
16 months agoserver : add "/chat/completions" alias for "/v1/...` (#5722)
Jorge A [Wed, 28 Feb 2024 08:39:15 +0000 (01:39 -0700)]
server : add "/chat/completions" alias for "/v1/...` (#5722)

* Add "/chat/completions" as alias for "/v1/chat/completions"

* merge to upstream master

* minor : fix trailing whitespace

---------

Co-authored-by: Georgi Gerganov <redacted>
16 months agoggml : make i-quants work with super-blocks of 64 (CPU,Metal) (#5760)
Kawrakow [Wed, 28 Feb 2024 08:37:02 +0000 (10:37 +0200)]
ggml : make i-quants work with super-blocks of 64 (CPU,Metal) (#5760)

* WIP: make i-quants work for QK_K = 64

* iq2_xs: attempt to fix AVX dot product for QK_K = 64

Tests pass, but I get gibberish.

* QK_K = 64 tests pass on ARM_NEON and Metal

Sadly, that does not mean it actually works.

* Make CUDA compile with QK_K = 64

Tests don't pass, plus we get misaligned access

* Q2_K: fixed bug in imatrix quantization for QK_K = 64

* iq1_s: turn off SIMD implementation for QK_K = 64 (it does not work)

---------

Co-authored-by: Iwan Kawrakow <redacted>
16 months agoAttempt to fix android build (#5752)
Kawrakow [Tue, 27 Feb 2024 17:16:49 +0000 (19:16 +0200)]
Attempt to fix android build (#5752)

Co-authored-by: Iwan Kawrakow <redacted>
16 months agoIQ4_XS: a 4.25 bpw quantization (#5747)
Kawrakow [Tue, 27 Feb 2024 14:34:24 +0000 (16:34 +0200)]
IQ4_XS: a 4.25 bpw quantization (#5747)

* Try IQ4_NL with blocks of 64 - does not look good

* iq4_xs: go to super-blocks of 256 and 6-bit scales for blocks of 32

* iq4_xs: CUDA works - 133.2 t/s

* iq4_xs: AVX2 dot product

* iq4_xs: ARM_NEON dot product

* iq4_nl: Metal implementation

As usual, Metal / Apple Silicon don't like my quants.

* iq3_xs: minor fix

* iq4_xs: shrink by using IQ3_S for attn_k and attn_q

* iq4_xs: revert using IQ3_S for attn_k and attn_v

PPL vs size is good, but CPU performance suffers: on M2 Max
TG-128 drops to 21.7 t/s from 28.8, and on a Ryzen-7950X
to 14.5 t/s from 15.8 t/s. On CUDA we have 135 t/s when
using IQ3_S vs 133 t/s with pure IQ4_XS.

* Fix CI

* iq4_xs: Added forgotten check for 256 divisibility

---------

Co-authored-by: Iwan Kawrakow <redacted>
16 months agocuda : replace remaining shfl_xor with calls to warp_reduce functions (#5744)
Engininja2 [Tue, 27 Feb 2024 13:22:45 +0000 (07:22 -0600)]
cuda : replace remaining shfl_xor with calls to warp_reduce functions (#5744)

16 months agoggml-quants : fix avx2 iq1_s vec_dot when compiled with gcc (#5742)
Engininja2 [Tue, 27 Feb 2024 12:50:18 +0000 (06:50 -0600)]
ggml-quants : fix avx2 iq1_s vec_dot when compiled with gcc (#5742)

16 months agollama : fix defrag bugs + add parameter (#5735)
Georgi Gerganov [Tue, 27 Feb 2024 12:35:51 +0000 (14:35 +0200)]
llama : fix defrag bugs + add parameter (#5735)

* llama : fix defrag bugs + enable by default

ggml-ci

* llama : add defrag_thold parameter

ggml-ci

* llama : cont

* llama : disable log message

ggml-ci

* llama : fix graph size check during defrag

16 months agoMakefile: use variables for cublas (#5689)
le.chang [Tue, 27 Feb 2024 02:03:06 +0000 (10:03 +0800)]
Makefile: use variables for cublas (#5689)

* make: use arch variable for cublas

* fix UNAME_M

* check opt first

---------

Co-authored-by: lindeer <redacted>
16 months agofix server hangs on empty prompt (#5733)
Xuan Son Nguyen [Mon, 26 Feb 2024 22:15:48 +0000 (23:15 +0100)]
fix server hangs on empty prompt (#5733)

16 months agoAdding IQ2_S and IQ2_M to complete coverage of the 2-3 bit quantization range (#5721)
Kawrakow [Mon, 26 Feb 2024 16:28:38 +0000 (18:28 +0200)]
Adding IQ2_S and IQ2_M to complete coverage of the 2-3 bit quantization range (#5721)

* Adding IQ2_S and IQ2_M as a single cumulative commit

* Update examples/quantize/quantize.cpp

Co-authored-by: Georgi Gerganov <redacted>
---------

Co-authored-by: Iwan Kawrakow <redacted>
Co-authored-by: Georgi Gerganov <redacted>
16 months agoCUDA: fix DEBUG_CUDA_MALLOC (#5729)
Johannes Gäßler [Mon, 26 Feb 2024 14:36:38 +0000 (15:36 +0100)]
CUDA: fix DEBUG_CUDA_MALLOC (#5729)

16 months agoreadme : update ui list (#5731)
Artem [Mon, 26 Feb 2024 14:15:28 +0000 (17:15 +0300)]
readme : update ui list (#5731)

* Add LLMFarm (ui for iOS) to list

16 months ago[SYCL] Add support for soft_max ALiBi (#5639)
AidanBeltonS [Mon, 26 Feb 2024 14:02:11 +0000 (14:02 +0000)]
[SYCL] Add support for soft_max ALiBi (#5639)

* Add support for bias

* Update pre-processor

* rm commented code

* fix format

* fix CI

---------

Co-authored-by: Abhilash Majumder <redacted>
16 months agounicode : reuse iterator (#5726)
Georgi Gerganov [Mon, 26 Feb 2024 12:02:12 +0000 (14:02 +0200)]
unicode : reuse iterator (#5726)

16 months agoserver: CI fix trailing space (#5728)
Pierrick Hymbert [Mon, 26 Feb 2024 10:41:34 +0000 (11:41 +0100)]
server: CI fix trailing space (#5728)

16 months agoserver: CI tests reduce build matrix (#5725)
Pierrick Hymbert [Mon, 26 Feb 2024 08:56:10 +0000 (09:56 +0100)]
server: CI tests reduce build matrix (#5725)

16 months agollama : fix Gemma rope type (#5691)
Georgi Gerganov [Mon, 26 Feb 2024 06:30:17 +0000 (08:30 +0200)]
llama : fix Gemma rope type (#5691)

16 months agoflake.lock: Update
github-actions[bot] [Sun, 25 Feb 2024 00:17:11 +0000 (00:17 +0000)]
flake.lock: Update

Flake lock file updates:

• Updated input 'nixpkgs':
    'github:NixOS/nixpkgs/5863c27340ba4de8f83e7e3c023b9599c3cb3c80' (2024-02-16)
  → 'github:NixOS/nixpkgs/cbc4211f0afffe6dfd2478a62615dd5175a13f9a' (2024-02-23)

16 months agoserver: tests - slow inference causes timeout on the CI (#5715)
Pierrick Hymbert [Sun, 25 Feb 2024 21:48:33 +0000 (22:48 +0100)]
server: tests - slow inference causes timeout on the CI (#5715)

* server: tests - longer inference timeout for CI

16 months agoserver: docs - refresh and tease a little bit more the http server (#5718)
Pierrick Hymbert [Sun, 25 Feb 2024 20:46:29 +0000 (21:46 +0100)]
server: docs - refresh and tease a little bit more the http server (#5718)

* server: docs - refresh and tease a little bit more the http server

* Rephrase README.md server doc

Co-authored-by: Georgi Gerganov <redacted>
* Update examples/server/README.md

Co-authored-by: Georgi Gerganov <redacted>
* Update examples/server/README.md

Co-authored-by: Georgi Gerganov <redacted>
* Update README.md

---------

Co-authored-by: Georgi Gerganov <redacted>
16 months agollama : refactor k-shift implementation + KV defragmentation (#5691)
Georgi Gerganov [Sun, 25 Feb 2024 20:12:24 +0000 (22:12 +0200)]
llama : refactor k-shift implementation + KV defragmentation (#5691)

* llama : refactor k-shift implementation

ggml-ci

* llama : rename llama_kv_cache_seq_shift to llama_kv_cache_seq_add

* llama : cont k-shift refactoring + normalize type names

ggml-ci

* minor : fix MPI builds

* llama : reuse n_rot from the build context

ggml-ci

* llama : revert enum name changes from this PR

ggml-ci

* llama : update llama_rope_type

* llama : add comment about rope values

* llama : fix build

* passkey : apply kv cache updates explicitly

ggml-ci

* llama : change name to llama_kv_cache_update()

* llama : add llama_kv_cache_seq_pos_max()

* passkey : fix llama_kv_cache_seq_pos_max() usage

* llama : some llama_kv_cell simplifications

* llama : add llama_kv_cache_compress (EXPERIMENTAL)

* llama : add alternative KV cache merging (EXPERIMENTAL)

* llama : add llama_kv_cache_defrag

* llama : comments

* llama : remove llama_kv_cache_compress

will add in a separate PR

ggml-ci

* llama : defragment via non-overlapping moves

* llama : ggml_graph based defrag implementation

ggml-ci

* llama : switch the loop order in build_defrag

* llama : add comments

16 months agoserver : fix crash when system prompt is bigger than batch size (#5714)
compilade [Sun, 25 Feb 2024 18:43:50 +0000 (13:43 -0500)]
server : fix crash when system prompt is bigger than batch size (#5714)

The system prompt is now decoded in batches.

* server : fix off-by-one n_past when start of prompt matches whole cache

The tokens right after the matching part would otherwise skip a pos value.

16 months agoggml-quants : provide ggml_vqtbl1q_u8 for 64bit compatibility (#5711)
Radosław Gryta [Sun, 25 Feb 2024 18:43:00 +0000 (19:43 +0100)]
ggml-quants : provide ggml_vqtbl1q_u8 for 64bit compatibility (#5711)

* [ggml-quants] Provide ggml_vqtbl1q_u8 for 64bit compatibility

vqtbl1q_u8 is not part of arm v7 neon library

* [android-example] Remove abi filter after arm v7a fix

* [github-workflows] Do not skip Android armeabi-v7a build

16 months agomake : fix nvcc version is empty (#5713)
kwin1412 [Sun, 25 Feb 2024 16:46:49 +0000 (00:46 +0800)]
make : fix nvcc version is empty (#5713)

fix nvcc version is empty

16 months agoreadme : add Msty to UI list (#5618)
Ashok Gelal [Sun, 25 Feb 2024 15:57:34 +0000 (10:57 -0500)]
readme : add Msty to UI list (#5618)

16 months agoserver: logs - unified format and --log-format option (#5700)
Pierrick Hymbert [Sun, 25 Feb 2024 12:50:32 +0000 (13:50 +0100)]
server: logs - unified format and --log-format option (#5700)

* server: logs - always use JSON logger, add add thread_id in message, log task_id and slot_id

* server : skip GH copilot requests from logging

* server : change message format of server_log()

* server : no need to repeat log in comment

* server : log style consistency

* server : fix compile warning

* server : fix tests regex patterns on M2 Ultra

* server: logs: PR feedback on log level

* server: logs: allow to choose log format in json or plain text

* server: tests: output server logs in text

* server: logs switch init logs to server logs macro

* server: logs ensure value json value does not raised error

* server: logs reduce level VERBOSE to VERB to max 4 chars

* server: logs lower case as other log messages

* server: logs avoid static in general

Co-authored-by: Georgi Gerganov <redacted>
* server: logs PR feedback: change text log format to: LEVEL [function_name] message | additional=data

---------

Co-authored-by: Georgi Gerganov <redacted>
16 months agoserver: concurrency fix + monitoring - add /metrics prometheus compatible endpoint...
Pierrick Hymbert [Sun, 25 Feb 2024 12:49:43 +0000 (13:49 +0100)]
server: concurrency fix + monitoring - add /metrics prometheus compatible endpoint (#5708)

* server: monitoring - add /metrics prometheus compatible endpoint

* server: concurrency issue, when 2 task are waiting for results, only one call thread is notified

* server: metrics - move to a dedicated struct

16 months agocmake : fix compilation for Android armeabi-v7a (#5702)
Radosław Gryta [Sun, 25 Feb 2024 10:53:11 +0000 (11:53 +0100)]
cmake : fix compilation for Android armeabi-v7a (#5702)

16 months agocode : normalize enum names (#5697)
Georgi Gerganov [Sun, 25 Feb 2024 10:09:09 +0000 (12:09 +0200)]
code : normalize enum names (#5697)

* coda : normalize enum names

ggml-ci

* code : cont

* code : cont

16 months agopy : fix StableLM conversion after config.json changes (#5703)
Anas Ahouzi [Sun, 25 Feb 2024 09:54:04 +0000 (10:54 +0100)]
py : fix StableLM conversion after config.json changes (#5703)

* Fix issues during StableLM models conversion

* Fix hard coded layer_norm_eps

* Support layer_norm_eps for LlavaStableLM

Co-authored-by: Jared Van Bortel <redacted>
* Add missing parenthesis

Co-authored-by: Jared Van Bortel <redacted>
* Support rotary_factor for LlavaStableLM

Co-authored-by: Jared Van Bortel <redacted>
* fix typo

* Add StableLMEpochForCausalLM for safety

Co-authored-by: compilade <redacted>
* Add StableLMEpochForCausalLM for safety 2

Co-authored-by: compilade <redacted>
---------

Co-authored-by: Jared Van Bortel <redacted>
Co-authored-by: Jared Van Bortel <redacted>
Co-authored-by: compilade <redacted>
16 months agoserver: continue to update other slots on embedding concurrent request (#5699)
Pierrick Hymbert [Sat, 24 Feb 2024 18:16:04 +0000 (19:16 +0100)]
server: continue to update other slots on embedding concurrent request (#5699)

* server: #5655 - continue to update other slots on embedding concurrent request.

* server: tests: add multi users embeddings as fixed

* server: tests: adding OAI compatible embedding concurrent endpoint

* server: tests: adding OAI compatible embedding with multiple inputs

16 months agoIQ3_S: a much better alternative to Q3_K (#5676)
Kawrakow [Sat, 24 Feb 2024 14:23:52 +0000 (16:23 +0200)]
IQ3_S: a much better alternative to Q3_K (#5676)

* iq4_nl: squash commits for easier rebase

* Basics (quantize, dequantize)
* CUDA dequantize and dot product
* Slightly faster CUDA dot product (120 t/s)
* Switch to 6-bit scales
* Scalar dot product
* AVX2 dot product
* ARM_NEON dot product
* Works on metal, but still slow
* Slightly better Metal dot product
* Another small Metal improvement
* Metal dot product is getting there
* Faster CUDA dot product
* Add 1/8 ffn_down layers as Q5_K when no imatrix has been provided
* Report the actual bpw
* Add _xs mix that is 4.05 bpw for non-MoE models
* Remove IQ4_XS for now, slightly adjust kvalues_iq4nl
* AVX2 dot product uses Q8_0 instead of Q8_K
* Add to test-backend-ops
* Minor fix
* Also use use Q5_K for attn_output in MoE models
* Fixes after merging latest master
* Switching to blocks of 32
* AVX2 for blocks of 32
* Scaler dot product for blocks of 32
* ARM_NEON dot product for blocks of 32
* Metal kernels for blocks of 32
* Slightly faster Metal kernels

* Resurrecting iq3_xs

After all the experimentation, nothing was better than this.

* Minor PPL improvement via a block scale fudge factor

* Minor improvement via 3 neighbours

* iq3_xs: working scalar and AVX2 dot products

* iq3_xs: ARM_NEON dot product - works but extremely slow (10 t/s)

* iq3_xs: working Metal implementation

* Adding IQ3_M - IQ3_XS mix with mostly Q4_K

* iiq3_xs: a 3.4375 bpw variant

* iq3_xs: make CUDA work for new version

* iq3_xs: make scalar and AVX2 work for new version

* iq3_s: make ARM_NEON work with new version

* iq3_xs: make new version work on metal

Performance is very similar to Q3_K_S

* iq3_xs: tiny Metal speed improvement

* iq3_xs: tiny Metal speed improvement

* Fix stupid warning

* Q3_K_XS now uses a mix of IQ3_XS and IQ3_XXS

* iq3_xs: rename to iq3_s

* iq3_s: make tests pass

* Move Q3_K_XS mix to 3.25 bpw

* Attempt to fix failing tests

* Another attempt to fix the Windows builds

* Attempt to fix ROCm

* ROCm again

* iq3_s: partial fix for QK_K = 64

* iq3_s: make it work on metal for QK_K = 64

Pleasent surprise: the coding was super-block size independent,
so all it took was to delete some QK_K == 256 guards.

* Will this fix ROCm?

---------

Co-authored-by: Iwan Kawrakow <redacted>
16 months agoserver: init functional tests (#5566)
Pierrick Hymbert [Sat, 24 Feb 2024 11:28:55 +0000 (12:28 +0100)]
server: init functional tests (#5566)

* server: tests: init scenarios
 - health and slots endpoints
 - completion endpoint
 - OAI compatible chat completion requests w/ and without streaming
 - completion multi users scenario
 - multi users scenario on OAI compatible endpoint with streaming
 - multi users with total number of tokens to predict exceeds the KV Cache size
 - server wrong usage scenario, like in Infinite loop of "context shift" #3969
 - slots shifting
 - continuous batching
 - embeddings endpoint
 - multi users embedding endpoint: Segmentation fault #5655
 - OpenAI-compatible embeddings API
 - tokenize endpoint
 - CORS and api key scenario

* server: CI GitHub workflow

---------

Co-authored-by: Georgi Gerganov <redacted>
16 months agoserver : add KV cache quantization options (#5684)
AlpinDale [Fri, 23 Feb 2024 19:31:54 +0000 (19:31 +0000)]
server : add KV cache quantization options (#5684)

16 months agoconvert : fix missing ftype for gemma (#5690)
Jared Van Bortel [Fri, 23 Feb 2024 18:39:14 +0000 (13:39 -0500)]
convert : fix missing ftype for gemma (#5690)

16 months agompt : do not duplicate token_embd.weight on disk (#5670)
Jared Van Bortel [Thu, 22 Feb 2024 22:05:23 +0000 (17:05 -0500)]
mpt : do not duplicate token_embd.weight on disk (#5670)

16 months agogemma : use more bits for the token_embd.weight tensor (#5650)
Georgi Gerganov [Thu, 22 Feb 2024 21:23:46 +0000 (23:23 +0200)]
gemma : use more bits for the token_embd.weight tensor (#5650)

* gemma : use Q8_0 for the token_embd.weight tensor

* llama : quantize token_embd.weight using output type

16 months agopy : add Gemma conversion from HF models (#5647)
Georgi Gerganov [Thu, 22 Feb 2024 21:22:48 +0000 (23:22 +0200)]
py : add Gemma conversion from HF models (#5647)

* py : add gemma conversion from HF models

* Update convert-hf-to-gguf.py

Co-authored-by: Aarni Koskela <redacted>
* Update convert-hf-to-gguf.py

Co-authored-by: Aarni Koskela <redacted>
* Update convert-hf-to-gguf.py

Co-authored-by: Jared Van Bortel <redacted>
---------

Co-authored-by: Aarni Koskela <redacted>
Co-authored-by: Jared Van Bortel <redacted>
16 months agoggml : always define ggml_fp16_t as uint16_t (#5666)
Georgi Gerganov [Thu, 22 Feb 2024 21:21:39 +0000 (23:21 +0200)]
ggml : always define ggml_fp16_t as uint16_t (#5666)

* ggml : always define ggml_fp16_t as uint16_t

ggml-ci

* ggml : cont

ggml-ci

* ggml : cont

* ggml : cont

ggml-ci

* ggml : cont

ggml-ci

* cuda : no longer ggml headers last

ggml-ci

* ggml : fix q6_K FP16 -> FP32 conversion

ggml-ci

* ggml : more FP16 -> FP32 conversion fixes

ggml-ci

16 months agosync : ggml
Georgi Gerganov [Thu, 22 Feb 2024 21:21:05 +0000 (23:21 +0200)]
sync : ggml

16 months agoggml : 32-bit arm compat (whisper/1891)
Georgi Gerganov [Thu, 22 Feb 2024 16:31:40 +0000 (18:31 +0200)]
ggml : 32-bit arm compat (whisper/1891)

* ggml : 32-bit arm compat

* ggml : add ggml_vqtbl1q_s8 impl

* ggml : cont

16 months agonix: init singularity and docker images (#5056)
Someone [Thu, 22 Feb 2024 19:44:10 +0000 (19:44 +0000)]
nix: init singularity and docker images (#5056)

Exposes a few attributes demonstrating how to build [singularity](https://docs.sylabs.io/guides/latest/user-guide/)/[apptainer](https://apptainer.org/) and Docker images re-using llama.cpp's Nix expression.

Built locally on `x86_64-linux` with `nix build github:someoneserge/llama.cpp/feat/nix/images#llamaPackages.{docker,docker-min,sif,llama-cpp}` and it's fast and effective.

16 months agopy : minor fixes (#5668)
Georgi Gerganov [Thu, 22 Feb 2024 18:13:25 +0000 (20:13 +0200)]
py : minor fixes (#5668)

16 months agoAdd Gemma chat template (#5665)
Xuan Son Nguyen [Thu, 22 Feb 2024 18:10:21 +0000 (19:10 +0100)]
Add Gemma chat template (#5665)

* add gemma chat template

* gemma: only apply system_prompt on non-model message

16 months agoworkflows: nix: hardcode cachix ids, build unconditionally (#5663)
Someone [Thu, 22 Feb 2024 16:32:09 +0000 (16:32 +0000)]
workflows: nix: hardcode cachix ids, build unconditionally (#5663)

GitHub does not expose environment and repository variables to PRs coming from forks implies that we've been disabling the Nix CI actions for most PRs.

The `if:` also didn't make much sense, because we can always pull from cachix, and there's no point (albeit no risk either) in pushing cache for the untrusted code.

16 months agominor : fix trailing whitespace (#5638)
Georgi Gerganov [Thu, 22 Feb 2024 11:54:03 +0000 (13:54 +0200)]
minor : fix trailing whitespace (#5638)

16 months agoreadme : update hot topics
Georgi Gerganov [Thu, 22 Feb 2024 08:35:54 +0000 (10:35 +0200)]
readme : update hot topics

16 months agoserver : fallback to chatml, add AlphaMonarch chat template (#5628)
Xuan Son Nguyen [Thu, 22 Feb 2024 08:33:24 +0000 (09:33 +0100)]
server : fallback to chatml, add AlphaMonarch chat template (#5628)

* server: fallback to chatml

* add new chat template

* server: add AlphaMonarch to test chat template

* server: only check model template if there is no custom tmpl

* remove TODO

16 months agoserver : clarify some params in the docs (#5640)
Alexey Parfenov [Thu, 22 Feb 2024 08:27:32 +0000 (08:27 +0000)]
server : clarify some params in the docs (#5640)

16 months agompt : add optional bias tensors (#5638)
Dat Quoc Nguyen [Thu, 22 Feb 2024 08:15:13 +0000 (18:15 +1000)]
mpt : add optional bias tensors (#5638)

Update for MPT with optional bias parameters: to work with PhoGPT and SEA-LION models that were pre-trained with 'bias'.

16 months agollama : fix loading models with shared tok_embd and output (#5651)
slaren [Wed, 21 Feb 2024 23:42:09 +0000 (00:42 +0100)]
llama : fix loading models with shared tok_embd and output (#5651)

ggml-ci