]> git.djapps.eu Git - pkg/ggml/sources/llama.cpp/log
pkg/ggml/sources/llama.cpp
15 months agoserver tests : more pythonic process management; fix bare `except:` (#6146)
Jared Van Bortel [Wed, 20 Mar 2024 05:33:49 +0000 (01:33 -0400)]
server tests : more pythonic process management; fix bare `except:` (#6146)

* server tests : remove seemingly redundant newlines in print()

* server tests : use built-in subprocess features, not os.kill and psutil

* server tests : do not catch e.g. SystemExit; use print_exc

* server tests: handle TimeoutExpired exception

* server tests: fix connect on dual-stack systems

* server: tests: add new tokens regex on windows generated following new repeat penalties default changed in (#6127)

* server: tests: remove the hack on windows since now we get the good socket family

* server: tests: add new tokens regex following new repeat penalties default changed in (#6127)

* server: tests: add new tokens regex following new repeat penalties default changed in (#6127)

---------

Co-authored-by: Pierrick HYMBERT <redacted>
15 months agoupdate readme sycl for new update (#6151)
Neo Zhang Jianyu [Wed, 20 Mar 2024 03:21:41 +0000 (11:21 +0800)]
update readme sycl for new update (#6151)

* update readme sycl for new update

* Update README-sycl.md

Co-authored-by: Abhilash Majumder <redacted>
* Update README-sycl.md

Co-authored-by: Abhilash Majumder <redacted>
* Update README-sycl.md

Co-authored-by: Abhilash Majumder <redacted>
* Update README-sycl.md

Co-authored-by: Abhilash Majumder <redacted>
* Update README-sycl.md

Co-authored-by: AidanBeltonS <redacted>
* Update README-sycl.md

Co-authored-by: AidanBeltonS <redacted>
* update by review comments

* update w64devkit link

* update for verify device id part

* Update README-sycl.md

Co-authored-by: Meng, Hengyu <redacted>
---------

Co-authored-by: Abhilash Majumder <redacted>
Co-authored-by: AidanBeltonS <redacted>
Co-authored-by: Meng, Hengyu <redacted>
15 months agoincrease igpu cluster limit (#6159)
Abhilash Majumder [Wed, 20 Mar 2024 02:58:49 +0000 (08:28 +0530)]
increase igpu cluster limit (#6159)

15 months agoRemove undeed header file. (#6158)
DAN™ [Tue, 19 Mar 2024 16:16:09 +0000 (12:16 -0400)]
Remove undeed header file. (#6158)

15 months agogguf-split: split and merge gguf per batch of tensors (#6135)
Pierrick Hymbert [Tue, 19 Mar 2024 11:05:44 +0000 (12:05 +0100)]
gguf-split: split and merge gguf per batch of tensors (#6135)

* gguf-split: split and merge gguf files per tensor

* gguf-split: build with make toolchain

* gguf-split: rename `--split-tensors-size` to `--split-max-tensors`. Set general.split_count KV to all split

* split : minor style + fix compile warnings

* gguf-split: remove --upload not implemented

---------

Co-authored-by: Georgi Gerganov <redacted>
15 months agocommon : disable repeat penalties by default (#6127)
Georgi Gerganov [Tue, 19 Mar 2024 08:21:54 +0000 (10:21 +0200)]
common : disable repeat penalties by default (#6127)

15 months agoci : exempt some labels from being tagged as stale (#6140)
slaren [Tue, 19 Mar 2024 08:06:54 +0000 (09:06 +0100)]
ci : exempt some labels from being tagged as stale (#6140)

15 months agocommon : print usage on '-h' and '--help' (#6145)
DAN™ [Tue, 19 Mar 2024 05:59:36 +0000 (01:59 -0400)]
common : print usage on '-h' and '--help' (#6145)

15 months agoflake.lock: Update
github-actions[bot] [Sun, 17 Mar 2024 06:37:44 +0000 (06:37 +0000)]
flake.lock: Update

Flake lock file updates:

• Updated input 'nixpkgs':
    'github:NixOS/nixpkgs/9df3e30ce24fd28c7b3e2de0d986769db5d6225d' (2024-03-06)
  → 'github:NixOS/nixpkgs/d691274a972b3165335d261cc4671335f5c67de9' (2024-03-14)

15 months agompt : implement backwards compatiblity with duped output tensor (#6139)
Jared Van Bortel [Mon, 18 Mar 2024 16:49:02 +0000 (12:49 -0400)]
mpt : implement backwards compatiblity with duped output tensor (#6139)

15 months agoclip : fix memory leak (#6138)
Felix [Mon, 18 Mar 2024 15:40:22 +0000 (16:40 +0100)]
clip : fix memory leak (#6138)

15 months agobackend : set max split inputs to GGML_MAX_SRC (#6137)
slaren [Mon, 18 Mar 2024 15:33:44 +0000 (16:33 +0100)]
backend : set max split inputs to GGML_MAX_SRC (#6137)

15 months agoci : disable stale issue messages (#6126)
Georgi Gerganov [Mon, 18 Mar 2024 11:45:38 +0000 (13:45 +0200)]
ci : disable stale issue messages (#6126)

15 months agoci : temporary disable sanitizer builds (#6128)
Georgi Gerganov [Mon, 18 Mar 2024 11:45:27 +0000 (13:45 +0200)]
ci : temporary disable sanitizer builds (#6128)

15 months agobackend : offload large batches to GPU (#6083)
slaren [Mon, 18 Mar 2024 10:03:04 +0000 (11:03 +0100)]
backend : offload large batches to GPU (#6083)

* backend : offload large batches to GPU

* fix hip

* code cleanup

* fix CUDA split buffers

* Update ggml-backend-impl.h

Co-authored-by: Johannes Gäßler <redacted>
* cuda : fix memset without set_device

* imatrix : remove sched affix from weight names

* sched : add a new split if the current one has too many inputs
reduce max inputs per split
more cleanup

* update backends

ggml-ci

---------

Co-authored-by: Johannes Gäßler <redacted>
15 months agocommon : tidy-up argument parsing (#6105)
DAN™ [Mon, 18 Mar 2024 08:27:44 +0000 (04:27 -0400)]
common : tidy-up argument parsing (#6105)

* Tidy-up argument parsing.

* Missing ref.

* common : minor

* common : add static classifier

---------

Co-authored-by: Georgi Gerganov <redacted>
15 months agoconvert : add support for CamembertModel architecture (#6119)
Thérence [Mon, 18 Mar 2024 08:17:00 +0000 (09:17 +0100)]
convert : add support for CamembertModel architecture (#6119)

Adding support for CamembertModel architecture used by :
https://huggingface.co/dangvantuan/sentence-camembert-large

15 months agoconvert : use f32 outtype for bf16 tensors (#6106)
Romain D [Mon, 18 Mar 2024 08:04:41 +0000 (09:04 +0100)]
convert : use f32 outtype for bf16 tensors (#6106)

The old behaviour is to use f16, but bf16 to f16 is not a lossless conversion.
Change the outtype to f32 to default to a lossless conversion.

15 months agocommon: llama_load_model_from_url using --model-url (#6098)
Pierrick Hymbert [Sun, 17 Mar 2024 18:12:37 +0000 (19:12 +0100)]
common: llama_load_model_from_url using --model-url (#6098)

* common: llama_load_model_from_url with libcurl dependency

Co-authored-by: Georgi Gerganov <redacted>
15 months agoci : close all stale issues at once (#6115)
Georgi Gerganov [Sun, 17 Mar 2024 17:51:57 +0000 (19:51 +0200)]
ci : close all stale issues at once (#6115)

15 months agoggml:fix finding transfer queue family index error (#6094)
GainLee [Sun, 17 Mar 2024 17:12:22 +0000 (01:12 +0800)]
ggml:fix finding transfer queue family index error (#6094)

Co-authored-by: GainLee <redacted>
15 months agoggml : add AVX512F SIMD (#6088)
AmirAli Mirian [Sat, 16 Mar 2024 15:52:02 +0000 (11:52 -0400)]
ggml : add AVX512F SIMD (#6088)

15 months agogritlm : add initial README.md (#6086)
Daniel Bevenius [Sat, 16 Mar 2024 15:46:29 +0000 (16:46 +0100)]
gritlm : add initial README.md (#6086)

* gritlm: add initial README.md to examples/gritlm

This commit adds a suggestion for an initial README.md for the gritlm
example.

Signed-off-by: Daniel Bevenius <redacted>
* squash! gritlm: add initial README.md to examples/gritlm

Use the `scripts/hf.sh` script to download the model file.

Signed-off-by: Daniel Bevenius <redacted>
* squash! gritlm: add initial README.md to examples/gritlm

Fix editorconfig-checker error in examples/gritlm/README.md.

Signed-off-by: Daniel Bevenius <redacted>
---------

Signed-off-by: Daniel Bevenius <redacted>
15 months agoreadme : add wllama as a wasm binding (#6100)
Xuan Son Nguyen [Sat, 16 Mar 2024 15:42:08 +0000 (16:42 +0100)]
readme : add wllama as a wasm binding (#6100)

15 months agocommon : refactor nested if causing error C1061 on MSVC (#6101)
DAN™ [Sat, 16 Mar 2024 15:39:15 +0000 (11:39 -0400)]
common : refactor nested if causing error C1061 on MSVC (#6101)

* Refactor nested if causing error C1061 on MSVC.

* Revert back and remove else's.

* Add flag to track found arguments.

15 months agoci : close inactive issue with workflow (#6053)
Pierrick Hymbert [Sat, 16 Mar 2024 12:20:53 +0000 (13:20 +0100)]
ci : close inactive issue with workflow (#6053)

* issues: ci - close inactive issue with workflow

* ci: close issue, change workflow schedule time

15 months agollama : fix Baichuan2 13B (#6092)
slaren [Fri, 15 Mar 2024 21:14:16 +0000 (22:14 +0100)]
llama : fix Baichuan2 13B (#6092)

15 months agollama : add support for control vectors (#5970)
Theia Vogel [Fri, 15 Mar 2024 20:43:02 +0000 (13:43 -0700)]
llama : add support for control vectors (#5970)

* control vector api and implementation

* control-vectors : minor code style updates

* disable control vector when data == nullptr

use -1 for disabled range (also on init) in case we ever support controlling layer 0 (embeddings)

---------

Co-authored-by: Georgi Gerganov <redacted>
15 months agollama : add Command-R support (#6033)
Andrew Canis [Fri, 15 Mar 2024 20:41:22 +0000 (16:41 -0400)]
llama : add Command-R support (#6033)

Information about the Command-R 35B model (128k context) can be found at:
https://huggingface.co/CohereForAI/c4ai-command-r-v01

Based on the llama2 model with a few changes:

1) New hyper parameter to scale output logits (logit_scale)
2) Uses LayerNorm instead of RMSNorm
3) Transfomer layers have a single shared LayerNorm that feeds into both the
   self-attention and FFN layers in parallel. There is no post-attention LayerNorm.
4) No support for Rotary Position Embeddings (RoPE) scaling
5) No biases used

Find GGUF files here:
https://huggingface.co/andrewcanis/c4ai-command-r-v01-GGUF

To convert model to GGUF format yourself:

1) Download Command-R Hugging Face safetensors:
git lfs install
git clone https://huggingface.co/CohereForAI/c4ai-command-r-v01

2) Run:
python3 convert-hf-to-gguf.py --outtype f16 ./c4ai-command-r-v01

15 months agollava : change API to pure C style for Rust FFI bindgen (#6079)
Ting Lou [Fri, 15 Mar 2024 14:31:05 +0000 (22:31 +0800)]
llava : change API to pure C style for Rust FFI bindgen (#6079)

Co-authored-by: Lou Ting <redacted>
15 months agocuda : disable unused cudaLaunchHostFunc code (#6078)
slaren [Fri, 15 Mar 2024 12:24:03 +0000 (13:24 +0100)]
cuda : disable unused cudaLaunchHostFunc code (#6078)

15 months agofix set main gpu error (#6073)
Neo Zhang Jianyu [Fri, 15 Mar 2024 10:53:53 +0000 (18:53 +0800)]
fix set main gpu error (#6073)

15 months agomake : ggml-metal.o depends on ggml.h
Georgi Gerganov [Fri, 15 Mar 2024 09:36:50 +0000 (11:36 +0200)]
make : ggml-metal.o depends on ggml.h

15 months ago[SYCL] Fix non-intel device selection (#6042)
AidanBeltonS [Fri, 15 Mar 2024 09:26:20 +0000 (09:26 +0000)]
[SYCL] Fix non-intel device selection (#6042)

* Fix non-intel device selection

* Update ggml-sycl.cpp

Co-authored-by: Neo Zhang Jianyu <redacted>
* Update ggml-sycl.cpp

Co-authored-by: Neo Zhang Jianyu <redacted>
---------

Co-authored-by: Abhilash Majumder <redacted>
Co-authored-by: Neo Zhang Jianyu <redacted>
15 months agogguf : add support for I64 and F64 arrays (#6062)
Ondřej Čertík [Fri, 15 Mar 2024 08:46:51 +0000 (02:46 -0600)]
gguf : add support for I64 and F64 arrays (#6062)

* gguf : add support for I64 and F64 arrays

GGML currently does not support I64 or F64 arrays and they are not often
used in machine learning, however if in the future the need arises, it
would be nice to add them now, so that the types are next to the other
types I8, I16, I32 in the enums, and it also reserves their type number.

Furthermore, with this addition the GGUF format becomes very usable for
most computational applications of NumPy (being compatible with the most
common NumPy dtypes: i8, i16, i32, i64, f32, f64), providing a faster,
and more versatile alternative to the `npz` format, and a simpler
alternative to the `hdf5` format.

The change in this PR seems small, not significantly increasing the
maintenance burden. I tested this from Python using GGUFWriter/Reader
and `gguf-dump`, as well as from C, everything seems to work.

* Fix compiler warnings

15 months agollama : add Orion chat template (#6066)
Xuan Son Nguyen [Fri, 15 Mar 2024 08:44:57 +0000 (09:44 +0100)]
llama : add Orion chat template (#6066)

15 months agollama-bench : use random tokens to improve accuracy with mixtral (#6069)
slaren [Fri, 15 Mar 2024 08:22:24 +0000 (09:22 +0100)]
llama-bench : use random tokens to improve accuracy with mixtral (#6069)

15 months agollama : fix integer overflow during quantization (#6063)
Georgi Gerganov [Thu, 14 Mar 2024 20:58:41 +0000 (22:58 +0200)]
llama : fix integer overflow during quantization (#6063)

15 months agogguf : fix resource leaks (#6061)
Steve Grubb [Thu, 14 Mar 2024 18:29:32 +0000 (14:29 -0400)]
gguf : fix resource leaks (#6061)

There several places where a gguf context is allocated. A call to gguf_free
is missing in some error paths. Also on linux, llama-bench was missing a
fclose.

15 months agogguf-py : bump version to 0.8.0 (#6060)
Ondřej Čertík [Thu, 14 Mar 2024 17:57:31 +0000 (11:57 -0600)]
gguf-py : bump version to 0.8.0 (#6060)

15 months agollama : support models without vocabulary (#5798)
Michael Podvitskiy [Thu, 14 Mar 2024 16:21:56 +0000 (17:21 +0100)]
llama : support models without vocabulary (#5798)

* additional methods to read model and ctx parameters

* vocab size as a part of a model metadata

* models without vocabulary, convert.py part

* models without vocabulary, llama.cpp part

* PR clean up

* converter scrypt fixes

* llama_vocab_type update (renamed the new key)

* pr review fixes

* revert function renaming

* one more NoVocab assert

15 months agoembedding : add EOS token if not present (#899)
Georgi Gerganov [Thu, 14 Mar 2024 13:14:14 +0000 (15:14 +0200)]
embedding : add EOS token if not present (#899)

15 months agogguf-py : fix dtype check (#6045)
Georgi Gerganov [Thu, 14 Mar 2024 11:32:14 +0000 (13:32 +0200)]
gguf-py : fix dtype check (#6045)

15 months agoreadme : improve readme for Llava-1.6 example (#6044)
Jian Liao [Thu, 14 Mar 2024 11:18:23 +0000 (04:18 -0700)]
readme : improve readme for Llava-1.6 example (#6044)

Co-authored-by: Jian Liao <redacted>
15 months agoserver: disable debug release type sanitizer, simplify trigger (#6047)
Pierrick Hymbert [Thu, 14 Mar 2024 11:15:39 +0000 (12:15 +0100)]
server: disable debug release type sanitizer, simplify trigger (#6047)

- increase time out for server
 - do not fail fast

15 months agollama : fix typo
Georgi Gerganov [Thu, 14 Mar 2024 11:13:06 +0000 (13:13 +0200)]
llama : fix typo

15 months agollama : optimize defrag moves + fix fragmentation calculation (#6037)
Michael Podvitskiy [Thu, 14 Mar 2024 10:56:48 +0000 (11:56 +0100)]
llama : optimize defrag moves + fix fragmentation calculation (#6037)

* attempt to reduce the impact of a worst-case scenario

* fragmentation calculation fix

* Update llama.cpp

---------

Co-authored-by: Georgi Gerganov <redacted>
15 months agogguf-py : add support for I8, I16 and I32 (#6045)
Ondřej Čertík [Thu, 14 Mar 2024 10:40:14 +0000 (04:40 -0600)]
gguf-py : add support for I8, I16 and I32 (#6045)

* Refactor dtype handling to be extensible

This code is equivalent as before, but now it is prepared to easily add
more NumPy dtypes.

* Add support for I8, I16 and I32

These types are allowed in the GGUF specification.

* Add support for I8, I16 and I32 to gguf_writer

* Add support for I8, I16, I32 to gguf_reader

15 months agoggml : designate enum vals for integer types (#6050)
Georgi Gerganov [Thu, 14 Mar 2024 10:38:37 +0000 (12:38 +0200)]
ggml : designate enum vals for integer types (#6050)

15 months agoembedding : print all resulting embeddings (#899)
Georgi Gerganov [Thu, 14 Mar 2024 10:37:20 +0000 (12:37 +0200)]
embedding : print all resulting embeddings (#899)

15 months agometal : build metallib + fix embed path (#6015)
Georgi Gerganov [Thu, 14 Mar 2024 09:55:23 +0000 (11:55 +0200)]
metal : build metallib + fix embed path (#6015)

* metal : build metallib + fix embed path

ggml-ci

* metal : fix embed build + update library load logic

ggml-ci

* metal : fix embeded library build

ggml-ci

* ci : fix iOS builds to use embedded library

15 months agoembedding : print cosine similarity (#899)
Georgi Gerganov [Thu, 14 Mar 2024 08:12:29 +0000 (10:12 +0200)]
embedding : print cosine similarity (#899)

15 months agoreadme : update details about running llama in Termux on Android (#6039)
Linwei Wang [Wed, 13 Mar 2024 18:34:40 +0000 (02:34 +0800)]
readme : update details about running llama in Termux on Android (#6039)

15 months agoreadme : update API changes and hot topics
Georgi Gerganov [Wed, 13 Mar 2024 18:33:56 +0000 (20:33 +0200)]
readme : update API changes and hot topics

15 months agogrammar : handle missing "root" node (#6004)
Clint Herron [Wed, 13 Mar 2024 18:10:40 +0000 (14:10 -0400)]
grammar : handle missing "root" node (#6004)

15 months agollama : add pipeline parallelism support (#6017)
slaren [Wed, 13 Mar 2024 17:54:21 +0000 (18:54 +0100)]
llama : add pipeline parallelism support (#6017)

* llama : add pipeline parallelism support for batch processing with multiple CUDA GPUs

ggml-ci

* server : add -ub, --ubatch-size parameter

* fix server embedding test

* llama : fix Mamba inference for pipeline parallelism

Tested to work correctly with both `main` and `parallel` examples.

* llama : limit max batch size to n_batch

* add LLAMA_SCHED_MAX_COPIES to configure the number of input copies for pipeline parallelism
default increase to 4 (from 2)

changing this value may improve performance for some systems, but increases memory usage

* fix hip build

* fix sycl build (disable cpy_tensor_async)

* fix hip build

* llama : limit n_batch and n_ubatch to n_ctx during context creation

* llama : fix norm backend

* batched-bench : sync after decode

* swiftui : sync after decode

* ggml : allow ggml_get_rows to use multiple threads if they are available

* check n_ubatch >= n_tokens with non-casual attention

* llama : do not limit n_batch to n_ctx with non-casual attn

* server : construct batch with size of llama_n_batch

* ggml_backend_cpu_graph_compute : fix return value when alloc fails

* llama : better n_batch and n_ubatch comment

* fix merge

* small fix

* reduce default n_batch to 2048

---------

Co-authored-by: Francis Couture-Harpin <redacted>
Co-authored-by: Georgi Gerganov <redacted>
15 months agotest-backend-ops : skip CPU backend by default (#6028)
slaren [Wed, 13 Mar 2024 13:58:30 +0000 (14:58 +0100)]
test-backend-ops : skip CPU backend by default (#6028)

15 months agoUpdate get version (#6025)
AidanBeltonS [Wed, 13 Mar 2024 13:17:54 +0000 (13:17 +0000)]
Update get version (#6025)

15 months agoServer: Use multi-task for embeddings endpoint (#6001)
Xuan Son Nguyen [Wed, 13 Mar 2024 10:39:11 +0000 (11:39 +0100)]
Server: Use multi-task for embeddings endpoint (#6001)

* use multitask for embd endpoint

* specify types

* remove redundant {"n_predict", 0}

15 months agoci : remove tidy-review (#6021)
slaren [Tue, 12 Mar 2024 15:55:19 +0000 (16:55 +0100)]
ci : remove tidy-review (#6021)

15 months agoggml : reuse quantum structs across backends (#5943)
Georgi Gerganov [Tue, 12 Mar 2024 12:27:20 +0000 (14:27 +0200)]
ggml : reuse quantum structs across backends (#5943)

* ggml : reuse quant blocks across backends

ggml-ci

* ggml : define helper constants only for CUDA and SYCL

ggml-ci

* ggml : define helper quantum constants for SYCL

ggml-ci

15 months agoggml : fix UB in IQ2_S and IQ3_S (#6012)
Georgi Gerganov [Tue, 12 Mar 2024 11:49:55 +0000 (13:49 +0200)]
ggml : fix UB in IQ2_S and IQ3_S (#6012)

15 months agosycl : update IQ1_S kernels (WIP - not working!) (#5995)
Georgi Gerganov [Tue, 12 Mar 2024 09:15:05 +0000 (11:15 +0200)]
sycl : update IQ1_S kernels (WIP - not working!) (#5995)

* sycl : try to fix after IQ1_S changes

* sycl : iq1s_grid -> iq1s_grid_gpu

* sycl : fix grid type

15 months agogrammar : fix unnecessarily retained pointer to rules (#6003)
gliptic [Mon, 11 Mar 2024 19:59:03 +0000 (20:59 +0100)]
grammar : fix unnecessarily retained pointer to rules (#6003)

15 months ago1.5 bit: we can do even better (#5999)
Kawrakow [Mon, 11 Mar 2024 15:53:15 +0000 (16:53 +0100)]
1.5 bit: we can do even better (#5999)

* iq1_s: we can do even better

Spent one of the 4 scale bits on a signs of a 0.125 shift.
I.e., quants are now -1 + delta, delta, 1 + delta, where delta
is +/- 0.125.

CUDA works, same performance as before.
PPL(LLaMA-v2-7B) is now 11.85!

* iq1_s: make scalar and AVX2 work with the new version

* iq1_s: make Neon work with new version.

~10% drop in performance, so will need some more work.

* iq1_s: make Metal work with new version

* iq1_s: very slightly faster dequantize on Metal

* iq1_s: fix dequantize on the CPU

---------

Co-authored-by: Iwan Kawrakow <redacted>
15 months agollama : more consistent names of count variables (#5994)
Georgi Gerganov [Mon, 11 Mar 2024 15:49:47 +0000 (17:49 +0200)]
llama : more consistent names of count variables (#5994)

* llama : more consistent names of count variables

ggml-ci

* llama : n_parallel -> n_seq_max

* common : fix param name

* examples : fix param name

15 months agollama : refactor unicode stuff (#5992)
Georgi Gerganov [Mon, 11 Mar 2024 15:47:47 +0000 (17:47 +0200)]
llama : refactor unicode stuff (#5992)

* llama : refactor unicode stuff

ggml-ci

* unicode : names

* make : fix c++ compiler

* unicode : names

* unicode : straighten tables

* zig : fix build

* unicode : put nfd normalization behind API

ggml-ci

* swift : fix build

* unicode : add BOM

* unicode : add <cstdint>

ggml-ci

* unicode : pass as cpts as const ref

15 months agoUpdate server docker image URLs (#5997)
Jakub N [Mon, 11 Mar 2024 13:40:42 +0000 (14:40 +0100)]
Update server docker image URLs (#5997)

15 months agoServer: format error to json (#5961)
Xuan Son Nguyen [Mon, 11 Mar 2024 09:56:41 +0000 (10:56 +0100)]
Server: format error to json (#5961)

* server: format error to json

* server: do not crash on grammar error

* fix api key test case

* revert limit max n_predict

* small fix

* correct coding style

* update completion.js

* launch_slot_with_task

* update docs

* update_slots

* update webui

* update readme

15 months agoggml, ci : Windows ARM runner and build fixes (#5979)
Michael Podvitskiy [Mon, 11 Mar 2024 09:28:51 +0000 (10:28 +0100)]
ggml, ci : Windows ARM runner and build fixes (#5979)

* windows arm ci

* fix `error C2078: too many initializers` with ggml_vld1q_u32 macro for MSVC ARM64

* fix `warning C4146: unary minus operator applied to unsigned type, result still unsigned`

* fix `error C2065: '__fp16': undeclared identifier`

15 months agoserver : maintain chat completion id for streaming responses (#5988)
Minsoo Cheong [Mon, 11 Mar 2024 08:09:32 +0000 (17:09 +0900)]
server : maintain chat completion id for streaming responses (#5988)

* server: maintain chat completion id for streaming responses

* Update examples/server/utils.hpp

* Update examples/server/utils.hpp

---------

Co-authored-by: Georgi Gerganov <redacted>
15 months agocmake : fix subdir for `LLAMA_METAL_EMBED_LIBRARY` (#5985)
Gilad S [Mon, 11 Mar 2024 08:00:08 +0000 (10:00 +0200)]
cmake : fix subdir for `LLAMA_METAL_EMBED_LIBRARY` (#5985)

15 months agollama : fix F16/F32 downcast + improve names (#5980)
Georgi Gerganov [Mon, 11 Mar 2024 07:56:47 +0000 (09:56 +0200)]
llama : fix F16/F32 downcast + improve names (#5980)

15 months agoBetter 1.5 bit quantization (#5971)
Kawrakow [Mon, 11 Mar 2024 06:51:49 +0000 (07:51 +0100)]
Better 1.5 bit quantization  (#5971)

* Trying blocvks of 16 for IQ1_S - seems slightly better

* iq1s_blocks16: Adjust scale fudge factor to 1.125

* iq1s_blocks16: going to blocks of 32

with 2048 lattice points, so same bpw.
This is even better than blocks of 16.
Should I try blocks of 64? But to keep the same
bpw, when I go to 4096 lattice points, I need to
remove blocks alltogether and just have superblocks of
256 weights.

* iq1s_blocks16: Use 2*<x^2> as sigma2 in weight adjustment

* iq1s_blocks16: scalar and AVX2 dot products

* iq1s_blocks16: CUDA dot product

* iq1s_blocks16: Metal works, Neon does not

Metal works but TG is dog slow (35 t/s). PP is OKish (493 t/s).
Not seeing the bug in the Neon implementation for now.

* iq1s_blocks16: fixed Neon

* iq1s_blocks16: very slightly faster TG on Metal

Still pathetic at 37 t/s

* iq1s_blocks16: speedup Metal by packing codebook into uint32_t's

* Formatting

* iq1s_blocks16: uint32_t codebook is also better in CUDA

TG-128 is now 204 t/s up from 194 t/s.
PP-512 is 5890 t/s, so significantly better than other quants

* iq1s_blocks16: slightly faster Neon dot product

* iq1s_blocks16: faster AVX2 dot product

* iq1s_blocks16: adjust to ggml-common.h

---------

Co-authored-by: Iwan Kawrakow <redacted>
15 months ago[SYCL] Add q3_s and q1_s (#5886)
Abhilash Majumder [Mon, 11 Mar 2024 04:57:56 +0000 (10:27 +0530)]
[SYCL] Add q3_s and q1_s (#5886)

* Add q3_s and q1_s

* fix compilation

* fix build

* fix build

* fix build

* enable ops

* rm macro

* increase grid space

15 months ago[SYCL] Add support for SYCL Nvidia target (#5738)
AidanBeltonS [Mon, 11 Mar 2024 01:13:57 +0000 (01:13 +0000)]
[SYCL] Add support for SYCL Nvidia target (#5738)

* Add support for nvidia target in CMake

* Update sycl read-me for Nvidia target

* Fix errors

15 months agometal : move mm_id indices to shared mem (#5982)
Georgi Gerganov [Sun, 10 Mar 2024 21:12:48 +0000 (23:12 +0200)]
metal : move mm_id indices to shared mem (#5982)

15 months agoandroid : fix utf8 decoding error (#5935)
Dean [Sun, 10 Mar 2024 20:03:17 +0000 (04:03 +0800)]
android : fix utf8 decoding error (#5935)

* examples: fix utf8 decoding error

some models have a tokenizer that decodes an id into an incomplete utf8 sequence, need to validate and wait for next token
one example would be: https://huggingface.co/Qwen/Qwen1.5-1.8B-Chat-GGUF/resolve/main/qwen1_5-1_8b-chat-q4_0.gguf and and an example of the token is 18137

* android : minor

---------

Co-authored-by: zhangfuwen <redacted>
Co-authored-by: Georgi Gerganov <redacted>
15 months agoreadme : update hot topics
Georgi Gerganov [Sun, 10 Mar 2024 18:58:26 +0000 (20:58 +0200)]
readme : update hot topics

15 months agosync : ggml
Georgi Gerganov [Sun, 10 Mar 2024 18:10:46 +0000 (20:10 +0200)]
sync : ggml

15 months agoggml : try fix 32-bit arm compat (whisper/1938)
Georgi Gerganov [Fri, 8 Mar 2024 21:45:07 +0000 (23:45 +0200)]
ggml : try fix 32-bit arm compat (whisper/1938)

* ggml : try fix 32-bit arm compat

* ggml : fix cont

15 months agoggml : remove __constant__ specifier for CUDA tables (#5940)
Georgi Gerganov [Sun, 10 Mar 2024 18:09:24 +0000 (20:09 +0200)]
ggml : remove __constant__ specifier for CUDA tables (#5940)

15 months agoserver: ci: windows build and tests (#5968)
Pierrick Hymbert [Sun, 10 Mar 2024 17:17:47 +0000 (18:17 +0100)]
server: ci: windows build and tests (#5968)

* server: ci: windows build and tests

* server: ci: remove tmp push branch

* server: ci: EOF EOL

* Use builti

Co-authored-by: Jared Van Bortel <redacted>
* server: tests: server graceful shutdown, then kill, then hard kill

* server: tests: remove python2 unicode string

* server: tests: remove wrong comment on server starting,  close_fds is always true

* server: tests: server kill, if pid exists

* server: tests: remove dependency to killall

* server: tests: ci windows: pid exists better handling

---------

Co-authored-by: Jared Van Bortel <redacted>
15 months agollama : add support for GritLM (#5959)
DAN™ [Sun, 10 Mar 2024 15:56:30 +0000 (11:56 -0400)]
llama : add support for GritLM (#5959)

* add gritlm example

* gritlm results match

* tabs to spaces

* comment out debug printing

* rebase to new embed

* gritlm embeddings are back babeee

* add to gitignore

* allow to toggle embedding mode

* Clean-up GritLM sample code.

* Fix types.

* Flush stdout and output ending newline if streaming.

* mostly style fixes; correct KQ_mask comment

* add causal_attn flag to llama_cparams

* gritml : minor

* llama : minor

---------

Co-authored-by: Douglas Hanley <redacted>
Co-authored-by: Georgi Gerganov <redacted>
15 months agogrammar : verify parsed state (#5950)
Clint Herron [Sun, 10 Mar 2024 15:17:43 +0000 (11:17 -0400)]
grammar : verify parsed state (#5950)

15 months agonix: update flake.lock (#5969)
Georgi Gerganov [Sun, 10 Mar 2024 14:43:08 +0000 (16:43 +0200)]
nix: update flake.lock (#5969)

Flake lock file updates:

• Updated input 'nixpkgs':
    'github:NixOS/nixpkgs/1536926ef5621b09bba54035ae2bb6d806d72ac8' (2024-02-29)
  → 'github:NixOS/nixpkgs/9df3e30ce24fd28c7b3e2de0d986769db5d6225d' (2024-03-06)

Co-authored-by: github-actions[bot] <redacted>
15 months agoserver: benchmark: chat/completions scenario and other llm servers comparison (#5941)
Pierrick Hymbert [Sat, 9 Mar 2024 22:41:49 +0000 (23:41 +0100)]
server: benchmark: chat/completions scenario and other llm servers comparison (#5941)

* server: bench: Init a bench scenario with K6
See #5827

* server: bench: EOL EOF

* server: bench: PR feedback and improved k6 script configuration

* server: bench: remove llamacpp_completions_tokens_seconds as it include prompt processing time and it's misleading

server: bench: add max_tokens from SERVER_BENCH_MAX_TOKENS

server: bench: increase truncated rate to 80% before failing

* server: bench: fix doc

* server: bench: change gauge custom metrics to trend

* server: bench: change gauge custom metrics to trend
server: bench: add trend custom metrics for total tokens per second average

* server: bench: doc add an option to debug http request

* server: bench: filter dataset too short and too long sequences

* server: bench: allow to filter out conversation in the dataset based on env variable

* server: bench: fix assistant message sent instead of user message

* server: bench: fix assistant message sent instead of user message

* server : add defrag thold parameter

* server: bench: select prompts based on the current iteration id not randomly to make the bench more reproducible

---------

Co-authored-by: Georgi Gerganov <redacted>
15 months agoserver : print chat template info
Georgi Gerganov [Sat, 9 Mar 2024 20:04:00 +0000 (22:04 +0200)]
server : print chat template info

15 months agoperplexity : support using multiple sequences to allow larger batch sizes (#5946)
slaren [Sat, 9 Mar 2024 18:55:54 +0000 (19:55 +0100)]
perplexity : support using multiple sequences to allow larger batch sizes (#5946)

* perplexity : support using multiple sequences to allow larger batch sizes

ggml-ci

* set cparams.n_parallel to the number of sequences

* print tested n_ctx, add assert

15 months agoreadme : update hot topics
Georgi Gerganov [Sat, 9 Mar 2024 16:14:13 +0000 (18:14 +0200)]
readme : update hot topics

15 months agoggml : fix unnecessary f32 -> f16 -> f32 casts (mmla) (#5951)
Georgi Gerganov [Sat, 9 Mar 2024 15:36:20 +0000 (17:36 +0200)]
ggml : fix unnecessary f32 -> f16 -> f32 casts (mmla) (#5951)

15 months agoserver : fix metrics init (#5964)
Georgi Gerganov [Sat, 9 Mar 2024 15:34:15 +0000 (17:34 +0200)]
server : fix metrics init (#5964)

15 months agoggml : remove old quantization functions (#5942)
Georgi Gerganov [Sat, 9 Mar 2024 13:53:59 +0000 (15:53 +0200)]
ggml : remove old quantization functions (#5942)

* ggml : remove old quantization functions

ggml-ci

* ggml : simplify ggml_quantize_chunk

ggml-ci

* ggml : restrict correctness

ggml-ci

* ggml : remove hist data from the quantization API

ggml-ci

* tests : remove hist usage in test-backend-ops

ggml-ci

* vulkan : remove hist and fix typo

15 months agoserver : clarify some items in the readme (#5957)
Georgi Gerganov [Sat, 9 Mar 2024 13:47:47 +0000 (15:47 +0200)]
server : clarify some items in the readme (#5957)

* server : clarify some items in the readme

* server : fix typo

15 months agoserver : normalize embeddings (#5956)
SeungWon Jeong [Sat, 9 Mar 2024 12:27:58 +0000 (21:27 +0900)]
server : normalize embeddings (#5956)

* output normalize embedding in '/v1/embeddings'

* common : reuse llama_embd_normalize

* common : better normalize impl

---------

Co-authored-by: Georgi Gerganov <redacted>
15 months agotests : gitignore ggml-common.h
Georgi Gerganov [Sat, 9 Mar 2024 12:17:11 +0000 (14:17 +0200)]
tests : gitignore ggml-common.h

15 months agoserver : fix passing prompt as tokens (#5955)
Alexey Parfenov [Sat, 9 Mar 2024 11:16:53 +0000 (11:16 +0000)]
server : fix passing prompt as tokens (#5955)

* server: fix passing prompt as tokens

* Update examples/server/server.cpp

---------

Co-authored-by: Georgi Gerganov <redacted>
15 months agoggml : add ggml-common.h to deduplicate shared code (#5940)
Georgi Gerganov [Sat, 9 Mar 2024 10:47:57 +0000 (12:47 +0200)]
ggml : add ggml-common.h to deduplicate shared code (#5940)

* ggml : add ggml-common.h to shared code

ggml-ci

* scripts : update sync scripts

* sycl : reuse quantum tables

ggml-ci

* ggml : minor

* ggml : minor

* sycl : try to fix build

15 months agoserver : simplify logic for empty prompts (#5953)
Georgi Gerganov [Sat, 9 Mar 2024 10:34:18 +0000 (12:34 +0200)]
server : simplify logic for empty prompts (#5953)

15 months agoServer: reorganize some http logic (#5939)
Xuan Son Nguyen [Sat, 9 Mar 2024 10:27:53 +0000 (11:27 +0100)]
Server: reorganize some http logic (#5939)

* refactor static file handler

* use set_pre_routing_handler for validate_api_key

* merge embedding handlers

* correct http verb for endpoints

* fix embedding response

* fix test case CORS Options

* fix code style