]> git.djapps.eu Git - pkg/ggml/sources/llama.cpp/log
pkg/ggml/sources/llama.cpp
22 months agoAllow passing grammar to completion endpoint (#2532)
Martin Krasser [Tue, 8 Aug 2023 13:29:19 +0000 (15:29 +0200)]
Allow passing grammar to completion endpoint (#2532)

* Allow passing grammar to completion endpoint

22 months agoCUDA: tighter VRAM scratch size for 65b/70b (#2551)
Johannes Gäßler [Tue, 8 Aug 2023 12:38:16 +0000 (14:38 +0200)]
CUDA: tighter VRAM scratch size for 65b/70b (#2551)

22 months agollm.vim : multiline autocompletion, get rid of "^@" (#2543)
chaihahaha [Tue, 8 Aug 2023 12:07:02 +0000 (20:07 +0800)]
llm.vim : multiline autocompletion, get rid of "^@" (#2543)

22 months agovim : bring back simple llm.vim example
Georgi Gerganov [Tue, 8 Aug 2023 12:05:30 +0000 (15:05 +0300)]
vim : bring back simple llm.vim example

22 months agovim : streaming and more (#2495)
AustinMroz [Tue, 8 Aug 2023 11:44:48 +0000 (06:44 -0500)]
vim : streaming and more (#2495)

* Update Vim plugin

* Remove getbufoneline usage, Add input bind example.

getbufoneline() appears to be a recently added function and has been
replaced with getbufline for compatibility.

An additional example that explains how to add a keybind that works in
insert mode was added.

22 months agoAdd --rope-scale parameter (#2544)
klosax [Mon, 7 Aug 2023 17:07:19 +0000 (19:07 +0200)]
Add --rope-scale parameter (#2544)

* common.cpp : Add --rope-scale parameter
* README.md : Add info about using linear rope scaling

22 months agoggml : mul mat tweaks (#2372)
Georgi Gerganov [Mon, 7 Aug 2023 11:25:58 +0000 (14:25 +0300)]
ggml : mul mat tweaks (#2372)

* ggml : mul mat wip

ggml-ci

* ggml : alternative thread distribution for mul_mat

ggml-ci

* ggml : mul_mat block tiling attempt

* ggml : mul_mat threads yield

ggml-ci

22 months agoggml : pad result of ggml_nbytes()
Georgi Gerganov [Mon, 7 Aug 2023 11:24:42 +0000 (14:24 +0300)]
ggml : pad result of ggml_nbytes()

22 months agoggml : change params pointer (style change) (#2539)
Georgi Gerganov [Mon, 7 Aug 2023 10:55:18 +0000 (13:55 +0300)]
ggml : change params pointer (style change) (#2539)

ggml-ci

22 months agoggml : sync (custom ops) (#2537)
Georgi Gerganov [Mon, 7 Aug 2023 10:20:09 +0000 (13:20 +0300)]
ggml : sync (custom ops) (#2537)

ggml-ci

22 months agoFixed mmap prefetch for GPU offloading (#2529)
Johannes Gäßler [Mon, 7 Aug 2023 08:09:40 +0000 (10:09 +0200)]
Fixed mmap prefetch for GPU offloading (#2529)

22 months agometal : fix out-of-bounds access + inc concurrency nodes (#2416)
Georgi Gerganov [Mon, 7 Aug 2023 07:52:57 +0000 (10:52 +0300)]
metal : fix out-of-bounds access + inc concurrency nodes (#2416)

* metal : fix out-of-bounds access + style changes

* metal : increase concurrency nodes to 2*GGML_MAX_NODES

22 months ago[Makefile] Move ARM CFLAGS before compilation (#2536)
GiviMAD [Mon, 7 Aug 2023 06:21:46 +0000 (23:21 -0700)]
[Makefile] Move ARM CFLAGS before compilation (#2536)

22 months ago[Zig] Rewrite build for Zig 0.11 (#2514)
Henri Vasserman [Mon, 7 Aug 2023 05:35:53 +0000 (08:35 +0300)]
[Zig] Rewrite build for Zig 0.11 (#2514)

* zig build fixes

* Disable LTO on Windows.

22 months agoconsole : fix issue related to Windows 11 PowerShell console mode persistence (#2521)
DannyDaemonic [Sun, 6 Aug 2023 06:49:34 +0000 (23:49 -0700)]
console : fix issue related to Windows 11 PowerShell console mode persistence (#2521)

22 months agoconvert.py : add missing abstract methods for quantized data (#2491)
Keiichi Tabata [Sun, 6 Aug 2023 06:34:05 +0000 (15:34 +0900)]
convert.py : add missing abstract methods for quantized data (#2491)

22 months agoCUDA: faster k-quant mul_mat_q kernels (#2525)
Johannes Gäßler [Sat, 5 Aug 2023 16:20:44 +0000 (18:20 +0200)]
CUDA: faster k-quant mul_mat_q kernels (#2525)

22 months agofix firefox autoscroll (#2519)
Jonas Wunderlich [Fri, 4 Aug 2023 20:16:11 +0000 (20:16 +0000)]
fix firefox autoscroll (#2519)

22 months agoserver: regenerate completion.js.hpp (#2515)
Cebtenzzre [Fri, 4 Aug 2023 19:00:57 +0000 (15:00 -0400)]
server: regenerate completion.js.hpp (#2515)

22 months agoCUDA: use min compute capability of GPUs actually used (#2506)
Cebtenzzre [Fri, 4 Aug 2023 15:35:22 +0000 (11:35 -0400)]
CUDA: use min compute capability of GPUs actually used (#2506)

22 months agoCUDA: check if event is NULL before cudaStreamWaitEvent (#2505)
Cebtenzzre [Fri, 4 Aug 2023 15:34:32 +0000 (11:34 -0400)]
CUDA: check if event is NULL before cudaStreamWaitEvent (#2505)

Fixes #2503

22 months agoAdd --simple-io option for subprocesses and break out console.h and cpp (#1558)
DannyDaemonic [Fri, 4 Aug 2023 15:20:12 +0000 (08:20 -0700)]
Add --simple-io option for subprocesses and break out console.h and cpp (#1558)

22 months agoFixing race condition in server and partial stream handling in frontend. (#2391)
Stephen Nichols [Fri, 4 Aug 2023 11:37:24 +0000 (06:37 -0500)]
Fixing race condition in server and partial stream handling in frontend. (#2391)

* Fixing race condition in server.cpp and partial stream handling in completion.js

* Reverting assert edits.

* Adding newline to eof

22 months agoStream save llama context data to file instead of allocating entire buffer upfront...
l3utterfly [Fri, 4 Aug 2023 11:29:52 +0000 (19:29 +0800)]
Stream save llama context data to file instead of allocating entire buffer upfront (#2488)

* added stream saving context data to file to avoid allocating unnecessary amounts of memory

* generalised copying state data to file or buffer

* added comments explaining how copy_state_data works

* fixed trailing whitespaces

* fixed save load state example

* updated save load state to use public function in llama.cpp

* - restored breakage of the llama_copy_state_data API
- moved new logic for copying llama state data to internal function

* fixed function declaration order

* restored save load state example

* fixed whitepace

* removed unused llama-util.h include

* Apply suggestions from code review

Co-authored-by: slaren <redacted>
* Apply code review suggestions

Co-authored-by: slaren <redacted>
---------

Co-authored-by: slaren <redacted>
22 months agobuild : fix several cast and printf warnings (#2499)
Borislav Stanimirov [Fri, 4 Aug 2023 10:07:21 +0000 (13:07 +0300)]
build : fix several cast and printf warnings (#2499)

22 months agoexamples : generate JSON according to schema (#1887)
Evan Jones [Thu, 3 Aug 2023 02:05:44 +0000 (22:05 -0400)]
examples : generate JSON according to schema (#1887)

* examples : add JSON schema grammars

* complete JSON grammar

* ensure primitive types can be used as root of schema

* support integer type and adjust usage text

22 months agoCUDA: faster non k-quant mul_mat_q kernels (#2483)
Johannes Gäßler [Wed, 2 Aug 2023 16:04:04 +0000 (18:04 +0200)]
CUDA: faster non k-quant mul_mat_q kernels (#2483)

22 months agoCUDA: Fix models with output size != 32000 (#2480)
Johannes Gäßler [Wed, 2 Aug 2023 14:48:10 +0000 (16:48 +0200)]
CUDA: Fix models with output size != 32000 (#2480)

22 months agoreadme : add Aquila-7B model series to supported models (#2487)
ldwang [Wed, 2 Aug 2023 08:21:11 +0000 (16:21 +0800)]
readme : add Aquila-7B model series to supported models (#2487)

* support bpe tokenizer in convert

Signed-off-by: ldwang <redacted>
* support bpe tokenizer in convert

Signed-off-by: ldwang <redacted>
* support bpe tokenizer in convert, fix

Signed-off-by: ldwang <redacted>
* Add Aquila-7B models in README.md

Signed-off-by: ldwang <redacted>
* Up Aquila-7B models in README.md

Signed-off-by: ldwang <redacted>
---------

Signed-off-by: ldwang <redacted>
Co-authored-by: ldwang <redacted>
22 months agotests : Fix compilation warnings (Linux/GCC) (#2451)
Eve [Wed, 2 Aug 2023 08:06:19 +0000 (04:06 -0400)]
tests : Fix compilation warnings (Linux/GCC) (#2451)

* fix hellaswag print format, cast away warning in test-double-float

* c++11 cannot use designated initializers

* add static to test-grad0.c internal functions

* use memcpy in test-double-float.c

* port c tests to c++

* use initializer list for ggml_init_params

22 months agoreadme : Add Chinese LLaMA-2 / Alpaca-2 to supported models (#2475)
Yiming Cui [Wed, 2 Aug 2023 06:18:31 +0000 (14:18 +0800)]
readme : Add Chinese LLaMA-2 / Alpaca-2 to supported models (#2475)

* add support for chinese llama-2 / alpaca-2

* remove white spaces

22 months agofix a typo in examples/server/README.md (#2478)
Bono Lv [Tue, 1 Aug 2023 12:54:28 +0000 (20:54 +0800)]
fix a typo in examples/server/README.md (#2478)

22 months agoserver : Support dark mode (#2414)
ebraminio [Tue, 1 Aug 2023 08:56:23 +0000 (01:56 -0700)]
server : Support dark mode (#2414)

* server : Support dark mode

So it respects user system light / dark settings.

* Update index.html.hpp by running ./deps.sh

22 months agometal : add gqa8 kernel to allow llama-2-70B on metal (#2459)
Matteo Boschini [Tue, 1 Aug 2023 07:43:12 +0000 (09:43 +0200)]
metal : add gqa8 kernel to allow llama-2-70B on metal (#2459)

* Added gqa8 kernel to allow llama-2-70B on metal

* Update ggml-metal.m

Co-authored-by: Cebtenzzre <redacted>
* Extend kernel_mul_mat_f16_f32 to handle gqa broadcast

* Added ne03==ne13 assertion

---------

Co-authored-by: Cebtenzzre <redacted>
22 months agoCUDA: fixed LLAMA_FAST compilation option (#2473)
Johannes Gäßler [Mon, 31 Jul 2023 19:02:19 +0000 (21:02 +0200)]
CUDA: fixed LLAMA_FAST compilation option (#2473)

22 months agoCUDA: fixed cmake F16 option (#2471)
Johannes Gäßler [Mon, 31 Jul 2023 17:52:22 +0000 (19:52 +0200)]
CUDA: fixed cmake F16 option (#2471)

23 months agoCUDA: mmq CLI option, fixed mmq build issues (#2453)
Johannes Gäßler [Mon, 31 Jul 2023 13:44:35 +0000 (15:44 +0200)]
CUDA: mmq CLI option, fixed mmq build issues (#2453)

23 months agoCUDA: Implemented row flattening for non-glm RoPE (#2468)
Johannes Gäßler [Mon, 31 Jul 2023 12:32:30 +0000 (14:32 +0200)]
CUDA: Implemented row flattening for non-glm RoPE (#2468)

23 months agoCUDA: fewer memory bank conflicts for mul_mat_q (#2458)
Johannes Gäßler [Mon, 31 Jul 2023 11:18:51 +0000 (13:18 +0200)]
CUDA: fewer memory bank conflicts for mul_mat_q (#2458)

23 months agoFix Metal backend broken from the allocator changes (#2455)
slaren [Mon, 31 Jul 2023 09:02:53 +0000 (11:02 +0200)]
Fix Metal backend broken from the allocator changes (#2455)

* fix Metal backend broken from the allocator changes

23 months agoggml : add graph tensor allocator (#2411)
slaren [Sun, 30 Jul 2023 13:58:01 +0000 (15:58 +0200)]
ggml : add graph tensor allocator (#2411)

* ggml : add graph tensor allocator

* ggml : don't calculate data pointer of unallocated tensors when creating a view with an offset

* ggml : refactor ggml_view_Nd into ggml_view_tensor_offset

23 months agoCUDA: Quantized matrix matrix multiplication (#2160)
Johannes Gäßler [Sat, 29 Jul 2023 21:04:44 +0000 (23:04 +0200)]
CUDA: Quantized matrix matrix multiplication (#2160)

* mmq implementation for non k-quants

* q6_K

* q2_K

* q3_k

* q4_K

* vdr

* q5_K

* faster q8_1 loading

* loop unrolling

* add __restrict__

* q2_K sc_high

* GGML_CUDA_MMQ_Y

* Updated Makefile

* Update Makefile

* DMMV_F16 -> F16

* Updated README, CMakeLists

* Fix CMakeLists.txt

* Fix CMakeLists.txt

* Fix multi GPU out-of-bounds

23 months agoCUDA: faster multi GPU synchronization (#2448)
Johannes Gäßler [Sat, 29 Jul 2023 21:04:10 +0000 (23:04 +0200)]
CUDA: faster multi GPU synchronization (#2448)

23 months agoperplexity : add Hellaswag calculation (#2389)
klosax [Fri, 28 Jul 2023 18:25:36 +0000 (20:25 +0200)]
perplexity : add Hellaswag calculation (#2389)

* common.h : add hellaswag / remove perplexity-lines

* common.cpp : add hellaswag / remove perplexity-lines

* perplexity.cpp : add hellswag scores / remove perplexity-lines

* perplexity.cpp : clean up

* common.h : change default param value

* common.cpp : Change default param

* perplexity.cpp : alter wording

* common.h : alter wording

* common.cpp : alter wording

23 months agoggml : workaround for missing _mm256_setr_m128i in GCC < 8 in k_quants.c (#2405)
Lee [Fri, 28 Jul 2023 18:17:45 +0000 (02:17 +0800)]
ggml : workaround for missing _mm256_setr_m128i in GCC < 8 in k_quants.c (#2405)

23 months agollama : support more diverse tokenizers? (#2420)
eric8607242 [Fri, 28 Jul 2023 18:10:05 +0000 (02:10 +0800)]
llama : support more diverse tokenizers? (#2420)

* supporting more diverse tokenizers

* Update llama.cpp

---------

Co-authored-by: Georgi Gerganov <redacted>
23 months agoexamples : fix whitespace
Georgi Gerganov [Fri, 28 Jul 2023 18:05:08 +0000 (21:05 +0300)]
examples : fix whitespace

23 months agoexamples : server chat mode with llama2 (#2400)
nhamanasu [Fri, 28 Jul 2023 18:02:10 +0000 (03:02 +0900)]
examples : server chat mode with llama2 (#2400)

* add: server chat mode with llama2

* fix: remove the unnecessary last \n

23 months agoreadme : fix the description of the Tail free sampling (TFS) method (#2431)
Weird Constructor [Fri, 28 Jul 2023 08:44:43 +0000 (10:44 +0200)]
readme : fix the description of the Tail free sampling (TFS) method (#2431)

23 months agollama : use n_embd_gqa instead of n_embd to handle llama-2 70B (#2433)
Rand Xie [Fri, 28 Jul 2023 08:42:53 +0000 (01:42 -0700)]
llama : use n_embd_gqa instead of n_embd to handle llama-2 70B (#2433)

23 months agoObtaining LLaMA 2 instructions (#2308)
niansa/tuxifan [Fri, 28 Jul 2023 01:14:11 +0000 (03:14 +0200)]
Obtaining LLaMA 2 instructions (#2308)

* Obtaining LLaMA 2 instructions

* Removed sharing warning for LLaMA 2

* Linked TheBloke's GGML repos

* Add LLaMA 2 to list of supported models

* Added LLaMA 2 usage instructions

* Added links to LLaMA 2 70B models

23 months agoconvert.py : Update to support 70B HF format model files (#2427)
mj-shifu [Thu, 27 Jul 2023 20:39:17 +0000 (22:39 +0200)]
convert.py : Update to support 70B HF format model files (#2427)

* convert.py : fix llama 2 70b conversion from Huggingface

23 months agometal : disable graph concurrency optimization due to bug (#2413)
Georgi Gerganov [Thu, 27 Jul 2023 08:00:54 +0000 (11:00 +0300)]
metal : disable graph concurrency optimization due to bug (#2413)

23 months agoggml : fix assert in ggml_set_unary_op (#2410)
slaren [Wed, 26 Jul 2023 21:57:23 +0000 (23:57 +0200)]
ggml : fix assert in ggml_set_unary_op (#2410)

23 months agomake : build with -Wmissing-prototypes (#2394)
Cebtenzzre [Wed, 26 Jul 2023 18:00:04 +0000 (14:00 -0400)]
make : build with -Wmissing-prototypes (#2394)

23 months agoggml : allocate graphs in a context (#2392)
slaren [Wed, 26 Jul 2023 13:56:53 +0000 (15:56 +0200)]
ggml : allocate graphs in a context (#2392)

* ggml : graph allocation in contexts

* allocate work buffer as a ggml_object in ggml_graph_compute_with_ctx

* llama.cpp : allocate graph in the context

* add GGML_PAD

---------

Co-authored-by: Georgi Gerganov <redacted>
23 months agoAdd LLAMA_DEFAULT_RMS_EPS so we can change the default (#2384)
Kawrakow [Tue, 25 Jul 2023 15:35:53 +0000 (18:35 +0300)]
Add LLAMA_DEFAULT_RMS_EPS so we can change the default (#2384)

Co-authored-by: Iwan Kawrakow <redacted>
23 months agoggml : fix ggml_flash_attn to use op_params (#2387)
slaren [Tue, 25 Jul 2023 14:20:12 +0000 (16:20 +0200)]
ggml : fix ggml_flash_attn to use op_params (#2387)

* ggml : fix ggml_flash_attn to use op_params

23 months agoconvert.py : support bpe tokenizer (#2228)
ldwang [Tue, 25 Jul 2023 13:22:09 +0000 (21:22 +0800)]
convert.py : support bpe tokenizer (#2228)

* support bpe tokenizer in convert

Signed-off-by: ldwang <redacted>
* support bpe tokenizer in convert

Signed-off-by: ldwang <redacted>
* support bpe tokenizer in convert, fix

Signed-off-by: ldwang <redacted>
---------

Signed-off-by: ldwang <redacted>
Co-authored-by: ldwang <redacted>
23 months agoggml : relax contiguous constraints in activation function (#2371)
Jiahao Li [Tue, 25 Jul 2023 12:58:32 +0000 (20:58 +0800)]
ggml : relax contiguous constraints in activation function (#2371)

23 months agoggml : improve graph build time via hash table lookup (#2329)
slaren [Tue, 25 Jul 2023 12:32:20 +0000 (14:32 +0200)]
ggml : improve graph build time via hash table lookup (#2329)

* improve graph build time

* ggml_tensor : use 1 bit per flag

* use a hash table instead

23 months agobuild : fix line breaking error in build-info.sh (#2349)
Hesen Peng [Tue, 25 Jul 2023 12:24:09 +0000 (05:24 -0700)]
build : fix line breaking error in build-info.sh (#2349)

* fix line breaking

* build number line break removal

23 months agomain : add `--in-prefix-bos` to prefix BOS to user inputs; keep EOS (#2304)
Xiao-Yong Jin [Tue, 25 Jul 2023 12:19:11 +0000 (07:19 -0500)]
main : add `--in-prefix-bos` to prefix BOS to user inputs; keep EOS (#2304)

* add `--in-prefix-bos` to prefix BOS to user inputs; keep EOS

The BOS precedes the string specified by `--in-prefix`.
Model generated EOS is now kept in the context.

It provides a way to strictly following the prompt format used in
Llama-2-chat.

The EOS handling also benefits some existing finetunes that uses
EOS to mark the end of turn.

* examples/common: move input_prefix_bos to other bools

23 months agoci : add non-AVX scalar build/test (#2356)
Eve [Tue, 25 Jul 2023 12:16:13 +0000 (08:16 -0400)]
ci : add non-AVX scalar build/test (#2356)

* noavx build and test

* we don't need to remove f16c in windows

23 months agok_quants : add AVX support to dot functions with QK_K as 64 (#2339)
katsu560 [Tue, 25 Jul 2023 12:13:41 +0000 (21:13 +0900)]
k_quants : add AVX support to dot functions with QK_K as 64 (#2339)

* add AVX to ggml_vec_dot_q2_K_q8_K()

* add AVX to ggml_vec_dot_q3_K_q8_K()

* add AVX to ggml_vec_dot_q4_K_q8_K()

* add AVX to ggml_vec_dot_q5_K_q8_K()

* add AVX to ggml_vec_dot_q6_K_q8_K()

* refactor AVX code in ggml_vec_dot_q6_K_q8_K()

23 months agometal : concurrently dispatch commands (#2358)
Shouzheng Liu [Tue, 25 Jul 2023 12:00:19 +0000 (08:00 -0400)]
metal : concurrently dispatch commands (#2358)

* metal: concurrently dispatch commands

Function `ggml_metal_graph_find_concurrency` will run and write
commands that can be issued concurrently to metal context `concur_list`
array, when `ggml_metal_graph_compute` is called for the first time.

* metal: don't call find_concurrency automatically.

* metal : code style changes

---------

Co-authored-by: Georgi Gerganov <redacted>
23 months agoAnother speed gain for Q4_0 and Q4_1 on Metal (#2375)
Kawrakow [Tue, 25 Jul 2023 10:48:29 +0000 (13:48 +0300)]
Another speed gain for Q4_0 and Q4_1 on Metal (#2375)

* Another speed gain for Q4_0 and Q4_1 on Metal

* Have N_DST, etc., be template parameters

---------

Co-authored-by: Iwan Kawrakow <redacted>
23 months agoFix Q4_K and Q5_K for QK_K = 64 on CUDA (#2359)
Kawrakow [Tue, 25 Jul 2023 10:48:04 +0000 (13:48 +0300)]
Fix Q4_K and Q5_K for QK_K = 64 on CUDA (#2359)

* Fix Q4_K and Q5_K for QK_K = 64

* Very slightly better Q5_K bit fiddling

---------

Co-authored-by: Iwan Kawrakow <redacted>
23 months agoserver: add rms_norm_eps parameter (#2380)
slaren [Tue, 25 Jul 2023 09:36:17 +0000 (11:36 +0200)]
server: add rms_norm_eps parameter (#2380)

23 months ago[Server] Escape HTML in webchat (#2368)
Henri Vasserman [Tue, 25 Jul 2023 07:27:34 +0000 (10:27 +0300)]
[Server] Escape HTML in webchat (#2368)

* escape HTML in webchat
* add amp

23 months agomake rms_norm_eps a parameter (#2374)
slaren [Mon, 24 Jul 2023 15:57:12 +0000 (17:57 +0200)]
make rms_norm_eps a parameter (#2374)

* make rms_norm_eps a parameter

* add rms_norm_eps to command line

* fix baby llama, test-grad0

* use scientific notation for eps param in the help

ggml-ci

23 months agoChat UI extras (#2366)
Aarni Koskela [Mon, 24 Jul 2023 14:54:22 +0000 (17:54 +0300)]
Chat UI extras (#2366)

* makefile: correct deps for server

* server: tighten settings layout a little

* server: expose all currently configured generation params in UI

* server: expose remaining generation params, for the adventurous

* server: embetter mirostat fields

23 months agoggml : sync (unary ops refactor, static-correctness) (#2370)
Georgi Gerganov [Mon, 24 Jul 2023 11:46:21 +0000 (14:46 +0300)]
ggml : sync (unary ops refactor, static-correctness) (#2370)

* ggml : sync (unary ops, tests)

ggml-ci

* tests : remove unnecessary funcs

23 months agoFix scalar version of Q5_K when QK_K = 64 (#2362)
Kawrakow [Mon, 24 Jul 2023 09:55:02 +0000 (12:55 +0300)]
Fix scalar version of Q5_K when QK_K = 64 (#2362)

Co-authored-by: Iwan Kawrakow <redacted>
23 months agollama : add grammar-based sampling (#1773)
Evan Jones [Mon, 24 Jul 2023 03:58:10 +0000 (23:58 -0400)]
llama : add grammar-based sampling (#1773)

* llama, main : constrain sampling to grammar

* allow loading grammar from file

* fix whitespace errors

* handle & print parser errors

* add comments to grammar syntax and allow newlines where unambiguous

* add missing include

* support alternates in root rule

* fix bugs with empty token and EOS

* adjust JSON grammar

* remove swp file

* rewrite ternary expressions

Co-authored-by: Henri Vasserman <redacted>
* use struct for grammar elements and add Unicode support

* add unicode escapes

* add inverse char ranges

* only sample full tokens (no peeking or truncation)

* llama : minor style changes

blindly applied in online editor - hopefully I didn't break something

* update help text

* add warning message if EOS is disabled

---------

Co-authored-by: Henri Vasserman <redacted>
Co-authored-by: Georgi Gerganov <redacted>
23 months agoSome more Q4_K and Q5_K speedup on CUDA (#2346)
Kawrakow [Sun, 23 Jul 2023 21:19:47 +0000 (00:19 +0300)]
Some more Q4_K and Q5_K speedup on CUDA (#2346)

* Faster Q5_K on CUDA

* Small Q5_K improvement on older GPUs

* Spped up Q4_K on CUDA

GTX1660: 29.5 ms/t -> 25.6 ms/t
RTX4080: 8.40 ms/t -> 8.25 ms/t

* Spped up Q4_K on CUDA

GTX1660: 36.7 ms/t -> 35.6 ms/t
RTX4080:  9.8 ms/t ->  9.5 ms/t

* Address PR comments

* Add some comments to satisfy PR reviewer

---------

Co-authored-by: Iwan Kawrakow <redacted>
23 months agoAdd gqa parameter support to the server (#2351)
IgnacioFDM [Sun, 23 Jul 2023 20:31:17 +0000 (17:31 -0300)]
Add gqa parameter support to the server (#2351)

* Add gqa parameter support to the server
* Change help from stderr to stdout

23 months agoFix __dp4a documentation (#2348)
Johannes Gäßler [Sun, 23 Jul 2023 15:49:06 +0000 (17:49 +0200)]
Fix __dp4a documentation (#2348)

23 months agocommon : n_threads == -1 uses std::thread::hardware_concurrency() (#2347)
wzy [Sun, 23 Jul 2023 13:33:02 +0000 (21:33 +0800)]
common : n_threads == -1 uses std::thread::hardware_concurrency() (#2347)

* Fix #2345, fix incorrect n_threads

* Update examples/common.cpp

---------

Co-authored-by: Georgi Gerganov <redacted>
23 months agofix n_tasks (#2342)
slaren [Sun, 23 Jul 2023 13:19:39 +0000 (15:19 +0200)]
fix n_tasks (#2342)

ggml-ci

23 months agoggml: move op parameters from tensors to ggml_tensor::op_params (#2333)
slaren [Sun, 23 Jul 2023 12:36:02 +0000 (14:36 +0200)]
ggml: move op parameters from tensors to ggml_tensor::op_params (#2333)

* ggml: move op parameters from tensors to ggml_tensor::op_params

* alibi: use memcpy for float params

* remove `src[1] = NULL` in ops

23 months agollama : grouped-query attention + LLaMAv2 70B support (#2276)
Georgi Gerganov [Sun, 23 Jul 2023 12:09:47 +0000 (15:09 +0300)]
llama : grouped-query attention + LLaMAv2 70B support (#2276)

* CUDA: GQA implementation

* llama : support for GQA and LLaMAv2 70B

ggml-ci

* py : fix hparams parsing (if-else blocks)

ggml-ci

* py : oh boy ..

ggml-ci

* help : fix gqa value for 70B

ggml-ci

---------

Co-authored-by: JohannesGaessler <redacted>
23 months agollama : print help to stdout (#2338)
maddes8cht [Sun, 23 Jul 2023 11:59:48 +0000 (13:59 +0200)]
llama : print help to stdout (#2338)

23 months agoflake : support `nix build '.#opencl'` (#2337)
wzy [Sun, 23 Jul 2023 11:57:02 +0000 (19:57 +0800)]
flake : support `nix build '.#opencl'` (#2337)

23 months agollama : print max tensor size to stderr (#2336)
Christian Demsar [Sun, 23 Jul 2023 11:56:34 +0000 (07:56 -0400)]
llama : print max tensor size to stderr (#2336)

23 months agomake : fix CLBLAST compile support in FreeBSD (#2331)
Jose Maldonado [Sun, 23 Jul 2023 11:52:08 +0000 (07:52 -0400)]
make : fix CLBLAST compile support in FreeBSD (#2331)

* Fix Makefile for CLBLAST compile support and instructions for compile llama.cpp FreeBSD

* More general use-case for CLBLAST support (Linux and FreeBSD)

23 months agoexamples : simplify vim plugin (#2327)
AustinMroz [Sun, 23 Jul 2023 11:16:48 +0000 (06:16 -0500)]
examples : simplify vim plugin (#2327)

Uses builtin json_encode and json_decode functions to simplify escaping
Removes the need for temp files

23 months agometal : support bcast add & dup & cont op (#2323)
Jiahao Li [Sun, 23 Jul 2023 11:00:37 +0000 (19:00 +0800)]
metal : support bcast add & dup & cont op (#2323)

23 months agoSpeed up Q4_K (#2322)
Kawrakow [Sun, 23 Jul 2023 05:49:20 +0000 (08:49 +0300)]
Speed up Q4_K (#2322)

Co-authored-by: Iwan Kawrakow <redacted>
23 months agoCUDA: Fixed 7b q3_K_S with mul_mat_vec_q (#2313)
Johannes Gäßler [Sat, 22 Jul 2023 19:27:34 +0000 (21:27 +0200)]
CUDA: Fixed 7b q3_K_S with mul_mat_vec_q (#2313)

23 months agollama : optimize memory buffers (#2325)
Georgi Gerganov [Sat, 22 Jul 2023 18:17:57 +0000 (21:17 +0300)]
llama : optimize memory buffers (#2325)

23 months agoPerplexity: Compute scores correlated to HellaSwag (#2312)
klosax [Sat, 22 Jul 2023 12:21:24 +0000 (14:21 +0200)]
Perplexity: Compute scores correlated to HellaSwag (#2312)

* Add parameter --perplexity-lines to perplexity.cpp

23 months agoexamples : basic VIM plugin
whoreson [Sat, 22 Jul 2023 10:34:51 +0000 (12:34 +0200)]
examples : basic VIM plugin

VIM plugin for server exe

23 months agoci : fix args
Georgi Gerganov [Sat, 22 Jul 2023 09:00:56 +0000 (12:00 +0300)]
ci : fix args

23 months agoci : add 7B CUDA tests (#2319)
Georgi Gerganov [Sat, 22 Jul 2023 08:48:22 +0000 (11:48 +0300)]
ci : add 7B CUDA tests (#2319)

* ci : add 7B CUDA tests

ggml-ci

* ci : add Q2_K to the tests

* ci : bump CUDA ppl chunks

ggml-ci

* ci : increase CUDA TG len + add --ignore-eos

* ci : reduce CUDA ppl cunks down to 4 to save time

23 months agoexamples : add easy python script to create quantized (k-bit support) GGML models...
Richard Roberson [Fri, 21 Jul 2023 19:01:10 +0000 (13:01 -0600)]
examples : add easy python script to create quantized (k-bit support) GGML models from local HF Transformer models (#2311)

* Resync my fork with new llama.cpp commits

* examples : rename to use dash instead of underscore

---------

Co-authored-by: Georgi Gerganov <redacted>
23 months agoCustom RoPE + bettter memory management for CUDA (#2295)
Kawrakow [Fri, 21 Jul 2023 14:27:51 +0000 (17:27 +0300)]
Custom RoPE + bettter memory management for CUDA (#2295)

* Custom RoPE + bettter memory management for CUDA

* Adjusted look ahead in ggml_cuda_pool_malloc to 5%

This is sufficient it seems.
We end up using about 200 MB less VRAM that way when running
the 13B model with context 8192.

---------

Co-authored-by: Iwan Kawrakow <redacted>
23 months agoFaster Q3_K implementation on Metal (#2307)
Kawrakow [Fri, 21 Jul 2023 14:05:30 +0000 (17:05 +0300)]
Faster Q3_K implementation on Metal (#2307)

* Faster Q3_K on Metal

* Additional Q3_K speedup on Metal

* Q3_K for QK_K = 64

* Better Q3_K for QK_K = 64

21.6 ms/t -> 21.1 ms/t

---------

Co-authored-by: Iwan Kawrakow <redacted>
23 months agoggml : fix the rope fix (513f8619535a64fa9ace808cdcbcf66211535f5c)
Georgi Gerganov [Fri, 21 Jul 2023 12:16:55 +0000 (15:16 +0300)]
ggml : fix the rope fix (513f8619535a64fa9ace808cdcbcf66211535f5c)

23 months agoexamples : fix typo in minigpt4.py (#2298)
Ikko Eltociear Ashimine [Fri, 21 Jul 2023 11:53:07 +0000 (20:53 +0900)]
examples :  fix typo in minigpt4.py (#2298)

promt -> prompt