git.djapps.eu Git - pkg/ggml/sources/llama.cpp/log

]> git.djapps.eu Git - pkg/ggml/sources/llama.cpp/log

overview / pkg / ggml / sources / llama.cpp / log

commit | commitdiff | tree

Johannes Gäßler [Mon, 31 Jul 2023 19:02:19 +0000 (21:02 +0200)]

CUDA: fixed LLAMA_FAST compilation option (#2473)

commit | commitdiff | tree

Johannes Gäßler [Mon, 31 Jul 2023 17:52:22 +0000 (19:52 +0200)]

CUDA: fixed cmake F16 option (#2471)

commit | commitdiff | tree

Johannes Gäßler [Mon, 31 Jul 2023 13:44:35 +0000 (15:44 +0200)]

CUDA: mmq CLI option, fixed mmq build issues (#2453)

commit | commitdiff | tree

Johannes Gäßler [Mon, 31 Jul 2023 12:32:30 +0000 (14:32 +0200)]

CUDA: Implemented row flattening for non-glm RoPE (#2468)

commit | commitdiff | tree

Johannes Gäßler [Mon, 31 Jul 2023 11:18:51 +0000 (13:18 +0200)]

CUDA: fewer memory bank conflicts for mul_mat_q (#2458)

commit | commitdiff | tree

slaren [Mon, 31 Jul 2023 09:02:53 +0000 (11:02 +0200)]

Fix Metal backend broken from the allocator changes (#2455)

* fix Metal backend broken from the allocator changes

commit | commitdiff | tree

slaren [Sun, 30 Jul 2023 13:58:01 +0000 (15:58 +0200)]

ggml : add graph tensor allocator (#2411)

* ggml : add graph tensor allocator

* ggml : don't calculate data pointer of unallocated tensors when creating a view with an offset

* ggml : refactor ggml_view_Nd into ggml_view_tensor_offset

commit | commitdiff | tree

Johannes Gäßler [Sat, 29 Jul 2023 21:04:44 +0000 (23:04 +0200)]

CUDA: Quantized matrix matrix multiplication (#2160)

* mmq implementation for non k-quants

* q6_K

* q2_K

* q3_k

* q4_K

* vdr

* q5_K

* faster q8_1 loading

* loop unrolling

* add __restrict__

* q2_K sc_high

* GGML_CUDA_MMQ_Y

* Updated Makefile

* Update Makefile

* DMMV_F16 -> F16

* Updated README, CMakeLists

* Fix CMakeLists.txt

* Fix CMakeLists.txt

* Fix multi GPU out-of-bounds

commit | commitdiff | tree

Johannes Gäßler [Sat, 29 Jul 2023 21:04:10 +0000 (23:04 +0200)]

CUDA: faster multi GPU synchronization (#2448)

commit | commitdiff | tree

klosax [Fri, 28 Jul 2023 18:25:36 +0000 (20:25 +0200)]

perplexity : add Hellaswag calculation (#2389)

* common.h : add hellaswag / remove perplexity-lines

* common.cpp : add hellaswag / remove perplexity-lines

* perplexity.cpp : add hellswag scores / remove perplexity-lines

* perplexity.cpp : clean up

* common.h : change default param value

* common.cpp : Change default param

* perplexity.cpp : alter wording

* common.h : alter wording

* common.cpp : alter wording

commit | commitdiff | tree

Lee [Fri, 28 Jul 2023 18:17:45 +0000 (02:17 +0800)]

ggml : workaround for missing _mm256_setr_m128i in GCC < 8 in k_quants.c (#2405)

commit | commitdiff | tree

eric8607242 [Fri, 28 Jul 2023 18:10:05 +0000 (02:10 +0800)]

llama : support more diverse tokenizers? (#2420)

* supporting more diverse tokenizers

* Update llama.cpp

---------

Co-authored-by: Georgi Gerganov <redacted>

commit | commitdiff | tree

Georgi Gerganov [Fri, 28 Jul 2023 18:05:08 +0000 (21:05 +0300)]

examples : fix whitespace

commit | commitdiff | tree

nhamanasu [Fri, 28 Jul 2023 18:02:10 +0000 (03:02 +0900)]

examples : server chat mode with llama2 (#2400)

* add: server chat mode with llama2

* fix: remove the unnecessary last \n

commit | commitdiff | tree

Weird Constructor [Fri, 28 Jul 2023 08:44:43 +0000 (10:44 +0200)]

readme : fix the description of the Tail free sampling (TFS) method (#2431)

commit | commitdiff | tree

Rand Xie [Fri, 28 Jul 2023 08:42:53 +0000 (01:42 -0700)]

llama : use n_embd_gqa instead of n_embd to handle llama-2 70B (#2433)

commit | commitdiff | tree

niansa/tuxifan [Fri, 28 Jul 2023 01:14:11 +0000 (03:14 +0200)]

Obtaining LLaMA 2 instructions (#2308)

* Obtaining LLaMA 2 instructions

* Removed sharing warning for LLaMA 2

* Linked TheBloke's GGML repos

* Add LLaMA 2 to list of supported models

* Added LLaMA 2 usage instructions

* Added links to LLaMA 2 70B models

commit | commitdiff | tree

mj-shifu [Thu, 27 Jul 2023 20:39:17 +0000 (22:39 +0200)]

convert.py : Update to support 70B HF format model files (#2427)

* convert.py : fix llama 2 70b conversion from Huggingface

commit | commitdiff | tree

Georgi Gerganov [Thu, 27 Jul 2023 08:00:54 +0000 (11:00 +0300)]

metal : disable graph concurrency optimization due to bug (#2413)

commit | commitdiff | tree

slaren [Wed, 26 Jul 2023 21:57:23 +0000 (23:57 +0200)]

ggml : fix assert in ggml_set_unary_op (#2410)

commit | commitdiff | tree

Cebtenzzre [Wed, 26 Jul 2023 18:00:04 +0000 (14:00 -0400)]

make : build with -Wmissing-prototypes (#2394)

commit | commitdiff | tree

slaren [Wed, 26 Jul 2023 13:56:53 +0000 (15:56 +0200)]

ggml : allocate graphs in a context (#2392)

* ggml : graph allocation in contexts

* allocate work buffer as a ggml_object in ggml_graph_compute_with_ctx

* llama.cpp : allocate graph in the context

* add GGML_PAD

---------

Co-authored-by: Georgi Gerganov <redacted>

commit | commitdiff | tree

Kawrakow [Tue, 25 Jul 2023 15:35:53 +0000 (18:35 +0300)]

Add LLAMA_DEFAULT_RMS_EPS so we can change the default (#2384)

Co-authored-by: Iwan Kawrakow <redacted>

commit | commitdiff | tree

slaren [Tue, 25 Jul 2023 14:20:12 +0000 (16:20 +0200)]

ggml : fix ggml_flash_attn to use op_params (#2387)

* ggml : fix ggml_flash_attn to use op_params

commit | commitdiff | tree

ldwang [Tue, 25 Jul 2023 13:22:09 +0000 (21:22 +0800)]

convert.py : support bpe tokenizer (#2228)

* support bpe tokenizer in convert

Signed-off-by: ldwang <redacted>
* support bpe tokenizer in convert

Signed-off-by: ldwang <redacted>
* support bpe tokenizer in convert, fix

Signed-off-by: ldwang <redacted>
---------

Signed-off-by: ldwang <redacted>
Co-authored-by: ldwang <redacted>

commit | commitdiff | tree

Jiahao Li [Tue, 25 Jul 2023 12:58:32 +0000 (20:58 +0800)]

ggml : relax contiguous constraints in activation function (#2371)

commit | commitdiff | tree

slaren [Tue, 25 Jul 2023 12:32:20 +0000 (14:32 +0200)]

ggml : improve graph build time via hash table lookup (#2329)

* improve graph build time

* ggml_tensor : use 1 bit per flag

* use a hash table instead

commit | commitdiff | tree

Hesen Peng [Tue, 25 Jul 2023 12:24:09 +0000 (05:24 -0700)]

build : fix line breaking error in build-info.sh (#2349)

* fix line breaking

* build number line break removal

commit | commitdiff | tree

Xiao-Yong Jin [Tue, 25 Jul 2023 12:19:11 +0000 (07:19 -0500)]

main : add `--in-prefix-bos` to prefix BOS to user inputs; keep EOS (#2304)

* add `--in-prefix-bos` to prefix BOS to user inputs; keep EOS

The BOS precedes the string specified by `--in-prefix`.
Model generated EOS is now kept in the context.

It provides a way to strictly following the prompt format used in
Llama-2-chat.

The EOS handling also benefits some existing finetunes that uses
EOS to mark the end of turn.

* examples/common: move input_prefix_bos to other bools

commit | commitdiff | tree

Eve [Tue, 25 Jul 2023 12:16:13 +0000 (08:16 -0400)]

ci : add non-AVX scalar build/test (#2356)

* noavx build and test

* we don't need to remove f16c in windows

commit | commitdiff | tree

katsu560 [Tue, 25 Jul 2023 12:13:41 +0000 (21:13 +0900)]

k_quants : add AVX support to dot functions with QK_K as 64 (#2339)

* add AVX to ggml_vec_dot_q2_K_q8_K()

* add AVX to ggml_vec_dot_q3_K_q8_K()

* add AVX to ggml_vec_dot_q4_K_q8_K()

* add AVX to ggml_vec_dot_q5_K_q8_K()

* add AVX to ggml_vec_dot_q6_K_q8_K()

* refactor AVX code in ggml_vec_dot_q6_K_q8_K()

commit | commitdiff | tree

Shouzheng Liu [Tue, 25 Jul 2023 12:00:19 +0000 (08:00 -0400)]

metal : concurrently dispatch commands (#2358)

* metal: concurrently dispatch commands

Function `ggml_metal_graph_find_concurrency` will run and write
commands that can be issued concurrently to metal context `concur_list`
array, when `ggml_metal_graph_compute` is called for the first time.

* metal: don't call find_concurrency automatically.

* metal : code style changes

---------

Co-authored-by: Georgi Gerganov <redacted>

commit | commitdiff | tree

Kawrakow [Tue, 25 Jul 2023 10:48:29 +0000 (13:48 +0300)]

Another speed gain for Q4_0 and Q4_1 on Metal (#2375)

* Another speed gain for Q4_0 and Q4_1 on Metal

* Have N_DST, etc., be template parameters

---------

Co-authored-by: Iwan Kawrakow <redacted>

commit | commitdiff | tree

Kawrakow [Tue, 25 Jul 2023 10:48:04 +0000 (13:48 +0300)]

Fix Q4_K and Q5_K for QK_K = 64 on CUDA (#2359)

* Fix Q4_K and Q5_K for QK_K = 64

* Very slightly better Q5_K bit fiddling

---------

Co-authored-by: Iwan Kawrakow <redacted>

commit | commitdiff | tree

slaren [Tue, 25 Jul 2023 09:36:17 +0000 (11:36 +0200)]

server: add rms_norm_eps parameter (#2380)

commit | commitdiff | tree

Henri Vasserman [Tue, 25 Jul 2023 07:27:34 +0000 (10:27 +0300)]

[Server] Escape HTML in webchat (#2368)

* escape HTML in webchat
* add amp

commit | commitdiff | tree

slaren [Mon, 24 Jul 2023 15:57:12 +0000 (17:57 +0200)]

make rms_norm_eps a parameter (#2374)

* make rms_norm_eps a parameter

* add rms_norm_eps to command line

* fix baby llama, test-grad0

* use scientific notation for eps param in the help

ggml-ci

commit | commitdiff | tree

Aarni Koskela [Mon, 24 Jul 2023 14:54:22 +0000 (17:54 +0300)]

Chat UI extras (#2366)

* makefile: correct deps for server

* server: tighten settings layout a little

* server: expose all currently configured generation params in UI

* server: expose remaining generation params, for the adventurous

* server: embetter mirostat fields

commit | commitdiff | tree

Georgi Gerganov [Mon, 24 Jul 2023 11:46:21 +0000 (14:46 +0300)]

ggml : sync (unary ops refactor, static-correctness) (#2370)

* ggml : sync (unary ops, tests)

ggml-ci

* tests : remove unnecessary funcs

commit | commitdiff | tree

Kawrakow [Mon, 24 Jul 2023 09:55:02 +0000 (12:55 +0300)]

Fix scalar version of Q5_K when QK_K = 64 (#2362)

Co-authored-by: Iwan Kawrakow <redacted>

commit | commitdiff | tree

Evan Jones [Mon, 24 Jul 2023 03:58:10 +0000 (23:58 -0400)]

llama : add grammar-based sampling (#1773)

* llama, main : constrain sampling to grammar

* allow loading grammar from file

* fix whitespace errors

* handle & print parser errors

* add comments to grammar syntax and allow newlines where unambiguous

* add missing include

* support alternates in root rule

* fix bugs with empty token and EOS

* adjust JSON grammar

* remove swp file

* rewrite ternary expressions

Co-authored-by: Henri Vasserman <redacted>
* use struct for grammar elements and add Unicode support

* add unicode escapes

* add inverse char ranges

* only sample full tokens (no peeking or truncation)

* llama : minor style changes

blindly applied in online editor - hopefully I didn't break something

* update help text

* add warning message if EOS is disabled

---------

Co-authored-by: Henri Vasserman <redacted>
Co-authored-by: Georgi Gerganov <redacted>

commit | commitdiff | tree

Kawrakow [Sun, 23 Jul 2023 21:19:47 +0000 (00:19 +0300)]

Some more Q4_K and Q5_K speedup on CUDA (#2346)

* Faster Q5_K on CUDA

* Small Q5_K improvement on older GPUs

* Spped up Q4_K on CUDA

GTX1660: 29.5 ms/t -> 25.6 ms/t
RTX4080: 8.40 ms/t -> 8.25 ms/t

* Spped up Q4_K on CUDA

GTX1660: 36.7 ms/t -> 35.6 ms/t
RTX4080: 9.8 ms/t -> 9.5 ms/t

* Address PR comments

* Add some comments to satisfy PR reviewer

---------

Co-authored-by: Iwan Kawrakow <redacted>

commit | commitdiff | tree

IgnacioFDM [Sun, 23 Jul 2023 20:31:17 +0000 (17:31 -0300)]

Add gqa parameter support to the server (#2351)

* Add gqa parameter support to the server
* Change help from stderr to stdout

commit | commitdiff | tree

Johannes Gäßler [Sun, 23 Jul 2023 15:49:06 +0000 (17:49 +0200)]

Fix __dp4a documentation (#2348)

commit | commitdiff | tree

wzy [Sun, 23 Jul 2023 13:33:02 +0000 (21:33 +0800)]

common : n_threads == -1 uses std::thread::hardware_concurrency() (#2347)

* Fix #2345, fix incorrect n_threads

* Update examples/common.cpp

---------

Co-authored-by: Georgi Gerganov <redacted>

commit | commitdiff | tree

slaren [Sun, 23 Jul 2023 13:19:39 +0000 (15:19 +0200)]

fix n_tasks (#2342)

ggml-ci

commit | commitdiff | tree

slaren [Sun, 23 Jul 2023 12:36:02 +0000 (14:36 +0200)]

ggml: move op parameters from tensors to ggml_tensor::op_params (#2333)

* ggml: move op parameters from tensors to ggml_tensor::op_params

* alibi: use memcpy for float params

* remove `src[1] = NULL` in ops

commit | commitdiff | tree

Georgi Gerganov [Sun, 23 Jul 2023 12:09:47 +0000 (15:09 +0300)]

llama : grouped-query attention + LLaMAv2 70B support (#2276)

* CUDA: GQA implementation

* llama : support for GQA and LLaMAv2 70B

ggml-ci

* py : fix hparams parsing (if-else blocks)

ggml-ci

* py : oh boy ..

ggml-ci

* help : fix gqa value for 70B

ggml-ci

---------

Co-authored-by: JohannesGaessler <redacted>

commit | commitdiff | tree

maddes8cht [Sun, 23 Jul 2023 11:59:48 +0000 (13:59 +0200)]

llama : print help to stdout (#2338)

commit | commitdiff | tree

wzy [Sun, 23 Jul 2023 11:57:02 +0000 (19:57 +0800)]

flake : support `nix build '.#opencl'` (#2337)

commit | commitdiff | tree

Christian Demsar [Sun, 23 Jul 2023 11:56:34 +0000 (07:56 -0400)]

llama : print max tensor size to stderr (#2336)

commit | commitdiff | tree

Jose Maldonado [Sun, 23 Jul 2023 11:52:08 +0000 (07:52 -0400)]

make : fix CLBLAST compile support in FreeBSD (#2331)

* Fix Makefile for CLBLAST compile support and instructions for compile llama.cpp FreeBSD

* More general use-case for CLBLAST support (Linux and FreeBSD)

commit | commitdiff | tree

AustinMroz [Sun, 23 Jul 2023 11:16:48 +0000 (06:16 -0500)]

examples : simplify vim plugin (#2327)

Uses builtin json_encode and json_decode functions to simplify escaping
Removes the need for temp files

commit | commitdiff | tree

Jiahao Li [Sun, 23 Jul 2023 11:00:37 +0000 (19:00 +0800)]

metal : support bcast add & dup & cont op (#2323)

commit | commitdiff | tree

Kawrakow [Sun, 23 Jul 2023 05:49:20 +0000 (08:49 +0300)]

Speed up Q4_K (#2322)

Co-authored-by: Iwan Kawrakow <redacted>

commit | commitdiff | tree

Johannes Gäßler [Sat, 22 Jul 2023 19:27:34 +0000 (21:27 +0200)]

CUDA: Fixed 7b q3_K_S with mul_mat_vec_q (#2313)

commit | commitdiff | tree

Georgi Gerganov [Sat, 22 Jul 2023 18:17:57 +0000 (21:17 +0300)]

llama : optimize memory buffers (#2325)

commit | commitdiff | tree

klosax [Sat, 22 Jul 2023 12:21:24 +0000 (14:21 +0200)]

Perplexity: Compute scores correlated to HellaSwag (#2312)

* Add parameter --perplexity-lines to perplexity.cpp

commit | commitdiff | tree

whoreson [Sat, 22 Jul 2023 10:34:51 +0000 (12:34 +0200)]

examples : basic VIM plugin

VIM plugin for server exe

commit | commitdiff | tree

Georgi Gerganov [Sat, 22 Jul 2023 09:00:56 +0000 (12:00 +0300)]

ci : fix args

commit | commitdiff | tree

Georgi Gerganov [Sat, 22 Jul 2023 08:48:22 +0000 (11:48 +0300)]

ci : add 7B CUDA tests (#2319)

* ci : add 7B CUDA tests

ggml-ci

* ci : add Q2_K to the tests

* ci : bump CUDA ppl chunks

ggml-ci

* ci : increase CUDA TG len + add --ignore-eos

* ci : reduce CUDA ppl cunks down to 4 to save time

commit | commitdiff | tree

Richard Roberson [Fri, 21 Jul 2023 19:01:10 +0000 (13:01 -0600)]

examples : add easy python script to create quantized (k-bit support) GGML models from local HF Transformer models (#2311)

* Resync my fork with new llama.cpp commits

* examples : rename to use dash instead of underscore

---------

Co-authored-by: Georgi Gerganov <redacted>

commit | commitdiff | tree

Kawrakow [Fri, 21 Jul 2023 14:27:51 +0000 (17:27 +0300)]

Custom RoPE + bettter memory management for CUDA (#2295)

* Custom RoPE + bettter memory management for CUDA

* Adjusted look ahead in ggml_cuda_pool_malloc to 5%

This is sufficient it seems.
We end up using about 200 MB less VRAM that way when running
the 13B model with context 8192.

---------

Co-authored-by: Iwan Kawrakow <redacted>

commit | commitdiff | tree

Kawrakow [Fri, 21 Jul 2023 14:05:30 +0000 (17:05 +0300)]

Faster Q3_K implementation on Metal (#2307)

* Faster Q3_K on Metal

* Additional Q3_K speedup on Metal

* Q3_K for QK_K = 64

* Better Q3_K for QK_K = 64

21.6 ms/t -> 21.1 ms/t

---------

Co-authored-by: Iwan Kawrakow <redacted>

commit | commitdiff | tree

Georgi Gerganov [Fri, 21 Jul 2023 12:16:55 +0000 (15:16 +0300)]

ggml : fix the rope fix (513f8619535a64fa9ace808cdcbcf66211535f5c)

commit | commitdiff | tree

Ikko Eltociear Ashimine [Fri, 21 Jul 2023 11:53:07 +0000 (20:53 +0900)]

examples : fix typo in minigpt4.py (#2298)

promt -> prompt

commit | commitdiff | tree

Georgi Gerganov [Fri, 21 Jul 2023 11:51:34 +0000 (14:51 +0300)]

ggml : fix rope args order + assert (#2054)

commit | commitdiff | tree

Georgi Gerganov [Fri, 21 Jul 2023 11:42:41 +0000 (14:42 +0300)]

gitignore : fix final newline

commit | commitdiff | tree

Guillaume "Vermeille" Sanchez [Fri, 21 Jul 2023 10:58:36 +0000 (12:58 +0200)]

llama : remove cfg smooth factor as it is only a reparameterization of the guidance scale (#2280)

commit | commitdiff | tree

Jose Maldonado [Fri, 21 Jul 2023 10:53:27 +0000 (06:53 -0400)]

gitignore : changes for Poetry users + chat examples (#2284)

A fix in Makefile for FreeBSD users. In the platfrom x86_64 is amd64. This fix resolve compilation using CFLAGS and CXXFLAGS with -march=native and -mtune=native
Add two examples for interactive mode using Llama2 models (thx TheBloke for models)

Co-authored-by: Georgi Gerganov <redacted>

commit | commitdiff | tree

Georgi Gerganov [Fri, 21 Jul 2023 10:50:55 +0000 (13:50 +0300)]

make : fix indentation

commit | commitdiff | tree

Georgi Gerganov [Fri, 21 Jul 2023 10:48:18 +0000 (13:48 +0300)]

ci : fix MNT realpath usage (#2250)

commit | commitdiff | tree

Sky Yan [Fri, 21 Jul 2023 10:38:57 +0000 (18:38 +0800)]

make : support customized LLAMA_CUDA_NVCC and LLAMA_CUDA_CCBIN (#2275)

Under certain environment, nvcc and gcc is installed under customized path but not standard path

Co-authored-by: Yan Lin <redacted>

commit | commitdiff | tree

wzy [Fri, 21 Jul 2023 10:26:34 +0000 (18:26 +0800)]

flake : remove intel mkl from flake.nix due to missing files (#2277)

NixOS's mkl misses some libraries like mkl-sdl.pc. See #2261
Currently NixOS doesn't have intel C compiler (icx, icpx). See https://discourse.nixos.org/t/packaging-intel-math-kernel-libraries-mkl/975
So remove it from flake.nix

Some minor changes:

- Change pkgs.python310 to pkgs.python3 to keep latest
- Add pkgconfig to devShells.default
- Remove installPhase because we have `cmake --install` from #2256

commit | commitdiff | tree

Georgi Gerganov [Fri, 21 Jul 2023 10:10:51 +0000 (13:10 +0300)]

llama : make tensor_split ptr instead of array (#2272)

commit | commitdiff | tree

Jiří Podivín [Fri, 21 Jul 2023 10:09:16 +0000 (12:09 +0200)]

make : add new target for test binaries (#2244)

Programs in the tests directory are now build with target tests
and placed in the same location.

* clean target was expanded to remove new binaries

* test target binaries are listed in a variable

* Locations of binaries were added to the .gitignore

Signed-off-by: Jiri Podivin <redacted>
Co-authored-by: Georgi Gerganov <redacted>

commit | commitdiff | tree

Hatsune Miku [Fri, 21 Jul 2023 08:13:18 +0000 (08:13 +0000)]

MIKU MAYHEM: Upgrading the Default Model for Maximum Fun 🎉 (#2287)

* Miku.sh: Set default model to llama-2-7b-chat

* Miku.sh: Set ctx_size to 4096

* Miku.sh: Add in-prefix/in-suffix opts

* Miku.sh: Switch sampler to mirostat_v2 and tiny prompt improvements

commit | commitdiff | tree

Kawrakow [Fri, 21 Jul 2023 07:44:40 +0000 (10:44 +0300)]

Faster Q2_K on Metal (#2297)

* Faster Q2_K on Metal

* Deleting unnoticed and dangereous trailing white space

* Fixed bug in new metal Q2_K implementation

---------

Co-authored-by: Iwan Kawrakow <redacted>

commit | commitdiff | tree

Przemysław Pawełczyk [Fri, 21 Jul 2023 07:42:21 +0000 (09:42 +0200)]

make : fix embdinput library and server examples building on MSYS2 (#2235)

* make : fix embdinput library and server examples building on MSYS2

* cmake : fix server example building on MSYS2

commit | commitdiff | tree

Kawrakow [Thu, 20 Jul 2023 15:19:45 +0000 (18:19 +0300)]

Faster Q5_K and Q6_K on Metal (#2294)

* Faster Q6_K on Metal

* Faster Q5_K on Metal

* Another Q5_K speedup

---------

Co-authored-by: Iwan Kawrakow <redacted>

commit | commitdiff | tree

Kawrakow [Thu, 20 Jul 2023 12:18:43 +0000 (15:18 +0300)]

Faster Q4_K on Metal (#2290)

Co-authored-by: Iwan Kawrakow <redacted>

commit | commitdiff | tree

Georgi Gerganov [Thu, 20 Jul 2023 10:47:26 +0000 (13:47 +0300)]

llama : fix regression from #2000 - could not load no-mmap models

commit | commitdiff | tree

Shouzheng Liu [Thu, 20 Jul 2023 10:32:22 +0000 (06:32 -0400)]

metal: minor q4 optimization and reduce code size (#2248)

* metal: use uint16_t instead of uint8_t.

Apple GPU doesn't like uint8_t. For every operation on uint8_t
the gpu need to copy the uint8_t to an empty 16 bit register, then
it can issue other instructions.

For the matrix-vector multiplication kernel only, we observed a
340~350 GB/s memory read speed on M1 Max after this commit, which is
very close to the reported hardware limit.

* metal: update rms_norm kernel

This commit double the speed of rms_norm operations by using 512 threads
per threadgroup, combining with SIMD primitives to minimize the need for
thread group barriers.

* metal: use template to reduce size

Revert modifications on block_q4_0 and block_q4_1.

commit | commitdiff | tree

Rinne [Wed, 19 Jul 2023 07:06:40 +0000 (15:06 +0800)]

llama : extend API to get max devices at runtime (#2253)

commit | commitdiff | tree

wzy [Wed, 19 Jul 2023 07:01:55 +0000 (15:01 +0800)]

flake : update flake.nix (#2270)

When `isx86_32 || isx86_64`, it will use mkl, else openblas

According to
https://discourse.nixos.org/t/rpath-of-binary-contains-a-forbidden-reference-to-build/12200/3,
add -DCMAKE_SKIP_BUILD_RPATH=ON

Fix #2261, Nix doesn't provide mkl-sdl.pc.
When we build with -DBUILD_SHARED_LIBS=ON, -DLLAMA_BLAS_VENDOR=Intel10_lp64
replace mkl-sdl.pc by mkl-dynamic-lp64-iomp.pc

commit | commitdiff | tree

wzy [Wed, 19 Jul 2023 07:01:11 +0000 (15:01 +0800)]

cmake : install targets (#2256)

fix #2252

commit | commitdiff | tree

Georgi Gerganov [Tue, 18 Jul 2023 11:24:43 +0000 (14:24 +0300)]

ci : integrate with ggml-org/ci (#2250)

* ci : run ctest

ggml-ci

* ci : add open llama 3B-v2 tests

ggml-ci

* ci : disable wget progress output

ggml-ci

* ci : add open llama 3B-v2 tg tests for q4 and q5 quantizations

ggml-ci

* tests : try to fix tail free sampling test

ggml-ci

* ci : add K-quants

ggml-ci

* ci : add short perplexity tests

ggml-ci

* ci : add README.md

* ppl : add --chunks argument to limit max number of chunks

ggml-ci

* ci : update README

commit | commitdiff | tree

Georgi Gerganov [Tue, 18 Jul 2023 08:50:49 +0000 (11:50 +0300)]

llama : shorten quantization descriptions

commit | commitdiff | tree

Jiahao Li [Mon, 17 Jul 2023 17:39:29 +0000 (01:39 +0800)]

Support dup & cont ops on CUDA (#2242)

commit | commitdiff | tree

Alex Klinkhamer [Sun, 16 Jul 2023 21:01:45 +0000 (14:01 -0700)]

llama : fix t_start_sample_us initialization warning (#2238)

commit | commitdiff | tree

Qingyou Meng [Sun, 16 Jul 2023 19:57:28 +0000 (03:57 +0800)]

ggml : fixed runtime bugs and compile errors related to GGML_PERF and GGML_DEBUG (#2219)

* fixed runtime bugs and compile errors related to GGML_PERF and GGML_DEBUG

* remove ifdef GGML_PERF; update fmt

commit | commitdiff | tree

Jiří Podivín [Sun, 16 Jul 2023 19:54:47 +0000 (21:54 +0200)]

py : turn verify-checksum-models.py into executable (#2245)

README.md was adjusted to reflect the change.

Signed-off-by: Jiri Podivin <redacted>

commit | commitdiff | tree

Xiao-Yong Jin [Sat, 15 Jul 2023 10:34:16 +0000 (06:34 -0400)]

llama : add custom RoPE (#2054)

* Implement customizable RoPE

The original RoPE has pre-defined parameters

theta_i = 10000^(−2(i−1)/d), for i in [1, 2, ..., d/2]

Our customizable RoPE, ggml_rope_custom_inplace, uses

theta_i = scale * base^(−2(i−1)/d), for i in [1, 2, ..., d/2]

with the default matches the original

scale = 1.0
base = 10000

The new command line arguments
--rope-freq-base
--rope-freq-scale
set the two new RoPE parameter.

Recent researches show changing these two parameters extends the context limit with minimal loss.

1. Extending Context to 8K
   kaiokendev
   https://kaiokendev.github.io/til#extending-context-to-8k

2. Extending Context Window of Large Language Models via Positional Interpolation
   Shouyuan Chen, Sherman Wong, Liangjian Chen, Yuandong Tian
   https://arxiv.org/abs/2306.15595

3. NTK-Aware Scaled RoPE allows LLaMA models to have extended (8k+) context size without any fine-tuning and minimal perplexity degradation.
   https://www.reddit.com/user/bloc97
   https://www.reddit.com/r/LocalLLaMA/comments/14lz7j5/ntkaware_scaled_rope_allows_llama_models_to_have/

For the bold, try adding the following command line parameters to your favorite model:
-c 16384 --rope-freq-base 80000 --rope-freq-scale 0.5

* ggml-metal: fix custom rope

* common: fix argument names in help

* llama: increase MEM_REQ_EVAL for MODEL_3B

It avoids crashing for quantized weights on CPU.
Better ways to calculate the required buffer size would be better.

* llama: make MEM_REQ_EVAL depend on n_ctx

* server: use proper Content-Type in curl examples

Without the header Content-Type: application/json, curl will POST with
Content-Type: application/x-www-form-urlencoded

Though our simple server doesn't care, the httplib.h used has a limit
with CPPHTTPLIB_FORM_URL_ENCODED_PAYLOAD_MAX_LENGTH 8192

With Content-Type: application/json, we can send large json data.

* style : minor fixes, mostly indentations

* ggml : fix asserts

---------

Co-authored-by: Georgi Gerganov <redacted>

commit | commitdiff | tree

Dave Della Costa [Fri, 14 Jul 2023 19:13:38 +0000 (15:13 -0400)]

flake : add runHook preInstall/postInstall to installPhase so hooks function (#2224)

commit | commitdiff | tree

wzy [Fri, 14 Jul 2023 19:05:08 +0000 (03:05 +0800)]

make : use pkg-config for OpenBLAS (#2222)

commit | commitdiff | tree

Bach Le [Fri, 14 Jul 2023 19:00:58 +0000 (03:00 +0800)]

cuda : allocate all temporary ggml_tensor_extra_gpu from a fixed-size buffer (#2220)

commit | commitdiff | tree

Evan Miller [Fri, 14 Jul 2023 18:55:56 +0000 (14:55 -0400)]

ggml : fix static_assert with older compilers #2024 (#2218)