git.djapps.eu Git - pkg/ggml/sources/llama.cpp/log

]> git.djapps.eu Git - pkg/ggml/sources/llama.cpp/log

overview / pkg / ggml / sources / llama.cpp / log

summary | shortlog | log | commit | commitdiff | tree
first ⋅ prev ⋅ next

commit | commitdiff | tree

Kawrakow [Thu, 8 Jun 2023 16:46:22 +0000 (19:46 +0300)]

metal : Q6_K implementation (#1752)

* Metal implementation for Q4_K

Very slow for now:
42 ms / token, Q4_0 runs in 28 ms/token on my
30-core M2 Max GPU.

* Optimizing Q4_K on metal

The first token always takes longer, I guess because
the metal kernel is being jit-compiled.
So, using n = 128 to measure time.

At this point Q4_K takes 29.5 ms / token
compared to 27.2 ms / token for Q4_0.
Quite a bit better than the initial attempt,
but still not good enough.

* Optimizing q4_K metal dot some more

For n = 256 it is now 28.1 ms/token compared to
27 ms/token for q4_0.

* Fix after merge with master

* Metal implementation for Q6_K

Similar to the CUDA implementation.
No idea if this is the optimum for Metal, but the few
alternative variants I tried all had a lower performance.

We get 36.5 ms / token on M2 Max with 30 GPU cores.
This corresponds to ~200 GB/second throughput.

* clang-tidy : add config back

* Much better Q6_K implementation for metal

28.3 ms / token for 7B. Subtracting ~9 ms that is spent in
other compute graph operations, we are left with ~19 ms
for the matrix multiplications. The model is ~5.5 GB,
so we are getting 1000 / 19 * 5.5 = 290 GB/s!

---------

Co-authored-by: Iwan Kawrakow <redacted>

commit | commitdiff | tree

qingfengfenga [Thu, 8 Jun 2023 07:58:53 +0000 (15:58 +0800)]

Add llama.cpp docker support for non-latin languages (#1673)

* Modify Dockerfile default character set to improve compatibility (#1673)

commit | commitdiff | tree

Steven Roussey [Thu, 8 Jun 2023 07:12:28 +0000 (00:12 -0700)]

ggml : fix fprintf warnings (#1720)

commit | commitdiff | tree

Georgi Gerganov [Thu, 8 Jun 2023 07:09:08 +0000 (10:09 +0300)]

clang-tidy : restore dot file from accidental deletion

commit | commitdiff | tree

Kawrakow [Thu, 8 Jun 2023 07:08:23 +0000 (10:08 +0300)]

metal : add Q4_K implementation (#1733)

* Metal implementation for Q4_K

Very slow for now:
42 ms / token, Q4_0 runs in 28 ms/token on my
30-core M2 Max GPU.

* Optimizing Q4_K on metal

The first token always takes longer, I guess because
the metal kernel is being jit-compiled.
So, using n = 128 to measure time.

At this point Q4_K takes 29.5 ms / token
compared to 27.2 ms / token for Q4_0.
Quite a bit better than the initial attempt,
but still not good enough.

* Optimizing q4_K metal dot some more

For n = 256 it is now 28.1 ms/token compared to
27 ms/token for q4_0.

* Fix after merge with master

---------

Co-authored-by: Iwan Kawrakow <redacted>

commit | commitdiff | tree

johnson442 [Thu, 8 Jun 2023 07:02:48 +0000 (08:02 +0100)]

k-quants : add missing compile definition to CMakeLists (#1748)

commit | commitdiff | tree

Georgi Gerganov [Wed, 7 Jun 2023 07:59:52 +0000 (10:59 +0300)]

k-quants : allow to optionally disable at compile time (#1734)

* k-quants : put behind optional compile flag LLAMA_K_QUANTS

* build : enable k-quants by default

commit | commitdiff | tree

jacobi petrucciani [Wed, 7 Jun 2023 04:15:31 +0000 (00:15 -0400)]

flake : update to support metal on m1/m2 (#1724)

commit | commitdiff | tree

Georgi Gerganov [Wed, 7 Jun 2023 04:15:08 +0000 (07:15 +0300)]

readme : add June roadmap

commit | commitdiff | tree

Willy Tarreau [Wed, 7 Jun 2023 02:10:17 +0000 (04:10 +0200)]

main: add the possibility to open the prompt cache read-only (#1640)

The prompt cache constitutes a nice speed up when using the same prompt
prefix across multiple evaluations, but when using it, it will also be
updated, which is not always desirable. One use case is to have a large
prompt containing some context and usage rules, and a second part
containing variable data of the problem being studied. In this case it's
desirable to be able to save the first part once, and to always reuse it
as-is without updating it with the second part.

The new argument --prompt-cache-ro enables this read-only mode on the
prompt cache. The prompt's contents that match the cache are loaded
from the cache but the rest is not modified. This allowed to reduce a
total analysis time from 112s to 49.7s here, without having to backup
and restore a copy of the prompt, which takes significant time at 500
MB.

Signed-off-by: Willy Tarreau <redacted>

commit | commitdiff | tree

Georgi Gerganov [Tue, 6 Jun 2023 19:54:39 +0000 (22:54 +0300)]

llama : fix vram_scratch var

commit | commitdiff | tree

Georgi Gerganov [Tue, 6 Jun 2023 19:41:53 +0000 (22:41 +0300)]

llama : fix compile warnings

commit | commitdiff | tree

Johannes Gäßler [Tue, 6 Jun 2023 19:33:23 +0000 (21:33 +0200)]

Multi GPU support, CUDA refactor, CUDA scratch buffer (#1703)

* CUDA multi GPU + scratch

ggml_cuda_compute_forward

Tensor parallelism

ggml_cuda_add

ggml_cuda_rms_norm

ggml_cuda_silu

CUDA scratch buffer

--main-gpu CLI option

commit | commitdiff | tree

Georgi Gerganov [Tue, 6 Jun 2023 17:16:57 +0000 (20:16 +0300)]

metal : add f16 support

commit | commitdiff | tree

LostRuins [Tue, 6 Jun 2023 17:00:01 +0000 (01:00 +0800)]

Clblast fixes + enhancements to save VRAM and offload more layers (#1675)

* Use events instead of clFinish, where possible

* OpenCL: Don't load gpu layers into RAM, add mul_f32 kernel

* Reduce queueing overhead for contiguous tensors by using single mul kernel call

* Adapt to #1612 cl_mem malloc changes

* Reduce code duplication between cuda and opencl branches

* Improve implementation

* Clblast fixes + enhancements to save VRAM:

1. Change all Clblast buffers to CL_MEM_READ_WRITE, as the pool malloc currently doesn't properly handle them.
2. When recycling buffers in pool malloc, always assign the SMALLEST available buffer that fits, instead of the FIRST available buffer
3. When failing to recycle a buffer in pool malloc (all too small), instead recycle the largest available free buffer by resizing it.

* change max value size_t to use limits

* removed flags from the CL pool malloc, apply code tidying suggestions.

commit | commitdiff | tree

Georgi Gerganov [Tue, 6 Jun 2023 07:18:03 +0000 (10:18 +0300)]

ggml : fix builds, add ggml-quants-k.o (close #1712, close #1710)

commit | commitdiff | tree

Georgi Gerganov [Tue, 6 Jun 2023 06:55:10 +0000 (09:55 +0300)]

gitignore : add .clang-tidy

commit | commitdiff | tree

Georgi Gerganov [Tue, 6 Jun 2023 06:39:38 +0000 (09:39 +0300)]

llama : temporary disable Q6_K output quantization (#1711)

commit | commitdiff | tree

Spencer Sutton [Tue, 6 Jun 2023 03:28:17 +0000 (23:28 -0400)]

metal : add checks for buffer size (#1706)

Co-authored-by: Spencer Sutton <redacted>

commit | commitdiff | tree

Yuval Peled [Mon, 5 Jun 2023 20:32:36 +0000 (23:32 +0300)]

docs : add performance troubleshoot + example benchmark documentation (#1674)

* test anchor link

* test table

* add benchmarks

* Add performance troubleshoot & benchmark

* add benchmarks

* remove unneeded line

---------

Co-authored-by: Georgi Gerganov <redacted>

commit | commitdiff | tree

Foul-Tarnished [Mon, 5 Jun 2023 20:28:37 +0000 (22:28 +0200)]

readme : fix typo (#1700)

Fix a typo in a command in README.md

commit | commitdiff | tree

mgroeber9110 [Mon, 5 Jun 2023 20:24:29 +0000 (22:24 +0200)]

llama : consistently catch and throw only exceptions deriving from std::exception (#1599)

Co-authored-by: Georgi Gerganov <redacted>

commit | commitdiff | tree

kiltyj [Mon, 5 Jun 2023 20:24:04 +0000 (13:24 -0700)]

metal : use shared buffers between CPU and GPU (#1696)

* Use MTLDevice.newBufferWithBytesNoCopy to share buffers between CPU and GPU

* Page-align buffers used by Metal

* Remove trailing whitespace

* Only import unistd.h for Metal builds

* metal : remove unnecessary copies

---------

Co-authored-by: Georgi Gerganov <redacted>

commit | commitdiff | tree

grahameth [Mon, 5 Jun 2023 20:11:49 +0000 (22:11 +0200)]

ggml : fix internal overflow in ggml_time_us on Windows (#1702)

Co-authored-by: grahameth <->

commit | commitdiff | tree

Georgi Gerganov [Mon, 5 Jun 2023 20:05:05 +0000 (23:05 +0300)]

ci : disable auto tidy (#1705)

commit | commitdiff | tree

Kawrakow [Mon, 5 Jun 2023 19:56:18 +0000 (22:56 +0300)]

ggml : add SOTA 2,3,4,5,6 bit k-quantizations (#1684)

* Starting to add k-quantization to ggml

I think it is better to have quantization separate from
ggml. For now just adding the k-quants there, but it would be
better to also factor out the existing ggml quantizations.

* Adding Q3_K and Q8_K (de)-quantization

* Q3_K now working on CUDA and AVX2/scalar

CUDA is not ideal - ~50% slower than Q4_0 for
single token prediction, about the same in batch
mode (perplexity). CPU single token is ~55 ms
(on Ryzen 7950X).

* Some improvement for Q3_K on CUDA

It is now ~22.5 ms/token on my GPU, so ~30% slower than Q4_0.

* Some more CUDA optimizations for Q3_K

Single token is now 20.5 ms/token (~20% slower than Q4_0).
Perplexity is on par with Q4_0.

* Adding Q4_K - scalar, AVX2, CUDA

Performance is the same or perhaps very slightly better than Q4_0 on the CPU.
On the GPU, single token prediction is ~10% better than Q4_0,
batch mode (perplexity is about the same).

* Adding Q6_K - scalar, AVX2, CUDA

Performance is ~40% lower compared to Q4_K on the CPU.
This is to be expected, considering that we are memory bound
on the CPU and the 6-bit model is ~44% larger than the 4-bit.
On the GPU, single token prediction is ~6% lower than Q4_0,
batch mode (perplexity) is even closer (but still slower).

* Adding Q5_K - scalar, AVX2, CUDA

Performance is ~20% lower compared to Q4_K on the CPU.
This is to be expected, considering that we are memory bound
on the CPU and the 5-bit model is ~22% larger than the 4-bit.
On the GPU, single token prediction is about the same as Q4_0
for both, single token and batch prediction.

* Per convention, all QX_K quantizations use Q5_K for output.weight

* Adding quantization mixes

* Quantization mixes: didn't quite get what I wanted in the last commit

* Q4_K dot product for ARM_NEON

* Q6_K dot product for ARM_NEON

* Q5_K dot product for ARM_NEON

* Adding Q3_K dot for ARM_NEON

It is 22% slower than Q4_K, despite the smaller model size.
On x86_64, where we are memory bound, the Q3_K model is
quite a bit faster than Q4_K.

* A very slightly faster ARM_NEON Q3_K dot

* Adding Q2_K - just CUDA for now

Token prediction is pretty good - about 15.5 ms on a RTX 4080.
Perplexity is about the same as Q4_K.

* Adding scalar and AVX2 Q2_K dot

* Adding ARM_NEON Q2_K dot

About the same performance as Q4_K.

* A slightly faster ARM_NEON Q2_K dot

Single token prediction is now ~36 ms on M2 Max.
The code is much simpler too.

* Fixed bug in Q2_K CUDA dot product kernel

Stranegly enough, for the few prompts I tried with the 7B model
the responses looked perfectly reasonable. Only realized something
is not quite right when I tried the larger models and started getting
nonse back.

In any case, Q2_K single token evaluation time on an RTX 4080 in a Ryzen7950X
box iusing CUDA and model fully loaded on the GPU are
~15.5 ms for 7B, ~25.4 ms for 13B, and ~55.8 ms for 30B.
The max number of layers that fit in VRAM for The 65B is 32.
With that, we get ~330 ms per token, which is not that much faster
than just running on the CPU (~470 ms per token).

* Don't print zeros/NaNs when no count histogram has been collected

* A 10% faster CUDA vector dot kernel for Q3_K

Q3_K is now running at ~18.5 ms / token on CUDA,
so the gap to Q4_0 is only 10%.
It seems memory acccess pattern is more important for
performance than the amount of computation the kernel
does.

* A slightly daster Q4_K AVX2 dot product

For perplexity, where we are less memory bound, time per
pass drops by ~5%. Barely measurable difference for single
token prediction.

* A slightly faster ARM_NEON A4_K dot product

* Minor

* Fix quantization error test

We cannot possibly be expecting rmse < 0.002 for 2- and 3-bit
quantization variants.

* Fix docker build

I have been sloppy with vector reinterpret casts on ARM_NEON.
It seems clang is very forgiving in that regard.

* Added forgotten ggml.o dependence on k_quants.h to the Makefile

* Had unintentionally committed the Makefile with -Ofast enabled

* ggml : rename k_quants -> ggml-quants-k, use lowercase in code

---------

Co-authored-by: Iwan Kawrakow <redacted>
Co-authored-by: Georgi Gerganov <redacted>

commit | commitdiff | tree

Henri Vasserman [Mon, 5 Jun 2023 10:43:08 +0000 (13:43 +0300)]

Increase 3B scratch buffers. (#1698)

The 128 MB was too optimistic.
Too bad it is not dynamically computed.

commit | commitdiff | tree

Georgi Gerganov [Mon, 5 Jun 2023 07:19:03 +0000 (10:19 +0300)]

llama : fix Metal KV cache sync (close #1695)

commit | commitdiff | tree

Georgi Gerganov [Sun, 4 Jun 2023 20:38:19 +0000 (23:38 +0300)]

readme : update hot topics

commit | commitdiff | tree

Georgi Gerganov [Sun, 4 Jun 2023 20:34:30 +0000 (23:34 +0300)]

llama : Metal inference (#1642)

* mtl : export the LLaMA computation graph

* ci : disable temporary

* mtl : adapt the MNIST example as starter

* mtl : no need for mtl-export tool, add cli arg for main instead

* mtl : export just a small part of the graph for now to make it easier

* mtl : move MSL code into separate file for easy editing

* mtl : initial get_rows_q4_0 kernel

* mtl : confirmed get_rows_q4_0 is working correctly

* mtl : add rms_norm kernel + confirm working

* mtl : add mul kernel + confirm working

* mtl : initial mul_mat Q4 kernel (wrong results)

* mtl : mul_mat fixes (still wrong)

* mtl : another mul_mat Q4 (still does not work)

* mtl : working mul_mat q4

* ggml : fix handling of "view" ops in ggml_graph_import()

* mtl : add rope kernel

* mtl : add reshape and transpose handling

* ggml : store offset as opt arg for ggml_view_xd() operators

* mtl : add cpy kernel + handle view ops

* mtl : confirm f16 x f32 attention mul mat

* mtl : add scale kernel

* mtl : add diag_mask_inf kernel

* mtl : fix soft_max kernel

* ggml : update ggml_nbytes() to handle non-contiguous tensors

* mtl : verify V tensor contents

* mtl : add f32 -> f32 cpy kernel

* mtl : add silu kernel

* mtl : add non-broadcast mul kernel

* mtl : full GPU inference of the computation graph

* mtl : optimize rms_norm and soft_max kernels

* mtl : add f16 mat x f32 vec multiplication kernel

* mtl : fix bug in f16 x f32 mul mat + speed-up computation

* mtl : faster mul_mat_q4_0_f32 kernel

* mtl : fix kernel signature + roll inner loop

* mtl : more threads for rms_norm + better timing

* mtl : remove printfs from inner loop

* mtl : simplify implementation

* mtl : add save/load vocab to ggml file

* mtl : plug Metal inference into llama.cpp (very quick-n-dirty)

* mtl : make it work with main example

Lots of hacks but at least now it generates text

* mtl : preparing for merge

* mtl : clean-up ggml mtl interface + suport scratch / inplace

* mtl : remove temp / debug code

* metal : final refactoring and simplification

* Revert "ci : disable temporary"

This reverts commit 98c267fc77fe811082f672538fc91bcfc9072d63.

* metal : add comments

* metal : clean-up stuff, fix typos

* readme : add Metal instructions

* readme : add example for main

commit | commitdiff | tree

0cc4m [Sun, 4 Jun 2023 06:12:05 +0000 (08:12 +0200)]

OpenCL: Fix duplication of layers in VRAM and RAM, add GPU mul kernel (#1653)

* Use events instead of clFinish, where possible

* OpenCL: Don't load gpu layers into RAM, add mul_f32 kernel

* Reduce queueing overhead for contiguous tensors by using single mul kernel call

* Adapt to #1612 cl_mem malloc changes

* Reduce code duplication between cuda and opencl branches

* Improve implementation

commit | commitdiff | tree

Henri Vasserman [Sat, 3 Jun 2023 13:35:20 +0000 (16:35 +0300)]

Add info about CUDA_VISIBLE_DEVICES (#1682)

commit | commitdiff | tree

Jiří Podivín [Sat, 3 Jun 2023 12:11:53 +0000 (14:11 +0200)]

Docker: change to calling convert.py (#1641)

Deprecation disclaimer was added to convert-pth-to-ggml.py

commit | commitdiff | tree

Evan Jones [Sat, 3 Jun 2023 11:28:45 +0000 (07:28 -0400)]

Fix prompt cache saving and chat-persistent rollover (#1678)

* Fix prompt cache saving and chat-persistent rollover (fixes #1670)

* clang-tidy

Co-authored-by: github-actions[bot] <redacted>
---------

Co-authored-by: github-actions[bot] <redacted>

commit | commitdiff | tree

Henri Vasserman [Tue, 30 May 2023 18:24:22 +0000 (21:24 +0300)]

OpenLLaMA 3B support (#1588)

This adds support to llama.cpp to load the model.

Currently missing are changes that are required from convert.py to convert the model correctly. It needs some changes to start reading the JSON configuration for HF models instead of deriving the values by guessing.

Co-authored-by: FNsi <redacted>

commit | commitdiff | tree

Georgi Gerganov [Mon, 29 May 2023 16:31:44 +0000 (19:31 +0300)]

ggml : sync cgraph import / export API

commit | commitdiff | tree

Georgi Gerganov [Mon, 29 May 2023 16:30:49 +0000 (19:30 +0300)]

ggml : fix bug in ggml_alibi

commit | commitdiff | tree

DannyDaemonic [Mon, 29 May 2023 12:13:40 +0000 (05:13 -0700)]

Work around for recalculating logits in cached prompts (Fixes #1585) (#1609)

* Work around for recalculating logits in cached prompts

commit | commitdiff | tree

Jiří Podivín [Mon, 29 May 2023 04:45:50 +0000 (06:45 +0200)]

Adding git in container package dependencies (#1621)

Git added to build packages for version information in docker image

Signed-off-by: Jiri Podivin <redacted>

commit | commitdiff | tree

Johannes Gäßler [Sun, 28 May 2023 19:01:02 +0000 (21:01 +0200)]

LLAMA_DEBUG adds debug symbols (#1617)

commit | commitdiff | tree

Kerfuffle [Sun, 28 May 2023 17:48:57 +0000 (11:48 -0600)]

Only show -ngl option when relevant + other doc/arg handling updates (#1625)

1. Add a `LLAMA_SUPPORTS_GPU_OFFLOAD` define to `llama.h` (defined when compiled with CLBlast or cuBLAS)
2. Update the argument handling in the common example code to only show the `-ngl`, `--n-gpu-layers` option when GPU offload is possible.
3. Add an entry for the `-ngl`, `--n-gpu-layers` option to the `main` and `server` examples documentation
4. Update `main` and `server` examples documentation to use the new style dash separator argument format
5. Update the `server` example to use dash separators for its arguments and adds `-ngl` to `--help` (only shown when compiled with appropriate support). It will still support `--memory_f32` and `--ctx_size` for compatibility.
6. Add a warning discouraging use of `--memory-f32` for the `main` and `server` examples `--help` text as well as documentation. Rationale: https://github.com/ggerganov/llama.cpp/discussions/1593#discussioncomment-6004356

commit | commitdiff | tree

Vladimir Zorin [Sun, 28 May 2023 17:14:24 +0000 (20:14 +0300)]

examples : add --alias option to gpt_params to set use friendly model name (#1614)

commit | commitdiff | tree

Howard Su [Sun, 28 May 2023 17:13:36 +0000 (01:13 +0800)]

opencl : no need to allocate cl_mem on heap (#1612)

commit | commitdiff | tree

Howard Su [Sun, 28 May 2023 17:09:56 +0000 (01:09 +0800)]

opencl : use strstr to check if fp16 supported (#1611)

* Use strstr to check if fp16 supported

* Ensure ext_buffer is null terminated

commit | commitdiff | tree

apcameron [Sat, 27 May 2023 20:03:25 +0000 (21:03 +0100)]

ggml : add support for the RISCV architecture (#1616)

commit | commitdiff | tree

Kerfuffle [Sat, 27 May 2023 17:04:14 +0000 (11:04 -0600)]

Include server in releases + other build system cleanups (#1610)

Set `LLAMA_BUILD_SERVER` in workflow so the `server` example gets build. This currently only applies to Windows builds because it seems like only Windows binary artifacts are included in releases.

Add `server` example target to `Makefile` (still uses `LLAMA_BUILD_SERVER` define and does not build by default)

Fix issue where `vdot` binary wasn't removed when running `make clean`.

Fix compile warnings in `server` example.

Add `.hpp` files to trigger workflow (the server example has one).

commit | commitdiff | tree

Henri Vasserman [Sat, 27 May 2023 15:47:55 +0000 (18:47 +0300)]

Add documentation about CLBlast (#1604)

Installing, compiling and using.

commit | commitdiff | tree

Henri Vasserman [Sat, 27 May 2023 14:24:06 +0000 (17:24 +0300)]

[CI] Fix openblas (#1613)

* Fix OpenBLAS build

* Fix `LLAMA_BLAS_VENDOR` CMake variable that should be a string and not a boolean.

commit | commitdiff | tree

Georgi Gerganov [Sat, 27 May 2023 13:19:56 +0000 (16:19 +0300)]

ggml : add ggml_tensor_overhead()

commit | commitdiff | tree

Henri Vasserman [Sat, 27 May 2023 12:18:25 +0000 (15:18 +0300)]

[CI] CLBlast: Fix directory name (#1606)

commit | commitdiff | tree

Georgi Gerganov [Sat, 27 May 2023 09:22:05 +0000 (12:22 +0300)]

ggml : sync ggml core (minor additions, e.g. ggml_get_tensor_by_name())

commit | commitdiff | tree

Kerfuffle [Fri, 26 May 2023 02:18:01 +0000 (20:18 -0600)]

Some improvements to loading the session with --prompt-cache (#1550)

Improvements to loading the session with `--prompt-cache` in the `main` example.

1. Fix an issue where the `--seed` parameter was ignored when loading a cached prompt.
2. When loading a cached prompt, you previously had to specify the saved prompt (or a prefix of it) again. This pull changes that behavior to default to the prompt that was cached if a prompt wasn't specified by the user.

commit | commitdiff | tree

Johannes Gäßler [Thu, 25 May 2023 21:07:29 +0000 (23:07 +0200)]

cuda : performance optimizations (#1530)

* xor hack

* block y dim

* loop unrolling

* Fixed cmake LLAMA_CUDA_BY option

* Removed hipblas compatibility code

* Define GGML_CUDA_DMMV_BLOCK_Y if not defined

* Fewer iters, more ops per iter

* Renamed DMMV X/Y compilation options

commit | commitdiff | tree

Henri Vasserman [Wed, 24 May 2023 07:30:09 +0000 (10:30 +0300)]

Update CLBlast to 1.6.0 (#1580)

* Update CLBlast to 1.6.0

commit | commitdiff | tree

Evan Jones [Wed, 24 May 2023 06:24:01 +0000 (02:24 -0400)]

readme : add docs for chat-persistent.sh (#1568)

* readme : add docs for chat-persistent.sh

* Update README.md

commit | commitdiff | tree

Senemu [Wed, 24 May 2023 06:16:22 +0000 (06:16 +0000)]

chat-persistent.sh : use bracket expressions in grep (#1564)

commit | commitdiff | tree

Maarten ter Huurne [Tue, 23 May 2023 16:01:15 +0000 (18:01 +0200)]

Fix handling of "invalid property" when creating OpenCL command queue (#1565)

The `clCreateCommandQueue()` function will return the code
`CL_INVALID_QUEUE_PROPERTIES` when passed unsupported properties,
not `CL_INVALID_PROPERTY` as the original code was checking for.

commit | commitdiff | tree

0cc4m [Mon, 22 May 2023 21:33:24 +0000 (23:33 +0200)]

OpenCL Token Generation Acceleration (#1459)

* Move back to C++ for OpenCL

* Refactor OpenCL code to work more like the CUDA code, add missing functions

* Deduplicate dequant kernels

* Add OpenCL compile options

* Use compile args for preprocessing constants

* Restore default platform + device selection by id behavior

---------

Co-authored-by: Johannes Gäßler <redacted>
Co-authored-by: Henri Vasserman <redacted>

commit | commitdiff | tree

Steward Garcia [Sun, 21 May 2023 17:51:18 +0000 (11:51 -0600)]

examples : add server example with REST API (#1443)

* Added httplib support

* Added readme for server example

* fixed some bugs

* Fix the build error on Macbook

* changed json11 to nlohmann-json

* removed some whitespaces

* remove trailing whitespace

* added support custom prompts and more functions

* some corrections and added as cmake option

commit | commitdiff | tree

Stefan Sydow [Sun, 21 May 2023 14:03:44 +0000 (16:03 +0200)]

make : .PHONY clean (#1553)

commit | commitdiff | tree

Georgi Gerganov [Sun, 21 May 2023 08:56:23 +0000 (11:56 +0300)]

ggml : output 3d sizes in ggml_graph_dump_dot()

commit | commitdiff | tree

Georgi Gerganov [Sat, 20 May 2023 17:00:41 +0000 (20:00 +0300)]

ggml : update WASM SIMD

commit | commitdiff | tree

Zenix [Sat, 20 May 2023 14:58:31 +0000 (23:58 +0900)]

feature : support blis and other blas implementation (#1536)

* feature: add blis support

* feature: allow all BLA_VENDOR to be assigned in cmake arguments. align with whisper.cpp pr 927

* fix: version detection for BLA_SIZEOF_INTEGER, recover min version of cmake

* Fix typo in INTEGER

Co-authored-by: Georgi Gerganov <redacted>
* Fix: blas changes on ci

---------

Co-authored-by: Georgi Gerganov <redacted>

commit | commitdiff | tree

Henri Vasserman [Sat, 20 May 2023 14:57:39 +0000 (17:57 +0300)]

OpenCL: Fixes for older devices. (#1435)

* Remove `constant`

* Rewrite platform and device selection

* Fix Q8_0

commit | commitdiff | tree

Juuso Alasuutari [Sat, 20 May 2023 12:58:15 +0000 (15:58 +0300)]

llama : define magic numbers as integer constants (#1518) (#1520)

The underlying representation of multibyte character literals is
implementation-defined. This could, at least in principle, cause
cross-build data export/import issues independent of endianness.

Define magic numbers as integer literals to be on the safe side.

Signed-off-by: Juuso Alasuutari <redacted>

commit | commitdiff | tree

Georgi Gerganov [Sat, 20 May 2023 12:34:45 +0000 (15:34 +0300)]

ggml : add ggml_clamp() (#1539)

* ggml : add ggml_clamp()

* ggml : indentation

commit | commitdiff | tree

Johannes Gäßler [Sat, 20 May 2023 12:19:28 +0000 (14:19 +0200)]

cuda : loading models directly into VRAM, norm calculation on GPU, broadcasting for ggml_mul (#1483)

* Broadcasting for ggml_mul

* CUDA kernel for ggml_mul, norms in VRAM

* GPU weights not in RAM, direct loading with cuFile

* fixup! GPU weights not in RAM, direct loading with cuFile

* fixup! GPU weights not in RAM, direct loading with cuFile

* define default model path once, sync path with readme (#1366)

* ~7% faster Q5_1 AVX2 code (#1477)

* convert.py: Support models which are stored in a single pytorch_model.bin (#1469)

* Support models in a single pytorch_model.bin

* Remove spurious line with typo

* benchmark-matmul: Print the average of the test results (#1490)

* Remove unused n_parts parameter (#1509)

* Fixes #1511 lambda issue for w64devkit (mingw) (#1513)

* Fix for w64devkit and mingw

* make kv_f16 the default for api users (#1517)

* minor : fix compile warnings

* readme : adds WizardLM to the list of supported models (#1485)

* main : make reverse prompt option act as a stop token in non-interactive mode (#1032)

* Make reverse prompt option act as a stop token in non-interactive scenarios

* Making requested review changes

* Update gpt_params_parse and fix a merge error

* Revert "Update gpt_params_parse and fix a merge error"

This reverts commit 2bb2ff1748513591ad45b175a75ed1d8089d84c8.

* Update gpt_params_parse and fix a merge error take 2

* examples : add persistent chat (#1495)

* examples : add persistent chat

* examples : fix whitespace

---------

Co-authored-by: Georgi Gerganov <redacted>
* tests : add missing header

* ggml : use F16 instead of F32 in Q4_0, Q4_1, Q8_0 (#1508)

* ggml : use F16 instead of F32 in Q4_0, Q4_1 and Q8_0

* llama : bump LLAMA_FILE_VERSION to 3

* cuda : update Q4 and Q8 dequantize kernels

* ggml : fix AVX dot products

* readme : update performance table + hot topics

* ggml : fix scalar implementation of Q4_1 dot

* llama : fix compile warnings in llama_set_state_data()

* llama : fix name shadowing and C4146 (#1526)

* Fix name shadowing and C4146

* Fix if macros not using defined when required

* Update llama-util.h

Co-authored-by: github-actions[bot] <redacted>
* Update llama-util.h

Co-authored-by: github-actions[bot] <redacted>
* Code style

Co-authored-by: Georgi Gerganov <redacted>
---------

Co-authored-by: github-actions[bot] <redacted>
Co-authored-by: Georgi Gerganov <redacted>
* Fix for mingw (#1462)

* llama : add llama_init_backend() API (close #1527)

* feature : add blis and other BLAS implementation support (#1502)

* feature: add blis support

* feature: allow all BLA_VENDOR to be assigned in cmake arguments. align with whisper.cpp pr 927

* fix: version detection for BLA_SIZEOF_INTEGER, recover min version of cmake

* Fix typo in INTEGER

Co-authored-by: Georgi Gerganov <redacted>
---------

Co-authored-by: Georgi Gerganov <redacted>
* Revert "feature : add blis and other BLAS implementation support (#1502)"

This reverts commit 07e9ace0f9da424d82e75df969642522880feb92.

* GPU weights not in RAM, direct loading with cuFile

* llama : code style fixes + progress print fix

* ggml : ggml_mul better broadcast support

* cmake : workarounds for cufile when CMake version < 3.25

* gg rebase fixup

* Loop in llama.cpp, fixed progress callback

* Attempt clang-tidy fix

* llama : fix vram size computation

* Add forgotten fclose()

---------

Co-authored-by: András Salamon <redacted>
Co-authored-by: Ilya Kurdyukov <redacted>
Co-authored-by: Tom Jobbins <redacted>
Co-authored-by: rankaiyx <redacted>
Co-authored-by: Stephan Walter <redacted>
Co-authored-by: DannyDaemonic <redacted>
Co-authored-by: Erik Scholz <redacted>
Co-authored-by: Georgi Gerganov <redacted>
Co-authored-by: David Kennedy <redacted>
Co-authored-by: Jason McCartney <redacted>
Co-authored-by: Evan Jones <redacted>
Co-authored-by: Maxime <redacted>
Co-authored-by: github-actions[bot] <redacted>
Co-authored-by: Zenix <redacted>

commit | commitdiff | tree

Georgi Gerganov [Sat, 20 May 2023 09:03:48 +0000 (12:03 +0300)]

Revert "feature : add blis and other BLAS implementation support (#1502)"

This reverts commit 07e9ace0f9da424d82e75df969642522880feb92.

commit | commitdiff | tree

Zenix [Sat, 20 May 2023 09:02:48 +0000 (18:02 +0900)]

feature : add blis and other BLAS implementation support (#1502)

* feature: add blis support

* feature: allow all BLA_VENDOR to be assigned in cmake arguments. align with whisper.cpp pr 927

* fix: version detection for BLA_SIZEOF_INTEGER, recover min version of cmake

* Fix typo in INTEGER

Co-authored-by: Georgi Gerganov <redacted>
---------

Co-authored-by: Georgi Gerganov <redacted>

commit | commitdiff | tree

Georgi Gerganov [Sat, 20 May 2023 08:06:11 +0000 (11:06 +0300)]

llama : add llama_init_backend() API (close #1527)

commit | commitdiff | tree

DannyDaemonic [Sat, 20 May 2023 07:40:02 +0000 (00:40 -0700)]

Fix for mingw (#1462)

commit | commitdiff | tree

Maxime [Sat, 20 May 2023 07:22:37 +0000 (09:22 +0200)]

llama : fix name shadowing and C4146 (#1526)

* Fix name shadowing and C4146

* Fix if macros not using defined when required

* Update llama-util.h

Co-authored-by: github-actions[bot] <redacted>
* Update llama-util.h

Co-authored-by: github-actions[bot] <redacted>
* Code style

Co-authored-by: Georgi Gerganov <redacted>
---------

Co-authored-by: github-actions[bot] <redacted>
Co-authored-by: Georgi Gerganov <redacted>

commit | commitdiff | tree

Georgi Gerganov [Sat, 20 May 2023 07:14:31 +0000 (10:14 +0300)]

llama : fix compile warnings in llama_set_state_data()

commit | commitdiff | tree

Georgi Gerganov [Sat, 20 May 2023 07:13:19 +0000 (10:13 +0300)]

ggml : fix scalar implementation of Q4_1 dot

commit | commitdiff | tree

Georgi Gerganov [Fri, 19 May 2023 19:17:18 +0000 (22:17 +0300)]

ggml : use F16 instead of F32 in Q4_0, Q4_1, Q8_0 (#1508)

* ggml : use F16 instead of F32 in Q4_0, Q4_1 and Q8_0

* llama : bump LLAMA_FILE_VERSION to 3

* cuda : update Q4 and Q8 dequantize kernels

* ggml : fix AVX dot products

* readme : update performance table + hot topics

commit | commitdiff | tree

Georgi Gerganov [Fri, 19 May 2023 18:17:28 +0000 (21:17 +0300)]

tests : add missing header

commit | commitdiff | tree

Evan Jones [Fri, 19 May 2023 17:39:51 +0000 (13:39 -0400)]

examples : add persistent chat (#1495)

* examples : add persistent chat

* examples : fix whitespace

---------

Co-authored-by: Georgi Gerganov <redacted>

commit | commitdiff | tree

Jason McCartney [Fri, 19 May 2023 17:24:59 +0000 (10:24 -0700)]

main : make reverse prompt option act as a stop token in non-interactive mode (#1032)

* Make reverse prompt option act as a stop token in non-interactive scenarios

* Making requested review changes

* Update gpt_params_parse and fix a merge error

* Revert "Update gpt_params_parse and fix a merge error"

This reverts commit 2bb2ff1748513591ad45b175a75ed1d8089d84c8.

* Update gpt_params_parse and fix a merge error take 2

commit | commitdiff | tree

David Kennedy [Fri, 19 May 2023 17:16:30 +0000 (13:16 -0400)]

readme : adds WizardLM to the list of supported models (#1485)

commit | commitdiff | tree

Georgi Gerganov [Fri, 19 May 2023 17:14:51 +0000 (20:14 +0300)]

minor : fix compile warnings

commit | commitdiff | tree

Erik Scholz [Thu, 18 May 2023 17:31:01 +0000 (19:31 +0200)]

make kv_f16 the default for api users (#1517)

commit | commitdiff | tree

DannyDaemonic [Thu, 18 May 2023 17:30:40 +0000 (10:30 -0700)]

Fixes #1511 lambda issue for w64devkit (mingw) (#1513)

* Fix for w64devkit and mingw

commit | commitdiff | tree

Stephan Walter [Wed, 17 May 2023 22:12:01 +0000 (22:12 +0000)]

Remove unused n_parts parameter (#1509)

commit | commitdiff | tree

rankaiyx [Wed, 17 May 2023 14:47:58 +0000 (22:47 +0800)]

benchmark-matmul: Print the average of the test results (#1490)

commit | commitdiff | tree

Tom Jobbins [Tue, 16 May 2023 22:04:35 +0000 (23:04 +0100)]

convert.py: Support models which are stored in a single pytorch_model.bin (#1469)

* Support models in a single pytorch_model.bin

* Remove spurious line with typo

commit | commitdiff | tree

Ilya Kurdyukov [Tue, 16 May 2023 18:36:47 +0000 (01:36 +0700)]

~7% faster Q5_1 AVX2 code (#1477)

commit | commitdiff | tree

András Salamon [Tue, 16 May 2023 15:46:34 +0000 (16:46 +0100)]

define default model path once, sync path with readme (#1366)

commit | commitdiff | tree

sandyiscool [Tue, 16 May 2023 08:30:15 +0000 (14:00 +0530)]

Add alternate include path for openblas (#1476)

In some linux distributions (fedora, for example), the include path for openblas is located at '/usr/local/include'

commit | commitdiff | tree

zrm [Mon, 15 May 2023 02:25:42 +0000 (22:25 -0400)]

fix get_num_physical_cores() (#1436)

* fix get_num_physical_cores()
had been broken on complex topologies because "cpu cores" in /proc/cpuinfo is per-"physical id"

* Add spaces to maintain consistent formatting

---------

Co-authored-by: slaren <redacted>

commit | commitdiff | tree

slaren [Sun, 14 May 2023 20:46:00 +0000 (22:46 +0200)]

benchmark-matmul: fix clang-tidy issues, report results in GFLOPS (#1458)

* benchmark-matmul: fix command line parsing, replace macros with functions, report results in GFLOPS

commit | commitdiff | tree

Johannes Gäßler [Sun, 14 May 2023 18:53:23 +0000 (20:53 +0200)]

cuda : deduplicated dequantization code (#1453)

commit | commitdiff | tree

xaedes [Sun, 14 May 2023 15:55:02 +0000 (17:55 +0200)]

ggml : alternative fix for race condition bug in non-inplace ggml_compute_forward_diag_mask_f32 (#1454)

* fix race condition bug in non-inplace ggml_compute_forward_diag_mask_f32

memcpy needs to be synchronized across threads to avoid race conditions.
=> do it in INIT phase

* remove trailing whitespace

* Update ggml.c

---------

Co-authored-by: Georgi Gerganov <redacted>

commit | commitdiff | tree

Georgi Gerganov [Sun, 14 May 2023 15:22:50 +0000 (18:22 +0300)]

ggml : various fixes (#1450)

- `ggml_rope()`
- `ggml_diag_mask_inf()` multi-threaded
- compatibility with scratch buffers

commit | commitdiff | tree

katsu560 [Sun, 14 May 2023 10:03:51 +0000 (19:03 +0900)]

ggml : add AVX support based on AVX2 code (#1430)

commit | commitdiff | tree

Georgi Gerganov [Sun, 14 May 2023 07:20:19 +0000 (10:20 +0300)]

ggml : add GGML_QNT_VERSION to track quantization format changes

https://github.com/ggerganov/ggml/issues/150#issuecomment-1546625668

commit | commitdiff | tree

Georgi Gerganov [Sat, 13 May 2023 14:40:58 +0000 (17:40 +0300)]

cuda : fix convert function (#1412)

commit | commitdiff | tree

Georgi Gerganov [Sat, 13 May 2023 14:25:09 +0000 (17:25 +0300)]

make : fix PERF build with cuBLAS

commit | commitdiff | tree

Georgi Gerganov [Sat, 13 May 2023 13:55:14 +0000 (16:55 +0300)]

llama : fix unused warning

commit | commitdiff | tree

Georgi Gerganov [Sat, 13 May 2023 13:48:03 +0000 (16:48 +0300)]

ggml : multi-thread mul and diag_mask ops (#1428)

commit | commitdiff | tree

Johannes Gäßler [Sat, 13 May 2023 13:38:36 +0000 (15:38 +0200)]

ggml : GPU-accelerated token generation (#1412)

* CUDA kernel for q4_0 dequant. + mat. vec. mult.

* Added q4_1 via template

* Added missing __syncthreads();

* --gpu_layers -> --gpu-layers

* Shorter dequantize_mul_mat_vec line

* q5_0 dequantize_mul_mat kernel

* More readable dequantize_mul_mat_vec logic

* dequantize_mul_mat_vec kernels for q5_1, q8_0, f16

* llama : offload "output" tensor to GPU too + coding style fixes

---------

Co-authored-by: Georgi Gerganov <redacted>

Packaging of ggml-org/llama.cpp