]> git.djapps.eu Git - pkg/ggml/sources/llama.cpp/log
pkg/ggml/sources/llama.cpp
17 months agoserver : change deps.sh xxd files to string literals (#5221)
JohnnyB [Tue, 30 Jan 2024 18:15:05 +0000 (12:15 -0600)]
server : change deps.sh xxd files to string literals (#5221)

* Changed ugly xxd to literals.

HPP files are much more readable as multiline literals rather than hex arrays.

* Dashes in literal variable names.

Replace . and - with _ in file names -> variable names.

* Comment on removing xxd.

XXD-> string literals

* XXD to string literals.

Replaced these unreadable headers with string literal versions using new deps.sh.

17 months agoggml : fix IQ3_XXS on Metal (#5219)
Kawrakow [Tue, 30 Jan 2024 17:15:28 +0000 (19:15 +0200)]
ggml : fix IQ3_XXS on Metal (#5219)

Co-authored-by: Iwan Kawrakow <redacted>
17 months agosync : ggml (#0)
Georgi Gerganov [Tue, 30 Jan 2024 14:21:57 +0000 (16:21 +0200)]
sync : ggml (#0)

17 months agogguf : fix comparison (ggml/715)
Georgi Gerganov [Mon, 29 Jan 2024 19:08:18 +0000 (21:08 +0200)]
gguf : fix comparison (ggml/715)

ggml-ci

17 months ago`ggml_cuda_cpy` support for 4d tensors and float16->float32 upcasting (ggml/686)
John Balis [Mon, 29 Jan 2024 12:37:33 +0000 (06:37 -0600)]
`ggml_cuda_cpy` support for 4d tensors and float16->float32 upcasting (ggml/686)

* added cuda float16->float32 upcasting to ggml_cuda_cpy

* added ability to copy 4d tensors with the cuda backend

* added tests for float16_>float32 upcast and 4d tensor cuda copys

* added 4d copy test for float32->float16 copy

* applied patch suggested by @iamlemec

* simplify cpy tests

---------

Co-authored-by: slaren <redacted>
17 months agogguf : add input validation, prevent integer overflows (ggml/709)
Georgi Gerganov [Mon, 29 Jan 2024 12:00:10 +0000 (14:00 +0200)]
gguf : add input validation, prevent integer overflows (ggml/709)

* gguf : add input validation, prevent integer overflows

ggml-ci

* gguf : fix switch default case

* gguf : sanitize info->n_dims and info->type

ggml-ci

* gguf : assert GGUF_TYPE_SIZE access

ggml-ci

* ggml : assert mallocs are successful

ggml-ci

* gguf : prevent integer overflow

* gguf : sanitize tensor info

ggml-ci

* gguf : stricter limit on the number of items

ggml-ci

17 months agoci : fix yolo URLs + fix metal capture (ggml/712)
Georgi Gerganov [Mon, 29 Jan 2024 11:29:46 +0000 (13:29 +0200)]
ci : fix yolo URLs + fix metal capture (ggml/712)

17 months agometal : add debug capture backend function (ggml/694)
Jack Mousseau [Mon, 29 Jan 2024 09:22:23 +0000 (01:22 -0800)]
metal : add debug capture backend function (ggml/694)

Co-authored-by: Georgi Gerganov <redacted>
17 months agoFaster AVX2 dot product for IQ2_XS (#5187)
Kawrakow [Tue, 30 Jan 2024 13:15:07 +0000 (15:15 +0200)]
Faster AVX2 dot product for IQ2_XS (#5187)

* iq2xs: faster AVX2 dot product

* iq2xs: small AVX2 imrovement

* Speed up computing sign bits in AVX2 iq2_xs dot product

---------

Co-authored-by: Iwan Kawrakow <redacted>
Co-authored-by: Peter Reid <redacted>
17 months agoSOTA 3-bit quants (#5196)
Kawrakow [Tue, 30 Jan 2024 13:14:12 +0000 (15:14 +0200)]
SOTA 3-bit quants  (#5196)

* iq3_xxs: quantize/dequantize

RMSE seems a bit high-ish at about half-way between q2_K and
q3_K, so need to check more.

* iq3_xxs: CUDA dequantize works

* iq2_xxs: tuning quantization

* iq3_xxs: starting to look better

PPL on wiki.test.raw
LLaMA-v1-7B: 6.4218
LLaMA-v2-7B: 6.3560
Mistral-7B : 6.0717

This is better than Q3_K_XS, with a 5% reduction in quantized model
size.

* iq3_xxs: CUDA dot product

We have
PP-512: 5891 t/s
TG-128: 143.9 t/s

* iq3_xxs: scalar and AVX2 dot products

* iq3_xxs: ARM_NEON and Metal

Metal performance is decent, ARM_NEON is pathetic

* iq3_xxs: slightly better grid points

* Faster iq3_xxs and iq2_xs dot products on CUDA

* iq3_xxs: add some quant mix

* iq3_xxs: fix failing quantization test

Dot product still fails. Is this real?

* iq3_xxs: hopefully fix ROCm

* iq3_xxs: failing tests

This time the dot product accuracy did find an actual bug
in the AVX2 implementation.

* Add IQ3_XXS to test-backend-ops

---------

Co-authored-by: Iwan Kawrakow <redacted>
17 months agoVulkan Windows APU Memory Handling (#5199)
0cc4m [Tue, 30 Jan 2024 12:59:30 +0000 (13:59 +0100)]
Vulkan Windows APU Memory Handling (#5199)

* Add basic UMA memory handling

Improve memory OOM behavior

Fix tests

* Fix UMA handling

* Also fix UMA handling for prealloc buffers

* Remove unnecessary warning message

* Remove outdated comment

17 months agoquantize : fix typo (#5211)
Vladimir Malyutin [Tue, 30 Jan 2024 10:57:07 +0000 (17:57 +0700)]
quantize : fix typo (#5211)

Fix misprint in quantize help

17 months agomain : allow empty --prompt-cache file (#5176)
divinity76 [Tue, 30 Jan 2024 09:18:02 +0000 (10:18 +0100)]
main : allow empty --prompt-cache file (#5176)

* allow empty --prompt-cache file

This allows the use of std::tmpnam(), std::tmpfile(), Python's tempfile.NamedTemporaryFile(), and similar create-empty-file API's for the user.

I switched from the C fopen API to the C++ filesystem api to get around the fact that, to the best of my knowledge, C has no portable way to get the file size above LONG_MAX, with std::ftell() returning long? fallback to std::ifstream for c++  < 17
(the project is currently targeting C++11 it seems - file_exists() and file_size() can be removed when we upgrade to c++17)

* formatting

(requested in codereview)

* remove c++17, file_is_empty

17 months agoreadme : minor (#5204)
Romain Neutron [Tue, 30 Jan 2024 09:16:38 +0000 (10:16 +0100)]
readme : minor (#5204)

This is about tuning the code formatting of the README file

17 months agoreadme : update hot topics
Georgi Gerganov [Tue, 30 Jan 2024 09:14:44 +0000 (11:14 +0200)]
readme : update hot topics

17 months agoserver : improve README (#5209)
Wu Jian Ping [Tue, 30 Jan 2024 09:11:46 +0000 (17:11 +0800)]
server : improve README (#5209)

17 months agoggml alloc: Fix for null dereference on alloc failure (#5200)
Paul Tsochantaris [Mon, 29 Jan 2024 22:19:29 +0000 (22:19 +0000)]
ggml alloc: Fix for null dereference on alloc failure (#5200)

* Fix for a null pointer dereference if a metal GGML buffer fails to be allocated

* Freeing the allocated buffers rather than the pointer in ggml-alloc.c

* Fixed the fix of the fix

17 months agokompute : fix fallback to CPU (#5201)
Jared Van Bortel [Mon, 29 Jan 2024 22:11:27 +0000 (17:11 -0500)]
kompute : fix fallback to CPU (#5201)

17 months agoNomic Vulkan backend (#4456)
Jared Van Bortel [Mon, 29 Jan 2024 20:50:50 +0000 (15:50 -0500)]
Nomic Vulkan backend (#4456)

Signed-off-by: Jared Van Bortel <redacted>
Co-authored-by: niansa <redacted>
Co-authored-by: Adam Treat <redacted>
Co-authored-by: Aaron Miller <redacted>
Co-authored-by: ToKiNoBug <redacted>
Co-authored-by: Georgi Gerganov <redacted>
Co-authored-by: slaren <redacted>
17 months agofix typo "RLIMIT_MLOCK" (#5175)
divinity76 [Mon, 29 Jan 2024 14:45:41 +0000 (15:45 +0100)]
fix typo "RLIMIT_MLOCK" (#5175)

17 months agoserver : embeddings compatibility for OpenAI (#5190)
Wu Jian Ping [Mon, 29 Jan 2024 13:48:10 +0000 (21:48 +0800)]
server : embeddings compatibility for OpenAI (#5190)

17 months agopy : fix except (#5194)
Georgi Gerganov [Mon, 29 Jan 2024 13:35:54 +0000 (15:35 +0200)]
py : fix except (#5194)

ggml-ci

17 months agopy : improve BPE tokenizer support (#5189)
Sang-Kil Park [Mon, 29 Jan 2024 09:24:19 +0000 (18:24 +0900)]
py : improve BPE tokenizer support (#5189)

17 months agoggml : add max buffer sizes to opencl and metal backends (#5181)
slaren [Mon, 29 Jan 2024 08:05:13 +0000 (09:05 +0100)]
ggml : add max buffer sizes to opencl and metal backends (#5181)

17 months agocmake : fix Vulkan build (#5182)
Eve [Mon, 29 Jan 2024 08:04:47 +0000 (08:04 +0000)]
cmake : fix Vulkan build (#5182)

17 months agometal : free metal objects (#5161)
Paul Tsochantaris [Sun, 28 Jan 2024 19:50:16 +0000 (19:50 +0000)]
metal : free metal objects (#5161)

* Releasing MTLFunction references after Metal pipeline construction

* Keeping the `ggml_metal_kernel` structure

* Spacing fix

* Whitespace fix

17 months agosync : ggml
Georgi Gerganov [Sun, 28 Jan 2024 17:48:05 +0000 (19:48 +0200)]
sync : ggml

17 months agoggml : minor type fix (int64_t -> size_t)
Georgi Gerganov [Sun, 28 Jan 2024 16:44:58 +0000 (18:44 +0200)]
ggml : minor type fix (int64_t -> size_t)

17 months agoggml : add Vulkan backend (#2059)
0cc4m [Sun, 28 Jan 2024 17:03:59 +0000 (18:03 +0100)]
ggml : add Vulkan backend (#2059)

* Vulkan loader code

* Fix matmul kernel, continue implementation

* Continue implementation

* Vulkan memory management

* Vulkan development

* Matmul call

* Add aligned malloc and free for VMA

* Continue implementation

* First matmul success

* GEMM Kernel optimization

* 1D Blocktiling

* 2D Blocktiling

* Write coalescing

* Continue vulkan implementation and optimization

* First FP16 attempt, disabled for now

* Code abstraction, FP16 implementation, fix kernel, add FP16 to FP32 kernel

* Enable device extensions properly, restore fp16 matmul op

* Fix mulmat_f16

* Output FP32 in fp16 matmul shader

* Fix f16_to_f32 kernel

* dequant_q4_0 kernel

* Add VMA library

* Avoid requesting dedicated memory, VMA can decide that by itself

* Add bounds checking to matmul kernels, improve implementation, fix command buffers not freed properly

* add cmake commands

* Add 2d write operation, profiling code

* Fix 2d write

* Fix queue selection for AMD RADV

* Fix trailing whitespace in vk_mem_alloc.h

* Add WIP warp tile mat mul shaders

* Disable glslc optimization

* Disable glslc optimization for CMake

* Optimize warptile matmul shader, replace blocktile with it

* Add split-k optimization for small matrix multiplication

Use semaphores for synchronization instead of fences or waitidle

Rework async write/read for synchronization

* Fix validation errors, improve compatibility with AMD GPUs

* Rework command buffer handling

* Variable matmul kernel using specialization constants

* Fix synchronization on AMD, add barriers for buffer ownership transfer, add debug flag and prints

* Reuse semaphores

* Handle stage flags during command buffer submission properly

* Increase matmul test runs for consistent results

* Fix F32 matmul

* Add vectorized loading and zeropadding for matrix multiplication

* Use pinned memory for f16 preprocessing

* Don't force aligned matmul

* Don't free before queue done

* Replace VMA library with native Vulkan buffer management

* Basic offloading support with mul_f32 and dmmv for q4_0

* Run glslc commands in parallel

* Unroll loops in dmmv shader

* Reduce usage of waitIdle

* Reuse pinned allocation for f16 conversion

* Handle devices with only a single queue

* Fix trailing whitespace in CMakeLists.txt

* Allow parallel execution of kernels, parallelize third and fourth dimension calls

* Add fallback for devices only supporting one DescriptorSet per DescriptorPool

* Move to graph function similar to CUDA implementation

* Use F16 kernel for most things, replace q_f32 with mul_mat_q_f16 function

* Add F32 dmmv shaders

* Batch submissions

* Add .spv to gitignore

* Split off matrix vector multiplication for separate optimization

* Use single command buffer for matrix vector multiplication ops

* Reduce overhead of mul_f32 calls by using a single command buffer

* Add submission batching to mul_f32

* Fix tests

* Add missing barrier

* Add further missing barrier

* Add further ops

* Replace vk::QueueFamilyIgnored with VK_QUEUE_FAMILY_IGNORED to support more Vulkan header versions

* Remove unnecessary cblas link

* Fix descriptor set pre-allocation assert

* Add runtime shader compilation, start transferring shaders to this approach

* Transfer remaining shaders to header and compile on runtime

* Fix fp32 fallback if device doesn't support fp16, add force disable env var GGML_VULKAN_DISABLE_F16

* Add support for q4_1, q5_0, q5_1 and q8_0

* Remove unnecessary scalar layout extension

* Parse graph early to pre-record command buffers

* Add q6_k support

* Add multi-submit for command buffers

* Fix q6_k dequant shader for AMD

* Fix q6_k for GPUs without fp16 support

* Simplify q6_k fp16 fix

* Minor fixes

* Fix wg_denom of m-mulmat shaders

* Add Python-based Vulkan shader generator

* Replace shaderc dependency with precompiled shaders

Fix python script to generate shaders

* Clean up code

* Fix shader generator script Windows compatibility

Co-authored-by: Concedo <redacted>
* Close file before deletion

* Fix vulkan shader fp32 name

* Add q2_k and q3_k support

Add validation check to compare shader results to cpu results

* Add q4_k support

* Add q5_k support

* Bake SPIR-V bytecode into the library instead of loading shaders from file

* Switch to signal semaphores for flexibility

Prepare broadcasting support for mul mat

* Finish broadcasting mul mat support for GQA

* Clean up unused functions

Add repeat op

* Add further ops, not yet enabled. Improve semaphore code

* Reduce number of used semaphores by utilizing timelines more properly

* Remove queue information

* Reuse timeline semaphores, allow parallel operation with binary semaphores to work around nvidia driver limitations

* Add Vulkan to llama-bench

* Remove cblas dependency

* Fix matmul k-split bug

* Fix q4_k dmmv K_QUANTS_PER_ITERATION 1 shader

* Add RMS Norm shader, rework op_f32 shader setup, fix matmul bug

* Fix issues with float16 overflows in shaders

* Fix issues with older Vulkan headers on Ubuntu 22.04

* Allow multi-op partial offloading by parsing the graph to preallocate enough between-op buffers

* Implement further ops, rework op_f32 calls, fix bugs

* Finish full offloading support, add last remaining ops, fix bugs, remove redundant code

* Upload generated file ggml-vulkan-shaders.hpp, remove redundant shaders

* Merge upstream changes, fix conflicts, adapt soft_max op

* Fix Python and shader header format

* Free model gpu buffers on exit

* Use single queue per device to simplify code

* Add matmul shader support for running multiple calculations in parallel

* Switch from semaphore-synchronized multiple command buffers per op to single command buffer for multiple ops, whole graph if possible

* Fix missing event cast

* Replace uint64_t(-1) with UINT64_MAX, rename function for clarity

* Fix warning about empty C function parameters

* Fix compiler warnings

* Properly implement Vulkan backend buffer handling

* Fix oversized host staging buffers

* Simplify barrier synchronization calls

* Fix gcc warnings

* Implement max_size for backend buffer types to limit the size of a single allocation

* Use min of maxMemoryAllocationSize and maxBufferSize for device max allocation size

* refactor multi buf

* Disable unsupported ops to fix tests

* Check for maintenance4 support before using it

* Handle devices with only a single queue

* Fix single queue logic

* propagate buffer usage in multi buffers

* Implement rope_neox op

* Cleanup header and other files

* Simplify gpu_extras by removing events and putting staging memcpys into contexts

* Move queue into context

Add not-yet-enabled async backend ops

* Simplify context use, optimize matmul shader for warp size 64 (AMD GCN), fix split_k matmul shader optimization

* Add get_max_size to SYCL backend.

Co-authored-by: Georgi Gerganov <redacted>
* llama : fix trailing whitespace

---------

Co-authored-by: Henri Vasserman <redacted>
Co-authored-by: Concedo <redacted>
Co-authored-by: slaren <redacted>
Co-authored-by: Georgi Gerganov <redacted>
17 months agoggml : add unified SYCL backend for Intel GPUs (#2690)
Abhilash Majumder [Sun, 28 Jan 2024 15:56:23 +0000 (21:26 +0530)]
ggml : add unified SYCL backend for Intel GPUs (#2690)

* first update for migration

* update init_cublas

* add debug functio, commit all help code

* step 1

* step 2

* step3 add fp16, slower 31->28

* add GGML_LIST_DEVICE function

* step 5 format device and print

* step6, enhance error check, remove CUDA macro, enhance device id to fix none-zero id issue

* support main device is non-zero

* step7 add debug for code path, rm log

* step 8, rename all macro & func from cuda by sycl

* fix error of select non-zero device, format device list

* ren ggml-sycl.hpp -> ggml-sycl.h

* clear CMAKE to rm unused lib and options

* correct queue: rm dtct:get_queue

* add print tensor function to debug

* fix error: wrong result in 658746bb26702e50f2c59c0e4ada8e9da6010481

* summary dpct definition in one header file to replace folder:dpct

* refactor device log

* mv dpct definition from folder dpct to ggml-sycl.h

* update readme, refactor build script

* fix build with sycl

* set nthread=1 when sycl, increase performance

* add run script, comment debug code

* add ls-sycl-device tool

* add ls-sycl-device, rm unused files

* rm rear space

* dos2unix

* Update README_sycl.md

* fix return type

* remove sycl version from include path

* restore rm code to fix hang issue

* add syc and link for sycl readme

* rm original sycl code before refactor

* fix code err

* add know issue for pvc hang issue

* enable SYCL_F16 support

* align pr4766

* check for sycl blas, better performance

* cleanup 1

* remove extra endif

* add build&run script, clean CMakefile, update guide by review comments

* rename macro to intel hardware

* editor config format

* format fixes

* format fixes

* editor format fix

* Remove unused headers

* skip build sycl tool for other code path

* replace tab by space

* fix blas matmul function

* fix mac build

* restore hip dependency

* fix conflict

* ren as review comments

* mv internal function to .cpp file

* export funciton print_sycl_devices(), mv class dpct definition to source file

* update CI/action for sycl code, fix CI error of repeat/dup

* fix action ID format issue

* rm unused strategy

* enable llama_f16 in ci

* fix conflict

* fix build break on MacOS, due to CI of MacOS depend on external ggml, instead of internal ggml

* fix ci cases for unsupported data type

* revert unrelated changed in cuda cmake
remove useless nommq
fix typo of GGML_USE_CLBLAS_SYCL

* revert hip cmake changes

* fix indent

* add prefix in func name

* revert no mmq

* rm cpu blas duplicate

* fix no_new_line

* fix src1->type==F16 bug.

* pass batch offset for F16 src1

* fix batch error

* fix wrong code

* revert sycl checking in test-sampling

* pass void as arguments of ggml_backend_sycl_print_sycl_devices

* remove extra blank line in test-sampling

* revert setting n_threads in sycl

* implement std::isinf for icpx with fast math.

* Update ci/run.sh

Co-authored-by: Georgi Gerganov <redacted>
* Update examples/sycl/run-llama2.sh

Co-authored-by: Georgi Gerganov <redacted>
* Update examples/sycl/run-llama2.sh

Co-authored-by: Georgi Gerganov <redacted>
* Update CMakeLists.txt

Co-authored-by: Georgi Gerganov <redacted>
* Update CMakeLists.txt

Co-authored-by: Georgi Gerganov <redacted>
* Update CMakeLists.txt

Co-authored-by: Georgi Gerganov <redacted>
* Update CMakeLists.txt

Co-authored-by: Georgi Gerganov <redacted>
* add copyright and MIT license declare

* update the cmd example

---------

Co-authored-by: jianyuzh <redacted>
Co-authored-by: luoyu-intel <redacted>
Co-authored-by: Meng, Hengyu <redacted>
Co-authored-by: Georgi Gerganov <redacted>
17 months agoflake.lock: Update (#5162)
Georgi Gerganov [Sun, 28 Jan 2024 14:54:54 +0000 (16:54 +0200)]
flake.lock: Update (#5162)

17 months agoApply min_p to unsorted tokens (#5115)
Johannes Gäßler [Sun, 28 Jan 2024 08:59:49 +0000 (09:59 +0100)]
Apply min_p to unsorted tokens (#5115)

17 months agoTests for min_p, sampling queue (#5147)
Johannes Gäßler [Sun, 28 Jan 2024 08:35:14 +0000 (09:35 +0100)]
Tests for min_p, sampling queue (#5147)

17 months agoreadme : add link to rust bindings (#5148)
Marcus Dunn [Sun, 28 Jan 2024 08:30:44 +0000 (00:30 -0800)]
readme : add link to rust bindings (#5148)

* added link to another set of rust bindings with brief note on differences.

* fixed link name

17 months agollama : add support for Orion-14B (#5118)
sharpHL [Sun, 28 Jan 2024 08:00:30 +0000 (16:00 +0800)]
llama : add support for Orion-14B (#5118)

* add support for Orion-14B(https://huggingface.co/OrionStarAI/Orion-14B-Chat)

* flake8 support

* Update llama.cpp

Co-authored-by: Georgi Gerganov <redacted>
* Update llama.cpp

Co-authored-by: Georgi Gerganov <redacted>
* Update llama.cpp

Co-authored-by: Georgi Gerganov <redacted>
* Update llama.cpp

Co-authored-by: Georgi Gerganov <redacted>
* Update llama.cpp

Co-authored-by: slaren <redacted>
* Update llama.cpp

* Update llama.cpp

---------

Co-authored-by: lixiaopu <redacted>
Co-authored-by: Georgi Gerganov <redacted>
Co-authored-by: slaren <redacted>
17 months agodocker : add server-first container images (#5157)
Kyle Mistele [Sun, 28 Jan 2024 07:55:31 +0000 (01:55 -0600)]
docker : add server-first container images (#5157)

* feat: add Dockerfiles for each platform that user ./server instead of ./main

* feat: update .github/workflows/docker.yml to build server-first docker containers

* doc: add information about running the server with Docker to README.md

* doc: add information about running with docker to the server README

* doc: update n-gpu-layers to show correct GPU usage

* fix(doc): update container tag from `server` to `server-cuda` for README example on running server container with CUDA

17 months agollava : support for Yi-VL and fix for mobileVLM (#5093)
John [Sat, 27 Jan 2024 15:09:18 +0000 (16:09 +0100)]
llava : support for Yi-VL and fix for mobileVLM (#5093)

* Support for Yi-VL, templating fix for mobileVLM

* ws

* Update examples/llava/clip.cpp

Co-authored-by: Georgi Gerganov <redacted>
* Update llava-cli.cpp

* Update clip.cpp

bugfix for new conversions

---------

Co-authored-by: Georgi Gerganov <redacted>
17 months agosync : ggml
Georgi Gerganov [Sat, 27 Jan 2024 14:59:20 +0000 (16:59 +0200)]
sync : ggml

17 months agoggml : check ggml_add src1 type (ggml/708)
Judd [Fri, 26 Jan 2024 13:04:01 +0000 (21:04 +0800)]
ggml : check ggml_add src1 type (ggml/708)

Co-authored-by: Judd <redacted>
17 months agoRemove unused data and add fixes (#5154)
Michael Klimenko [Sat, 27 Jan 2024 14:25:55 +0000 (15:25 +0100)]
Remove unused data and add fixes (#5154)

* Remove unused data and add fixes

* Add missing file

* Address review comments

* Replace the scope of vq allocation

17 months agoserver : add self-extend support (#5104)
Maximilian Winter [Sat, 27 Jan 2024 13:38:05 +0000 (14:38 +0100)]
server : add self-extend support (#5104)

* Ported self extension to server example

* Update server.cpp

* Fixed prompt caching without self extend

* Update server.cpp

* Added description to server readme.

* Update server.cpp

* Update server.cpp

* Update server.cpp

* Update server.cpp

* Update README.md

* Changed descriptions

* server : formatting

* Update examples/server/server.cpp

Co-authored-by: Georgi Gerganov <redacted>
* Update examples/server/server.cpp

Co-authored-by: Georgi Gerganov <redacted>
* Update server.cpp

* Update server.cpp

---------

Co-authored-by: Georgi Gerganov <redacted>
17 months agoAdd OpenCL add kernel (#5151)
0cc4m [Fri, 26 Jan 2024 22:07:32 +0000 (23:07 +0100)]
Add OpenCL add kernel (#5151)

* Add OpenCL add kernel

* Put add kernel into different string to stay within MSVC string length limit, disable float16 support due to bad results

17 months agocmake : pass CPU architecture flags to nvcc (#5146)
Jared Van Bortel [Fri, 26 Jan 2024 20:34:06 +0000 (15:34 -0500)]
cmake : pass CPU architecture flags to nvcc (#5146)

17 months agocuda : fix tensor size calculation for non-split buffer (#5145)
slaren [Fri, 26 Jan 2024 17:59:43 +0000 (18:59 +0100)]
cuda : fix tensor size calculation for non-split buffer (#5145)

17 months agoggml-alloc : add 10% margin to the buffer sizes (#5149)
slaren [Fri, 26 Jan 2024 17:18:26 +0000 (18:18 +0100)]
ggml-alloc : add 10% margin to the buffer sizes (#5149)

17 months agoggml : update softmax n_task calculation (#5126)
snadampal [Fri, 26 Jan 2024 17:17:59 +0000 (11:17 -0600)]
ggml : update softmax n_task calculation (#5126)

updated the n_task calculation to use max number of
threads possible. This has improved the prompt eval
performance by around 5% for DOT kernels and by
around 10% for MMLA kernels on AWS Graviton3.

17 months agoscripts : move run-with-preset.py from root to scripts folder
Georgi Gerganov [Fri, 26 Jan 2024 15:09:44 +0000 (17:09 +0200)]
scripts : move run-with-preset.py from root to scripts folder

17 months agotests : gitignore test-c.o
Georgi Gerganov [Fri, 26 Jan 2024 12:48:15 +0000 (14:48 +0200)]
tests : gitignore test-c.o

17 months agoserver : refactored the task processing logic (#5065)
Xuan Son Nguyen [Fri, 26 Jan 2024 12:42:20 +0000 (13:42 +0100)]
server : refactored the task processing logic (#5065)

* server: add llama_server_queue struct

* server: add llama_server_response_event

* server: add comments

* server: move all mutexes away from server.cpp

* server: correct multitask response

* server: only add back deferred tasks when one slot is available

* server: fix a race condition cause by "request_completion"

17 months agoci : add model tests + script wrapper (#4586)
crasm [Fri, 26 Jan 2024 12:18:00 +0000 (07:18 -0500)]
ci : add model tests + script wrapper (#4586)

* scripts : add lib.sh and lib_test.sh

* scripts : stub out new ci-run.sh script

* scripts : switch to PascalCase for functions

This looks a little odd at first, but I find it very useful as a
convention to know if a command is part of our code vs a builtin.

* scripts : add some fancy conversion from snake_case to PascalCase

* Add venv to ci/run.sh

* Revert scripts work

* scripts : add wrapper script for local use of ci/run.sh

* Simplify .gitignore for tests, clang-tidy fixes

* Label all ctest tests

* ci : ctest uses -L main

* Attempt at writing ctest_with_model

* Update test-model-load-cancel

* ci : add ctest_with_model for debug and release

ggml-ci

* Fix gg_get_model function

ggml-ci

* got stuck on CMake

* Add get_model.cpp to tests/CMakeLists.txt

ggml-ci

* Fix README.md output for ctest_with_model

ggml-ci

* workflows : use `-L main` for all ctest

ggml-ci

* Fixes

* GG_RUN_CTEST_MODELFILE => LLAMACPP_TESTMODELFILE
* Always show warning rather than failing if model file variable is not
  set

* scripts : update usage text for ci-run.sh

17 months agometal : remove unused `n_buffers` and `buffers` (#5129)
Paul Tsochantaris [Fri, 26 Jan 2024 12:16:07 +0000 (12:16 +0000)]
metal : remove unused `n_buffers` and `buffers` (#5129)

17 months agogguf : fix "general.alignment" type in gguf_reader.py (#5136)
Riceball LEE [Fri, 26 Jan 2024 09:10:28 +0000 (17:10 +0800)]
gguf : fix "general.alignment" type in gguf_reader.py (#5136)

17 months agoreadme : update hot topics
Georgi Gerganov [Fri, 26 Jan 2024 08:52:33 +0000 (10:52 +0200)]
readme : update hot topics

17 months agoAnother bucket sort (#5109)
Kawrakow [Fri, 26 Jan 2024 07:14:39 +0000 (09:14 +0200)]
Another bucket sort (#5109)

* Initial bucket sort

* Bucket sort: slightly better version

* Bucket sort: another minor improvement

---------

Co-authored-by: Iwan Kawrakow <redacted>
17 months agoreadme : add MobileVLM 1.7B/3B to the supported models list (#5107)
XiaotaoChen [Thu, 25 Jan 2024 20:14:32 +0000 (04:14 +0800)]
readme : add MobileVLM 1.7B/3B to the supported models list (#5107)

Co-authored-by: Chenxiaotao03 <redacted>
17 months agollama : dynamic temperature sampling (#4972)
l3utterfly [Thu, 25 Jan 2024 20:06:22 +0000 (05:06 +0900)]
llama : dynamic temperature sampling (#4972)

* implemented dynamic temperature sampling from koboldcpp

* removed trailing whitespace

* removed unused temp parameter in llama_sample_entropy

* exposed exponent_val in dynamic temp sampler

* added debug check for printf statements

* use nullptr in llama_sample_softmax call during llama_sample_entropy

this avoids counting the time taken stats twice

Co-authored-by: Georgi Gerganov <redacted>
* return earlier if there is only 1 candiate (i.e. max_entropy == 0)

* reformat 't' case in llama_sample_queue

Co-authored-by: Jared Van Bortel <redacted>
* check for one or zero candidates case in llama_sample_entropy

---------

Co-authored-by: Georgi Gerganov <redacted>
Co-authored-by: Jared Van Bortel <redacted>
17 months agoexamples : make pydantic scripts pass mypy and support py3.8 (#5099)
Jared Van Bortel [Thu, 25 Jan 2024 19:51:24 +0000 (14:51 -0500)]
examples : make pydantic scripts pass mypy and support py3.8 (#5099)

17 months agoandroid : use release cmake build type by default (#5123)
Valentin Konovalov [Thu, 25 Jan 2024 17:05:51 +0000 (12:05 -0500)]
android : use release cmake build type by default (#5123)

17 months agoFix Q3_K_XS for MoE models (#5113)
Kawrakow [Thu, 25 Jan 2024 15:58:53 +0000 (17:58 +0200)]
Fix Q3_K_XS for MoE models (#5113)

Co-authored-by: Iwan Kawrakow <redacted>
17 months agometal : show compile log messages
Georgi Gerganov [Thu, 25 Jan 2024 09:26:17 +0000 (11:26 +0200)]
metal : show compile log messages

17 months agocuda : fix 2-bit quants on amd hip (#5105)
Engininja2 [Wed, 24 Jan 2024 22:18:15 +0000 (16:18 -0600)]
cuda : fix 2-bit quants on amd hip (#5105)

* cuda : fix 2-bit quants on amd hip

* use __low2float intrinsic function for new quants

17 months agonix-shell: use addToSearchPath
Michael Hueschen [Mon, 22 Jan 2024 23:44:10 +0000 (16:44 -0700)]
nix-shell: use addToSearchPath

thx to @SomeoneSerge for the suggestion!

17 months agonix: add cc to devShell LD_LIBRARY_PATH
Michael Hueschen [Mon, 22 Jan 2024 10:17:05 +0000 (03:17 -0700)]
nix: add cc to devShell LD_LIBRARY_PATH

this fixes the error I encountered when trying to run the convert.py
script in a venv:

```
$ nix develop

[...]$ source .venv/bin/activate
(.venv)
[...]$ pip3 install -r requirements.txt
<... clipped ...>
[...]$ python3 ./convert.py
Traceback (most recent call last):
  File "/home/mhueschen/projects-reference/llama.cpp/./convert.py", line 40, in <module>
    from sentencepiece import SentencePieceProcessor
  File "/home/mhueschen/projects-reference/llama.cpp/.venv/lib/python3.11/site-packages/sentencepiece/__init__.py", line 13, in <module>
    from . import _sentencepiece
ImportError: libstdc++.so.6: cannot open shared object file: No such file or directory
```

however, I am not sure this is the cleanest way to address this linker
issue...

17 months agollama : pre-allocate input tensors in a separate buffer (#5100)
slaren [Wed, 24 Jan 2024 11:48:14 +0000 (12:48 +0100)]
llama : pre-allocate input tensors in a separate buffer (#5100)

17 months agometal : disable support for MUL_MAT F32 x F16
Georgi Gerganov [Tue, 23 Jan 2024 13:50:56 +0000 (15:50 +0200)]
metal : disable support for MUL_MAT F32 x F16

17 months agoAdditional KL-divergence statistics (#5081)
Kawrakow [Tue, 23 Jan 2024 13:17:20 +0000 (15:17 +0200)]
Additional KL-divergence statistics (#5081)

* perplexity: add top-token probability

* perplexity: add additional KL-divergence statistics

* perplexity: a better organized KL-divergence statistics output

---------

Co-authored-by: Iwan Kawrakow <redacted>
17 months agoCUDA: more info when no device code (#5088)
Johannes Gäßler [Tue, 23 Jan 2024 12:31:56 +0000 (13:31 +0100)]
CUDA: more info when no device code (#5088)

17 months agominor : clean-up some warnings and style (#5094)
Georgi Gerganov [Tue, 23 Jan 2024 12:12:57 +0000 (14:12 +0200)]
minor : clean-up some warnings and style (#5094)

* minor : clean-up some warnings and style

ggml-ci

* ggml : add comment

17 months agodevops : add intel oneapi dockerfile (#5068)
Xuan Son Nguyen [Tue, 23 Jan 2024 07:11:39 +0000 (08:11 +0100)]
devops : add intel oneapi dockerfile (#5068)

Co-authored-by: Xuan Son Nguyen <redacted>
17 months agollama.vim : added api key support (#5090)
Michael Coppola [Tue, 23 Jan 2024 06:51:27 +0000 (01:51 -0500)]
llama.vim : added api key support (#5090)

Co-authored-by: Michael Coppola <redacted>
17 months agollama : fix not enough space in buffer with Qwen (#5086)
slaren [Mon, 22 Jan 2024 22:42:41 +0000 (23:42 +0100)]
llama : fix not enough space in buffer with Qwen (#5086)

17 months agoKL-divergence (#5076)
Kawrakow [Mon, 22 Jan 2024 14:10:14 +0000 (16:10 +0200)]
KL-divergence (#5076)

* kl-divergence: be able to save all logits to a file

* Add ability to compute KL-divergence

---------

Co-authored-by: Iwan Kawrakow <redacted>
17 months agoggml : parallelize FP32 conversion when using BLAS (#5045)
Reinforce-II [Mon, 22 Jan 2024 13:15:08 +0000 (21:15 +0800)]
ggml : parallelize FP32 conversion when using BLAS (#5045)

* make GGML_TASK_INIT phase can be run in multithread

* multithreaded dequantize in mul_mat when using blas library

* minor fixes

* update outdated comment
* fix coding style

* simplify code

Co-authored-by: Georgi Gerganov <redacted>
---------

Co-authored-by: Georgi Gerganov <redacted>
17 months agollava : MobileVLM support (#4954)
XiaotaoChen [Mon, 22 Jan 2024 13:09:35 +0000 (21:09 +0800)]
llava : MobileVLM support (#4954)

* MobileVLM native implementation

* delete depthwise_conv_2d and permute_cpy relative code, replace the two by the existed functions, and opt ldp definition, support LLAMA_PERF option for CMake

* move android script to example/llava directory

* Fix the editor config checks

---------

Co-authored-by: Chenxiaotao03 <redacted>
17 months agoflake.nix: add a comment about flakes vs nix
Someone Serge [Sun, 21 Jan 2024 03:41:37 +0000 (03:41 +0000)]
flake.nix: add a comment about flakes vs nix

17 months agonix: add a comment on the many nixpkgs-with-cuda instances
Someone Serge [Sun, 21 Jan 2024 03:29:38 +0000 (03:29 +0000)]
nix: add a comment on the many nixpkgs-with-cuda instances

17 months agonix: add a comment about makeScope
Someone Serge [Sun, 21 Jan 2024 03:15:13 +0000 (03:15 +0000)]
nix: add a comment about makeScope

17 months agonix: refactor the cleanSource rules
Someone Serge [Sat, 13 Jan 2024 17:45:01 +0000 (17:45 +0000)]
nix: refactor the cleanSource rules

17 months agoworkflows: nix-ci: drop the redundant "paths" filter
Someone Serge [Sat, 13 Jan 2024 17:38:32 +0000 (17:38 +0000)]
workflows: nix-ci: drop the redundant "paths" filter

17 months agoworkflows: nix-build-aarch64: rate limit
Someone Serge [Sat, 13 Jan 2024 17:16:54 +0000 (17:16 +0000)]
workflows: nix-build-aarch64: rate limit

17 months agoworkflows: nix-ci: rebuild on flake.lock updates
Someone Serge [Sat, 13 Jan 2024 17:10:19 +0000 (17:10 +0000)]
workflows: nix-ci: rebuild on flake.lock updates

17 months agoimatrix : keep intermediate imatrix results (#5077)
Kawrakow [Mon, 22 Jan 2024 12:18:43 +0000 (14:18 +0200)]
imatrix : keep intermediate imatrix results (#5077)

Co-authored-by: Iwan Kawrakow <redacted>
17 months agollama : support StableLM 2 1.6B (#5052)
compilade [Mon, 22 Jan 2024 11:21:52 +0000 (06:21 -0500)]
llama : support StableLM 2 1.6B (#5052)

* llama : support StableLM 2 1.6B

* convert : fix Qwen's set_vocab wrongly naming all special tokens [PAD{id}]

* convert : refactor Qwen's set_vocab to use it for StableLM 2 too

* nix : add tiktoken to llama-python-extra

* convert : use presence of tokenizer.json to determine StableLM tokenizer loader

It's a less arbitrary heuristic than the vocab size.

17 months agofinetune : print sample-start/include-sample-start (#5072)
Daniel Bevenius [Mon, 22 Jan 2024 11:11:01 +0000 (12:11 +0100)]
finetune : print sample-start/include-sample-start (#5072)

This commit adds `--sample-start` and `--include-sample-start` to the
output from the main function in finetune.cpp.

The motivation for this is that even though these are set explicitly by
the user via the command line, if one forgets to set them then it is
useful to have their values printed out. Otherwise it is possible to go
through the whole training process before realizing that the values are
not what one expected.

Signed-off-by: Daniel Bevenius <redacted>
17 months agollama : add Q3_K_XS (#5060)
Kawrakow [Mon, 22 Jan 2024 10:43:33 +0000 (12:43 +0200)]
llama : add Q3_K_XS (#5060)

* Add Q3_K_XS - intermediate size between Q2_K and Q3_K_S

* Q3_K_XS: quanize first 1/8 of ffn_down layers with Q4_K

Together with an importance matrix, this brings perplexity
for LLaMA-v2-70B below the perplexity of the former Q2_K
with a 800 MB smaller quantized model size.

---------

Co-authored-by: Iwan Kawrakow <redacted>
17 months agoci : fix Windows CI by updating Intel SDE version (#5053)
bobqianic [Mon, 22 Jan 2024 08:55:05 +0000 (08:55 +0000)]
ci : fix Windows CI by updating Intel SDE version (#5053)

17 months agollama : add more qwen2 models (#5071)
Shijie [Mon, 22 Jan 2024 07:33:19 +0000 (15:33 +0800)]
llama : add more qwen2 models (#5071)

17 months agoRevert LLAMA_NATIVE to OFF in flake.nix (#5066)
iSma [Sun, 21 Jan 2024 21:37:13 +0000 (22:37 +0100)]
Revert LLAMA_NATIVE to OFF in flake.nix (#5066)

17 months agoadd safetensors support to convert-lora-to-ggml.py (#5062)
kuronekosaiko [Sun, 21 Jan 2024 16:28:14 +0000 (00:28 +0800)]
add safetensors support to convert-lora-to-ggml.py (#5062)

* add safetensors support to convert-lora-to-ggml.py

* Update convert-lora-to-ggml.py

Remove white space in line 69.

17 months agoadd `#include <string>` to unicode.h (#5051)
bobqianic [Sun, 21 Jan 2024 15:17:35 +0000 (15:17 +0000)]
add `#include <string>` to unicode.h (#5051)

Co-authored-by: Jared Van Bortel <redacted>
17 months agoAdd ability to evauate multiple choice tasks (#5047)
Kawrakow [Sun, 21 Jan 2024 12:42:44 +0000 (14:42 +0200)]
Add ability to evauate multiple choice tasks  (#5047)

* TruthfulQA: 1st attempt, does not look like it is working

The same implementation can be used for HellaSwag as well,
so I converted a HellaSwag validation dataset to the binary
format used here and tested with that. The score is only
around 50, so something is not quite right.

* TruthfulQA: works but the result is bad

I know it works because if I convert the HellaSwag validation
data to the binary format used in the truthful_qa_score() function
I get the exact same result as from the hellaswag_score() function.
But I guess, the questions are tricky and the way I have done
the combination of question + answer is very likely not the best.
The TruthfulQA validation dataset contains 817 questions, with
random chance result around 19%. With this version I get
29.1% for Mistral-7B and 55.2% for Mistral-7B-Instruct-v0.2.
The HF leader board results for these two models are
42.2% and 68.3%, respectively.

* TruthfulQA: fix random sample

* TruthfulQA: prepare tasks in parallel for large test datasets

* Rename truthful_qa to multiple_choice

* Make MSVC happy

I had forgotten that MSVC does not make constexpr's available
inside a lambda.

---------

Co-authored-by: Iwan Kawrakow <redacted>
17 months agoSlightly faster imatrix (#5050)
Kawrakow [Sun, 21 Jan 2024 06:01:20 +0000 (08:01 +0200)]
Slightly faster imatrix (#5050)

* imatrix: speedup by avoiding unnecessary allocations and copies

* imatrix: add --no-ppl option to skip PPL calculations altogether

---------

Co-authored-by: Iwan Kawrakow <redacted>
17 months agoflake.lock: Update (#5054)
Georgi Gerganov [Sun, 21 Jan 2024 03:17:27 +0000 (05:17 +0200)]
flake.lock: Update (#5054)

Flake lock file updates:

• Updated input 'nixpkgs':
    'github:NixOS/nixpkgs/9b19f5e77dd906cb52dade0b7bd280339d2a1f3d' (2024-01-13)
  → 'github:NixOS/nixpkgs/bbe7d8f876fbbe7c959c90ba2ae2852220573261' (2024-01-19)

Co-authored-by: github-actions[bot] <redacted>
17 months agoconvert : partially revert PR #4818 (#5041)
Jared Van Bortel [Sat, 20 Jan 2024 23:14:18 +0000 (18:14 -0500)]
convert : partially revert PR #4818 (#5041)

17 months agoperplexity : fix MSVC build after #5020 (#5043)
Jared Van Bortel [Sat, 20 Jan 2024 15:08:08 +0000 (10:08 -0500)]
perplexity : fix MSVC build after #5020 (#5043)

* perplexity : fix MSVC build after #5020

* try a differerent fix

17 months agollama : run all KQV ops on the CPU with no KV offload (#5049)
slaren [Sat, 20 Jan 2024 15:05:49 +0000 (16:05 +0100)]
llama : run all KQV ops on the CPU with no KV offload (#5049)

ggml-ci

17 months agocmake : add support for ccache (#5002)
Herman Semenov [Sat, 20 Jan 2024 08:11:31 +0000 (08:11 +0000)]
cmake : add support for ccache (#5002)

* Added support ccache for speedup recompilation

* cmake : option to disable ccache

---------

Co-authored-by: Georgi Gerganov <redacted>
17 months agoAdd a dart/flutter binding to README.md (#4882)
adel boussaken [Sat, 20 Jan 2024 08:05:43 +0000 (09:05 +0100)]
Add a dart/flutter binding to README.md (#4882)

17 months agocuda : fix compile error in jetson platform (#4975)
Kylin [Sat, 20 Jan 2024 07:01:46 +0000 (15:01 +0800)]
cuda : fix compile error in jetson platform (#4975)

* cuda: fix compile error in jetson platform

* cuda: update comment in ggml-cuda.cu

* cuda: update ggml-cuda.cu comment

17 months agofinetune : fix ggml_allocr lifetimes (tmp workaround) (#5033)
Uzo Nweke [Fri, 19 Jan 2024 18:20:50 +0000 (13:20 -0500)]
finetune : fix ggml_allocr lifetimes (tmp workaround) (#5033)

* Fix issue with alloc causing max_compute_size to be calculated

* remove ggml_allocr_free as suggested in issue #4791