git.djapps.eu Git - pkg/ggml/sources/llama.cpp/log

]> git.djapps.eu Git - pkg/ggml/sources/llama.cpp/log

overview / pkg / ggml / sources / llama.cpp / log

Neo Zhang Jianyu [Fri, 2 Feb 2024 07:53:27 +0000 (15:53 +0800)]

[SYCL] update guide of SYCL backend (#5254)

* update guide for make installation, memory, gguf model link, rm todo for windows build

* add vs install requirement

* update for gpu device check

* update help of llama-bench

* fix grammer issues

commit | commitdiff | tree

Ian Bull [Fri, 2 Feb 2024 07:20:13 +0000 (23:20 -0800)]

llama : fix memory leak in llama_batch_free (#5252)

The llama_batch_init allocates memory for a fixed number of tokens.
However, the llama_batch_free only frees memory for the number of
tokens that were added to the batch.

This change-set uses a null terminated array for the batch seq_id, and
frees all the elements until the nullptr is reached. This change-set
also changes the name of the first parameter from `n_tokens` to
`n_tokens_alloc` to more clearly indicate that this value is the number
of tokens allocated to the batch, not the number of tokens in the batch.

commit | commitdiff | tree

Neo Zhang Jianyu [Thu, 1 Feb 2024 19:48:53 +0000 (03:48 +0800)]

add --no-mmap in llama-bench (#5257)

* add --no-mmap, show sycl backend

* fix conflict

* fix code format, change print for --no-mmap

* ren no_mmap to mmap, show mmap when not default value in printer

* update guide for mmap

* mv position to reduce model reload

commit | commitdiff | tree

0cc4m [Thu, 1 Feb 2024 18:25:24 +0000 (19:25 +0100)]

Vulkan Phi Fix for AMD Proprietary Drivers (#5260)

* Replace tanh to avoid NaN in gelu shader on AMD proprietary driver

* Fix another Vulkan CPY buffer size bug

commit | commitdiff | tree

slaren [Thu, 1 Feb 2024 17:30:17 +0000 (18:30 +0100)]

cuda : fix LLAMA_CUDA_F16 (#5262)

commit | commitdiff | tree

Ali Nehzat [Thu, 1 Feb 2024 15:18:53 +0000 (02:18 +1100)]

make : generate .a library for static linking (#5205)

commit | commitdiff | tree

Guoteng [Thu, 1 Feb 2024 09:19:51 +0000 (17:19 +0800)]

llama : support InternLM2 (#5184)

* support InternLM2 inference
* add add_space_prefix KV pair

commit | commitdiff | tree

Eve [Wed, 31 Jan 2024 19:21:55 +0000 (19:21 +0000)]

Fix broken Vulkan Cmake (properly) (#5230)

* build vulkan as object

* vulkan ci

commit | commitdiff | tree

Georgi Gerganov [Wed, 31 Jan 2024 16:47:10 +0000 (18:47 +0200)]

llama : reorder build_orion() at correct place (#5118)

commit | commitdiff | tree

Georgi Gerganov [Wed, 31 Jan 2024 15:30:17 +0000 (17:30 +0200)]

llama : remove LLAMA_MAX_DEVICES and LLAMA_SUPPORTS_GPU_OFFLOAD (#5240)

* llama : remove LLAMA_MAX_DEVICES from llama.h

ggml-ci

* Update llama.cpp

Co-authored-by: slaren <redacted>
* server : remove LLAMA_MAX_DEVICES

ggml-ci

* llama : remove LLAMA_SUPPORTS_GPU_OFFLOAD

ggml-ci

* train : remove LLAMA_SUPPORTS_GPU_OFFLOAD

* readme : add deprecation notice

* readme : change deprecation notice to "remove" and fix url

* llama : remove gpu includes from llama.h

ggml-ci

---------

Co-authored-by: slaren <redacted>

commit | commitdiff | tree

Georgi Gerganov [Wed, 31 Jan 2024 13:35:41 +0000 (15:35 +0200)]

metal : add im2col F32 dst support (#5132)

commit | commitdiff | tree

JidongZhang-THU [Wed, 31 Jan 2024 13:10:15 +0000 (21:10 +0800)]

llava : add MobileVLM support (#5132)

* New Feature:
    1. Sum_Rows:
        fix cuda kernel overflow
        fix block shape error when nrows too big
    2. Im2Col:
        Support Batch in cuda
        Support f32 to f32 both in cpu && cuda
    3. DepthWiseConv:
        Support by Im2Col && MulMat
    4. Pool_2d:
        Supoort avg pooling in cuda
    5. HardSigmoid:
        Imp in cuda
    6. HardSwish:
        Imp in cuda

* fix tabs instead of spaces

* code clean

* CUDA POOL2D

* ADD POOL2D test case in test-backend-ops.cpp

* code clean

* fix pool2d_kernel

nits

* fix bug in pool2d kernel

* fix avg pooling, count_include_pad

nits

* test-backend-ops : add more pool_2d tests

* cuda : fix warnings and formatting

* ggml : check types in release builds too in pool_2d

* test-backend-ops : remove f16 pool_2d tests

* cuda : more style fixes

* Add assert in ggml_cuda_op_pool2d

* pool2d float padding fallback

* test-backend-ops : add dst_type to im2col

---------

Co-authored-by: slaren <redacted>

commit | commitdiff | tree

Neo Zhang Jianyu [Wed, 31 Jan 2024 13:04:46 +0000 (21:04 +0800)]

format license text, restore apache license by legal suggestion (#5233)

commit | commitdiff | tree

slaren [Wed, 31 Jan 2024 12:43:03 +0000 (13:43 +0100)]

ggml : limit n_threads to the max n_tasks (#5238)

commit | commitdiff | tree

0cc4m [Wed, 31 Jan 2024 10:44:19 +0000 (11:44 +0100)]

Vulkan Fixes (#5223)

* Fix Vulkan F16 models

* Fix Vulkan context shift crash

* Add Vulkan to common.cpp dump_non_result_info_yaml function

* Fix bug in Vulkan CPY op

* Fix small matrix multiplication errors in AMD GPUs on Windows or with amdvlk

Co-authored-by: Engininja2 <redacted>
---------

Co-authored-by: Engininja2 <redacted>

commit | commitdiff | tree

Yiming Cui [Wed, 31 Jan 2024 03:04:21 +0000 (11:04 +0800)]

Fix typos of IQ2_XXS and IQ3_XXS in llama.cpp (#5231)

commit | commitdiff | tree

Neo Zhang Jianyu [Wed, 31 Jan 2024 02:38:07 +0000 (10:38 +0800)]

support SYCL backend windows build (#5208)

* support SYCL backend windows build

* add windows build in CI

* add for win build CI

* correct install oneMKL

* fix install issue

* fix ci

* fix install cmd

* fix install cmd

* fix install cmd

* fix install cmd

* fix install cmd

* fix win build

* fix win build

* fix win build

* restore other CI part

* restore as base

* rm no new line

* fix no new line issue, add -j

* fix grammer issue

* allow to trigger manually, fix format issue

* fix format

* add newline

* fix format

* fix format

* fix format issuse

---------

Co-authored-by: Abhilash Majumder <redacted>

commit | commitdiff | tree

Jared Van Bortel [Wed, 31 Jan 2024 00:04:37 +0000 (19:04 -0500)]

kompute : llama-bench support and ggml_cpu_has_kompute() (#5226)

commit | commitdiff | tree

Georgi Gerganov [Tue, 30 Jan 2024 19:19:26 +0000 (21:19 +0200)]

Revert "server : change deps.sh xxd files to string literals (#5221)"

This reverts commit 4003be0e5feef320f3707786f22722b73cff9356.

commit | commitdiff | tree

Georgi Gerganov [Tue, 30 Jan 2024 18:17:30 +0000 (20:17 +0200)]

server : fix context shift (#5195)

* server : fix context shift + simplify self-extend

* server : take system_tokens into account

* server : more n_past fixes

* server : rever n_past_se changes

commit | commitdiff | tree

JohnnyB [Tue, 30 Jan 2024 18:15:05 +0000 (12:15 -0600)]

server : change deps.sh xxd files to string literals (#5221)

* Changed ugly xxd to literals.

HPP files are much more readable as multiline literals rather than hex arrays.

* Dashes in literal variable names.

Replace . and - with _ in file names -> variable names.

* Comment on removing xxd.

XXD-> string literals

* XXD to string literals.

Replaced these unreadable headers with string literal versions using new deps.sh.

commit | commitdiff | tree

Kawrakow [Tue, 30 Jan 2024 17:15:28 +0000 (19:15 +0200)]

ggml : fix IQ3_XXS on Metal (#5219)

Co-authored-by: Iwan Kawrakow <redacted>

commit | commitdiff | tree

Georgi Gerganov [Tue, 30 Jan 2024 14:21:57 +0000 (16:21 +0200)]

sync : ggml (#0)

commit | commitdiff | tree

Georgi Gerganov [Mon, 29 Jan 2024 19:08:18 +0000 (21:08 +0200)]

gguf : fix comparison (ggml/715)

ggml-ci

commit | commitdiff | tree

John Balis [Mon, 29 Jan 2024 12:37:33 +0000 (06:37 -0600)]

`ggml_cuda_cpy` support for 4d tensors and float16->float32 upcasting (ggml/686)

* added cuda float16->float32 upcasting to ggml_cuda_cpy

* added ability to copy 4d tensors with the cuda backend

* added tests for float16_>float32 upcast and 4d tensor cuda copys

* added 4d copy test for float32->float16 copy

* applied patch suggested by @iamlemec

* simplify cpy tests

---------

Co-authored-by: slaren <redacted>

commit | commitdiff | tree

Georgi Gerganov [Mon, 29 Jan 2024 12:00:10 +0000 (14:00 +0200)]

gguf : add input validation, prevent integer overflows (ggml/709)

* gguf : add input validation, prevent integer overflows

ggml-ci

* gguf : fix switch default case

* gguf : sanitize info->n_dims and info->type

ggml-ci

* gguf : assert GGUF_TYPE_SIZE access

ggml-ci

* ggml : assert mallocs are successful

ggml-ci

* gguf : prevent integer overflow

* gguf : sanitize tensor info

ggml-ci

* gguf : stricter limit on the number of items

ggml-ci

commit | commitdiff | tree

Georgi Gerganov [Mon, 29 Jan 2024 11:29:46 +0000 (13:29 +0200)]

ci : fix yolo URLs + fix metal capture (ggml/712)

commit | commitdiff | tree

Jack Mousseau [Mon, 29 Jan 2024 09:22:23 +0000 (01:22 -0800)]

metal : add debug capture backend function (ggml/694)

Co-authored-by: Georgi Gerganov <redacted>

commit | commitdiff | tree

Kawrakow [Tue, 30 Jan 2024 13:15:07 +0000 (15:15 +0200)]

Faster AVX2 dot product for IQ2_XS (#5187)

* iq2xs: faster AVX2 dot product

* iq2xs: small AVX2 imrovement

* Speed up computing sign bits in AVX2 iq2_xs dot product

---------

Co-authored-by: Iwan Kawrakow <redacted>
Co-authored-by: Peter Reid <redacted>

commit | commitdiff | tree

Kawrakow [Tue, 30 Jan 2024 13:14:12 +0000 (15:14 +0200)]

SOTA 3-bit quants (#5196)

* iq3_xxs: quantize/dequantize

RMSE seems a bit high-ish at about half-way between q2_K and
q3_K, so need to check more.

* iq3_xxs: CUDA dequantize works

* iq2_xxs: tuning quantization

* iq3_xxs: starting to look better

PPL on wiki.test.raw
LLaMA-v1-7B: 6.4218
LLaMA-v2-7B: 6.3560
Mistral-7B : 6.0717

This is better than Q3_K_XS, with a 5% reduction in quantized model
size.

* iq3_xxs: CUDA dot product

We have
PP-512: 5891 t/s
TG-128: 143.9 t/s

* iq3_xxs: scalar and AVX2 dot products

* iq3_xxs: ARM_NEON and Metal

Metal performance is decent, ARM_NEON is pathetic

* iq3_xxs: slightly better grid points

* Faster iq3_xxs and iq2_xs dot products on CUDA

* iq3_xxs: add some quant mix

* iq3_xxs: fix failing quantization test

Dot product still fails. Is this real?

* iq3_xxs: hopefully fix ROCm

* iq3_xxs: failing tests

This time the dot product accuracy did find an actual bug
in the AVX2 implementation.

* Add IQ3_XXS to test-backend-ops

---------

Co-authored-by: Iwan Kawrakow <redacted>

commit | commitdiff | tree

0cc4m [Tue, 30 Jan 2024 12:59:30 +0000 (13:59 +0100)]

Vulkan Windows APU Memory Handling (#5199)

* Add basic UMA memory handling

Improve memory OOM behavior

Fix tests

* Fix UMA handling

* Also fix UMA handling for prealloc buffers

* Remove unnecessary warning message

* Remove outdated comment

commit | commitdiff | tree

Vladimir Malyutin [Tue, 30 Jan 2024 10:57:07 +0000 (17:57 +0700)]

quantize : fix typo (#5211)

Fix misprint in quantize help

commit | commitdiff | tree

divinity76 [Tue, 30 Jan 2024 09:18:02 +0000 (10:18 +0100)]

main : allow empty --prompt-cache file (#5176)

* allow empty --prompt-cache file

This allows the use of std::tmpnam(), std::tmpfile(), Python's tempfile.NamedTemporaryFile(), and similar create-empty-file API's for the user.

I switched from the C fopen API to the C++ filesystem api to get around the fact that, to the best of my knowledge, C has no portable way to get the file size above LONG_MAX, with std::ftell() returning long? fallback to std::ifstream for c++ < 17
(the project is currently targeting C++11 it seems - file_exists() and file_size() can be removed when we upgrade to c++17)

* formatting

(requested in codereview)

* remove c++17, file_is_empty

commit | commitdiff | tree

Romain Neutron [Tue, 30 Jan 2024 09:16:38 +0000 (10:16 +0100)]

readme : minor (#5204)

This is about tuning the code formatting of the README file

commit | commitdiff | tree

Georgi Gerganov [Tue, 30 Jan 2024 09:14:44 +0000 (11:14 +0200)]

readme : update hot topics

commit | commitdiff | tree

Wu Jian Ping [Tue, 30 Jan 2024 09:11:46 +0000 (17:11 +0800)]

server : improve README (#5209)

commit | commitdiff | tree

Paul Tsochantaris [Mon, 29 Jan 2024 22:19:29 +0000 (22:19 +0000)]

ggml alloc: Fix for null dereference on alloc failure (#5200)

* Fix for a null pointer dereference if a metal GGML buffer fails to be allocated

* Freeing the allocated buffers rather than the pointer in ggml-alloc.c

* Fixed the fix of the fix

commit | commitdiff | tree

Jared Van Bortel [Mon, 29 Jan 2024 22:11:27 +0000 (17:11 -0500)]

kompute : fix fallback to CPU (#5201)

commit | commitdiff | tree

Jared Van Bortel [Mon, 29 Jan 2024 20:50:50 +0000 (15:50 -0500)]

Nomic Vulkan backend (#4456)

Signed-off-by: Jared Van Bortel <redacted>
Co-authored-by: niansa <redacted>
Co-authored-by: Adam Treat <redacted>
Co-authored-by: Aaron Miller <redacted>
Co-authored-by: ToKiNoBug <redacted>
Co-authored-by: Georgi Gerganov <redacted>
Co-authored-by: slaren <redacted>

commit | commitdiff | tree

divinity76 [Mon, 29 Jan 2024 14:45:41 +0000 (15:45 +0100)]

fix typo "RLIMIT_MLOCK" (#5175)

commit | commitdiff | tree

Wu Jian Ping [Mon, 29 Jan 2024 13:48:10 +0000 (21:48 +0800)]

server : embeddings compatibility for OpenAI (#5190)

commit | commitdiff | tree

Georgi Gerganov [Mon, 29 Jan 2024 13:35:54 +0000 (15:35 +0200)]

py : fix except (#5194)

ggml-ci

commit | commitdiff | tree

Sang-Kil Park [Mon, 29 Jan 2024 09:24:19 +0000 (18:24 +0900)]

py : improve BPE tokenizer support (#5189)

commit | commitdiff | tree

slaren [Mon, 29 Jan 2024 08:05:13 +0000 (09:05 +0100)]

ggml : add max buffer sizes to opencl and metal backends (#5181)

commit | commitdiff | tree

Eve [Mon, 29 Jan 2024 08:04:47 +0000 (08:04 +0000)]

cmake : fix Vulkan build (#5182)

commit | commitdiff | tree

Paul Tsochantaris [Sun, 28 Jan 2024 19:50:16 +0000 (19:50 +0000)]

metal : free metal objects (#5161)

* Releasing MTLFunction references after Metal pipeline construction

* Keeping the `ggml_metal_kernel` structure

* Spacing fix

* Whitespace fix

commit | commitdiff | tree

Georgi Gerganov [Sun, 28 Jan 2024 17:48:05 +0000 (19:48 +0200)]

sync : ggml

commit | commitdiff | tree

Georgi Gerganov [Sun, 28 Jan 2024 16:44:58 +0000 (18:44 +0200)]

ggml : minor type fix (int64_t -> size_t)

commit | commitdiff | tree

0cc4m [Sun, 28 Jan 2024 17:03:59 +0000 (18:03 +0100)]

ggml : add Vulkan backend (#2059)

* Vulkan loader code

* Fix matmul kernel, continue implementation

* Continue implementation

* Vulkan memory management

* Vulkan development

* Matmul call

* Add aligned malloc and free for VMA

* Continue implementation

* First matmul success

* GEMM Kernel optimization

* 1D Blocktiling

* 2D Blocktiling

* Write coalescing

* Continue vulkan implementation and optimization

* First FP16 attempt, disabled for now

* Code abstraction, FP16 implementation, fix kernel, add FP16 to FP32 kernel

* Enable device extensions properly, restore fp16 matmul op

* Fix mulmat_f16

* Output FP32 in fp16 matmul shader

* Fix f16_to_f32 kernel

* dequant_q4_0 kernel

* Add VMA library

* Avoid requesting dedicated memory, VMA can decide that by itself

* Add bounds checking to matmul kernels, improve implementation, fix command buffers not freed properly

* add cmake commands

* Add 2d write operation, profiling code

* Fix 2d write

* Fix queue selection for AMD RADV

* Fix trailing whitespace in vk_mem_alloc.h

* Add WIP warp tile mat mul shaders

* Disable glslc optimization

* Disable glslc optimization for CMake

* Optimize warptile matmul shader, replace blocktile with it

* Add split-k optimization for small matrix multiplication

Use semaphores for synchronization instead of fences or waitidle

Rework async write/read for synchronization

* Fix validation errors, improve compatibility with AMD GPUs

* Rework command buffer handling

* Variable matmul kernel using specialization constants

* Fix synchronization on AMD, add barriers for buffer ownership transfer, add debug flag and prints

* Reuse semaphores

* Handle stage flags during command buffer submission properly

* Increase matmul test runs for consistent results

* Fix F32 matmul

* Add vectorized loading and zeropadding for matrix multiplication

* Use pinned memory for f16 preprocessing

* Don't force aligned matmul

* Don't free before queue done

* Replace VMA library with native Vulkan buffer management

* Basic offloading support with mul_f32 and dmmv for q4_0

* Run glslc commands in parallel

* Unroll loops in dmmv shader

* Reduce usage of waitIdle

* Reuse pinned allocation for f16 conversion

* Handle devices with only a single queue

* Fix trailing whitespace in CMakeLists.txt

* Allow parallel execution of kernels, parallelize third and fourth dimension calls

* Add fallback for devices only supporting one DescriptorSet per DescriptorPool

* Move to graph function similar to CUDA implementation

* Use F16 kernel for most things, replace q_f32 with mul_mat_q_f16 function

* Add F32 dmmv shaders

* Batch submissions

* Add .spv to gitignore

* Split off matrix vector multiplication for separate optimization

* Use single command buffer for matrix vector multiplication ops

* Reduce overhead of mul_f32 calls by using a single command buffer

* Add submission batching to mul_f32

* Fix tests

* Add missing barrier

* Add further missing barrier

* Add further ops

* Replace vk::QueueFamilyIgnored with VK_QUEUE_FAMILY_IGNORED to support more Vulkan header versions

* Remove unnecessary cblas link

* Fix descriptor set pre-allocation assert

* Add runtime shader compilation, start transferring shaders to this approach

* Transfer remaining shaders to header and compile on runtime

* Fix fp32 fallback if device doesn't support fp16, add force disable env var GGML_VULKAN_DISABLE_F16

* Add support for q4_1, q5_0, q5_1 and q8_0

* Remove unnecessary scalar layout extension

* Parse graph early to pre-record command buffers

* Add q6_k support

* Add multi-submit for command buffers

* Fix q6_k dequant shader for AMD

* Fix q6_k for GPUs without fp16 support

* Simplify q6_k fp16 fix

* Minor fixes

* Fix wg_denom of m-mulmat shaders

* Add Python-based Vulkan shader generator

* Replace shaderc dependency with precompiled shaders

Fix python script to generate shaders

* Clean up code

* Fix shader generator script Windows compatibility

Co-authored-by: Concedo <redacted>
* Close file before deletion

* Fix vulkan shader fp32 name

* Add q2_k and q3_k support

Add validation check to compare shader results to cpu results

* Add q4_k support

* Add q5_k support

* Bake SPIR-V bytecode into the library instead of loading shaders from file

* Switch to signal semaphores for flexibility

Prepare broadcasting support for mul mat

* Finish broadcasting mul mat support for GQA

* Clean up unused functions

Add repeat op

* Add further ops, not yet enabled. Improve semaphore code

* Reduce number of used semaphores by utilizing timelines more properly

* Remove queue information

* Reuse timeline semaphores, allow parallel operation with binary semaphores to work around nvidia driver limitations

* Add Vulkan to llama-bench

* Remove cblas dependency

* Fix matmul k-split bug

* Fix q4_k dmmv K_QUANTS_PER_ITERATION 1 shader

* Add RMS Norm shader, rework op_f32 shader setup, fix matmul bug

* Fix issues with float16 overflows in shaders

* Fix issues with older Vulkan headers on Ubuntu 22.04

* Allow multi-op partial offloading by parsing the graph to preallocate enough between-op buffers

* Implement further ops, rework op_f32 calls, fix bugs

* Finish full offloading support, add last remaining ops, fix bugs, remove redundant code

* Upload generated file ggml-vulkan-shaders.hpp, remove redundant shaders

* Merge upstream changes, fix conflicts, adapt soft_max op

* Fix Python and shader header format

* Free model gpu buffers on exit

* Use single queue per device to simplify code

* Add matmul shader support for running multiple calculations in parallel

* Switch from semaphore-synchronized multiple command buffers per op to single command buffer for multiple ops, whole graph if possible

* Fix missing event cast

* Replace uint64_t(-1) with UINT64_MAX, rename function for clarity

* Fix warning about empty C function parameters

* Fix compiler warnings

* Properly implement Vulkan backend buffer handling

* Fix oversized host staging buffers

* Simplify barrier synchronization calls

* Fix gcc warnings

* Implement max_size for backend buffer types to limit the size of a single allocation

* Use min of maxMemoryAllocationSize and maxBufferSize for device max allocation size

* refactor multi buf

* Disable unsupported ops to fix tests

* Check for maintenance4 support before using it

* Handle devices with only a single queue

* Fix single queue logic

* propagate buffer usage in multi buffers

* Implement rope_neox op

* Cleanup header and other files

* Simplify gpu_extras by removing events and putting staging memcpys into contexts

* Move queue into context

Add not-yet-enabled async backend ops

* Simplify context use, optimize matmul shader for warp size 64 (AMD GCN), fix split_k matmul shader optimization

* Add get_max_size to SYCL backend.

Co-authored-by: Georgi Gerganov <redacted>
* llama : fix trailing whitespace

---------

Co-authored-by: Henri Vasserman <redacted>
Co-authored-by: Concedo <redacted>
Co-authored-by: slaren <redacted>
Co-authored-by: Georgi Gerganov <redacted>

commit | commitdiff | tree

Abhilash Majumder [Sun, 28 Jan 2024 15:56:23 +0000 (21:26 +0530)]

ggml : add unified SYCL backend for Intel GPUs (#2690)

* first update for migration

* update init_cublas

* add debug functio, commit all help code

* step 1

* step 2

* step3 add fp16, slower 31->28

* add GGML_LIST_DEVICE function

* step 5 format device and print

* step6, enhance error check, remove CUDA macro, enhance device id to fix none-zero id issue

* support main device is non-zero

* step7 add debug for code path, rm log

* step 8, rename all macro & func from cuda by sycl

* fix error of select non-zero device, format device list

* ren ggml-sycl.hpp -> ggml-sycl.h

* clear CMAKE to rm unused lib and options

* correct queue: rm dtct:get_queue

* add print tensor function to debug

* fix error: wrong result in 658746bb26702e50f2c59c0e4ada8e9da6010481

* summary dpct definition in one header file to replace folder:dpct

* refactor device log

* mv dpct definition from folder dpct to ggml-sycl.h

* update readme, refactor build script

* fix build with sycl

* set nthread=1 when sycl, increase performance

* add run script, comment debug code

* add ls-sycl-device tool

* add ls-sycl-device, rm unused files

* rm rear space

* dos2unix

* Update README_sycl.md

* fix return type

* remove sycl version from include path

* restore rm code to fix hang issue

* add syc and link for sycl readme

* rm original sycl code before refactor

* fix code err

* add know issue for pvc hang issue

* enable SYCL_F16 support

* align pr4766

* check for sycl blas, better performance

* cleanup 1

* remove extra endif

* add build&run script, clean CMakefile, update guide by review comments

* rename macro to intel hardware

* editor config format

* format fixes

* format fixes

* editor format fix

* Remove unused headers

* skip build sycl tool for other code path

* replace tab by space

* fix blas matmul function

* fix mac build

* restore hip dependency

* fix conflict

* ren as review comments

* mv internal function to .cpp file

* export funciton print_sycl_devices(), mv class dpct definition to source file

* update CI/action for sycl code, fix CI error of repeat/dup

* fix action ID format issue

* rm unused strategy

* enable llama_f16 in ci

* fix conflict

* fix build break on MacOS, due to CI of MacOS depend on external ggml, instead of internal ggml

* fix ci cases for unsupported data type

* revert unrelated changed in cuda cmake
remove useless nommq
fix typo of GGML_USE_CLBLAS_SYCL

* revert hip cmake changes

* fix indent

* add prefix in func name

* revert no mmq

* rm cpu blas duplicate

* fix no_new_line

* fix src1->type==F16 bug.

* pass batch offset for F16 src1

* fix batch error

* fix wrong code

* revert sycl checking in test-sampling

* pass void as arguments of ggml_backend_sycl_print_sycl_devices

* remove extra blank line in test-sampling

* revert setting n_threads in sycl

* implement std::isinf for icpx with fast math.

* Update ci/run.sh

Co-authored-by: Georgi Gerganov <redacted>
* Update examples/sycl/run-llama2.sh

Co-authored-by: Georgi Gerganov <redacted>
* Update examples/sycl/run-llama2.sh

Co-authored-by: Georgi Gerganov <redacted>
* Update CMakeLists.txt

Co-authored-by: Georgi Gerganov <redacted>
* Update CMakeLists.txt

Co-authored-by: Georgi Gerganov <redacted>
* Update CMakeLists.txt

Co-authored-by: Georgi Gerganov <redacted>
* Update CMakeLists.txt

Co-authored-by: Georgi Gerganov <redacted>
* add copyright and MIT license declare

* update the cmd example

---------

Co-authored-by: jianyuzh <redacted>
Co-authored-by: luoyu-intel <redacted>
Co-authored-by: Meng, Hengyu <redacted>
Co-authored-by: Georgi Gerganov <redacted>

commit | commitdiff | tree

Georgi Gerganov [Sun, 28 Jan 2024 14:54:54 +0000 (16:54 +0200)]

flake.lock: Update (#5162)

commit | commitdiff | tree

Johannes Gäßler [Sun, 28 Jan 2024 08:59:49 +0000 (09:59 +0100)]

Apply min_p to unsorted tokens (#5115)

commit | commitdiff | tree

Johannes Gäßler [Sun, 28 Jan 2024 08:35:14 +0000 (09:35 +0100)]

Tests for min_p, sampling queue (#5147)

commit | commitdiff | tree

Marcus Dunn [Sun, 28 Jan 2024 08:30:44 +0000 (00:30 -0800)]

readme : add link to rust bindings (#5148)

* added link to another set of rust bindings with brief note on differences.

* fixed link name

commit | commitdiff | tree

sharpHL [Sun, 28 Jan 2024 08:00:30 +0000 (16:00 +0800)]

llama : add support for Orion-14B (#5118)

* add support for Orion-14B(https://huggingface.co/OrionStarAI/Orion-14B-Chat)

* flake8 support

* Update llama.cpp

Co-authored-by: Georgi Gerganov <redacted>
* Update llama.cpp

Co-authored-by: Georgi Gerganov <redacted>
* Update llama.cpp

Co-authored-by: Georgi Gerganov <redacted>
* Update llama.cpp

Co-authored-by: Georgi Gerganov <redacted>
* Update llama.cpp

Co-authored-by: slaren <redacted>
* Update llama.cpp

* Update llama.cpp

---------

Co-authored-by: lixiaopu <redacted>
Co-authored-by: Georgi Gerganov <redacted>
Co-authored-by: slaren <redacted>

commit | commitdiff | tree

Kyle Mistele [Sun, 28 Jan 2024 07:55:31 +0000 (01:55 -0600)]

docker : add server-first container images (#5157)

* feat: add Dockerfiles for each platform that user ./server instead of ./main

* feat: update .github/workflows/docker.yml to build server-first docker containers

* doc: add information about running the server with Docker to README.md

* doc: add information about running with docker to the server README

* doc: update n-gpu-layers to show correct GPU usage

* fix(doc): update container tag from `server` to `server-cuda` for README example on running server container with CUDA

commit | commitdiff | tree

John [Sat, 27 Jan 2024 15:09:18 +0000 (16:09 +0100)]

llava : support for Yi-VL and fix for mobileVLM (#5093)

* Support for Yi-VL, templating fix for mobileVLM

* ws

* Update examples/llava/clip.cpp

Co-authored-by: Georgi Gerganov <redacted>
* Update llava-cli.cpp

* Update clip.cpp

bugfix for new conversions

---------

Co-authored-by: Georgi Gerganov <redacted>

commit | commitdiff | tree

Georgi Gerganov [Sat, 27 Jan 2024 14:59:20 +0000 (16:59 +0200)]

sync : ggml

commit | commitdiff | tree

Judd [Fri, 26 Jan 2024 13:04:01 +0000 (21:04 +0800)]

ggml : check ggml_add src1 type (ggml/708)

Co-authored-by: Judd <redacted>

commit | commitdiff | tree

Michael Klimenko [Sat, 27 Jan 2024 14:25:55 +0000 (15:25 +0100)]

Remove unused data and add fixes (#5154)

* Remove unused data and add fixes

* Add missing file

* Address review comments

* Replace the scope of vq allocation

commit | commitdiff | tree

Maximilian Winter [Sat, 27 Jan 2024 13:38:05 +0000 (14:38 +0100)]

server : add self-extend support (#5104)

* Ported self extension to server example

* Update server.cpp

* Fixed prompt caching without self extend

* Update server.cpp

* Added description to server readme.

* Update server.cpp

* Update server.cpp

* Update server.cpp

* Update server.cpp

* Update README.md

* Changed descriptions

* server : formatting

* Update examples/server/server.cpp

Co-authored-by: Georgi Gerganov <redacted>
* Update examples/server/server.cpp

Co-authored-by: Georgi Gerganov <redacted>
* Update server.cpp

* Update server.cpp

---------

Co-authored-by: Georgi Gerganov <redacted>

commit | commitdiff | tree

0cc4m [Fri, 26 Jan 2024 22:07:32 +0000 (23:07 +0100)]

Add OpenCL add kernel (#5151)

* Add OpenCL add kernel

* Put add kernel into different string to stay within MSVC string length limit, disable float16 support due to bad results

commit | commitdiff | tree

Jared Van Bortel [Fri, 26 Jan 2024 20:34:06 +0000 (15:34 -0500)]

cmake : pass CPU architecture flags to nvcc (#5146)

commit | commitdiff | tree

slaren [Fri, 26 Jan 2024 17:59:43 +0000 (18:59 +0100)]

cuda : fix tensor size calculation for non-split buffer (#5145)

commit | commitdiff | tree

slaren [Fri, 26 Jan 2024 17:18:26 +0000 (18:18 +0100)]

ggml-alloc : add 10% margin to the buffer sizes (#5149)

commit | commitdiff | tree

snadampal [Fri, 26 Jan 2024 17:17:59 +0000 (11:17 -0600)]

ggml : update softmax n_task calculation (#5126)

updated the n_task calculation to use max number of
threads possible. This has improved the prompt eval
performance by around 5% for DOT kernels and by
around 10% for MMLA kernels on AWS Graviton3.

commit | commitdiff | tree

Georgi Gerganov [Fri, 26 Jan 2024 15:09:44 +0000 (17:09 +0200)]

scripts : move run-with-preset.py from root to scripts folder

commit | commitdiff | tree

Georgi Gerganov [Fri, 26 Jan 2024 12:48:15 +0000 (14:48 +0200)]

tests : gitignore test-c.o

commit | commitdiff | tree

Xuan Son Nguyen [Fri, 26 Jan 2024 12:42:20 +0000 (13:42 +0100)]

server : refactored the task processing logic (#5065)

* server: add llama_server_queue struct

* server: add llama_server_response_event

* server: add comments

* server: move all mutexes away from server.cpp

* server: correct multitask response

* server: only add back deferred tasks when one slot is available

* server: fix a race condition cause by "request_completion"

commit | commitdiff | tree

crasm [Fri, 26 Jan 2024 12:18:00 +0000 (07:18 -0500)]

ci : add model tests + script wrapper (#4586)

* scripts : add lib.sh and lib_test.sh

* scripts : stub out new ci-run.sh script

* scripts : switch to PascalCase for functions

This looks a little odd at first, but I find it very useful as a
convention to know if a command is part of our code vs a builtin.

* scripts : add some fancy conversion from snake_case to PascalCase

* Add venv to ci/run.sh

* Revert scripts work

* scripts : add wrapper script for local use of ci/run.sh

* Simplify .gitignore for tests, clang-tidy fixes

* Label all ctest tests

* ci : ctest uses -L main

* Attempt at writing ctest_with_model

* Update test-model-load-cancel

* ci : add ctest_with_model for debug and release

ggml-ci

* Fix gg_get_model function

ggml-ci

* got stuck on CMake

* Add get_model.cpp to tests/CMakeLists.txt

ggml-ci

* Fix README.md output for ctest_with_model

ggml-ci

* workflows : use `-L main` for all ctest

ggml-ci

* Fixes

* GG_RUN_CTEST_MODELFILE => LLAMACPP_TESTMODELFILE
* Always show warning rather than failing if model file variable is not
set

* scripts : update usage text for ci-run.sh

commit | commitdiff | tree

Paul Tsochantaris [Fri, 26 Jan 2024 12:16:07 +0000 (12:16 +0000)]

metal : remove unused `n_buffers` and `buffers` (#5129)

commit | commitdiff | tree

Riceball LEE [Fri, 26 Jan 2024 09:10:28 +0000 (17:10 +0800)]

gguf : fix "general.alignment" type in gguf_reader.py (#5136)

commit | commitdiff | tree

Georgi Gerganov [Fri, 26 Jan 2024 08:52:33 +0000 (10:52 +0200)]

readme : update hot topics

commit | commitdiff | tree

Kawrakow [Fri, 26 Jan 2024 07:14:39 +0000 (09:14 +0200)]

Another bucket sort (#5109)

* Initial bucket sort

* Bucket sort: slightly better version

* Bucket sort: another minor improvement

---------

Co-authored-by: Iwan Kawrakow <redacted>

commit | commitdiff | tree

XiaotaoChen [Thu, 25 Jan 2024 20:14:32 +0000 (04:14 +0800)]

readme : add MobileVLM 1.7B/3B to the supported models list (#5107)

Co-authored-by: Chenxiaotao03 <redacted>

commit | commitdiff | tree

l3utterfly [Thu, 25 Jan 2024 20:06:22 +0000 (05:06 +0900)]

llama : dynamic temperature sampling (#4972)

* implemented dynamic temperature sampling from koboldcpp

* removed trailing whitespace

* removed unused temp parameter in llama_sample_entropy

* exposed exponent_val in dynamic temp sampler

* added debug check for printf statements

* use nullptr in llama_sample_softmax call during llama_sample_entropy

this avoids counting the time taken stats twice

Co-authored-by: Georgi Gerganov <redacted>
* return earlier if there is only 1 candiate (i.e. max_entropy == 0)

* reformat 't' case in llama_sample_queue

Co-authored-by: Jared Van Bortel <redacted>
* check for one or zero candidates case in llama_sample_entropy

---------

Co-authored-by: Georgi Gerganov <redacted>
Co-authored-by: Jared Van Bortel <redacted>

commit | commitdiff | tree

Jared Van Bortel [Thu, 25 Jan 2024 19:51:24 +0000 (14:51 -0500)]

examples : make pydantic scripts pass mypy and support py3.8 (#5099)

commit | commitdiff | tree

Valentin Konovalov [Thu, 25 Jan 2024 17:05:51 +0000 (12:05 -0500)]

android : use release cmake build type by default (#5123)

commit | commitdiff | tree

Kawrakow [Thu, 25 Jan 2024 15:58:53 +0000 (17:58 +0200)]

Fix Q3_K_XS for MoE models (#5113)

Co-authored-by: Iwan Kawrakow <redacted>

commit | commitdiff | tree

Georgi Gerganov [Thu, 25 Jan 2024 09:26:17 +0000 (11:26 +0200)]

metal : show compile log messages

commit | commitdiff | tree

Engininja2 [Wed, 24 Jan 2024 22:18:15 +0000 (16:18 -0600)]

cuda : fix 2-bit quants on amd hip (#5105)

* cuda : fix 2-bit quants on amd hip

* use __low2float intrinsic function for new quants

commit | commitdiff | tree

Michael Hueschen [Mon, 22 Jan 2024 23:44:10 +0000 (16:44 -0700)]

nix-shell: use addToSearchPath

thx to @SomeoneSerge for the suggestion!

commit | commitdiff | tree

Michael Hueschen [Mon, 22 Jan 2024 10:17:05 +0000 (03:17 -0700)]

nix: add cc to devShell LD_LIBRARY_PATH

this fixes the error I encountered when trying to run the convert.py
script in a venv:

```
$ nix develop

[...]$ source .venv/bin/activate
(.venv)
[...]$ pip3 install -r requirements.txt
<... clipped ...>
[...]$ python3 ./convert.py
Traceback (most recent call last):
  File "/home/mhueschen/projects-reference/llama.cpp/./convert.py", line 40, in <module>
    from sentencepiece import SentencePieceProcessor
  File "/home/mhueschen/projects-reference/llama.cpp/.venv/lib/python3.11/site-packages/sentencepiece/__init__.py", line 13, in <module>
    from . import _sentencepiece
ImportError: libstdc++.so.6: cannot open shared object file: No such file or directory
```

however, I am not sure this is the cleanest way to address this linker
issue...

commit | commitdiff | tree

slaren [Wed, 24 Jan 2024 11:48:14 +0000 (12:48 +0100)]

llama : pre-allocate input tensors in a separate buffer (#5100)

commit | commitdiff | tree

Georgi Gerganov [Tue, 23 Jan 2024 13:50:56 +0000 (15:50 +0200)]

metal : disable support for MUL_MAT F32 x F16

commit | commitdiff | tree

Kawrakow [Tue, 23 Jan 2024 13:17:20 +0000 (15:17 +0200)]

Additional KL-divergence statistics (#5081)

* perplexity: add top-token probability

* perplexity: add additional KL-divergence statistics

* perplexity: a better organized KL-divergence statistics output

---------

Co-authored-by: Iwan Kawrakow <redacted>

commit | commitdiff | tree

Johannes Gäßler [Tue, 23 Jan 2024 12:31:56 +0000 (13:31 +0100)]

CUDA: more info when no device code (#5088)

commit | commitdiff | tree

Georgi Gerganov [Tue, 23 Jan 2024 12:12:57 +0000 (14:12 +0200)]

minor : clean-up some warnings and style (#5094)

* minor : clean-up some warnings and style

ggml-ci

* ggml : add comment

commit | commitdiff | tree

Xuan Son Nguyen [Tue, 23 Jan 2024 07:11:39 +0000 (08:11 +0100)]

devops : add intel oneapi dockerfile (#5068)

Co-authored-by: Xuan Son Nguyen <redacted>

commit | commitdiff | tree

Michael Coppola [Tue, 23 Jan 2024 06:51:27 +0000 (01:51 -0500)]

llama.vim : added api key support (#5090)

Co-authored-by: Michael Coppola <redacted>

commit | commitdiff | tree

slaren [Mon, 22 Jan 2024 22:42:41 +0000 (23:42 +0100)]

llama : fix not enough space in buffer with Qwen (#5086)

commit | commitdiff | tree

Kawrakow [Mon, 22 Jan 2024 14:10:14 +0000 (16:10 +0200)]

KL-divergence (#5076)

* kl-divergence: be able to save all logits to a file

* Add ability to compute KL-divergence

---------

Co-authored-by: Iwan Kawrakow <redacted>

commit | commitdiff | tree

Reinforce-II [Mon, 22 Jan 2024 13:15:08 +0000 (21:15 +0800)]

ggml : parallelize FP32 conversion when using BLAS (#5045)

* make GGML_TASK_INIT phase can be run in multithread

* multithreaded dequantize in mul_mat when using blas library

* minor fixes

* update outdated comment
* fix coding style

* simplify code

Co-authored-by: Georgi Gerganov <redacted>
---------

Co-authored-by: Georgi Gerganov <redacted>

commit | commitdiff | tree

XiaotaoChen [Mon, 22 Jan 2024 13:09:35 +0000 (21:09 +0800)]

llava : MobileVLM support (#4954)

* MobileVLM native implementation

* delete depthwise_conv_2d and permute_cpy relative code, replace the two by the existed functions, and opt ldp definition, support LLAMA_PERF option for CMake

* move android script to example/llava directory

* Fix the editor config checks

---------

Co-authored-by: Chenxiaotao03 <redacted>

commit | commitdiff | tree

Someone Serge [Sun, 21 Jan 2024 03:41:37 +0000 (03:41 +0000)]

flake.nix: add a comment about flakes vs nix

commit | commitdiff | tree

Someone Serge [Sun, 21 Jan 2024 03:29:38 +0000 (03:29 +0000)]

nix: add a comment on the many nixpkgs-with-cuda instances

commit | commitdiff | tree

Someone Serge [Sun, 21 Jan 2024 03:15:13 +0000 (03:15 +0000)]

nix: add a comment about makeScope

commit | commitdiff | tree

Someone Serge [Sat, 13 Jan 2024 17:45:01 +0000 (17:45 +0000)]

nix: refactor the cleanSource rules