]>
git.djapps.eu Git - pkg/ggml/sources/llama.cpp/log
0cc4m [Thu, 1 Feb 2024 18:25:24 +0000 (19:25 +0100)]
Vulkan Phi Fix for AMD Proprietary Drivers (#5260)
* Replace tanh to avoid NaN in gelu shader on AMD proprietary driver
* Fix another Vulkan CPY buffer size bug
slaren [Thu, 1 Feb 2024 17:30:17 +0000 (18:30 +0100)]
cuda : fix LLAMA_CUDA_F16 (#5262)
Ali Nehzat [Thu, 1 Feb 2024 15:18:53 +0000 (02:18 +1100)]
make : generate .a library for static linking (#5205)
Guoteng [Thu, 1 Feb 2024 09:19:51 +0000 (17:19 +0800)]
llama : support InternLM2 (#5184)
* support InternLM2 inference
* add add_space_prefix KV pair
Eve [Wed, 31 Jan 2024 19:21:55 +0000 (19:21 +0000)]
Fix broken Vulkan Cmake (properly) (#5230)
* build vulkan as object
* vulkan ci
Georgi Gerganov [Wed, 31 Jan 2024 16:47:10 +0000 (18:47 +0200)]
llama : reorder build_orion() at correct place (#5118)
Georgi Gerganov [Wed, 31 Jan 2024 15:30:17 +0000 (17:30 +0200)]
llama : remove LLAMA_MAX_DEVICES and LLAMA_SUPPORTS_GPU_OFFLOAD (#5240)
* llama : remove LLAMA_MAX_DEVICES from llama.h
ggml-ci
* Update llama.cpp
Co-authored-by: slaren <redacted>
* server : remove LLAMA_MAX_DEVICES
ggml-ci
* llama : remove LLAMA_SUPPORTS_GPU_OFFLOAD
ggml-ci
* train : remove LLAMA_SUPPORTS_GPU_OFFLOAD
* readme : add deprecation notice
* readme : change deprecation notice to "remove" and fix url
* llama : remove gpu includes from llama.h
ggml-ci
---------
Co-authored-by: slaren <redacted>
Georgi Gerganov [Wed, 31 Jan 2024 13:35:41 +0000 (15:35 +0200)]
metal : add im2col F32 dst support (#5132)
JidongZhang-THU [Wed, 31 Jan 2024 13:10:15 +0000 (21:10 +0800)]
llava : add MobileVLM support (#5132)
* New Feature:
1. Sum_Rows:
fix cuda kernel overflow
fix block shape error when nrows too big
2. Im2Col:
Support Batch in cuda
Support f32 to f32 both in cpu && cuda
3. DepthWiseConv:
Support by Im2Col && MulMat
4. Pool_2d:
Supoort avg pooling in cuda
5. HardSigmoid:
Imp in cuda
6. HardSwish:
Imp in cuda
* fix tabs instead of spaces
* code clean
* CUDA POOL2D
* ADD POOL2D test case in test-backend-ops.cpp
* code clean
* fix pool2d_kernel
nits
* fix bug in pool2d kernel
* fix avg pooling, count_include_pad
nits
* test-backend-ops : add more pool_2d tests
* cuda : fix warnings and formatting
* ggml : check types in release builds too in pool_2d
* test-backend-ops : remove f16 pool_2d tests
* cuda : more style fixes
* Add assert in ggml_cuda_op_pool2d
* pool2d float padding fallback
* test-backend-ops : add dst_type to im2col
---------
Co-authored-by: slaren <redacted>
Neo Zhang Jianyu [Wed, 31 Jan 2024 13:04:46 +0000 (21:04 +0800)]
format license text, restore apache license by legal suggestion (#5233)
slaren [Wed, 31 Jan 2024 12:43:03 +0000 (13:43 +0100)]
ggml : limit n_threads to the max n_tasks (#5238)
0cc4m [Wed, 31 Jan 2024 10:44:19 +0000 (11:44 +0100)]
Vulkan Fixes (#5223)
* Fix Vulkan F16 models
* Fix Vulkan context shift crash
* Add Vulkan to common.cpp dump_non_result_info_yaml function
* Fix bug in Vulkan CPY op
* Fix small matrix multiplication errors in AMD GPUs on Windows or with amdvlk
Co-authored-by: Engininja2 <redacted>
---------
Co-authored-by: Engininja2 <redacted>
Yiming Cui [Wed, 31 Jan 2024 03:04:21 +0000 (11:04 +0800)]
Fix typos of IQ2_XXS and IQ3_XXS in llama.cpp (#5231)
Neo Zhang Jianyu [Wed, 31 Jan 2024 02:38:07 +0000 (10:38 +0800)]
support SYCL backend windows build (#5208)
* support SYCL backend windows build
* add windows build in CI
* add for win build CI
* correct install oneMKL
* fix install issue
* fix ci
* fix install cmd
* fix install cmd
* fix install cmd
* fix install cmd
* fix install cmd
* fix win build
* fix win build
* fix win build
* restore other CI part
* restore as base
* rm no new line
* fix no new line issue, add -j
* fix grammer issue
* allow to trigger manually, fix format issue
* fix format
* add newline
* fix format
* fix format
* fix format issuse
---------
Co-authored-by: Abhilash Majumder <redacted>
Jared Van Bortel [Wed, 31 Jan 2024 00:04:37 +0000 (19:04 -0500)]
kompute : llama-bench support and ggml_cpu_has_kompute() (#5226)
Georgi Gerganov [Tue, 30 Jan 2024 19:19:26 +0000 (21:19 +0200)]
Revert "server : change deps.sh xxd files to string literals (#5221)"
This reverts commit
4003be0e5feef320f3707786f22722b73cff9356 .
Georgi Gerganov [Tue, 30 Jan 2024 18:17:30 +0000 (20:17 +0200)]
server : fix context shift (#5195)
* server : fix context shift + simplify self-extend
* server : take system_tokens into account
* server : more n_past fixes
* server : rever n_past_se changes
JohnnyB [Tue, 30 Jan 2024 18:15:05 +0000 (12:15 -0600)]
server : change deps.sh xxd files to string literals (#5221)
* Changed ugly xxd to literals.
HPP files are much more readable as multiline literals rather than hex arrays.
* Dashes in literal variable names.
Replace . and - with _ in file names -> variable names.
* Comment on removing xxd.
XXD-> string literals
* XXD to string literals.
Replaced these unreadable headers with string literal versions using new deps.sh.
Kawrakow [Tue, 30 Jan 2024 17:15:28 +0000 (19:15 +0200)]
ggml : fix IQ3_XXS on Metal (#5219)
Co-authored-by: Iwan Kawrakow <redacted>
Georgi Gerganov [Tue, 30 Jan 2024 14:21:57 +0000 (16:21 +0200)]
sync : ggml (#0)
Georgi Gerganov [Mon, 29 Jan 2024 19:08:18 +0000 (21:08 +0200)]
gguf : fix comparison (ggml/715)
ggml-ci
John Balis [Mon, 29 Jan 2024 12:37:33 +0000 (06:37 -0600)]
`ggml_cuda_cpy` support for 4d tensors and float16->float32 upcasting (ggml/686)
* added cuda float16->float32 upcasting to ggml_cuda_cpy
* added ability to copy 4d tensors with the cuda backend
* added tests for float16_>float32 upcast and 4d tensor cuda copys
* added 4d copy test for float32->float16 copy
* applied patch suggested by @iamlemec
* simplify cpy tests
---------
Co-authored-by: slaren <redacted>
Georgi Gerganov [Mon, 29 Jan 2024 12:00:10 +0000 (14:00 +0200)]
gguf : add input validation, prevent integer overflows (ggml/709)
* gguf : add input validation, prevent integer overflows
ggml-ci
* gguf : fix switch default case
* gguf : sanitize info->n_dims and info->type
ggml-ci
* gguf : assert GGUF_TYPE_SIZE access
ggml-ci
* ggml : assert mallocs are successful
ggml-ci
* gguf : prevent integer overflow
* gguf : sanitize tensor info
ggml-ci
* gguf : stricter limit on the number of items
ggml-ci
Georgi Gerganov [Mon, 29 Jan 2024 11:29:46 +0000 (13:29 +0200)]
ci : fix yolo URLs + fix metal capture (ggml/712)
Jack Mousseau [Mon, 29 Jan 2024 09:22:23 +0000 (01:22 -0800)]
metal : add debug capture backend function (ggml/694)
Co-authored-by: Georgi Gerganov <redacted>
Kawrakow [Tue, 30 Jan 2024 13:15:07 +0000 (15:15 +0200)]
Faster AVX2 dot product for IQ2_XS (#5187)
* iq2xs: faster AVX2 dot product
* iq2xs: small AVX2 imrovement
* Speed up computing sign bits in AVX2 iq2_xs dot product
---------
Co-authored-by: Iwan Kawrakow <redacted>
Co-authored-by: Peter Reid <redacted>
Kawrakow [Tue, 30 Jan 2024 13:14:12 +0000 (15:14 +0200)]
SOTA 3-bit quants (#5196)
* iq3_xxs: quantize/dequantize
RMSE seems a bit high-ish at about half-way between q2_K and
q3_K, so need to check more.
* iq3_xxs: CUDA dequantize works
* iq2_xxs: tuning quantization
* iq3_xxs: starting to look better
PPL on wiki.test.raw
LLaMA-v1-7B: 6.4218
LLaMA-v2-7B: 6.3560
Mistral-7B : 6.0717
This is better than Q3_K_XS, with a 5% reduction in quantized model
size.
* iq3_xxs: CUDA dot product
We have
PP-512: 5891 t/s
TG-128: 143.9 t/s
* iq3_xxs: scalar and AVX2 dot products
* iq3_xxs: ARM_NEON and Metal
Metal performance is decent, ARM_NEON is pathetic
* iq3_xxs: slightly better grid points
* Faster iq3_xxs and iq2_xs dot products on CUDA
* iq3_xxs: add some quant mix
* iq3_xxs: fix failing quantization test
Dot product still fails. Is this real?
* iq3_xxs: hopefully fix ROCm
* iq3_xxs: failing tests
This time the dot product accuracy did find an actual bug
in the AVX2 implementation.
* Add IQ3_XXS to test-backend-ops
---------
Co-authored-by: Iwan Kawrakow <redacted>
0cc4m [Tue, 30 Jan 2024 12:59:30 +0000 (13:59 +0100)]
Vulkan Windows APU Memory Handling (#5199)
* Add basic UMA memory handling
Improve memory OOM behavior
Fix tests
* Fix UMA handling
* Also fix UMA handling for prealloc buffers
* Remove unnecessary warning message
* Remove outdated comment
Vladimir Malyutin [Tue, 30 Jan 2024 10:57:07 +0000 (17:57 +0700)]
quantize : fix typo (#5211)
Fix misprint in quantize help
divinity76 [Tue, 30 Jan 2024 09:18:02 +0000 (10:18 +0100)]
main : allow empty --prompt-cache file (#5176)
* allow empty --prompt-cache file
This allows the use of std::tmpnam(), std::tmpfile(), Python's tempfile.NamedTemporaryFile(), and similar create-empty-file API's for the user.
I switched from the C fopen API to the C++ filesystem api to get around the fact that, to the best of my knowledge, C has no portable way to get the file size above LONG_MAX, with std::ftell() returning long? fallback to std::ifstream for c++ < 17
(the project is currently targeting C++11 it seems - file_exists() and file_size() can be removed when we upgrade to c++17)
* formatting
(requested in codereview)
* remove c++17, file_is_empty
Romain Neutron [Tue, 30 Jan 2024 09:16:38 +0000 (10:16 +0100)]
readme : minor (#5204)
This is about tuning the code formatting of the README file
Georgi Gerganov [Tue, 30 Jan 2024 09:14:44 +0000 (11:14 +0200)]
readme : update hot topics
Wu Jian Ping [Tue, 30 Jan 2024 09:11:46 +0000 (17:11 +0800)]
server : improve README (#5209)
Paul Tsochantaris [Mon, 29 Jan 2024 22:19:29 +0000 (22:19 +0000)]
ggml alloc: Fix for null dereference on alloc failure (#5200)
* Fix for a null pointer dereference if a metal GGML buffer fails to be allocated
* Freeing the allocated buffers rather than the pointer in ggml-alloc.c
* Fixed the fix of the fix
Jared Van Bortel [Mon, 29 Jan 2024 22:11:27 +0000 (17:11 -0500)]
kompute : fix fallback to CPU (#5201)
Jared Van Bortel [Mon, 29 Jan 2024 20:50:50 +0000 (15:50 -0500)]
Nomic Vulkan backend (#4456)
Signed-off-by: Jared Van Bortel <redacted>
Co-authored-by: niansa <redacted>
Co-authored-by: Adam Treat <redacted>
Co-authored-by: Aaron Miller <redacted>
Co-authored-by: ToKiNoBug <redacted>
Co-authored-by: Georgi Gerganov <redacted>
Co-authored-by: slaren <redacted>
divinity76 [Mon, 29 Jan 2024 14:45:41 +0000 (15:45 +0100)]
fix typo "RLIMIT_MLOCK" (#5175)
Wu Jian Ping [Mon, 29 Jan 2024 13:48:10 +0000 (21:48 +0800)]
server : embeddings compatibility for OpenAI (#5190)
Georgi Gerganov [Mon, 29 Jan 2024 13:35:54 +0000 (15:35 +0200)]
py : fix except (#5194)
ggml-ci
Sang-Kil Park [Mon, 29 Jan 2024 09:24:19 +0000 (18:24 +0900)]
py : improve BPE tokenizer support (#5189)
slaren [Mon, 29 Jan 2024 08:05:13 +0000 (09:05 +0100)]
ggml : add max buffer sizes to opencl and metal backends (#5181)
Eve [Mon, 29 Jan 2024 08:04:47 +0000 (08:04 +0000)]
cmake : fix Vulkan build (#5182)
Paul Tsochantaris [Sun, 28 Jan 2024 19:50:16 +0000 (19:50 +0000)]
metal : free metal objects (#5161)
* Releasing MTLFunction references after Metal pipeline construction
* Keeping the `ggml_metal_kernel` structure
* Spacing fix
* Whitespace fix
Georgi Gerganov [Sun, 28 Jan 2024 17:48:05 +0000 (19:48 +0200)]
sync : ggml
Georgi Gerganov [Sun, 28 Jan 2024 16:44:58 +0000 (18:44 +0200)]
ggml : minor type fix (int64_t -> size_t)
0cc4m [Sun, 28 Jan 2024 17:03:59 +0000 (18:03 +0100)]
ggml : add Vulkan backend (#2059)
* Vulkan loader code
* Fix matmul kernel, continue implementation
* Continue implementation
* Vulkan memory management
* Vulkan development
* Matmul call
* Add aligned malloc and free for VMA
* Continue implementation
* First matmul success
* GEMM Kernel optimization
* 1D Blocktiling
* 2D Blocktiling
* Write coalescing
* Continue vulkan implementation and optimization
* First FP16 attempt, disabled for now
* Code abstraction, FP16 implementation, fix kernel, add FP16 to FP32 kernel
* Enable device extensions properly, restore fp16 matmul op
* Fix mulmat_f16
* Output FP32 in fp16 matmul shader
* Fix f16_to_f32 kernel
* dequant_q4_0 kernel
* Add VMA library
* Avoid requesting dedicated memory, VMA can decide that by itself
* Add bounds checking to matmul kernels, improve implementation, fix command buffers not freed properly
* add cmake commands
* Add 2d write operation, profiling code
* Fix 2d write
* Fix queue selection for AMD RADV
* Fix trailing whitespace in vk_mem_alloc.h
* Add WIP warp tile mat mul shaders
* Disable glslc optimization
* Disable glslc optimization for CMake
* Optimize warptile matmul shader, replace blocktile with it
* Add split-k optimization for small matrix multiplication
Use semaphores for synchronization instead of fences or waitidle
Rework async write/read for synchronization
* Fix validation errors, improve compatibility with AMD GPUs
* Rework command buffer handling
* Variable matmul kernel using specialization constants
* Fix synchronization on AMD, add barriers for buffer ownership transfer, add debug flag and prints
* Reuse semaphores
* Handle stage flags during command buffer submission properly
* Increase matmul test runs for consistent results
* Fix F32 matmul
* Add vectorized loading and zeropadding for matrix multiplication
* Use pinned memory for f16 preprocessing
* Don't force aligned matmul
* Don't free before queue done
* Replace VMA library with native Vulkan buffer management
* Basic offloading support with mul_f32 and dmmv for q4_0
* Run glslc commands in parallel
* Unroll loops in dmmv shader
* Reduce usage of waitIdle
* Reuse pinned allocation for f16 conversion
* Handle devices with only a single queue
* Fix trailing whitespace in CMakeLists.txt
* Allow parallel execution of kernels, parallelize third and fourth dimension calls
* Add fallback for devices only supporting one DescriptorSet per DescriptorPool
* Move to graph function similar to CUDA implementation
* Use F16 kernel for most things, replace q_f32 with mul_mat_q_f16 function
* Add F32 dmmv shaders
* Batch submissions
* Add .spv to gitignore
* Split off matrix vector multiplication for separate optimization
* Use single command buffer for matrix vector multiplication ops
* Reduce overhead of mul_f32 calls by using a single command buffer
* Add submission batching to mul_f32
* Fix tests
* Add missing barrier
* Add further missing barrier
* Add further ops
* Replace vk::QueueFamilyIgnored with VK_QUEUE_FAMILY_IGNORED to support more Vulkan header versions
* Remove unnecessary cblas link
* Fix descriptor set pre-allocation assert
* Add runtime shader compilation, start transferring shaders to this approach
* Transfer remaining shaders to header and compile on runtime
* Fix fp32 fallback if device doesn't support fp16, add force disable env var GGML_VULKAN_DISABLE_F16
* Add support for q4_1, q5_0, q5_1 and q8_0
* Remove unnecessary scalar layout extension
* Parse graph early to pre-record command buffers
* Add q6_k support
* Add multi-submit for command buffers
* Fix q6_k dequant shader for AMD
* Fix q6_k for GPUs without fp16 support
* Simplify q6_k fp16 fix
* Minor fixes
* Fix wg_denom of m-mulmat shaders
* Add Python-based Vulkan shader generator
* Replace shaderc dependency with precompiled shaders
Fix python script to generate shaders
* Clean up code
* Fix shader generator script Windows compatibility
Co-authored-by: Concedo <redacted>
* Close file before deletion
* Fix vulkan shader fp32 name
* Add q2_k and q3_k support
Add validation check to compare shader results to cpu results
* Add q4_k support
* Add q5_k support
* Bake SPIR-V bytecode into the library instead of loading shaders from file
* Switch to signal semaphores for flexibility
Prepare broadcasting support for mul mat
* Finish broadcasting mul mat support for GQA
* Clean up unused functions
Add repeat op
* Add further ops, not yet enabled. Improve semaphore code
* Reduce number of used semaphores by utilizing timelines more properly
* Remove queue information
* Reuse timeline semaphores, allow parallel operation with binary semaphores to work around nvidia driver limitations
* Add Vulkan to llama-bench
* Remove cblas dependency
* Fix matmul k-split bug
* Fix q4_k dmmv K_QUANTS_PER_ITERATION 1 shader
* Add RMS Norm shader, rework op_f32 shader setup, fix matmul bug
* Fix issues with float16 overflows in shaders
* Fix issues with older Vulkan headers on Ubuntu 22.04
* Allow multi-op partial offloading by parsing the graph to preallocate enough between-op buffers
* Implement further ops, rework op_f32 calls, fix bugs
* Finish full offloading support, add last remaining ops, fix bugs, remove redundant code
* Upload generated file ggml-vulkan-shaders.hpp, remove redundant shaders
* Merge upstream changes, fix conflicts, adapt soft_max op
* Fix Python and shader header format
* Free model gpu buffers on exit
* Use single queue per device to simplify code
* Add matmul shader support for running multiple calculations in parallel
* Switch from semaphore-synchronized multiple command buffers per op to single command buffer for multiple ops, whole graph if possible
* Fix missing event cast
* Replace uint64_t(-1) with UINT64_MAX, rename function for clarity
* Fix warning about empty C function parameters
* Fix compiler warnings
* Properly implement Vulkan backend buffer handling
* Fix oversized host staging buffers
* Simplify barrier synchronization calls
* Fix gcc warnings
* Implement max_size for backend buffer types to limit the size of a single allocation
* Use min of maxMemoryAllocationSize and maxBufferSize for device max allocation size
* refactor multi buf
* Disable unsupported ops to fix tests
* Check for maintenance4 support before using it
* Handle devices with only a single queue
* Fix single queue logic
* propagate buffer usage in multi buffers
* Implement rope_neox op
* Cleanup header and other files
* Simplify gpu_extras by removing events and putting staging memcpys into contexts
* Move queue into context
Add not-yet-enabled async backend ops
* Simplify context use, optimize matmul shader for warp size 64 (AMD GCN), fix split_k matmul shader optimization
* Add get_max_size to SYCL backend.
Co-authored-by: Georgi Gerganov <redacted>
* llama : fix trailing whitespace
---------
Co-authored-by: Henri Vasserman <redacted>
Co-authored-by: Concedo <redacted>
Co-authored-by: slaren <redacted>
Co-authored-by: Georgi Gerganov <redacted>
Abhilash Majumder [Sun, 28 Jan 2024 15:56:23 +0000 (21:26 +0530)]
ggml : add unified SYCL backend for Intel GPUs (#2690)
* first update for migration
* update init_cublas
* add debug functio, commit all help code
* step 1
* step 2
* step3 add fp16, slower 31->28
* add GGML_LIST_DEVICE function
* step 5 format device and print
* step6, enhance error check, remove CUDA macro, enhance device id to fix none-zero id issue
* support main device is non-zero
* step7 add debug for code path, rm log
* step 8, rename all macro & func from cuda by sycl
* fix error of select non-zero device, format device list
* ren ggml-sycl.hpp -> ggml-sycl.h
* clear CMAKE to rm unused lib and options
* correct queue: rm dtct:get_queue
* add print tensor function to debug
* fix error: wrong result in
658746bb26702e50f2c59c0e4ada8e9da6010481
* summary dpct definition in one header file to replace folder:dpct
* refactor device log
* mv dpct definition from folder dpct to ggml-sycl.h
* update readme, refactor build script
* fix build with sycl
* set nthread=1 when sycl, increase performance
* add run script, comment debug code
* add ls-sycl-device tool
* add ls-sycl-device, rm unused files
* rm rear space
* dos2unix
* Update README_sycl.md
* fix return type
* remove sycl version from include path
* restore rm code to fix hang issue
* add syc and link for sycl readme
* rm original sycl code before refactor
* fix code err
* add know issue for pvc hang issue
* enable SYCL_F16 support
* align pr4766
* check for sycl blas, better performance
* cleanup 1
* remove extra endif
* add build&run script, clean CMakefile, update guide by review comments
* rename macro to intel hardware
* editor config format
* format fixes
* format fixes
* editor format fix
* Remove unused headers
* skip build sycl tool for other code path
* replace tab by space
* fix blas matmul function
* fix mac build
* restore hip dependency
* fix conflict
* ren as review comments
* mv internal function to .cpp file
* export funciton print_sycl_devices(), mv class dpct definition to source file
* update CI/action for sycl code, fix CI error of repeat/dup
* fix action ID format issue
* rm unused strategy
* enable llama_f16 in ci
* fix conflict
* fix build break on MacOS, due to CI of MacOS depend on external ggml, instead of internal ggml
* fix ci cases for unsupported data type
* revert unrelated changed in cuda cmake
remove useless nommq
fix typo of GGML_USE_CLBLAS_SYCL
* revert hip cmake changes
* fix indent
* add prefix in func name
* revert no mmq
* rm cpu blas duplicate
* fix no_new_line
* fix src1->type==F16 bug.
* pass batch offset for F16 src1
* fix batch error
* fix wrong code
* revert sycl checking in test-sampling
* pass void as arguments of ggml_backend_sycl_print_sycl_devices
* remove extra blank line in test-sampling
* revert setting n_threads in sycl
* implement std::isinf for icpx with fast math.
* Update ci/run.sh
Co-authored-by: Georgi Gerganov <redacted>
* Update examples/sycl/run-llama2.sh
Co-authored-by: Georgi Gerganov <redacted>
* Update examples/sycl/run-llama2.sh
Co-authored-by: Georgi Gerganov <redacted>
* Update CMakeLists.txt
Co-authored-by: Georgi Gerganov <redacted>
* Update CMakeLists.txt
Co-authored-by: Georgi Gerganov <redacted>
* Update CMakeLists.txt
Co-authored-by: Georgi Gerganov <redacted>
* Update CMakeLists.txt
Co-authored-by: Georgi Gerganov <redacted>
* add copyright and MIT license declare
* update the cmd example
---------
Co-authored-by: jianyuzh <redacted>
Co-authored-by: luoyu-intel <redacted>
Co-authored-by: Meng, Hengyu <redacted>
Co-authored-by: Georgi Gerganov <redacted>
Georgi Gerganov [Sun, 28 Jan 2024 14:54:54 +0000 (16:54 +0200)]
flake.lock: Update (#5162)
Johannes Gäßler [Sun, 28 Jan 2024 08:59:49 +0000 (09:59 +0100)]
Apply min_p to unsorted tokens (#5115)
Johannes Gäßler [Sun, 28 Jan 2024 08:35:14 +0000 (09:35 +0100)]
Tests for min_p, sampling queue (#5147)
Marcus Dunn [Sun, 28 Jan 2024 08:30:44 +0000 (00:30 -0800)]
readme : add link to rust bindings (#5148)
* added link to another set of rust bindings with brief note on differences.
* fixed link name
sharpHL [Sun, 28 Jan 2024 08:00:30 +0000 (16:00 +0800)]
llama : add support for Orion-14B (#5118)
* add support for Orion-14B(https://huggingface.co/OrionStarAI/Orion-14B-Chat)
* flake8 support
* Update llama.cpp
Co-authored-by: Georgi Gerganov <redacted>
* Update llama.cpp
Co-authored-by: Georgi Gerganov <redacted>
* Update llama.cpp
Co-authored-by: Georgi Gerganov <redacted>
* Update llama.cpp
Co-authored-by: Georgi Gerganov <redacted>
* Update llama.cpp
Co-authored-by: slaren <redacted>
* Update llama.cpp
* Update llama.cpp
---------
Co-authored-by: lixiaopu <redacted>
Co-authored-by: Georgi Gerganov <redacted>
Co-authored-by: slaren <redacted>
Kyle Mistele [Sun, 28 Jan 2024 07:55:31 +0000 (01:55 -0600)]
docker : add server-first container images (#5157)
* feat: add Dockerfiles for each platform that user ./server instead of ./main
* feat: update .github/workflows/docker.yml to build server-first docker containers
* doc: add information about running the server with Docker to README.md
* doc: add information about running with docker to the server README
* doc: update n-gpu-layers to show correct GPU usage
* fix(doc): update container tag from `server` to `server-cuda` for README example on running server container with CUDA
John [Sat, 27 Jan 2024 15:09:18 +0000 (16:09 +0100)]
llava : support for Yi-VL and fix for mobileVLM (#5093)
* Support for Yi-VL, templating fix for mobileVLM
* ws
* Update examples/llava/clip.cpp
Co-authored-by: Georgi Gerganov <redacted>
* Update llava-cli.cpp
* Update clip.cpp
bugfix for new conversions
---------
Co-authored-by: Georgi Gerganov <redacted>
Georgi Gerganov [Sat, 27 Jan 2024 14:59:20 +0000 (16:59 +0200)]
sync : ggml
Judd [Fri, 26 Jan 2024 13:04:01 +0000 (21:04 +0800)]
ggml : check ggml_add src1 type (ggml/708)
Co-authored-by: Judd <redacted>
Michael Klimenko [Sat, 27 Jan 2024 14:25:55 +0000 (15:25 +0100)]
Remove unused data and add fixes (#5154)
* Remove unused data and add fixes
* Add missing file
* Address review comments
* Replace the scope of vq allocation
Maximilian Winter [Sat, 27 Jan 2024 13:38:05 +0000 (14:38 +0100)]
server : add self-extend support (#5104)
* Ported self extension to server example
* Update server.cpp
* Fixed prompt caching without self extend
* Update server.cpp
* Added description to server readme.
* Update server.cpp
* Update server.cpp
* Update server.cpp
* Update server.cpp
* Update README.md
* Changed descriptions
* server : formatting
* Update examples/server/server.cpp
Co-authored-by: Georgi Gerganov <redacted>
* Update examples/server/server.cpp
Co-authored-by: Georgi Gerganov <redacted>
* Update server.cpp
* Update server.cpp
---------
Co-authored-by: Georgi Gerganov <redacted>
0cc4m [Fri, 26 Jan 2024 22:07:32 +0000 (23:07 +0100)]
Add OpenCL add kernel (#5151)
* Add OpenCL add kernel
* Put add kernel into different string to stay within MSVC string length limit, disable float16 support due to bad results
Jared Van Bortel [Fri, 26 Jan 2024 20:34:06 +0000 (15:34 -0500)]
cmake : pass CPU architecture flags to nvcc (#5146)
slaren [Fri, 26 Jan 2024 17:59:43 +0000 (18:59 +0100)]
cuda : fix tensor size calculation for non-split buffer (#5145)
slaren [Fri, 26 Jan 2024 17:18:26 +0000 (18:18 +0100)]
ggml-alloc : add 10% margin to the buffer sizes (#5149)
snadampal [Fri, 26 Jan 2024 17:17:59 +0000 (11:17 -0600)]
ggml : update softmax n_task calculation (#5126)
updated the n_task calculation to use max number of
threads possible. This has improved the prompt eval
performance by around 5% for DOT kernels and by
around 10% for MMLA kernels on AWS Graviton3.
Georgi Gerganov [Fri, 26 Jan 2024 15:09:44 +0000 (17:09 +0200)]
scripts : move run-with-preset.py from root to scripts folder
Georgi Gerganov [Fri, 26 Jan 2024 12:48:15 +0000 (14:48 +0200)]
tests : gitignore test-c.o
Xuan Son Nguyen [Fri, 26 Jan 2024 12:42:20 +0000 (13:42 +0100)]
server : refactored the task processing logic (#5065)
* server: add llama_server_queue struct
* server: add llama_server_response_event
* server: add comments
* server: move all mutexes away from server.cpp
* server: correct multitask response
* server: only add back deferred tasks when one slot is available
* server: fix a race condition cause by "request_completion"
crasm [Fri, 26 Jan 2024 12:18:00 +0000 (07:18 -0500)]
ci : add model tests + script wrapper (#4586)
* scripts : add lib.sh and lib_test.sh
* scripts : stub out new ci-run.sh script
* scripts : switch to PascalCase for functions
This looks a little odd at first, but I find it very useful as a
convention to know if a command is part of our code vs a builtin.
* scripts : add some fancy conversion from snake_case to PascalCase
* Add venv to ci/run.sh
* Revert scripts work
* scripts : add wrapper script for local use of ci/run.sh
* Simplify .gitignore for tests, clang-tidy fixes
* Label all ctest tests
* ci : ctest uses -L main
* Attempt at writing ctest_with_model
* Update test-model-load-cancel
* ci : add ctest_with_model for debug and release
ggml-ci
* Fix gg_get_model function
ggml-ci
* got stuck on CMake
* Add get_model.cpp to tests/CMakeLists.txt
ggml-ci
* Fix README.md output for ctest_with_model
ggml-ci
* workflows : use `-L main` for all ctest
ggml-ci
* Fixes
* GG_RUN_CTEST_MODELFILE => LLAMACPP_TESTMODELFILE
* Always show warning rather than failing if model file variable is not
set
* scripts : update usage text for ci-run.sh
Paul Tsochantaris [Fri, 26 Jan 2024 12:16:07 +0000 (12:16 +0000)]
metal : remove unused `n_buffers` and `buffers` (#5129)
Riceball LEE [Fri, 26 Jan 2024 09:10:28 +0000 (17:10 +0800)]
gguf : fix "general.alignment" type in gguf_reader.py (#5136)
Georgi Gerganov [Fri, 26 Jan 2024 08:52:33 +0000 (10:52 +0200)]
readme : update hot topics
Kawrakow [Fri, 26 Jan 2024 07:14:39 +0000 (09:14 +0200)]
Another bucket sort (#5109)
* Initial bucket sort
* Bucket sort: slightly better version
* Bucket sort: another minor improvement
---------
Co-authored-by: Iwan Kawrakow <redacted>
XiaotaoChen [Thu, 25 Jan 2024 20:14:32 +0000 (04:14 +0800)]
readme : add MobileVLM 1.7B/3B to the supported models list (#5107)
Co-authored-by: Chenxiaotao03 <redacted>
l3utterfly [Thu, 25 Jan 2024 20:06:22 +0000 (05:06 +0900)]
llama : dynamic temperature sampling (#4972)
* implemented dynamic temperature sampling from koboldcpp
* removed trailing whitespace
* removed unused temp parameter in llama_sample_entropy
* exposed exponent_val in dynamic temp sampler
* added debug check for printf statements
* use nullptr in llama_sample_softmax call during llama_sample_entropy
this avoids counting the time taken stats twice
Co-authored-by: Georgi Gerganov <redacted>
* return earlier if there is only 1 candiate (i.e. max_entropy == 0)
* reformat 't' case in llama_sample_queue
Co-authored-by: Jared Van Bortel <redacted>
* check for one or zero candidates case in llama_sample_entropy
---------
Co-authored-by: Georgi Gerganov <redacted>
Co-authored-by: Jared Van Bortel <redacted>
Jared Van Bortel [Thu, 25 Jan 2024 19:51:24 +0000 (14:51 -0500)]
examples : make pydantic scripts pass mypy and support py3.8 (#5099)
Valentin Konovalov [Thu, 25 Jan 2024 17:05:51 +0000 (12:05 -0500)]
android : use release cmake build type by default (#5123)
Kawrakow [Thu, 25 Jan 2024 15:58:53 +0000 (17:58 +0200)]
Fix Q3_K_XS for MoE models (#5113)
Co-authored-by: Iwan Kawrakow <redacted>
Georgi Gerganov [Thu, 25 Jan 2024 09:26:17 +0000 (11:26 +0200)]
metal : show compile log messages
Engininja2 [Wed, 24 Jan 2024 22:18:15 +0000 (16:18 -0600)]
cuda : fix 2-bit quants on amd hip (#5105)
* cuda : fix 2-bit quants on amd hip
* use __low2float intrinsic function for new quants
Michael Hueschen [Mon, 22 Jan 2024 23:44:10 +0000 (16:44 -0700)]
nix-shell: use addToSearchPath
thx to @SomeoneSerge for the suggestion!
Michael Hueschen [Mon, 22 Jan 2024 10:17:05 +0000 (03:17 -0700)]
nix: add cc to devShell LD_LIBRARY_PATH
this fixes the error I encountered when trying to run the convert.py
script in a venv:
```
$ nix develop
[...]$ source .venv/bin/activate
(.venv)
[...]$ pip3 install -r requirements.txt
<... clipped ...>
[...]$ python3 ./convert.py
Traceback (most recent call last):
File "/home/mhueschen/projects-reference/llama.cpp/./convert.py", line 40, in <module>
from sentencepiece import SentencePieceProcessor
File "/home/mhueschen/projects-reference/llama.cpp/.venv/lib/python3.11/site-packages/sentencepiece/__init__.py", line 13, in <module>
from . import _sentencepiece
ImportError: libstdc++.so.6: cannot open shared object file: No such file or directory
```
however, I am not sure this is the cleanest way to address this linker
issue...
slaren [Wed, 24 Jan 2024 11:48:14 +0000 (12:48 +0100)]
llama : pre-allocate input tensors in a separate buffer (#5100)
Georgi Gerganov [Tue, 23 Jan 2024 13:50:56 +0000 (15:50 +0200)]
metal : disable support for MUL_MAT F32 x F16
Kawrakow [Tue, 23 Jan 2024 13:17:20 +0000 (15:17 +0200)]
Additional KL-divergence statistics (#5081)
* perplexity: add top-token probability
* perplexity: add additional KL-divergence statistics
* perplexity: a better organized KL-divergence statistics output
---------
Co-authored-by: Iwan Kawrakow <redacted>
Johannes Gäßler [Tue, 23 Jan 2024 12:31:56 +0000 (13:31 +0100)]
CUDA: more info when no device code (#5088)
Georgi Gerganov [Tue, 23 Jan 2024 12:12:57 +0000 (14:12 +0200)]
minor : clean-up some warnings and style (#5094)
* minor : clean-up some warnings and style
ggml-ci
* ggml : add comment
Xuan Son Nguyen [Tue, 23 Jan 2024 07:11:39 +0000 (08:11 +0100)]
devops : add intel oneapi dockerfile (#5068)
Co-authored-by: Xuan Son Nguyen <redacted>
Michael Coppola [Tue, 23 Jan 2024 06:51:27 +0000 (01:51 -0500)]
llama.vim : added api key support (#5090)
Co-authored-by: Michael Coppola <redacted>
slaren [Mon, 22 Jan 2024 22:42:41 +0000 (23:42 +0100)]
llama : fix not enough space in buffer with Qwen (#5086)
Kawrakow [Mon, 22 Jan 2024 14:10:14 +0000 (16:10 +0200)]
KL-divergence (#5076)
* kl-divergence: be able to save all logits to a file
* Add ability to compute KL-divergence
---------
Co-authored-by: Iwan Kawrakow <redacted>
Reinforce-II [Mon, 22 Jan 2024 13:15:08 +0000 (21:15 +0800)]
ggml : parallelize FP32 conversion when using BLAS (#5045)
* make GGML_TASK_INIT phase can be run in multithread
* multithreaded dequantize in mul_mat when using blas library
* minor fixes
* update outdated comment
* fix coding style
* simplify code
Co-authored-by: Georgi Gerganov <redacted>
---------
Co-authored-by: Georgi Gerganov <redacted>
XiaotaoChen [Mon, 22 Jan 2024 13:09:35 +0000 (21:09 +0800)]
llava : MobileVLM support (#4954)
* MobileVLM native implementation
* delete depthwise_conv_2d and permute_cpy relative code, replace the two by the existed functions, and opt ldp definition, support LLAMA_PERF option for CMake
* move android script to example/llava directory
* Fix the editor config checks
---------
Co-authored-by: Chenxiaotao03 <redacted>
Someone Serge [Sun, 21 Jan 2024 03:41:37 +0000 (03:41 +0000)]
flake.nix: add a comment about flakes vs nix
Someone Serge [Sun, 21 Jan 2024 03:29:38 +0000 (03:29 +0000)]
nix: add a comment on the many nixpkgs-with-cuda instances
Someone Serge [Sun, 21 Jan 2024 03:15:13 +0000 (03:15 +0000)]
nix: add a comment about makeScope
Someone Serge [Sat, 13 Jan 2024 17:45:01 +0000 (17:45 +0000)]
nix: refactor the cleanSource rules
Someone Serge [Sat, 13 Jan 2024 17:38:32 +0000 (17:38 +0000)]
workflows: nix-ci: drop the redundant "paths" filter
Someone Serge [Sat, 13 Jan 2024 17:16:54 +0000 (17:16 +0000)]
workflows: nix-build-aarch64: rate limit
Someone Serge [Sat, 13 Jan 2024 17:10:19 +0000 (17:10 +0000)]
workflows: nix-ci: rebuild on flake.lock updates
Kawrakow [Mon, 22 Jan 2024 12:18:43 +0000 (14:18 +0200)]
imatrix : keep intermediate imatrix results (#5077)
Co-authored-by: Iwan Kawrakow <redacted>
compilade [Mon, 22 Jan 2024 11:21:52 +0000 (06:21 -0500)]
llama : support StableLM 2 1.6B (#5052)
* llama : support StableLM 2 1.6B
* convert : fix Qwen's set_vocab wrongly naming all special tokens [PAD{id}]
* convert : refactor Qwen's set_vocab to use it for StableLM 2 too
* nix : add tiktoken to llama-python-extra
* convert : use presence of tokenizer.json to determine StableLM tokenizer loader
It's a less arbitrary heuristic than the vocab size.