]>
git.djapps.eu Git - pkg/ggml/sources/llama.cpp/log
Jared Van Bortel [Mon, 29 Jan 2024 22:11:27 +0000 (17:11 -0500)]
kompute : fix fallback to CPU (#5201)
Jared Van Bortel [Mon, 29 Jan 2024 20:50:50 +0000 (15:50 -0500)]
Nomic Vulkan backend (#4456)
Signed-off-by: Jared Van Bortel <redacted>
Co-authored-by: niansa <redacted>
Co-authored-by: Adam Treat <redacted>
Co-authored-by: Aaron Miller <redacted>
Co-authored-by: ToKiNoBug <redacted>
Co-authored-by: Georgi Gerganov <redacted>
Co-authored-by: slaren <redacted>
divinity76 [Mon, 29 Jan 2024 14:45:41 +0000 (15:45 +0100)]
fix typo "RLIMIT_MLOCK" (#5175)
Wu Jian Ping [Mon, 29 Jan 2024 13:48:10 +0000 (21:48 +0800)]
server : embeddings compatibility for OpenAI (#5190)
Georgi Gerganov [Mon, 29 Jan 2024 13:35:54 +0000 (15:35 +0200)]
py : fix except (#5194)
ggml-ci
Sang-Kil Park [Mon, 29 Jan 2024 09:24:19 +0000 (18:24 +0900)]
py : improve BPE tokenizer support (#5189)
slaren [Mon, 29 Jan 2024 08:05:13 +0000 (09:05 +0100)]
ggml : add max buffer sizes to opencl and metal backends (#5181)
Eve [Mon, 29 Jan 2024 08:04:47 +0000 (08:04 +0000)]
cmake : fix Vulkan build (#5182)
Paul Tsochantaris [Sun, 28 Jan 2024 19:50:16 +0000 (19:50 +0000)]
metal : free metal objects (#5161)
* Releasing MTLFunction references after Metal pipeline construction
* Keeping the `ggml_metal_kernel` structure
* Spacing fix
* Whitespace fix
Georgi Gerganov [Sun, 28 Jan 2024 17:48:05 +0000 (19:48 +0200)]
sync : ggml
Georgi Gerganov [Sun, 28 Jan 2024 16:44:58 +0000 (18:44 +0200)]
ggml : minor type fix (int64_t -> size_t)
0cc4m [Sun, 28 Jan 2024 17:03:59 +0000 (18:03 +0100)]
ggml : add Vulkan backend (#2059)
* Vulkan loader code
* Fix matmul kernel, continue implementation
* Continue implementation
* Vulkan memory management
* Vulkan development
* Matmul call
* Add aligned malloc and free for VMA
* Continue implementation
* First matmul success
* GEMM Kernel optimization
* 1D Blocktiling
* 2D Blocktiling
* Write coalescing
* Continue vulkan implementation and optimization
* First FP16 attempt, disabled for now
* Code abstraction, FP16 implementation, fix kernel, add FP16 to FP32 kernel
* Enable device extensions properly, restore fp16 matmul op
* Fix mulmat_f16
* Output FP32 in fp16 matmul shader
* Fix f16_to_f32 kernel
* dequant_q4_0 kernel
* Add VMA library
* Avoid requesting dedicated memory, VMA can decide that by itself
* Add bounds checking to matmul kernels, improve implementation, fix command buffers not freed properly
* add cmake commands
* Add 2d write operation, profiling code
* Fix 2d write
* Fix queue selection for AMD RADV
* Fix trailing whitespace in vk_mem_alloc.h
* Add WIP warp tile mat mul shaders
* Disable glslc optimization
* Disable glslc optimization for CMake
* Optimize warptile matmul shader, replace blocktile with it
* Add split-k optimization for small matrix multiplication
Use semaphores for synchronization instead of fences or waitidle
Rework async write/read for synchronization
* Fix validation errors, improve compatibility with AMD GPUs
* Rework command buffer handling
* Variable matmul kernel using specialization constants
* Fix synchronization on AMD, add barriers for buffer ownership transfer, add debug flag and prints
* Reuse semaphores
* Handle stage flags during command buffer submission properly
* Increase matmul test runs for consistent results
* Fix F32 matmul
* Add vectorized loading and zeropadding for matrix multiplication
* Use pinned memory for f16 preprocessing
* Don't force aligned matmul
* Don't free before queue done
* Replace VMA library with native Vulkan buffer management
* Basic offloading support with mul_f32 and dmmv for q4_0
* Run glslc commands in parallel
* Unroll loops in dmmv shader
* Reduce usage of waitIdle
* Reuse pinned allocation for f16 conversion
* Handle devices with only a single queue
* Fix trailing whitespace in CMakeLists.txt
* Allow parallel execution of kernels, parallelize third and fourth dimension calls
* Add fallback for devices only supporting one DescriptorSet per DescriptorPool
* Move to graph function similar to CUDA implementation
* Use F16 kernel for most things, replace q_f32 with mul_mat_q_f16 function
* Add F32 dmmv shaders
* Batch submissions
* Add .spv to gitignore
* Split off matrix vector multiplication for separate optimization
* Use single command buffer for matrix vector multiplication ops
* Reduce overhead of mul_f32 calls by using a single command buffer
* Add submission batching to mul_f32
* Fix tests
* Add missing barrier
* Add further missing barrier
* Add further ops
* Replace vk::QueueFamilyIgnored with VK_QUEUE_FAMILY_IGNORED to support more Vulkan header versions
* Remove unnecessary cblas link
* Fix descriptor set pre-allocation assert
* Add runtime shader compilation, start transferring shaders to this approach
* Transfer remaining shaders to header and compile on runtime
* Fix fp32 fallback if device doesn't support fp16, add force disable env var GGML_VULKAN_DISABLE_F16
* Add support for q4_1, q5_0, q5_1 and q8_0
* Remove unnecessary scalar layout extension
* Parse graph early to pre-record command buffers
* Add q6_k support
* Add multi-submit for command buffers
* Fix q6_k dequant shader for AMD
* Fix q6_k for GPUs without fp16 support
* Simplify q6_k fp16 fix
* Minor fixes
* Fix wg_denom of m-mulmat shaders
* Add Python-based Vulkan shader generator
* Replace shaderc dependency with precompiled shaders
Fix python script to generate shaders
* Clean up code
* Fix shader generator script Windows compatibility
Co-authored-by: Concedo <redacted>
* Close file before deletion
* Fix vulkan shader fp32 name
* Add q2_k and q3_k support
Add validation check to compare shader results to cpu results
* Add q4_k support
* Add q5_k support
* Bake SPIR-V bytecode into the library instead of loading shaders from file
* Switch to signal semaphores for flexibility
Prepare broadcasting support for mul mat
* Finish broadcasting mul mat support for GQA
* Clean up unused functions
Add repeat op
* Add further ops, not yet enabled. Improve semaphore code
* Reduce number of used semaphores by utilizing timelines more properly
* Remove queue information
* Reuse timeline semaphores, allow parallel operation with binary semaphores to work around nvidia driver limitations
* Add Vulkan to llama-bench
* Remove cblas dependency
* Fix matmul k-split bug
* Fix q4_k dmmv K_QUANTS_PER_ITERATION 1 shader
* Add RMS Norm shader, rework op_f32 shader setup, fix matmul bug
* Fix issues with float16 overflows in shaders
* Fix issues with older Vulkan headers on Ubuntu 22.04
* Allow multi-op partial offloading by parsing the graph to preallocate enough between-op buffers
* Implement further ops, rework op_f32 calls, fix bugs
* Finish full offloading support, add last remaining ops, fix bugs, remove redundant code
* Upload generated file ggml-vulkan-shaders.hpp, remove redundant shaders
* Merge upstream changes, fix conflicts, adapt soft_max op
* Fix Python and shader header format
* Free model gpu buffers on exit
* Use single queue per device to simplify code
* Add matmul shader support for running multiple calculations in parallel
* Switch from semaphore-synchronized multiple command buffers per op to single command buffer for multiple ops, whole graph if possible
* Fix missing event cast
* Replace uint64_t(-1) with UINT64_MAX, rename function for clarity
* Fix warning about empty C function parameters
* Fix compiler warnings
* Properly implement Vulkan backend buffer handling
* Fix oversized host staging buffers
* Simplify barrier synchronization calls
* Fix gcc warnings
* Implement max_size for backend buffer types to limit the size of a single allocation
* Use min of maxMemoryAllocationSize and maxBufferSize for device max allocation size
* refactor multi buf
* Disable unsupported ops to fix tests
* Check for maintenance4 support before using it
* Handle devices with only a single queue
* Fix single queue logic
* propagate buffer usage in multi buffers
* Implement rope_neox op
* Cleanup header and other files
* Simplify gpu_extras by removing events and putting staging memcpys into contexts
* Move queue into context
Add not-yet-enabled async backend ops
* Simplify context use, optimize matmul shader for warp size 64 (AMD GCN), fix split_k matmul shader optimization
* Add get_max_size to SYCL backend.
Co-authored-by: Georgi Gerganov <redacted>
* llama : fix trailing whitespace
---------
Co-authored-by: Henri Vasserman <redacted>
Co-authored-by: Concedo <redacted>
Co-authored-by: slaren <redacted>
Co-authored-by: Georgi Gerganov <redacted>
Abhilash Majumder [Sun, 28 Jan 2024 15:56:23 +0000 (21:26 +0530)]
ggml : add unified SYCL backend for Intel GPUs (#2690)
* first update for migration
* update init_cublas
* add debug functio, commit all help code
* step 1
* step 2
* step3 add fp16, slower 31->28
* add GGML_LIST_DEVICE function
* step 5 format device and print
* step6, enhance error check, remove CUDA macro, enhance device id to fix none-zero id issue
* support main device is non-zero
* step7 add debug for code path, rm log
* step 8, rename all macro & func from cuda by sycl
* fix error of select non-zero device, format device list
* ren ggml-sycl.hpp -> ggml-sycl.h
* clear CMAKE to rm unused lib and options
* correct queue: rm dtct:get_queue
* add print tensor function to debug
* fix error: wrong result in
658746bb26702e50f2c59c0e4ada8e9da6010481
* summary dpct definition in one header file to replace folder:dpct
* refactor device log
* mv dpct definition from folder dpct to ggml-sycl.h
* update readme, refactor build script
* fix build with sycl
* set nthread=1 when sycl, increase performance
* add run script, comment debug code
* add ls-sycl-device tool
* add ls-sycl-device, rm unused files
* rm rear space
* dos2unix
* Update README_sycl.md
* fix return type
* remove sycl version from include path
* restore rm code to fix hang issue
* add syc and link for sycl readme
* rm original sycl code before refactor
* fix code err
* add know issue for pvc hang issue
* enable SYCL_F16 support
* align pr4766
* check for sycl blas, better performance
* cleanup 1
* remove extra endif
* add build&run script, clean CMakefile, update guide by review comments
* rename macro to intel hardware
* editor config format
* format fixes
* format fixes
* editor format fix
* Remove unused headers
* skip build sycl tool for other code path
* replace tab by space
* fix blas matmul function
* fix mac build
* restore hip dependency
* fix conflict
* ren as review comments
* mv internal function to .cpp file
* export funciton print_sycl_devices(), mv class dpct definition to source file
* update CI/action for sycl code, fix CI error of repeat/dup
* fix action ID format issue
* rm unused strategy
* enable llama_f16 in ci
* fix conflict
* fix build break on MacOS, due to CI of MacOS depend on external ggml, instead of internal ggml
* fix ci cases for unsupported data type
* revert unrelated changed in cuda cmake
remove useless nommq
fix typo of GGML_USE_CLBLAS_SYCL
* revert hip cmake changes
* fix indent
* add prefix in func name
* revert no mmq
* rm cpu blas duplicate
* fix no_new_line
* fix src1->type==F16 bug.
* pass batch offset for F16 src1
* fix batch error
* fix wrong code
* revert sycl checking in test-sampling
* pass void as arguments of ggml_backend_sycl_print_sycl_devices
* remove extra blank line in test-sampling
* revert setting n_threads in sycl
* implement std::isinf for icpx with fast math.
* Update ci/run.sh
Co-authored-by: Georgi Gerganov <redacted>
* Update examples/sycl/run-llama2.sh
Co-authored-by: Georgi Gerganov <redacted>
* Update examples/sycl/run-llama2.sh
Co-authored-by: Georgi Gerganov <redacted>
* Update CMakeLists.txt
Co-authored-by: Georgi Gerganov <redacted>
* Update CMakeLists.txt
Co-authored-by: Georgi Gerganov <redacted>
* Update CMakeLists.txt
Co-authored-by: Georgi Gerganov <redacted>
* Update CMakeLists.txt
Co-authored-by: Georgi Gerganov <redacted>
* add copyright and MIT license declare
* update the cmd example
---------
Co-authored-by: jianyuzh <redacted>
Co-authored-by: luoyu-intel <redacted>
Co-authored-by: Meng, Hengyu <redacted>
Co-authored-by: Georgi Gerganov <redacted>
Georgi Gerganov [Sun, 28 Jan 2024 14:54:54 +0000 (16:54 +0200)]
flake.lock: Update (#5162)
Johannes Gäßler [Sun, 28 Jan 2024 08:59:49 +0000 (09:59 +0100)]
Apply min_p to unsorted tokens (#5115)
Johannes Gäßler [Sun, 28 Jan 2024 08:35:14 +0000 (09:35 +0100)]
Tests for min_p, sampling queue (#5147)
Marcus Dunn [Sun, 28 Jan 2024 08:30:44 +0000 (00:30 -0800)]
readme : add link to rust bindings (#5148)
* added link to another set of rust bindings with brief note on differences.
* fixed link name
sharpHL [Sun, 28 Jan 2024 08:00:30 +0000 (16:00 +0800)]
llama : add support for Orion-14B (#5118)
* add support for Orion-14B(https://huggingface.co/OrionStarAI/Orion-14B-Chat)
* flake8 support
* Update llama.cpp
Co-authored-by: Georgi Gerganov <redacted>
* Update llama.cpp
Co-authored-by: Georgi Gerganov <redacted>
* Update llama.cpp
Co-authored-by: Georgi Gerganov <redacted>
* Update llama.cpp
Co-authored-by: Georgi Gerganov <redacted>
* Update llama.cpp
Co-authored-by: slaren <redacted>
* Update llama.cpp
* Update llama.cpp
---------
Co-authored-by: lixiaopu <redacted>
Co-authored-by: Georgi Gerganov <redacted>
Co-authored-by: slaren <redacted>
Kyle Mistele [Sun, 28 Jan 2024 07:55:31 +0000 (01:55 -0600)]
docker : add server-first container images (#5157)
* feat: add Dockerfiles for each platform that user ./server instead of ./main
* feat: update .github/workflows/docker.yml to build server-first docker containers
* doc: add information about running the server with Docker to README.md
* doc: add information about running with docker to the server README
* doc: update n-gpu-layers to show correct GPU usage
* fix(doc): update container tag from `server` to `server-cuda` for README example on running server container with CUDA
John [Sat, 27 Jan 2024 15:09:18 +0000 (16:09 +0100)]
llava : support for Yi-VL and fix for mobileVLM (#5093)
* Support for Yi-VL, templating fix for mobileVLM
* ws
* Update examples/llava/clip.cpp
Co-authored-by: Georgi Gerganov <redacted>
* Update llava-cli.cpp
* Update clip.cpp
bugfix for new conversions
---------
Co-authored-by: Georgi Gerganov <redacted>
Georgi Gerganov [Sat, 27 Jan 2024 14:59:20 +0000 (16:59 +0200)]
sync : ggml
Judd [Fri, 26 Jan 2024 13:04:01 +0000 (21:04 +0800)]
ggml : check ggml_add src1 type (ggml/708)
Co-authored-by: Judd <redacted>
Michael Klimenko [Sat, 27 Jan 2024 14:25:55 +0000 (15:25 +0100)]
Remove unused data and add fixes (#5154)
* Remove unused data and add fixes
* Add missing file
* Address review comments
* Replace the scope of vq allocation
Maximilian Winter [Sat, 27 Jan 2024 13:38:05 +0000 (14:38 +0100)]
server : add self-extend support (#5104)
* Ported self extension to server example
* Update server.cpp
* Fixed prompt caching without self extend
* Update server.cpp
* Added description to server readme.
* Update server.cpp
* Update server.cpp
* Update server.cpp
* Update server.cpp
* Update README.md
* Changed descriptions
* server : formatting
* Update examples/server/server.cpp
Co-authored-by: Georgi Gerganov <redacted>
* Update examples/server/server.cpp
Co-authored-by: Georgi Gerganov <redacted>
* Update server.cpp
* Update server.cpp
---------
Co-authored-by: Georgi Gerganov <redacted>
0cc4m [Fri, 26 Jan 2024 22:07:32 +0000 (23:07 +0100)]
Add OpenCL add kernel (#5151)
* Add OpenCL add kernel
* Put add kernel into different string to stay within MSVC string length limit, disable float16 support due to bad results
Jared Van Bortel [Fri, 26 Jan 2024 20:34:06 +0000 (15:34 -0500)]
cmake : pass CPU architecture flags to nvcc (#5146)
slaren [Fri, 26 Jan 2024 17:59:43 +0000 (18:59 +0100)]
cuda : fix tensor size calculation for non-split buffer (#5145)
slaren [Fri, 26 Jan 2024 17:18:26 +0000 (18:18 +0100)]
ggml-alloc : add 10% margin to the buffer sizes (#5149)
snadampal [Fri, 26 Jan 2024 17:17:59 +0000 (11:17 -0600)]
ggml : update softmax n_task calculation (#5126)
updated the n_task calculation to use max number of
threads possible. This has improved the prompt eval
performance by around 5% for DOT kernels and by
around 10% for MMLA kernels on AWS Graviton3.
Georgi Gerganov [Fri, 26 Jan 2024 15:09:44 +0000 (17:09 +0200)]
scripts : move run-with-preset.py from root to scripts folder
Georgi Gerganov [Fri, 26 Jan 2024 12:48:15 +0000 (14:48 +0200)]
tests : gitignore test-c.o
Xuan Son Nguyen [Fri, 26 Jan 2024 12:42:20 +0000 (13:42 +0100)]
server : refactored the task processing logic (#5065)
* server: add llama_server_queue struct
* server: add llama_server_response_event
* server: add comments
* server: move all mutexes away from server.cpp
* server: correct multitask response
* server: only add back deferred tasks when one slot is available
* server: fix a race condition cause by "request_completion"
crasm [Fri, 26 Jan 2024 12:18:00 +0000 (07:18 -0500)]
ci : add model tests + script wrapper (#4586)
* scripts : add lib.sh and lib_test.sh
* scripts : stub out new ci-run.sh script
* scripts : switch to PascalCase for functions
This looks a little odd at first, but I find it very useful as a
convention to know if a command is part of our code vs a builtin.
* scripts : add some fancy conversion from snake_case to PascalCase
* Add venv to ci/run.sh
* Revert scripts work
* scripts : add wrapper script for local use of ci/run.sh
* Simplify .gitignore for tests, clang-tidy fixes
* Label all ctest tests
* ci : ctest uses -L main
* Attempt at writing ctest_with_model
* Update test-model-load-cancel
* ci : add ctest_with_model for debug and release
ggml-ci
* Fix gg_get_model function
ggml-ci
* got stuck on CMake
* Add get_model.cpp to tests/CMakeLists.txt
ggml-ci
* Fix README.md output for ctest_with_model
ggml-ci
* workflows : use `-L main` for all ctest
ggml-ci
* Fixes
* GG_RUN_CTEST_MODELFILE => LLAMACPP_TESTMODELFILE
* Always show warning rather than failing if model file variable is not
set
* scripts : update usage text for ci-run.sh
Paul Tsochantaris [Fri, 26 Jan 2024 12:16:07 +0000 (12:16 +0000)]
metal : remove unused `n_buffers` and `buffers` (#5129)
Riceball LEE [Fri, 26 Jan 2024 09:10:28 +0000 (17:10 +0800)]
gguf : fix "general.alignment" type in gguf_reader.py (#5136)
Georgi Gerganov [Fri, 26 Jan 2024 08:52:33 +0000 (10:52 +0200)]
readme : update hot topics
Kawrakow [Fri, 26 Jan 2024 07:14:39 +0000 (09:14 +0200)]
Another bucket sort (#5109)
* Initial bucket sort
* Bucket sort: slightly better version
* Bucket sort: another minor improvement
---------
Co-authored-by: Iwan Kawrakow <redacted>
XiaotaoChen [Thu, 25 Jan 2024 20:14:32 +0000 (04:14 +0800)]
readme : add MobileVLM 1.7B/3B to the supported models list (#5107)
Co-authored-by: Chenxiaotao03 <redacted>
l3utterfly [Thu, 25 Jan 2024 20:06:22 +0000 (05:06 +0900)]
llama : dynamic temperature sampling (#4972)
* implemented dynamic temperature sampling from koboldcpp
* removed trailing whitespace
* removed unused temp parameter in llama_sample_entropy
* exposed exponent_val in dynamic temp sampler
* added debug check for printf statements
* use nullptr in llama_sample_softmax call during llama_sample_entropy
this avoids counting the time taken stats twice
Co-authored-by: Georgi Gerganov <redacted>
* return earlier if there is only 1 candiate (i.e. max_entropy == 0)
* reformat 't' case in llama_sample_queue
Co-authored-by: Jared Van Bortel <redacted>
* check for one or zero candidates case in llama_sample_entropy
---------
Co-authored-by: Georgi Gerganov <redacted>
Co-authored-by: Jared Van Bortel <redacted>
Jared Van Bortel [Thu, 25 Jan 2024 19:51:24 +0000 (14:51 -0500)]
examples : make pydantic scripts pass mypy and support py3.8 (#5099)
Valentin Konovalov [Thu, 25 Jan 2024 17:05:51 +0000 (12:05 -0500)]
android : use release cmake build type by default (#5123)
Kawrakow [Thu, 25 Jan 2024 15:58:53 +0000 (17:58 +0200)]
Fix Q3_K_XS for MoE models (#5113)
Co-authored-by: Iwan Kawrakow <redacted>
Georgi Gerganov [Thu, 25 Jan 2024 09:26:17 +0000 (11:26 +0200)]
metal : show compile log messages
Engininja2 [Wed, 24 Jan 2024 22:18:15 +0000 (16:18 -0600)]
cuda : fix 2-bit quants on amd hip (#5105)
* cuda : fix 2-bit quants on amd hip
* use __low2float intrinsic function for new quants
Michael Hueschen [Mon, 22 Jan 2024 23:44:10 +0000 (16:44 -0700)]
nix-shell: use addToSearchPath
thx to @SomeoneSerge for the suggestion!
Michael Hueschen [Mon, 22 Jan 2024 10:17:05 +0000 (03:17 -0700)]
nix: add cc to devShell LD_LIBRARY_PATH
this fixes the error I encountered when trying to run the convert.py
script in a venv:
```
$ nix develop
[...]$ source .venv/bin/activate
(.venv)
[...]$ pip3 install -r requirements.txt
<... clipped ...>
[...]$ python3 ./convert.py
Traceback (most recent call last):
File "/home/mhueschen/projects-reference/llama.cpp/./convert.py", line 40, in <module>
from sentencepiece import SentencePieceProcessor
File "/home/mhueschen/projects-reference/llama.cpp/.venv/lib/python3.11/site-packages/sentencepiece/__init__.py", line 13, in <module>
from . import _sentencepiece
ImportError: libstdc++.so.6: cannot open shared object file: No such file or directory
```
however, I am not sure this is the cleanest way to address this linker
issue...
slaren [Wed, 24 Jan 2024 11:48:14 +0000 (12:48 +0100)]
llama : pre-allocate input tensors in a separate buffer (#5100)
Georgi Gerganov [Tue, 23 Jan 2024 13:50:56 +0000 (15:50 +0200)]
metal : disable support for MUL_MAT F32 x F16
Kawrakow [Tue, 23 Jan 2024 13:17:20 +0000 (15:17 +0200)]
Additional KL-divergence statistics (#5081)
* perplexity: add top-token probability
* perplexity: add additional KL-divergence statistics
* perplexity: a better organized KL-divergence statistics output
---------
Co-authored-by: Iwan Kawrakow <redacted>
Johannes Gäßler [Tue, 23 Jan 2024 12:31:56 +0000 (13:31 +0100)]
CUDA: more info when no device code (#5088)
Georgi Gerganov [Tue, 23 Jan 2024 12:12:57 +0000 (14:12 +0200)]
minor : clean-up some warnings and style (#5094)
* minor : clean-up some warnings and style
ggml-ci
* ggml : add comment
Xuan Son Nguyen [Tue, 23 Jan 2024 07:11:39 +0000 (08:11 +0100)]
devops : add intel oneapi dockerfile (#5068)
Co-authored-by: Xuan Son Nguyen <redacted>
Michael Coppola [Tue, 23 Jan 2024 06:51:27 +0000 (01:51 -0500)]
llama.vim : added api key support (#5090)
Co-authored-by: Michael Coppola <redacted>
slaren [Mon, 22 Jan 2024 22:42:41 +0000 (23:42 +0100)]
llama : fix not enough space in buffer with Qwen (#5086)
Kawrakow [Mon, 22 Jan 2024 14:10:14 +0000 (16:10 +0200)]
KL-divergence (#5076)
* kl-divergence: be able to save all logits to a file
* Add ability to compute KL-divergence
---------
Co-authored-by: Iwan Kawrakow <redacted>
Reinforce-II [Mon, 22 Jan 2024 13:15:08 +0000 (21:15 +0800)]
ggml : parallelize FP32 conversion when using BLAS (#5045)
* make GGML_TASK_INIT phase can be run in multithread
* multithreaded dequantize in mul_mat when using blas library
* minor fixes
* update outdated comment
* fix coding style
* simplify code
Co-authored-by: Georgi Gerganov <redacted>
---------
Co-authored-by: Georgi Gerganov <redacted>
XiaotaoChen [Mon, 22 Jan 2024 13:09:35 +0000 (21:09 +0800)]
llava : MobileVLM support (#4954)
* MobileVLM native implementation
* delete depthwise_conv_2d and permute_cpy relative code, replace the two by the existed functions, and opt ldp definition, support LLAMA_PERF option for CMake
* move android script to example/llava directory
* Fix the editor config checks
---------
Co-authored-by: Chenxiaotao03 <redacted>
Someone Serge [Sun, 21 Jan 2024 03:41:37 +0000 (03:41 +0000)]
flake.nix: add a comment about flakes vs nix
Someone Serge [Sun, 21 Jan 2024 03:29:38 +0000 (03:29 +0000)]
nix: add a comment on the many nixpkgs-with-cuda instances
Someone Serge [Sun, 21 Jan 2024 03:15:13 +0000 (03:15 +0000)]
nix: add a comment about makeScope
Someone Serge [Sat, 13 Jan 2024 17:45:01 +0000 (17:45 +0000)]
nix: refactor the cleanSource rules
Someone Serge [Sat, 13 Jan 2024 17:38:32 +0000 (17:38 +0000)]
workflows: nix-ci: drop the redundant "paths" filter
Someone Serge [Sat, 13 Jan 2024 17:16:54 +0000 (17:16 +0000)]
workflows: nix-build-aarch64: rate limit
Someone Serge [Sat, 13 Jan 2024 17:10:19 +0000 (17:10 +0000)]
workflows: nix-ci: rebuild on flake.lock updates
Kawrakow [Mon, 22 Jan 2024 12:18:43 +0000 (14:18 +0200)]
imatrix : keep intermediate imatrix results (#5077)
Co-authored-by: Iwan Kawrakow <redacted>
compilade [Mon, 22 Jan 2024 11:21:52 +0000 (06:21 -0500)]
llama : support StableLM 2 1.6B (#5052)
* llama : support StableLM 2 1.6B
* convert : fix Qwen's set_vocab wrongly naming all special tokens [PAD{id}]
* convert : refactor Qwen's set_vocab to use it for StableLM 2 too
* nix : add tiktoken to llama-python-extra
* convert : use presence of tokenizer.json to determine StableLM tokenizer loader
It's a less arbitrary heuristic than the vocab size.
Daniel Bevenius [Mon, 22 Jan 2024 11:11:01 +0000 (12:11 +0100)]
finetune : print sample-start/include-sample-start (#5072)
This commit adds `--sample-start` and `--include-sample-start` to the
output from the main function in finetune.cpp.
The motivation for this is that even though these are set explicitly by
the user via the command line, if one forgets to set them then it is
useful to have their values printed out. Otherwise it is possible to go
through the whole training process before realizing that the values are
not what one expected.
Signed-off-by: Daniel Bevenius <redacted>
Kawrakow [Mon, 22 Jan 2024 10:43:33 +0000 (12:43 +0200)]
llama : add Q3_K_XS (#5060)
* Add Q3_K_XS - intermediate size between Q2_K and Q3_K_S
* Q3_K_XS: quanize first 1/8 of ffn_down layers with Q4_K
Together with an importance matrix, this brings perplexity
for LLaMA-v2-70B below the perplexity of the former Q2_K
with a 800 MB smaller quantized model size.
---------
Co-authored-by: Iwan Kawrakow <redacted>
bobqianic [Mon, 22 Jan 2024 08:55:05 +0000 (08:55 +0000)]
ci : fix Windows CI by updating Intel SDE version (#5053)
Shijie [Mon, 22 Jan 2024 07:33:19 +0000 (15:33 +0800)]
llama : add more qwen2 models (#5071)
iSma [Sun, 21 Jan 2024 21:37:13 +0000 (22:37 +0100)]
Revert LLAMA_NATIVE to OFF in flake.nix (#5066)
kuronekosaiko [Sun, 21 Jan 2024 16:28:14 +0000 (00:28 +0800)]
add safetensors support to convert-lora-to-ggml.py (#5062)
* add safetensors support to convert-lora-to-ggml.py
* Update convert-lora-to-ggml.py
Remove white space in line 69.
bobqianic [Sun, 21 Jan 2024 15:17:35 +0000 (15:17 +0000)]
add `#include <string>` to unicode.h (#5051)
Co-authored-by: Jared Van Bortel <redacted>
Kawrakow [Sun, 21 Jan 2024 12:42:44 +0000 (14:42 +0200)]
Add ability to evauate multiple choice tasks (#5047)
* TruthfulQA: 1st attempt, does not look like it is working
The same implementation can be used for HellaSwag as well,
so I converted a HellaSwag validation dataset to the binary
format used here and tested with that. The score is only
around 50, so something is not quite right.
* TruthfulQA: works but the result is bad
I know it works because if I convert the HellaSwag validation
data to the binary format used in the truthful_qa_score() function
I get the exact same result as from the hellaswag_score() function.
But I guess, the questions are tricky and the way I have done
the combination of question + answer is very likely not the best.
The TruthfulQA validation dataset contains 817 questions, with
random chance result around 19%. With this version I get
29.1% for Mistral-7B and 55.2% for Mistral-7B-Instruct-v0.2.
The HF leader board results for these two models are
42.2% and 68.3%, respectively.
* TruthfulQA: fix random sample
* TruthfulQA: prepare tasks in parallel for large test datasets
* Rename truthful_qa to multiple_choice
* Make MSVC happy
I had forgotten that MSVC does not make constexpr's available
inside a lambda.
---------
Co-authored-by: Iwan Kawrakow <redacted>
Kawrakow [Sun, 21 Jan 2024 06:01:20 +0000 (08:01 +0200)]
Slightly faster imatrix (#5050)
* imatrix: speedup by avoiding unnecessary allocations and copies
* imatrix: add --no-ppl option to skip PPL calculations altogether
---------
Co-authored-by: Iwan Kawrakow <redacted>
Georgi Gerganov [Sun, 21 Jan 2024 03:17:27 +0000 (05:17 +0200)]
flake.lock: Update (#5054)
Flake lock file updates:
• Updated input 'nixpkgs':
'github:NixOS/nixpkgs/
9b19f5e77dd906cb52dade0b7bd280339d2a1f3d ' (2024-01-13)
→ 'github:NixOS/nixpkgs/
bbe7d8f876fbbe7c959c90ba2ae2852220573261 ' (2024-01-19)
Co-authored-by: github-actions[bot] <redacted>
Jared Van Bortel [Sat, 20 Jan 2024 23:14:18 +0000 (18:14 -0500)]
convert : partially revert PR #4818 (#5041)
Jared Van Bortel [Sat, 20 Jan 2024 15:08:08 +0000 (10:08 -0500)]
perplexity : fix MSVC build after #5020 (#5043)
* perplexity : fix MSVC build after #5020
* try a differerent fix
slaren [Sat, 20 Jan 2024 15:05:49 +0000 (16:05 +0100)]
llama : run all KQV ops on the CPU with no KV offload (#5049)
ggml-ci
Herman Semenov [Sat, 20 Jan 2024 08:11:31 +0000 (08:11 +0000)]
cmake : add support for ccache (#5002)
* Added support ccache for speedup recompilation
* cmake : option to disable ccache
---------
Co-authored-by: Georgi Gerganov <redacted>
adel boussaken [Sat, 20 Jan 2024 08:05:43 +0000 (09:05 +0100)]
Add a dart/flutter binding to README.md (#4882)
Kylin [Sat, 20 Jan 2024 07:01:46 +0000 (15:01 +0800)]
cuda : fix compile error in jetson platform (#4975)
* cuda: fix compile error in jetson platform
* cuda: update comment in ggml-cuda.cu
* cuda: update ggml-cuda.cu comment
Uzo Nweke [Fri, 19 Jan 2024 18:20:50 +0000 (13:20 -0500)]
finetune : fix ggml_allocr lifetimes (tmp workaround) (#5033)
* Fix issue with alloc causing max_compute_size to be calculated
* remove ggml_allocr_free as suggested in issue #4791
Georgi Gerganov [Fri, 19 Jan 2024 13:24:47 +0000 (15:24 +0200)]
imatrix : add README.md
Shijie [Fri, 19 Jan 2024 11:53:13 +0000 (19:53 +0800)]
llama : support upcoming Qwen2 (#5037)
Georgi Gerganov [Fri, 19 Jan 2024 11:52:22 +0000 (13:52 +0200)]
py : fix flake8 lint
Kawrakow [Fri, 19 Jan 2024 09:39:11 +0000 (11:39 +0200)]
winogrande: evaluate log-probs in parallel (#5036)
This is a relatively minor performance tweak resulting in
~10% speedup on my system.
Co-authored-by: Iwan Kawrakow <redacted>
chiranko [Fri, 19 Jan 2024 09:07:27 +0000 (17:07 +0800)]
llama : add CodeShell support (#5016)
* llama: add codeshell support
* llama.cpp: fix codeshell with NeoX rope
Co-authored-by: Georgi Gerganov <redacted>
---------
Co-authored-by: Georgi Gerganov <redacted>
Kawrakow [Fri, 19 Jan 2024 09:02:39 +0000 (11:02 +0200)]
perplexity: avoid unnecessary alloocations and logit copies (#5035)
Co-authored-by: Iwan Kawrakow <redacted>
Georgi Gerganov [Fri, 19 Jan 2024 08:45:06 +0000 (10:45 +0200)]
perplexity : faster Winogrande via batching (#5024)
* perplexity : faster Winogrande via batching
ggml-ci
* perplexity : remove unused function
* perplexity : only tokenize selected tasks for Winogrande
John [Thu, 18 Jan 2024 22:12:15 +0000 (23:12 +0100)]
llama : fix falcon arch for tied output embeddings (#4978)
* falcon arch fix for tied output embeddings
* Update llama.cpp
Co-authored-by: Georgi Gerganov <redacted>
* Update llama.cpp
* Update llama.cpp
Co-authored-by: Georgi Gerganov <redacted>
* Update llama.cpp
---------
Co-authored-by: Georgi Gerganov <redacted>
Georgi Gerganov [Thu, 18 Jan 2024 21:36:07 +0000 (23:36 +0200)]
cmake : add ggml public headers (#5011)
Xuan Son Nguyen [Thu, 18 Jan 2024 20:33:05 +0000 (21:33 +0100)]
server : defer tasks when "slot unavailable" (#5018)
* server: defer task when no slot is available
* remove unnecessary log
---------
Co-authored-by: Xuan Son Nguyen <redacted>
slaren [Thu, 18 Jan 2024 20:12:15 +0000 (21:12 +0100)]
llama : fix mlock with no-mmap with Metal (#5025)
Georgi Gerganov [Thu, 18 Jan 2024 19:45:51 +0000 (21:45 +0200)]
imatrix : fix assert for src0 non-cont check
Georgi Gerganov [Thu, 18 Jan 2024 18:49:00 +0000 (20:49 +0200)]
perplexity : fix winogrande N tasks option
Georgi Gerganov [Thu, 18 Jan 2024 18:45:39 +0000 (20:45 +0200)]
scripts : add get-winogrande.sh
David Sommers [Thu, 18 Jan 2024 17:20:59 +0000 (12:20 -0500)]
convert.py : fix llama/llama2 conversion due to vocab_size=-1 (#5019)
PR #4818 (merged last week) reintroduced a config check for vocab_size that was addressed in PR #4258 (merged 2023-11-30).
Without the fix, llama2 models can't be converted. The error is:
`ValueError: The model's vocab size is set to -1 in params.json. Please update it manually. Maybe 32000?`
Kawrakow [Thu, 18 Jan 2024 17:18:21 +0000 (19:18 +0200)]
HellaSwag: speed up by parallelizing log-prob evaluation (#5020)
For Mistral-7B and fp16, time on my system goes down from 536 seconds
to 423 seconds for the full evaluation dataset (10042 tasks).
Co-authored-by: Iwan Kawrakow <redacted>
Georgi Gerganov [Thu, 18 Jan 2024 13:33:01 +0000 (15:33 +0200)]
perplexity : faster HellaSwag via batching (#5017)
* perplexity : faster HellaSwag
ggml-ci
* perplexity : clean-up
ggml-ci
* perplexity : no need for decode_helper
ggml-ci
* perplexity : add comments
* perplexity : option to specify max batched tasks via `n_parallel`
* perplexity : remove HellaSwag restruction for n_batch