]>
git.djapps.eu Git - pkg/ggml/sources/llama.cpp/log
Georgi Gerganov [Sat, 10 Feb 2024 07:30:36 +0000 (09:30 +0200)]
sync : ggml
Michael Podvitskiy [Fri, 9 Feb 2024 09:42:27 +0000 (10:42 +0100)]
ggml : add abort_callback for cpu backend (ggml/725)
* a way to use abort_callback with the cpu backend
* whisper update
Neuman Vong [Fri, 9 Feb 2024 18:30:19 +0000 (05:30 +1100)]
vulkan: Set limit for task concurrency (#5427)
A common default for the maximum number of open files is 256, which can
lead to `asyncio.gather(*tasks)` failing with Too many open files.
$ python ggml_vk_generate_shaders.py --glslc=$ANDROID_NDK_PATH/shader-tools/darwin-x86_64/glslc
ggml_vulkan: Generating and compiling shaders to SPIR-V
Traceback (most recent call last):
File "/Users/neuman/Code.noindex/github/llama.cpp/ggml_vk_generate_shaders.py", line 2326, in <module>
asyncio.run(main())
File "/Users/neuman/Code.noindex/miniforge3/lib/python3.10/asyncio/runners.py", line 44, in run
return loop.run_until_complete(main)
File "/Users/neuman/Code.noindex/miniforge3/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete
return future.result()
File "/Users/neuman/Code.noindex/github/llama.cpp/ggml_vk_generate_shaders.py", line 2294, in main
await asyncio.gather(*tasks)
[...snip...]
OSError: [Errno 24] Too many open files
This change sets a reasonable concurrency limit for tasks (and therefore
open files), without significant impact on run time.
Daniel Bevenius [Fri, 9 Feb 2024 13:00:59 +0000 (14:00 +0100)]
llava : add requirements.txt and update README.md (#5428)
* llava: add requirements.txt and update README.md
This commit adds a `requirements.txt` file to the `examples/llava`
directory. This file contains the required Python packages to run the
scripts in the `examples/llava` directory.
The motivation of this to make it easier for users to run the scripts in
`examples/llava`. This will avoid users from having to possibly run into
missing package issues if the packages are not installed on their system.
Signed-off-by: Daniel Bevenius <redacted>
* llava: fix typo in llava-surgery.py output
Signed-off-by: Daniel Bevenius <redacted>
---------
Signed-off-by: Daniel Bevenius <redacted>
Riley Stewart [Fri, 9 Feb 2024 10:49:49 +0000 (02:49 -0800)]
server : fix prompt caching for repeated prompts (#5420)
Paul Tsochantaris [Fri, 9 Feb 2024 10:48:06 +0000 (10:48 +0000)]
llama : do not cap thread count when MoE on CPU (#5419)
* Not capping thread count when MoE inference is running on CPU
* Whitespace
Marko Tasic [Fri, 9 Feb 2024 10:17:00 +0000 (11:17 +0100)]
readme : add JavaScript/Wasm repo (#5415)
Michael Podvitskiy [Fri, 9 Feb 2024 09:56:43 +0000 (10:56 +0100)]
ggml : fix `error C2078: too many initializers` for MSVC ARM64 (#5404)
0cc4m [Fri, 9 Feb 2024 05:52:33 +0000 (06:52 +0100)]
Fix Vulkan crash on APUs with very little device memory (#5424)
* Fix Vulkan crash on APUs with very little device memory
* Fix debug output function names
Johannes Gäßler [Thu, 8 Feb 2024 20:56:40 +0000 (21:56 +0100)]
CUDA: more warps for mmvq on NVIDIA (#5394)
slaren [Thu, 8 Feb 2024 20:33:03 +0000 (21:33 +0100)]
llama : do not print "offloading layers" message in CPU-only builds (#5416)
Abhilash Majumder [Thu, 8 Feb 2024 17:09:10 +0000 (22:39 +0530)]
Fix f16_sycl cpy call from Arc (#5411)
* fix f16_sycl cpy call
* rm old logic
* add fp16 build CI
* use macro
* format fix
Daniel Bevenius [Thu, 8 Feb 2024 14:20:03 +0000 (15:20 +0100)]
llava : add missing .py, and fix paths in README.md (#5414)
This commit adds the missing .py extension to the convert-image-encoder-to-gguf
script. It also fixes the paths for the `model` and `mmproj` options in the
example llava-cli command.
Signed-off-by: Daniel Bevenius <redacted>
Johannes Gäßler [Thu, 8 Feb 2024 10:36:54 +0000 (11:36 +0100)]
fix trailing whitespace (#5407)
runfuture [Thu, 8 Feb 2024 10:36:19 +0000 (18:36 +0800)]
llama : fix MiniCPM (#5392)
* fix bug for norm_rms_eps missing
* to align with the same order as convert.py for model write
* fix: undo HF models permute tensor
* update for flake8 lint
Daniel Bevenius [Thu, 8 Feb 2024 08:58:19 +0000 (09:58 +0100)]
llava: fix typo/formatting in README.md (#5405)
This commit fixes a typo in the README.md file for the llava example
which is causing the formatting to look a little off:
Clone llava-v15-7b`` and clip-vit-large-patch14-336`` locally
Signed-off-by: Daniel Bevenius <redacted>
Johannes Gäßler [Thu, 8 Feb 2024 08:46:30 +0000 (09:46 +0100)]
sampling: fix top_k <= 0 (#5388)
* sampling: fix top_k <= 0
* Update llama.cpp
Co-authored-by: Georgi Gerganov <redacted>
---------
Co-authored-by: Georgi Gerganov <redacted>
Georgi Gerganov [Thu, 8 Feb 2024 07:46:47 +0000 (09:46 +0200)]
tests : .gitignore obj files
Michael Podvitskiy [Wed, 7 Feb 2024 21:39:23 +0000 (22:39 +0100)]
CMAKE_OSX_ARCHITECTURES for MacOS cross compilation (#5393)
Co-authored-by: Jared Van Bortel <redacted>
Ebey Abraham [Wed, 7 Feb 2024 21:11:30 +0000 (21:11 +0000)]
fix typo in readme (#5399)
Co-authored-by: Ebey Abraham <redacted>
Kamil Tomšík [Wed, 7 Feb 2024 18:44:52 +0000 (19:44 +0100)]
Add Ava in the list of llama.cpp UIs (#4362)
Johannes Gäßler [Wed, 7 Feb 2024 11:40:26 +0000 (12:40 +0100)]
CUDA: fixed mmvq kernel for bs 2,3,4 and -sm row (#5386)
Neo Zhang Jianyu [Wed, 7 Feb 2024 10:16:55 +0000 (18:16 +0800)]
[SYCL] update install make by w64devkit (#5297)
Xiao-Yong Jin [Wed, 7 Feb 2024 08:17:25 +0000 (02:17 -0600)]
llava-cli : always tokenize special tokens (#5382)
* llava-cli: tokenize special tokens in prompt
* llava-cli: use the escape CLI argument, remove incomplete separate escaping process
0cc4m [Wed, 7 Feb 2024 06:54:50 +0000 (07:54 +0100)]
Basic Vulkan Multi-GPU implementation (#5321)
* Initial Vulkan multi-gpu implementation
Move most global variables into backend context
* Add names to backend device functions
* Add further missing cleanup code
* Reduce code duplication in tensor split layer assignment
* generalize LLAMA_SPLIT_LAYER for all backends, do not expose device count and memory in llama.h
* Only do device info print in the beginning and initialize one backend for cpu assist
Add missing cleanup code
* Rework backend memory management to make sure devices and buffers get properly allocated and freed
* Rename cpu assist free function
---------
Co-authored-by: slaren <redacted>
Eve [Wed, 7 Feb 2024 06:21:30 +0000 (06:21 +0000)]
readme : modernize (#5379)
* first cleanup, update everything to Llama 2 and remove outdated content
* Delete SHA256SUMS
* make build instructions generic
* recommend Q4_K_M quantization method
* Update README.md
Ben Williams [Wed, 7 Feb 2024 06:16:48 +0000 (22:16 -0800)]
readme : update ui list (#5354)
runfuture [Wed, 7 Feb 2024 06:15:56 +0000 (14:15 +0800)]
llama : add MiniCPM support (#5346)
* support minicpm arch.
* fix tab/space typo.
* convert minicpm model via convert-hf-gguf.py
* try to make tokenizer work
* fix bug for quantize minicpm
* fix for flake8 lint
* remove convert-minicpm.py
* fix for editorconfig
* correct minicpm model type (size)
* constants expanded for minicpm
* Minor change of the constant names for minicpm
Justin Parker [Wed, 7 Feb 2024 06:15:19 +0000 (01:15 -0500)]
server : update `/props` with "total_slots" value (#5373)
* include total "num_slots" in default_generation_settings_for_props
* cleanup total_slots return value in /props endpoint
* update /props endpoint docs with total_slots
* remove num_slots from default_generation_settings_for_props
* update /props endpoint section
Sang-Kil Park [Wed, 7 Feb 2024 04:28:00 +0000 (13:28 +0900)]
convert : fix TypeError on GPT-2 vocab.json (#5288)
Alexey Parfenov [Tue, 6 Feb 2024 18:08:38 +0000 (18:08 +0000)]
server : remove model.json endpoint (#5371)
Johannes Gäßler [Tue, 6 Feb 2024 17:43:06 +0000 (18:43 +0100)]
CUDA: mul_mat_vec_q max. batch size 8 -> 4 (#5370)
Kawrakow [Tue, 6 Feb 2024 17:00:16 +0000 (19:00 +0200)]
Update README.md (#5366)
Add some links to quantization related PRs
Kawrakow [Tue, 6 Feb 2024 15:28:02 +0000 (17:28 +0200)]
Slight quantization improvement for Q4_K and Q5_K (#5361)
* Q4_K: slightly better quantization
* Q5_K: slightly better quantization
---------
Co-authored-by: Iwan Kawrakow <redacted>
BarfingLemurs [Tue, 6 Feb 2024 14:06:48 +0000 (09:06 -0500)]
readme : add phi, orion 14b, internlm2, and yi-VL to readme (#5362)
Johannes Gäßler [Tue, 6 Feb 2024 13:44:06 +0000 (14:44 +0100)]
CUDA: mul_mat_vec_q for batch sizes > 1 (#5351)
Justin Parker [Tue, 6 Feb 2024 09:20:59 +0000 (04:20 -0500)]
server : include total "num_slots" in props endpoint (#5349)
Michael Coppola [Tue, 6 Feb 2024 09:20:00 +0000 (04:20 -0500)]
server : add `dynatemp_range` and `dynatemp_exponent` (#5352)
* server: added `dynatemp_range` and `dynatemp_exponent`
* Update README.md
---------
Co-authored-by: Michael Coppola <redacted>
Niall Coates [Tue, 6 Feb 2024 08:16:23 +0000 (08:16 +0000)]
server : various fixes for the prompt field in /completion (#5300)
server : fix deadlock when prompt array contains strings and numbers
server : removed an unnecessary generation when generating multi-prompts
server : removed an unnecessary assert
Georgi Gerganov [Tue, 6 Feb 2024 05:47:22 +0000 (07:47 +0200)]
py : handle byte tokens in `get_token_type` (#5341)
* py : handle byte tokens in `get_token_type`
* py : fix empty bytes arg
Johannes Gäßler [Mon, 5 Feb 2024 18:33:00 +0000 (19:33 +0100)]
make: Use ccache for faster compilation (#5318)
* make: Use ccache for faster compilation
Johannes Gäßler [Mon, 5 Feb 2024 14:55:10 +0000 (15:55 +0100)]
README: updated introduction (#5343)
* README: updated introduction
* readme : update
---------
Co-authored-by: Georgi Gerganov <redacted>
Kawrakow [Mon, 5 Feb 2024 12:09:47 +0000 (14:09 +0200)]
ggml : make use of ggml-quants.h possible in C++ code (#5338)
* Make use of ggml-quants.h possible in C++ code
* One cannot possibly be defining static_assert in a C++ compilation
---------
Co-authored-by: Iwan Kawrakow <redacted>
Dr. Tom Murphy VII Ph.D [Mon, 5 Feb 2024 11:13:57 +0000 (06:13 -0500)]
ggml : avoid duplicating function calls using MIN/MAX macros (#5325)
* Avoid duplicating function calls when using MIN/MAX macros.
Since these copy "a" and "b" they ask the compiler to evaluate one of them twice. The compiler doesn't have a problem with removing the duplication in something like MAX(0, x + 2), but in some cases we're calling functions, and those calls just happen twice.
By explicitly evaluating at the expression we get smaller and faster code without duplicate calls. See ggml_rope_yarn_corr_dims in Compiler Explorer:
https://godbolt.org/z/Ee4KMrvKh
Code behaves exactly the same.
* Update ggml.c
---------
Co-authored-by: Georgi Gerganov <redacted>
Kawrakow [Mon, 5 Feb 2024 10:32:27 +0000 (12:32 +0200)]
iq3_xxs: quards for the no-imatrix situation (#5334)
Co-authored-by: Iwan Kawrakow <redacted>
Guoteng [Mon, 5 Feb 2024 09:04:06 +0000 (17:04 +0800)]
py : fix internlm2-hf convert to gguf (#5305)
* py : fix internlm2-hf convert to gguf
* ggml-ci
Kawrakow [Mon, 5 Feb 2024 08:46:06 +0000 (10:46 +0200)]
iq2_xxs: tune quantization (#5320)
We get slightly better PPL, and we cut quantization time in
nearly half.
The trick is to 1st quantize without forcing points onto the E8-lattice.
We can then use a narrower search range around the block scale that we
got that way.
Co-authored-by: Iwan Kawrakow <redacted>
Alexey Parfenov [Mon, 5 Feb 2024 08:10:22 +0000 (08:10 +0000)]
server : allow to get default generation settings for completion (#5307)
l3utterfly [Mon, 5 Feb 2024 08:00:47 +0000 (17:00 +0900)]
common : add dynamic temperature parameters to main example cli (#5295)
* added dynamic temp params in main
* added help text
Georgi Gerganov [Mon, 5 Feb 2024 07:48:03 +0000 (09:48 +0200)]
scripts : fix typos, cleanup (#5303)
Нияз Гарифзянов [Mon, 5 Feb 2024 07:43:57 +0000 (10:43 +0300)]
scripts : add non-interactive server-llm.sh (#5303)
* Update server-llm.sh
Add flag --non-interactive that allows run script without asking a permission
* Update scripts/server-llm.sh
---------
Co-authored-by: Georgi Gerganov <redacted>
chiranko [Mon, 5 Feb 2024 07:41:38 +0000 (15:41 +0800)]
readme : add CodeShell models to the supported models list (#5330)
AidanBeltonS [Mon, 5 Feb 2024 07:08:24 +0000 (07:08 +0000)]
[SYCL] Fix cpy with dims of 3 (#5289)
* Fix cpy with dims of 3
* rm asserts
---------
Co-authored-by: Abhilash Majumder <redacted>
github-actions[bot] [Sun, 4 Feb 2024 00:17:24 +0000 (00:17 +0000)]
flake.lock: Update
Flake lock file updates:
• Updated input 'flake-parts':
'github:hercules-ci/flake-parts/
07f6395285469419cf9d078f59b5b49993198c00 ' (2024-01-11)
→ 'github:hercules-ci/flake-parts/
b253292d9c0a5ead9bc98c4e9a26c6312e27d69f ' (2024-02-01)
• Updated input 'flake-parts/nixpkgs-lib':
'github:NixOS/nixpkgs/
b0d36bd0a420ecee3bc916c91886caca87c894e9 ?dir=lib' (2023-12-30)
→ 'github:NixOS/nixpkgs/
97b17f32362e475016f942bbdfda4a4a72a8a652 ?dir=lib' (2024-01-29)
• Updated input 'nixpkgs':
'github:NixOS/nixpkgs/
ae5c332cbb5827f6b1f02572496b141021de335f ' (2024-01-25)
→ 'github:NixOS/nixpkgs/
b8b232ae7b8b144397fdb12d20f592e5e7c1a64d ' (2024-01-31)
Kawrakow [Sun, 4 Feb 2024 08:39:58 +0000 (10:39 +0200)]
Adding some imatrix tools (#5302)
* imatrix: adding --combine and --continue-from
* imatrix: be able to start from a specific chunk
---------
Co-authored-by: Iwan Kawrakow <redacted>
Welby Seely [Sun, 4 Feb 2024 04:18:51 +0000 (23:18 -0500)]
cmake : use set() for LLAMA_WIN_VER (#5298)
option() is specifically for booleans.
Fixes #5158
Johannes Gäßler [Sat, 3 Feb 2024 19:15:13 +0000 (20:15 +0100)]
make: add nvcc info print (#5310)
Johannes Gäßler [Sat, 3 Feb 2024 19:14:59 +0000 (20:14 +0100)]
make: fix nvcc optimization flags for host code (#5309)
Martin Schwaighofer [Sun, 28 Jan 2024 11:59:43 +0000 (12:59 +0100)]
add Vulkan support to Nix flake
0cc4m [Sat, 3 Feb 2024 17:15:00 +0000 (18:15 +0100)]
Vulkan Intel Fixes, Optimizations and Debugging Flags (#5301)
* Fix Vulkan on Intel ARC
Optimize matmul for Intel ARC
Add Vulkan dequant test
* Add Vulkan debug and validate flags to Make and CMakeLists.txt
* Enable asynchronous transfers in Vulkan backend
* Fix flake8
* Disable Vulkan async backend functions for now
* Also add Vulkan run tests command to Makefile and CMakeLists.txt
Michael Klimenko [Sat, 3 Feb 2024 11:23:37 +0000 (12:23 +0100)]
refactor : switch to emplace_back to avoid extra object (#5291)
Jared Van Bortel [Sat, 3 Feb 2024 11:22:06 +0000 (06:22 -0500)]
YaRN : store rope scaling type as int32_t in memory (#5285)
* YaRN : store rope scaling type as int32_t in memory
* llama : store mapped names as const char *
BADR [Sat, 3 Feb 2024 11:20:26 +0000 (12:20 +0100)]
readme : add tenere in the ui tools list (#5284)
AidanBeltonS [Sat, 3 Feb 2024 08:11:37 +0000 (08:11 +0000)]
Fix im2col with 32fp (#5286)
kalomaze [Fri, 2 Feb 2024 14:15:30 +0000 (08:15 -0600)]
perplexity : fix KL divergence calculations on Windows (#5273)
Georgi Gerganov [Fri, 2 Feb 2024 12:23:40 +0000 (14:23 +0200)]
scripts : parse wtype in server-llm.sh (#5167)
* scripts : parse wtype in server-llm.sh
* scripts : fix check for wfile
Mirror Azure [Fri, 2 Feb 2024 11:39:09 +0000 (14:39 +0300)]
py : add check for '.attn.masked_bias' layers to GPT2model (#5281)
AidanBeltonS [Fri, 2 Feb 2024 08:39:48 +0000 (08:39 +0000)]
Tidy ggml-sycl (#5261)
* Tidy some code in ggml-sycl
* Remove blank space
* Remove std::printf comments
---------
Co-authored-by: Abhilash Majumder <redacted>
Xuan Son Nguyen [Fri, 2 Feb 2024 07:56:31 +0000 (08:56 +0100)]
docker : add build for SYCL, Vulkan + update readme (#5228)
* add vulkan dockerfile
* intel dockerfile: compile sycl by default
* fix vulkan dockerfile
* add docs for vulkan
* docs: sycl build in docker
* docs: remove trailing spaces
* docs: sycl: add docker section
* docs: clarify install vulkan SDK outside docker
* sycl: use intel/oneapi-basekit docker image
* docs: correct TOC
* docs: correct docker image for Intel oneMKL
Meng, Hengyu [Fri, 2 Feb 2024 07:54:14 +0000 (15:54 +0800)]
[SYCL] get MAX_MEM_ALLOC from device property (#5270)
* get max alloc size from device prop
* fix macro typo
Neo Zhang Jianyu [Fri, 2 Feb 2024 07:53:27 +0000 (15:53 +0800)]
[SYCL] update guide of SYCL backend (#5254)
* update guide for make installation, memory, gguf model link, rm todo for windows build
* add vs install requirement
* update for gpu device check
* update help of llama-bench
* fix grammer issues
Ian Bull [Fri, 2 Feb 2024 07:20:13 +0000 (23:20 -0800)]
llama : fix memory leak in llama_batch_free (#5252)
The llama_batch_init allocates memory for a fixed number of tokens.
However, the llama_batch_free only frees memory for the number of
tokens that were added to the batch.
This change-set uses a null terminated array for the batch seq_id, and
frees all the elements until the nullptr is reached. This change-set
also changes the name of the first parameter from `n_tokens` to
`n_tokens_alloc` to more clearly indicate that this value is the number
of tokens allocated to the batch, not the number of tokens in the batch.
Neo Zhang Jianyu [Thu, 1 Feb 2024 19:48:53 +0000 (03:48 +0800)]
add --no-mmap in llama-bench (#5257)
* add --no-mmap, show sycl backend
* fix conflict
* fix code format, change print for --no-mmap
* ren no_mmap to mmap, show mmap when not default value in printer
* update guide for mmap
* mv position to reduce model reload
0cc4m [Thu, 1 Feb 2024 18:25:24 +0000 (19:25 +0100)]
Vulkan Phi Fix for AMD Proprietary Drivers (#5260)
* Replace tanh to avoid NaN in gelu shader on AMD proprietary driver
* Fix another Vulkan CPY buffer size bug
slaren [Thu, 1 Feb 2024 17:30:17 +0000 (18:30 +0100)]
cuda : fix LLAMA_CUDA_F16 (#5262)
Ali Nehzat [Thu, 1 Feb 2024 15:18:53 +0000 (02:18 +1100)]
make : generate .a library for static linking (#5205)
Guoteng [Thu, 1 Feb 2024 09:19:51 +0000 (17:19 +0800)]
llama : support InternLM2 (#5184)
* support InternLM2 inference
* add add_space_prefix KV pair
Eve [Wed, 31 Jan 2024 19:21:55 +0000 (19:21 +0000)]
Fix broken Vulkan Cmake (properly) (#5230)
* build vulkan as object
* vulkan ci
Georgi Gerganov [Wed, 31 Jan 2024 16:47:10 +0000 (18:47 +0200)]
llama : reorder build_orion() at correct place (#5118)
Georgi Gerganov [Wed, 31 Jan 2024 15:30:17 +0000 (17:30 +0200)]
llama : remove LLAMA_MAX_DEVICES and LLAMA_SUPPORTS_GPU_OFFLOAD (#5240)
* llama : remove LLAMA_MAX_DEVICES from llama.h
ggml-ci
* Update llama.cpp
Co-authored-by: slaren <redacted>
* server : remove LLAMA_MAX_DEVICES
ggml-ci
* llama : remove LLAMA_SUPPORTS_GPU_OFFLOAD
ggml-ci
* train : remove LLAMA_SUPPORTS_GPU_OFFLOAD
* readme : add deprecation notice
* readme : change deprecation notice to "remove" and fix url
* llama : remove gpu includes from llama.h
ggml-ci
---------
Co-authored-by: slaren <redacted>
Georgi Gerganov [Wed, 31 Jan 2024 13:35:41 +0000 (15:35 +0200)]
metal : add im2col F32 dst support (#5132)
JidongZhang-THU [Wed, 31 Jan 2024 13:10:15 +0000 (21:10 +0800)]
llava : add MobileVLM support (#5132)
* New Feature:
1. Sum_Rows:
fix cuda kernel overflow
fix block shape error when nrows too big
2. Im2Col:
Support Batch in cuda
Support f32 to f32 both in cpu && cuda
3. DepthWiseConv:
Support by Im2Col && MulMat
4. Pool_2d:
Supoort avg pooling in cuda
5. HardSigmoid:
Imp in cuda
6. HardSwish:
Imp in cuda
* fix tabs instead of spaces
* code clean
* CUDA POOL2D
* ADD POOL2D test case in test-backend-ops.cpp
* code clean
* fix pool2d_kernel
nits
* fix bug in pool2d kernel
* fix avg pooling, count_include_pad
nits
* test-backend-ops : add more pool_2d tests
* cuda : fix warnings and formatting
* ggml : check types in release builds too in pool_2d
* test-backend-ops : remove f16 pool_2d tests
* cuda : more style fixes
* Add assert in ggml_cuda_op_pool2d
* pool2d float padding fallback
* test-backend-ops : add dst_type to im2col
---------
Co-authored-by: slaren <redacted>
Neo Zhang Jianyu [Wed, 31 Jan 2024 13:04:46 +0000 (21:04 +0800)]
format license text, restore apache license by legal suggestion (#5233)
slaren [Wed, 31 Jan 2024 12:43:03 +0000 (13:43 +0100)]
ggml : limit n_threads to the max n_tasks (#5238)
0cc4m [Wed, 31 Jan 2024 10:44:19 +0000 (11:44 +0100)]
Vulkan Fixes (#5223)
* Fix Vulkan F16 models
* Fix Vulkan context shift crash
* Add Vulkan to common.cpp dump_non_result_info_yaml function
* Fix bug in Vulkan CPY op
* Fix small matrix multiplication errors in AMD GPUs on Windows or with amdvlk
Co-authored-by: Engininja2 <redacted>
---------
Co-authored-by: Engininja2 <redacted>
Yiming Cui [Wed, 31 Jan 2024 03:04:21 +0000 (11:04 +0800)]
Fix typos of IQ2_XXS and IQ3_XXS in llama.cpp (#5231)
Neo Zhang Jianyu [Wed, 31 Jan 2024 02:38:07 +0000 (10:38 +0800)]
support SYCL backend windows build (#5208)
* support SYCL backend windows build
* add windows build in CI
* add for win build CI
* correct install oneMKL
* fix install issue
* fix ci
* fix install cmd
* fix install cmd
* fix install cmd
* fix install cmd
* fix install cmd
* fix win build
* fix win build
* fix win build
* restore other CI part
* restore as base
* rm no new line
* fix no new line issue, add -j
* fix grammer issue
* allow to trigger manually, fix format issue
* fix format
* add newline
* fix format
* fix format
* fix format issuse
---------
Co-authored-by: Abhilash Majumder <redacted>
Jared Van Bortel [Wed, 31 Jan 2024 00:04:37 +0000 (19:04 -0500)]
kompute : llama-bench support and ggml_cpu_has_kompute() (#5226)
Georgi Gerganov [Tue, 30 Jan 2024 19:19:26 +0000 (21:19 +0200)]
Revert "server : change deps.sh xxd files to string literals (#5221)"
This reverts commit
4003be0e5feef320f3707786f22722b73cff9356 .
Georgi Gerganov [Tue, 30 Jan 2024 18:17:30 +0000 (20:17 +0200)]
server : fix context shift (#5195)
* server : fix context shift + simplify self-extend
* server : take system_tokens into account
* server : more n_past fixes
* server : rever n_past_se changes
JohnnyB [Tue, 30 Jan 2024 18:15:05 +0000 (12:15 -0600)]
server : change deps.sh xxd files to string literals (#5221)
* Changed ugly xxd to literals.
HPP files are much more readable as multiline literals rather than hex arrays.
* Dashes in literal variable names.
Replace . and - with _ in file names -> variable names.
* Comment on removing xxd.
XXD-> string literals
* XXD to string literals.
Replaced these unreadable headers with string literal versions using new deps.sh.
Kawrakow [Tue, 30 Jan 2024 17:15:28 +0000 (19:15 +0200)]
ggml : fix IQ3_XXS on Metal (#5219)
Co-authored-by: Iwan Kawrakow <redacted>
Georgi Gerganov [Tue, 30 Jan 2024 14:21:57 +0000 (16:21 +0200)]
sync : ggml (#0)
Georgi Gerganov [Mon, 29 Jan 2024 19:08:18 +0000 (21:08 +0200)]
gguf : fix comparison (ggml/715)
ggml-ci
John Balis [Mon, 29 Jan 2024 12:37:33 +0000 (06:37 -0600)]
`ggml_cuda_cpy` support for 4d tensors and float16->float32 upcasting (ggml/686)
* added cuda float16->float32 upcasting to ggml_cuda_cpy
* added ability to copy 4d tensors with the cuda backend
* added tests for float16_>float32 upcast and 4d tensor cuda copys
* added 4d copy test for float32->float16 copy
* applied patch suggested by @iamlemec
* simplify cpy tests
---------
Co-authored-by: slaren <redacted>
Georgi Gerganov [Mon, 29 Jan 2024 12:00:10 +0000 (14:00 +0200)]
gguf : add input validation, prevent integer overflows (ggml/709)
* gguf : add input validation, prevent integer overflows
ggml-ci
* gguf : fix switch default case
* gguf : sanitize info->n_dims and info->type
ggml-ci
* gguf : assert GGUF_TYPE_SIZE access
ggml-ci
* ggml : assert mallocs are successful
ggml-ci
* gguf : prevent integer overflow
* gguf : sanitize tensor info
ggml-ci
* gguf : stricter limit on the number of items
ggml-ci
Georgi Gerganov [Mon, 29 Jan 2024 11:29:46 +0000 (13:29 +0200)]
ci : fix yolo URLs + fix metal capture (ggml/712)
Jack Mousseau [Mon, 29 Jan 2024 09:22:23 +0000 (01:22 -0800)]
metal : add debug capture backend function (ggml/694)
Co-authored-by: Georgi Gerganov <redacted>
Kawrakow [Tue, 30 Jan 2024 13:15:07 +0000 (15:15 +0200)]
Faster AVX2 dot product for IQ2_XS (#5187)
* iq2xs: faster AVX2 dot product
* iq2xs: small AVX2 imrovement
* Speed up computing sign bits in AVX2 iq2_xs dot product
---------
Co-authored-by: Iwan Kawrakow <redacted>
Co-authored-by: Peter Reid <redacted>
Kawrakow [Tue, 30 Jan 2024 13:14:12 +0000 (15:14 +0200)]
SOTA 3-bit quants (#5196)
* iq3_xxs: quantize/dequantize
RMSE seems a bit high-ish at about half-way between q2_K and
q3_K, so need to check more.
* iq3_xxs: CUDA dequantize works
* iq2_xxs: tuning quantization
* iq3_xxs: starting to look better
PPL on wiki.test.raw
LLaMA-v1-7B: 6.4218
LLaMA-v2-7B: 6.3560
Mistral-7B : 6.0717
This is better than Q3_K_XS, with a 5% reduction in quantized model
size.
* iq3_xxs: CUDA dot product
We have
PP-512: 5891 t/s
TG-128: 143.9 t/s
* iq3_xxs: scalar and AVX2 dot products
* iq3_xxs: ARM_NEON and Metal
Metal performance is decent, ARM_NEON is pathetic
* iq3_xxs: slightly better grid points
* Faster iq3_xxs and iq2_xs dot products on CUDA
* iq3_xxs: add some quant mix
* iq3_xxs: fix failing quantization test
Dot product still fails. Is this real?
* iq3_xxs: hopefully fix ROCm
* iq3_xxs: failing tests
This time the dot product accuracy did find an actual bug
in the AVX2 implementation.
* Add IQ3_XXS to test-backend-ops
---------
Co-authored-by: Iwan Kawrakow <redacted>