]>
git.djapps.eu Git - pkg/ggml/sources/llama.cpp/log
Xuan Son Nguyen [Fri, 18 Oct 2024 21:18:01 +0000 (23:18 +0200)]
llama : remove all_pos_0, all_pos_1, all_seq_id from llama_batch (#9745)
* refactor llama_batch_get_one
* adapt all examples
* fix simple.cpp
* fix llama_bench
* fix
* fix context shifting
* free batch before return
* use common_batch_add, reuse llama_batch in loop
* null terminated seq_id list
* fix save-load-state example
* fix perplexity
* correct token pos in llama_batch_allocr
Radoslav Gerganov [Fri, 18 Oct 2024 11:33:58 +0000 (14:33 +0300)]
rpc : backend refactoring (#9912)
* rpc : refactor backend
Use structs for RPC request/response messages
* rpc : refactor server
Ouadie EL FAROUKI [Fri, 18 Oct 2024 05:46:16 +0000 (06:46 +0100)]
[SYCL] Add SYCL Backend registry, device and Event Interfaces (#9705)
* implemented missing SYCL event APIs
* sycl : Added device and backend reg interfaces
* Restructured ggml-sycl.cpp
Ma Mingfei [Fri, 18 Oct 2024 05:34:36 +0000 (13:34 +0800)]
add amx kernel for gemm (#8998)
add intel amx isa detection
add vnni kernel for gemv cases
add vnni and amx kernel support for block_q8_0
code cleanup
fix packing B issue
enable openmp
fine tune amx kernel
switch to aten parallel pattern
add error message for nested parallelism
code cleanup
add f16 support in ggml-amx
add amx kernels for QK_K quant formats: Q4_K, Q5_K, Q6_K and IQ4_XS
update CMakeList
update README
fix some compilation warning
fix compiler warning when amx is not enabled
minor change
ggml-ci
move ggml_amx_init from ggml.c to ggml-amx/mmq.cpp
ggml-ci
update CMakeLists with -mamx-tile, -mamx-int8 and -mamx-bf16
ggml-ci
add amx as an ggml-backend
update header file, the old path for immintrin.h has changed to ggml-cpu-impl.h
minor change
update CMakeLists.txt
minor change
apply weight prepacking in set_tensor method in ggml-backend
fix compile error
ggml-ci
minor change
ggml-ci
update CMakeLists.txt
ggml-ci
add march dependency
minor change
ggml-ci
change ggml_backend_buffer_is_host to return false for amx backend
ggml-ci
fix supports_op
use device reg for AMX backend
ggml-ci
minor change
ggml-ci
minor change
fix rebase
set .buffer_from_host_ptr to be false for AMX backend
Georgi Gerganov [Fri, 18 Oct 2024 04:32:19 +0000 (07:32 +0300)]
server : add n_indent parameter for line indentation requirement (#9929)
ggml-ci
Daniel Bevenius [Thu, 17 Oct 2024 23:41:51 +0000 (01:41 +0200)]
llama : rename batch_all to batch (#8881)
This commit addresses the TODO in the code to rename the `batch_all`
parameter to `batch` in `llama_decode_internal`.
Georgi Gerganov [Thu, 17 Oct 2024 20:43:05 +0000 (23:43 +0300)]
readme : remove --memory-f32 references (#9925)
Georgi Gerganov [Thu, 17 Oct 2024 20:26:32 +0000 (23:26 +0300)]
llama : change warning to debug log
Georgi Gerganov [Thu, 17 Oct 2024 19:32:47 +0000 (22:32 +0300)]
llama : infill sampling handle very long tokens (#9924)
* llama : infill sampling handle very long tokens
ggml-ci
* cont : better indices
ggml-ci
Tim Wang [Thu, 17 Oct 2024 06:57:14 +0000 (17:57 +1100)]
readme : update bindings list (#9918)
Co-authored-by: Tim Wang <redacted>
Diego Devesa [Thu, 17 Oct 2024 00:46:58 +0000 (02:46 +0200)]
vulkan : add backend registry / device interfaces (#9721)
* vulkan : add backend registry / device interfaces
* llama : print devices used on model load
Gilad S. [Wed, 16 Oct 2024 23:34:22 +0000 (02:34 +0300)]
fix: allocating CPU buffer with size `0` (#9917)
Gilad S. [Wed, 16 Oct 2024 22:36:51 +0000 (01:36 +0300)]
fix: use `vm_allocate` to allocate CPU backend buffer on macOS (#9875)
* fix: use `vm_allocate` to allocate CPU backend buffer on macOS
* fix: switch to `posix_memalign` to keep existing `free()` usages work
* feat: move `GGML_ALIGNED_MALLOC` to `ggml-backend-impl.h`, add support for `vm_allocate` on macOS
* style: formatting
* fix: move const outside of `#ifndef`
* style: formatting
* fix: unused var
* fix: transform `GGML_ALIGNED_MALLOC` and `GGML_ALIGNED_FREE` into functions and add them to `ggml-impl.h`
* fix: unused var
* fix: page align to `GGUF_DEFAULT_ALIGNMENT`
* fix: page align to `TENSOR_ALIGNMENT`
* fix: convert `TENSOR_ALIGNMENT` to a macro
* fix: increase page size to `32` on iOS
* fix: iOS page size
* fix: `hbw_posix_memalign` alignment
Daniel Bevenius [Wed, 16 Oct 2024 17:34:28 +0000 (19:34 +0200)]
llama : suppress conversion from 'size_t' to 'int' (#9046)
* llama : suppress conversion from 'size_t' to 'int'
This commit updates llm_tokenizer_spm.tokenize to suppress/remove the
following warnings that are generated on Windows when using MSVC:
```console
src\llama-vocab.cpp(211,1): warning C4267: 'argument':
conversion from 'size_t' to 'int', possible loss of data
src\llama-vocab.cpp(517,1): warning C4267: 'argument':
conversion from 'size_t' to 'int', possible loss of data
```
This is done by adding a cast for the size_t returned from
symbols.size(). I believe this is safe as it seems unlikely that
symbols, which stores an entry for each UTF8 character, would become
larger than INT_MAX.
The motivation for this change is to reduce the number of warnings that
are currently generated when building on Windows.
* squash! llama : suppress conversion from 'size_t' to 'int'
Move cast into for loop.
Daniel Bevenius [Wed, 16 Oct 2024 17:24:05 +0000 (19:24 +0200)]
llava : fix typo in error message [no ci] (#9884)
Joe Eli McIlvain [Wed, 16 Oct 2024 16:03:24 +0000 (09:03 -0700)]
grammar : fix JSON Schema for string regex with top-level alt. (#9903)
Prior to this commit, using a JSON Schema containing a string
with `pattern` regular expression that uses top-level alternation
(e.g. `"pattern": "^A|B|C|D$"`) would result in invalid JSON
output from the constrained sampling grammar, because it
ended up creating a grammar rule like this for the string:
```
thing ::= "\"" "A" | "B" | "C" | "D" "\"" space
```
Note that this rule will only match a starting quote for the "A" case,
and will only match an ending quote for the "D" case,
so this rule will always produce invalid JSON when used for sampling
(that is, the JSON will always be lacking the starting quote,
the ending quote, or both).
This was fixed in a simple way by adding parentheses to the
generated rule (for all string pattern rules, to keep it simple),
such that the new generated rule looks like this (correct):
```
thing ::= "\"" ("A" | "B" | "C" | "D") "\"" space
```
Molly Sophia [Wed, 16 Oct 2024 10:10:21 +0000 (18:10 +0800)]
llama : add tensor name for "result_norm" (#9907)
Signed-off-by: Molly Sophia <redacted>
Alexey Parfenov [Wed, 16 Oct 2024 08:35:53 +0000 (08:35 +0000)]
server : fix the disappearance of the end of the text (#9867)
* server: fix the disappearance of the end of the text when streaming with stop strings
* simplify "send text" checks
Georgi Gerganov [Wed, 16 Oct 2024 08:28:14 +0000 (11:28 +0300)]
sync : ggml
Daniel Bevenius [Wed, 9 Oct 2024 14:40:35 +0000 (16:40 +0200)]
ggml-alloc : remove buffer_id from leaf_alloc (ggml/987)
This commit removes the buffer_id field from the leaf_alloc struct.
The motivation for is that this field is only written to and never
read/used as far as I can tell. Each tensor_alloc has a buffer_id field
and this is what caused me to look into this more closely, to
understand what the buffer_id in leaf_alloc was used for.
leo-pony [Wed, 16 Oct 2024 00:51:46 +0000 (08:51 +0800)]
[CANN] Fix cann compilation error (#9891)
Fix cann compilation error after merging llama.cpp supports dynamically loadable backends.
Georgi Gerganov [Tue, 15 Oct 2024 13:35:33 +0000 (16:35 +0300)]
llama : add infill sampler (#9896)
ggml-ci
Georgi Gerganov [Tue, 15 Oct 2024 13:28:55 +0000 (16:28 +0300)]
server : improve infill context reuse (#9894)
ggml-ci
MaggotHATE [Tue, 15 Oct 2024 10:54:55 +0000 (15:54 +0500)]
sampling : add XTC sampler (#9742)
* Initial XTC commit
Adds XTC sampler, not activated by default, but recommended settings by default.
* Cleanup
* Simplified chances calculation
To be more inline with the original implementation, chance is calculated once at the beginning.
* First fixes by comments
Still need to look into sorting
* Fixed trailing backspaces
* Fixed RNG to be reproduceable
Thanks to @slaren for directions
* Fixed forgotten header
* Moved `min_keep`
Moved from conditions to a simple check at the end.
* Fixed broken randomization
Thanks to @slaren for explanation
* Swapped sorting for a custom algorithm
Shifts tokens to remove the penalized ones, then puts the penalized at the back. Should make `min_keep` still viable.
* Algorithm rework
1. Scan token from top till the first non-penalizable
2. Remove the last captured token (the least probable above threshold)
3. Shift all tokens to override the remaining penalizable
4. Penalize and put them at the the bottom.
* Added XTC to `test-sampling`
* Simplified algorithm and more tests
* Updated info in common and args
* Merged back lost commits in common and arg
* Update dump info in common
* Fixed incorrect min_keep check
* Added XTC to README
* Renamed parameters, fixed info and defaults
* probability is at 0 by default, but XTC is included in sampling queue
* threshold higher than 0.5 switches XTC off
* Initial server support
* Added XTC to server UIs
* Fixed labels in old server UI
* Made algorithm safer and more readable
* Removed xtc_threshold_max
* Fixed arg after update
* Quick fixes by comments
* Simplified algorithm since threshold_max is removed
* Renamed random distribution
* Fixed tests and outdated README
* Small fixes
Georgi Gerganov [Tue, 15 Oct 2024 09:48:44 +0000 (12:48 +0300)]
server : update preact (#9895)
Michał Tuszyński [Tue, 15 Oct 2024 08:20:34 +0000 (10:20 +0200)]
readme : update bindings list (#9889)
VoidIsVoid [Mon, 14 Oct 2024 07:04:36 +0000 (15:04 +0800)]
server : handle "logprobs" field with false value (#9871)
Co-authored-by: Gimling <redacted>
agray3 [Mon, 14 Oct 2024 00:49:08 +0000 (01:49 +0100)]
Vectorize load instructions in dmmv f16 CUDA kernel (#9816)
* Vectorize load instructions in dmmv f16 CUDA kernel
Replaces scalar with vector load instructions, which substantially
improves performance on NVIDIA HBM GPUs, e.g. gives a 1.27X overall
speedup for Meta-Llama-3-8B-Instruct-F16 BS1 inference evaluation on
H100 SXM 80GB HBM3. On GDDR GPUs, there is a slight (1.01X) speedup.
* addressed comment
* Update ggml/src/ggml-cuda/dmmv.cu
Co-authored-by: Johannes Gäßler <redacted>
---------
Co-authored-by: Johannes Gäßler <redacted>
Georgi Gerganov [Sun, 13 Oct 2024 18:31:35 +0000 (21:31 +0300)]
server : accept extra_context for the infill endpoint (#9874)
* server : accept extra_context for the infill endpoint
ggml-ci
* server : update readme [no ci]
* server : use repo-level FIM pattern if possible
ggml-ci
Georgi Gerganov [Sun, 13 Oct 2024 15:52:48 +0000 (18:52 +0300)]
server : reuse cached context chunks (#9866)
ggml-ci
Georgi Gerganov [Sun, 13 Oct 2024 03:11:26 +0000 (06:11 +0300)]
flake.lock: Update (#9870)
Flake lock file updates:
• Updated input 'nixpkgs':
'github:NixOS/nixpkgs/
bc947f541ae55e999ffdb4013441347d83b00feb ?narHash=sha256-NOiTvBbRLIOe5F6RbHaAh6%2B%2BBNjsb149fGZd1T4%2BKBg%3D' (2024-10-04)
→ 'github:NixOS/nixpkgs/
5633bcff0c6162b9e4b5f1264264611e950c8ec7 ?narHash=sha256-9UTxR8eukdg%2BXZeHgxW5hQA9fIKHsKCdOIUycTryeVw%3D' (2024-10-09)
Co-authored-by: github-actions[bot] <redacted>
Georgi Gerganov [Sat, 12 Oct 2024 13:14:27 +0000 (16:14 +0300)]
server : add option to time limit the generation phase (#9865)
ggml-ci
Georgi Gerganov [Sat, 12 Oct 2024 13:06:31 +0000 (16:06 +0300)]
server : remove self-extend features (#9860)
* server : remove self-extend
ggml-ci
* server : fix context limit check to use slot.n_past
ggml-ci
Georgi Gerganov [Sat, 12 Oct 2024 11:51:54 +0000 (14:51 +0300)]
server : remove legacy system_prompt feature (#9857)
* server : remove legacy system_prompt feature
ggml-ci
* readme : update [no ci]
* server : fix non-transformer logic + remove response from /props
Georgi Gerganov [Sat, 12 Oct 2024 05:21:51 +0000 (08:21 +0300)]
llama : improve infill support and special token detection (#9798)
* llama : improve infill support
ggml-ci
* llama : add more FIM token strings
ggml-ci
* server : update prompt on slot restore (#9800)
* gguf : deprecate old FIM token KVs
R0CKSTAR [Sat, 12 Oct 2024 05:09:53 +0000 (13:09 +0800)]
musa : update doc (#9856)
Signed-off-by: Xiaodong Ye <redacted>
Diego Devesa [Fri, 11 Oct 2024 13:34:45 +0000 (15:34 +0200)]
ggml : move more prints to the ggml log system (#9839)
* ggml : move more prints to the ggml log system
* show BLAS OpenMP warnings in all builds using debug print
Diego Devesa [Thu, 10 Oct 2024 20:57:42 +0000 (22:57 +0200)]
common : use common_ prefix for common library functions (#9805)
* common : use common_ prefix for common library functions
---------
Co-authored-by: Georgi Gerganov <redacted>
Diego Devesa [Thu, 10 Oct 2024 18:14:55 +0000 (20:14 +0200)]
rpc : add backend registry / device interfaces (#9812)
* rpc : add backend registry / device interfaces
* llama : add llama_supports_rpc API
* ggml_backend_rpc_start_rpc_server -> ggml_backend_rpc_start_server
R0CKSTAR [Thu, 10 Oct 2024 18:10:37 +0000 (02:10 +0800)]
musa: add docker image support (#9685)
* mtgpu: add docker image support
Signed-off-by: Xiaodong Ye <redacted>
* mtgpu: enable docker workflow
Signed-off-by: Xiaodong Ye <redacted>
---------
Signed-off-by: Xiaodong Ye <redacted>
Diego Devesa [Thu, 10 Oct 2024 17:50:49 +0000 (19:50 +0200)]
examples : do not use common library in simple example (#9803)
* examples : do not use common library in simple example
* add command line parser, simplify code
Diego Devesa [Wed, 9 Oct 2024 16:49:52 +0000 (18:49 +0200)]
cmake : do not build common library by default when standalone (#9804)
Georgi Gerganov [Wed, 9 Oct 2024 14:00:18 +0000 (17:00 +0300)]
perplexity : fix integer overflow (#9783)
* perplexity : fix integer overflow
ggml-ci
* perplexity : keep n_vocab as int and make appropriate casts
ggml-ci
Georgi Gerganov [Wed, 9 Oct 2024 07:55:42 +0000 (10:55 +0300)]
examples : remove llama.vim
An updated version will be added in #9787
Diego Devesa [Tue, 8 Oct 2024 12:21:43 +0000 (14:21 +0200)]
ggml : fix BLAS with unsupported types (#9775)
* ggml : do not use BLAS with types without to_float
* ggml : return pointer from ggml_internal_get_type_traits to avoid unnecessary copies
* ggml : rename ggml_internal_get_type_traits -> ggml_get_type_traits
it's not really internal if everybody uses it
Xuan Son Nguyen [Tue, 8 Oct 2024 11:27:04 +0000 (13:27 +0200)]
server : better security control for public deployments (#9776)
* server : more explicit endpoint access settings
* protect /props endpoint
* fix tests
* update server docs
* fix typo
* fix tests
standby24x7 [Tue, 8 Oct 2024 06:19:53 +0000 (15:19 +0900)]
scripts : fix spelling typo in messages and comments (#9782)
Signed-off-by: Masanari Iida <redacted>
Diego Devesa [Mon, 7 Oct 2024 19:55:08 +0000 (21:55 +0200)]
ggml : add backend registry / device interfaces to BLAS backend (#9752)
* ggml : add backend registry / device interfaces to BLAS backend
* fix mmap usage when using host buffers
Andrew Minh Nguyen [Mon, 7 Oct 2024 16:37:31 +0000 (09:37 -0700)]
Update building for Android (#9672)
* docs : clarify building Android on Termux
* docs : update building Android on Termux
* docs : add cross-compiling for Android
* cmake : link dl explicitly for Android
Georgi Gerganov [Mon, 7 Oct 2024 16:35:42 +0000 (19:35 +0300)]
flake.lock: Update (#9753)
Flake lock file updates:
• Updated input 'flake-parts':
'github:hercules-ci/flake-parts/
bcef6817a8b2aa20a5a6dbb19b43e63c5bf8619a ?narHash=sha256-HO4zgY0ekfwO5bX0QH/3kJ/h4KvUDFZg8YpkNwIbg1U%3D' (2024-09-12)
→ 'github:hercules-ci/flake-parts/
3d04084d54bedc3d6b8b736c70ef449225c361b1 ?narHash=sha256-K5ZLCyfO/Zj9mPFldf3iwS6oZStJcU4tSpiXTMYaaL0%3D' (2024-10-01)
• Updated input 'flake-parts/nixpkgs-lib':
'https://github.com/NixOS/nixpkgs/archive/
356624c12086a18f2ea2825fed34523d60ccc4e3 .tar.gz?narHash=sha256-Ss8QWLXdr2JCBPcYChJhz4xJm%2Bh/xjl4G0c0XlP6a74%3D' (2024-09-01)
→ 'https://github.com/NixOS/nixpkgs/archive/
fb192fec7cc7a4c26d51779e9bab07ce6fa5597a .tar.gz?narHash=sha256-0xHYkMkeLVQAMa7gvkddbPqpxph%2BhDzdu1XdGPJR%2BOs%3D' (2024-10-01)
• Updated input 'nixpkgs':
'github:NixOS/nixpkgs/
1925c603f17fc89f4c8f6bf6f631a802ad85d784 ?narHash=sha256-J%2BPeFKSDV%2BpHL7ukkfpVzCOO7mBSrrpJ3svwBFABbhI%3D' (2024-09-26)
→ 'github:NixOS/nixpkgs/
bc947f541ae55e999ffdb4013441347d83b00feb ?narHash=sha256-NOiTvBbRLIOe5F6RbHaAh6%2B%2BBNjsb149fGZd1T4%2BKBg%3D' (2024-10-04)
Co-authored-by: github-actions[bot] <redacted>
Georgi Gerganov [Mon, 7 Oct 2024 15:27:51 +0000 (18:27 +0300)]
ggml : add metal backend registry / device (#9713)
* ggml : add metal backend registry / device
ggml-ci
* metal : fix names [no ci]
* metal : global registry and device instances
ggml-ci
* cont : alternative initialization of global objects
ggml-ci
* llama : adapt to backend changes
ggml-ci
* fixes
* metal : fix indent
* metal : fix build when MTLGPUFamilyApple3 is not available
ggml-ci
* fix merge
* metal : avoid unnecessary singleton accesses
ggml-ci
* metal : minor fix [no ci]
* metal : g_state -> g_ggml_ctx_dev_main [no ci]
* metal : avoid reference of device context in the backend context
ggml-ci
* metal : minor [no ci]
* metal : fix maxTransferRate check
* metal : remove transfer rate stuff
---------
Co-authored-by: slaren <redacted>
Paul Tsochantaris [Mon, 7 Oct 2024 12:26:31 +0000 (13:26 +0100)]
metal : single allocation of encode_async block (#9747)
* Single allocation of encode_async block with non-ARC capture in ggml-metal.m
* Moving Block_release to the deallocation code
* Release encode block when re-setting encoding buffer count if needed
* Update ggml/src/ggml-metal.m
---------
Co-authored-by: Georgi Gerganov <redacted>
Georgi Gerganov [Sun, 6 Oct 2024 11:15:27 +0000 (14:15 +0300)]
contrib : simplify + minor edits [no ci]
Georgi Gerganov [Sun, 6 Oct 2024 10:49:41 +0000 (13:49 +0300)]
readme : fix typo [no ci]
Georgi Gerganov [Sun, 6 Oct 2024 09:53:28 +0000 (12:53 +0300)]
sync : llama.cpp
SRHMorris [Sun, 6 Oct 2024 07:34:20 +0000 (08:34 +0100)]
vulkan : retry allocation with fallback flags (whisper/2451)
Co-authored-by: Samuel Morris <redacted>
Georgi Gerganov [Sat, 5 Oct 2024 12:55:04 +0000 (15:55 +0300)]
rerank : use [SEP] token instead of [BOS] (#9737)
* rerank : use [SEP] token instead of [BOS]
ggml-ci
* common : sanity check for non-NULL tokens
ggml-ci
* ci : adjust rank score interval
ggml-ci
* ci : add shebang to run.sh
ggml-ci
Georgi Gerganov [Sat, 5 Oct 2024 12:53:49 +0000 (15:53 +0300)]
sync : ggml
Georgi Gerganov [Sat, 5 Oct 2024 11:33:54 +0000 (14:33 +0300)]
metal : zero-init buffer contexts (whisper/0)
Viet-Anh NGUYEN (Andrew) [Fri, 4 Oct 2024 18:29:35 +0000 (01:29 +0700)]
Add Llama Assistant (#9744)
Georgi Gerganov [Fri, 4 Oct 2024 15:50:25 +0000 (18:50 +0300)]
sync : ggml
Daniel Bevenius [Fri, 4 Oct 2024 13:46:18 +0000 (15:46 +0200)]
ggml : fix typo in example usage ggml_gallocr_new (ggml/984)
Diego Devesa [Fri, 4 Oct 2024 06:41:40 +0000 (08:41 +0200)]
ggml : fixes after sync (ggml/983)
ggml : remove test-backend-buffer
ggml : fix CUDA build warnings
Xuan Son Nguyen [Fri, 4 Oct 2024 09:47:19 +0000 (11:47 +0200)]
ci : fine-grant permission (#9710)
Daniel Kleine [Fri, 4 Oct 2024 08:54:44 +0000 (10:54 +0200)]
Fixed RNG seed docs (#9723)
* Update README.md
fixed RNG seed info
* changed print format to unsigned
Georgi Gerganov [Thu, 3 Oct 2024 18:18:19 +0000 (21:18 +0300)]
metal : remove abort (skip) (ggml/0)
Georgi Gerganov [Thu, 3 Oct 2024 18:17:49 +0000 (21:17 +0300)]
sync : ggml
Johannes Gäßler [Thu, 3 Oct 2024 15:29:59 +0000 (17:29 +0200)]
ggml/ex: calculate accuracy in graph, adapt MNIST (ggml/980)
Johannes Gäßler [Wed, 2 Oct 2024 13:32:39 +0000 (15:32 +0200)]
ggml: refactor cross entropy loss CPU impl. (ggml/976)
Jack Mousseau [Thu, 3 Oct 2024 18:01:46 +0000 (11:01 -0700)]
metal : fix compute pass descriptor autorelease crash (#9718)
Diego Devesa [Thu, 3 Oct 2024 15:39:18 +0000 (17:39 +0200)]
ggml-backend : add device description to CPU backend (#9720)
bandoti [Thu, 3 Oct 2024 15:39:03 +0000 (12:39 -0300)]
ggml: unify backend logging mechanism (#9709)
* Add scaffolding for ggml logging macros
* Metal backend now uses GGML logging
* Cuda backend now uses GGML logging
* Cann backend now uses GGML logging
* Add enum tag to parameters
* Use C memory allocation funcs
* Fix compile error
* Use GGML_LOG instead of GGML_PRINT
* Rename llama_state to llama_logger_state
* Prevent null format string
* Fix whitespace
* Remove log callbacks from ggml backends
* Remove cuda log statement
compilade [Thu, 3 Oct 2024 14:22:15 +0000 (10:22 -0400)]
convert : handle tokenizer merges format from transformers 4.45 (#9696)
Radoslav Gerganov [Thu, 3 Oct 2024 10:00:52 +0000 (13:00 +0300)]
rpc : enable vulkan (#9714)
closes #8536
Ouadie EL FAROUKI [Thu, 3 Oct 2024 06:50:44 +0000 (07:50 +0100)]
Fixed dequant precision issues in Q4_1 and Q5_1 (#9711)
Diego Devesa [Wed, 2 Oct 2024 23:49:47 +0000 (01:49 +0200)]
ggml-backend : add device and backend reg interfaces (#9707)
Co-authored-by: Johannes Gäßler <redacted>
Xuan Son Nguyen [Wed, 2 Oct 2024 13:49:55 +0000 (15:49 +0200)]
llama : reduce compile time and binary size (#9712)
* llama : speed up compile time
* fix build
* fix build (2)
Alberto Cabrera Pérez [Wed, 2 Oct 2024 12:57:18 +0000 (13:57 +0100)]
[SYCL] Initial cmake support of SYCL for AMD GPUs (#9658)
sycl: initial cmake support of SYCL for AMD GPUs
Radoslav Gerganov [Wed, 2 Oct 2024 10:49:16 +0000 (13:49 +0300)]
vulkan : do not use tensor->extra (#9407)
* vulkan : do not use tensor->extra
This patch allows using the Vulkan backend with the RPC backend as
tensor->extra is no longer used.
Ref: #8536
* Adapt GGML_VULKAN_CHECK_RESULTS to extra removal (#2)
---------
Co-authored-by: 0cc4m <redacted>
Zhenwei Jin [Wed, 2 Oct 2024 07:21:57 +0000 (15:21 +0800)]
gguf-split : improve --split and --merge logic (#9619)
* make sure params --split and --merge are not specified at same time
* update gguf-split params parse logic
* Update examples/gguf-split/gguf-split.cpp
Co-authored-by: slaren <redacted>
---------
Co-authored-by: Xuan Son Nguyen <redacted>
Co-authored-by: slaren <redacted>
Georgi Gerganov [Wed, 2 Oct 2024 07:14:44 +0000 (10:14 +0300)]
examples : remove benchmark (#9704)
ggml-ci
Paweł Wodnicki [Tue, 1 Oct 2024 17:18:46 +0000 (12:18 -0500)]
Update README.md (#9591)
Add Bielik model.
Georgi Gerganov [Tue, 1 Oct 2024 13:09:42 +0000 (16:09 +0300)]
sync : ggml
Johannes Gäßler [Mon, 30 Sep 2024 07:55:23 +0000 (09:55 +0200)]
test: fix OPT_STEP_ADAMW for test-backend-ops (ggml/974)
Salvatore Mesoraca [Mon, 30 Sep 2024 07:14:09 +0000 (09:14 +0200)]
vulkan : mul_mat: fix UB with small warps (ggml/952)
When the device's warp size is less than 16,
it is possible for loadstride_a (mul_mm.comp:114)
and loadstride_b (mul_mm.comp:115) to be set to 0.
Because they are calculated as: the workgroup size,
multiplied by LOAD_VEC_* (which can be 1) and divided by 16.
And the workgroup size is set to be the same as the
warp/subgroup size.
The loadstride_* variables are used as increments in the
loops that populate the buffers used for the multiplication.
When they are 0 they cause an infinite loop.
But infinite loops without side-effects are UB and the
values of loadstride_* are known at compile time.
So, the compiler quietly optimizes all the loops away.
As a consequence, the buffers are not populated and
the multiplication result is just a matrix with all elements
set to 0.
We prevent the UB by making sure that the workgroup size
will never be less than 16, even if our device has a
smaller warp size (e.g. 8).
Signed-off-by: Salvatore Mesoraca <redacted>
Borislav Stanimirov [Mon, 30 Sep 2024 07:11:41 +0000 (10:11 +0300)]
ggml : fix ggml_cast (ggml/973)
Johannes Gäßler [Sun, 29 Sep 2024 21:18:02 +0000 (23:18 +0200)]
ggml: fix gradient allocation logic (ggml/966)
* ggml: fix gradient allocation logic
* gradient allocation in ggml_build_backward_expand
* fixup
* fix test-backend-ops grad
* suggestions by slaren
* fix test1.c
* fix legacy opt API
* fix test-grad0
* remove keep arg
Georgi Gerganov [Tue, 1 Oct 2024 13:00:25 +0000 (16:00 +0300)]
metal : reduce command encoding overhead (#9698)
* metal : reduce command encoding overhead
ggml-ci
* metal : add comments
Georgi Gerganov [Tue, 1 Oct 2024 08:42:01 +0000 (11:42 +0300)]
llama : print correct model type for Llama 3.2 1B and 3B
compilade [Tue, 1 Oct 2024 06:31:36 +0000 (02:31 -0400)]
convert : refactor rope_freqs generation (#9396)
* convert : refactor rope_freqs generation
This should also fix vocab-only conversion for Phi-3.
* convert : adapt MiniCPM3 to separate rope_freqs insertion
MiniCPM3's tokenizer is treated as a SentencePiece tokenizer to avoid
having to run its custom Python code which mixes tokenization
in the same file as tool calls.
gguf-py : add long and short RoPE factors to tensor mappings
Empty, but the key names are used to populate the mappings.
serhii-nakon [Mon, 30 Sep 2024 18:57:12 +0000 (21:57 +0300)]
Fix Docker ROCM builds, use AMDGPU_TARGETS instead of GPU_TARGETS (#9641)
* Fix Docker ROCM builds, use AMDGPU_TARGETS instead of GPU_TARGETS
* Set ROCM_DOCKER_ARCH as string due it incorrectly build and cause OOM exit code
compilade [Mon, 30 Sep 2024 18:13:16 +0000 (14:13 -0400)]
ci : reduce severity of unused Pyright ignore comments (#9697)
vb [Mon, 30 Sep 2024 15:03:47 +0000 (17:03 +0200)]
py : update transfomers version (#9694)
* update transfomers version.
* update hfh version.
Georgi Gerganov [Mon, 30 Sep 2024 14:48:49 +0000 (17:48 +0300)]
flake.lock: Update (#9680)
Flake lock file updates:
• Updated input 'nixpkgs':
'github:NixOS/nixpkgs/
c04d5652cfa9742b1d519688f65d1bbccea9eb7e ?narHash=sha256-PmUr/2GQGvFTIJ6/Tvsins7Q43KTMvMFhvG6oaYK%2BWk%3D' (2024-09-19)
→ 'github:NixOS/nixpkgs/
1925c603f17fc89f4c8f6bf6f631a802ad85d784 ?narHash=sha256-J%2BPeFKSDV%2BpHL7ukkfpVzCOO7mBSrrpJ3svwBFABbhI%3D' (2024-09-26)
Co-authored-by: github-actions[bot] <redacted>
Ruchira Hasaranga [Mon, 30 Sep 2024 08:23:42 +0000 (13:53 +0530)]
console : utf-8 fix for windows stdin (#9690)
* utf-8 fix for windows stdin
* Update common/console.cpp
---------
Co-authored-by: Georgi Gerganov <redacted>
Georgi Gerganov [Sun, 29 Sep 2024 18:18:23 +0000 (21:18 +0300)]
ggml : define missing HWCAP flags (#9684)
ggml-ci
Co-authored-by: Willy Tarreau <redacted>
Georgi Gerganov [Sun, 29 Sep 2024 18:16:07 +0000 (21:16 +0300)]
sync : ggml
Johannes Gäßler [Sun, 29 Sep 2024 17:56:17 +0000 (19:56 +0200)]
CUDA: remove bad assert (ggml/972)
Jeff Bolz [Sun, 29 Sep 2024 16:50:17 +0000 (11:50 -0500)]
vulkan : multithread pipeline creation (ggml/963)
Jeff Bolz [Fri, 27 Sep 2024 07:58:01 +0000 (02:58 -0500)]
vulkan : fix build for GGML_VULKAN_RUN_TESTS, add TFLOPS to log (ggml/961)