]>
git.djapps.eu Git - pkg/ggml/sources/llama.cpp/log
Georgi Gerganov [Fri, 8 Mar 2024 10:40:02 +0000 (12:40 +0200)]
server : fix EOS token detection with disabled cache (#5938)
UEXTM.com [Fri, 8 Mar 2024 09:35:04 +0000 (04:35 -0500)]
log : fix MSVC compile errors (#5643)
MSVC gives the following error with the existing macros:
`Error C2059 : syntax error: ','`
This patch adds `##` as a prefix to `__VA_ARGS__` to address this error.
Georgi Gerganov [Thu, 7 Mar 2024 14:32:38 +0000 (16:32 +0200)]
llama-bench : add embeddings option (#5924)
* llama-bench : add embeddings option
* llama-bench : do not hard code embd default value
---------
Co-authored-by: slaren <redacted>
Neo Zhang Jianyu [Thu, 7 Mar 2024 11:14:49 +0000 (19:14 +0800)]
Revert "[SYCL] fix error when set main gpu to non-zero (#5901)" (#5918)
This reverts commit
ceca1aef0738b57951cd12c603c3477e75312dec .
Minsoo Cheong [Thu, 7 Mar 2024 10:42:39 +0000 (19:42 +0900)]
server : add `/v1/completions` endpoint (#5914)
* add-`/v1/completions`-endpoint
* add legacy comment to `/completion` endpoint
Georgi Gerganov [Thu, 7 Mar 2024 09:41:53 +0000 (11:41 +0200)]
server : refactor (#5882)
* server : refactoring (wip)
* server : remove llava/clip objects from build
* server : fix empty prompt handling + all slots idle logic
* server : normalize id vars
* server : code style
* server : simplify model chat template validation
* server : code style
* server : minor
* llama : llama_chat_apply_template support null buf
* server : do not process embedding requests when disabled
* server : reorganize structs and enums + naming fixes
* server : merge oai.hpp in utils.hpp
* server : refactor system prompt update at start
* server : disable cached prompts with self-extend
* server : do not process more than n_batch tokens per iter
* server: tests: embeddings use a real embeddings model (#5908)
* server, tests : bump batch to fit 1 embedding prompt
* server: tests: embeddings fix build type Debug is randomly failing (#5911)
* server: tests: embeddings, use different KV Cache size
* server: tests: embeddings, fixed prompt do not exceed n_batch, increase embedding timeout, reduce number of concurrent embeddings
* server: tests: embeddings, no need to wait for server idle as it can timout
* server: refactor: clean up http code (#5912)
* server : avoid n_available var
ggml-ci
* server: refactor: better http codes
* server : simplify json parsing + add comment about t_last
* server : rename server structs
* server : allow to override FQDN in tests
ggml-ci
* server : add comments
---------
Co-authored-by: Pierrick Hymbert <redacted>
Neo Zhang Jianyu [Thu, 7 Mar 2024 08:34:31 +0000 (16:34 +0800)]
[SYCL] fix error when set main gpu to non-zero (#5901)
* fix error when set main gpu to non-zero
* fix delete condition
Jared Van Bortel [Wed, 6 Mar 2024 20:42:23 +0000 (15:42 -0500)]
ggml : use SYS_get_cpu if SYS_getcpu is not defined (#5906)
Fixes #5694
Fixes ggerganov/whisper.cpp#1894
bobqianic [Wed, 6 Mar 2024 07:35:07 +0000 (07:35 +0000)]
ggml : use `uint8x16_t` return type for `ggml_vqtbl1q_u8` (#5894)
* use uint8x16_t
* Update ggml-quants.c
Georgi Gerganov [Wed, 6 Mar 2024 07:12:25 +0000 (09:12 +0200)]
convert : remove AWQ remnants (#5768)
Neo Zhang Jianyu [Wed, 6 Mar 2024 04:08:32 +0000 (12:08 +0800)]
add wait() to make code stable (#5895)
slaren [Tue, 5 Mar 2024 21:27:29 +0000 (22:27 +0100)]
compare-llama-bench.py : remove mul_mat_q (#5892)
Jared Van Bortel [Tue, 5 Mar 2024 16:56:37 +0000 (11:56 -0500)]
quants : use MM256_SET_M128I consistently to fix gcc 7 build (#5889)
ExtReMLapin [Tue, 5 Mar 2024 16:33:08 +0000 (17:33 +0100)]
grammars : blacklists character control set (#5888)
* Prevent control characters from being served in json string
* Prevent control characters from being served in json string (array)
Georgi Gerganov [Tue, 5 Mar 2024 13:56:24 +0000 (15:56 +0200)]
Revert "grammars : don't allow to output unescaped new line in string (#5885)"
This reverts commit
b1a4e994fde929300d4aeb1deb8320c59cb6edec .
ExtReMLapin [Tue, 5 Mar 2024 13:44:29 +0000 (14:44 +0100)]
grammars : don't allow to output unescaped new line in string (#5885)
* Don't allow grammar json array to output unescaped new line in string
* Don't allow new line in json object string
0cc4m [Tue, 5 Mar 2024 12:33:42 +0000 (13:33 +0100)]
Vulkan Improvements (#5835)
* Improve dequant shaders, add fast q4_0 dequant
* Optimize dmmv non-kquants for GCN
Remove unnecessary SPIR-V shader duplication
* Fix q4_0 dequant dispatch sizes
Fix backend free bug
* Optimize dequant shaders for q4_1, q5_0, q5_1 and q8_0
* Add unary and binary op shader templates
* Fix Vulkan check results
* Enable non-contiguous support for simple ops
* Add argsort
Basic q4_0 mmq shader and unit test
* Speed up q4_0 dequant code, enable mmq for q4_0
* Rework matmul pipeline selection
* Add soft_max alibi support
* Add q4_1, q5_0, q5_1 and q8_0 dequant mat mat mul shaders
* Add environment variable GGML_VK_FORCE_MAX_ALLOCATION_SIZE to limit max buffer size
Rename GGML_VULKAN_DISABLE_F16 to GGML_VK_DISABLE_F16 for consistency
Neo Zhang Jianyu [Tue, 5 Mar 2024 08:08:35 +0000 (16:08 +0800)]
[SYCL] fix mul_mat fault in CI/unit-test (#5862)
* fix mul_mat fault in cpy_f32_f16
* rm unused function
* add wait() for memcpy
* restore ci/run.sh, rename struct defination, fix bug in ggml_sycl_op_mul_mat_sycl
* fix format issue
* llama : fix segfault from unknown model arch name (#5820)
* llama : fix segfault from unknown model arch name
* llama : make all LLM maps const
This also requires using `std::map::at` instead of its `operator[]`
which does not exist for const maps.
* llama : name LLM_ARCH_UNKNOWN to "(unknown)"
This avoids errors from `std::map::at` when
getting the general name of the model architecture.
Using "(unknown)" instead of an empty string as per suggestion
https://github.com/ggerganov/llama.cpp/pull/5820#issuecomment-
1973735284
* llama : remove redundant inner const for LLM_TENSOR_NAMES
The extra const won't do anything here as const maps
return const references to values.
Co-authored-by: Jared Van Bortel <redacted>
* llama : remove redundant nullptr check in llm_arch_from_string
Since LLM_ARCH_NAMES is a const map, no spurious elements
with a NULL name are inserted anymore, so this check is dead code.
---------
Co-authored-by: Jared Van Bortel <redacted>
* llama : refactor internal quantization functions (#5830)
* scripts : add pod-llama.sh
* ggml : IQ3_S improvements (#5829)
* iq3_s: somewhat faster AVX2 dot product
On Ryzen a 7950X TG-128 increases to 16 t/s from 15.5 t/s using
16 threads. For 8 threads it is 13.85 t/s vs 11.75 t/s.
PP-512 increases to 28.5 t/s from 23.8 t/s.
* iq3_s: somewhat faster ARM_NEON dot product
Still dog slow - 10.7 t/s up from 9.9 t/s.
* iq3_s: another small ARM_NEON improvement
10.7 -> 11.0 t/s. Using vmulq_s8 is faster than the xor - sub trick
that works best on AVX2.
* iq3_s: minor improvement on Metal
49.4 t/s -> 50.3 t/s
* iq3_s: PPL improvement
E.g., for a context of 4096 LLaMA-v2-7B goes to 5.1340 from 5.1653.
* iq3_s: use new grid everywhere
* Fix ARM_NEON
---------
Co-authored-by: Iwan Kawrakow <redacted>
* convert-hf : make model class definitions self-contained (#5825)
* convert : automatically fall back to HfVocab if tokenizer.model doesn't exist (#5821)
* ggml : fix IQ3_S AVX implementation (#5834)
ggml-ci
* llama : add abort_callback to interrupt computation (#5409)
* using abort_callback from ggml to stop llama computation
* format fix
* a brief explaining comment
---------
Co-authored-by: Georgi Gerganov <redacted>
* server: tests: passkey challenge / self-extend with context shift demo (#5832)
* server: tests: add models endpoint scenario
* server: /v1/models add some metadata
* server: tests: add debug field in context before scenario
* server: tests: download model from HF, add batch size
* server: tests: add passkey test
* server: tests: add group attention params
* server: do not truncate prompt tokens if self-extend through group attention is enabled
* server: logs: do not truncate log values
* server: tests - passkey - first good working value of nga
* server: tests: fix server timeout
* server: tests: fix passkey, add doc, fix regex content matching, fix timeout
* server: tests: fix regex content matching
* server: tests: schedule slow tests on master
* server: metrics: fix when no prompt processed
* server: tests: self-extend add llama-2-7B and Mixtral-8x7B-v0.1
* server: tests: increase timeout for completion
* server: tests: keep only the PHI-2 test
* server: tests: passkey add a negative test
* flake.lock: Update (#5842)
Flake lock file updates:
• Updated input 'flake-parts':
'github:hercules-ci/flake-parts/
b253292d9c0a5ead9bc98c4e9a26c6312e27d69f ' (2024-02-01)
→ 'github:hercules-ci/flake-parts/
f7b3c975cf067e56e7cda6cb098ebe3fb4d74ca2 ' (2024-03-01)
• Updated input 'flake-parts/nixpkgs-lib':
'github:NixOS/nixpkgs/
97b17f32362e475016f942bbdfda4a4a72a8a652 ?dir=lib' (2024-01-29)
→ 'github:NixOS/nixpkgs/
1536926ef5621b09bba54035ae2bb6d806d72ac8 ?dir=lib' (2024-02-29)
• Updated input 'nixpkgs':
'github:NixOS/nixpkgs/
cbc4211f0afffe6dfd2478a62615dd5175a13f9a ' (2024-02-23)
→ 'github:NixOS/nixpkgs/
1536926ef5621b09bba54035ae2bb6d806d72ac8 ' (2024-02-29)
Co-authored-by: github-actions[bot] <redacted>
* server : init http requests thread pool with --parallel if set (#5836)
* ci : schedule slow server tests only on Release or on demand (#5839)
* llama : fix llama_copy_state_data with fragmented KV cache (#5840)
The row size of the saved states was based on kv_self.head while
it should be based on llama_kv_cache_cell_max.
Existing session files should still work.
* llama : fix llama_kv_cache_cell_max inability to return 1
I've also changed its return type to uint32_t,
because this function is always used to set the value of uint32_t variables,
and because the index already has this type.
* llama : fix state size calculation
Some bytes in the state were unaccounted for in llama_get_state_size.
Since the logits reserve so much space, it did not cause problems.
* gguf-dump : support i-quants (#5841)
Co-authored-by: Black_Fox <redacted>
* llama : allow for user specified embedding pooling type (#5849)
* allow for user specified pooling type
* llama : use enum types over int
---------
Co-authored-by: Georgi Gerganov <redacted>
* readme : add API changes section
* cuda : fix data race in soft max (#5853)
* main : support special tokens as reverse/anti prompt (#5847)
* Support special tokens as reverse/anti prompt.
* Tokenize antiprompts only once.
* main : minor
---------
Co-authored-by: Georgi Gerganov <redacted>
* common : use LLAMA_DEFAULT_SEED (#5855)
* add some new ops, fix some operators and add batch operations to certain operators. (ggml/747)
* cuda: fix group_norm
* cuda: add batch inference support for ggml_pad/ggml_upscale
* add ggml_arrange
* add ggml_timestep_embedding
* update ggml_arange/ggml_timestep_embedding tests
* cuda: fix im2col
* add ggml_arange/ggml_timestep_embbeding support for metal backend
* fix some bugs
* fix some bugs
* Update ggml.h
Co-authored-by: Georgi Gerganov <redacted>
* Update ggml-cuda.cu
Co-authored-by: Georgi Gerganov <redacted>
* Update ggml-metal.m
Co-authored-by: Georgi Gerganov <redacted>
* Update ggml-metal.m
Co-authored-by: Georgi Gerganov <redacted>
* Update ggml-metal.metal
Co-authored-by: Georgi Gerganov <redacted>
* modify according to the review comments
* ggml : fix compile warnings + code style
* ggml : normalize compute_forward calls + fix seg fault in debug
* minor
---------
Co-authored-by: Georgi Gerganov <redacted>
Co-authored-by: slaren <redacted>
* sync : ggml
* add alias for chat template (#5858)
* speculative : implement stochastic speculative sampling (#5625)
* (WIP) Implement stochastic speculative decoding
* sample from residual distribution on draft accept failure
* fix #5657: force greedy sampling with probs when temp is 0
* remove p_accept parameter
* fix style
* remove unused variables
* add srand() in speculative.cpp
* replace use of rand() with mt19937 sampling
* fixes based on review (@JohannesGaessler)
* fix r random generation
* randomly select next sequence to verify + fix bug in memory freeing
* fix bug in active_seqs sync
* fix uniform int distribution initialization
* remove warnings from comparison between int and size_t
* check grammar in `llama_sample_probability_distribution_impl`
* remove malloc code by utilizing vectors
* add PR link to README
* cmake : handle cases where git index is not found in .git (#5844)
* Update CMakeLists.txt
* Update CMakeLists.txt
* ggml : introduce ggml_status (ggml/750)
* using enum as an exit code instead of macros
* update return type from enum to unsigned int
* indentation fix
* compound update
ggml_compute_exit_code -> ggml_status
changed ggml_status from a bit-field type to simple codes
ggml_status to string cast
* ggml_status to string cast
* GGML_CALL was removed
Co-authored-by: slaren <redacted>
---------
Co-authored-by: slaren <redacted>
Co-authored-by: Georgi Gerganov <redacted>
* sync : ggml
ggml-ci
* ggml : fix unknown status (#0)
* flake : fix
* llama : fix embeddings (#5796)
* llama : fix embeddings
ggml-ci
* llama : do not use KV cache for non-causal models
ggml-ci
* embeddings : fix llama_batch_init arg
* llama : add pooling switch
* llama : distinguish token vs sequence embeddings
ggml-ci
* llama : assert pooling tensor
* llama : simplify causal mask condition
ggml-ci
* llama : assert input batch with pooling enabled
* readme : update API changes list
* nix: static build (#5814)
* fix speculative decoding build on windows (#5874)
* rebase and rm tailing space
---------
Co-authored-by: LiangtaoJin <redacted>
Co-authored-by: compilade <redacted>
Co-authored-by: Jared Van Bortel <redacted>
Co-authored-by: Xuan Son Nguyen <redacted>
Co-authored-by: Georgi Gerganov <redacted>
Co-authored-by: Kawrakow <redacted>
Co-authored-by: Iwan Kawrakow <redacted>
Co-authored-by: Jared Van Bortel <redacted>
Co-authored-by: Michael Podvitskiy <redacted>
Co-authored-by: Pierrick Hymbert <redacted>
Co-authored-by: github-actions[bot] <redacted>
Co-authored-by: Nindaleth <redacted>
Co-authored-by: Black_Fox <redacted>
Co-authored-by: Douglas Hanley <redacted>
Co-authored-by: slaren <redacted>
Co-authored-by: DAN™ <redacted>
Co-authored-by: leejet <redacted>
Co-authored-by: Minsoo Cheong <redacted>
Co-authored-by: Dane Madsen <redacted>
Co-authored-by: hutli <redacted>
Co-authored-by: Jeffrey Quesnelle <redacted>
Minsoo Cheong [Tue, 5 Mar 2024 06:12:23 +0000 (15:12 +0900)]
fix editorconfig check break (#5879)
Jeffrey Quesnelle [Tue, 5 Mar 2024 03:23:06 +0000 (19:23 -0800)]
fix speculative decoding build on windows (#5874)
hutli [Tue, 5 Mar 2024 01:33:08 +0000 (02:33 +0100)]
nix: static build (#5814)
Georgi Gerganov [Mon, 4 Mar 2024 20:31:20 +0000 (22:31 +0200)]
llama : fix embeddings (#5796)
* llama : fix embeddings
ggml-ci
* llama : do not use KV cache for non-causal models
ggml-ci
* embeddings : fix llama_batch_init arg
* llama : add pooling switch
* llama : distinguish token vs sequence embeddings
ggml-ci
* llama : assert pooling tensor
* llama : simplify causal mask condition
ggml-ci
* llama : assert input batch with pooling enabled
* readme : update API changes list
Georgi Gerganov [Mon, 4 Mar 2024 19:50:50 +0000 (21:50 +0200)]
flake : fix
Georgi Gerganov [Mon, 4 Mar 2024 18:53:27 +0000 (20:53 +0200)]
ggml : fix unknown status (#0)
Georgi Gerganov [Mon, 4 Mar 2024 09:06:39 +0000 (11:06 +0200)]
sync : ggml
ggml-ci
Michael Podvitskiy [Mon, 4 Mar 2024 09:05:42 +0000 (10:05 +0100)]
ggml : introduce ggml_status (ggml/750)
* using enum as an exit code instead of macros
* update return type from enum to unsigned int
* indentation fix
* compound update
ggml_compute_exit_code -> ggml_status
changed ggml_status from a bit-field type to simple codes
ggml_status to string cast
* ggml_status to string cast
* GGML_CALL was removed
Co-authored-by: slaren <redacted>
---------
Co-authored-by: slaren <redacted>
Co-authored-by: Georgi Gerganov <redacted>
Dane Madsen [Mon, 4 Mar 2024 18:26:55 +0000 (05:26 +1100)]
cmake : handle cases where git index is not found in .git (#5844)
* Update CMakeLists.txt
* Update CMakeLists.txt
Minsoo Cheong [Mon, 4 Mar 2024 18:24:00 +0000 (03:24 +0900)]
speculative : implement stochastic speculative sampling (#5625)
* (WIP) Implement stochastic speculative decoding
* sample from residual distribution on draft accept failure
* fix #5657: force greedy sampling with probs when temp is 0
* remove p_accept parameter
* fix style
* remove unused variables
* add srand() in speculative.cpp
* replace use of rand() with mt19937 sampling
* fixes based on review (@JohannesGaessler)
* fix r random generation
* randomly select next sequence to verify + fix bug in memory freeing
* fix bug in active_seqs sync
* fix uniform int distribution initialization
* remove warnings from comparison between int and size_t
* check grammar in `llama_sample_probability_distribution_impl`
* remove malloc code by utilizing vectors
* add PR link to README
Xuan Son Nguyen [Mon, 4 Mar 2024 11:22:08 +0000 (12:22 +0100)]
add alias for chat template (#5858)
Georgi Gerganov [Mon, 4 Mar 2024 08:40:04 +0000 (10:40 +0200)]
sync : ggml
leejet [Sun, 3 Mar 2024 12:23:52 +0000 (20:23 +0800)]
add some new ops, fix some operators and add batch operations to certain operators. (ggml/747)
* cuda: fix group_norm
* cuda: add batch inference support for ggml_pad/ggml_upscale
* add ggml_arrange
* add ggml_timestep_embedding
* update ggml_arange/ggml_timestep_embedding tests
* cuda: fix im2col
* add ggml_arange/ggml_timestep_embbeding support for metal backend
* fix some bugs
* fix some bugs
* Update ggml.h
Co-authored-by: Georgi Gerganov <redacted>
* Update ggml-cuda.cu
Co-authored-by: Georgi Gerganov <redacted>
* Update ggml-metal.m
Co-authored-by: Georgi Gerganov <redacted>
* Update ggml-metal.m
Co-authored-by: Georgi Gerganov <redacted>
* Update ggml-metal.metal
Co-authored-by: Georgi Gerganov <redacted>
* modify according to the review comments
* ggml : fix compile warnings + code style
* ggml : normalize compute_forward calls + fix seg fault in debug
* minor
---------
Co-authored-by: Georgi Gerganov <redacted>
Co-authored-by: slaren <redacted>
DAN™ [Mon, 4 Mar 2024 08:08:19 +0000 (03:08 -0500)]
common : use LLAMA_DEFAULT_SEED (#5855)
DAN™ [Mon, 4 Mar 2024 07:57:20 +0000 (02:57 -0500)]
main : support special tokens as reverse/anti prompt (#5847)
* Support special tokens as reverse/anti prompt.
* Tokenize antiprompts only once.
* main : minor
---------
Co-authored-by: Georgi Gerganov <redacted>
slaren [Sun, 3 Mar 2024 13:26:18 +0000 (14:26 +0100)]
cuda : fix data race in soft max (#5853)
Georgi Gerganov [Sun, 3 Mar 2024 10:44:03 +0000 (12:44 +0200)]
readme : add API changes section
Douglas Hanley [Sun, 3 Mar 2024 10:40:27 +0000 (04:40 -0600)]
llama : allow for user specified embedding pooling type (#5849)
* allow for user specified pooling type
* llama : use enum types over int
---------
Co-authored-by: Georgi Gerganov <redacted>
Nindaleth [Sun, 3 Mar 2024 08:43:42 +0000 (09:43 +0100)]
gguf-dump : support i-quants (#5841)
Co-authored-by: Black_Fox <redacted>
compilade [Sun, 3 Mar 2024 08:41:55 +0000 (03:41 -0500)]
llama : fix llama_copy_state_data with fragmented KV cache (#5840)
The row size of the saved states was based on kv_self.head while
it should be based on llama_kv_cache_cell_max.
Existing session files should still work.
* llama : fix llama_kv_cache_cell_max inability to return 1
I've also changed its return type to uint32_t,
because this function is always used to set the value of uint32_t variables,
and because the index already has this type.
* llama : fix state size calculation
Some bytes in the state were unaccounted for in llama_get_state_size.
Since the logits reserve so much space, it did not cause problems.
Pierrick Hymbert [Sun, 3 Mar 2024 08:35:23 +0000 (09:35 +0100)]
ci : schedule slow server tests only on Release or on demand (#5839)
Pierrick Hymbert [Sun, 3 Mar 2024 07:48:36 +0000 (08:48 +0100)]
server : init http requests thread pool with --parallel if set (#5836)
Georgi Gerganov [Sun, 3 Mar 2024 04:11:31 +0000 (06:11 +0200)]
flake.lock: Update (#5842)
Flake lock file updates:
• Updated input 'flake-parts':
'github:hercules-ci/flake-parts/
b253292d9c0a5ead9bc98c4e9a26c6312e27d69f ' (2024-02-01)
→ 'github:hercules-ci/flake-parts/
f7b3c975cf067e56e7cda6cb098ebe3fb4d74ca2 ' (2024-03-01)
• Updated input 'flake-parts/nixpkgs-lib':
'github:NixOS/nixpkgs/
97b17f32362e475016f942bbdfda4a4a72a8a652 ?dir=lib' (2024-01-29)
→ 'github:NixOS/nixpkgs/
1536926ef5621b09bba54035ae2bb6d806d72ac8 ?dir=lib' (2024-02-29)
• Updated input 'nixpkgs':
'github:NixOS/nixpkgs/
cbc4211f0afffe6dfd2478a62615dd5175a13f9a ' (2024-02-23)
→ 'github:NixOS/nixpkgs/
1536926ef5621b09bba54035ae2bb6d806d72ac8 ' (2024-02-29)
Co-authored-by: github-actions[bot] <redacted>
Pierrick Hymbert [Sat, 2 Mar 2024 21:00:14 +0000 (22:00 +0100)]
server: tests: passkey challenge / self-extend with context shift demo (#5832)
* server: tests: add models endpoint scenario
* server: /v1/models add some metadata
* server: tests: add debug field in context before scenario
* server: tests: download model from HF, add batch size
* server: tests: add passkey test
* server: tests: add group attention params
* server: do not truncate prompt tokens if self-extend through group attention is enabled
* server: logs: do not truncate log values
* server: tests - passkey - first good working value of nga
* server: tests: fix server timeout
* server: tests: fix passkey, add doc, fix regex content matching, fix timeout
* server: tests: fix regex content matching
* server: tests: schedule slow tests on master
* server: metrics: fix when no prompt processed
* server: tests: self-extend add llama-2-7B and Mixtral-8x7B-v0.1
* server: tests: increase timeout for completion
* server: tests: keep only the PHI-2 test
* server: tests: passkey add a negative test
Michael Podvitskiy [Sat, 2 Mar 2024 19:52:25 +0000 (20:52 +0100)]
llama : add abort_callback to interrupt computation (#5409)
* using abort_callback from ggml to stop llama computation
* format fix
* a brief explaining comment
---------
Co-authored-by: Georgi Gerganov <redacted>
Georgi Gerganov [Sat, 2 Mar 2024 18:00:49 +0000 (20:00 +0200)]
ggml : fix IQ3_S AVX implementation (#5834)
ggml-ci
Jared Van Bortel [Sat, 2 Mar 2024 17:27:26 +0000 (12:27 -0500)]
convert : automatically fall back to HfVocab if tokenizer.model doesn't exist (#5821)
Jared Van Bortel [Sat, 2 Mar 2024 17:21:47 +0000 (12:21 -0500)]
convert-hf : make model class definitions self-contained (#5825)
Kawrakow [Sat, 2 Mar 2024 15:00:51 +0000 (17:00 +0200)]
ggml : IQ3_S improvements (#5829)
* iq3_s: somewhat faster AVX2 dot product
On Ryzen a 7950X TG-128 increases to 16 t/s from 15.5 t/s using
16 threads. For 8 threads it is 13.85 t/s vs 11.75 t/s.
PP-512 increases to 28.5 t/s from 23.8 t/s.
* iq3_s: somewhat faster ARM_NEON dot product
Still dog slow - 10.7 t/s up from 9.9 t/s.
* iq3_s: another small ARM_NEON improvement
10.7 -> 11.0 t/s. Using vmulq_s8 is faster than the xor - sub trick
that works best on AVX2.
* iq3_s: minor improvement on Metal
49.4 t/s -> 50.3 t/s
* iq3_s: PPL improvement
E.g., for a context of 4096 LLaMA-v2-7B goes to 5.1340 from 5.1653.
* iq3_s: use new grid everywhere
* Fix ARM_NEON
---------
Co-authored-by: Iwan Kawrakow <redacted>
Georgi Gerganov [Sat, 2 Mar 2024 14:54:08 +0000 (16:54 +0200)]
scripts : add pod-llama.sh
Xuan Son Nguyen [Sat, 2 Mar 2024 14:19:09 +0000 (15:19 +0100)]
llama : refactor internal quantization functions (#5830)
compilade [Sat, 2 Mar 2024 13:42:56 +0000 (08:42 -0500)]
llama : fix segfault from unknown model arch name (#5820)
* llama : fix segfault from unknown model arch name
* llama : make all LLM maps const
This also requires using `std::map::at` instead of its `operator[]`
which does not exist for const maps.
* llama : name LLM_ARCH_UNKNOWN to "(unknown)"
This avoids errors from `std::map::at` when
getting the general name of the model architecture.
Using "(unknown)" instead of an empty string as per suggestion
https://github.com/ggerganov/llama.cpp/pull/5820#issuecomment-
1973735284
* llama : remove redundant inner const for LLM_TENSOR_NAMES
The extra const won't do anything here as const maps
return const references to values.
Co-authored-by: Jared Van Bortel <redacted>
* llama : remove redundant nullptr check in llm_arch_from_string
Since LLM_ARCH_NAMES is a const map, no spurious elements
with a NULL name are inserted anymore, so this check is dead code.
---------
Co-authored-by: Jared Van Bortel <redacted>
Neo Zhang Jianyu [Sat, 2 Mar 2024 11:49:30 +0000 (19:49 +0800)]
Support multiple GPUs (split mode) on SYCL backend (#5806)
* suport multiple cards: split-mode - layer|row
* rm warning
* rebase with master, support tow new OPs, close feature for -sm=row, fix for unit test
* update news
* fix merge error
* update according to review comments
crasm [Sat, 2 Mar 2024 05:11:06 +0000 (00:11 -0500)]
workflows : remove nocleanup arg for check-requirements.sh (#5826)
Reduces peak tmpfs usage and should prevent the check from failing from
running out of space.
Fixes the 'No space left on device' issue mentioned in #5703.
Tushar [Fri, 1 Mar 2024 23:18:26 +0000 (04:48 +0530)]
build(nix): Introduce flake.formatter for `nix fmt` (#5687)
* build(nix): Introduce flake.formatter for `nix fmt`
* chore: Switch to pkgs.nixfmt-rfc-style
nold [Fri, 1 Mar 2024 21:51:12 +0000 (22:51 +0100)]
convert-hf-to-gguf : require einops for InternLM2ForCausalLM (#5792)
Sourab Mangrulkar [Fri, 1 Mar 2024 19:30:46 +0000 (01:00 +0530)]
llama : add StarCoder2 support (#5795)
* Add support for starcoder2
* handle rope type
* skip rope freq and rotary embeddings from being serialized
* resolve comments
* Update llama.cpp
* remove redundant changes
* handle `rope-theta`
* llama : change starcoder2 rope type
* address comment
---------
Co-authored-by: Georgi Gerganov <redacted>
Georgi Gerganov [Fri, 1 Mar 2024 18:00:58 +0000 (20:00 +0200)]
server : remove api_like_OAI.py proxy script (#5808)
ddpasa [Fri, 1 Mar 2024 17:00:00 +0000 (18:00 +0100)]
ggml-vulkan: fix VULKAN_CHECK_RESULTS flag, which was previously broken (#5813)
kunal-vaishnavi [Fri, 1 Mar 2024 14:08:08 +0000 (06:08 -0800)]
gemma : fix bfloat16 -> float16 conversion issue (#5810)
Miwa / Ensan [Fri, 1 Mar 2024 13:48:56 +0000 (22:48 +0900)]
common : fix flag `--logits-all` to `--all-logits` (#5805)
Pierrick Hymbert [Fri, 1 Mar 2024 11:39:06 +0000 (12:39 +0100)]
llama : cleanup unused mmq flags (#5772)
* cleanup unused --no-mul-mat-q,-nommq, -mmq, --mul-mat-q, mul_mat_q
* remove: mul_mat_q in compare llama bench and usage
* update llama-bench
---------
Co-authored-by: slaren <redacted>
Douglas Hanley [Fri, 1 Mar 2024 09:15:36 +0000 (03:15 -0600)]
unicode : switch to multimap based nfd_map (#5799)
* switch to multimap based nfd_map due to compile time issues
* simplify multimap keys
* dont construct new locale every time
Pierrick Hymbert [Fri, 1 Mar 2024 09:08:08 +0000 (10:08 +0100)]
server: allow to override threads server pool with --threads-http (#5794)
Eve [Fri, 1 Mar 2024 08:54:53 +0000 (08:54 +0000)]
ci : add Ubuntu 22 Vulkan CI run (#5789)
Georgi Gerganov [Fri, 1 Mar 2024 07:59:43 +0000 (09:59 +0200)]
server : fix newlines in help (#5785)
AidanBeltonS [Fri, 1 Mar 2024 07:36:47 +0000 (07:36 +0000)]
[SYCL] Use batched mul_mat pathway (#5591)
* Use batched mul_mat pathway
* rm extra line
* Explicitly state scaled data type
---------
Co-authored-by: Abhilash Majumder <redacted>
Xuan Son Nguyen [Thu, 29 Feb 2024 20:42:11 +0000 (21:42 +0100)]
Server: normalize naming (#5779)
* server: normalize naming
* fix spacing
Marcus Dunn [Thu, 29 Feb 2024 08:17:23 +0000 (00:17 -0800)]
llama : constified `llama_set_state_data`'s `src` (#5774)
Georgi Gerganov [Wed, 28 Feb 2024 19:44:21 +0000 (21:44 +0200)]
ci : reduce 3b ppl chunks to 1 to avoid timeout (#5771)
ggml-ci
Eve [Wed, 28 Feb 2024 19:33:37 +0000 (19:33 +0000)]
make portability_enumeration_ext apple only (#5757)
Georgi Gerganov [Wed, 28 Feb 2024 16:43:38 +0000 (18:43 +0200)]
llama : remove deprecated API (#5770)
ggml-ci
Georgi Gerganov [Wed, 28 Feb 2024 15:36:53 +0000 (17:36 +0200)]
awq-py : remove (#5768)
Georgi Gerganov [Wed, 28 Feb 2024 09:17:32 +0000 (11:17 +0200)]
sync : ggml
slaren [Sun, 25 Feb 2024 19:41:35 +0000 (20:41 +0100)]
add google magika inference example (ggml/748)
* add magika inference example
* ggml : fix unaligned accesses in custom ops
* ggml : fix FP32 GELU for values that exceed the FP16 range
* use ggml_pool_1d
* add README
* Update README.md
* pad inputs if the files are too small
* cleanup
ggml-ci
UEXTM.com [Sat, 24 Feb 2024 16:27:36 +0000 (11:27 -0500)]
Introduce backend GUIDs (ggml/743)
* Introduce backend GUIDs
Initial proposed implementation of backend GUIDs
(Discussed in https://github.com/ggerganov/ggml/pull/741)
Hardcoded CPU backend GUID (for now)
Change ggml_backend_is_cpu logic to use GUID
* Remove redundant functions
Remove redundant functions `ggml_backend_i::get_name` and `ggml_backend_guid` which are not desired for future expansion
* Add spaces to match style
Co-authored-by: slaren <redacted>
* Fix brace style to match
Co-authored-by: slaren <redacted>
* Add void to () in function signature
Co-authored-by: slaren <redacted>
* Add back ggml_backend_guid and make CPU_GUID a local static in ggml_backend_cpu_guid
* add guids to all backends
ggml-ci
---------
Co-authored-by: slaren <redacted>
Xuan Son Nguyen [Wed, 28 Feb 2024 08:55:37 +0000 (09:55 +0100)]
server : hit Ctrl+C twice to exit (#5734)
* server: twice ctrl+C to exit
* std::atomic_flag
* sigint: message
* sigint: stderr
* Update examples/server/server.cpp
Co-authored-by: Jared Van Bortel <redacted>
---------
Co-authored-by: Jared Van Bortel <redacted>
compilade [Wed, 28 Feb 2024 08:52:56 +0000 (03:52 -0500)]
llama : fix non-quantization of expert gating tensors (#5754)
This reverts a single line from #5475
Douglas Hanley [Wed, 28 Feb 2024 08:51:11 +0000 (02:51 -0600)]
llama : improve BERT tokenization (#5740)
* implement nfd for stripping accents in wpm tokenizer
* sort nfd map; reuse iterator
* use builtin tolower
* add locale include
* Simplify to_lower cases
Co-authored-by: Jared Van Bortel <redacted>
---------
Co-authored-by: Jared Van Bortel <redacted>
Daniel Bevenius [Wed, 28 Feb 2024 08:39:39 +0000 (09:39 +0100)]
readme : add link to LLaVA 1.6 models (#5758)
Signed-off-by: Daniel Bevenius <redacted>
Jorge A [Wed, 28 Feb 2024 08:39:15 +0000 (01:39 -0700)]
server : add "/chat/completions" alias for "/v1/...` (#5722)
* Add "/chat/completions" as alias for "/v1/chat/completions"
* merge to upstream master
* minor : fix trailing whitespace
---------
Co-authored-by: Georgi Gerganov <redacted>
Kawrakow [Wed, 28 Feb 2024 08:37:02 +0000 (10:37 +0200)]
ggml : make i-quants work with super-blocks of 64 (CPU,Metal) (#5760)
* WIP: make i-quants work for QK_K = 64
* iq2_xs: attempt to fix AVX dot product for QK_K = 64
Tests pass, but I get gibberish.
* QK_K = 64 tests pass on ARM_NEON and Metal
Sadly, that does not mean it actually works.
* Make CUDA compile with QK_K = 64
Tests don't pass, plus we get misaligned access
* Q2_K: fixed bug in imatrix quantization for QK_K = 64
* iq1_s: turn off SIMD implementation for QK_K = 64 (it does not work)
---------
Co-authored-by: Iwan Kawrakow <redacted>
Kawrakow [Tue, 27 Feb 2024 17:16:49 +0000 (19:16 +0200)]
Attempt to fix android build (#5752)
Co-authored-by: Iwan Kawrakow <redacted>
Kawrakow [Tue, 27 Feb 2024 14:34:24 +0000 (16:34 +0200)]
IQ4_XS: a 4.25 bpw quantization (#5747)
* Try IQ4_NL with blocks of 64 - does not look good
* iq4_xs: go to super-blocks of 256 and 6-bit scales for blocks of 32
* iq4_xs: CUDA works - 133.2 t/s
* iq4_xs: AVX2 dot product
* iq4_xs: ARM_NEON dot product
* iq4_nl: Metal implementation
As usual, Metal / Apple Silicon don't like my quants.
* iq3_xs: minor fix
* iq4_xs: shrink by using IQ3_S for attn_k and attn_q
* iq4_xs: revert using IQ3_S for attn_k and attn_v
PPL vs size is good, but CPU performance suffers: on M2 Max
TG-128 drops to 21.7 t/s from 28.8, and on a Ryzen-7950X
to 14.5 t/s from 15.8 t/s. On CUDA we have 135 t/s when
using IQ3_S vs 133 t/s with pure IQ4_XS.
* Fix CI
* iq4_xs: Added forgotten check for 256 divisibility
---------
Co-authored-by: Iwan Kawrakow <redacted>
Engininja2 [Tue, 27 Feb 2024 13:22:45 +0000 (07:22 -0600)]
cuda : replace remaining shfl_xor with calls to warp_reduce functions (#5744)
Engininja2 [Tue, 27 Feb 2024 12:50:18 +0000 (06:50 -0600)]
ggml-quants : fix avx2 iq1_s vec_dot when compiled with gcc (#5742)
Georgi Gerganov [Tue, 27 Feb 2024 12:35:51 +0000 (14:35 +0200)]
llama : fix defrag bugs + add parameter (#5735)
* llama : fix defrag bugs + enable by default
ggml-ci
* llama : add defrag_thold parameter
ggml-ci
* llama : cont
* llama : disable log message
ggml-ci
* llama : fix graph size check during defrag
le.chang [Tue, 27 Feb 2024 02:03:06 +0000 (10:03 +0800)]
Makefile: use variables for cublas (#5689)
* make: use arch variable for cublas
* fix UNAME_M
* check opt first
---------
Co-authored-by: lindeer <redacted>
Xuan Son Nguyen [Mon, 26 Feb 2024 22:15:48 +0000 (23:15 +0100)]
fix server hangs on empty prompt (#5733)
Kawrakow [Mon, 26 Feb 2024 16:28:38 +0000 (18:28 +0200)]
Adding IQ2_S and IQ2_M to complete coverage of the 2-3 bit quantization range (#5721)
* Adding IQ2_S and IQ2_M as a single cumulative commit
* Update examples/quantize/quantize.cpp
Co-authored-by: Georgi Gerganov <redacted>
---------
Co-authored-by: Iwan Kawrakow <redacted>
Co-authored-by: Georgi Gerganov <redacted>
Johannes Gäßler [Mon, 26 Feb 2024 14:36:38 +0000 (15:36 +0100)]
CUDA: fix DEBUG_CUDA_MALLOC (#5729)
Artem [Mon, 26 Feb 2024 14:15:28 +0000 (17:15 +0300)]
readme : update ui list (#5731)
* Add LLMFarm (ui for iOS) to list
AidanBeltonS [Mon, 26 Feb 2024 14:02:11 +0000 (14:02 +0000)]
[SYCL] Add support for soft_max ALiBi (#5639)
* Add support for bias
* Update pre-processor
* rm commented code
* fix format
* fix CI
---------
Co-authored-by: Abhilash Majumder <redacted>
Georgi Gerganov [Mon, 26 Feb 2024 12:02:12 +0000 (14:02 +0200)]
unicode : reuse iterator (#5726)
Pierrick Hymbert [Mon, 26 Feb 2024 10:41:34 +0000 (11:41 +0100)]
server: CI fix trailing space (#5728)
Pierrick Hymbert [Mon, 26 Feb 2024 08:56:10 +0000 (09:56 +0100)]
server: CI tests reduce build matrix (#5725)
Georgi Gerganov [Mon, 26 Feb 2024 06:30:17 +0000 (08:30 +0200)]
llama : fix Gemma rope type (#5691)
github-actions[bot] [Sun, 25 Feb 2024 00:17:11 +0000 (00:17 +0000)]
flake.lock: Update
Flake lock file updates:
• Updated input 'nixpkgs':
'github:NixOS/nixpkgs/
5863c27340ba4de8f83e7e3c023b9599c3cb3c80 ' (2024-02-16)
→ 'github:NixOS/nixpkgs/
cbc4211f0afffe6dfd2478a62615dd5175a13f9a ' (2024-02-23)
Pierrick Hymbert [Sun, 25 Feb 2024 21:48:33 +0000 (22:48 +0100)]
server: tests - slow inference causes timeout on the CI (#5715)
* server: tests - longer inference timeout for CI
Pierrick Hymbert [Sun, 25 Feb 2024 20:46:29 +0000 (21:46 +0100)]
server: docs - refresh and tease a little bit more the http server (#5718)
* server: docs - refresh and tease a little bit more the http server
* Rephrase README.md server doc
Co-authored-by: Georgi Gerganov <redacted>
* Update examples/server/README.md
Co-authored-by: Georgi Gerganov <redacted>
* Update examples/server/README.md
Co-authored-by: Georgi Gerganov <redacted>
* Update README.md
---------
Co-authored-by: Georgi Gerganov <redacted>
Georgi Gerganov [Sun, 25 Feb 2024 20:12:24 +0000 (22:12 +0200)]
llama : refactor k-shift implementation + KV defragmentation (#5691)
* llama : refactor k-shift implementation
ggml-ci
* llama : rename llama_kv_cache_seq_shift to llama_kv_cache_seq_add
* llama : cont k-shift refactoring + normalize type names
ggml-ci
* minor : fix MPI builds
* llama : reuse n_rot from the build context
ggml-ci
* llama : revert enum name changes from this PR
ggml-ci
* llama : update llama_rope_type
* llama : add comment about rope values
* llama : fix build
* passkey : apply kv cache updates explicitly
ggml-ci
* llama : change name to llama_kv_cache_update()
* llama : add llama_kv_cache_seq_pos_max()
* passkey : fix llama_kv_cache_seq_pos_max() usage
* llama : some llama_kv_cell simplifications
* llama : add llama_kv_cache_compress (EXPERIMENTAL)
* llama : add alternative KV cache merging (EXPERIMENTAL)
* llama : add llama_kv_cache_defrag
* llama : comments
* llama : remove llama_kv_cache_compress
will add in a separate PR
ggml-ci
* llama : defragment via non-overlapping moves
* llama : ggml_graph based defrag implementation
ggml-ci
* llama : switch the loop order in build_defrag
* llama : add comments
compilade [Sun, 25 Feb 2024 18:43:50 +0000 (13:43 -0500)]
server : fix crash when system prompt is bigger than batch size (#5714)
The system prompt is now decoded in batches.
* server : fix off-by-one n_past when start of prompt matches whole cache
The tokens right after the matching part would otherwise skip a pos value.