]>
git.djapps.eu Git - pkg/ggml/sources/llama.cpp/log
DAN™ [Mon, 4 Mar 2024 08:08:19 +0000 (03:08 -0500)]
common : use LLAMA_DEFAULT_SEED (#5855)
DAN™ [Mon, 4 Mar 2024 07:57:20 +0000 (02:57 -0500)]
main : support special tokens as reverse/anti prompt (#5847)
* Support special tokens as reverse/anti prompt.
* Tokenize antiprompts only once.
* main : minor
---------
Co-authored-by: Georgi Gerganov <redacted>
slaren [Sun, 3 Mar 2024 13:26:18 +0000 (14:26 +0100)]
cuda : fix data race in soft max (#5853)
Georgi Gerganov [Sun, 3 Mar 2024 10:44:03 +0000 (12:44 +0200)]
readme : add API changes section
Douglas Hanley [Sun, 3 Mar 2024 10:40:27 +0000 (04:40 -0600)]
llama : allow for user specified embedding pooling type (#5849)
* allow for user specified pooling type
* llama : use enum types over int
---------
Co-authored-by: Georgi Gerganov <redacted>
Nindaleth [Sun, 3 Mar 2024 08:43:42 +0000 (09:43 +0100)]
gguf-dump : support i-quants (#5841)
Co-authored-by: Black_Fox <redacted>
compilade [Sun, 3 Mar 2024 08:41:55 +0000 (03:41 -0500)]
llama : fix llama_copy_state_data with fragmented KV cache (#5840)
The row size of the saved states was based on kv_self.head while
it should be based on llama_kv_cache_cell_max.
Existing session files should still work.
* llama : fix llama_kv_cache_cell_max inability to return 1
I've also changed its return type to uint32_t,
because this function is always used to set the value of uint32_t variables,
and because the index already has this type.
* llama : fix state size calculation
Some bytes in the state were unaccounted for in llama_get_state_size.
Since the logits reserve so much space, it did not cause problems.
Pierrick Hymbert [Sun, 3 Mar 2024 08:35:23 +0000 (09:35 +0100)]
ci : schedule slow server tests only on Release or on demand (#5839)
Pierrick Hymbert [Sun, 3 Mar 2024 07:48:36 +0000 (08:48 +0100)]
server : init http requests thread pool with --parallel if set (#5836)
Georgi Gerganov [Sun, 3 Mar 2024 04:11:31 +0000 (06:11 +0200)]
flake.lock: Update (#5842)
Flake lock file updates:
• Updated input 'flake-parts':
'github:hercules-ci/flake-parts/
b253292d9c0a5ead9bc98c4e9a26c6312e27d69f ' (2024-02-01)
→ 'github:hercules-ci/flake-parts/
f7b3c975cf067e56e7cda6cb098ebe3fb4d74ca2 ' (2024-03-01)
• Updated input 'flake-parts/nixpkgs-lib':
'github:NixOS/nixpkgs/
97b17f32362e475016f942bbdfda4a4a72a8a652 ?dir=lib' (2024-01-29)
→ 'github:NixOS/nixpkgs/
1536926ef5621b09bba54035ae2bb6d806d72ac8 ?dir=lib' (2024-02-29)
• Updated input 'nixpkgs':
'github:NixOS/nixpkgs/
cbc4211f0afffe6dfd2478a62615dd5175a13f9a ' (2024-02-23)
→ 'github:NixOS/nixpkgs/
1536926ef5621b09bba54035ae2bb6d806d72ac8 ' (2024-02-29)
Co-authored-by: github-actions[bot] <redacted>
Pierrick Hymbert [Sat, 2 Mar 2024 21:00:14 +0000 (22:00 +0100)]
server: tests: passkey challenge / self-extend with context shift demo (#5832)
* server: tests: add models endpoint scenario
* server: /v1/models add some metadata
* server: tests: add debug field in context before scenario
* server: tests: download model from HF, add batch size
* server: tests: add passkey test
* server: tests: add group attention params
* server: do not truncate prompt tokens if self-extend through group attention is enabled
* server: logs: do not truncate log values
* server: tests - passkey - first good working value of nga
* server: tests: fix server timeout
* server: tests: fix passkey, add doc, fix regex content matching, fix timeout
* server: tests: fix regex content matching
* server: tests: schedule slow tests on master
* server: metrics: fix when no prompt processed
* server: tests: self-extend add llama-2-7B and Mixtral-8x7B-v0.1
* server: tests: increase timeout for completion
* server: tests: keep only the PHI-2 test
* server: tests: passkey add a negative test
Michael Podvitskiy [Sat, 2 Mar 2024 19:52:25 +0000 (20:52 +0100)]
llama : add abort_callback to interrupt computation (#5409)
* using abort_callback from ggml to stop llama computation
* format fix
* a brief explaining comment
---------
Co-authored-by: Georgi Gerganov <redacted>
Georgi Gerganov [Sat, 2 Mar 2024 18:00:49 +0000 (20:00 +0200)]
ggml : fix IQ3_S AVX implementation (#5834)
ggml-ci
Jared Van Bortel [Sat, 2 Mar 2024 17:27:26 +0000 (12:27 -0500)]
convert : automatically fall back to HfVocab if tokenizer.model doesn't exist (#5821)
Jared Van Bortel [Sat, 2 Mar 2024 17:21:47 +0000 (12:21 -0500)]
convert-hf : make model class definitions self-contained (#5825)
Kawrakow [Sat, 2 Mar 2024 15:00:51 +0000 (17:00 +0200)]
ggml : IQ3_S improvements (#5829)
* iq3_s: somewhat faster AVX2 dot product
On Ryzen a 7950X TG-128 increases to 16 t/s from 15.5 t/s using
16 threads. For 8 threads it is 13.85 t/s vs 11.75 t/s.
PP-512 increases to 28.5 t/s from 23.8 t/s.
* iq3_s: somewhat faster ARM_NEON dot product
Still dog slow - 10.7 t/s up from 9.9 t/s.
* iq3_s: another small ARM_NEON improvement
10.7 -> 11.0 t/s. Using vmulq_s8 is faster than the xor - sub trick
that works best on AVX2.
* iq3_s: minor improvement on Metal
49.4 t/s -> 50.3 t/s
* iq3_s: PPL improvement
E.g., for a context of 4096 LLaMA-v2-7B goes to 5.1340 from 5.1653.
* iq3_s: use new grid everywhere
* Fix ARM_NEON
---------
Co-authored-by: Iwan Kawrakow <redacted>
Georgi Gerganov [Sat, 2 Mar 2024 14:54:08 +0000 (16:54 +0200)]
scripts : add pod-llama.sh
Xuan Son Nguyen [Sat, 2 Mar 2024 14:19:09 +0000 (15:19 +0100)]
llama : refactor internal quantization functions (#5830)
compilade [Sat, 2 Mar 2024 13:42:56 +0000 (08:42 -0500)]
llama : fix segfault from unknown model arch name (#5820)
* llama : fix segfault from unknown model arch name
* llama : make all LLM maps const
This also requires using `std::map::at` instead of its `operator[]`
which does not exist for const maps.
* llama : name LLM_ARCH_UNKNOWN to "(unknown)"
This avoids errors from `std::map::at` when
getting the general name of the model architecture.
Using "(unknown)" instead of an empty string as per suggestion
https://github.com/ggerganov/llama.cpp/pull/5820#issuecomment-
1973735284
* llama : remove redundant inner const for LLM_TENSOR_NAMES
The extra const won't do anything here as const maps
return const references to values.
Co-authored-by: Jared Van Bortel <redacted>
* llama : remove redundant nullptr check in llm_arch_from_string
Since LLM_ARCH_NAMES is a const map, no spurious elements
with a NULL name are inserted anymore, so this check is dead code.
---------
Co-authored-by: Jared Van Bortel <redacted>
Neo Zhang Jianyu [Sat, 2 Mar 2024 11:49:30 +0000 (19:49 +0800)]
Support multiple GPUs (split mode) on SYCL backend (#5806)
* suport multiple cards: split-mode - layer|row
* rm warning
* rebase with master, support tow new OPs, close feature for -sm=row, fix for unit test
* update news
* fix merge error
* update according to review comments
crasm [Sat, 2 Mar 2024 05:11:06 +0000 (00:11 -0500)]
workflows : remove nocleanup arg for check-requirements.sh (#5826)
Reduces peak tmpfs usage and should prevent the check from failing from
running out of space.
Fixes the 'No space left on device' issue mentioned in #5703.
Tushar [Fri, 1 Mar 2024 23:18:26 +0000 (04:48 +0530)]
build(nix): Introduce flake.formatter for `nix fmt` (#5687)
* build(nix): Introduce flake.formatter for `nix fmt`
* chore: Switch to pkgs.nixfmt-rfc-style
nold [Fri, 1 Mar 2024 21:51:12 +0000 (22:51 +0100)]
convert-hf-to-gguf : require einops for InternLM2ForCausalLM (#5792)
Sourab Mangrulkar [Fri, 1 Mar 2024 19:30:46 +0000 (01:00 +0530)]
llama : add StarCoder2 support (#5795)
* Add support for starcoder2
* handle rope type
* skip rope freq and rotary embeddings from being serialized
* resolve comments
* Update llama.cpp
* remove redundant changes
* handle `rope-theta`
* llama : change starcoder2 rope type
* address comment
---------
Co-authored-by: Georgi Gerganov <redacted>
Georgi Gerganov [Fri, 1 Mar 2024 18:00:58 +0000 (20:00 +0200)]
server : remove api_like_OAI.py proxy script (#5808)
ddpasa [Fri, 1 Mar 2024 17:00:00 +0000 (18:00 +0100)]
ggml-vulkan: fix VULKAN_CHECK_RESULTS flag, which was previously broken (#5813)
kunal-vaishnavi [Fri, 1 Mar 2024 14:08:08 +0000 (06:08 -0800)]
gemma : fix bfloat16 -> float16 conversion issue (#5810)
Miwa / Ensan [Fri, 1 Mar 2024 13:48:56 +0000 (22:48 +0900)]
common : fix flag `--logits-all` to `--all-logits` (#5805)
Pierrick Hymbert [Fri, 1 Mar 2024 11:39:06 +0000 (12:39 +0100)]
llama : cleanup unused mmq flags (#5772)
* cleanup unused --no-mul-mat-q,-nommq, -mmq, --mul-mat-q, mul_mat_q
* remove: mul_mat_q in compare llama bench and usage
* update llama-bench
---------
Co-authored-by: slaren <redacted>
Douglas Hanley [Fri, 1 Mar 2024 09:15:36 +0000 (03:15 -0600)]
unicode : switch to multimap based nfd_map (#5799)
* switch to multimap based nfd_map due to compile time issues
* simplify multimap keys
* dont construct new locale every time
Pierrick Hymbert [Fri, 1 Mar 2024 09:08:08 +0000 (10:08 +0100)]
server: allow to override threads server pool with --threads-http (#5794)
Eve [Fri, 1 Mar 2024 08:54:53 +0000 (08:54 +0000)]
ci : add Ubuntu 22 Vulkan CI run (#5789)
Georgi Gerganov [Fri, 1 Mar 2024 07:59:43 +0000 (09:59 +0200)]
server : fix newlines in help (#5785)
AidanBeltonS [Fri, 1 Mar 2024 07:36:47 +0000 (07:36 +0000)]
[SYCL] Use batched mul_mat pathway (#5591)
* Use batched mul_mat pathway
* rm extra line
* Explicitly state scaled data type
---------
Co-authored-by: Abhilash Majumder <redacted>
Xuan Son Nguyen [Thu, 29 Feb 2024 20:42:11 +0000 (21:42 +0100)]
Server: normalize naming (#5779)
* server: normalize naming
* fix spacing
Marcus Dunn [Thu, 29 Feb 2024 08:17:23 +0000 (00:17 -0800)]
llama : constified `llama_set_state_data`'s `src` (#5774)
Georgi Gerganov [Wed, 28 Feb 2024 19:44:21 +0000 (21:44 +0200)]
ci : reduce 3b ppl chunks to 1 to avoid timeout (#5771)
ggml-ci
Eve [Wed, 28 Feb 2024 19:33:37 +0000 (19:33 +0000)]
make portability_enumeration_ext apple only (#5757)
Georgi Gerganov [Wed, 28 Feb 2024 16:43:38 +0000 (18:43 +0200)]
llama : remove deprecated API (#5770)
ggml-ci
Georgi Gerganov [Wed, 28 Feb 2024 15:36:53 +0000 (17:36 +0200)]
awq-py : remove (#5768)
Georgi Gerganov [Wed, 28 Feb 2024 09:17:32 +0000 (11:17 +0200)]
sync : ggml
slaren [Sun, 25 Feb 2024 19:41:35 +0000 (20:41 +0100)]
add google magika inference example (ggml/748)
* add magika inference example
* ggml : fix unaligned accesses in custom ops
* ggml : fix FP32 GELU for values that exceed the FP16 range
* use ggml_pool_1d
* add README
* Update README.md
* pad inputs if the files are too small
* cleanup
ggml-ci
UEXTM.com [Sat, 24 Feb 2024 16:27:36 +0000 (11:27 -0500)]
Introduce backend GUIDs (ggml/743)
* Introduce backend GUIDs
Initial proposed implementation of backend GUIDs
(Discussed in https://github.com/ggerganov/ggml/pull/741)
Hardcoded CPU backend GUID (for now)
Change ggml_backend_is_cpu logic to use GUID
* Remove redundant functions
Remove redundant functions `ggml_backend_i::get_name` and `ggml_backend_guid` which are not desired for future expansion
* Add spaces to match style
Co-authored-by: slaren <redacted>
* Fix brace style to match
Co-authored-by: slaren <redacted>
* Add void to () in function signature
Co-authored-by: slaren <redacted>
* Add back ggml_backend_guid and make CPU_GUID a local static in ggml_backend_cpu_guid
* add guids to all backends
ggml-ci
---------
Co-authored-by: slaren <redacted>
Xuan Son Nguyen [Wed, 28 Feb 2024 08:55:37 +0000 (09:55 +0100)]
server : hit Ctrl+C twice to exit (#5734)
* server: twice ctrl+C to exit
* std::atomic_flag
* sigint: message
* sigint: stderr
* Update examples/server/server.cpp
Co-authored-by: Jared Van Bortel <redacted>
---------
Co-authored-by: Jared Van Bortel <redacted>
compilade [Wed, 28 Feb 2024 08:52:56 +0000 (03:52 -0500)]
llama : fix non-quantization of expert gating tensors (#5754)
This reverts a single line from #5475
Douglas Hanley [Wed, 28 Feb 2024 08:51:11 +0000 (02:51 -0600)]
llama : improve BERT tokenization (#5740)
* implement nfd for stripping accents in wpm tokenizer
* sort nfd map; reuse iterator
* use builtin tolower
* add locale include
* Simplify to_lower cases
Co-authored-by: Jared Van Bortel <redacted>
---------
Co-authored-by: Jared Van Bortel <redacted>
Daniel Bevenius [Wed, 28 Feb 2024 08:39:39 +0000 (09:39 +0100)]
readme : add link to LLaVA 1.6 models (#5758)
Signed-off-by: Daniel Bevenius <redacted>
Jorge A [Wed, 28 Feb 2024 08:39:15 +0000 (01:39 -0700)]
server : add "/chat/completions" alias for "/v1/...` (#5722)
* Add "/chat/completions" as alias for "/v1/chat/completions"
* merge to upstream master
* minor : fix trailing whitespace
---------
Co-authored-by: Georgi Gerganov <redacted>
Kawrakow [Wed, 28 Feb 2024 08:37:02 +0000 (10:37 +0200)]
ggml : make i-quants work with super-blocks of 64 (CPU,Metal) (#5760)
* WIP: make i-quants work for QK_K = 64
* iq2_xs: attempt to fix AVX dot product for QK_K = 64
Tests pass, but I get gibberish.
* QK_K = 64 tests pass on ARM_NEON and Metal
Sadly, that does not mean it actually works.
* Make CUDA compile with QK_K = 64
Tests don't pass, plus we get misaligned access
* Q2_K: fixed bug in imatrix quantization for QK_K = 64
* iq1_s: turn off SIMD implementation for QK_K = 64 (it does not work)
---------
Co-authored-by: Iwan Kawrakow <redacted>
Kawrakow [Tue, 27 Feb 2024 17:16:49 +0000 (19:16 +0200)]
Attempt to fix android build (#5752)
Co-authored-by: Iwan Kawrakow <redacted>
Kawrakow [Tue, 27 Feb 2024 14:34:24 +0000 (16:34 +0200)]
IQ4_XS: a 4.25 bpw quantization (#5747)
* Try IQ4_NL with blocks of 64 - does not look good
* iq4_xs: go to super-blocks of 256 and 6-bit scales for blocks of 32
* iq4_xs: CUDA works - 133.2 t/s
* iq4_xs: AVX2 dot product
* iq4_xs: ARM_NEON dot product
* iq4_nl: Metal implementation
As usual, Metal / Apple Silicon don't like my quants.
* iq3_xs: minor fix
* iq4_xs: shrink by using IQ3_S for attn_k and attn_q
* iq4_xs: revert using IQ3_S for attn_k and attn_v
PPL vs size is good, but CPU performance suffers: on M2 Max
TG-128 drops to 21.7 t/s from 28.8, and on a Ryzen-7950X
to 14.5 t/s from 15.8 t/s. On CUDA we have 135 t/s when
using IQ3_S vs 133 t/s with pure IQ4_XS.
* Fix CI
* iq4_xs: Added forgotten check for 256 divisibility
---------
Co-authored-by: Iwan Kawrakow <redacted>
Engininja2 [Tue, 27 Feb 2024 13:22:45 +0000 (07:22 -0600)]
cuda : replace remaining shfl_xor with calls to warp_reduce functions (#5744)
Engininja2 [Tue, 27 Feb 2024 12:50:18 +0000 (06:50 -0600)]
ggml-quants : fix avx2 iq1_s vec_dot when compiled with gcc (#5742)
Georgi Gerganov [Tue, 27 Feb 2024 12:35:51 +0000 (14:35 +0200)]
llama : fix defrag bugs + add parameter (#5735)
* llama : fix defrag bugs + enable by default
ggml-ci
* llama : add defrag_thold parameter
ggml-ci
* llama : cont
* llama : disable log message
ggml-ci
* llama : fix graph size check during defrag
le.chang [Tue, 27 Feb 2024 02:03:06 +0000 (10:03 +0800)]
Makefile: use variables for cublas (#5689)
* make: use arch variable for cublas
* fix UNAME_M
* check opt first
---------
Co-authored-by: lindeer <redacted>
Xuan Son Nguyen [Mon, 26 Feb 2024 22:15:48 +0000 (23:15 +0100)]
fix server hangs on empty prompt (#5733)
Kawrakow [Mon, 26 Feb 2024 16:28:38 +0000 (18:28 +0200)]
Adding IQ2_S and IQ2_M to complete coverage of the 2-3 bit quantization range (#5721)
* Adding IQ2_S and IQ2_M as a single cumulative commit
* Update examples/quantize/quantize.cpp
Co-authored-by: Georgi Gerganov <redacted>
---------
Co-authored-by: Iwan Kawrakow <redacted>
Co-authored-by: Georgi Gerganov <redacted>
Johannes Gäßler [Mon, 26 Feb 2024 14:36:38 +0000 (15:36 +0100)]
CUDA: fix DEBUG_CUDA_MALLOC (#5729)
Artem [Mon, 26 Feb 2024 14:15:28 +0000 (17:15 +0300)]
readme : update ui list (#5731)
* Add LLMFarm (ui for iOS) to list
AidanBeltonS [Mon, 26 Feb 2024 14:02:11 +0000 (14:02 +0000)]
[SYCL] Add support for soft_max ALiBi (#5639)
* Add support for bias
* Update pre-processor
* rm commented code
* fix format
* fix CI
---------
Co-authored-by: Abhilash Majumder <redacted>
Georgi Gerganov [Mon, 26 Feb 2024 12:02:12 +0000 (14:02 +0200)]
unicode : reuse iterator (#5726)
Pierrick Hymbert [Mon, 26 Feb 2024 10:41:34 +0000 (11:41 +0100)]
server: CI fix trailing space (#5728)
Pierrick Hymbert [Mon, 26 Feb 2024 08:56:10 +0000 (09:56 +0100)]
server: CI tests reduce build matrix (#5725)
Georgi Gerganov [Mon, 26 Feb 2024 06:30:17 +0000 (08:30 +0200)]
llama : fix Gemma rope type (#5691)
github-actions[bot] [Sun, 25 Feb 2024 00:17:11 +0000 (00:17 +0000)]
flake.lock: Update
Flake lock file updates:
• Updated input 'nixpkgs':
'github:NixOS/nixpkgs/
5863c27340ba4de8f83e7e3c023b9599c3cb3c80 ' (2024-02-16)
→ 'github:NixOS/nixpkgs/
cbc4211f0afffe6dfd2478a62615dd5175a13f9a ' (2024-02-23)
Pierrick Hymbert [Sun, 25 Feb 2024 21:48:33 +0000 (22:48 +0100)]
server: tests - slow inference causes timeout on the CI (#5715)
* server: tests - longer inference timeout for CI
Pierrick Hymbert [Sun, 25 Feb 2024 20:46:29 +0000 (21:46 +0100)]
server: docs - refresh and tease a little bit more the http server (#5718)
* server: docs - refresh and tease a little bit more the http server
* Rephrase README.md server doc
Co-authored-by: Georgi Gerganov <redacted>
* Update examples/server/README.md
Co-authored-by: Georgi Gerganov <redacted>
* Update examples/server/README.md
Co-authored-by: Georgi Gerganov <redacted>
* Update README.md
---------
Co-authored-by: Georgi Gerganov <redacted>
Georgi Gerganov [Sun, 25 Feb 2024 20:12:24 +0000 (22:12 +0200)]
llama : refactor k-shift implementation + KV defragmentation (#5691)
* llama : refactor k-shift implementation
ggml-ci
* llama : rename llama_kv_cache_seq_shift to llama_kv_cache_seq_add
* llama : cont k-shift refactoring + normalize type names
ggml-ci
* minor : fix MPI builds
* llama : reuse n_rot from the build context
ggml-ci
* llama : revert enum name changes from this PR
ggml-ci
* llama : update llama_rope_type
* llama : add comment about rope values
* llama : fix build
* passkey : apply kv cache updates explicitly
ggml-ci
* llama : change name to llama_kv_cache_update()
* llama : add llama_kv_cache_seq_pos_max()
* passkey : fix llama_kv_cache_seq_pos_max() usage
* llama : some llama_kv_cell simplifications
* llama : add llama_kv_cache_compress (EXPERIMENTAL)
* llama : add alternative KV cache merging (EXPERIMENTAL)
* llama : add llama_kv_cache_defrag
* llama : comments
* llama : remove llama_kv_cache_compress
will add in a separate PR
ggml-ci
* llama : defragment via non-overlapping moves
* llama : ggml_graph based defrag implementation
ggml-ci
* llama : switch the loop order in build_defrag
* llama : add comments
compilade [Sun, 25 Feb 2024 18:43:50 +0000 (13:43 -0500)]
server : fix crash when system prompt is bigger than batch size (#5714)
The system prompt is now decoded in batches.
* server : fix off-by-one n_past when start of prompt matches whole cache
The tokens right after the matching part would otherwise skip a pos value.
Radosław Gryta [Sun, 25 Feb 2024 18:43:00 +0000 (19:43 +0100)]
ggml-quants : provide ggml_vqtbl1q_u8 for 64bit compatibility (#5711)
* [ggml-quants] Provide ggml_vqtbl1q_u8 for 64bit compatibility
vqtbl1q_u8 is not part of arm v7 neon library
* [android-example] Remove abi filter after arm v7a fix
* [github-workflows] Do not skip Android armeabi-v7a build
kwin1412 [Sun, 25 Feb 2024 16:46:49 +0000 (00:46 +0800)]
make : fix nvcc version is empty (#5713)
fix nvcc version is empty
Ashok Gelal [Sun, 25 Feb 2024 15:57:34 +0000 (10:57 -0500)]
readme : add Msty to UI list (#5618)
Pierrick Hymbert [Sun, 25 Feb 2024 12:50:32 +0000 (13:50 +0100)]
server: logs - unified format and --log-format option (#5700)
* server: logs - always use JSON logger, add add thread_id in message, log task_id and slot_id
* server : skip GH copilot requests from logging
* server : change message format of server_log()
* server : no need to repeat log in comment
* server : log style consistency
* server : fix compile warning
* server : fix tests regex patterns on M2 Ultra
* server: logs: PR feedback on log level
* server: logs: allow to choose log format in json or plain text
* server: tests: output server logs in text
* server: logs switch init logs to server logs macro
* server: logs ensure value json value does not raised error
* server: logs reduce level VERBOSE to VERB to max 4 chars
* server: logs lower case as other log messages
* server: logs avoid static in general
Co-authored-by: Georgi Gerganov <redacted>
* server: logs PR feedback: change text log format to: LEVEL [function_name] message | additional=data
---------
Co-authored-by: Georgi Gerganov <redacted>
Pierrick Hymbert [Sun, 25 Feb 2024 12:49:43 +0000 (13:49 +0100)]
server: concurrency fix + monitoring - add /metrics prometheus compatible endpoint (#5708)
* server: monitoring - add /metrics prometheus compatible endpoint
* server: concurrency issue, when 2 task are waiting for results, only one call thread is notified
* server: metrics - move to a dedicated struct
Radosław Gryta [Sun, 25 Feb 2024 10:53:11 +0000 (11:53 +0100)]
cmake : fix compilation for Android armeabi-v7a (#5702)
Georgi Gerganov [Sun, 25 Feb 2024 10:09:09 +0000 (12:09 +0200)]
code : normalize enum names (#5697)
* coda : normalize enum names
ggml-ci
* code : cont
* code : cont
Anas Ahouzi [Sun, 25 Feb 2024 09:54:04 +0000 (10:54 +0100)]
py : fix StableLM conversion after config.json changes (#5703)
* Fix issues during StableLM models conversion
* Fix hard coded layer_norm_eps
* Support layer_norm_eps for LlavaStableLM
Co-authored-by: Jared Van Bortel <redacted>
* Add missing parenthesis
Co-authored-by: Jared Van Bortel <redacted>
* Support rotary_factor for LlavaStableLM
Co-authored-by: Jared Van Bortel <redacted>
* fix typo
* Add StableLMEpochForCausalLM for safety
Co-authored-by: compilade <redacted>
* Add StableLMEpochForCausalLM for safety 2
Co-authored-by: compilade <redacted>
---------
Co-authored-by: Jared Van Bortel <redacted>
Co-authored-by: Jared Van Bortel <redacted>
Co-authored-by: compilade <redacted>
Pierrick Hymbert [Sat, 24 Feb 2024 18:16:04 +0000 (19:16 +0100)]
server: continue to update other slots on embedding concurrent request (#5699)
* server: #5655 - continue to update other slots on embedding concurrent request.
* server: tests: add multi users embeddings as fixed
* server: tests: adding OAI compatible embedding concurrent endpoint
* server: tests: adding OAI compatible embedding with multiple inputs
Kawrakow [Sat, 24 Feb 2024 14:23:52 +0000 (16:23 +0200)]
IQ3_S: a much better alternative to Q3_K (#5676)
* iq4_nl: squash commits for easier rebase
* Basics (quantize, dequantize)
* CUDA dequantize and dot product
* Slightly faster CUDA dot product (120 t/s)
* Switch to 6-bit scales
* Scalar dot product
* AVX2 dot product
* ARM_NEON dot product
* Works on metal, but still slow
* Slightly better Metal dot product
* Another small Metal improvement
* Metal dot product is getting there
* Faster CUDA dot product
* Add 1/8 ffn_down layers as Q5_K when no imatrix has been provided
* Report the actual bpw
* Add _xs mix that is 4.05 bpw for non-MoE models
* Remove IQ4_XS for now, slightly adjust kvalues_iq4nl
* AVX2 dot product uses Q8_0 instead of Q8_K
* Add to test-backend-ops
* Minor fix
* Also use use Q5_K for attn_output in MoE models
* Fixes after merging latest master
* Switching to blocks of 32
* AVX2 for blocks of 32
* Scaler dot product for blocks of 32
* ARM_NEON dot product for blocks of 32
* Metal kernels for blocks of 32
* Slightly faster Metal kernels
* Resurrecting iq3_xs
After all the experimentation, nothing was better than this.
* Minor PPL improvement via a block scale fudge factor
* Minor improvement via 3 neighbours
* iq3_xs: working scalar and AVX2 dot products
* iq3_xs: ARM_NEON dot product - works but extremely slow (10 t/s)
* iq3_xs: working Metal implementation
* Adding IQ3_M - IQ3_XS mix with mostly Q4_K
* iiq3_xs: a 3.4375 bpw variant
* iq3_xs: make CUDA work for new version
* iq3_xs: make scalar and AVX2 work for new version
* iq3_s: make ARM_NEON work with new version
* iq3_xs: make new version work on metal
Performance is very similar to Q3_K_S
* iq3_xs: tiny Metal speed improvement
* iq3_xs: tiny Metal speed improvement
* Fix stupid warning
* Q3_K_XS now uses a mix of IQ3_XS and IQ3_XXS
* iq3_xs: rename to iq3_s
* iq3_s: make tests pass
* Move Q3_K_XS mix to 3.25 bpw
* Attempt to fix failing tests
* Another attempt to fix the Windows builds
* Attempt to fix ROCm
* ROCm again
* iq3_s: partial fix for QK_K = 64
* iq3_s: make it work on metal for QK_K = 64
Pleasent surprise: the coding was super-block size independent,
so all it took was to delete some QK_K == 256 guards.
* Will this fix ROCm?
---------
Co-authored-by: Iwan Kawrakow <redacted>
Pierrick Hymbert [Sat, 24 Feb 2024 11:28:55 +0000 (12:28 +0100)]
server: init functional tests (#5566)
* server: tests: init scenarios
- health and slots endpoints
- completion endpoint
- OAI compatible chat completion requests w/ and without streaming
- completion multi users scenario
- multi users scenario on OAI compatible endpoint with streaming
- multi users with total number of tokens to predict exceeds the KV Cache size
- server wrong usage scenario, like in Infinite loop of "context shift" #3969
- slots shifting
- continuous batching
- embeddings endpoint
- multi users embedding endpoint: Segmentation fault #5655
- OpenAI-compatible embeddings API
- tokenize endpoint
- CORS and api key scenario
* server: CI GitHub workflow
---------
Co-authored-by: Georgi Gerganov <redacted>
AlpinDale [Fri, 23 Feb 2024 19:31:54 +0000 (19:31 +0000)]
server : add KV cache quantization options (#5684)
Jared Van Bortel [Fri, 23 Feb 2024 18:39:14 +0000 (13:39 -0500)]
convert : fix missing ftype for gemma (#5690)
Jared Van Bortel [Thu, 22 Feb 2024 22:05:23 +0000 (17:05 -0500)]
mpt : do not duplicate token_embd.weight on disk (#5670)
Georgi Gerganov [Thu, 22 Feb 2024 21:23:46 +0000 (23:23 +0200)]
gemma : use more bits for the token_embd.weight tensor (#5650)
* gemma : use Q8_0 for the token_embd.weight tensor
* llama : quantize token_embd.weight using output type
Georgi Gerganov [Thu, 22 Feb 2024 21:22:48 +0000 (23:22 +0200)]
py : add Gemma conversion from HF models (#5647)
* py : add gemma conversion from HF models
* Update convert-hf-to-gguf.py
Co-authored-by: Aarni Koskela <redacted>
* Update convert-hf-to-gguf.py
Co-authored-by: Aarni Koskela <redacted>
* Update convert-hf-to-gguf.py
Co-authored-by: Jared Van Bortel <redacted>
---------
Co-authored-by: Aarni Koskela <redacted>
Co-authored-by: Jared Van Bortel <redacted>
Georgi Gerganov [Thu, 22 Feb 2024 21:21:39 +0000 (23:21 +0200)]
ggml : always define ggml_fp16_t as uint16_t (#5666)
* ggml : always define ggml_fp16_t as uint16_t
ggml-ci
* ggml : cont
ggml-ci
* ggml : cont
* ggml : cont
ggml-ci
* ggml : cont
ggml-ci
* cuda : no longer ggml headers last
ggml-ci
* ggml : fix q6_K FP16 -> FP32 conversion
ggml-ci
* ggml : more FP16 -> FP32 conversion fixes
ggml-ci
Georgi Gerganov [Thu, 22 Feb 2024 21:21:05 +0000 (23:21 +0200)]
sync : ggml
Georgi Gerganov [Thu, 22 Feb 2024 16:31:40 +0000 (18:31 +0200)]
ggml : 32-bit arm compat (whisper/1891)
* ggml : 32-bit arm compat
* ggml : add ggml_vqtbl1q_s8 impl
* ggml : cont
Someone [Thu, 22 Feb 2024 19:44:10 +0000 (19:44 +0000)]
nix: init singularity and docker images (#5056)
Exposes a few attributes demonstrating how to build [singularity](https://docs.sylabs.io/guides/latest/user-guide/)/[apptainer](https://apptainer.org/) and Docker images re-using llama.cpp's Nix expression.
Built locally on `x86_64-linux` with `nix build github:someoneserge/llama.cpp/feat/nix/images#llamaPackages.{docker,docker-min,sif,llama-cpp}` and it's fast and effective.
Georgi Gerganov [Thu, 22 Feb 2024 18:13:25 +0000 (20:13 +0200)]
py : minor fixes (#5668)
Xuan Son Nguyen [Thu, 22 Feb 2024 18:10:21 +0000 (19:10 +0100)]
Add Gemma chat template (#5665)
* add gemma chat template
* gemma: only apply system_prompt on non-model message
Someone [Thu, 22 Feb 2024 16:32:09 +0000 (16:32 +0000)]
workflows: nix: hardcode cachix ids, build unconditionally (#5663)
GitHub does not expose environment and repository variables to PRs coming from forks implies that we've been disabling the Nix CI actions for most PRs.
The `if:` also didn't make much sense, because we can always pull from cachix, and there's no point (albeit no risk either) in pushing cache for the untrusted code.
Georgi Gerganov [Thu, 22 Feb 2024 11:54:03 +0000 (13:54 +0200)]
minor : fix trailing whitespace (#5638)
Georgi Gerganov [Thu, 22 Feb 2024 08:35:54 +0000 (10:35 +0200)]
readme : update hot topics
Xuan Son Nguyen [Thu, 22 Feb 2024 08:33:24 +0000 (09:33 +0100)]
server : fallback to chatml, add AlphaMonarch chat template (#5628)
* server: fallback to chatml
* add new chat template
* server: add AlphaMonarch to test chat template
* server: only check model template if there is no custom tmpl
* remove TODO
Alexey Parfenov [Thu, 22 Feb 2024 08:27:32 +0000 (08:27 +0000)]
server : clarify some params in the docs (#5640)
Dat Quoc Nguyen [Thu, 22 Feb 2024 08:15:13 +0000 (18:15 +1000)]
mpt : add optional bias tensors (#5638)
Update for MPT with optional bias parameters: to work with PhoGPT and SEA-LION models that were pre-trained with 'bias'.
slaren [Wed, 21 Feb 2024 23:42:09 +0000 (00:42 +0100)]
llama : fix loading models with shared tok_embd and output (#5651)
ggml-ci
Xuan Son Nguyen [Wed, 21 Feb 2024 23:31:00 +0000 (00:31 +0100)]
Add docs for llama_chat_apply_template (#5645)
* add docs for llama_chat_apply_template
* fix typo
slaren [Wed, 21 Feb 2024 21:52:39 +0000 (22:52 +0100)]
llama : fix session save/load with quantized KV (#5649)