]>
git.djapps.eu Git - pkg/ggml/sources/llama.cpp/log
Daniel Bevenius [Mon, 16 Sep 2024 11:07:13 +0000 (13:07 +0200)]
llama : rename n_embed to n_embd in rwkv6_time_mix (#9504)
This commit renames n_embed to n_embd in llm_build_rwkv6_time_mix.
The motivation for this change is consistency with the other rwkv6
functions like build_rwkv6 (and other parts of the code base).
Michael Podvitskiy [Mon, 16 Sep 2024 11:06:50 +0000 (13:06 +0200)]
ggml : link MATH_LIBRARY not by its full path (#9339)
compilade [Mon, 16 Sep 2024 07:30:22 +0000 (03:30 -0400)]
convert : identify missing model files (#9397)
Georgi Gerganov [Mon, 16 Sep 2024 07:27:50 +0000 (10:27 +0300)]
cmake : do not hide GGML options + rename option (#9465)
* cmake : do not hide GGML options
ggml-ci
* build : rename flag GGML_CUDA_USE_GRAPHS -> GGML_CUDA_GRAPHS
for consistency
ggml-ci
Eve [Mon, 16 Sep 2024 06:48:24 +0000 (06:48 +0000)]
ggml : IQ4_NL sgemm + Q4_0 AVX optimization (#9422)
* squashed
readd my iq4_nl sgemm PR https://github.com/ggerganov/llama.cpp/pull/8049
have ggml_vec_dot_q4_0 do two blocks per loop for avx
try out f16c ggml_vec_dot_iq4_nl, but it's not really faster. as per https://github.com/ggerganov/llama.cpp/pull/8549 we can calculate several blocks at a time with no issue
* shuffle
* remove f16c iq4_nl as i cant make it faster than before
Shane A [Mon, 16 Sep 2024 06:47:37 +0000 (23:47 -0700)]
llama : support OLMoE (#9462)
CarryFun [Mon, 16 Sep 2024 06:45:20 +0000 (14:45 +0800)]
llama : support MiniCPM3 (#9322)
Co-authored-by: 范睿凯 <redacted>
Vinesh Janarthanan [Mon, 16 Sep 2024 06:20:01 +0000 (01:20 -0500)]
main : option to disable context shift (#9484)
* added cli arg to disable context shift
* reverted precommit
* updated README.md for main
* white space
* allow disabling context shift in the server
* Update common/arg.cpp
no-context-shift only works for main example
Co-authored-by: Georgi Gerganov <redacted>
* added server example to --no-context-shift args
* removed server changes
* white space
---------
Co-authored-by: Georgi Gerganov <redacted>
Georgi Gerganov [Mon, 16 Sep 2024 06:05:56 +0000 (09:05 +0300)]
metal : handle zero-sized allocs (#9466)
Georgi Gerganov [Mon, 16 Sep 2024 02:14:23 +0000 (05:14 +0300)]
flake.lock: Update (#9488)
Georgi Gerganov [Sun, 15 Sep 2024 17:46:12 +0000 (20:46 +0300)]
common : reimplement logging (#9418)
https://github.com/ggerganov/llama.cpp/pull/9418
slaren [Sun, 15 Sep 2024 17:02:27 +0000 (19:02 +0200)]
gguf-split : add basic checks (#9499)
* gguf-split : do not overwrite existing files when merging
* gguf-split : error when too many arguments are passed
Michael Podvitskiy [Sun, 15 Sep 2024 16:55:52 +0000 (18:55 +0200)]
cmake : correct order of sycl flags (#9497)
Csaba Kecskemeti [Sun, 15 Sep 2024 07:48:25 +0000 (00:48 -0700)]
py : add "LLaMAForCausalLM" conversion support (#9485)
Co-authored-by: Csaba Kecskemeti <redacted>
OSecret [Sun, 15 Sep 2024 07:36:53 +0000 (10:36 +0300)]
readme : update tools list (#9475)
* Added link to proprietary wrapper for Unity3d into README.md
Wrapper has prebuild library and was tested on iOS, Android, WebGL, PC, Mac platforms, has online demos like [this](https://d23myu0xfn2ttc.cloudfront.net/rich/index.html) and [that](https://d23myu0xfn2ttc.cloudfront.net/).
* Update README.md
Fixes upon review
Michael Podvitskiy [Sun, 15 Sep 2024 07:06:38 +0000 (09:06 +0200)]
cmake : try to fix sycl+intel build (#9487)
Yuri Khrustalev [Sat, 14 Sep 2024 09:54:37 +0000 (05:54 -0400)]
ggml : ggml_type_name return "NONE" for invalid values (#9458)
When running on Windows, the quantization utility attempts to print the types that are not set which leads to a crash.
VoidIsVoid [Sat, 14 Sep 2024 09:36:44 +0000 (17:36 +0800)]
server: add data: [DONE] to /chat/completions stream response (#9459)
Georgi Gerganov [Sat, 14 Sep 2024 07:55:05 +0000 (10:55 +0300)]
cmake : use list(APPEND ...) instead of set() + dedup linker (#9463)
* cmake : use list(APPEND ...) instead of set() + dedup linker
ggml-ci
* cmake : try fix sycl
* cmake : try to fix sycl 2
* cmake : fix sycl build (#9469)
* try fix sycl build
* use CMAKE_CXX_FLAGS as a string variable
---------
Co-authored-by: Georgi Gerganov <redacted>
* one more CMAKE_CXX_FLAGS fix (#9471)
---------
Co-authored-by: Michael Podvitskiy <redacted>
Daniel Bevenius [Sat, 14 Sep 2024 07:50:12 +0000 (09:50 +0200)]
llama : make cell_id const in inp_s_mask block (#9470)
This commit makes the cell_id variable const in the inp_s_mask block.
The motivation for this change is consistency with the code in the
inp_s_copy block.
Xuan Son Nguyen [Fri, 13 Sep 2024 12:23:11 +0000 (14:23 +0200)]
server : add loading html page while model is loading (#9468)
* Adding loading page for '/' server requests
* set content when model is loading
* removed loading html file
* updated cmakelist
* updated makefile
* cleaned up whitespace
* cleanup for PR removed error
* updated server test to handle 503 HTML
* updated server test to handle 503 HTML
* ca†ch 503 before parsing json
* revert test
* account for both api and web browser requests
* precommit corrections
* eol fix
* revert changes to pre-commit
* removed print statement
* made loading message more descriptive
* also support .html files
---------
Co-authored-by: VJHack <redacted>
Co-authored-by: Vinesh Janarthanan <redacted>
Georgi Gerganov [Fri, 13 Sep 2024 06:53:38 +0000 (09:53 +0300)]
llama : llama_perf + option to disable timings during decode (#9355)
* llama : llama_perf + option to disable timings during decode
ggml-ci
* common : add llama_arg
* Update src/llama.cpp
Co-authored-by: Xuan Son Nguyen <redacted>
* perf : separate functions in the API
ggml-ci
* perf : safer pointer handling + naming update
ggml-ci
* minor : better local var name
* perf : abort on invalid sampler pointer
ggml-ci
---------
Co-authored-by: Xuan Son Nguyen <redacted>
Gilad S. [Fri, 13 Sep 2024 01:54:49 +0000 (04:54 +0300)]
feat: remove a sampler from a chain (#9445)
* feat: remove a sampler from a chain
* fix: return removed sampler
* fix: safer casting
Mathijs Henquet [Thu, 12 Sep 2024 20:30:11 +0000 (22:30 +0200)]
server : Add option to return token pieces in /tokenize endpoint (#9108)
* server : added with_pieces functionality to /tokenize endpoint
* server : Add tokenize with pieces tests to server.feature
* Handle case if tokenizer splits along utf8 continuation bytes
* Add example of token splitting
* Remove trailing ws
* Fix trailing ws
* Maybe fix ci
* maybe this fix windows ci?
---------
Co-authored-by: Xuan Son Nguyen <redacted>
Dou Xinpeng [Thu, 12 Sep 2024 11:46:43 +0000 (19:46 +0800)]
cann: Add host buffer type for Ascend NPU (#9406)
* feat: Add host buffer type for Ascend NPU(CANN backend)
* fix some checking errors
* Add a few comments
fengerhu1 [Thu, 12 Sep 2024 11:34:22 +0000 (19:34 +0800)]
llava : fix the script error in MobileVLM README (#9054)
Signed-off-by: Erhu Feng <redacted>
Xuan Son Nguyen [Thu, 12 Sep 2024 11:33:57 +0000 (13:33 +0200)]
lora : raise error if lm_head is ignored (#9103)
* lora : raise error if lm_head is ignored
* fix style
* clarify comment
Michael Podvitskiy [Thu, 12 Sep 2024 11:30:01 +0000 (13:30 +0200)]
cmake : fix for builds without `GGML_CDEF_PUBLIC` (#9338)
* `GGML_TARGET_DEFINES-NOTFOUND` fix for builds without `GGML_CDEF_PUBLIC`
* Update CMakeLists.txt, spaces fix
Huang Qi [Thu, 12 Sep 2024 11:28:43 +0000 (19:28 +0800)]
ci : update HIP SDK to 24.Q3 (ROCm 6.1) (#9329)
daminho [Thu, 12 Sep 2024 11:28:20 +0000 (20:28 +0900)]
py : add Phi-1.5/Phi-2 tokenizer (#9361)
* add phi2 tokenizer
* add phi name to convert_hf_to_gguf_update.py
* make tokenizer_pre consistent; llama.cpp work
Trivikram Kamat [Thu, 12 Sep 2024 11:27:45 +0000 (04:27 -0700)]
ci : bump actions/checkout to v4 (#9377)
Michael Podvitskiy [Thu, 12 Sep 2024 11:27:14 +0000 (13:27 +0200)]
cmake : fixed the order of linking libraries for llama-quantize (#9450)
Molly Sophia [Thu, 12 Sep 2024 11:25:16 +0000 (19:25 +0800)]
py : add special tokens in hf_converter for RWKV v6 (#9428)
Signed-off-by: Molly Sophia <redacted>
Ahmad Tameem [Thu, 12 Sep 2024 11:24:31 +0000 (16:24 +0500)]
riscv : modify Makefile and add a RISCV_VECT to print log info (#9442)
- Added ggml_cpu_has_riscv_v() in GGML to print system info in log
- Modified Makefile to only use flag when cross compiling for RISC-V
Georgi Gerganov [Thu, 12 Sep 2024 11:23:49 +0000 (14:23 +0300)]
ggml : hide ggml_object, ggml_cgraph, ggml_hash_set (#9408)
* ggml : hide ggml_object, ggml_cgraph, ggml_hash_set
ggml-ci
* ggml : add ggml-impl.h to backends
* ggml : fix compiler warnings
ggml-ci
* ggml : add assert upon adding nodes
Neo Zhang Jianyu [Thu, 12 Sep 2024 09:44:17 +0000 (17:44 +0800)]
enhance run script to be easy to change the parameters (#9448)
Co-authored-by: arthw <redacted>
Xinpeng Dou [Thu, 12 Sep 2024 01:02:35 +0000 (09:02 +0800)]
cann: Fix error when running a non-exist op (#9424)
Faisal Zaghloul [Thu, 12 Sep 2024 00:29:53 +0000 (20:29 -0400)]
Add Jais to list of supported models (#9439)
Co-authored-by: fmz <redacted>
slaren [Wed, 11 Sep 2024 15:52:13 +0000 (17:52 +0200)]
llama : skip token bounds check when evaluating embeddings (#9437)
Pavel Zloi [Wed, 11 Sep 2024 12:29:51 +0000 (15:29 +0300)]
py : support converting local models (#7547)
* Support of converting local models added to convert-hf-to-gguf-update.py
* Description fixed
* shutil added to imports
Xuan Son Nguyen [Wed, 11 Sep 2024 10:59:13 +0000 (12:59 +0200)]
llava : correct args for minicpmv-cli (#9429)
Xuan Son Nguyen [Wed, 11 Sep 2024 10:02:09 +0000 (12:02 +0200)]
files : remove accidentally added `lora_test` submodule (#9430)
Farbod Bijary [Wed, 11 Sep 2024 09:22:37 +0000 (12:52 +0330)]
feat: Implements retrying logic for downloading models using --model-url flag (#9255)
* feat: Implements retrying logic for downloading models using --model-url flag
* Update common/common.cpp
Co-authored-by: Xuan Son Nguyen <redacted>
* Update common/common.cpp
Co-authored-by: Xuan Son Nguyen <redacted>
* apply comments
* implements a retry function to avoid duplication
* fix editorconfig
* change function name
---------
Co-authored-by: farbod <redacted>
Co-authored-by: Xuan Son Nguyen <redacted>
Co-authored-by: slaren <redacted>
Co-authored-by: Xuan Son Nguyen <redacted>
Johannes Gäßler [Wed, 11 Sep 2024 08:22:40 +0000 (10:22 +0200)]
CUDA: fix --split-mode row race condition (#9413)
Georgi Gerganov [Wed, 11 Sep 2024 07:03:54 +0000 (10:03 +0300)]
batched-bench : remove unused code (#9305)
R0CKSTAR [Wed, 11 Sep 2024 01:46:55 +0000 (09:46 +0800)]
musa: remove Clang builtins mapping (#9421)
Signed-off-by: Xiaodong Ye <redacted>
Alberto Cabrera Pérez [Wed, 11 Sep 2024 00:53:42 +0000 (01:53 +0100)]
sycl : update support conditions (#9394)
* sycl : update support condition to im2col
Signed-off-by: Alberto Cabrera <redacted>
* Added TODO to remind supporting FP32 im2col
---------
Signed-off-by: Alberto Cabrera <redacted>
Georgi Gerganov [Tue, 10 Sep 2024 22:46:59 +0000 (01:46 +0300)]
flake.lock: Update (#9360)
Flake lock file updates:
• Updated input 'flake-parts':
'github:hercules-ci/flake-parts/
af510d4a62d071ea13925ce41c95e3dec816c01d ?narHash=sha256-ODYRm8zHfLTH3soTFWE452ydPYz2iTvr9T8ftDMUQ3E%3D' (2024-08-30)
→ 'github:hercules-ci/flake-parts/
567b938d64d4b4112ee253b9274472dc3a346eb6 ?narHash=sha256-%2Bebgonl3NbiKD2UD0x4BszCZQ6sTfL4xioaM49o5B3Y%3D' (2024-09-01)
• Updated input 'flake-parts/nixpkgs-lib':
'https://github.com/NixOS/nixpkgs/archive/
a5d394176e64ab29c852d03346c1fc9b0b7d33eb .tar.gz?narHash=sha256-uFf2QeW7eAHlYXuDktm9c25OxOyCoUOQmh5SZ9amE5Q%3D' (2024-08-01)
→ 'https://github.com/NixOS/nixpkgs/archive/
356624c12086a18f2ea2825fed34523d60ccc4e3 .tar.gz?narHash=sha256-Ss8QWLXdr2JCBPcYChJhz4xJm%2Bh/xjl4G0c0XlP6a74%3D' (2024-09-01)
• Updated input 'nixpkgs':
'github:NixOS/nixpkgs/
71e91c409d1e654808b2621f28a327acfdad8dc2 ?narHash=sha256-GnR7/ibgIH1vhoy8cYdmXE6iyZqKqFxQSVkFgosBh6w%3D' (2024-08-28)
→ 'github:NixOS/nixpkgs/
574d1eac1c200690e27b8eb4e24887f8df7ac27c ?narHash=sha256-v3rIhsJBOMLR8e/RNWxr828tB%2BWywYIoajrZKFM%2B0Gg%3D' (2024-09-06)
Co-authored-by: github-actions[bot] <redacted>
Xuan Son Nguyen [Tue, 10 Sep 2024 20:41:29 +0000 (22:41 +0200)]
arg : bring back missing ifdef (#9411)
* arg : bring back missing ifdef
* replace with llama_supports_gpu_offload
matteo [Tue, 10 Sep 2024 20:40:59 +0000 (22:40 +0200)]
enable --special arg for llama-server (#9419)
Co-authored-by: matteo serva <redacted>
slaren [Tue, 10 Sep 2024 16:04:25 +0000 (18:04 +0200)]
llama : move random seed generation to the samplers (#9398)
* llama_sampler_penalties : clamp penalty_last_n to zero
Georgi Gerganov [Tue, 10 Sep 2024 07:17:03 +0000 (10:17 +0300)]
metal : fix compile warning with GGML_METAL_NDEBUG (#0)
Daniel Bevenius [Tue, 10 Sep 2024 07:03:21 +0000 (09:03 +0200)]
llama : update llm_build_copy_mask_state comment [no ci] (#9385)
This commit updates the comment, which seems to contain a typo or be an
outdated comment, in the copy_mask_state function changing the variable
n_rs to n_kv.
I believe this change is correct and what the comment wants to
convey is to copy the states that are not going to be used in the
upcoming processing, which are the tokens states from n_seqs up to
the number of possible token states n_kv.
Molly Sophia [Tue, 10 Sep 2024 07:02:30 +0000 (15:02 +0800)]
RWKV v6: Add time_mix_decay_w1/w2 in quant exclusion list (#9387)
Signed-off-by: Molly Sophia <redacted>
slaren [Tue, 10 Sep 2024 06:23:33 +0000 (08:23 +0200)]
make : do not run llama-gen-docs when building (#9399)
Xuan Son Nguyen [Mon, 9 Sep 2024 21:36:09 +0000 (23:36 +0200)]
common : move arg parser code to `arg.cpp` (#9388)
* common : move arg parser to arg.cpp
* better categorize args
* add cmake
* missing climits
* missing cstdarg
* common : more explicit includes
* fix build
* refactor gpt_params_parse
* update server readme
* fix test
---------
Co-authored-by: Georgi Gerganov <redacted>
Radoslav Gerganov [Mon, 9 Sep 2024 15:40:10 +0000 (18:40 +0300)]
rpc : fix segfault with nkvo (#9389)
* rpc : fix nkvo
* rpc : buf_size must not be static
ref: #9337
---------
Co-authored-by: slaren <redacted>
Prashant Vithule [Mon, 9 Sep 2024 15:37:18 +0000 (21:07 +0530)]
ggml : vector length agnostic SVE support (#9290)
* Implemented vector length agnostic SVE using switch case for 512-bit, 256-bit, 128-bit vector lengths
* Implemented vector length agnostic SVE using switch case for 512-bit, 256-bit, 128-bit vector lengths
* Removed WhiteSpaces
* ggml : style changes + fix 512-bit nb loop check
- fix local scope in switch cases
- consistent predicate names
- empty lines when necessary
- opening braces, spaces
- const-correctness
- add asserts
* Update ggml/src/ggml-quants.c
Co-authored-by: Georgi Gerganov <redacted>
---------
Co-authored-by: Georgi Gerganov <redacted>
slaren [Mon, 9 Sep 2024 15:10:46 +0000 (17:10 +0200)]
llama : minor sampling refactor (2) (#9386)
Georgi Gerganov [Mon, 9 Sep 2024 12:51:37 +0000 (15:51 +0300)]
readme : update hot topics
Johannes Gäßler [Mon, 9 Sep 2024 12:22:53 +0000 (14:22 +0200)]
CUDA: fix variable name conflict for Windows build (#9382)
Antonis Makropoulos [Mon, 9 Sep 2024 11:21:38 +0000 (14:21 +0300)]
readme : add LLMUnity to UI projects (#9381)
* add LLMUnity to UI projects
* add newline to examples/rpc/README.md to fix editorconfig-checker unit test
Radoslav Gerganov [Mon, 9 Sep 2024 08:04:39 +0000 (11:04 +0300)]
rpc : update README [no ci] (#9320)
Update README with instructions how to offload model layers to both
local and remote devices
Dan Johansson [Mon, 9 Sep 2024 07:02:45 +0000 (09:02 +0200)]
Arm AArch64: Documentation updates (#9321)
* Arm AArch64: Documentation updates
* Update docs/build.md to include information on how to enable the Arm optimized gemm/gemv kernels
* Update examples/quantize/README.md with information on the Q4_0_4_4, Q4_0_4_8 and Q4_0_8_8 formats
* Add newline to the end of docs/build.md
Markus Tavenrath [Sun, 8 Sep 2024 19:43:48 +0000 (21:43 +0200)]
Overlap cmdbuffer creation and cmdbuffer execution in Vulkan backend by submitting smaller cmdbuffers early. (#9118)
* Overlap cmdbuffer creation and cmdbuffer execution in Vulkan backend by submitting smaller cmdbuffers early.
* fix compile issues
* Fix issues where the last submit wasn't executed or handled properly.
* remove trailing whitespace
* Repair GGML_VULKAN_CHECK_RESULTS
* Increase submit counter only if actual work has been submitted and increase submit count to 100.
* Fix some nodes are not checked with GGML_VULKAN_CHECK_RESULTS enabled.
Georgi Gerganov [Sun, 8 Sep 2024 19:01:02 +0000 (22:01 +0300)]
cuda : fix FA Q src index (1 -> 0) (#9374)
Xuan Son Nguyen [Sun, 8 Sep 2024 16:08:55 +0000 (18:08 +0200)]
common : bring back missing args, add env var duplication check (#9375)
* common : bring back missing args
* move duplication check to test-arg-parser
* add check for duplicated env var
* correct default values
slaren [Sun, 8 Sep 2024 14:44:42 +0000 (16:44 +0200)]
common : restore --n-gpu-layers (#9371)
slaren [Sun, 8 Sep 2024 13:52:07 +0000 (15:52 +0200)]
llama : refactor samplers internal implementation (#9370)
Neo Zhang Jianyu [Sun, 8 Sep 2024 11:05:29 +0000 (19:05 +0800)]
[SYCL] add check malloc result on device (#9346)
* add check malloc result on device
* update for review comments, check all malloc_device() result
---------
Co-authored-by: arthw <redacted>
slaren [Sun, 8 Sep 2024 10:41:51 +0000 (12:41 +0200)]
llama : sanitize tokens in the upper bound (#9359)
Xuan Son Nguyen [Sun, 8 Sep 2024 10:12:17 +0000 (12:12 +0200)]
imatrix : fix arg parser for imatrix (#9366)
* imatrix : fix arg parser
* beautify printing first arg
Georgi Gerganov [Sun, 8 Sep 2024 06:57:57 +0000 (09:57 +0300)]
metal : update support condition for im2col + fix warning (#0)
Georgi Gerganov [Sun, 8 Sep 2024 06:38:56 +0000 (09:38 +0300)]
sync : ggml
Georgi Gerganov [Sun, 8 Sep 2024 06:38:42 +0000 (09:38 +0300)]
scripts : option to increase git patch context
Salvatore Mesoraca [Fri, 6 Sep 2024 12:34:25 +0000 (14:34 +0200)]
vulkan: add dryrun support to sin and cos ops (ggml/947)
sin and cos failed test-backend-ops because they
tried to dereference a context pointer that is null
on dry runs.
This commit prevents that segfault.
Signed-off-by: Salvatore Mesoraca <redacted>
Salvatore Mesoraca [Fri, 6 Sep 2024 12:34:07 +0000 (14:34 +0200)]
vulkan: correctly report support for OP_CONT (ggml/946)
test-backend-ops fails because ggml_cont aborts
when invoked passing an unsupported type.
This commit makes ggml_cont tests pass
Signed-off-by: Salvatore Mesoraca <redacted>
Johannes Gäßler [Tue, 3 Sep 2024 15:21:46 +0000 (17:21 +0200)]
tests: add gradient tests for all backends (ggml/932)
* tests: add gradient checking to test-backend-ops
* remove old comment
* reorder includes
* adjust SIN/COS parameters
* add documentation, use supports_op if possible
Johannes Gäßler [Sat, 31 Aug 2024 12:35:42 +0000 (14:35 +0200)]
ggml: fix ggml_graph_cpy undefined behavior (ggml/943)
Georgi Gerganov [Wed, 28 Aug 2024 15:45:01 +0000 (18:45 +0300)]
cann : fix doxy (ggml/0)
Mengqing Cao [Fri, 9 Aug 2024 12:21:56 +0000 (20:21 +0800)]
cann : add Ascend NPU support (whisper/2336)
* enable Ascend NPU in src/whisper.cpp
* sync test-backend-ops with llama.cpp
Georgi Gerganov [Wed, 28 Aug 2024 14:08:03 +0000 (17:08 +0300)]
cuda : mark BF16 CONT as unsupported
Salvatore Mesoraca [Wed, 28 Aug 2024 08:23:02 +0000 (10:23 +0200)]
ggml : fix cont with transposed tensors when one dimension is 1 (ggml/934)
* ggml_cont: fix issue with transposed tensors when one dimension is 1
when using multiple threads, it is not enough
to check for the tensors to be contiguous for
ggml_compute_forward_dup_same_cont to work correctly.
The tensors strides also need to match.
Signed-off-by: Salvatore Mesoraca <redacted>
* Add ggml_cont tests
Signed-off-by: Salvatore Mesoraca <redacted>
* Remove dead code
it isn't possible to reach this code because
all these functions are invoked by ggml_compute_forward_dup
if and only if src0->type != dst->type
Signed-off-by: Salvatore Mesoraca <redacted>
* Make ggml_compute_forward_dup_same_cont work with contiguous tensors
Co-authored-by: Georgi Gerganov <redacted>
Signed-off-by: Salvatore Mesoraca <redacted>
---------
Signed-off-by: Salvatore Mesoraca <redacted>
Co-authored-by: Georgi Gerganov <redacted>
Kevin Gibbons [Sun, 8 Sep 2024 05:51:00 +0000 (22:51 -0700)]
llama : set attrs of mislabelled EOT/EOM tokens (#9348)
Georgi Gerganov [Sat, 7 Sep 2024 21:33:50 +0000 (00:33 +0300)]
llama.android : fix build (#9350)
Georgi Gerganov [Sat, 7 Sep 2024 21:33:33 +0000 (00:33 +0300)]
llama : fix empty ring buffer push (#9358)
Georgi Gerganov [Sat, 7 Sep 2024 21:33:13 +0000 (00:33 +0300)]
llama : sanitize invalid tokens (#9357)
* common : do not add null tokens during warmup
ggml-ci
* llama : check that the input tokens are valid
ggml-ci
* tests : fix batch size of bert model
ggml-ci
Eve [Sat, 7 Sep 2024 19:02:26 +0000 (19:02 +0000)]
llamafile : disable sgemm for batch-size 1 (#9330)
Xuan Son Nguyen [Sat, 7 Sep 2024 18:43:51 +0000 (20:43 +0200)]
common : refactor arg parser (#9308)
* (wip) argparser v3
* migrated
* add test
* handle env
* fix linux build
* add export-docs example
* fix build (2)
* skip build test-arg-parser on windows
* update server docs
* bring back missing --alias
* bring back --n-predict
* clarify test-arg-parser
* small correction
* add comments
* fix args with 2 values
* refine example-specific args
* no more lamba capture
Co-authored-by: slaren@users.noreply.github.com
* params.sparams
* optimize more
* export-docs --> gen-docs
slaren [Sat, 7 Sep 2024 18:23:07 +0000 (20:23 +0200)]
ggml : always check bounds on get_rows operations (#9354)
Georgi Gerganov [Sat, 7 Sep 2024 12:16:19 +0000 (15:16 +0300)]
llama : refactor sampling v2 (#9294)
- Add `struct llama_sampler` and `struct llama_sampler_i`
- Add `llama_sampler_` API
- Add `llama_sampler_chain_` API for chaining multiple samplers
- Remove `LLAMA_API_INTERNAL`
- Add `llama_perf_` API and remove old `llama_print_timings` and `llama_reset_timings`
Xuan Son Nguyen [Sat, 7 Sep 2024 10:01:34 +0000 (12:01 +0200)]
ggml : fix missing `cpu_set_t` on emscripten (#9336)
* ggml : fix missing cpu_set_t on emscripten
* better version
* bring back android part
slaren [Sat, 7 Sep 2024 07:48:54 +0000 (09:48 +0200)]
ci : disable rocm image creation (#9340)
Xuan Son Nguyen [Fri, 6 Sep 2024 21:21:29 +0000 (23:21 +0200)]
server : simplify state machine for slot (#9283)
* server : simplify state machine for slot
* add SLOT_STATE_DONE_PROMPT
* pop_deferred_task
* add missing notify_one
* fix passkey test
* metrics : add n_busy_slots_per_decode
* fix test step
* add test
* maybe fix AddressSanitizer?
* fix deque ?
* missing lock
* pop_deferred_task: also notify
* Update examples/server/server.cpp
Co-authored-by: Georgi Gerganov <redacted>
---------
Co-authored-by: Georgi Gerganov <redacted>
Aarni Koskela [Fri, 6 Sep 2024 21:03:01 +0000 (00:03 +0300)]
llama-bench : log benchmark progress (#9287)
* llama-bench : add optional progress messages
Aarni Koskela [Fri, 6 Sep 2024 15:59:58 +0000 (18:59 +0300)]
batched-bench : add `--output-format jsonl` option (#9293)
`--output-format` is modeled after `llama-bench`'s options
Changyeon Kim [Fri, 6 Sep 2024 12:54:50 +0000 (21:54 +0900)]
ggml : fix build break for the vulkan-debug (#9265)
- windows build : Ok.
- linux build : Ok.
Signed-off-by: Changyeon Kim <redacted>
Xuan Son Nguyen [Fri, 6 Sep 2024 12:06:04 +0000 (14:06 +0200)]
server : fix missing lock (#9334)
Markus Tavenrath [Fri, 6 Sep 2024 06:56:17 +0000 (08:56 +0200)]
Improve Vulkan shader build system (#9239)
* Improve Vulkan shader builds system
- Add dependency to vulkan-shaders-gen to rebuild shaders when changing the shader compilation utility.
- Add option to generate debug info for Vulkan shaders to provide shader source to Vulkan shader profiling tools
* remove not required self dependency
compilade [Fri, 6 Sep 2024 01:48:47 +0000 (21:48 -0400)]
ggml-quants : ternary packing for TriLMs and BitNet b1.58 (#8151)
* ggml-quants : 1.625 bpw ternary packing for BitNet 1.58b
* ggml-quants : faster 1.625 bpw AVX2 vec_dot
Not using a lookup table anymore makes it match q4_0 speed.
* gguf-py : fix formatting
* llama : remove spaces on empty line
* ggml-quants : subtract 1 when back in epi8
This makes the 1.625 bpw type go faster than q4_0. Still not the fastest.
* ggml-quants : Q2_2 now faster than Q4_K on with AVX2
* ggml-quants : cleanup Q1_3 code formatting
* ggml-quants : ARM NEON vec_dot for q2_2 and q1_3
* ggml-quants : use ceiling division when quantizing q1_3
* convert-hf : simplify BitNet pre-quantization
This still results in the exact same tensor weights and scales,
but it reveals some weirdness in the current algorithm.
* convert-hf : allow converting the weird BitNet 1.3B
Its FFN size is 5460 which is not convenient.
The offending tensors are kept in F16,
which makes the final model 5.01 bpw.
* bitnet : replace 1.58b with b1.58, as in the paper
* ggml-quants : fix build failure on Windows
* ggml-quants : attempt to fix Arm 32-bit support
* ggml : add some informative comments in q1_3 vec_dot
* ggml : add TQ1_0 and TQ2_0 ternary quantization types
* ggml : even faster TQ2_0
* ggml : also faster TQ1_0
Same optimization as for TQ2_0 by offsetting the sum instead of the weights.
This makes TQ1_0 almost as fast as Q8_0 on AVX2.
* ggml : fix build issues in certain environments
* ggml : add NEON vec_dot implementation for TQ1_0 and TQ2_0
* ggml : avoid directly using vmlal_high_s8, for 32-bit ARM compat
The compiler seems smart enough to use the same instruction
even when using vget_high_s8 instead.
* ggml : remove q1_3 and q2_2
No more 1.625 bpw and 2.000 bpw,
now instead using 1.6875 bpw and 2.0625 bpw
with TQ1_0 and TQ2_0, respectively.
* llama : remove the separate scale tensors of BitNet b1.58
They won't be needed, since the remaining ternary quant types have
built-in scales.
* ggml-quants : rename fields of TQ1_0 and TQ2_0 structs for consistency
* ggml-quants : allow using vdotq_s32 in TQ2_0 vec_dot
Not yet tested on hardware which supports it,
might not work or might not even compile. But also it might.
It should make the performance better on recent ARM CPUs.
* ggml-quants : remove comment about possible format change of TQ2_0
Making it slightly more convenient for AVX512
but less convenient for everything else is not worth the trouble.
* gguf-py : Numpy (de)quantization for TQ1_0 and TQ2_0
* ggml-quants : use roundf instead of nearest_int for TQ1_0 and TQ2_0
This does not change anything for ternary models,
since their values should never end up being in halfway cases anyway.
* convert : allow direct conversion to TQ1_0 and TQ2_0
The token embeddings and output tensors are kept in F16
to allow quantizing them to Q4_K and Q6_K with llama-quantize.
* llama : handle fallback for TQ1_0 and TQ2_0 with Q4_0
Q4_0 is not completely symmetric (so not lossless for ternary models),
but it should be good enough.
* ggml-quants : allow using ARM dot product instructions for TQ1_0
* ggml-quants : deduplicate TQ1_0 and TQ2_0 __ARM_FEATURE_DOTPROD support
* ggml : remove unused ggml_mul special case
It would otherwise conflict with the more general
optimization coming with Mamba-2.
* ggml : handle TQ1_0 and TQ2_0 in dequantization-based operators
* test-backend-ops : add TQ1_0 and TQ2_0 comments for later
Not yet adding uncommented, because some backends like SYCL and Metal
do not properly handle unknown types in supports_op for GGML_OP_MUL_MAT.
(and Metal also doesn't handle it with GGML_OP_GET_ROWS)
Support for TQ1_0 and TQ2_0 for other backends than CPU
will be added in follow-up pull requests.