]> git.djapps.eu Git - pkg/ggml/sources/llama.cpp/log
pkg/ggml/sources/llama.cpp
9 months agocmake : do not hide GGML options + rename option (#9465)
Georgi Gerganov [Mon, 16 Sep 2024 07:27:50 +0000 (10:27 +0300)]
cmake : do not hide GGML options + rename option (#9465)

* cmake : do not hide GGML options

ggml-ci

* build : rename flag GGML_CUDA_USE_GRAPHS -> GGML_CUDA_GRAPHS

for consistency

ggml-ci

9 months agoggml : IQ4_NL sgemm + Q4_0 AVX optimization (#9422)
Eve [Mon, 16 Sep 2024 06:48:24 +0000 (06:48 +0000)]
ggml : IQ4_NL sgemm + Q4_0 AVX optimization (#9422)

* squashed

readd my iq4_nl sgemm PR https://github.com/ggerganov/llama.cpp/pull/8049

have ggml_vec_dot_q4_0 do two blocks per loop for avx

try out f16c ggml_vec_dot_iq4_nl, but it's not really faster. as per https://github.com/ggerganov/llama.cpp/pull/8549 we can calculate several blocks at a time with no issue

* shuffle

* remove f16c iq4_nl as i cant make it faster than before

9 months agollama : support OLMoE (#9462)
Shane A [Mon, 16 Sep 2024 06:47:37 +0000 (23:47 -0700)]
llama : support OLMoE (#9462)

9 months agollama : support MiniCPM3 (#9322)
CarryFun [Mon, 16 Sep 2024 06:45:20 +0000 (14:45 +0800)]
llama : support MiniCPM3 (#9322)

Co-authored-by: 范睿凯 <redacted>
9 months agomain : option to disable context shift (#9484)
Vinesh Janarthanan [Mon, 16 Sep 2024 06:20:01 +0000 (01:20 -0500)]
main : option to disable context shift (#9484)

* added cli arg to disable context shift

* reverted precommit

* updated README.md for main

* white space

* allow disabling context shift in the server

* Update common/arg.cpp

no-context-shift only works for main example

Co-authored-by: Georgi Gerganov <redacted>
* added server example to --no-context-shift args

* removed server changes

* white space

---------

Co-authored-by: Georgi Gerganov <redacted>
9 months agometal : handle zero-sized allocs (#9466)
Georgi Gerganov [Mon, 16 Sep 2024 06:05:56 +0000 (09:05 +0300)]
metal : handle zero-sized allocs (#9466)

9 months agoflake.lock: Update (#9488)
Georgi Gerganov [Mon, 16 Sep 2024 02:14:23 +0000 (05:14 +0300)]
flake.lock: Update (#9488)

9 months agocommon : reimplement logging (#9418)
Georgi Gerganov [Sun, 15 Sep 2024 17:46:12 +0000 (20:46 +0300)]
common : reimplement logging (#9418)

https://github.com/ggerganov/llama.cpp/pull/9418

9 months agogguf-split : add basic checks (#9499)
slaren [Sun, 15 Sep 2024 17:02:27 +0000 (19:02 +0200)]
gguf-split : add basic checks (#9499)

* gguf-split : do not overwrite existing files when merging

* gguf-split : error when too many arguments are passed

9 months agocmake : correct order of sycl flags (#9497)
Michael Podvitskiy [Sun, 15 Sep 2024 16:55:52 +0000 (18:55 +0200)]
cmake : correct order of sycl flags (#9497)

9 months agopy : add "LLaMAForCausalLM" conversion support (#9485)
Csaba Kecskemeti [Sun, 15 Sep 2024 07:48:25 +0000 (00:48 -0700)]
py : add "LLaMAForCausalLM" conversion support (#9485)

Co-authored-by: Csaba Kecskemeti <redacted>
9 months agoreadme : update tools list (#9475)
OSecret [Sun, 15 Sep 2024 07:36:53 +0000 (10:36 +0300)]
readme : update tools list (#9475)

* Added link to proprietary wrapper for Unity3d into README.md

Wrapper has prebuild library and was tested on iOS, Android, WebGL, PC, Mac platforms, has online demos like [this](https://d23myu0xfn2ttc.cloudfront.net/rich/index.html) and [that](https://d23myu0xfn2ttc.cloudfront.net/).

* Update README.md

Fixes upon review

9 months agocmake : try to fix sycl+intel build (#9487)
Michael Podvitskiy [Sun, 15 Sep 2024 07:06:38 +0000 (09:06 +0200)]
cmake : try to fix sycl+intel build (#9487)

9 months agoggml : ggml_type_name return "NONE" for invalid values (#9458)
Yuri Khrustalev [Sat, 14 Sep 2024 09:54:37 +0000 (05:54 -0400)]
ggml : ggml_type_name return "NONE" for invalid values (#9458)

When running on Windows, the quantization utility attempts to print the types that are not set which leads to a crash.

9 months agoserver: add data: [DONE] to /chat/completions stream response (#9459)
VoidIsVoid [Sat, 14 Sep 2024 09:36:44 +0000 (17:36 +0800)]
server: add data: [DONE] to /chat/completions stream response (#9459)

9 months agocmake : use list(APPEND ...) instead of set() + dedup linker (#9463)
Georgi Gerganov [Sat, 14 Sep 2024 07:55:05 +0000 (10:55 +0300)]
cmake : use list(APPEND ...) instead of set() + dedup linker (#9463)

* cmake : use list(APPEND ...) instead of set() + dedup linker

ggml-ci

* cmake : try fix sycl

* cmake : try to fix sycl 2

* cmake : fix sycl build (#9469)

* try fix sycl build

* use CMAKE_CXX_FLAGS as a string variable

---------

Co-authored-by: Georgi Gerganov <redacted>
* one more CMAKE_CXX_FLAGS fix (#9471)

---------

Co-authored-by: Michael Podvitskiy <redacted>
9 months agollama : make cell_id const in inp_s_mask block (#9470)
Daniel Bevenius [Sat, 14 Sep 2024 07:50:12 +0000 (09:50 +0200)]
llama : make cell_id const in inp_s_mask block (#9470)

This commit makes the cell_id variable const in the inp_s_mask block.

The motivation for this change is consistency with the code in the
inp_s_copy block.

9 months agoserver : add loading html page while model is loading (#9468)
Xuan Son Nguyen [Fri, 13 Sep 2024 12:23:11 +0000 (14:23 +0200)]
server : add loading html page while model is loading (#9468)

* Adding loading page for '/' server requests

* set content when model is loading

* removed loading html file

* updated cmakelist

* updated makefile

* cleaned up whitespace

* cleanup for PR removed error

* updated server test to handle 503 HTML

* updated server test to handle 503 HTML

* ca†ch 503 before parsing json

* revert test

* account for both api and web browser requests

* precommit corrections

* eol fix

* revert changes to pre-commit

* removed print statement

* made loading message more descriptive

* also support .html files

---------

Co-authored-by: VJHack <redacted>
Co-authored-by: Vinesh Janarthanan <redacted>
9 months agollama : llama_perf + option to disable timings during decode (#9355)
Georgi Gerganov [Fri, 13 Sep 2024 06:53:38 +0000 (09:53 +0300)]
llama : llama_perf + option to disable timings during decode (#9355)

* llama : llama_perf + option to disable timings during decode

ggml-ci

* common : add llama_arg

* Update src/llama.cpp

Co-authored-by: Xuan Son Nguyen <redacted>
* perf : separate functions in the API

ggml-ci

* perf : safer pointer handling + naming update

ggml-ci

* minor : better local var name

* perf : abort on invalid sampler pointer

ggml-ci

---------

Co-authored-by: Xuan Son Nguyen <redacted>
9 months agofeat: remove a sampler from a chain (#9445)
Gilad S. [Fri, 13 Sep 2024 01:54:49 +0000 (04:54 +0300)]
feat: remove a sampler from a chain (#9445)

* feat: remove a sampler from a chain

* fix: return removed sampler

* fix: safer casting

9 months agoserver : Add option to return token pieces in /tokenize endpoint (#9108)
Mathijs Henquet [Thu, 12 Sep 2024 20:30:11 +0000 (22:30 +0200)]
server : Add option to return token pieces in /tokenize endpoint (#9108)

* server : added with_pieces functionality to /tokenize endpoint

* server : Add tokenize with pieces tests to server.feature

* Handle case if tokenizer splits along utf8 continuation bytes

* Add example of token splitting

* Remove trailing ws

* Fix trailing ws

* Maybe fix ci

* maybe this fix windows ci?

---------

Co-authored-by: Xuan Son Nguyen <redacted>
9 months agocann: Add host buffer type for Ascend NPU (#9406)
Dou Xinpeng [Thu, 12 Sep 2024 11:46:43 +0000 (19:46 +0800)]
cann: Add host buffer type for Ascend NPU (#9406)

* feat: Add host buffer type for Ascend NPU(CANN backend)

* fix some checking errors

* Add a few comments

9 months agollava : fix the script error in MobileVLM README (#9054)
fengerhu1 [Thu, 12 Sep 2024 11:34:22 +0000 (19:34 +0800)]
llava : fix the script error in MobileVLM README (#9054)

Signed-off-by: Erhu Feng <redacted>
9 months agolora : raise error if lm_head is ignored (#9103)
Xuan Son Nguyen [Thu, 12 Sep 2024 11:33:57 +0000 (13:33 +0200)]
lora : raise error if lm_head is ignored (#9103)

* lora : raise error if lm_head is ignored

* fix style

* clarify comment

9 months agocmake : fix for builds without `GGML_CDEF_PUBLIC` (#9338)
Michael Podvitskiy [Thu, 12 Sep 2024 11:30:01 +0000 (13:30 +0200)]
cmake : fix for builds without `GGML_CDEF_PUBLIC` (#9338)

* `GGML_TARGET_DEFINES-NOTFOUND` fix for builds without `GGML_CDEF_PUBLIC`

* Update CMakeLists.txt, spaces fix

9 months agoci : update HIP SDK to 24.Q3 (ROCm 6.1) (#9329)
Huang Qi [Thu, 12 Sep 2024 11:28:43 +0000 (19:28 +0800)]
ci : update HIP SDK to 24.Q3 (ROCm 6.1) (#9329)

9 months agopy : add Phi-1.5/Phi-2 tokenizer (#9361)
daminho [Thu, 12 Sep 2024 11:28:20 +0000 (20:28 +0900)]
py : add Phi-1.5/Phi-2 tokenizer (#9361)

* add phi2 tokenizer

* add phi name to convert_hf_to_gguf_update.py

* make tokenizer_pre consistent; llama.cpp work

9 months agoci : bump actions/checkout to v4 (#9377)
Trivikram Kamat [Thu, 12 Sep 2024 11:27:45 +0000 (04:27 -0700)]
ci : bump actions/checkout to v4 (#9377)

9 months agocmake : fixed the order of linking libraries for llama-quantize (#9450)
Michael Podvitskiy [Thu, 12 Sep 2024 11:27:14 +0000 (13:27 +0200)]
cmake : fixed the order of linking libraries for llama-quantize (#9450)

9 months agopy : add special tokens in hf_converter for RWKV v6 (#9428)
Molly Sophia [Thu, 12 Sep 2024 11:25:16 +0000 (19:25 +0800)]
py : add special tokens in hf_converter for RWKV v6 (#9428)

Signed-off-by: Molly Sophia <redacted>
9 months agoriscv : modify Makefile and add a RISCV_VECT to print log info (#9442)
Ahmad Tameem [Thu, 12 Sep 2024 11:24:31 +0000 (16:24 +0500)]
riscv : modify Makefile and add a RISCV_VECT to print log info (#9442)

- Added ggml_cpu_has_riscv_v() in GGML to print system info in log
- Modified Makefile to only use flag when cross compiling for RISC-V

9 months agoggml : hide ggml_object, ggml_cgraph, ggml_hash_set (#9408)
Georgi Gerganov [Thu, 12 Sep 2024 11:23:49 +0000 (14:23 +0300)]
ggml : hide ggml_object, ggml_cgraph, ggml_hash_set (#9408)

* ggml : hide ggml_object, ggml_cgraph, ggml_hash_set

ggml-ci

* ggml : add ggml-impl.h to backends

* ggml : fix compiler warnings

ggml-ci

* ggml : add assert upon adding nodes

9 months agoenhance run script to be easy to change the parameters (#9448)
Neo Zhang Jianyu [Thu, 12 Sep 2024 09:44:17 +0000 (17:44 +0800)]
enhance run script to be easy to change the parameters (#9448)

Co-authored-by: arthw <redacted>
9 months agocann: Fix error when running a non-exist op (#9424)
Xinpeng Dou [Thu, 12 Sep 2024 01:02:35 +0000 (09:02 +0800)]
cann: Fix error when running a non-exist op (#9424)

9 months agoAdd Jais to list of supported models (#9439)
Faisal Zaghloul [Thu, 12 Sep 2024 00:29:53 +0000 (20:29 -0400)]
Add Jais to list of supported models (#9439)

Co-authored-by: fmz <redacted>
9 months agollama : skip token bounds check when evaluating embeddings (#9437)
slaren [Wed, 11 Sep 2024 15:52:13 +0000 (17:52 +0200)]
llama : skip token bounds check when evaluating embeddings (#9437)

9 months agopy : support converting local models (#7547)
Pavel Zloi [Wed, 11 Sep 2024 12:29:51 +0000 (15:29 +0300)]
py : support converting local models (#7547)

* Support of converting local models added to convert-hf-to-gguf-update.py

* Description fixed

* shutil added to imports

9 months agollava : correct args for minicpmv-cli (#9429)
Xuan Son Nguyen [Wed, 11 Sep 2024 10:59:13 +0000 (12:59 +0200)]
llava : correct args for minicpmv-cli (#9429)

9 months agofiles : remove accidentally added `lora_test` submodule (#9430)
Xuan Son Nguyen [Wed, 11 Sep 2024 10:02:09 +0000 (12:02 +0200)]
files : remove accidentally added `lora_test` submodule (#9430)

9 months agofeat: Implements retrying logic for downloading models using --model-url flag (#9255)
Farbod Bijary [Wed, 11 Sep 2024 09:22:37 +0000 (12:52 +0330)]
feat: Implements retrying logic for downloading models using --model-url flag (#9255)

* feat: Implements retrying logic for downloading models using --model-url flag

* Update common/common.cpp

Co-authored-by: Xuan Son Nguyen <redacted>
* Update common/common.cpp

Co-authored-by: Xuan Son Nguyen <redacted>
* apply comments

* implements a retry function to avoid duplication

* fix editorconfig

* change function name

---------

Co-authored-by: farbod <redacted>
Co-authored-by: Xuan Son Nguyen <redacted>
Co-authored-by: slaren <redacted>
Co-authored-by: Xuan Son Nguyen <redacted>
9 months agoCUDA: fix --split-mode row race condition (#9413)
Johannes Gäßler [Wed, 11 Sep 2024 08:22:40 +0000 (10:22 +0200)]
CUDA: fix --split-mode row race condition (#9413)

9 months agobatched-bench : remove unused code (#9305)
Georgi Gerganov [Wed, 11 Sep 2024 07:03:54 +0000 (10:03 +0300)]
batched-bench : remove unused code (#9305)

9 months agomusa: remove Clang builtins mapping (#9421)
R0CKSTAR [Wed, 11 Sep 2024 01:46:55 +0000 (09:46 +0800)]
musa: remove Clang builtins mapping (#9421)

Signed-off-by: Xiaodong Ye <redacted>
9 months agosycl : update support conditions (#9394)
Alberto Cabrera Pérez [Wed, 11 Sep 2024 00:53:42 +0000 (01:53 +0100)]
sycl : update support conditions  (#9394)

* sycl : update support condition to im2col

Signed-off-by: Alberto Cabrera <redacted>
* Added TODO to remind supporting FP32 im2col

---------

Signed-off-by: Alberto Cabrera <redacted>
9 months agoflake.lock: Update (#9360)
Georgi Gerganov [Tue, 10 Sep 2024 22:46:59 +0000 (01:46 +0300)]
flake.lock: Update (#9360)

Flake lock file updates:

• Updated input 'flake-parts':
    'github:hercules-ci/flake-parts/af510d4a62d071ea13925ce41c95e3dec816c01d?narHash=sha256-ODYRm8zHfLTH3soTFWE452ydPYz2iTvr9T8ftDMUQ3E%3D' (2024-08-30)
  → 'github:hercules-ci/flake-parts/567b938d64d4b4112ee253b9274472dc3a346eb6?narHash=sha256-%2Bebgonl3NbiKD2UD0x4BszCZQ6sTfL4xioaM49o5B3Y%3D' (2024-09-01)
• Updated input 'flake-parts/nixpkgs-lib':
    'https://github.com/NixOS/nixpkgs/archive/a5d394176e64ab29c852d03346c1fc9b0b7d33eb.tar.gz?narHash=sha256-uFf2QeW7eAHlYXuDktm9c25OxOyCoUOQmh5SZ9amE5Q%3D' (2024-08-01)
  → 'https://github.com/NixOS/nixpkgs/archive/356624c12086a18f2ea2825fed34523d60ccc4e3.tar.gz?narHash=sha256-Ss8QWLXdr2JCBPcYChJhz4xJm%2Bh/xjl4G0c0XlP6a74%3D' (2024-09-01)
• Updated input 'nixpkgs':
    'github:NixOS/nixpkgs/71e91c409d1e654808b2621f28a327acfdad8dc2?narHash=sha256-GnR7/ibgIH1vhoy8cYdmXE6iyZqKqFxQSVkFgosBh6w%3D' (2024-08-28)
  → 'github:NixOS/nixpkgs/574d1eac1c200690e27b8eb4e24887f8df7ac27c?narHash=sha256-v3rIhsJBOMLR8e/RNWxr828tB%2BWywYIoajrZKFM%2B0Gg%3D' (2024-09-06)

Co-authored-by: github-actions[bot] <redacted>
9 months agoarg : bring back missing ifdef (#9411)
Xuan Son Nguyen [Tue, 10 Sep 2024 20:41:29 +0000 (22:41 +0200)]
arg : bring back missing ifdef (#9411)

* arg : bring back missing ifdef

* replace with llama_supports_gpu_offload

9 months agoenable --special arg for llama-server (#9419)
matteo [Tue, 10 Sep 2024 20:40:59 +0000 (22:40 +0200)]
enable --special arg for llama-server (#9419)

Co-authored-by: matteo serva <redacted>
9 months agollama : move random seed generation to the samplers (#9398)
slaren [Tue, 10 Sep 2024 16:04:25 +0000 (18:04 +0200)]
llama : move random seed generation to the samplers (#9398)

* llama_sampler_penalties : clamp penalty_last_n to zero

9 months agometal : fix compile warning with GGML_METAL_NDEBUG (#0)
Georgi Gerganov [Tue, 10 Sep 2024 07:17:03 +0000 (10:17 +0300)]
metal : fix compile warning with GGML_METAL_NDEBUG (#0)

9 months agollama : update llm_build_copy_mask_state comment [no ci] (#9385)
Daniel Bevenius [Tue, 10 Sep 2024 07:03:21 +0000 (09:03 +0200)]
llama : update llm_build_copy_mask_state comment [no ci] (#9385)

This commit updates the comment, which seems to contain a typo or be an
outdated comment, in the copy_mask_state function changing the variable
n_rs to n_kv.

I believe this change is correct and what the comment wants to
convey is to copy the states that are not going to be used in the
upcoming processing, which are the tokens states from n_seqs up to
the number of possible token states n_kv.

9 months agoRWKV v6: Add time_mix_decay_w1/w2 in quant exclusion list (#9387)
Molly Sophia [Tue, 10 Sep 2024 07:02:30 +0000 (15:02 +0800)]
RWKV v6: Add time_mix_decay_w1/w2 in quant exclusion list (#9387)

Signed-off-by: Molly Sophia <redacted>
9 months agomake : do not run llama-gen-docs when building (#9399)
slaren [Tue, 10 Sep 2024 06:23:33 +0000 (08:23 +0200)]
make : do not run llama-gen-docs when building (#9399)

9 months agocommon : move arg parser code to `arg.cpp` (#9388)
Xuan Son Nguyen [Mon, 9 Sep 2024 21:36:09 +0000 (23:36 +0200)]
common : move arg parser code to `arg.cpp` (#9388)

* common : move arg parser to arg.cpp

* better categorize args

* add cmake

* missing climits

* missing cstdarg

* common : more explicit includes

* fix build

* refactor gpt_params_parse

* update server readme

* fix test

---------

Co-authored-by: Georgi Gerganov <redacted>
9 months agorpc : fix segfault with nkvo (#9389)
Radoslav Gerganov [Mon, 9 Sep 2024 15:40:10 +0000 (18:40 +0300)]
rpc : fix segfault with nkvo (#9389)

* rpc : fix nkvo

* rpc : buf_size must not be static

ref: #9337

---------

Co-authored-by: slaren <redacted>
9 months agoggml : vector length agnostic SVE support (#9290)
Prashant Vithule [Mon, 9 Sep 2024 15:37:18 +0000 (21:07 +0530)]
ggml : vector length agnostic SVE support (#9290)

* Implemented vector length agnostic SVE using switch case for 512-bit, 256-bit, 128-bit vector lengths

* Implemented vector length agnostic SVE using switch case for 512-bit, 256-bit, 128-bit vector lengths

* Removed WhiteSpaces

* ggml : style changes + fix 512-bit nb loop check

- fix local scope in switch cases
- consistent predicate names
- empty lines when necessary
- opening braces, spaces
- const-correctness
- add asserts

* Update ggml/src/ggml-quants.c

Co-authored-by: Georgi Gerganov <redacted>
---------

Co-authored-by: Georgi Gerganov <redacted>
9 months agollama : minor sampling refactor (2) (#9386)
slaren [Mon, 9 Sep 2024 15:10:46 +0000 (17:10 +0200)]
llama : minor sampling refactor (2) (#9386)

9 months agoreadme : update hot topics
Georgi Gerganov [Mon, 9 Sep 2024 12:51:37 +0000 (15:51 +0300)]
readme : update hot topics

9 months agoCUDA: fix variable name conflict for Windows build (#9382)
Johannes Gäßler [Mon, 9 Sep 2024 12:22:53 +0000 (14:22 +0200)]
CUDA: fix variable name conflict for Windows build (#9382)

9 months agoreadme : add LLMUnity to UI projects (#9381)
Antonis Makropoulos [Mon, 9 Sep 2024 11:21:38 +0000 (14:21 +0300)]
readme : add LLMUnity to UI projects (#9381)

* add LLMUnity to UI projects

* add newline to examples/rpc/README.md to fix editorconfig-checker unit test

9 months agorpc : update README [no ci] (#9320)
Radoslav Gerganov [Mon, 9 Sep 2024 08:04:39 +0000 (11:04 +0300)]
rpc : update README [no ci] (#9320)

Update README with instructions how to offload model layers to both
local and remote devices

9 months agoArm AArch64: Documentation updates (#9321)
Dan Johansson [Mon, 9 Sep 2024 07:02:45 +0000 (09:02 +0200)]
Arm AArch64: Documentation updates (#9321)

* Arm AArch64: Documentation updates

* Update docs/build.md to include information on how to enable the Arm optimized gemm/gemv kernels

* Update examples/quantize/README.md with information on the Q4_0_4_4, Q4_0_4_8 and Q4_0_8_8 formats

* Add newline to the end of docs/build.md

9 months agoOverlap cmdbuffer creation and cmdbuffer execution in Vulkan backend by submitting...
Markus Tavenrath [Sun, 8 Sep 2024 19:43:48 +0000 (21:43 +0200)]
Overlap cmdbuffer creation and cmdbuffer execution in Vulkan backend by submitting smaller cmdbuffers early. (#9118)

* Overlap cmdbuffer creation and cmdbuffer execution in Vulkan backend by submitting smaller cmdbuffers early.

* fix compile issues

* Fix issues where the last submit wasn't executed or handled properly.

* remove trailing whitespace

* Repair GGML_VULKAN_CHECK_RESULTS

* Increase submit counter only if actual work has been submitted and increase submit count to 100.

* Fix some nodes are not checked with GGML_VULKAN_CHECK_RESULTS enabled.

9 months agocuda : fix FA Q src index (1 -> 0) (#9374)
Georgi Gerganov [Sun, 8 Sep 2024 19:01:02 +0000 (22:01 +0300)]
cuda : fix FA Q src index (1 -> 0) (#9374)

9 months agocommon : bring back missing args, add env var duplication check (#9375)
Xuan Son Nguyen [Sun, 8 Sep 2024 16:08:55 +0000 (18:08 +0200)]
common : bring back missing args, add env var duplication check (#9375)

* common : bring back missing args

* move duplication check to test-arg-parser

* add check for duplicated env var

* correct default values

9 months agocommon : restore --n-gpu-layers (#9371)
slaren [Sun, 8 Sep 2024 14:44:42 +0000 (16:44 +0200)]
common : restore --n-gpu-layers (#9371)

9 months agollama : refactor samplers internal implementation (#9370)
slaren [Sun, 8 Sep 2024 13:52:07 +0000 (15:52 +0200)]
llama : refactor samplers internal implementation (#9370)

9 months ago[SYCL] add check malloc result on device (#9346)
Neo Zhang Jianyu [Sun, 8 Sep 2024 11:05:29 +0000 (19:05 +0800)]
[SYCL] add check malloc result on device (#9346)

* add check malloc result on device

* update for review comments, check all malloc_device() result

---------

Co-authored-by: arthw <redacted>
9 months agollama : sanitize tokens in the upper bound (#9359)
slaren [Sun, 8 Sep 2024 10:41:51 +0000 (12:41 +0200)]
llama : sanitize tokens in the upper bound (#9359)

9 months agoimatrix : fix arg parser for imatrix (#9366)
Xuan Son Nguyen [Sun, 8 Sep 2024 10:12:17 +0000 (12:12 +0200)]
imatrix : fix arg parser for imatrix (#9366)

* imatrix : fix arg parser

* beautify printing first arg

9 months agometal : update support condition for im2col + fix warning (#0)
Georgi Gerganov [Sun, 8 Sep 2024 06:57:57 +0000 (09:57 +0300)]
metal : update support condition for im2col + fix warning (#0)

9 months agosync : ggml
Georgi Gerganov [Sun, 8 Sep 2024 06:38:56 +0000 (09:38 +0300)]
sync : ggml

9 months agoscripts : option to increase git patch context
Georgi Gerganov [Sun, 8 Sep 2024 06:38:42 +0000 (09:38 +0300)]
scripts : option to increase git patch context

9 months agovulkan: add dryrun support to sin and cos ops (ggml/947)
Salvatore Mesoraca [Fri, 6 Sep 2024 12:34:25 +0000 (14:34 +0200)]
vulkan: add dryrun support to sin and cos ops (ggml/947)

sin and cos failed test-backend-ops because they
tried to dereference a context pointer that is null
on dry runs.

This commit prevents that segfault.

Signed-off-by: Salvatore Mesoraca <redacted>
9 months agovulkan: correctly report support for OP_CONT (ggml/946)
Salvatore Mesoraca [Fri, 6 Sep 2024 12:34:07 +0000 (14:34 +0200)]
vulkan: correctly report support for OP_CONT (ggml/946)

test-backend-ops fails because ggml_cont aborts
when invoked passing an unsupported type.

This commit makes ggml_cont tests pass

Signed-off-by: Salvatore Mesoraca <redacted>
9 months agotests: add gradient tests for all backends (ggml/932)
Johannes Gäßler [Tue, 3 Sep 2024 15:21:46 +0000 (17:21 +0200)]
tests: add gradient tests for all backends (ggml/932)

* tests: add gradient checking to test-backend-ops

* remove old comment

* reorder includes

* adjust SIN/COS parameters

* add documentation, use supports_op if possible

9 months agoggml: fix ggml_graph_cpy undefined behavior (ggml/943)
Johannes Gäßler [Sat, 31 Aug 2024 12:35:42 +0000 (14:35 +0200)]
ggml: fix ggml_graph_cpy undefined behavior (ggml/943)

9 months agocann : fix doxy (ggml/0)
Georgi Gerganov [Wed, 28 Aug 2024 15:45:01 +0000 (18:45 +0300)]
cann : fix doxy (ggml/0)

9 months agocann : add Ascend NPU support (whisper/2336)
Mengqing Cao [Fri, 9 Aug 2024 12:21:56 +0000 (20:21 +0800)]
cann : add Ascend NPU support (whisper/2336)

* enable Ascend NPU in src/whisper.cpp
  * sync test-backend-ops with llama.cpp

9 months agocuda : mark BF16 CONT as unsupported
Georgi Gerganov [Wed, 28 Aug 2024 14:08:03 +0000 (17:08 +0300)]
cuda : mark BF16 CONT as unsupported

9 months agoggml : fix cont with transposed tensors when one dimension is 1 (ggml/934)
Salvatore Mesoraca [Wed, 28 Aug 2024 08:23:02 +0000 (10:23 +0200)]
ggml : fix cont with transposed tensors when one dimension is 1 (ggml/934)

* ggml_cont: fix issue with transposed tensors when one dimension is 1

when using multiple threads, it is not enough
to check for the tensors to be contiguous for
ggml_compute_forward_dup_same_cont to work correctly.
The tensors strides also need to match.

Signed-off-by: Salvatore Mesoraca <redacted>
* Add ggml_cont tests

Signed-off-by: Salvatore Mesoraca <redacted>
* Remove dead code

it isn't possible to reach this code because
all these functions are invoked by ggml_compute_forward_dup
if and only if src0->type != dst->type

Signed-off-by: Salvatore Mesoraca <redacted>
* Make ggml_compute_forward_dup_same_cont work with contiguous tensors

Co-authored-by: Georgi Gerganov <redacted>
Signed-off-by: Salvatore Mesoraca <redacted>
---------

Signed-off-by: Salvatore Mesoraca <redacted>
Co-authored-by: Georgi Gerganov <redacted>
9 months agollama : set attrs of mislabelled EOT/EOM tokens (#9348)
Kevin Gibbons [Sun, 8 Sep 2024 05:51:00 +0000 (22:51 -0700)]
llama : set attrs of mislabelled EOT/EOM tokens (#9348)

9 months agollama.android : fix build (#9350)
Georgi Gerganov [Sat, 7 Sep 2024 21:33:50 +0000 (00:33 +0300)]
llama.android : fix build (#9350)

9 months agollama : fix empty ring buffer push (#9358)
Georgi Gerganov [Sat, 7 Sep 2024 21:33:33 +0000 (00:33 +0300)]
llama : fix empty ring buffer push (#9358)

9 months agollama : sanitize invalid tokens (#9357)
Georgi Gerganov [Sat, 7 Sep 2024 21:33:13 +0000 (00:33 +0300)]
llama : sanitize invalid tokens (#9357)

* common : do not add null tokens during warmup

ggml-ci

* llama : check that the input tokens are valid

ggml-ci

* tests : fix batch size of bert model

ggml-ci

9 months agollamafile : disable sgemm for batch-size 1 (#9330)
Eve [Sat, 7 Sep 2024 19:02:26 +0000 (19:02 +0000)]
llamafile : disable sgemm for batch-size 1 (#9330)

9 months agocommon : refactor arg parser (#9308)
Xuan Son Nguyen [Sat, 7 Sep 2024 18:43:51 +0000 (20:43 +0200)]
common : refactor arg parser (#9308)

* (wip) argparser v3

* migrated

* add test

* handle env

* fix linux build

* add export-docs example

* fix build (2)

* skip build test-arg-parser on windows

* update server docs

* bring back missing --alias

* bring back --n-predict

* clarify test-arg-parser

* small correction

* add comments

* fix args with 2 values

* refine example-specific args

* no more lamba capture

Co-authored-by: slaren@users.noreply.github.com
* params.sparams

* optimize more

* export-docs --> gen-docs

9 months agoggml : always check bounds on get_rows operations (#9354)
slaren [Sat, 7 Sep 2024 18:23:07 +0000 (20:23 +0200)]
ggml : always check bounds on get_rows operations (#9354)

9 months agollama : refactor sampling v2 (#9294)
Georgi Gerganov [Sat, 7 Sep 2024 12:16:19 +0000 (15:16 +0300)]
llama : refactor sampling v2 (#9294)

- Add `struct llama_sampler` and `struct llama_sampler_i`
- Add `llama_sampler_` API
- Add `llama_sampler_chain_` API for chaining multiple samplers
- Remove `LLAMA_API_INTERNAL`
- Add `llama_perf_` API and remove old `llama_print_timings` and `llama_reset_timings`

9 months agoggml : fix missing `cpu_set_t` on emscripten (#9336)
Xuan Son Nguyen [Sat, 7 Sep 2024 10:01:34 +0000 (12:01 +0200)]
ggml : fix missing `cpu_set_t` on emscripten (#9336)

* ggml : fix missing cpu_set_t on emscripten

* better version

* bring back android part

9 months agoci : disable rocm image creation (#9340)
slaren [Sat, 7 Sep 2024 07:48:54 +0000 (09:48 +0200)]
ci : disable rocm image creation (#9340)

9 months agoserver : simplify state machine for slot (#9283)
Xuan Son Nguyen [Fri, 6 Sep 2024 21:21:29 +0000 (23:21 +0200)]
server : simplify state machine for slot (#9283)

* server : simplify state machine for slot

* add SLOT_STATE_DONE_PROMPT

* pop_deferred_task

* add missing notify_one

* fix passkey test

* metrics : add n_busy_slots_per_decode

* fix test step

* add test

* maybe fix AddressSanitizer?

* fix deque ?

* missing lock

* pop_deferred_task: also notify

* Update examples/server/server.cpp

Co-authored-by: Georgi Gerganov <redacted>
---------

Co-authored-by: Georgi Gerganov <redacted>
9 months agollama-bench : log benchmark progress (#9287)
Aarni Koskela [Fri, 6 Sep 2024 21:03:01 +0000 (00:03 +0300)]
llama-bench : log benchmark progress (#9287)

* llama-bench : add optional progress messages

9 months agobatched-bench : add `--output-format jsonl` option (#9293)
Aarni Koskela [Fri, 6 Sep 2024 15:59:58 +0000 (18:59 +0300)]
batched-bench : add `--output-format jsonl` option (#9293)

`--output-format` is modeled after `llama-bench`'s options

9 months agoggml : fix build break for the vulkan-debug (#9265)
Changyeon Kim [Fri, 6 Sep 2024 12:54:50 +0000 (21:54 +0900)]
ggml : fix build break for the vulkan-debug (#9265)

- windows build : Ok.
- linux build : Ok.

Signed-off-by: Changyeon Kim <redacted>
9 months agoserver : fix missing lock (#9334)
Xuan Son Nguyen [Fri, 6 Sep 2024 12:06:04 +0000 (14:06 +0200)]
server : fix missing lock (#9334)

9 months agoImprove Vulkan shader build system (#9239)
Markus Tavenrath [Fri, 6 Sep 2024 06:56:17 +0000 (08:56 +0200)]
Improve Vulkan shader build system (#9239)

* Improve Vulkan shader builds system

- Add dependency to vulkan-shaders-gen to rebuild shaders when changing the shader compilation utility.
- Add option to generate debug info for Vulkan shaders to provide shader source to Vulkan shader profiling tools

* remove not required self dependency

9 months agoggml-quants : ternary packing for TriLMs and BitNet b1.58 (#8151)
compilade [Fri, 6 Sep 2024 01:48:47 +0000 (21:48 -0400)]
ggml-quants : ternary packing for TriLMs and BitNet b1.58 (#8151)

* ggml-quants : 1.625 bpw ternary packing for BitNet 1.58b

* ggml-quants : faster 1.625 bpw AVX2 vec_dot

Not using a lookup table anymore makes it match q4_0 speed.

* gguf-py : fix formatting

* llama : remove spaces on empty line

* ggml-quants : subtract 1 when back in epi8

This makes the 1.625 bpw type go faster than q4_0. Still not the fastest.

* ggml-quants : Q2_2 now faster than Q4_K on with AVX2

* ggml-quants : cleanup Q1_3 code formatting

* ggml-quants : ARM NEON vec_dot for q2_2 and q1_3

* ggml-quants : use ceiling division when quantizing q1_3

* convert-hf : simplify BitNet pre-quantization

This still results in the exact same tensor weights and scales,
but it reveals some weirdness in the current algorithm.

* convert-hf : allow converting the weird BitNet 1.3B

Its FFN size is 5460 which is not convenient.
The offending tensors are kept in F16,
which makes the final model 5.01 bpw.

* bitnet : replace 1.58b with b1.58, as in the paper

* ggml-quants : fix build failure on Windows

* ggml-quants : attempt to fix Arm 32-bit support

* ggml : add some informative comments in q1_3 vec_dot

* ggml : add TQ1_0 and TQ2_0 ternary quantization types

* ggml : even faster TQ2_0

* ggml : also faster TQ1_0

Same optimization as for TQ2_0 by offsetting the sum instead of the weights.
This makes TQ1_0 almost as fast as Q8_0 on AVX2.

* ggml : fix build issues in certain environments

* ggml : add NEON vec_dot implementation for TQ1_0 and TQ2_0

* ggml : avoid directly using vmlal_high_s8, for 32-bit ARM compat

The compiler seems smart enough to use the same instruction
even when using vget_high_s8 instead.

* ggml : remove q1_3 and q2_2

No more 1.625 bpw and 2.000 bpw,
now instead using 1.6875 bpw and 2.0625 bpw
with TQ1_0 and TQ2_0, respectively.

* llama : remove the separate scale tensors of BitNet b1.58

They won't be needed, since the remaining ternary quant types have
built-in scales.

* ggml-quants : rename fields of TQ1_0 and TQ2_0 structs for consistency

* ggml-quants : allow using vdotq_s32 in TQ2_0 vec_dot

Not yet tested on hardware which supports it,
might not work or might not even compile. But also it might.
It should make the performance better on recent ARM CPUs.

* ggml-quants : remove comment about possible format change of TQ2_0

Making it slightly more convenient for AVX512
but less convenient for everything else is not worth the trouble.

* gguf-py : Numpy (de)quantization for TQ1_0 and TQ2_0

* ggml-quants : use roundf instead of nearest_int for TQ1_0 and TQ2_0

This does not change anything for ternary models,
since their values should never end up being in halfway cases anyway.

* convert : allow direct conversion to TQ1_0 and TQ2_0

The token embeddings and output tensors are kept in F16
to allow quantizing them to Q4_K and Q6_K with llama-quantize.

* llama : handle fallback for TQ1_0 and TQ2_0 with Q4_0

Q4_0 is not completely symmetric (so not lossless for ternary models),
but it should be good enough.

* ggml-quants : allow using ARM dot product instructions for TQ1_0

* ggml-quants : deduplicate TQ1_0 and TQ2_0 __ARM_FEATURE_DOTPROD support

* ggml : remove unused ggml_mul special case

It would otherwise conflict with the more general
optimization coming with Mamba-2.

* ggml : handle TQ1_0 and TQ2_0 in dequantization-based operators

* test-backend-ops : add TQ1_0 and TQ2_0 comments for later

Not yet adding uncommented, because some backends like SYCL and Metal
do not properly handle unknown types in supports_op for GGML_OP_MUL_MAT.
(and Metal also doesn't handle it with GGML_OP_GET_ROWS)
Support for TQ1_0 and TQ2_0 for other backends than CPU
will be added in follow-up pull requests.

9 months agoUpdate build.yml (#9184)
awatuna [Thu, 5 Sep 2024 22:34:36 +0000 (06:34 +0800)]
Update build.yml (#9184)

build rpc-server for windows cuda

9 months agoCMake fix: host for msvc compiler can only be x86 or x64 (#8624)
Michael Podvitskiy [Thu, 5 Sep 2024 22:14:12 +0000 (00:14 +0200)]
CMake fix: host for msvc compiler can only be x86 or x64 (#8624)

9 months agocuda : fix defrag with quantized KV (#9319)
slaren [Thu, 5 Sep 2024 09:13:11 +0000 (11:13 +0200)]
cuda : fix defrag with quantized KV (#9319)