]>
git.djapps.eu Git - pkg/ggml/sources/llama.cpp/log
Xuan Son Nguyen [Mon, 4 Nov 2024 15:33:29 +0000 (16:33 +0100)]
server : clarify /slots endpoint, add is_processing (#10162)
* server : clarify /slots endpoint, add is_processing
* fix tests
snadampal [Mon, 4 Nov 2024 15:08:33 +0000 (09:08 -0600)]
fix build break on arm64 linux (#10166)
This fixes the build break from the recent changes
to move the CPU backend to separate files
https://github.com/ggerganov/llama.cpp/pull/10144
Diego Devesa [Mon, 4 Nov 2024 12:10:23 +0000 (13:10 +0100)]
cuda : clear error after changing peer access (#10153)
Georgi Gerganov [Mon, 4 Nov 2024 11:49:34 +0000 (13:49 +0200)]
metal : simplify f16 and f32 dequant kernels (#0)
Georgi Gerganov [Mon, 4 Nov 2024 11:43:32 +0000 (13:43 +0200)]
metal : move dequantize templates to beginning of MSL source (#0)
leo-pony [Mon, 4 Nov 2024 11:08:22 +0000 (19:08 +0800)]
CANN: adjust backend registry refactor. (#10158)
remove buffer->iface.get_name that used in cann as it was removed in backend registry refactor PR.
Georgi Gerganov [Mon, 4 Nov 2024 08:33:37 +0000 (10:33 +0200)]
sync : ggml
Yuri Khrustalev [Sat, 2 Nov 2024 09:09:12 +0000 (05:09 -0400)]
cmake : make it possible linking ggml as external lib (ggml/1003)
Plamen Minev [Fri, 1 Nov 2024 14:55:10 +0000 (16:55 +0200)]
metal : fix minor string leaks (ggml/1004)
Diego Devesa [Sun, 3 Nov 2024 18:34:08 +0000 (19:34 +0100)]
ggml : move CPU backend to a separate file (#10144)
Georgi Gerganov [Sun, 3 Nov 2024 13:18:40 +0000 (15:18 +0200)]
metal : minor fixup in FA kernel (#10143)
* metal : minor fixup in FA kernel
ggml-ci
* metal : use the unrolled loop variable
* metal : remove unused var
Georgi Gerganov [Sun, 3 Nov 2024 13:14:15 +0000 (15:14 +0200)]
flake.lock: Update (#10146)
Christian Köhnenkamp [Sat, 2 Nov 2024 22:35:31 +0000 (23:35 +0100)]
Add apple arm to presets (#10134)
* Add apple arm to presets
* Add final new line
sasha0552 [Sat, 2 Nov 2024 16:34:56 +0000 (16:34 +0000)]
server : fix slot selection by lru (#10126)
* server : fix slot selection by lru, migrate lcs to `size_t`
* minor debug log fix
Georgi Gerganov [Sat, 2 Nov 2024 16:34:00 +0000 (18:34 +0200)]
server : fix endpoint checks (#10135)
ggml-ci
Georgi Gerganov [Sat, 2 Nov 2024 13:18:56 +0000 (15:18 +0200)]
llama : adjust default context size + print warnings (#10136)
* llama : adjust default context size + print warnings
ggml-ci
* ggml-ci : add missing gpu-layers + adjust context sizes
Diego Devesa [Sat, 2 Nov 2024 12:08:53 +0000 (13:08 +0100)]
simple-chat : only add bos on first prompt (#10129)
Xuan Son Nguyen [Sat, 2 Nov 2024 11:53:17 +0000 (12:53 +0100)]
convert-lora : make `--base` optional (#10110)
* convert-lora : make `--base` optional
* lint
* handle case where base_model_name_or_path is invalid
* do not include metadata from base model
* clarify unspecified --base
* add small comment [no ci]
* trigger ci
Diego Devesa [Fri, 1 Nov 2024 22:50:59 +0000 (23:50 +0100)]
llama : add simple-chat example (#10124)
* llama : add simple-chat example
---------
Co-authored-by: Xuan Son Nguyen <redacted>
Diego Devesa [Fri, 1 Nov 2024 22:48:26 +0000 (23:48 +0100)]
llama : use smart pointers for ggml resources (#10117)
Shupei Fan [Fri, 1 Nov 2024 18:33:14 +0000 (02:33 +0800)]
vulkan : improve ggml_vk_create_buffer error handling (#9898)
Georgi Gerganov [Fri, 1 Nov 2024 15:31:51 +0000 (17:31 +0200)]
readme : update hot topics
sasha0552 [Fri, 1 Nov 2024 13:33:14 +0000 (13:33 +0000)]
server : fix smart selection of available slot (#10120)
* Fix smart selection of available slot
* minor fix
* replace vectors of tokens with shorthands
Georgi Gerganov [Fri, 1 Nov 2024 10:58:45 +0000 (12:58 +0200)]
ggml : remove ggml_scratch (#10121)
ggml-ci
Georgi Gerganov [Fri, 1 Nov 2024 08:28:24 +0000 (10:28 +0200)]
sync : ggml
Georgi Gerganov [Fri, 1 Nov 2024 08:23:05 +0000 (10:23 +0200)]
ggml : alloc ggml_contexts on the heap (whisper/2525)
Zhenwei Jin [Fri, 1 Nov 2024 03:09:59 +0000 (11:09 +0800)]
build: fix build error in Windows env with OneAPI setup (#10107)
Diego Devesa [Thu, 31 Oct 2024 23:49:53 +0000 (00:49 +0100)]
llama : improve output buffer type selection (#10098)
Diego Devesa [Thu, 31 Oct 2024 23:45:34 +0000 (00:45 +0100)]
quantize : fix --keep-split (#10114)
Diego Devesa [Thu, 31 Oct 2024 21:54:23 +0000 (22:54 +0100)]
llama : fix buffer checks for mamba and rwk (#10111)
* llama : fix buffer checks for mamba and rwk
* llama : fix missing worst case flag during reserve
* cuda : fix supports_op for norm
* disable sched SET_CAUSE
Zhenwei Jin [Thu, 31 Oct 2024 18:50:39 +0000 (02:50 +0800)]
loader: refactor tensor weights storage (#9935)
* loader: refactor tensor weights storage
* use sorted map, sort weights by layer
---------
Co-authored-by: slaren <redacted>
Kevin Gibbons [Thu, 31 Oct 2024 13:02:35 +0000 (06:02 -0700)]
server : include scheme when printing URL (#10106)
Diego Devesa [Thu, 31 Oct 2024 10:40:59 +0000 (11:40 +0100)]
ggml : check tensor name lengths in gguf files (#10100)
Sergio López [Thu, 31 Oct 2024 09:09:52 +0000 (10:09 +0100)]
kompute: add mul_mat_q4_k shader (#10097)
This is a more or less direct translation from the Metal implementation
to GLSL.
Signed-off-by: Sergio Lopez <redacted>
Sergio López [Wed, 30 Oct 2024 16:01:52 +0000 (17:01 +0100)]
kompute: add backend registry / device interfaces (#10045)
Get in line with the other backends by supporting the newer
backend/device registry interfaces.
Signed-off-by: Sergio Lopez <redacted>
Diego Devesa [Wed, 30 Oct 2024 13:51:21 +0000 (14:51 +0100)]
ggml : fix memory leaks when loading invalid gguf files (#10094)
* ggml : fix gguf string leak when reading kv pairs fails
* ggml : avoid crashing with GGML_ABORT when the KV has an invalid type
* ggml : avoid crashing on failed memory allocations when loading a gguf file
Rich Dougherty [Wed, 30 Oct 2024 12:22:39 +0000 (01:22 +1300)]
readme : more lora detail in main example readme (#10064)
Rich Dougherty [Wed, 30 Oct 2024 12:22:21 +0000 (01:22 +1300)]
convert : more detailed convert lora usage docs (#10065)
xctan [Wed, 30 Oct 2024 07:00:40 +0000 (15:00 +0800)]
ggml : add Q4_0_8_8 RISC-V GEMV and GEMM kernels (#10029)
* ggml : RISC-V vector gemv for q4_0_8x8
* ggml : Added WIP rvv q4_0_8x8 gemm
* ggml : Added initial implementation of rvv gemm
* ggml : optimize gemm to avoid register spillover
* ggml : Fix GCC rvv load alignment issue
* ggml : Format gemm rvv code
* ggml : Fix a typo in RVV q4_0_8_8 GEMM
Diego Devesa [Wed, 30 Oct 2024 01:01:23 +0000 (02:01 +0100)]
llama : refactor model loader with backend registry (#10026)
Changyeon Kim [Tue, 29 Oct 2024 08:52:56 +0000 (17:52 +0900)]
ggml: Add POOL2D OP for GPU acceleration to the Vulkan backend in the MobileVLM model. (#9763)
* ggml: Add POOL2D OP for GPU ACC to the Vulkan.
- The MobileVLM model now supports inference acceleration through GPU by utilizing the Vulkan backend.
- A GGML_OP_POOL_2D shader has been added. (Pooling)
- The encoding performance of the CLIP model improved from 2.8s on the CPU to 0.7s on the GPU.
Signed-off-by: Changyeon Kim <redacted>
* [fix] Correct the incorrect order of the parameters.
fix casting to int.
Signed-off-by: Changyeon Kim <redacted>
---------
Signed-off-by: Changyeon Kim <redacted>
Georgi Gerganov [Tue, 29 Oct 2024 08:42:05 +0000 (10:42 +0200)]
llama : remove Tail-Free sampling (#10071)
ggml-ci
arch-btw [Mon, 28 Oct 2024 17:45:33 +0000 (10:45 -0700)]
llama : Add IBM granite template (#10013)
* Add granite template to llama.cpp
* Add granite template to test-chat-template.cpp
* Update src/llama.cpp
Co-authored-by: Xuan Son Nguyen <redacted>
* Update tests/test-chat-template.cpp
Co-authored-by: Xuan Son Nguyen <redacted>
* Added proper template and expected output
* Small change to \n
Small change to \n
* Add code space &
Co-authored-by: Xuan Son Nguyen <redacted>
* Fix spacing
* Apply suggestions from code review
* Update src/llama.cpp
---------
Co-authored-by: Xuan Son Nguyen <redacted>
Georgi Gerganov [Mon, 28 Oct 2024 15:41:24 +0000 (17:41 +0200)]
flake.lock: Update (#10063)
Flake lock file updates:
• Updated input 'nixpkgs':
'github:NixOS/nixpkgs/
4c2fcb090b1f3e5b47eaa7bd33913b574a11e0a0 ?narHash=sha256-/uilDXvCIEs3C9l73JTACm4quuHUsIHcns1c%2BcHUJwA%3D' (2024-10-18)
→ 'github:NixOS/nixpkgs/
2768c7d042a37de65bb1b5b3268fc987e534c49d ?narHash=sha256-AlcmCXJZPIlO5dmFzV3V2XF6x/OpNWUV8Y/FMPGd8Z4%3D' (2024-10-23)
Co-authored-by: github-actions[bot] <redacted>
R0CKSTAR [Mon, 28 Oct 2024 09:02:48 +0000 (17:02 +0800)]
musa: workaround for Guilty Lockup in cleaning src0 (#10042)
Signed-off-by: Xiaodong Ye <redacted>
Georgi Gerganov [Mon, 28 Oct 2024 06:49:32 +0000 (08:49 +0200)]
server : don't overfill the batch during infill (#10018)
ggml-ci
Georgi Gerganov [Sun, 27 Oct 2024 18:59:58 +0000 (20:59 +0200)]
llama : switch KQ multiplication to F32 precision by default (#10015)
ggml-ci
Georgi Gerganov [Sat, 26 Oct 2024 07:34:08 +0000 (10:34 +0300)]
sync : ggml
bssrdf [Wed, 23 Oct 2024 18:34:00 +0000 (14:34 -0400)]
increase cuda_cpy block size (ggml/996)
Co-authored-by: bssrdf <redacted>
Georgi Gerganov [Sat, 26 Oct 2024 07:33:31 +0000 (10:33 +0300)]
scripts : fix amx sync [no ci]
Georgi Gerganov [Fri, 25 Oct 2024 19:26:15 +0000 (22:26 +0300)]
metal : support permuted matrix multiplicaions (#10033)
* metal : support permuted matrix multiplicaions
ggml-ci
* cont : use nb01 directly for row steps
ggml-ci
* cont : add comments [no ci]
* metal : minor refactor
* metal : minor
wwoodsTM [Fri, 25 Oct 2024 16:07:34 +0000 (10:07 -0600)]
llama : add DRY sampler (#9702)
* sampling : add DRY sampler (post-refactor)
* DRY: Trying to fix coauthors, removed unneeded line
* DRY: Fixed redundant code
* DRY: Fixed crash issue due to DRY being in chain but uninitialized
---------
Co-authored-by: l3utterfly <redacted>
Co-authored-by: pi6am <redacted>
Michael Podvitskiy [Fri, 25 Oct 2024 15:57:54 +0000 (17:57 +0200)]
llama: string_split fix (#10022)
* llama: Refactor string_split to use template specialization, fixes parsing strings with spaces
* llama: Add static_assert in the string_split template to ensure the correct template specialization is used for std::string
Srihari-mcw [Fri, 25 Oct 2024 07:27:41 +0000 (12:57 +0530)]
llamafile : extend sgemm.cpp support for Q5_0 models (#10010)
Georgi Gerganov [Fri, 25 Oct 2024 07:13:46 +0000 (10:13 +0300)]
server : check that the prompt fits in the slot's context (#10030)
ggml-ci
Xuan Son Nguyen [Thu, 24 Oct 2024 19:51:22 +0000 (21:51 +0200)]
server : refactor slot input data, move tokenizer to HTTP thread (#10023)
* server : refactor slot input data, move tokenizer to HTTP thread
* move prompt_tokens.empty() check
* fix incorrect if branch
* fix infinite generation loop
* bring back infill validation
* add infill test
* try fixing format_infill
* fix test
* remove redundant code
* rename completion to inference
* update docs
* use llama_tokens everywhere
Georgi Gerganov [Thu, 24 Oct 2024 18:23:33 +0000 (21:23 +0300)]
ci : fix cmake flags for SYCL
Johannes Gäßler [Thu, 24 Oct 2024 12:40:23 +0000 (14:40 +0200)]
CUDA: fix insufficient buffer clearing for MMQ (#10032)
Johannes Gäßler [Thu, 24 Oct 2024 09:09:36 +0000 (11:09 +0200)]
CUDA: fix MMQ for non-contiguous src0, add tests (#10021)
* CUDA: fix MMQ for non-contiguous src0, add tests
* revise test code
wwoodsTM [Wed, 23 Oct 2024 19:27:51 +0000 (13:27 -0600)]
server : samplers accept the prompt correctly (#10019)
Georgi Gerganov [Wed, 23 Oct 2024 14:23:55 +0000 (17:23 +0300)]
sync : ggml
Georgi Gerganov [Wed, 23 Oct 2024 14:16:56 +0000 (17:16 +0300)]
llama.vim : bump generation time limit to 3s [no ci]
Johannes Gäßler [Fri, 18 Oct 2024 07:24:44 +0000 (09:24 +0200)]
CUDA: fix 1D im2col, add tests (ggml/993)
Daniel Bevenius [Wed, 16 Oct 2024 18:10:01 +0000 (20:10 +0200)]
ggml : remove redundant set of contexts used field (ggml/978)
This commit removes the setting of the `used` field of the contexts in
the global state (g_state) in `ggml_init`.
The motivation for this change is that I believe that this additional
initialization might not be required after the changes in Commit
45fc4fed0b9fb5b1af4a8525cbebb95e11208732 ("sync : latest changes from
whisper.cpp"), which changed the initialization of the contexts field
from `{ 0 }` to `{ { 0 } }`:
```console
g_state = (struct ggml_state) {
- /*.contexts =*/ { 0 },
+ /*.contexts =*/ { { 0 } },
};
```
My understanding is that the `{0}` initialization might not have
zero-initialized all the nested fields in every array element because of
compiler differences, and might have been the reason for having the
explicit setting of the `used` fields to false.
Michael Coppola [Wed, 23 Oct 2024 11:09:26 +0000 (07:09 -0400)]
llama.vim : add classic vim support (#9995)
* added classic vim support
* fixed ring update, removed blank line
* minor
* minor
* minor doc update
* removed uneeded var
* minor
* minor
* fixed job_start creating new scratch buffers
* fixed job_start creating new scratch buffers
* fixed ghost text indenting when expandtab is on
* removed unused code
* minor
* unified fim_on_exit
* minor
* vim ghost text rendering now uses pos_x and pos_y parameters
* renamed *_hlgroup to hlgroup_*
* renamed *_ghost_text to ghost_text_*, moved nvim/vim detection to llama#init()
* minor
---------
Co-authored-by: Michael Coppola <redacted>
Jun Hee Yoo [Wed, 23 Oct 2024 10:33:45 +0000 (19:33 +0900)]
metal : add POOL2D and fix IM2COL (#9943)
* add pool_2d
Signed-off-by: Junhee Yoo <redacted>
* fix im2col and add unittest for N>=1024
Signed-off-by: Junhee Yoo <redacted>
* add tests for N % 1024 != 0
Signed-off-by: Junhee Yoo <redacted>
* remove trailing whitespaces
Signed-off-by: Junhee Yoo <redacted>
* apply suggestions
Signed-off-by: Junhee Yoo <redacted>
* apply more optimization
- original IM2COL kernel + _ext with MIN()
Signed-off-by: Junhee Yoo <redacted>
* apply review: change kernel name of pool_2d
Signed-off-by: Junhee Yoo <redacted>
* apply review
Signed-off-by: Junhee Yoo <redacted>
* fix more formatting and enhance readability
Signed-off-by: Junhee Yoo <redacted>
---------
Signed-off-by: Junhee Yoo <redacted>
github-actions[bot] [Sun, 20 Oct 2024 00:22:59 +0000 (00:22 +0000)]
flake.lock: Update
Flake lock file updates:
• Updated input 'nixpkgs':
'github:NixOS/nixpkgs/
5633bcff0c6162b9e4b5f1264264611e950c8ec7 ?narHash=sha256-9UTxR8eukdg%2BXZeHgxW5hQA9fIKHsKCdOIUycTryeVw%3D' (2024-10-09)
→ 'github:NixOS/nixpkgs/
4c2fcb090b1f3e5b47eaa7bd33913b574a11e0a0 ?narHash=sha256-/uilDXvCIEs3C9l73JTACm4quuHUsIHcns1c%2BcHUJwA%3D' (2024-10-18)
Xuan Son Nguyen [Tue, 22 Oct 2024 14:59:02 +0000 (16:59 +0200)]
llama : fix empty batch causing llama_batch_allocr to crash (#9966)
* llama : fix empty batch cause llama_batch_allocr to crash
* move batch_allocr inside decode/encode_internal
* fix build
* add GGML_ASSERT
* Apply suggestions from code review
Co-authored-by: Georgi Gerganov <redacted>
---------
Co-authored-by: Georgi Gerganov <redacted>
Daniel Bevenius [Tue, 22 Oct 2024 13:31:06 +0000 (15:31 +0200)]
llama : rename batch to ubatch (#9950)
This commit renames the member field batch in llm_build_context to
ubatch, and also the parameter batch in llama_build_graph, and
llama_set_inputs to ubatch.
The motivation for this change is to make the code more readable
(considering there are the structs llama_batch and llama_sbatch), and
consistent with other parts of the code base where parameters/fields of
type llama_ubatch are named ubatch.
Molly Sophia [Tue, 22 Oct 2024 13:22:26 +0000 (21:22 +0800)]
Rwkv chat template fix (#10001)
* llama: remove useless template matching for rwkv-world
Signed-off-by: Molly Sophia <redacted>
* converter: Add comment about the hack for rwkv models
Signed-off-by: Molly Sophia <redacted>
* Update src/llama.cpp
Co-authored-by: Xuan Son Nguyen <redacted>
---------
Signed-off-by: Molly Sophia <redacted>
Co-authored-by: Xuan Son Nguyen <redacted>
Xuan Son Nguyen [Tue, 22 Oct 2024 11:08:41 +0000 (13:08 +0200)]
lora : warn user if new token is added in the adapter (#9948)
Molly Sophia [Tue, 22 Oct 2024 10:33:37 +0000 (18:33 +0800)]
llama : add chat template for RWKV-World + fix EOT (#9968)
* Add chat template for RWKV-World
Signed-off-by: Molly Sophia <redacted>
* RWKV: Fix the chat template not being used
Signed-off-by: Molly Sophia <redacted>
* RWKV v6: Set EOT token to ``\n\n``
Signed-off-by: Molly Sophia <redacted>
* readme: add rwkv into supported model list
Signed-off-by: Molly Sophia <redacted>
---------
Signed-off-by: Molly Sophia <redacted>
leo-pony [Tue, 22 Oct 2024 08:16:01 +0000 (16:16 +0800)]
[CANN] Adapt to dynamically loadable backends mechanism (#9970)
* [CANN] Adapt to dynamically loadable backends mechanism
* Fix the Bug: inference running result is garbled in debug running model for LM models who's type is Q4_0 class
* Handle the review comments of this pull request
Daniel Bevenius [Tue, 22 Oct 2024 07:40:02 +0000 (09:40 +0200)]
arg : fix typo in embeddings argument help [no ci] (#9994)
This commit fixes two typos in the help text for the `--embd-normalize`
and `--embd-separator` arguments. It also updates common.h which contain
the same typo in two comments.
Georgi Gerganov [Mon, 21 Oct 2024 21:35:25 +0000 (00:35 +0300)]
llama.vim : fix info text display [no ci] (#9787)
Georgi Gerganov [Mon, 21 Oct 2024 19:52:22 +0000 (22:52 +0300)]
llama.vim : move info to the right of screen [no ci] (#9787)
'eol' messes up the rendering with nvim v0.10.2 for some reason
Asghar Ghorbani [Mon, 21 Oct 2024 18:20:59 +0000 (20:20 +0200)]
readme : update UI list (#9972)
add PocketPal AI app
Daniel Bevenius [Mon, 21 Oct 2024 18:12:52 +0000 (20:12 +0200)]
arg : fix attention non-causal arg value hint (#9985)
This commit updates the argument value hint for the `--attention`
argument to `non-causal`.
The motivation for this change is that the only values for this argument
are `causal` and `non-causal`.
Georgi Gerganov [Mon, 21 Oct 2024 17:25:02 +0000 (20:25 +0300)]
llama.vim : plugin for Neovim (#9787)
Georgi Gerganov [Mon, 21 Oct 2024 13:20:46 +0000 (16:20 +0300)]
ggml : add asserts for type conversion in fattn kernels (#9971)
ggml-ci
Radoslav Gerganov [Mon, 21 Oct 2024 10:35:40 +0000 (13:35 +0300)]
rpc : pack only RPC structs (#9959)
Georgi Gerganov [Mon, 21 Oct 2024 06:46:40 +0000 (09:46 +0300)]
llama : default sampling changes + greedy update (#9897)
* llama : deprecate softmax sampler + fix dist sampler
ggml-ci
* tests : replace macros with functions
ggml-ci
* sampling : change temperature sampler logic
For t <= 0.0f, keep the max logit intact and set the rest to -inf
* cont : no need for special "greedy" logic
top-k == 1 is the same
* tests : init prob correctly
* llama : handle temp <= 0.0 in the temp_ext sampler too
ggml-ci
* cont : avoid extra loop in temperature sampler for sub-zero temp
ggml-ci
Georgi Gerganov [Mon, 21 Oct 2024 06:37:12 +0000 (09:37 +0300)]
speculative : fix handling of some input params (#9963)
* speculative : fix batch sizes at initialization
ggml-ci
* speculative : handle params.n_predict == -1
* speculative : limit batch size to llama_n_batch
Neo Zhang Jianyu [Mon, 21 Oct 2024 06:26:09 +0000 (14:26 +0800)]
fix mul_mat_vec_q and *_vec_q error (#9939)
Co-authored-by: arthw <redacted>
Loïc Carrère [Sun, 20 Oct 2024 16:25:41 +0000 (18:25 +0200)]
readme : update bindings list (#9951)
Update the binding list by adding LM-Kit.NET (C# & VB.NET)
icppWorld [Sun, 20 Oct 2024 16:01:34 +0000 (12:01 -0400)]
readme : update infra list (#9942)
llama_cpp_canister allows you to run llama.cpp as a Smart Contract on the Internet Computer. The smart contract runs as WebAssembly in a so-called 'canister'.
Xuan Son Nguyen [Fri, 18 Oct 2024 21:18:01 +0000 (23:18 +0200)]
llama : remove all_pos_0, all_pos_1, all_seq_id from llama_batch (#9745)
* refactor llama_batch_get_one
* adapt all examples
* fix simple.cpp
* fix llama_bench
* fix
* fix context shifting
* free batch before return
* use common_batch_add, reuse llama_batch in loop
* null terminated seq_id list
* fix save-load-state example
* fix perplexity
* correct token pos in llama_batch_allocr
Radoslav Gerganov [Fri, 18 Oct 2024 11:33:58 +0000 (14:33 +0300)]
rpc : backend refactoring (#9912)
* rpc : refactor backend
Use structs for RPC request/response messages
* rpc : refactor server
Ouadie EL FAROUKI [Fri, 18 Oct 2024 05:46:16 +0000 (06:46 +0100)]
[SYCL] Add SYCL Backend registry, device and Event Interfaces (#9705)
* implemented missing SYCL event APIs
* sycl : Added device and backend reg interfaces
* Restructured ggml-sycl.cpp
Ma Mingfei [Fri, 18 Oct 2024 05:34:36 +0000 (13:34 +0800)]
add amx kernel for gemm (#8998)
add intel amx isa detection
add vnni kernel for gemv cases
add vnni and amx kernel support for block_q8_0
code cleanup
fix packing B issue
enable openmp
fine tune amx kernel
switch to aten parallel pattern
add error message for nested parallelism
code cleanup
add f16 support in ggml-amx
add amx kernels for QK_K quant formats: Q4_K, Q5_K, Q6_K and IQ4_XS
update CMakeList
update README
fix some compilation warning
fix compiler warning when amx is not enabled
minor change
ggml-ci
move ggml_amx_init from ggml.c to ggml-amx/mmq.cpp
ggml-ci
update CMakeLists with -mamx-tile, -mamx-int8 and -mamx-bf16
ggml-ci
add amx as an ggml-backend
update header file, the old path for immintrin.h has changed to ggml-cpu-impl.h
minor change
update CMakeLists.txt
minor change
apply weight prepacking in set_tensor method in ggml-backend
fix compile error
ggml-ci
minor change
ggml-ci
update CMakeLists.txt
ggml-ci
add march dependency
minor change
ggml-ci
change ggml_backend_buffer_is_host to return false for amx backend
ggml-ci
fix supports_op
use device reg for AMX backend
ggml-ci
minor change
ggml-ci
minor change
fix rebase
set .buffer_from_host_ptr to be false for AMX backend
Georgi Gerganov [Fri, 18 Oct 2024 04:32:19 +0000 (07:32 +0300)]
server : add n_indent parameter for line indentation requirement (#9929)
ggml-ci
Daniel Bevenius [Thu, 17 Oct 2024 23:41:51 +0000 (01:41 +0200)]
llama : rename batch_all to batch (#8881)
This commit addresses the TODO in the code to rename the `batch_all`
parameter to `batch` in `llama_decode_internal`.
Georgi Gerganov [Thu, 17 Oct 2024 20:43:05 +0000 (23:43 +0300)]
readme : remove --memory-f32 references (#9925)
Georgi Gerganov [Thu, 17 Oct 2024 20:26:32 +0000 (23:26 +0300)]
llama : change warning to debug log
Georgi Gerganov [Thu, 17 Oct 2024 19:32:47 +0000 (22:32 +0300)]
llama : infill sampling handle very long tokens (#9924)
* llama : infill sampling handle very long tokens
ggml-ci
* cont : better indices
ggml-ci
Tim Wang [Thu, 17 Oct 2024 06:57:14 +0000 (17:57 +1100)]
readme : update bindings list (#9918)
Co-authored-by: Tim Wang <redacted>
Diego Devesa [Thu, 17 Oct 2024 00:46:58 +0000 (02:46 +0200)]
vulkan : add backend registry / device interfaces (#9721)
* vulkan : add backend registry / device interfaces
* llama : print devices used on model load
Gilad S. [Wed, 16 Oct 2024 23:34:22 +0000 (02:34 +0300)]
fix: allocating CPU buffer with size `0` (#9917)
Gilad S. [Wed, 16 Oct 2024 22:36:51 +0000 (01:36 +0300)]
fix: use `vm_allocate` to allocate CPU backend buffer on macOS (#9875)
* fix: use `vm_allocate` to allocate CPU backend buffer on macOS
* fix: switch to `posix_memalign` to keep existing `free()` usages work
* feat: move `GGML_ALIGNED_MALLOC` to `ggml-backend-impl.h`, add support for `vm_allocate` on macOS
* style: formatting
* fix: move const outside of `#ifndef`
* style: formatting
* fix: unused var
* fix: transform `GGML_ALIGNED_MALLOC` and `GGML_ALIGNED_FREE` into functions and add them to `ggml-impl.h`
* fix: unused var
* fix: page align to `GGUF_DEFAULT_ALIGNMENT`
* fix: page align to `TENSOR_ALIGNMENT`
* fix: convert `TENSOR_ALIGNMENT` to a macro
* fix: increase page size to `32` on iOS
* fix: iOS page size
* fix: `hbw_posix_memalign` alignment
Daniel Bevenius [Wed, 16 Oct 2024 17:34:28 +0000 (19:34 +0200)]
llama : suppress conversion from 'size_t' to 'int' (#9046)
* llama : suppress conversion from 'size_t' to 'int'
This commit updates llm_tokenizer_spm.tokenize to suppress/remove the
following warnings that are generated on Windows when using MSVC:
```console
src\llama-vocab.cpp(211,1): warning C4267: 'argument':
conversion from 'size_t' to 'int', possible loss of data
src\llama-vocab.cpp(517,1): warning C4267: 'argument':
conversion from 'size_t' to 'int', possible loss of data
```
This is done by adding a cast for the size_t returned from
symbols.size(). I believe this is safe as it seems unlikely that
symbols, which stores an entry for each UTF8 character, would become
larger than INT_MAX.
The motivation for this change is to reduce the number of warnings that
are currently generated when building on Windows.
* squash! llama : suppress conversion from 'size_t' to 'int'
Move cast into for loop.