git.djapps.eu Git - pkg/ggml/sources/llama.cpp/log

]> git.djapps.eu Git - pkg/ggml/sources/llama.cpp/log

overview / pkg / ggml / sources / llama.cpp / log

commit | commitdiff | tree

Georgi Gerganov [Tue, 27 Aug 2024 19:01:45 +0000 (22:01 +0300)]

sync : ggml

commit | commitdiff | tree

Xie Yanbo [Tue, 27 Aug 2024 12:33:08 +0000 (20:33 +0800)]

Fix minicpm example directory (#9111)

commit | commitdiff | tree

compilade [Tue, 27 Aug 2024 10:09:23 +0000 (06:09 -0400)]

llama : fix qs.n_attention_wv for DeepSeek-V2 (#9156)

commit | commitdiff | tree

Xuan Son Nguyen [Tue, 27 Aug 2024 09:07:01 +0000 (11:07 +0200)]

server : add some missing env variables (#9116)

* server : add some missing env variables

* add LLAMA_ARG_HOST to server dockerfile

* also add LLAMA_ARG_CONT_BATCHING

commit | commitdiff | tree

CausalLM [Tue, 27 Aug 2024 06:58:22 +0000 (14:58 +0800)]

llama : fix ChatGLM4 wrong shape (#9194)

This should fix THUDM/glm-4-9b-chat-1m and CausalLM/miniG

commit | commitdiff | tree

Carsten Kragelund Jørgensen [Tue, 27 Aug 2024 06:53:40 +0000 (08:53 +0200)]

llama : fix llama3.1 rope_freqs not respecting custom head_dim (#9141)

* fix: llama3.1 rope_freqs not respecting custom head_dim

* fix: use potential head_dim for Exaone

commit | commitdiff | tree

arch-btw [Tue, 27 Aug 2024 05:58:50 +0000 (22:58 -0700)]

common : Update stb_image.h to latest version (#9161)

* Update stb_image.h to latest version

Fixes https://github.com/ggerganov/llama.cpp/issues/7431

* Update .ecrc

commit | commitdiff | tree

slaren [Mon, 26 Aug 2024 17:44:43 +0000 (19:44 +0200)]

ggml : do not crash when quantizing q4_x_x with an imatrix (#9192)

commit | commitdiff | tree

Georgi Gerganov [Mon, 26 Aug 2024 15:31:02 +0000 (18:31 +0300)]

metal : separate scale and mask from QKT in FA kernel (#9189)

* metal : separate scale and mask from QKT in FA kernel

* metal : ne01 check no longer necessary

* metal : keep data in local memory

commit | commitdiff | tree

Georgi Gerganov [Mon, 26 Aug 2024 14:55:36 +0000 (17:55 +0300)]

ggml : add SSM Metal kernels (#8546)

* ggml : add ggml_ssm_conv metal impl

* ggml : add ssm_scan metal impl

ggml-ci

commit | commitdiff | tree

Georgi Gerganov [Mon, 26 Aug 2024 13:30:25 +0000 (16:30 +0300)]

tests : fix compile warnings for unreachable code (#9185)

ggml-ci

commit | commitdiff | tree

Georgi Gerganov [Mon, 26 Aug 2024 09:19:39 +0000 (12:19 +0300)]

ci : add VULKAN support to ggml-ci (#9055)

commit | commitdiff | tree

Georgi Gerganov [Mon, 26 Aug 2024 09:16:57 +0000 (12:16 +0300)]

server : update deps (#9183)

commit | commitdiff | tree

slaren [Mon, 26 Aug 2024 09:08:59 +0000 (11:08 +0200)]

metal : gemma2 flash attention support (#9159)

commit | commitdiff | tree

slaren [Mon, 26 Aug 2024 09:03:30 +0000 (11:03 +0200)]

ggml-ci : try to improve build time (#9160)

commit | commitdiff | tree

Justine Tunney [Mon, 26 Aug 2024 06:09:53 +0000 (23:09 -0700)]

llama : fix time complexity of string replacement (#9163)

This change fixes a bug where replacing text in a very long string could
cause llama.cpp to hang indefinitely. This is because the algorithm used
was quadratic, due to memmove() when s.replace() is called in a loop. It
seems most search results and LLM responses actually provide the O(n**2)
algorithm, which is a great tragedy. Using a builder string fixes things

commit | commitdiff | tree

Herman Semenov [Sun, 25 Aug 2024 22:54:37 +0000 (22:54 +0000)]

common: fixed not working find argument --n-gpu-layers-draft (#9175)

commit | commitdiff | tree

Johannes Gäßler [Sun, 25 Aug 2024 20:11:48 +0000 (22:11 +0200)]

CUDA: fix Gemma 2 numerical issues for FA (#9166)

commit | commitdiff | tree

Johannes Gäßler [Sat, 24 Aug 2024 19:34:59 +0000 (21:34 +0200)]

CPU/CUDA: Gemma 2 FlashAttention support (#8542)

* CPU/CUDA: Gemma 2 FlashAttention support

* apply logit_softcap to scale in kernel

* disable logit softcapping tests on Metal

* remove metal check

commit | commitdiff | tree

João Dinis Ferreira [Sat, 24 Aug 2024 06:22:45 +0000 (07:22 +0100)]

quantize : fix typo in usage help of `quantize.cpp` (#9145)

commit | commitdiff | tree

Xuan Son Nguyen [Fri, 23 Aug 2024 10:58:53 +0000 (12:58 +0200)]

lora : fix llama conversion script with ROPE_FREQS (#9117)

commit | commitdiff | tree

piDack [Fri, 23 Aug 2024 07:27:17 +0000 (15:27 +0800)]

llama : use F32 precision in GLM4 attention and no FA (#9130)

commit | commitdiff | tree

Akarshan Biswas [Thu, 22 Aug 2024 14:09:47 +0000 (19:39 +0530)]

[SYCL] Add a space to supress a cmake warning (#9133)

commit | commitdiff | tree

luoyu-intel [Thu, 22 Aug 2024 04:50:10 +0000 (12:50 +0800)]

[SYCL] Add oneDNN primitive support (#9091)

* add onednn

* add sycl_f16

* add dnnl stream

* add engine map

* use dnnl for intel only

* use fp16fp16fp16

* update doc

commit | commitdiff | tree

compilade [Wed, 21 Aug 2024 21:58:11 +0000 (17:58 -0400)]

llama : simplify Mamba with advanced batch splits (#8526)

* llama : advanced batch splits

This includes equal-sequence-length batch splits which are useful
to simplify recurrent model operators.

* llama : always make recurrent state slots contiguous

* ggml : simplify mamba operators

* llama : fix integer signedness mixing

* llama : logits_all has priority over batch->logits

Otherwise, the server embeddings tests failed.
This was likely an existing problem but was only detected here
because of an additional assertion.

* llama : apply suggestions

Co-authored-by: Georgi Gerganov <redacted>
* llama : fix t5 segfault

* llama : fix Mamba session save and restore

* llama : minor cosmetic changes

* llama : rename llama_reorder_outputs to llama_output_reorder

Also move it closer to llama_output_reserve.

* llama : fix pooled embeddings when using batches with equal_seqs

* minor : add struct members for clarity

ggml-ci

* llama : fix T5 segfault again

* llama : fix Mamba pooled embeddings with multiple sequences

Until the pooled embeddings are refactored to allow splitting
across ubatches for causal embeddings,
recurrent models can only process a single sequence per ubatch
when calculating pooled embeddings.

* llama : add llama_model_is_recurrent to simplify figuring that out

This will make it easier to more cleanly support RWKV-v6 and Mamba-2.

* llama : fix simple splits when the batch contains embeddings

---------

Co-authored-by: Georgi Gerganov <redacted>

commit | commitdiff | tree

Xuan Son Nguyen [Wed, 21 Aug 2024 09:04:34 +0000 (11:04 +0200)]

server : support reading arguments from environment variables (#9105)

* server : support reading arguments from environment variables

* add -fa and -dt

* readme : specify non-arg env var

commit | commitdiff | tree

Younes Belkada [Wed, 21 Aug 2024 08:06:36 +0000 (12:06 +0400)]

llama : support for `falcon-mamba` architecture (#9074)

* feat: initial support for llama.cpp

* fix: lint

* refactor: better refactor

* Update src/llama.cpp

Co-authored-by: compilade <redacted>
* Update src/llama.cpp

Co-authored-by: compilade <redacted>
* fix: address comments

* Update convert_hf_to_gguf.py

Co-authored-by: compilade <redacted>
* fix: add more cleanup and harmonization

* fix: lint

* Update gguf-py/gguf/gguf_writer.py

Co-authored-by: compilade <redacted>
* fix: change name

* Apply suggestions from code review

Co-authored-by: compilade <redacted>
* add in operator

* fix: add `dt_b_c_rms` in `llm_load_print_meta`

* fix: correct printf format for bool

* fix: correct print format

* Update src/llama.cpp

Co-authored-by: compilade <redacted>
* llama : quantize more Mamba tensors

* llama : use f16 as the fallback of fallback quant types

---------

Co-authored-by: compilade <redacted>

commit | commitdiff | tree

fairydreaming [Wed, 21 Aug 2024 07:45:49 +0000 (09:45 +0200)]

llava : zero-initialize clip_ctx structure fields with aggregate initialization 908)

Co-authored-by: Stanisław Szymczyk <redacted>

commit | commitdiff | tree

Daniel Bevenius [Wed, 21 Aug 2024 07:32:58 +0000 (09:32 +0200)]

llama : std::move llm_bigram_bpe from work_queue (#9062)

* llama : std::move llm_bigram_bpe from work_queue

This commit updates the retrieval of llm_bigram_bpe objects from
work_queue.top() by using std::move.

The motivation for this is to avoid the copying of the std::string
`text` member of the llm_bigram_bpe struct.

* squash! llama : std::move llm_bigram_bpe from work_queue

Introduced a MovablePriorityQueue class to allow moving elements
out of the priority queue for llm_bigram_bpe.

* squash! llama : std::move llm_bigram_bpe from work_queue

Rename MovablePriorityQueue to lama_priority_queue.

* squash! llama : std::move llm_bigram_bpe from work_queue

Rename lama_priority_queue -> llama_priority_queue.

commit | commitdiff | tree

Changyeon Kim [Tue, 20 Aug 2024 19:00:00 +0000 (04:00 +0900)]

llava: Add ACC OP for GPU acceleration to the Vulkan backend in the LLAVA CLIP model. (#8984)

* llava: Add ACC OP for GPU acceleration to the Vulkan backend in the LLAVA CLIP model.

- The CLIP model now prioritizes the Vulkan backend over the CPU when vulkan available.
- A GGML_OP_ACC shader has been added.
- The encoding performance of the CLIP model improved from 4.2s on the CPU to 0.9s on the GPU.

Signed-off-by: Changyeon Kim <redacted>
* fix-up coding style.

Signed-off-by: Changyeon Kim <redacted>
* Fix-up the missing initial parameter to resolve the compilation warning.

Signed-off-by: Changyeon Kim <redacted>
* [fix] Add missing parameters.

Signed-off-by: Changyeon Kim <redacted>
* [fix] Use nb1 and nb2 for dst.

Signed-off-by: Changyeon Kim <redacted>
* Fix check results ggml_acc call

---------

Signed-off-by: Changyeon Kim <redacted>
Co-authored-by: 0cc4m <redacted>

commit | commitdiff | tree

Meng, Hengyu [Tue, 20 Aug 2024 15:50:17 +0000 (23:50 +0800)]

[SYCL] fallback mmvq (#9088)

* fallback mmvq to mul_mat

* mmvq in cuda path

* Update ggml/src/ggml-sycl.cpp

Co-authored-by: Alberto Cabrera Pérez <redacted>
---------

Co-authored-by: Alberto Cabrera Pérez <redacted>

commit | commitdiff | tree

zhentaoyu [Tue, 20 Aug 2024 15:06:51 +0000 (23:06 +0800)]

[SYCL] Fix SYCL `im2col` and `convert` Overflow with Large Dims (#9052)

* sycl: fix im2col overflow and sync with cuda

Signed-off-by: zhentaoyu <redacted>
* sycl: fix convert overflow

Signed-off-by: zhentaoyu <redacted>
* sycl: fix convert and dequantize

Signed-off-by: zhentaoyu <redacted>
* sycl: fix ib in dmmv

Signed-off-by: zhentaoyu <redacted>
* sycl:refine convert

Signed-off-by: zhentaoyu <redacted>
* sycl: move downsample global_range into common

Signed-off-by: zhentaoyu <redacted>
* test: add im2col and convert test cases

Signed-off-by: zhentaoyu <redacted>
* test: make new cases only in sycl

Signed-off-by: zhentaoyu <redacted>
* test: comment new test_cases for only local testing

Signed-off-by: zhentaoyu <redacted>
---------

Signed-off-by: zhentaoyu <redacted>

commit | commitdiff | tree

fairydreaming [Tue, 20 Aug 2024 09:09:55 +0000 (11:09 +0200)]

tests : add missing comma in grammar integration tests (#9099)

Co-authored-by: Stanisław Szymczyk <redacted>

commit | commitdiff | tree

wangshuai09 [Mon, 19 Aug 2024 08:46:38 +0000 (16:46 +0800)]

cann: add doc for cann backend (#8867)

Co-authored-by: xuedinge233 <redacted>
Co-authored-by: hipudding <redacted>

commit | commitdiff | tree

Radoslav Gerganov [Mon, 19 Aug 2024 07:11:45 +0000 (10:11 +0300)]

rpc : print error message when failed to connect endpoint (#9042)

commit | commitdiff | tree

Radoslav Gerganov [Mon, 19 Aug 2024 07:10:21 +0000 (10:10 +0300)]

rpc : prevent crashes on invalid input (#9040)

Add more checks which prevent RPC server from crashing if invalid input
is received from client

commit | commitdiff | tree

Georgi Gerganov [Sun, 18 Aug 2024 14:43:32 +0000 (17:43 +0300)]

flake.lock: Update (#9068)

commit | commitdiff | tree

ltoniazzi [Sun, 18 Aug 2024 09:58:04 +0000 (10:58 +0100)]

tests : add integration test for lora adapters (#8957)

* Add printing to check weights match torch version

* minor code style changes

---------

Co-authored-by: Xuan Son Nguyen <redacted>

commit | commitdiff | tree

Yoshi Suhara [Sat, 17 Aug 2024 13:34:21 +0000 (06:34 -0700)]

Fix incorrect use of ctx_split for bias tensors (#9063)

commit | commitdiff | tree

Xuan Son Nguyen [Fri, 16 Aug 2024 15:19:05 +0000 (17:19 +0200)]

server : refactor middleware and /health endpoint (#9056)

* server : refactor middleware and /health endpoint

* move "fail_on_no_slot" to /slots

* Update examples/server/server.cpp

Co-authored-by: Georgi Gerganov <redacted>
* fix server tests

* fix CI

* update server docs

---------

Co-authored-by: Georgi Gerganov <redacted>

commit | commitdiff | tree

tc-mb [Fri, 16 Aug 2024 13:34:41 +0000 (21:34 +0800)]

llava : support MiniCPM-V-2.6 (#8967)

* init

* rename

* add run android for termux in readme

* add android readme

* add instructions in readme

* change name in readme

* Update README.md

* fixed line

* add result in readme

* random pos_embed

* add positions index

* change for ollama

* change for ollama

* better pos_embed in clip

* support ollama

* updata cmakelist

* updata cmakelist

* rename wrapper

* clear code

* replace and organize code

* add link

* sync master

* fix warnings

* fix warnings

* fix bug in bicubic resize when need resize iamge smaller

* receive review comments and modify

* receive review comments and modify

* put all code into llava dir

* fix quality problem in pr code

* change n_layer

* add space in "-1"

* imitate reshape bug of python code

* fix bug in clip

* fix issues for merging

* fix llama-minicpmv-cli in cmake file

* change pr readme

* fix code review

* remove in line 33 directory in the /cmakelists.txt (not in example, in the main dir

* fix cmakefile

* add warn

* fix KEY_HAS_MINICPMV_PROJ

* remove load_image_size into clip_ctx

* remove the extern "C", MINICPMV_API

* fix uhd code for review comment

* delete minicpmv-wrapper in pr

* remove uhd_image_embed

* Modify 2 notes

* support minicpmv2.6

* modify convert script of minicpmv

* modify convert

* modify convert

* add readme

* add resampler of v2.6

* modify clip

* modify readme

* fix type-check

* fix type-check

* fix type-check

* fix type-check

* modify convert script and readme

* fix convert script and readme

* fix convert

* fix num in convert

* fix type-check

---------

Co-authored-by: Hongji Zhu <redacted>
Co-authored-by: harvestingmoon <redacted>

commit | commitdiff | tree

Farbod Bijary [Fri, 16 Aug 2024 10:36:30 +0000 (14:06 +0330)]

py : fix wrong input type for raw_dtype in ggml to gguf scripts (#8928)

Co-authored-by: farbod <redacted>

commit | commitdiff | tree

Aisuko [Fri, 16 Aug 2024 09:08:59 +0000 (19:08 +1000)]

Fix inference example lacks required parameters (#9035)

Signed-off-by: Aisuko <redacted>

commit | commitdiff | tree

compilade [Fri, 16 Aug 2024 06:36:11 +0000 (02:36 -0400)]

gguf-py : bump version from 0.9.1 to 0.10.0 (#9051)

commit | commitdiff | tree

Minsoo Cheong [Fri, 16 Aug 2024 06:35:18 +0000 (15:35 +0900)]

llama : add EXAONE model support (#9025)

* add exaone model support

* add chat template

* fix whitespace

Co-authored-by: Georgi Gerganov <redacted>
* add ftype

* add exaone pre-tokenizer in `llama-vocab.cpp`

Co-Authored-By: compilade <redacted>
* fix lint

Co-Authored-By: compilade <redacted>
* add `EXAONE` to supported models in `README.md`

* fix space

Co-authored-by: compilade <redacted>
---------

Co-authored-by: Georgi Gerganov <redacted>
Co-authored-by: compilade <redacted>
Co-authored-by: compilade <redacted>

commit | commitdiff | tree

Liu Jia [Fri, 16 Aug 2024 06:23:12 +0000 (14:23 +0800)]

common : add support for cpu_get_num_physical_cores() on Windows (#8771)

* Add support for cpu_get_num_phsical_cores() on Windows

* fix build bug on msys2-clang64 and ucrt64

* avoid adding new function

* add new macros to avoid windows+mingw64

* Add error checking to return default value

commit | commitdiff | tree

Yoshi Suhara [Fri, 16 Aug 2024 02:23:33 +0000 (19:23 -0700)]

Add Nemotron/Minitron GGUF Conversion & Inference Support (#8922)

* Add nemotron GGUF conversion & inference support

* Fix formatting issues

* Remove unnecessary write_tensors()

* Update convert_hf_to_gguf.py

Co-authored-by: compilade <redacted>
* Update src/llama.cpp

Co-authored-by: compilade <redacted>
* Address comments by @compilade

* Replace ggml_mul_mat()->llm_build_lora_mm()

* Remove mutable variable

* Use for bias tensors

* Cover corner case for role_scaling not in config.json

---------

Co-authored-by: compilade <redacted>

commit | commitdiff | tree

Nico Bosshard [Fri, 16 Aug 2024 02:22:55 +0000 (04:22 +0200)]

ggml : dynamic ggml_sched_max_splits based on graph_size (#9047)

* ggml : Dynamic ggml_sched_max_splits based on graph_size

* Fixed and readded debug code for causes

commit | commitdiff | tree

gtygo [Thu, 15 Aug 2024 07:40:12 +0000 (15:40 +0800)]

retrieval : fix memory leak in retrieval query handling (#8955)

* retrieval

* Reuse querybatch to reduce frequent memory allocation

* delete unused white space

commit | commitdiff | tree

Riceball LEE [Thu, 15 Aug 2024 07:28:05 +0000 (15:28 +0800)]

server : fix duplicated n_predict key in the generation_settings (#8994)

commit | commitdiff | tree

Zhenwei Jin [Thu, 15 Aug 2024 07:23:23 +0000 (15:23 +0800)]

common : remove duplicate function llama_should_add_bos_token (#8778)

commit | commitdiff | tree

Esko Toivonen [Thu, 15 Aug 2024 07:17:12 +0000 (10:17 +0300)]

llama : add pre-tokenizer regexes for BLOOM and gpt3-finnish (#8850)

commit | commitdiff | tree

Georgi Gerganov [Thu, 15 Aug 2024 07:11:11 +0000 (10:11 +0300)]

ci : disable bench workflow (#9010)

commit | commitdiff | tree

Jiří Podivín [Thu, 15 Aug 2024 06:21:57 +0000 (08:21 +0200)]

server : init stop and error fields of the result struct (#9026)

Signed-off-by: Jiri Podivin <redacted>

commit | commitdiff | tree

0cc4m [Wed, 14 Aug 2024 16:32:53 +0000 (18:32 +0200)]

Vulkan Optimizations and Fixes (#8959)

* Optimize Vulkan REPEAT performance

* Use Vulkan GLSL fused multiply-add instruction where possible

* Add GGML_VULKAN_PERF option to output performance data per operator

* Rework and fix Vulkan descriptor set and descriptor pool handling

* Fix float32 concat f16 shader validation error

* Add Vulkan GROUP_NORM eps parameter

* Fix validation error with transfer queue memory barrier flags

* Remove trailing whitespaces

commit | commitdiff | tree

compilade [Wed, 14 Aug 2024 06:51:02 +0000 (02:51 -0400)]

server : fix segfault on long system prompt (#8987)

* server : fix segfault on long system prompt

* server : fix parallel generation with very small batch sizes

* server : fix typo in comment

commit | commitdiff | tree

Georgi Gerganov [Wed, 14 Aug 2024 06:14:49 +0000 (09:14 +0300)]

cmake : remove unused option GGML_CURL (#9011)

commit | commitdiff | tree

Daniel Bevenius [Tue, 13 Aug 2024 19:13:15 +0000 (21:13 +0200)]

ggml : move rope type enum to ggml.h (#8949)

* ggml : move rope type enum to ggml.h

This commit moves the `llama_rope_type` enum from `llama.h` to
`ggml.h` and changes its name to `ggml_rope_type`.

The motivation for this change is to address the TODO in `llama.h` and
use the enum in ggml.

Note: This commit does not change the `mode` parameter to be of type
`enum ggml_rope_type`. The name `mode` and its usage suggest that it
might be more generic and possibly used as a bit field for multiple
flags. Further investigation/discussion may be needed to determine
if `mode` should be restricted to RoPE types.

* squash! ggml : move rope type enum to ggml.h

This commit removes GGML_ROPE_TYPE_NONE and GGML_ROPE_TYPE_GLM from
ggml.h, and back the llama_rope_type enum.

I've kept the assert for GGML_ROPE_TYPE_GLM as I'm not sure if it is
safe to remove it yet.

* squash! ggml : move rope type enum to ggml.h

This commit removes the enum ggml_rope_type from ggml.h and replaces it
with a define (GGML_ROPE_TYPE_NEOX). This define is used in the code to
check if the mode is set to GPT-NeoX. Also the enum llama_rope_type has
been updated to reflect this change.

* squash! ggml : move rope type enum to ggml.h

This commit contains a suggestion enable the GGML_ROPE_TYPE_NEOX
macro/define to be passed to the shader compiler.

* squash! ggml : move rope type enum to ggml.h

This commit fixes the editorconfig-checker warnings.

* squash! ggml : move rope type enum to ggml.h

Update comment for ggml_rope function.

* Revert "squash! ggml : move rope type enum to ggml.h"

This reverts commit 6261222bd0dc0efd51f0fb0435ad3f16a5b52fd6.

* squash! ggml : move rope type enum to ggml.h

Add GGML_ROPE_TYPE_NEOX to rope_common.comp.

* remove extra line

---------

Co-authored-by: slaren <redacted>

commit | commitdiff | tree

Xuan Son Nguyen [Tue, 13 Aug 2024 09:41:14 +0000 (11:41 +0200)]

export-lora : throw error if lora is quantized (#9002)

commit | commitdiff | tree

Diogo Teles Sant'Anna [Mon, 12 Aug 2024 16:28:23 +0000 (13:28 -0300)]

ci : fix github workflow vulnerable to script injection (#9008)

Signed-off-by: Diogo Teles Sant'Anna <redacted>

commit | commitdiff | tree

Radoslav Gerganov [Mon, 12 Aug 2024 16:17:03 +0000 (19:17 +0300)]

ci : enable RPC in all of the released builds (#9006)

ref: #8912

commit | commitdiff | tree

Nico Bosshard [Mon, 12 Aug 2024 15:13:59 +0000 (17:13 +0200)]

llama : model-based max number of graph nodes calculation (#8970)

* llama : model-based max number of graph nodes calculation

* Update src/llama.cpp

---------

Co-authored-by: slaren <redacted>

commit | commitdiff | tree

Frank Mai [Mon, 12 Aug 2024 12:45:50 +0000 (20:45 +0800)]

docs: introduce gpustack and gguf-parser (#8873)

* readme: introduce gpustack

GPUStack is an open-source GPU cluster manager for running large
language models, which uses llama.cpp as the backend.

Signed-off-by: thxCode <redacted>
* readme: introduce gguf-parser

GGUF Parser is a tool to review/check the GGUF file and estimate the
memory usage without downloading the whole model.

Signed-off-by: thxCode <redacted>
---------

Signed-off-by: thxCode <redacted>

commit | commitdiff | tree

DavidKorczynski [Mon, 12 Aug 2024 12:36:41 +0000 (13:36 +0100)]

grammar-parser : fix possible null-deref (#9004)

Fixes: https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=70680
Signed-off-by: David Korczynski <redacted>

commit | commitdiff | tree

DavidKorczynski [Mon, 12 Aug 2024 12:21:41 +0000 (13:21 +0100)]

ggml: fix div-by-zero (#9003)

Fixes: https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=70724
In order to access the above bug you need to login using one of the
emails in
https://github.com/google/oss-fuzz/blob/master/projects/llamacpp/project.yaml#L3-L5

Signed-off-by: David Korczynski <redacted>

commit | commitdiff | tree

Liu Jia [Mon, 12 Aug 2024 09:46:03 +0000 (17:46 +0800)]

Fix a spelling mistake (#9001)

commit | commitdiff | tree

Georgi Gerganov [Mon, 12 Aug 2024 08:02:01 +0000 (11:02 +0300)]

py : fix requirements check '==' -> '~=' (#8982)

* py : fix requirements check '==' -> '~='

* cont : fix the fix

* ci : run on all requirements.txt

commit | commitdiff | tree

Georgi Gerganov [Mon, 12 Aug 2024 07:21:50 +0000 (10:21 +0300)]

server : handle models with missing EOS token (#8997)

ggml-ci

commit | commitdiff | tree

compilade [Sun, 11 Aug 2024 18:45:41 +0000 (14:45 -0400)]

gguf-py : Numpy dequantization for most types (#8939)

* gguf-py : Numpy dequantization for most types

* gguf-py : Numpy dequantization for grid-based i-quants

commit | commitdiff | tree

Georgi Gerganov [Sun, 11 Aug 2024 13:58:58 +0000 (16:58 +0300)]

flake.lock: Update (#8979)

commit | commitdiff | tree

Neo Zhang [Sun, 11 Aug 2024 08:37:43 +0000 (16:37 +0800)]

update guide (#8909)

Co-authored-by: Neo Zhang <>

commit | commitdiff | tree

fairydreaming [Sun, 11 Aug 2024 08:35:26 +0000 (10:35 +0200)]

llama : check all graph nodes when searching for result_embd_pooled (#8956)

Co-authored-by: Stanisław Szymczyk <redacted>

commit | commitdiff | tree

Markus Tavenrath [Sun, 11 Aug 2024 08:09:09 +0000 (10:09 +0200)]

Optimize Vulkan backend for better CPU performance and less GPU synchronization overhead. (#8943)

* Optimize Vulkan backend for better CPU performance and less GPU synchronization overhead.

- Allocation overhead for the temporary std::vectors was easily detectable with a sampling profiler and simple to remove.
- ggml_vk_sync_buffer introduce a full pipeline sync which has a significant cost on the GPU side, sometimes larger than the actual kernel execution. Adding only barriers for shader read/writes and transfers seems to be sufficient looking at the code which either launches compute kernels or copies tensors.

* Fix small typo

---------

Co-authored-by: 0cc4m <redacted>

commit | commitdiff | tree

slaren [Sat, 10 Aug 2024 13:42:10 +0000 (15:42 +0200)]

metal : fix uninitialized abort_callback (#8968)

commit | commitdiff | tree

Xuan Son Nguyen [Sat, 10 Aug 2024 11:04:40 +0000 (13:04 +0200)]

llama : default n_swa for phi-3 (#8931)

* default n_swa for phi-3

* fix

* double check swa

commit | commitdiff | tree

fairydreaming [Sat, 10 Aug 2024 09:43:26 +0000 (11:43 +0200)]

Add support for encoder-only T5 models (#8900)

* gguf-py : add T5ENCODER model architecture

* common : call llama_decode() during warmup only if the model has decoder

* convert-hf : add T5EncoderModel

* llama : add llama_model_has_decoder() API function

* llama : split build_t5() into build_t5_encoder() and build_t5_decoder()

* llama : add support for LLM_ARCH_T5ENCODER

* llama-embedding : add support for LLAMA_POOLING_TYPE_NONE

* llama-embedding : add support for encoder-only models

---------

Co-authored-by: Stanisław Szymczyk <redacted>

commit | commitdiff | tree

Matteo Mortari [Sat, 10 Aug 2024 05:58:49 +0000 (07:58 +0200)]

gguf-py : fix double call to add_architecture() (#8952)

Signed-off-by: tarilabs <redacted>

commit | commitdiff | tree

Georgi Gerganov [Fri, 9 Aug 2024 20:03:21 +0000 (23:03 +0300)]

Merge commit from fork

commit | commitdiff | tree

fairydreaming [Fri, 9 Aug 2024 16:53:09 +0000 (18:53 +0200)]

llama : add support for lora adapters in T5 model (#8938)

Co-authored-by: Stanisław Szymczyk <redacted>

commit | commitdiff | tree

Georgi Gerganov [Fri, 9 Aug 2024 15:24:30 +0000 (18:24 +0300)]

make : fix llava obj file race (#8946)

ggml-ci

commit | commitdiff | tree

Georgi Gerganov [Fri, 9 Aug 2024 15:23:52 +0000 (18:23 +0300)]

llama : better replace_all (cont) (#8926)

* llama : better replace_all (cont)

ggml-ci

* code : deduplicate replace_all

ggml-ci

commit | commitdiff | tree

tc-mb [Fri, 9 Aug 2024 10:33:53 +0000 (18:33 +0800)]

llava : support MiniCPM-V-2.5 (#7599)

* init

* rename

* add run android for termux in readme

* add android readme

* add instructions in readme

* change name in readme

* Update README.md

* fixed line

* add result in readme

* random pos_embed

* add positions index

* change for ollama

* change for ollama

* better pos_embed in clip

* support ollama

* updata cmakelist

* updata cmakelist

* rename wrapper

* clear code

* replace and organize code

* add link

* sync master

* fix warnings

* fix warnings

* fix bug in bicubic resize when need resize iamge smaller

* receive review comments and modify

* receive review comments and modify

* put all code into llava dir

* fix quality problem in pr code

* change n_layer

* add space in "-1"

* imitate reshape bug of python code

* fix bug in clip

* fix issues for merging

* fix llama-minicpmv-cli in cmake file

* change pr readme

* fix code review

* remove in line 33 directory in the /cmakelists.txt (not in example, in the main dir

* fix cmakefile

* add warn

* fix KEY_HAS_MINICPMV_PROJ

* remove load_image_size into clip_ctx

* remove the extern "C", MINICPMV_API

* fix uhd code for review comment

* delete minicpmv-wrapper in pr

* remove uhd_image_embed

* Modify 2 notes

* clip : style changes

* del common.h in clip

* fix Type-Check error

* fix Type-Check error

* fix Type-Check error

* fix Type-Check error

* fix makefile error

* fix ubuntu-make error

* try fix clip

* try fix 1

---------

Co-authored-by: Hongji Zhu <redacted>
Co-authored-by: harvestingmoon <redacted>
Co-authored-by: Georgi Gerganov <redacted>

commit | commitdiff | tree

Georgi Gerganov [Fri, 9 Aug 2024 07:03:48 +0000 (10:03 +0300)]

sync : ggml

commit | commitdiff | tree

Matt Stephenson [Tue, 16 Jul 2024 07:21:09 +0000 (03:21 -0400)]

whisper : use vulkan as gpu backend when available (whisper/2302)

* ggml: use vulkan as gpu backend when available

Signed-off-by: Matt Stephenson <redacted>
* whisper: enable using vk as default buffer type

Signed-off-by: Matt Stephenson <redacted>
---------

Signed-off-by: Matt Stephenson <redacted>

commit | commitdiff | tree

Daniel Bevenius [Fri, 9 Aug 2024 06:33:30 +0000 (08:33 +0200)]

embedding : add --pooling option to README.md [no ci] (#8934)

This commit adds the `--pooling` option to the README.md file in the
`examples/embedding` directory.

The motivation for adding this options is that currently if the model
used does not specify a pooling type the embedding example will fail
with the following error message:
```console
main: error: pooling type NONE not supported
```

This commit also updates the name of the executable in the examples
section.

commit | commitdiff | tree

Daniel Bevenius [Fri, 9 Aug 2024 06:32:23 +0000 (08:32 +0200)]

llama : fix typo in llama_tensor_get_type comment [no ci] (#8937)

commit | commitdiff | tree

Mathieu Geli [Fri, 9 Aug 2024 06:32:02 +0000 (08:32 +0200)]

server : add one level list nesting for embeddings (#8936)

commit | commitdiff | tree

compilade [Fri, 9 Aug 2024 03:54:00 +0000 (23:54 -0400)]

llama : reduce useless copies when saving session (#8916)

* llama : avoid useless copies in dummy session writer

* llama : avoid double tensor copy when saving session to buffer

commit | commitdiff | tree

compilade [Thu, 8 Aug 2024 17:33:09 +0000 (13:33 -0400)]

gguf-py : simplify support for quant types (#8838)

* gguf-py : use classes for quants

* convert_hf : simplify internal quantization type selection

* gguf-py : fix flake8 lint

* gguf-py : fix BF16 numpy view type

* gguf-py : remove LlamaFileTypeMap

Too specific to 'llama.cpp', and would be a maintenance burden
to keep up to date.

* gguf-py : add generic quantize and dequantize functions

The quant classes no longer need to be known,
only the target or the source type,
for 'quantize' and 'dequantize', respectively.

commit | commitdiff | tree

Georgi Gerganov [Thu, 8 Aug 2024 11:56:52 +0000 (14:56 +0300)]

scripts : sync cann files (#0)

commit | commitdiff | tree

Georgi Gerganov [Thu, 8 Aug 2024 11:40:12 +0000 (14:40 +0300)]

scripts : fix sync filenames (#0)

commit | commitdiff | tree

Georgi Gerganov [Thu, 8 Aug 2024 10:19:47 +0000 (13:19 +0300)]

sync : ggml

commit | commitdiff | tree

Borislav Stanimirov [Wed, 7 Aug 2024 07:00:56 +0000 (10:00 +0300)]

ggml : ignore more msvc warnings (ggml/906)

commit | commitdiff | tree

Georgi Gerganov [Wed, 7 Aug 2024 06:57:00 +0000 (09:57 +0300)]

metal : fix struct name (ggml/912)

ggml-ci

commit | commitdiff | tree

Conrad Kramer [Wed, 7 Aug 2024 06:55:49 +0000 (02:55 -0400)]

metal : add abort callback (ggml/905)

commit | commitdiff | tree

Pablo Duboue [Thu, 8 Aug 2024 08:44:51 +0000 (04:44 -0400)]

make : clean llamafile objects (#8923)

`ggml/src/llamafile/sgemm.o` was not deleted on `make clean`

commit | commitdiff | tree

slaren [Wed, 7 Aug 2024 16:24:05 +0000 (18:24 +0200)]

make : use C compiler to build metal embed object (#8899)

* make : use C compiler to build metal embed object

* use rm + rmdir to avoid -r flag in rm

commit | commitdiff | tree

slaren [Wed, 7 Aug 2024 11:29:02 +0000 (13:29 +0200)]

ggml-backend : fix async copy from CPU (#8897)

* ggml-backend : fix async copy from CPU

* cuda : more reliable async copy, fix stream used when the devices are the same

commit | commitdiff | tree

Ouadie EL FAROUKI [Wed, 7 Aug 2024 10:25:36 +0000 (11:25 +0100)]

[SYCL] Updated SYCL device filtering (#8901)

* Updated device filter to depend on default_selector (fixes non-intel device issues)
* Small related update to example/sycl Readme

commit | commitdiff | tree

Johannes Gäßler [Wed, 7 Aug 2024 07:07:52 +0000 (09:07 +0200)]

CUDA/HIP: fix tests/test-backend-ops (#8896)

Packaging of ggml-org/llama.cpp

RSS Atom