]>
git.djapps.eu Git - pkg/ggml/sources/llama.cpp/log
Diogo Teles Sant'Anna [Mon, 12 Aug 2024 16:28:23 +0000 (13:28 -0300)]
ci : fix github workflow vulnerable to script injection (#9008)
Signed-off-by: Diogo Teles Sant'Anna <redacted>
Radoslav Gerganov [Mon, 12 Aug 2024 16:17:03 +0000 (19:17 +0300)]
ci : enable RPC in all of the released builds (#9006)
ref: #8912
Nico Bosshard [Mon, 12 Aug 2024 15:13:59 +0000 (17:13 +0200)]
llama : model-based max number of graph nodes calculation (#8970)
* llama : model-based max number of graph nodes calculation
* Update src/llama.cpp
---------
Co-authored-by: slaren <redacted>
Frank Mai [Mon, 12 Aug 2024 12:45:50 +0000 (20:45 +0800)]
docs: introduce gpustack and gguf-parser (#8873)
* readme: introduce gpustack
GPUStack is an open-source GPU cluster manager for running large
language models, which uses llama.cpp as the backend.
Signed-off-by: thxCode <redacted>
* readme: introduce gguf-parser
GGUF Parser is a tool to review/check the GGUF file and estimate the
memory usage without downloading the whole model.
Signed-off-by: thxCode <redacted>
---------
Signed-off-by: thxCode <redacted>
DavidKorczynski [Mon, 12 Aug 2024 12:36:41 +0000 (13:36 +0100)]
grammar-parser : fix possible null-deref (#9004)
Fixes: https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=70680
Signed-off-by: David Korczynski <redacted>
DavidKorczynski [Mon, 12 Aug 2024 12:21:41 +0000 (13:21 +0100)]
ggml: fix div-by-zero (#9003)
Fixes: https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=70724
In order to access the above bug you need to login using one of the
emails in
https://github.com/google/oss-fuzz/blob/master/projects/llamacpp/project.yaml#L3-L5
Signed-off-by: David Korczynski <redacted>
Liu Jia [Mon, 12 Aug 2024 09:46:03 +0000 (17:46 +0800)]
Fix a spelling mistake (#9001)
Georgi Gerganov [Mon, 12 Aug 2024 08:02:01 +0000 (11:02 +0300)]
py : fix requirements check '==' -> '~=' (#8982)
* py : fix requirements check '==' -> '~='
* cont : fix the fix
* ci : run on all requirements.txt
Georgi Gerganov [Mon, 12 Aug 2024 07:21:50 +0000 (10:21 +0300)]
server : handle models with missing EOS token (#8997)
ggml-ci
compilade [Sun, 11 Aug 2024 18:45:41 +0000 (14:45 -0400)]
gguf-py : Numpy dequantization for most types (#8939)
* gguf-py : Numpy dequantization for most types
* gguf-py : Numpy dequantization for grid-based i-quants
Georgi Gerganov [Sun, 11 Aug 2024 13:58:58 +0000 (16:58 +0300)]
flake.lock: Update (#8979)
Neo Zhang [Sun, 11 Aug 2024 08:37:43 +0000 (16:37 +0800)]
update guide (#8909)
Co-authored-by: Neo Zhang <>
fairydreaming [Sun, 11 Aug 2024 08:35:26 +0000 (10:35 +0200)]
llama : check all graph nodes when searching for result_embd_pooled (#8956)
Co-authored-by: Stanisław Szymczyk <redacted>
Markus Tavenrath [Sun, 11 Aug 2024 08:09:09 +0000 (10:09 +0200)]
Optimize Vulkan backend for better CPU performance and less GPU synchronization overhead. (#8943)
* Optimize Vulkan backend for better CPU performance and less GPU synchronization overhead.
- Allocation overhead for the temporary std::vectors was easily detectable with a sampling profiler and simple to remove.
- ggml_vk_sync_buffer introduce a full pipeline sync which has a significant cost on the GPU side, sometimes larger than the actual kernel execution. Adding only barriers for shader read/writes and transfers seems to be sufficient looking at the code which either launches compute kernels or copies tensors.
* Fix small typo
---------
Co-authored-by: 0cc4m <redacted>
slaren [Sat, 10 Aug 2024 13:42:10 +0000 (15:42 +0200)]
metal : fix uninitialized abort_callback (#8968)
Xuan Son Nguyen [Sat, 10 Aug 2024 11:04:40 +0000 (13:04 +0200)]
llama : default n_swa for phi-3 (#8931)
* default n_swa for phi-3
* fix
* double check swa
fairydreaming [Sat, 10 Aug 2024 09:43:26 +0000 (11:43 +0200)]
Add support for encoder-only T5 models (#8900)
* gguf-py : add T5ENCODER model architecture
* common : call llama_decode() during warmup only if the model has decoder
* convert-hf : add T5EncoderModel
* llama : add llama_model_has_decoder() API function
* llama : split build_t5() into build_t5_encoder() and build_t5_decoder()
* llama : add support for LLM_ARCH_T5ENCODER
* llama-embedding : add support for LLAMA_POOLING_TYPE_NONE
* llama-embedding : add support for encoder-only models
---------
Co-authored-by: Stanisław Szymczyk <redacted>
Matteo Mortari [Sat, 10 Aug 2024 05:58:49 +0000 (07:58 +0200)]
gguf-py : fix double call to add_architecture() (#8952)
Signed-off-by: tarilabs <redacted>
Georgi Gerganov [Fri, 9 Aug 2024 20:03:21 +0000 (23:03 +0300)]
Merge commit from fork
fairydreaming [Fri, 9 Aug 2024 16:53:09 +0000 (18:53 +0200)]
llama : add support for lora adapters in T5 model (#8938)
Co-authored-by: Stanisław Szymczyk <redacted>
Georgi Gerganov [Fri, 9 Aug 2024 15:24:30 +0000 (18:24 +0300)]
make : fix llava obj file race (#8946)
ggml-ci
Georgi Gerganov [Fri, 9 Aug 2024 15:23:52 +0000 (18:23 +0300)]
llama : better replace_all (cont) (#8926)
* llama : better replace_all (cont)
ggml-ci
* code : deduplicate replace_all
ggml-ci
tc-mb [Fri, 9 Aug 2024 10:33:53 +0000 (18:33 +0800)]
llava : support MiniCPM-V-2.5 (#7599)
* init
* rename
* add run android for termux in readme
* add android readme
* add instructions in readme
* change name in readme
* Update README.md
* fixed line
* add result in readme
* random pos_embed
* add positions index
* change for ollama
* change for ollama
* better pos_embed in clip
* support ollama
* updata cmakelist
* updata cmakelist
* rename wrapper
* clear code
* replace and organize code
* add link
* sync master
* fix warnings
* fix warnings
* fix bug in bicubic resize when need resize iamge smaller
* receive review comments and modify
* receive review comments and modify
* put all code into llava dir
* fix quality problem in pr code
* change n_layer
* add space in "-1"
* imitate reshape bug of python code
* fix bug in clip
* fix issues for merging
* fix llama-minicpmv-cli in cmake file
* change pr readme
* fix code review
* remove in line 33 directory in the /cmakelists.txt (not in example, in the main dir
* fix cmakefile
* add warn
* fix KEY_HAS_MINICPMV_PROJ
* remove load_image_size into clip_ctx
* remove the extern "C", MINICPMV_API
* fix uhd code for review comment
* delete minicpmv-wrapper in pr
* remove uhd_image_embed
* Modify 2 notes
* clip : style changes
* del common.h in clip
* fix Type-Check error
* fix Type-Check error
* fix Type-Check error
* fix Type-Check error
* fix makefile error
* fix ubuntu-make error
* try fix clip
* try fix 1
---------
Co-authored-by: Hongji Zhu <redacted>
Co-authored-by: harvestingmoon <redacted>
Co-authored-by: Georgi Gerganov <redacted>
Georgi Gerganov [Fri, 9 Aug 2024 07:03:48 +0000 (10:03 +0300)]
sync : ggml
Matt Stephenson [Tue, 16 Jul 2024 07:21:09 +0000 (03:21 -0400)]
whisper : use vulkan as gpu backend when available (whisper/2302)
* ggml: use vulkan as gpu backend when available
Signed-off-by: Matt Stephenson <redacted>
* whisper: enable using vk as default buffer type
Signed-off-by: Matt Stephenson <redacted>
---------
Signed-off-by: Matt Stephenson <redacted>
Daniel Bevenius [Fri, 9 Aug 2024 06:33:30 +0000 (08:33 +0200)]
embedding : add --pooling option to README.md [no ci] (#8934)
This commit adds the `--pooling` option to the README.md file in the
`examples/embedding` directory.
The motivation for adding this options is that currently if the model
used does not specify a pooling type the embedding example will fail
with the following error message:
```console
main: error: pooling type NONE not supported
```
This commit also updates the name of the executable in the examples
section.
Daniel Bevenius [Fri, 9 Aug 2024 06:32:23 +0000 (08:32 +0200)]
llama : fix typo in llama_tensor_get_type comment [no ci] (#8937)
Mathieu Geli [Fri, 9 Aug 2024 06:32:02 +0000 (08:32 +0200)]
server : add one level list nesting for embeddings (#8936)
compilade [Fri, 9 Aug 2024 03:54:00 +0000 (23:54 -0400)]
llama : reduce useless copies when saving session (#8916)
* llama : avoid useless copies in dummy session writer
* llama : avoid double tensor copy when saving session to buffer
compilade [Thu, 8 Aug 2024 17:33:09 +0000 (13:33 -0400)]
gguf-py : simplify support for quant types (#8838)
* gguf-py : use classes for quants
* convert_hf : simplify internal quantization type selection
* gguf-py : fix flake8 lint
* gguf-py : fix BF16 numpy view type
* gguf-py : remove LlamaFileTypeMap
Too specific to 'llama.cpp', and would be a maintenance burden
to keep up to date.
* gguf-py : add generic quantize and dequantize functions
The quant classes no longer need to be known,
only the target or the source type,
for 'quantize' and 'dequantize', respectively.
Georgi Gerganov [Thu, 8 Aug 2024 11:56:52 +0000 (14:56 +0300)]
scripts : sync cann files (#0)
Georgi Gerganov [Thu, 8 Aug 2024 11:40:12 +0000 (14:40 +0300)]
scripts : fix sync filenames (#0)
Georgi Gerganov [Thu, 8 Aug 2024 10:19:47 +0000 (13:19 +0300)]
sync : ggml
Borislav Stanimirov [Wed, 7 Aug 2024 07:00:56 +0000 (10:00 +0300)]
ggml : ignore more msvc warnings (ggml/906)
Georgi Gerganov [Wed, 7 Aug 2024 06:57:00 +0000 (09:57 +0300)]
metal : fix struct name (ggml/912)
ggml-ci
Conrad Kramer [Wed, 7 Aug 2024 06:55:49 +0000 (02:55 -0400)]
metal : add abort callback (ggml/905)
Pablo Duboue [Thu, 8 Aug 2024 08:44:51 +0000 (04:44 -0400)]
make : clean llamafile objects (#8923)
`ggml/src/llamafile/sgemm.o` was not deleted on `make clean`
slaren [Wed, 7 Aug 2024 16:24:05 +0000 (18:24 +0200)]
make : use C compiler to build metal embed object (#8899)
* make : use C compiler to build metal embed object
* use rm + rmdir to avoid -r flag in rm
slaren [Wed, 7 Aug 2024 11:29:02 +0000 (13:29 +0200)]
ggml-backend : fix async copy from CPU (#8897)
* ggml-backend : fix async copy from CPU
* cuda : more reliable async copy, fix stream used when the devices are the same
Ouadie EL FAROUKI [Wed, 7 Aug 2024 10:25:36 +0000 (11:25 +0100)]
[SYCL] Updated SYCL device filtering (#8901)
* Updated device filter to depend on default_selector (fixes non-intel device issues)
* Small related update to example/sycl Readme
Johannes Gäßler [Wed, 7 Aug 2024 07:07:52 +0000 (09:07 +0200)]
CUDA/HIP: fix tests/test-backend-ops (#8896)
Zhenwei Jin [Wed, 7 Aug 2024 01:01:06 +0000 (09:01 +0800)]
llama-bench : add support for getting cpu info on Windows (#8824)
* Add support for getting cpu info on Windows for llama_bench
* refactor
---------
Co-authored-by: slaren <redacted>
Daniel Bevenius [Tue, 6 Aug 2024 23:43:00 +0000 (01:43 +0200)]
quantize : update usage comment in quantize.cpp (#8889)
This commit updates the usage comment in quantize.cpp to reflect the
new name of the executable, which is llama-quantize.
Nexes the Old [Tue, 6 Aug 2024 23:41:54 +0000 (01:41 +0200)]
typo correction (#8891)
Xuan Son Nguyen [Tue, 6 Aug 2024 15:33:39 +0000 (17:33 +0200)]
server : add lora hotswap endpoint (WIP) (#8857)
* server : add lora hotswap endpoint
* handle lora_no_apply
* fix build
* updae docs
* clean up struct def
* fix build
* add LoRA test
* fix style
Johannes Gäßler [Tue, 6 Aug 2024 15:13:55 +0000 (17:13 +0200)]
CUDA: fix padding logic for FP16/FP32 (#8884)
Daniel Bevenius [Tue, 6 Aug 2024 14:44:35 +0000 (16:44 +0200)]
simple : update name of executable to llama-simple (#8885)
This commit updates the name of the executable in README.md from
`simple` to `llama-simple`.
Jaeden Amero [Tue, 6 Aug 2024 13:21:47 +0000 (17:21 +0400)]
cmake : Link vulkan-shaders-gen with pthreads (#8835)
When using CMake to build with Vulkan support, compiling
vulkan-shaders-gen fails due to missing a CMakeLists.txt specification
to link vulkan-shaders-gen with the threading library, resulting in the
following error.
[5/172] Linking CXX executable bin/vulkan-shaders-gen
FAILED: bin/vulkan-shaders-gen
: && /usr/bin/c++ ggml/src/vulkan-shaders/CMakeFiles/vulkan-shaders-gen.dir/vulkan-shaders-gen.cpp.o -o bin/vulkan-shaders-gen && :
ld: error: undefined symbol: pthread_create
>>> referenced by vulkan-shaders-gen.cpp
>>> ggml/src/vulkan-shaders/CMakeFiles/vulkan-shaders-gen.dir/vulkan-shaders-gen.cpp.o:(std::__1::__libcpp_thread_create[abi:se180100](pthread**,
>>> void* (*)(void*), void*))
c++: error: linker command failed with exit code 1 (use -v to see invocation)
[6/172] Generating build details from Git
-- Found Git: /usr/local/bin/git (found version "2.45.2")
ninja: build stopped: subcommand failed.
Add the CMakeLists.txt specification to link vulkan-shaders-gen with the
threading library and fix the above error.
Fixes #8834
MaggotHATE [Tue, 6 Aug 2024 11:32:03 +0000 (16:32 +0500)]
[Vulkan] Fix compilation of `vulkan-shaders-gen` on w64devkit after `
e31a4f6 ` (#8880)
* Fix compilation issue in `vulkan-shaders-gen`
https://github.com/ggerganov/llama.cpp/commit/
e31a4f679779220312c165b0f5994c680a610e38 broke compilation on w64devkit. Including `algorithm` seems to fix that.
* Guard it under `#ifdef _WIN32`
Georgi Gerganov [Tue, 6 Aug 2024 08:48:01 +0000 (11:48 +0300)]
contributing : add note about write access
Molly Sophia [Tue, 6 Aug 2024 07:26:46 +0000 (15:26 +0800)]
ggml : add epsilon as a parameter for group_norm (#8818)
Signed-off-by: Molly Sophia <redacted>
Douglas Hanley [Tue, 6 Aug 2024 07:20:54 +0000 (02:20 -0500)]
convert : add support for XLMRoberta embedding models (#8658)
* add conversion for bge-m3; small fix in unigram tokenizer
* clean up and simplify XLMRoberta conversion
Mengqing Cao [Tue, 6 Aug 2024 04:42:42 +0000 (12:42 +0800)]
[CANN]: Fix ggml_backend_cann_buffer_get_tensor (#8871)
* cann: fix ggml_backend_cann_buffer_get_tensor
1. fix data ptr offset
2. enable the acquisition of incomplete tensors
* fix backend cann set_tensor
Neo Zhang [Tue, 6 Aug 2024 01:09:12 +0000 (09:09 +0800)]
[SYCL] correct cmd name (#8877)
Liu Jia [Mon, 5 Aug 2024 16:14:10 +0000 (00:14 +0800)]
common : Changed tuple to struct (TODO fix) (#8823)
* common : Changed tuple to struct (TODO fix)
Use struct `llama_init_result` to replace the previous
std::tuple<struct llama_model *, struct llama_context *>
* delete llama_init_default_params()
* delete the extra whitespace
wangshuai09 [Mon, 5 Aug 2024 13:10:37 +0000 (21:10 +0800)]
cann: fix buffer_num and runtime speed slowly error (#8865)
Eric Curtin [Mon, 5 Aug 2024 12:45:01 +0000 (13:45 +0100)]
readme : add ramalama to the availables UI (#8811)
ramalama is a repo agnostic boring CLI tool that supports pulling from
ollama, huggingface and oci registries.
Signed-off-by: Eric Curtin <redacted>
Justine Tunney [Mon, 5 Aug 2024 12:43:40 +0000 (05:43 -0700)]
ggml : fix overflows in elu function (#8866)
It's helpful to use expm1f(x), because expf(x)-1 will result in overflow
for 25% of single-precision floating point numbers.
Brian [Mon, 5 Aug 2024 11:15:28 +0000 (21:15 +1000)]
py: Add more authorship metadata from model card (#8810)
* py: add more authorship metadata from model card
* fixup! py: add more authorship metadata from model card
fairydreaming [Mon, 5 Aug 2024 07:38:01 +0000 (09:38 +0200)]
Stop the generation when <|eom_id|> token is encountered - needed for Llama 3.1 tool call support (#8858)
* gguf-py, llama : add constants and methods related to Llama-3.1 <|eom_id|> token
* llama : find Llama-3.1 <|eom_id|> token id during vocab loading
* llama-vocab : add Llama-3.1 <|eom_id|> token to the set of tokens stopping the generation
---------
Co-authored-by: Stanisław Szymczyk <redacted>
stduhpf [Mon, 5 Aug 2024 06:18:27 +0000 (08:18 +0200)]
cmake: fix paths for vulkan shaders compilation on Windows (#8573)
* Vulkan-shaders: attempt fix compilation on windows
* fix miss-matched parenthesis
BarfingLemurs [Mon, 5 Aug 2024 05:54:10 +0000 (01:54 -0400)]
readme : update model list (#8851)
Georgi Gerganov [Mon, 5 Aug 2024 05:53:39 +0000 (08:53 +0300)]
llama : better replace_all (#8852)
0cc4m [Mon, 5 Aug 2024 05:52:55 +0000 (07:52 +0200)]
vulkan : fix Qantized Mat-Vec Mul on AMD GPUs for ncols < 64 (#8855)
* Fix Vulkan mul mat vec invalid results when ncols < warp size
* Only run backend ops mul mat vec block size test if block size not already covered
Georgi Gerganov [Sun, 4 Aug 2024 16:13:25 +0000 (19:13 +0300)]
sync : ggml
ggml-ci
0cc4m [Sun, 4 Aug 2024 15:28:08 +0000 (17:28 +0200)]
vulkan : implement Stable Diffusion operators (ggml/904)
* Fix Vulkan repeat op
* Implement Vulkan concat op
* Delete old Vulkan shader generator
* Implement Vulkan im2col op
* Implement Vulkan unary gelu_quick op
* Implement Vulkan group_norm op
* Implement Vulkan timestep_embedding op
* Implement Vulkan upscale op
* Fix Vulkan vk_context tensor extra index issue
* Fix Vulkan matmul shader parameter bug
* Properly fix Vulkan matmul shader parameter bug
* Add Vulkan ADD f16 + f32 -> f16 operator support
* Implement Vulkan tanh op
* Fix Vulkan group count too large Validation error on non-Nvidia GPUs
* Throw error when too much memory is requested
* Fix another Vulkan group count too large Validation error on non-Nvidia GPUs
* Fix matmul MMQ condition
* Implement Vulkan pad op
* Fix Vulkan crash when tensor is used multiple times in a compute graph
* Add Vulkan CONCAT f16 + f16 -> f16 op
* Add Vulkan LEAKY_RELU op
Daniel Bevenius [Mon, 29 Jul 2024 13:06:06 +0000 (15:06 +0200)]
ggml : move c parameter comment to ggml_rope_ext (ggml/901)
This commit moves the comment for the c parameter from ggml_rope to
ggml_rope_ext. The comment is currently incorrect as ggml_rope does not
have a c parameter (freq_factors tensor).
Signed-off-by: Daniel Bevenius <redacted>
wangshuai09 [Mon, 5 Aug 2024 04:22:30 +0000 (12:22 +0800)]
cann: support q4_0 model (#8822)
Brandon Squizzato [Sun, 4 Aug 2024 18:17:16 +0000 (14:17 -0400)]
Install curl in runtime layer (#8693)
ardfork [Sun, 4 Aug 2024 18:16:23 +0000 (18:16 +0000)]
Server: Don't ignore llama.cpp params (#8754)
* Don't ignore llama.cpp params
* Add fallback for max_tokens
Brian Cunnie [Sun, 4 Aug 2024 10:55:03 +0000 (03:55 -0700)]
batched-bench : handle empty `-npl` (#8839)
* [example] batched-bench "segmentation fault"
When `llama-batched-bench` is invoked _without_ setting `-npl`, "number
of parallel prompts", it segfaults.
The segfault is caused by invoking `max_element()` on a zero-length
vector, `n_pl`
This commit addresses that by first checking to see if the number of
parallel prompts is zero, and if so sets the maximum sequence size to 1;
otherwise, sets it to the original, the result of `max_element()`.
Fixes, when running `lldb build/bin/llama-batched-bench -- -m models/Meta-Llama-3-8B.gguf`
```
* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0x0)
frame #0: 0x000000010000366c llama-batched-bench`main(argc=3, argv=0x000000016fdff268) at batched-bench.cpp:72:28
69 llama_context_params ctx_params = llama_context_params_from_gpt_params(params);
70
71 // ensure enough sequences are available
-> 72 ctx_params.n_seq_max = *std::max_element(n_pl.begin(), n_pl.end());
```
* Update examples/batched-bench/batched-bench.cpp
Co-authored-by: compilade <redacted>
---------
Co-authored-by: Georgi Gerganov <redacted>
Co-authored-by: compilade <redacted>
Daniel Bevenius [Sat, 3 Aug 2024 13:07:47 +0000 (15:07 +0200)]
baby-llama : remove duplicate vector include
Georgi Gerganov [Sun, 4 Aug 2024 02:53:20 +0000 (05:53 +0300)]
flake.lock: Update (#8847)
jdomke [Sat, 3 Aug 2024 16:34:41 +0000 (01:34 +0900)]
ggml : reading the runtime sve config of the cpu (#8709)
* ggml : reading the runtime sve config of the cpu
* change to one time init to prevent performance drop
* prefix variable to avoid possible conflicts
* revert xxhash fix and add brackets
---------
Co-authored-by: domke <redacted>
Sigbjørn Skjæret [Fri, 2 Aug 2024 19:11:39 +0000 (21:11 +0200)]
Fix conversion of unnormalized BF16->BF16 weights (#7843)
* add truncate_bf16
* truncate intermediate fp32 if converting bf16 to bf16
* fix masking in __compute_fp32_to_bf16
* np.int16 no longer used
* missing cast and additional numpy 2.x fix
* ggml-impl : do not flush bf16 subnormals to zero
* ggml : add reference fp32 to bf16 conversion
The fast version is no longer equivalent for all platforms
because of the handling of subnormal values.
* gguf-py : remove flush to zero for bf16 subnormals
* gguf-py : remove float32 truncation to bf16
Rounding achieves the same thing in the cases where this was used.
* missed prototype update in merge
* merge cleanup
---------
Co-authored-by: Francis Couture-Harpin <redacted>
Mengqing Cao [Fri, 2 Aug 2024 08:50:53 +0000 (16:50 +0800)]
cann: Fix ggml_cann_im2col for 1D im2col (#8819)
* fix ggml_cann_im2col for 1D im2col
* fix build warning
Ouadie EL FAROUKI [Fri, 2 Aug 2024 00:55:17 +0000 (01:55 +0100)]
[SYCL] Fixing wrong VDR iq4nl value (#8812)
matteo [Thu, 1 Aug 2024 21:28:28 +0000 (23:28 +0200)]
ggml-cuda: Adding support for unified memory (#8035)
* Adding support for unified memory
* adding again the documentation about unified memory
* refactoring: Moved the unified memory code in the correct location.
* Fixed compilation error when using hipblas
* cleaning up the documentation
* Updating the documentation
Co-authored-by: Johannes Gäßler <redacted>
* adding one more case where the PR should not be enabled
---------
Co-authored-by: matteo serva <redacted>
Co-authored-by: Johannes Gäßler <redacted>
Alex O'Connell [Thu, 1 Aug 2024 16:53:46 +0000 (12:53 -0400)]
Build: Only include execinfo.h on linux systems that support it (#8783)
* Only enable backtrace on GLIBC linux systems
* fix missing file from copy
* use glibc macro instead of defining a custom one
slaren [Thu, 1 Aug 2024 13:26:22 +0000 (15:26 +0200)]
cuda : fix dmmv cols requirement to 2*GGML_CUDA_DMMV_X (#8800)
* cuda : fix dmmv cols requirement to 2*GGML_CUDA_DMMV_X
* update asserts
* only use dmmv for supported types
* add test
wangshuai09 [Thu, 1 Aug 2024 02:39:05 +0000 (10:39 +0800)]
cann: support q8_0 for Ascend\b backend (#8805)
Igor Okulist [Wed, 31 Jul 2024 23:59:09 +0000 (18:59 -0500)]
server : update llama-server embedding flag documentation (#8779)
Fixes #8763
Clint Herron [Wed, 31 Jul 2024 19:51:06 +0000 (15:51 -0400)]
Build: Fix potential race condition (#8781)
* Fix potential race condition as pointed out by @fairydreaming in #8776
* Reference the .o rather than rebuilding every time.
* Adding in CXXFLAGS and LDFLAGS
* Removing unnecessary linker flags.
pculliton [Wed, 31 Jul 2024 15:12:10 +0000 (11:12 -0400)]
Adding Gemma 2 2B configs (#8784)
* Adding Gemma 2 2B configs
Updates to Q scaling and Gemma 2 model sizes to match v2 2B model.
* Update src/llama.cpp
Co-authored-by: slaren <redacted>
---------
Co-authored-by: slaren <redacted>
Borislav Stanimirov [Wed, 31 Jul 2024 13:40:08 +0000 (16:40 +0300)]
cmake : fix use of external ggml (#8787)
Someone [Tue, 30 Jul 2024 20:35:30 +0000 (23:35 +0300)]
nix: cuda: rely on propagatedBuildInputs (#8772)
Listing individual outputs no longer necessary to reduce the runtime closure size after https://github.com/NixOS/nixpkgs/pull/323056.
Brian [Tue, 30 Jul 2024 14:57:03 +0000 (00:57 +1000)]
py: add_array() will not add to kv store if value is an empty array (#8774)
* gguf_writer.py: add_array() should not add to kv store if empty
* Apply suggestions from code review
I was wondering if there was a specific reason for `if val` but good to hear we can safely use `len(val == 0`
Co-authored-by: compilade <redacted>
---------
Co-authored-by: compilade <redacted>
l3utterfly [Tue, 30 Jul 2024 14:40:18 +0000 (23:40 +0900)]
added android implementation of ggml_print_backtrace_symbols (#8751)
* added android implementation of ggml_print_backtrace_symbols
* Update ggml/src/ggml.c
Co-authored-by: slaren <redacted>
* Update ggml/src/ggml.c
Co-authored-by: slaren <redacted>
* Update ggml/src/ggml.c
Co-authored-by: slaren <redacted>
* Update ggml/src/ggml.c
Co-authored-by: slaren <redacted>
* Update ggml/src/ggml.c
Co-authored-by: slaren <redacted>
---------
Co-authored-by: slaren <redacted>
Georgi Gerganov [Tue, 30 Jul 2024 12:58:57 +0000 (15:58 +0300)]
flake.lock: Update (#8729)
wangshuai09 [Tue, 30 Jul 2024 10:37:35 +0000 (18:37 +0800)]
cann: update cmake (#8765)
zhentaoyu [Tue, 30 Jul 2024 06:56:51 +0000 (14:56 +0800)]
[SYCL] Add `TIMESTEP_EMBEDDING` OP (#8707)
Signed-off-by: zhentaoyu <redacted>
CarterLi999 [Mon, 29 Jul 2024 16:38:34 +0000 (00:38 +0800)]
ggml: bugfix: fix the inactive elements is agnostic for risc-v vector (#8748)
In these codes, we want to retain the value that they previously held
when mask[i] is false. So we should use undisturbed. With the default
agnostic policy of rvv intrinsic, these values can be held or be
written with 1s.
Co-authored-by: carter.li <redacted>
R0CKSTAR [Mon, 29 Jul 2024 12:56:12 +0000 (20:56 +0800)]
cuda : organize vendor-specific headers into vendors directory (#8746)
Signed-off-by: Xiaodong Ye <redacted>
Meng, Hengyu [Mon, 29 Jul 2024 02:50:27 +0000 (10:50 +0800)]
[SYCL] add conv support (#8688)
Johannes Gäßler [Sun, 28 Jul 2024 20:32:44 +0000 (22:32 +0200)]
cmake: use 1 more thread for non-ggml in CI (#8740)
Austin [Sun, 28 Jul 2024 07:52:42 +0000 (03:52 -0400)]
chore : Fix vulkan related compiler warnings, add help text, improve CLI options (#8477)
* chore: Fix compiler warnings, add help text, improve CLI options
* Add prototypes for function definitions
* Invert logic of --no-clean option to be more intuitive
* Provide a new help prompt with clear instructions
* chore : Add ignore rule for vulkan shader generator
Signed-off-by: teleprint-me <redacted>
* Update ggml/src/vulkan-shaders/vulkan-shaders-gen.cpp
Co-authored-by: 0cc4m <redacted>
* chore : Remove void and apply C++ style empty parameters
* chore : Remove void and apply C++ style empty parameters
---------
Signed-off-by: teleprint-me <redacted>
Co-authored-by: 0cc4m <redacted>
compilade [Sun, 28 Jul 2024 04:42:05 +0000 (00:42 -0400)]
llama : refactor session file management (#8699)
* llama : refactor session file management
* llama : saving and restoring state checks for overflow
The size of the buffers should now be given to the functions working
with them, otherwise a truncated file could cause out of bound reads.
* llama : stream from session file instead of copying into a big buffer
Loading session files should no longer cause a memory usage spike.
* llama : llama_state_get_size returns the actual size instead of max
This is a breaking change, but makes that function *much* easier
to keep up to date, and it also makes it reflect the behavior
of llama_state_seq_get_size.
* llama : share code between whole and seq_id-specific state saving
Both session file types now use a more similar format.
* llama : no longer store all hparams in session files
Instead, the model arch name is stored.
The layer count and the embedding dimensions of the KV cache
are still verified when loading.
Storing all the hparams is not necessary.
* llama : fix uint64_t format type
* llama : various integer type cast and format string fixes
Some platforms use "%lu" and others "%llu" for uint64_t.
Not sure how to handle that, so casting to size_t when displaying errors.
* llama : remove _context suffix for llama_data_context
* llama : fix session file loading
llama_state_get_size cannot be used to get the max size anymore.
* llama : more graceful error handling of invalid session files
* llama : remove LLAMA_MAX_RNG_STATE
It's no longer necessary to limit the size of the RNG state,
because the max size of session files is not estimated anymore.
* llama : cast seq_id in comparison with unsigned n_seq_max
R0CKSTAR [Sat, 27 Jul 2024 23:41:25 +0000 (07:41 +0800)]
feat: Support Moore Threads GPU (#8383)
* Update doc for MUSA
Signed-off-by: Xiaodong Ye <redacted>
* Add GGML_MUSA in Makefile
Signed-off-by: Xiaodong Ye <redacted>
* Add GGML_MUSA in CMake
Signed-off-by: Xiaodong Ye <redacted>
* CUDA => MUSA
Signed-off-by: Xiaodong Ye <redacted>
* MUSA adds support for __vsubss4
Signed-off-by: Xiaodong Ye <redacted>
* Fix CI build failure
Signed-off-by: Xiaodong Ye <redacted>
---------
Signed-off-by: Xiaodong Ye <redacted>
Georgi Gerganov [Sat, 27 Jul 2024 15:08:31 +0000 (18:08 +0300)]
scripts : sync vulkan-shaders (#0)
Georgi Gerganov [Sat, 27 Jul 2024 14:19:35 +0000 (17:19 +0300)]
scripts : sync ggml-aarch64 sources