]> git.djapps.eu Git - pkg/ggml/sources/llama.cpp/log
pkg/ggml/sources/llama.cpp
3 months agorpc : update README for cache usage (#12620)
Radoslav Gerganov [Fri, 28 Mar 2025 07:44:13 +0000 (09:44 +0200)]
rpc : update README for cache usage (#12620)

3 months agollamafile : ppc64le GEMV forwarding for FP32. (#12594)
amritahs-ibm [Fri, 28 Mar 2025 07:43:22 +0000 (13:13 +0530)]
llamafile : ppc64le GEMV forwarding for FP32. (#12594)

This patch enables usage of MMA when one of the
dimensions of the matrix(ie either M or N) is 1. This
is useful in case of token generation where N < 2.

The concept of 'GEMV Forwarding' is used where when one
of the matrix has a single row/column, the elements are
broadcasted, instead of using packing routine to prepack
the matrix elements.

This change results in 5% - 15% improvement in total
speed(ie all tokens/total time), across various batch
sizes. This is in comparision with the corresponding
dot product implementation.

The patch is tested with FP32 models of Meta-Lllama-3-8B,
Mistral-7B, Llama-2-7B-chat-hf on a IBM POWER10 machine.

Signed-off-by: Amrita H S <redacted>
3 months agorpc : send hash when tensor data is above some fixed threshold (#12496)
Radoslav Gerganov [Fri, 28 Mar 2025 06:18:04 +0000 (08:18 +0200)]
rpc : send hash when tensor data is above some fixed threshold (#12496)

* rpc : send hash when tensor data is above some fixed threshold

ref #10095

* rpc : put cache under $HOME/.cache/llama.cpp

* try to fix win32 build

* another try to fix win32 build

* remove llama as dependency

3 months agoserver : Support listening on a unix socket (#12613)
Piotr [Thu, 27 Mar 2025 22:41:04 +0000 (23:41 +0100)]
server : Support listening on a unix socket (#12613)

* server : Bump cpp-httplib to include AF_UNIX windows support

Signed-off-by: Piotr Stankiewicz <redacted>
* server : Allow running the server example on a unix socket

Signed-off-by: Piotr Stankiewicz <redacted>
---------

Signed-off-by: Piotr Stankiewicz <redacted>
3 months agomedia : add SVG logo [no ci] (#12616)
Georgi Gerganov [Thu, 27 Mar 2025 21:09:05 +0000 (23:09 +0200)]
media : add SVG logo [no ci] (#12616)

3 months agoopencl: add multi and vision rope, `gelu_quick` and `im2col` (#12600)
lhez [Thu, 27 Mar 2025 15:08:08 +0000 (08:08 -0700)]
opencl: add multi and vision rope, `gelu_quick` and `im2col` (#12600)

* opencl: add `im2col`

* opencl: add `gelu_quick`

* opencl: add mrope

* opencl: add vision rope

3 months agollama : add PLM GGUF Conversion & Inference Support (#12457)
Si1w [Thu, 27 Mar 2025 10:49:15 +0000 (10:49 +0000)]
llama : add PLM GGUF Conversion & Inference Support (#12457)

* add edgellm model arch[conversation feature doesn't work]

* remove output.weight layer for edgellm arch

* [Model] update the name of the model

* update the name of model arch in convert gguf

* [Model] Refarctor the model arch into llama-model

* [Bug] Fix the bug in create attn kv

* [Code] Fix editorconfig erros

* [Code] Remove Trailing whitespace

* [Code] Remove Trailing whitespace

* [Code] Change the order of model arch in list

* [Code] Fix flake8 Lint errors

* Remove trailing white space

* [Code] Remove  call in model arch

3 months agomodel : restore support for T5Encoder (#12590)
HighDoping [Thu, 27 Mar 2025 10:43:33 +0000 (18:43 +0800)]
model : restore support for T5Encoder (#12590)

3 months agoconvert : Support Qwen2_5_VLForConditionalGeneration (#12595)
Csaba Kecskemeti [Thu, 27 Mar 2025 10:11:23 +0000 (03:11 -0700)]
convert : Support Qwen2_5_VLForConditionalGeneration (#12595)

3 months agosync : ggml
Georgi Gerganov [Thu, 27 Mar 2025 07:36:13 +0000 (09:36 +0200)]
sync : ggml

ggml-ci

3 months agoscripts : update sync + fix cmake merge
Georgi Gerganov [Thu, 27 Mar 2025 07:22:30 +0000 (09:22 +0200)]
scripts : update sync + fix cmake merge

ggml-ci

3 months agosync : ggml
Georgi Gerganov [Thu, 27 Mar 2025 07:01:21 +0000 (09:01 +0200)]
sync : ggml

ggml-ci

3 months agocmake : sync/merge PowerPC build commands (#0)
Georgi Gerganov [Thu, 27 Mar 2025 07:00:57 +0000 (09:00 +0200)]
cmake : sync/merge PowerPC build commands (#0)

3 months agollamafile : ppc64le MMA implementation for Q4_0. (#12489)
amritahs-ibm [Thu, 27 Mar 2025 06:51:47 +0000 (12:21 +0530)]
llamafile : ppc64le MMA implementation for Q4_0. (#12489)

This change upstreams llamafile's cpu matrix
multiplication kernels for ppc64le ISA using MMA
builtins. This patch handles matrix multiplication
between quantised datatypes, block_q4_0 and
block_q8_0.

This change results in 5% - 50% improvement
in total speed(ie all tokens/total time), across
various batch sizes.

The patch is tested with Meta-Lllama-3-8B,
Mistral-7B, Llama-2-7B-chat-hf models on a
IBM POWER10 machine.

Signed-off-by: Amrita H S <redacted>
3 months agoggml : riscv: add 128-bit RVV support (#12530)
xctan [Thu, 27 Mar 2025 06:38:34 +0000 (14:38 +0800)]
ggml : riscv: add 128-bit RVV support (#12530)

* ggml : add 128-bit RVV support

* ggml : revert to old RVV 256+ q2_K, q3_K, q4_K, q6_K impl

* remove trailing whitespaces

* restructure vector length selection code

3 months agollama : make loras compatible with repacking (#12593)
Georgi Gerganov [Thu, 27 Mar 2025 06:24:10 +0000 (08:24 +0200)]
llama : make loras compatible with repacking (#12593)

* llama : make loras compatible with repacking

ggml-ci

* cont : simplify

ggml-ci

* cont : add TODO [no ci]

3 months agoSYCL: implement memset ggml backend buffer interface (#12580)
Akarshan Biswas [Thu, 27 Mar 2025 01:46:00 +0000 (07:16 +0530)]
SYCL: implement memset ggml backend buffer interface (#12580)

* SYCL: implement memset ggml backend buffer interface

* use GGML_ABORT macro

* Do not wait for all queues to finish for memset operation

3 months agoHIP: Add support for RDNA4 targets (#12372)
Slobodan Josic [Wed, 26 Mar 2025 22:46:30 +0000 (23:46 +0100)]
HIP: Add support for RDNA4 targets (#12372)

3 months agometal : refactor mat-vec code (#12569)
Georgi Gerganov [Wed, 26 Mar 2025 19:38:38 +0000 (21:38 +0200)]
metal : refactor mat-vec code (#12569)

* metal : refactor mat-vec code

ggml-ci

* metal : rename all_sum -> sum_all

ggml-ci

* metal : fix comments [no ci]

* metal : fix nr constant [no ci]

* metal : mv q6_K support nr0 > 1

ggml-ci

* metal : reduce register pressure

ggml-ci

* metal : fix typo [no ci]

* metal : reduce register pressure

ggml-ci

3 months agoupgrade to llguidance 0.7.10 (#12576)
Michał Moskal [Wed, 26 Mar 2025 18:06:09 +0000 (11:06 -0700)]
upgrade to llguidance 0.7.10 (#12576)

3 months agoclip: Fix llama-llava-clip-quantize-cli quantization error under CUDA backend (#12566)
Ivy233 [Wed, 26 Mar 2025 14:06:04 +0000 (22:06 +0800)]
clip: Fix llama-llava-clip-quantize-cli quantization error under CUDA backend (#12566)

* [Fix] Compiling clip-quantize-cli and running it in a CUDA environment will cause ggml_fp16_to_fp32 to report an error when trying to access video memory. You need to switch to the CPU backend to run quantize.
After the fix, it will automatically run in the CPU backend and will no longer be bound to CUDA.

* [Fix]Roll back the signature and implementation of clip_model_load, and change the call in clip_model_quantize to clip_init.

3 months agoconvert : fix squeeze for ssm_conv tensors (#12573)
Georgi Gerganov [Wed, 26 Mar 2025 12:21:05 +0000 (14:21 +0200)]
convert : fix squeeze for ssm_conv tensors (#12573)

* convert : fix squeeze for ssm_conv tensors

* convert : match ssm_conv tensors by type

---------

Co-authored-by: Francis Couture-Harpin <redacted>
3 months agoggml : fix MUL_MAT_ID repack with Q8_K (#12544)
Georgi Gerganov [Wed, 26 Mar 2025 11:02:00 +0000 (13:02 +0200)]
ggml : fix MUL_MAT_ID repack with Q8_K (#12544)

* ggml : fix MUL_MAT_ID repack with Q8_K

ggml-ci

* ggml : improve repack templates

ggml-ci

3 months agodoc: [MUSA] minor changes (#12583)
R0CKSTAR [Wed, 26 Mar 2025 07:09:48 +0000 (15:09 +0800)]
doc: [MUSA] minor changes (#12583)

Signed-off-by: Xiaodong Ye <redacted>
3 months agoconvert: fix Mistral3/Gemma3 model hparams init (#12571)
Sigbjørn Skjæret [Tue, 25 Mar 2025 22:03:10 +0000 (23:03 +0100)]
convert: fix Mistral3/Gemma3 model hparams init (#12571)

* Fix Mistral3/Gemma3 model hparams init

* set positional args correctly

* use existing hparams if passed

3 months agorun: de-duplicate fmt and format functions and optimize (#11596)
Eric Curtin [Tue, 25 Mar 2025 17:46:11 +0000 (17:46 +0000)]
run: de-duplicate fmt and format functions and optimize (#11596)

3 months agoggml-cpu : update KleidiAI to v1.5.0 (#12568)
Dan Johansson [Tue, 25 Mar 2025 11:10:18 +0000 (12:10 +0100)]
ggml-cpu : update KleidiAI to v1.5.0 (#12568)

ggml-cpu : bug fix related to KleidiAI LHS packing

Signed-off-by: Dan Johansson <redacted>
3 months agoSYCL: disable Q4_0 reorder optimization (#12560)
Akarshan Biswas [Tue, 25 Mar 2025 10:40:18 +0000 (16:10 +0530)]
SYCL: disable Q4_0 reorder optimization (#12560)

ggml-ci

3 months agodocs : add build instructions for KleidiAI (#12563)
Dan Johansson [Tue, 25 Mar 2025 09:35:20 +0000 (10:35 +0100)]
docs : add build instructions for KleidiAI (#12563)

Signed-off-by: Dan Johansson <redacted>
3 months agoci: [MUSA] add CI and update doc (#12562)
R0CKSTAR [Tue, 25 Mar 2025 07:45:08 +0000 (15:45 +0800)]
ci: [MUSA] add CI and update doc (#12562)

Signed-off-by: Xiaodong Ye <redacted>
3 months agocontext : fix worst-case reserve outputs (#12545)
Georgi Gerganov [Tue, 25 Mar 2025 07:19:23 +0000 (09:19 +0200)]
context : fix worst-case reserve outputs (#12545)

ggml-ci

3 months agoci: [SYCL] ggml-ci Use main GPU and enable sysman (#12547)
Akarshan Biswas [Mon, 24 Mar 2025 17:35:38 +0000 (23:05 +0530)]
ci: [SYCL] ggml-ci Use main GPU and enable sysman (#12547)

3 months agoopencl: simplify kernel embedding logic in cmakefile (#12503)
lhez [Mon, 24 Mar 2025 16:20:47 +0000 (09:20 -0700)]
opencl: simplify kernel embedding logic in cmakefile (#12503)

Co-authored-by: Max Krasnyansky <redacted>
3 months agoCI: fix SYCL build (#12546)
Akarshan Biswas [Mon, 24 Mar 2025 12:58:32 +0000 (18:28 +0530)]
CI: fix SYCL build (#12546)

3 months agodocs: update: improve the Fedoa CUDA guide (#12536)
Tei Home [Mon, 24 Mar 2025 11:02:26 +0000 (19:02 +0800)]
docs: update: improve the Fedoa CUDA guide (#12536)

* docs: update fedora-cuda guide

- Rename and place into Backend Folder.
- Update Host-Supplied Packages.
- Expand Recommended Users Section.

* docs: improve the flow of CUDA-FEDORA.md

3 months agollama-vocab : add SuperBPE pre-tokenizer (#12532)
compilade [Mon, 24 Mar 2025 10:47:24 +0000 (06:47 -0400)]
llama-vocab : add SuperBPE pre-tokenizer (#12532)

3 months agoCUDA: Fix clang warnings (#12540)
R0CKSTAR [Mon, 24 Mar 2025 10:28:34 +0000 (18:28 +0800)]
CUDA: Fix clang warnings (#12540)

Signed-off-by: Xiaodong Ye <redacted>
3 months agommap : skip resource limit checks on AIX (#12541)
Prajwal B Mehendarkar [Mon, 24 Mar 2025 10:17:10 +0000 (15:47 +0530)]
mmap : skip resource limit checks on AIX (#12541)

3 months agovulkan: fix mul_mat_vec failure in backend tests (#12529)
Jeff Bolz [Mon, 24 Mar 2025 06:56:17 +0000 (01:56 -0500)]
vulkan: fix mul_mat_vec failure in backend tests (#12529)

The OOB calculation could be wrong if the last iteration was during one of
the unrolled loops. Adjust the unrolling counts to avoid this. Add a couple
new backend tests that hit this failure on NVIDIA GPUs.

3 months agoserver : Add verbose output to OAI compatible chat endpoint. (#12246)
Marius Gerdes [Sun, 23 Mar 2025 18:30:26 +0000 (19:30 +0100)]
server : Add verbose output to OAI compatible chat endpoint. (#12246)

Add verbose output to server_task_result_cmpl_final::to_json_oaicompat_chat_stream, making it conform with server_task_result_cmpl_final::to_json_oaicompat_chat, as well as the other to_json methods.

3 months agoinstall : add macports (#12518)
Lars Sonchocky-Helldorf [Sun, 23 Mar 2025 08:21:48 +0000 (09:21 +0100)]
install : add macports (#12518)

MacPorts section added

3 months agollama : gemma3 : use output tensor if it exists in model weight (#12506)
Xuan-Son Nguyen [Sat, 22 Mar 2025 22:28:19 +0000 (23:28 +0100)]
llama : gemma3 : use output tensor if it exists in model weight (#12506)

* llama : gemma3 : use output tensor if it exists in model weight

* also add to the llm_tensor_names

3 months agoggml : fix quantized cpy op (#12310)
Georgi Gerganov [Sat, 22 Mar 2025 14:23:26 +0000 (16:23 +0200)]
ggml : fix quantized cpy op (#12310)

* ggml : fix quantized cpy op

ggml-ci

* tests : add cpy tests for all types

ggml-ci

* tests : add BF16 copy tests

ggml-ci

* tests : fix loop for same-type copy

ggml-ci

* tests : add option to permute the dst tensor

ggml-ci

3 months agomusa: refine compute capability (#12493)
R0CKSTAR [Sat, 22 Mar 2025 09:11:37 +0000 (17:11 +0800)]
musa: refine compute capability (#12493)

* musa: refine compute capability

Signed-off-by: Xiaodong Ye <redacted>
* Address review comments

Signed-off-by: Xiaodong Ye <redacted>
---------

Signed-off-by: Xiaodong Ye <redacted>
3 months agovulkan: Optimize mul_mat_vec p021 and nc shaders (#12505)
Jeff Bolz [Sat, 22 Mar 2025 08:40:11 +0000 (03:40 -0500)]
vulkan: Optimize mul_mat_vec p021 and nc shaders (#12505)

* tests: add mul_mat perf/functional tests for p021/nc vulkan shaders

* vulkan: Optimize mul_mat_vec p021 and nc shaders.

These shaders are used in attention calculations, and when the KV cache grows
large they start to dominate the run time. For the nc shader (which is called
with large 'k' dimension), use unrolling and vector loads. For the p021 shader
(which is called with large 'm' and small 'k' dimensions), take advantage of
grouped query attention to reuse loads from the A matrix for the whole group,
and reduce the number of workgroups (too much overhead from tiny dispatches).

Using subgroupAdd in the p021 shader also helps, use that conditionally.

3 months agoVulkan: RTE rounding for cpy to quant (#12480)
stduhpf [Fri, 21 Mar 2025 19:34:50 +0000 (20:34 +0100)]
Vulkan: RTE rounding for cpy to quant (#12480)

* Vulkan: RTE rounding for cpy to quant

Co-Authored-By: Jeff Bolz <redacted>
* remove trailing whitespace

* avoid duplicating pipeline_cpy_f32_quant

* fix copypasting issue

* remove duplicated code

---------

Co-authored-by: Jeff Bolz <redacted>
3 months agovulkan: workaround for AMD Windows driver 16 bit unpack8 bug (#12472)
Eve [Fri, 21 Mar 2025 19:27:47 +0000 (19:27 +0000)]
vulkan: workaround for AMD Windows driver 16 bit unpack8 bug (#12472)

3 months agomodel : do not repack if a GPU device is present (#12498)
Georgi Gerganov [Fri, 21 Mar 2025 14:14:29 +0000 (16:14 +0200)]
model : do not repack if a GPU device is present (#12498)

ggml-ci

3 months agochore : cleanup llama_model_loader::TENSOR_ usage (#12492)
Sigbjørn Skjæret [Fri, 21 Mar 2025 09:21:36 +0000 (10:21 +0100)]
chore : cleanup llama_model_loader::TENSOR_ usage (#12492)

3 months agollama-tts : avoid crashes related to bad model file paths (#12482)
marcoStocchi [Fri, 21 Mar 2025 09:12:45 +0000 (10:12 +0100)]
llama-tts : avoid crashes related to bad model file paths (#12482)

3 months ago[SYCL] Fix build on Windows when ccache enabled (#9954) (#9976)
蕭澧邦 [Fri, 21 Mar 2025 06:58:47 +0000 (14:58 +0800)]
[SYCL] Fix build on Windows when ccache enabled (#9954) (#9976)

* [SYCL] Fix build on Windows when ccache enabled (#9954)

* take effect only on windows and force it to icl

---------

Co-authored-by: Romain Biessy <redacted>
3 months agosycl: cleanup oneDNN related code (#12097)
Svetlozar Georgiev [Fri, 21 Mar 2025 02:15:56 +0000 (02:15 +0000)]
sycl: cleanup oneDNN related code (#12097)

3 months agowebui : Prevent rerendering on textarea input (#12299)
Woof Dog [Thu, 20 Mar 2025 14:57:43 +0000 (14:57 +0000)]
webui : Prevent rerendering on textarea input (#12299)

* webui: Make textarea uncontrolled to eliminate devastating lag

* Update index.html.gz

* use signal-style implementation

* rm console log

* no duplicated savedInitValue set

---------

Co-authored-by: Xuan Son Nguyen <redacted>
3 months agollama : make Qwen2MoE QKV bias optional (#12477)
Sigbjørn Skjæret [Thu, 20 Mar 2025 11:49:59 +0000 (12:49 +0100)]
llama : make Qwen2MoE QKV bias optional (#12477)

3 months agoggml : block interleaving support for Q4_K quantization for x86 AVX2 architecture...
Srihari-mcw [Thu, 20 Mar 2025 11:35:34 +0000 (17:05 +0530)]
ggml : block interleaving support for Q4_K quantization for x86 AVX2 architecture (#12332)

* Add block interleaving support for Q4_K quantization

* Remove whitespaces and fix CI/CD issues

* Update pointer of bsums from int16_t to const int16_t

* Add vector version of quantize_q8_K_4x8 function

* Update code formatting based on review comments

3 months agoconvert : avoid calls to tokenizer.added_tokens_decoder (#12473)
Bartowski [Thu, 20 Mar 2025 06:36:37 +0000 (02:36 -0400)]
convert : avoid calls to tokenizer.added_tokens_decoder (#12473)

tokenizer.added_tokens_decoder returns a fresh dict every time relatively slowly (~0.04s on average) which results in massive slowdowns when we have a huge number of added tokens

3 months agocontext : clear sets containing encoder output sequence ids before storing new values...
fairydreaming [Wed, 19 Mar 2025 20:01:57 +0000 (21:01 +0100)]
context : clear sets containing encoder output sequence ids before storing new values (#12470)

Co-authored-by: Stanisław Szymczyk <redacted>
3 months agoCUDA: Improve flash decoding kernel GPU occupancy for BS=1 case (#12183)
Gaurav Garg [Wed, 19 Mar 2025 19:52:06 +0000 (01:22 +0530)]
CUDA: Improve flash decoding kernel GPU occupancy for BS=1 case (#12183)

- Find out active blocks per SM using cudaOccupancyMaxActiveBlocksPerMultiprocessor API. Use this value to determine the optimal parallel_blocks value.
- Prefer vector flash attention kernels over MMA kernel for BS=1

Fixes Issue: #12182
---------

Co-authored-by: Johannes Gäßler <redacted>
3 months agovulkan: optimize iq1 coopmat2 dequant functions (#12427)
Jeff Bolz [Wed, 19 Mar 2025 18:56:23 +0000 (13:56 -0500)]
vulkan: optimize iq1 coopmat2 dequant functions (#12427)

3 months agoFix visionOS build and add CI (#12415)
Guus Waals [Wed, 19 Mar 2025 10:15:23 +0000 (10:15 +0000)]
Fix visionOS build and add CI (#12415)

* ci: add visionOS build workflow

Add a new GitHub Actions workflow for building on visionOS with CMake and Xcode.

* ggml: Define _DARWIN_C_SOURCE for visionOS to fix missing u_xxx typedefs

* ci: remove define hacks for u_xxx system types

---------

Co-authored-by: Giovanni Petrantoni <redacted>
3 months agollama : add support for GPT2, Bloom and CodeShell tied word embeddings (#12456)
Sigbjørn Skjæret [Wed, 19 Mar 2025 08:08:49 +0000 (09:08 +0100)]
llama : add support for GPT2, Bloom and CodeShell tied word embeddings (#12456)

* Add support for GPT2, Bloom and CodeShell tied word embeddings

* Deduplicate tied word embeddings weights

* Workaround for incorrect weight map

It appears transformer.wte.weight is in the weight map even though the weights are not there, remove it if output weights are encountered first.

* check++

* fatfingers--

3 months agoconvert : Support chat_template.json (#12460)
Sigbjørn Skjæret [Wed, 19 Mar 2025 07:58:13 +0000 (08:58 +0100)]
convert : Support chat_template.json (#12460)

3 months agovulkan: Submit once enough matmul work has been recorded (#12406)
Jeff Bolz [Wed, 19 Mar 2025 07:26:26 +0000 (02:26 -0500)]
vulkan: Submit once enough matmul work has been recorded (#12406)

I've been seeing significantly worse performance for tg with flash attention
enabled vs disabled, and it seems to be related to the submit heuristic.
Change the heuristic to check how many bytes worth of weight matrix are
used and flush every 100MB, and ramp up after the first few submits.
This seems to resolve the issue, and also increases perf for non-FA a bit.

3 months agoopencl: improve profiling (#12442)
lhez [Tue, 18 Mar 2025 19:54:55 +0000 (12:54 -0700)]
opencl: improve profiling (#12442)

* opencl: more profiling timing

* opencl: generate trace for profiling

* opencl: reduce profiling overhead

* Populate profiling timing info at the end rather than after each
  kernel run

* opencl: fix for chrome tracing

3 months agograph : normalize Q, K, V shapes + sync cross attention (#12449)
Georgi Gerganov [Tue, 18 Mar 2025 19:35:19 +0000 (21:35 +0200)]
graph : normalize Q, K, V shapes + sync cross attention (#12449)

* graph : normalize Q, K, V shapes and add comments

ggml-ci

* context : synchronize before getting cross attention data

* model : fix command-r attention norm check

3 months agomusa: override warp_size of musa device to 32 (#12445)
R0CKSTAR [Tue, 18 Mar 2025 18:28:26 +0000 (02:28 +0800)]
musa: override warp_size of musa device to 32 (#12445)

Signed-off-by: Xiaodong Ye <redacted>
3 months agollama : support converting Mistral Small text-only (#12450)
Xuan-Son Nguyen [Tue, 18 Mar 2025 18:16:19 +0000 (19:16 +0100)]
llama : support converting Mistral Small text-only (#12450)

3 months agospeculative : fix seg fault in certain cases (#12454)
Georgi Gerganov [Tue, 18 Mar 2025 17:35:11 +0000 (19:35 +0200)]
speculative : fix seg fault in certain cases (#12454)

3 months agollama : add support for EXAONE tied word embeddings (#12451)
Xuan-Son Nguyen [Tue, 18 Mar 2025 16:24:33 +0000 (17:24 +0100)]
llama : add support for EXAONE tied word embeddings (#12451)

3 months agocontext : always use non-causal attention for encoder graphs (#12447)
Georgi Gerganov [Tue, 18 Mar 2025 11:05:49 +0000 (13:05 +0200)]
context : always use non-causal attention for encoder graphs (#12447)

* context : always use non-causal attention for encoder graphs

ggml-ci

* context : move the change to llama_context::encode()

ggml-ci

3 months agoSYCL: using graphs is configurable by environment variable and compile option (#12371)
Łukasz Ślusarczyk [Tue, 18 Mar 2025 10:16:31 +0000 (11:16 +0100)]
SYCL: using graphs is configurable by environment variable and compile option (#12371)

* alberto changes

* enable sycl graphs by env variable

* fixed compilation warnings in ggml-sycl.cpp

* renamed graph variables

* fix markdown in docs/backend/SYCL.md

Co-authored-by: Romain Biessy <redacted>
* fix markdown in docs/backend/SYCL.md again

* compiling graphs by default, renamed graph_enable to graph_disable

---------

Co-authored-by: Romain Biessy <redacted>
3 months agoserver : fix warmup draft cache type (#12446)
Georgi Gerganov [Tue, 18 Mar 2025 10:05:42 +0000 (12:05 +0200)]
server : fix warmup draft cache type (#12446)

ggml-ci

3 months agocmake : fix PowerPC build (#12241)
Prajwal B Mehendarkar [Tue, 18 Mar 2025 09:37:33 +0000 (15:07 +0530)]
cmake : fix PowerPC build (#12241)

Closes #12240

3 months agoggml : add SVE support for q6_K_q8_K (#12361)
fj-y-saito [Tue, 18 Mar 2025 08:14:39 +0000 (17:14 +0900)]
ggml : add SVE support for q6_K_q8_K (#12361)

3 months agoVulkan: Default to 1GB allocations instead of 4GB to avoid fragmentation and driver...
0cc4m [Tue, 18 Mar 2025 06:21:40 +0000 (07:21 +0100)]
Vulkan: Default to 1GB allocations instead of 4GB to avoid fragmentation and driver issues (#12434)

3 months agofixed compilation warnings in ggml-sycl (#12424)
Łukasz Ślusarczyk [Tue, 18 Mar 2025 00:51:25 +0000 (01:51 +0100)]
fixed compilation warnings in ggml-sycl (#12424)

3 months agollama: Add support for RWKV v7 architecture (#12412)
Molly Sophia [Mon, 17 Mar 2025 23:27:50 +0000 (07:27 +0800)]
llama: Add support for RWKV v7 architecture (#12412)

* ggml: Add op l2_norm

Signed-off-by: Molly Sophia <redacted>
* ggml: Add op rwkv_wkv7

Signed-off-by: Molly Sophia <redacted>
* llama: Add support for RWKV7 and ARWKV7 models

Signed-off-by: Molly Sophia <redacted>
* llama: fix inference with RWKV6Qwen2

Signed-off-by: Molly Sophia <redacted>
* llama: add more (a)rwkv7 variants in size

Signed-off-by: Molly Sophia <redacted>
* Apply code-format changes

Signed-off-by: Molly Sophia <redacted>
* fix MUSA build

Signed-off-by: Molly Sophia <redacted>
* llama: fix shape error with rwkv using llama-parallel

Signed-off-by: Molly Sophia <redacted>
---------

Signed-off-by: Molly Sophia <redacted>
3 months agodocs : bring llama-cli conversation/template docs up-to-date (#12426)
Sigbjørn Skjæret [Mon, 17 Mar 2025 20:14:32 +0000 (21:14 +0100)]
docs : bring llama-cli conversation/template docs up-to-date (#12426)

3 months agocuda : enable CUDA Graph on CUDA Toolkit < 12.x (#12394)
Gaurav Garg [Mon, 17 Mar 2025 18:25:13 +0000 (23:55 +0530)]
cuda : enable CUDA Graph on CUDA Toolkit < 12.x (#12394)

* Enable CUDA Graph on CTK < 12.x

`cudaGraphExecUpdate` API was changed on 12.x. For this reason CUDA graph support was disabled on older CUDA toolkit. This change enables CUDA support in CTK version < 12.x by using older API if CTK < 12.x.

* Fix compilation errors with MUSA

* Disable CUDA Graph for MUSA

3 months agoggml-vulkan: remove unused find_program(glslc) (#12416)
Guus Waals [Mon, 17 Mar 2025 16:35:43 +0000 (00:35 +0800)]
ggml-vulkan: remove unused find_program(glslc) (#12416)

It's already found by FindVulkan.cmake in the parent CMakeLists

3 months agovulkan: Add N/2 and N/4 optimized paths in coopmat2 shader (#12312)
Jeff Bolz [Mon, 17 Mar 2025 14:26:18 +0000 (09:26 -0500)]
vulkan: Add N/2 and N/4 optimized paths in coopmat2 shader (#12312)

3 months agovulkan: subgroup size tuning (#12087)
Daniele [Mon, 17 Mar 2025 11:42:33 +0000 (12:42 +0100)]
vulkan: subgroup size tuning (#12087)

* vulkan: subgroup size test

* Vulkan: Add device architecture enum and logic to recognize AMD generations

* vulkan: use new architecture logic to specify subgroup size

* Initial vulkan subgroup size tuning for RDNA3

* vulkan: commonize RDNA subgroup tuning

* vulkan: override subgroup size if required_subgroup_size = 0

* vulkan: disable warp 32 for RDNA3

* vulkan: fine tuned RDNA1 subgroup sizes

* vulkan: adjusted subgroup size map

* vulkan: fixed RDNA2 subgroup map

---------

Co-authored-by: 0cc4m <redacted>
3 months agovulkan: use fp32 in coopmat2 q4_k dequant function (#12309)
Jeff Bolz [Mon, 17 Mar 2025 09:43:35 +0000 (04:43 -0500)]
vulkan: use fp32 in coopmat2 q4_k dequant function (#12309)

3 months agovulkan: Pad N dimension of B matrix for coopmat2 perf, to avoid bounds checking ...
Jeff Bolz [Mon, 17 Mar 2025 09:41:59 +0000 (04:41 -0500)]
vulkan: Pad N dimension of B matrix for coopmat2 perf, to avoid bounds checking (#12273)

* vulkan: Pad N dimension of B matrix for coopmat2 perf, to avoid bounds checking

3 months agovulkan: Adjust coopmat2 tile sizes and selection heuristic (#12258)
Jeff Bolz [Mon, 17 Mar 2025 09:35:00 +0000 (04:35 -0500)]
vulkan: Adjust coopmat2 tile sizes and selection heuristic (#12258)

3 months agocmake : enable building llama.cpp using system libggml (#12321)
Christian Kastner [Mon, 17 Mar 2025 09:05:23 +0000 (10:05 +0100)]
cmake : enable building llama.cpp using system libggml (#12321)

* cmake: Factor out compiler flag function from ggml

llama.cpps's build requires it, too, and we may want to make use of it
without add_subdirectory(ggml).

* cmake: Enable building against system ggml

This facilitates package maintenance for Linux distributions, where the
libggml library most likely will be shipped as an individual package
upon which a llama.cpp package depends.

3 months agoSYCL: set extras only on GGML_TYPE_Q4_0 (#12366)
Akarshan Biswas [Mon, 17 Mar 2025 01:45:12 +0000 (07:15 +0530)]
SYCL: set extras only on GGML_TYPE_Q4_0 (#12366)

* SYCL: set extras only on GGML_TYPE_Q4_0

* release tensor_extras in reset buffer interface

3 months agollama : fix OLMo-2-0325-32B-Instruct K-norm size (#12400)
Sigbjørn Skjæret [Sun, 16 Mar 2025 17:46:36 +0000 (18:46 +0100)]
llama : fix OLMo-2-0325-32B-Instruct K-norm size (#12400)

3 months agocontext : fix init of n_outputs (#12397)
Georgi Gerganov [Sun, 16 Mar 2025 17:29:36 +0000 (19:29 +0200)]
context : fix init of n_outputs (#12397)

ggml-ci

3 months agoci : add --symlinks to xcframework zip command (#12409)
Daniel Bevenius [Sun, 16 Mar 2025 17:22:05 +0000 (18:22 +0100)]
ci : add --symlinks to xcframework zip command (#12409)

This commit adds the --symlinks option to the zip command used to create
the xcframework zip file. This is necessary to create symlinks in the
zip file. Without this option,  the Versions symlink is stored as a
regular directory entry in the zip file, rather than as a symlink in the
zip which causes the followig error in xcode:
```console
Couldn't resolve framework symlink for '/Users/danbev/work/ai/llama.cpp/tmp_1/build-apple/llama.xcframework/macos-arm64_x86_64/llama.framework/Versions/Current': readlink(/Users/danbev/work/ai/llama.cpp/tmp_1/build-apple/llama.xcframework/macos-arm64_x86_64/llama.framework/Versions/Current): Invalid argument (22)
```

Refs: https://github.com/ggml-org/llama.cpp/pull/11996#issuecomment-2727026377

3 months agollama-tts : add '-o' option (#12398)
marcoStocchi [Sat, 15 Mar 2025 16:23:11 +0000 (17:23 +0100)]
llama-tts : add '-o' option (#12398)

* added -o option to specify an output file name

* llama-tts returns ENOENT in case of file write error

note : PR #12042 is closed as superseded with this one.

3 months agoSYCL: Delete redundant plus sign and space (#12391)
aubreyli [Sat, 15 Mar 2025 14:49:03 +0000 (22:49 +0800)]
SYCL: Delete redundant plus sign and space (#12391)

3 months agoSYCL : support non-contiguous tensors in binary ops (add, sub, etc) (#12399)
fairydreaming [Sat, 15 Mar 2025 14:19:30 +0000 (15:19 +0100)]
SYCL : support non-contiguous tensors in binary ops (add, sub, etc) (#12399)

* sycl : support non-contiguous tensors in binary ops

* sycl : silence unused variable warning

---------

Co-authored-by: Stanisław Szymczyk <redacted>
3 months ago[CANN]MUL_MAT optimization (#12382)
Chenguang Li [Sat, 15 Mar 2025 01:31:08 +0000 (09:31 +0800)]
[CANN]MUL_MAT optimization (#12382)

3 months agoAdd CLI arg to llama-run to adjust the number of threads used (#12370)
Eric Curtin [Fri, 14 Mar 2025 16:41:20 +0000 (16:41 +0000)]
Add CLI arg to llama-run to adjust the number of threads used (#12370)

We default to 4, sometimes we want to manually adjust this

Signed-off-by: Eric Curtin <redacted>
3 months agomain : add -sysf / --system-prompt-file (#12249) (#12250)
Sigbjørn Skjæret [Fri, 14 Mar 2025 15:57:05 +0000 (16:57 +0100)]
main : add -sysf / --system-prompt-file (#12249) (#12250)

* add system_prompt_file

* add -sysf / --system-prompt-file

* remove system_prompt_file

3 months agoLoad all MoE experts during warmup (#11571)
fairydreaming [Fri, 14 Mar 2025 12:47:05 +0000 (13:47 +0100)]
Load all MoE experts during warmup (#11571)

* llama : introduce llama_set_warmup() API call that controls warmup mode; use all MoE experts during warmup

* common : use new API to enable warmup mode during model warmup

---------

Co-authored-by: Stanisław Szymczyk <redacted>
3 months agoserver: fix "--grammar-file" parameter (#12285)
Victor [Fri, 14 Mar 2025 10:21:17 +0000 (11:21 +0100)]
server: fix "--grammar-file" parameter (#12285)

3 months agograph : simplify attn input build for unified KV cache (#12381)
Georgi Gerganov [Fri, 14 Mar 2025 08:47:44 +0000 (10:47 +0200)]
graph : simplify attn input build for unified KV cache (#12381)

ggml-ci

3 months agohparams : add SWA rope parameters (#12374)
Georgi Gerganov [Fri, 14 Mar 2025 07:03:24 +0000 (09:03 +0200)]
hparams : add SWA rope parameters (#12374)

ggml-ci