]> git.djapps.eu Git - pkg/ggml/sources/llama.cpp/log
pkg/ggml/sources/llama.cpp
5 weeks agotests : avoid github urls due to throttling (#13654)
Sigbjørn Skjæret [Tue, 20 May 2025 10:03:17 +0000 (12:03 +0200)]
tests : avoid github urls due to throttling (#13654)

5 weeks agosycl: disable reorder for sycl mulmat (#13536)
Svetlozar Georgiev [Tue, 20 May 2025 09:34:15 +0000 (10:34 +0100)]
sycl: disable reorder for sycl mulmat (#13536)

5 weeks agoSet GLM4 blk.*.attn_output.weight, kqv_out-* matmul to GGML_PREC_F32 to fix infinity...
0cc4m [Tue, 20 May 2025 08:11:56 +0000 (10:11 +0200)]
Set GLM4 blk.*.attn_output.weight, kqv_out-* matmul to GGML_PREC_F32 to fix infinity values in output (#13639)

5 weeks agometal : fix typo in FA kernel comments (#13651)
Georgi Gerganov [Tue, 20 May 2025 07:41:40 +0000 (10:41 +0300)]
metal : fix typo in FA kernel comments (#13651)

5 weeks agokv-cache : add SWA support (#13194)
Georgi Gerganov [Tue, 20 May 2025 05:05:46 +0000 (08:05 +0300)]
kv-cache : add SWA support (#13194)

* kv-cache : prepare for SWA

ggml-ci

* kv-cache : initial iSWA implementation

ggml-ci

* kv-cache : rework error recovery logic

ggml-ci

* models : fix Phi-3 SWA parameters

ggml-ci

* model : adjust Granite to rope factor changes

ggml-ci

* server : check if context can do shifts

ggml-ci

* iswa : for now, always enable shifts (experiment)

ggml-ci

* kv-cache : simplify SWA logic

ggml-ci

* kv-cache : apply defrag when we fail to find slots for the batch

ggml-ci

* llama : update docs about llama_decode

ggml-ci

* kv-cache : update warning logs when no space for the batch is available

ggml-ci

* llama : add llama_kv_self_seq_pos_min()

* kv-cache : keep track of partial SWA computes and print warnings

* server : disallow use cases involving partial SWA context

ggml-ci

* llama : add param to control SWA cache size

ggml-ci

* minor : clean-up

ggml-ci

5 weeks agoCANN: Update CANN model support (#13162)
Xinpeng Dou [Tue, 20 May 2025 03:43:43 +0000 (11:43 +0800)]
CANN: Update CANN model support (#13162)

* Update CANN model support status

* Update of model support

* update

* update

* update

* fix format of CANN.md

* fix format of CANN.md

* fix format of CANN.md

5 weeks agosycl : Overcoming workaround for mmap() allocation on Windows (#13482)
Nicolò Scipione [Tue, 20 May 2025 00:54:43 +0000 (02:54 +0200)]
sycl : Overcoming workaround for mmap() allocation on Windows (#13482)

* Remove mmap workaround on windows

After some testing I found that mmap is supported on windows and for
many GPUs on Linux. Therefore I remove the workaround for windows since
it is not necessary.

* Update llama-bench README

SYCL backend introduced a workaround that allows execution of
llama-bench also without specifying `--mmp 0` flag

5 weeks agocommon : add load_progress_callback (#13617)
psocolovsky [Mon, 19 May 2025 19:17:36 +0000 (21:17 +0200)]
common : add load_progress_callback (#13617)

5 weeks agoVulkan: Add f32 accumulator support to quantized mul mat to fix GLM4 32B incoherence...
0cc4m [Mon, 19 May 2025 15:54:08 +0000 (17:54 +0200)]
Vulkan: Add f32 accumulator support to quantized mul mat to fix GLM4 32B incoherence (#13607)

5 weeks agosycl : backend documentation review (#13544)
Alberto Cabrera Pérez [Mon, 19 May 2025 13:38:20 +0000 (14:38 +0100)]
sycl : backend documentation review (#13544)

* sycl: reviewing and updating docs

* Updates Runtime error codes

* Improves OOM troubleshooting entry

* Added a llama 3 sample

* Updated supported models

* Updated releases table

5 weeks agomtmd : add vision support for llama 4 (#13282)
Xuan-Son Nguyen [Mon, 19 May 2025 11:04:14 +0000 (13:04 +0200)]
mtmd : add vision support for llama 4 (#13282)

* wip llama 4 conversion

* rm redundant __init__

* fix conversion

* fix conversion

* test impl

* try this

* reshape patch_embeddings_0

* fix view

* rm ffn_post_norm

* cgraph ok

* f32 for pos embd

* add image marker tokens

* Llama4UnfoldConvolution

* correct pixel shuffle

* fix merge conflicts

* correct

* add debug_graph

* logits matched, but it still preceives the image incorrectly

* fix style

* add image_grid_pinpoints

* handle llama 4 preprocessing

* rm load_image_size

* rm unused line

* fix

* small fix 2

* add test & docs

* fix llava-1.6 test

* test: add notion of huge models

* add comment

* add warn about degraded quality

5 weeks agoci : upgraded oneAPI version in SYCL workflows and dockerfile (#13532)
Alberto Cabrera Pérez [Mon, 19 May 2025 10:46:09 +0000 (11:46 +0100)]
ci : upgraded oneAPI version in SYCL workflows and dockerfile (#13532)

5 weeks agosync : ggml
Georgi Gerganov [Mon, 19 May 2025 09:50:29 +0000 (12:50 +0300)]
sync : ggml

ggml-ci

5 weeks agomnist: fix segmentation fault (ggml/1227)
Johannes Gäßler [Mon, 19 May 2025 07:33:35 +0000 (09:33 +0200)]
mnist: fix segmentation fault (ggml/1227)

5 weeks agoggml : fix apple OS check in ggml_print_backtrace (ggml/1229)
Diego Devesa [Mon, 19 May 2025 01:30:13 +0000 (18:30 -0700)]
ggml : fix apple OS check in ggml_print_backtrace (ggml/1229)

5 weeks agoggml : Fix missing backtrace on Linux (ggml/1228)
Daniel Tang [Sat, 17 May 2025 23:06:26 +0000 (19:06 -0400)]
ggml : Fix missing backtrace on Linux (ggml/1228)

* Modern Linux defaults /proc/sys/kernel/yama/ptrace_scope to 1
* Fixed lldb attach
* Simplify by having the child do ggml_print_backtrace_symbols

5 weeks agofix: check model pointer validity before use (#13631)
Nick [Mon, 19 May 2025 10:25:41 +0000 (18:25 +0800)]
fix: check model pointer validity before use (#13631)

5 weeks agoCANN: Support MOE Model MUL_MAT_ID (#13042)
Chenguang Li [Mon, 19 May 2025 06:21:17 +0000 (14:21 +0800)]
CANN: Support MOE Model MUL_MAT_ID (#13042)

Signed-off-by: noemotiovon <redacted>
5 weeks agoserver : added --no-prefill-assistant flag (#13608)
Isaac McFadyen [Sat, 17 May 2025 21:59:48 +0000 (17:59 -0400)]
server : added --no-prefill-assistant flag (#13608)

* added no-prefill-assistant flag

* reworded documentation comment

* updated server README.md

6 weeks agocmake: use the current build config for vulkan-shaders-gen (#13595)
Gilad S. [Sat, 17 May 2025 18:26:43 +0000 (21:26 +0300)]
cmake: use the current build config for vulkan-shaders-gen (#13595)

* fix: use the current build config for `vulkan-shaders-gen`

* fix: only pass a valid build type to `--config`

6 weeks agoparallel : add option for non-shared and larger prompts (#13598)
Georgi Gerganov [Sat, 17 May 2025 09:58:55 +0000 (12:58 +0300)]
parallel : add option for non-shared and larger prompts (#13598)

* parallel : add option for non-shared and larger prompts

* parallel : update readme [no ci]

* cont : add note about base models [no ci]

* parallel : better var name

ggml-ci

6 weeks agovulkan: move common FA code to flash_attn_base.comp (#13556)
Jeff Bolz [Sat, 17 May 2025 07:14:55 +0000 (16:14 +0900)]
vulkan: move common FA code to flash_attn_base.comp (#13556)

* vulkan: move common FA code to flash_attn_base.comp

* vulkan: move common FA index/stride setup code to flash_attn_base.comp

* build fix

6 weeks agovulkan: use scalar FA rather than coopmat2 when N==1 (#13554)
Jeff Bolz [Sat, 17 May 2025 06:35:47 +0000 (15:35 +0900)]
vulkan: use scalar FA rather than coopmat2 when N==1 (#13554)

6 weeks agollguidance : official v0.7.20 release (no actual changes) [noci] (#13594)
Z [Fri, 16 May 2025 20:56:28 +0000 (14:56 -0600)]
llguidance : official v0.7.20 release (no actual changes) [noci] (#13594)

6 weeks agoserver : do not return error out of context (with ctx shift disabled) (#13577)
Xuan-Son Nguyen [Fri, 16 May 2025 19:50:00 +0000 (21:50 +0200)]
server : do not return error out of context (with ctx shift disabled) (#13577)

6 weeks agowebui : improve accessibility for visually impaired people (#13551)
Xuan-Son Nguyen [Fri, 16 May 2025 19:49:01 +0000 (21:49 +0200)]
webui : improve accessibility for visually impaired people (#13551)

* webui : improve accessibility for visually impaired people

* add a11y for extra contents

* fix some labels being read twice

* add skip to main content

6 weeks agoreadme : add list of dependencies and their license (#13591)
Xuan-Son Nguyen [Fri, 16 May 2025 18:04:18 +0000 (20:04 +0200)]
readme : add list of dependencies and their license (#13591)

6 weeks agoreleases : use arm version of curl for arm releases (#13592)
Diego Devesa [Fri, 16 May 2025 17:36:51 +0000 (10:36 -0700)]
releases : use arm version of curl for arm releases (#13592)

6 weeks agometal : add FA-vec kernel for head size 64 (#13583)
Georgi Gerganov [Fri, 16 May 2025 17:32:58 +0000 (20:32 +0300)]
metal : add FA-vec kernel for head size 64 (#13583)

ggml-ci

6 weeks agollama : print hint when loading a model when no backends are loaded (#13589)
Diego Devesa [Fri, 16 May 2025 14:38:07 +0000 (07:38 -0700)]
llama : print hint when loading a model when no backends are loaded (#13589)

6 weeks agoci : add ppc64el to build-linux-cross (#13575)
Sigbjørn Skjæret [Fri, 16 May 2025 12:54:23 +0000 (14:54 +0200)]
ci : add ppc64el to build-linux-cross (#13575)

6 weeks agosycl : fixed compilation warnings (#13582)
Łukasz Ślusarczyk [Fri, 16 May 2025 10:15:29 +0000 (12:15 +0200)]
sycl : fixed compilation warnings (#13582)

6 weeks agominja: sync (qwen3) (#13573)
Olivier Chafik [Thu, 15 May 2025 22:29:10 +0000 (23:29 +0100)]
minja: sync (qwen3) (#13573)

* minja: sync https://github.com/google/minja/commit/f06140fa52fd140fe38e531ec373d8dc9c86aa06

- https://github.com/google/minja/pull/67 (@grf53)
- https://github.com/google/minja/pull/66 (@taha-yassine)
- https://github.com/google/minja/pull/63 (@grf53)
- https://github.com/google/minja/pull/58

---------

Co-authored-by: ochafik <redacted>
6 weeks agogguf : use ggml log system (#13571)
Diego Devesa [Thu, 15 May 2025 17:13:11 +0000 (10:13 -0700)]
gguf : use ggml log system (#13571)

* gguf : use ggml log system

* llama : remove unnecessary new lines in exception messages

6 weeks agogguf-py : fix disconnect-before-connect in editor-gui (#13569)
Daniel Tang [Thu, 15 May 2025 16:47:10 +0000 (12:47 -0400)]
gguf-py : fix disconnect-before-connect in editor-gui (#13569)

The bug caused a crash upon load with venvs created with
--system-site-packages to use
python3-pyside6.qtwidgets=python3-pyside6.qtwidgets=6.6.2-4
from Kubuntu 24.10.

6 weeks agoconvert : fix conversion for llama 4 (#13567)
Xuan-Son Nguyen [Thu, 15 May 2025 15:40:07 +0000 (17:40 +0200)]
convert : fix conversion for llama 4 (#13567)

6 weeks agosycl: simplify bin_bcast_kernel (#13383)
Atharva Dubey [Thu, 15 May 2025 15:39:52 +0000 (16:39 +0100)]
sycl: simplify bin_bcast_kernel (#13383)

6 weeks agosycl: reordered Q4_K MMVQ (#13109)
Svetlozar Georgiev [Thu, 15 May 2025 15:35:44 +0000 (16:35 +0100)]
sycl: reordered Q4_K MMVQ (#13109)

6 weeks agosycl: use oneDNN for matrices multiplication (#12972)
Łukasz Ślusarczyk [Thu, 15 May 2025 14:53:41 +0000 (16:53 +0200)]
sycl: use oneDNN for matrices multiplication (#12972)

6 weeks agollama-bench : fix -ot with dl backends (#13563)
Diego Devesa [Thu, 15 May 2025 13:46:55 +0000 (06:46 -0700)]
llama-bench : fix -ot with dl backends (#13563)

6 weeks agowebui : handle PDF input (as text or image) + convert pasted long content to file...
Xuan-Son Nguyen [Thu, 15 May 2025 12:24:50 +0000 (14:24 +0200)]
webui : handle PDF input (as text or image) + convert pasted long content to file (#13562)

* webui : handle PDF input (as text or image)

* handle the case where pdf image + server without mtmd

* fix bug missing pages

6 weeks agoserver : proper error handling for missing elements in messages array (OpenAI compati...
Piotr Wilkin (ilintar) [Thu, 15 May 2025 06:40:58 +0000 (08:40 +0200)]
server : proper error handling for missing elements in messages array (OpenAI compatible backend) (#13540)

6 weeks agobench : handle decode errors (#13548)
Georgi Gerganov [Thu, 15 May 2025 02:57:02 +0000 (05:57 +0300)]
bench : handle decode errors (#13548)

ggml-ci

6 weeks ago`server`: inject date_string in llama 3.x template + fix date for firefunction v2...
Olivier Chafik [Thu, 15 May 2025 01:39:51 +0000 (02:39 +0100)]
`server`: inject date_string in llama 3.x template + fix date for firefunction v2 (#12802)

* Inject date_string in llama 3.x + fix for functionary v2

https://github.com/ggml-org/llama.cpp/issues/12729

* move/fix detection of functionary v3.1 before llama 3.x, fix & test their non-tool mode

Co-authored-by: Sigbjørn Skjæret <redacted>
* generate more tokens in test_completion_with_required_tool_tiny_fast to avoid truncation

---------

Co-authored-by: ochafik <redacted>
Co-authored-by: Sigbjørn Skjæret <redacted>
6 weeks agokv-cache : fix out-of-bounds view during reserve graph (#13547)
Georgi Gerganov [Wed, 14 May 2025 20:15:15 +0000 (23:15 +0300)]
kv-cache : fix out-of-bounds view during reserve graph (#13547)

* kv-cache : fix reserve graph out-of-bounds access

ggml-ci

* cont : add comment

* cont : fix comments [no ci]

* cont : more correct comment [no ci]

6 weeks agoarm64: optimize q6_k_q8_k kernel with i8mm (#13519)
Yibo Cai [Wed, 14 May 2025 19:53:52 +0000 (03:53 +0800)]
arm64: optimize q6_k_q8_k kernel with i8mm (#13519)

This PR improves q6_k_q8_k gemm kernel with arm64 i8mm instruction.

Tested on neoverse-n2 with llama3 8b q6_k quantization model.
- 40% ~ 54% S_PP uplift for all batch sizes
- 16% ~ 47% S_TG uplift for batch size 4 and above

Perplexity doesn't change with this PR.

```
// tested on neoverse-n2
$ llama-batched-bench \
      -m Meta-Llama-3-8B-Instruct-Q6_K.gguf \
      --no-mmap -fa \
      -c 8192 -b 4096 -ub 512 -npp 128 -ntg 128 \
      -npl 1,2,4,8,16,32 \
      -t 64

---------------------------------------------------------------------
|    PP |     TG |    B |       S_PP t/s      |       S_TG t/s      |
|       |        |      | original |  this pr | original |  this pr |
|-------|--------|------|----------|----------|----------|----------|
|   128 |    128 |    1 |    78.52 |   109.18 |    18.63 |    18.88 |
|   128 |    128 |    2 |    84.62 |   123.94 |    34.54 |    36.92 |
|   128 |    128 |    4 |    84.36 |   122.49 |    52.65 |    61.32 |
|   128 |    128 |    8 |    90.52 |   138.87 |    63.46 |    84.41 |
|   128 |    128 |   16 |    90.11 |   138.56 |    71.04 |   101.33 |
|   128 |    128 |   32 |    89.81 |   137.79 |    75.14 |   110.47 |
---------------------------------------------------------------------
```

6 weeks ago`common`: add partial regex support (#12808)
Olivier Chafik [Wed, 14 May 2025 18:50:57 +0000 (19:50 +0100)]
`common`: add partial regex support (#12808)

* move string_find_partial_stop & string_ends_with to common

* add common_regex (supports partial matches)

Co-authored-by: Georgi Gerganov <redacted>
* Update common/regex-partial.cpp

Co-authored-by: Georgi Gerganov <redacted>
* Update common/regex-partial.cpp

Co-authored-by: Georgi Gerganov <redacted>
* Update common/regex-partial.h

Co-authored-by: Georgi Gerganov <redacted>
* partial regex: add missing iterator end checks

* string utils: use string_views

* direct throw to avoid ggml.h include

* regex-partial: replace missed ggml_asserts

---------

Co-authored-by: ochafik <redacted>
Co-authored-by: Georgi Gerganov <redacted>
6 weeks agoeditorconfig : fix trailing whitespace from #13542 (#13546)
Sigbjørn Skjæret [Wed, 14 May 2025 18:22:49 +0000 (20:22 +0200)]
editorconfig : fix trailing whitespace from #13542 (#13546)

6 weeks agofix: crash when calling `llama_state_get_size` on a context without a KV cache (...
Gilad S. [Wed, 14 May 2025 16:18:18 +0000 (19:18 +0300)]
fix: crash when calling `llama_state_get_size` on a context without a KV cache (#13542)

6 weeks agoCUDA: fix crash on large batch size for quant. MoE (#13537)
Johannes Gäßler [Wed, 14 May 2025 14:41:02 +0000 (16:41 +0200)]
CUDA: fix crash on large batch size for quant. MoE (#13537)

6 weeks agollama : fix quantize with dl backends (#13539)
Diego Devesa [Wed, 14 May 2025 14:12:36 +0000 (07:12 -0700)]
llama : fix quantize with dl backends (#13539)

6 weeks agoCUDA: faster Deepseek FA, add Turing support (#13435)
Johannes Gäßler [Wed, 14 May 2025 14:08:20 +0000 (16:08 +0200)]
CUDA: faster Deepseek FA, add Turing support (#13435)

6 weeks agofix: Move build_inp_pos to the top of the graph section for build_granite (#13538)
Gabe Goodhart [Wed, 14 May 2025 12:53:59 +0000 (06:53 -0600)]
fix: Move build_inp_pos to the top of the graph section for build_granite (#13538)

This matches how others do it, but will still avoid the extra
initialization when rope is disabled.

Branch: GraniteFour

Signed-off-by: Gabe Goodhart <redacted>
6 weeks agoserver : passthrough the /models endpoint during loading (#13535)
Georgi Gerganov [Wed, 14 May 2025 12:42:10 +0000 (15:42 +0300)]
server : passthrough the /models endpoint during loading (#13535)

* server : passthrough the /models endpoint during loading

* server : update readme + return json for "meta" field

6 weeks agoserver : fix cache_tokens bug with no cache_prompt (#13533)
Xuan-Son Nguyen [Wed, 14 May 2025 11:35:07 +0000 (13:35 +0200)]
server : fix cache_tokens bug with no cache_prompt (#13533)

6 weeks agocmake: simplify vulkan shader test logic (#13263)
bandoti [Wed, 14 May 2025 10:53:57 +0000 (07:53 -0300)]
cmake: simplify vulkan shader test logic (#13263)

6 weeks agovulkan: KHR_coopmat flash attention (#13506)
Jeff Bolz [Wed, 14 May 2025 09:55:26 +0000 (18:55 +0900)]
vulkan: KHR_coopmat flash attention (#13506)

This shader uses coopmat1 to do the Q*K^T multiply. The P*V multiply is more
difficult for various reasons so I haven't done it. Performance for this
shader is around 2.5x better than for the scalar shader when doing prompt
processing. Some of the benefit may be from other optimizations like staging
through shared memory, or splitting by rows.

6 weeks agowebui : use fflate for more deterministic gzip compress (#13525)
Xuan-Son Nguyen [Wed, 14 May 2025 08:26:12 +0000 (10:26 +0200)]
webui : use fflate for more deterministic gzip compress (#13525)

* webui : use pako for more deterministic gzip compress

* simpler code

* use fflate instead of pako

6 weeks agowebui: Allow pasting file from clipboard (#13526)
Luca Stefani [Wed, 14 May 2025 08:07:31 +0000 (10:07 +0200)]
webui: Allow pasting file from clipboard (#13526)

* server: Allow pasting file from clipboard

* server: Prevent default action on file paste

* update build

* format then build combined

---------

Co-authored-by: Xuan Son Nguyen <redacted>
6 weeks agodocs: Update link to ggml-org in multimodal.md (#13513)
ddpasa [Wed, 14 May 2025 07:59:12 +0000 (09:59 +0200)]
docs: Update link to ggml-org in multimodal.md (#13513)

* Update multimodal.md

Minor change to include the huggingface link

* Update docs/multimodal.md

---------

Co-authored-by: Xuan-Son Nguyen <redacted>
6 weeks agoscripts : fix compare-llama-bench.py show parameter (#13514)
Sigbjørn Skjæret [Wed, 14 May 2025 06:41:01 +0000 (08:41 +0200)]
scripts : fix compare-llama-bench.py show parameter (#13514)

6 weeks agovulkan: workaround FA compile failures on macos (#13517)
Jeff Bolz [Wed, 14 May 2025 04:15:50 +0000 (13:15 +0900)]
vulkan: workaround FA compile failures on macos (#13517)

6 weeks agoquantize : improve tensor-type pattern matching (#13033)
Ed Addario [Tue, 13 May 2025 17:12:31 +0000 (18:12 +0100)]
quantize : improve tensor-type pattern matching (#13033)

6 weeks agoclip : clip.h become private API (⚠️ breaking change) (#13510)
Xuan-Son Nguyen [Tue, 13 May 2025 15:07:21 +0000 (17:07 +0200)]
clip : clip.h become private API (⚠️ breaking change) (#13510)

6 weeks agometal : use FA-vec kernel up to batch size 20 (#13496)
Georgi Gerganov [Tue, 13 May 2025 15:04:39 +0000 (18:04 +0300)]
metal : use FA-vec kernel up to batch size 20 (#13496)

* batched-bench : fix pp batch contents

* metal : optimize multi-sequence FA vec kernel

ggml-ci

* metal : use FA-vec kernel up to batch size 20

ggml-ci

6 weeks agometal : optimize multi-sequence FA vec kernel (#13493)
Georgi Gerganov [Tue, 13 May 2025 15:04:00 +0000 (18:04 +0300)]
metal : optimize multi-sequence FA vec kernel (#13493)

* batched-bench : fix pp batch contents

* metal : optimize multi-sequence FA vec kernel

ggml-ci

6 weeks agoggml-cpu: Update KleidiAI to v1.6 and fix include directives (#13509)
Dan Johansson [Tue, 13 May 2025 15:02:28 +0000 (17:02 +0200)]
ggml-cpu: Update KleidiAI to v1.6 and fix include directives (#13509)

Signed-off-by: Dan Johansson <redacted>
6 weeks agobatched-bench : fix pp batch contents (#13492)
Georgi Gerganov [Tue, 13 May 2025 15:01:53 +0000 (18:01 +0300)]
batched-bench : fix pp batch contents (#13492)

6 weeks agomtmd : remove libllava, remove clip-quantize-cli (⚠️ breaking change) (#13460)
Xuan-Son Nguyen [Tue, 13 May 2025 13:33:58 +0000 (15:33 +0200)]
mtmd : remove libllava, remove clip-quantize-cli (⚠️ breaking change) (#13460)

* mtmd : remove libllava, remove clip-quantize-cli

* rm clip_model_quantize

6 weeks agoscripts : support arbitrary input file formats in compare-llama-bench.py (#13455)
Sigbjørn Skjæret [Tue, 13 May 2025 13:31:12 +0000 (15:31 +0200)]
scripts : support arbitrary input file formats in compare-llama-bench.py (#13455)

6 weeks agomodel : Granite MoE shared (#13269)
Gabe Goodhart [Tue, 13 May 2025 13:12:01 +0000 (07:12 -0600)]
model : Granite MoE shared (#13269)

* feat: Add GGUF conversion for granitemoeshared

Branch: GraniteMoEShared

Signed-off-by: Gabe Goodhart <redacted>
* feat: hparam and arch plumbing for granitemoeshared

Branch: GraniteMoEShared

Signed-off-by: Gabe Goodhart <redacted>
* fix: Split MoE fused tensors for shared experts in conversion

Branch: GraniteMoEShared

Signed-off-by: Gabe Goodhart <redacted>
* feat: First WIP cut at model arch in cpp

The hparam and architecture plumbing should be correct, but the
implementation of the shared experts seems to still be broken.

Branch: GraniteMoEShared

Signed-off-by: Gabe Goodhart <redacted>
* fix: Cleaner (maybe more correct?) splitting for gate/up

Branch: GraniteMoEShared

Signed-off-by: Gabe Goodhart <redacted>
* fix: Fix the input to the shared experts

I had misread that the shared experts take the inputs _before_ the standard
MoE layer and was feeding the output of the MoE to the shared experts.

Branch: GraniteMoEShared

Signed-off-by: Gabe Goodhart <redacted>
* fix: Avoid architecture-specific checks for Granite MoE Shared

This is a cleaner way that will allow more flexibility in architecture
strings going forward.

Branch: GraniteMoEShared

Signed-off-by: Gabe Goodhart <redacted>
* refactor: Split granite architectures out of llm_build_llama

This helps de-clutter the llama-family graph construction and allows
granite to diverge further (in preparation for Granite 4).

NOTE: I removed the granite scale factors from llm_build_deci because they
appear to only be there as copy-paste from llm_build_llama. The HF config
does not seem to set those values:
https://huggingface.co/Deci/DeciLM-7B/blob/main/config.json

Branch: GraniteMoEShared

Signed-off-by: Gabe Goodhart <redacted>
* fix: Fix compiler warning about uninitialized inp_pos

This should not have been reachable, but it warns on some compliers

Branch: GraniteMoEShared

Signed-off-by: Gabe Goodhart <redacted>
* fix: Consoladate GraniteMoEShared into GraniteMoE for conversion

Branch: GraniteMoEShared

Signed-off-by: Gabe Goodhart <redacted>
* fix: Consolidate GraniteMoEShared into GraniteMoE on the c++ side

Branch: GraniteMoEShared

Signed-off-by: Gabe Goodhart <redacted>
---------

Signed-off-by: Gabe Goodhart <redacted>
6 weeks agosync : ggml
Georgi Gerganov [Tue, 13 May 2025 11:01:45 +0000 (14:01 +0300)]
sync : ggml

6 weeks agollama-bench : add defrag-thold, check for invalid ranges (#13487)
Diego Devesa [Mon, 12 May 2025 22:31:37 +0000 (15:31 -0700)]
llama-bench : add defrag-thold, check for invalid ranges (#13487)

6 weeks agoopencl: remove unnecessary assert for `add` (#13257)
lhez [Mon, 12 May 2025 20:13:49 +0000 (13:13 -0700)]
opencl: remove unnecessary assert for `add` (#13257)

6 weeks agoclip : cap max image size 1024 for qwen vl model (#13478)
Xuan-Son Nguyen [Mon, 12 May 2025 13:06:51 +0000 (15:06 +0200)]
clip : cap max image size 1024 for qwen vl model (#13478)

6 weeks agollama/ggml: add LLM training support (#10544)
Johannes Gäßler [Mon, 12 May 2025 12:44:49 +0000 (14:44 +0200)]
llama/ggml: add LLM training support (#10544)

* llama/ggml: add LLM training support

more compact progress bar

llama_save_model_to_file

llama_opt_param_filter

ggml_graph_dup force_grads

refactor ggml_opt, fix test-opt

* remove logits_all

* refactor CUDA implementation for ACC

* reset graph at beginning of opt period

6 weeks agocontext : fix state io for memory-less contexts (#13470)
Georgi Gerganov [Mon, 12 May 2025 12:12:27 +0000 (15:12 +0300)]
context : fix state io for memory-less contexts (#13470)

ggml-ci

6 weeks agoserver : allow content to be null in oaicompat_completion_params_parse (#13477)
Anudit Nagar [Mon, 12 May 2025 11:56:42 +0000 (18:56 +0700)]
server : allow content to be null in oaicompat_completion_params_parse (#13477)

6 weeks agollama-bench : accept ranges for integer parameters (#13410)
Diego Devesa [Mon, 12 May 2025 11:08:22 +0000 (13:08 +0200)]
llama-bench : accept ranges for integer parameters (#13410)

6 weeks agoggml-cpu: Integrate fp32=bf16xbf16 SME KleidiAI kernel (#13053)
Dan Johansson [Mon, 12 May 2025 11:06:19 +0000 (13:06 +0200)]
ggml-cpu: Integrate fp32=bf16xbf16 SME KleidiAI kernel (#13053)

* ggml-cpu: Integrate fp32=bf16xbf16 SME KleidiAI kernel

Signed-off-by: Dan Johansson <redacted>
* * code review fixes

Signed-off-by: Dan Johansson <redacted>
* * adds a comment that clarifies barrier usage

Signed-off-by: Dan Johansson <redacted>
---------

Signed-off-by: Dan Johansson <redacted>
Co-authored-by: Charles Xu <redacted>
6 weeks agoCUDA: fix misaligned synchronization in FA (#13469)
Johannes Gäßler [Mon, 12 May 2025 08:51:21 +0000 (10:51 +0200)]
CUDA: fix misaligned synchronization in FA (#13469)

6 weeks agoggml : add mrope kernel for metal (#13457)
Xuan-Son Nguyen [Mon, 12 May 2025 08:29:13 +0000 (10:29 +0200)]
ggml : add mrope kernel for metal (#13457)

6 weeks agoenable dpcpp nightly builds with libraries (#13406)
Atharva Dubey [Mon, 12 May 2025 05:15:32 +0000 (06:15 +0100)]
enable dpcpp nightly builds with libraries (#13406)

6 weeks agomtmd : Use RMS norm for InternVL 3 38B and 78B mmproj (#13459)
City [Sun, 11 May 2025 22:39:06 +0000 (00:39 +0200)]
mtmd : Use RMS norm for InternVL 3 38B and 78B mmproj (#13459)

6 weeks agotools : fix uninitialized llama_batch in server (#13436)
Anthony Umfer [Sun, 11 May 2025 15:08:26 +0000 (11:08 -0400)]
tools : fix uninitialized llama_batch in server (#13436)

* add constructor to initialize server_context::batch, preventing destructor's call to llama_batch_free from causing an invalid free()

* Update tools/server/server.cpp

Co-authored-by: Xuan-Son Nguyen <redacted>
* use C++11 initializer syntax

* switch from Copy-list-initialization to Direct-list-initialization

---------

Co-authored-by: Xuan-Son Nguyen <redacted>
6 weeks agoscripts : exit compare-llama-bench.py gracefully when there's nothing to compare...
Sigbjørn Skjæret [Sun, 11 May 2025 14:20:39 +0000 (16:20 +0200)]
scripts : exit compare-llama-bench.py gracefully when there's nothing to compare (#13451)

6 weeks agoCUDA: fix crash with partial offloading of MoE (#13439)
Johannes Gäßler [Sun, 11 May 2025 14:09:33 +0000 (16:09 +0200)]
CUDA: fix crash with partial offloading of MoE (#13439)

6 weeks agoAdd `--no-op-offload` to improve `-ot` pp perf in MoE models like llama4 400B (#13386)
David Huang [Sun, 11 May 2025 12:18:39 +0000 (20:18 +0800)]
Add `--no-op-offload` to improve `-ot` pp perf in MoE models like llama4 400B (#13386)

6 weeks agomtmd : support InternVL 3 38B and 78B mmproj (#13443)
City [Sun, 11 May 2025 09:35:52 +0000 (11:35 +0200)]
mtmd : support InternVL 3 38B and 78B mmproj (#13443)

* Support InternVL 3 38B and 78B mmproj

* Swap norms in clip.cpp

* Group variables together

6 weeks agomtmd : move helpers to dedicated file (#13442)
Xuan-Son Nguyen [Sun, 11 May 2025 09:34:23 +0000 (11:34 +0200)]
mtmd : move helpers to dedicated file (#13442)

* mtmd : move helpers to dedicated file

* fix windows build

* rm redundant include

6 weeks agodocs : Fix typo in InternVL3 model name (#13440)
Thomas Germer [Sat, 10 May 2025 20:26:46 +0000 (22:26 +0200)]
docs : Fix typo in InternVL3 model name (#13440)

6 weeks agoCUDA: fix race conditions FlashAttention kernels (#13438)
Johannes Gäßler [Sat, 10 May 2025 20:22:48 +0000 (22:22 +0200)]
CUDA: fix race conditions FlashAttention kernels (#13438)

6 weeks agovocab : add ByteDance-Seed/Seed-Coder (#13423)
Sigbjørn Skjæret [Sat, 10 May 2025 20:08:07 +0000 (22:08 +0200)]
vocab : add ByteDance-Seed/Seed-Coder (#13423)

7 weeks agomtmd : add hard limit on image resolution for qwen2vl / qwen2.5vl (#13434)
Xuan-Son Nguyen [Sat, 10 May 2025 17:57:54 +0000 (19:57 +0200)]
mtmd : add hard limit on image resolution for qwen2vl / qwen2.5vl (#13434)

* mtmd : add hard limit on image resolution for qwen2vl / qwen2.5vl

* fix typo

7 weeks agoserver : update docs (#13432)
Xuan-Son Nguyen [Sat, 10 May 2025 16:44:49 +0000 (18:44 +0200)]
server : update docs (#13432)

7 weeks agollguidance : set tokenizer slices to default (#13424)
Sigbjørn Skjæret [Sat, 10 May 2025 15:19:52 +0000 (17:19 +0200)]
llguidance : set tokenizer slices to default (#13424)

7 weeks agoci: free_disk_space flag enabled for intel variant (#13426)
Thammachart Chinvarapon [Sat, 10 May 2025 14:34:48 +0000 (21:34 +0700)]
ci: free_disk_space flag enabled for intel variant (#13426)

before cleanup: 20G
after cleanup: 44G
after all built and pushed: 24G

https://github.com/Thammachart/llama.cpp/actions/runs/14945093573/job/41987371245

7 weeks agomtmd : support InternVL 2.5 and 3 (#13422)
Xuan-Son Nguyen [Sat, 10 May 2025 14:26:42 +0000 (16:26 +0200)]
mtmd : support InternVL 2.5 and 3 (#13422)

* convert : internvl support

* InternVL3-1B working

* fix regression

* rm mobilevlm from test

* fix conversion

* add test for internvl

* add to list of pre-quant

* restore boi/eoi check

* add clarify comment for norm eps

7 weeks agoCUDA: fix FlashAttention on Turing (#13415)
Johannes Gäßler [Sat, 10 May 2025 07:16:52 +0000 (09:16 +0200)]
CUDA: fix FlashAttention on Turing (#13415)

7 weeks agoarg : add env var to control mmproj (#13416)
Xuan-Son Nguyen [Sat, 10 May 2025 06:16:29 +0000 (08:16 +0200)]
arg : add env var to control mmproj (#13416)

* arg : add env var to control mmproj

* small note about -hf --mmproj