git.djapps.eu Git - pkg/ggml/sources/llama.cpp/log

]> git.djapps.eu Git - pkg/ggml/sources/llama.cpp/log

overview / pkg / ggml / sources / llama.cpp / log

Daniel Tang [Thu, 15 May 2025 16:47:10 +0000 (12:47 -0400)]

gguf-py : fix disconnect-before-connect in editor-gui (#13569)

The bug caused a crash upon load with venvs created with
--system-site-packages to use
python3-pyside6.qtwidgets=python3-pyside6.qtwidgets=6.6.2-4
from Kubuntu 24.10.

commit | commitdiff | tree

Xuan-Son Nguyen [Thu, 15 May 2025 15:40:07 +0000 (17:40 +0200)]

convert : fix conversion for llama 4 (#13567)

commit | commitdiff | tree

Atharva Dubey [Thu, 15 May 2025 15:39:52 +0000 (16:39 +0100)]

sycl: simplify bin_bcast_kernel (#13383)

commit | commitdiff | tree

Svetlozar Georgiev [Thu, 15 May 2025 15:35:44 +0000 (16:35 +0100)]

sycl: reordered Q4_K MMVQ (#13109)

commit | commitdiff | tree

Łukasz Ślusarczyk [Thu, 15 May 2025 14:53:41 +0000 (16:53 +0200)]

sycl: use oneDNN for matrices multiplication (#12972)

commit | commitdiff | tree

Diego Devesa [Thu, 15 May 2025 13:46:55 +0000 (06:46 -0700)]

llama-bench : fix -ot with dl backends (#13563)

commit | commitdiff | tree

Xuan-Son Nguyen [Thu, 15 May 2025 12:24:50 +0000 (14:24 +0200)]

webui : handle PDF input (as text or image) + convert pasted long content to file (#13562)

* webui : handle PDF input (as text or image)

* handle the case where pdf image + server without mtmd

* fix bug missing pages

commit | commitdiff | tree

Piotr Wilkin (ilintar) [Thu, 15 May 2025 06:40:58 +0000 (08:40 +0200)]

server : proper error handling for missing elements in messages array (OpenAI compatible backend) (#13540)

commit | commitdiff | tree

Georgi Gerganov [Thu, 15 May 2025 02:57:02 +0000 (05:57 +0300)]

bench : handle decode errors (#13548)

ggml-ci

commit | commitdiff | tree

Olivier Chafik [Thu, 15 May 2025 01:39:51 +0000 (02:39 +0100)]

`server`: inject date_string in llama 3.x template + fix date for firefunction v2 (#12802)

* Inject date_string in llama 3.x + fix for functionary v2

https://github.com/ggml-org/llama.cpp/issues/12729

* move/fix detection of functionary v3.1 before llama 3.x, fix & test their non-tool mode

Co-authored-by: Sigbjørn Skjæret <redacted>
* generate more tokens in test_completion_with_required_tool_tiny_fast to avoid truncation

---------

Co-authored-by: ochafik <redacted>
Co-authored-by: Sigbjørn Skjæret <redacted>

commit | commitdiff | tree

Georgi Gerganov [Wed, 14 May 2025 20:15:15 +0000 (23:15 +0300)]

kv-cache : fix out-of-bounds view during reserve graph (#13547)

* kv-cache : fix reserve graph out-of-bounds access

ggml-ci

* cont : add comment

* cont : fix comments [no ci]

* cont : more correct comment [no ci]

commit | commitdiff | tree

Yibo Cai [Wed, 14 May 2025 19:53:52 +0000 (03:53 +0800)]

arm64: optimize q6_k_q8_k kernel with i8mm (#13519)

This PR improves q6_k_q8_k gemm kernel with arm64 i8mm instruction.

Tested on neoverse-n2 with llama3 8b q6_k quantization model.
- 40% ~ 54% S_PP uplift for all batch sizes
- 16% ~ 47% S_TG uplift for batch size 4 and above

Perplexity doesn't change with this PR.

```
// tested on neoverse-n2
$ llama-batched-bench \
      -m Meta-Llama-3-8B-Instruct-Q6_K.gguf \
      --no-mmap -fa \
      -c 8192 -b 4096 -ub 512 -npp 128 -ntg 128 \
      -npl 1,2,4,8,16,32 \
      -t 64

---------------------------------------------------------------------
|    PP |     TG |    B |       S_PP t/s      |       S_TG t/s      |
|       |        |      | original |  this pr | original |  this pr |
|-------|--------|------|----------|----------|----------|----------|
|   128 |    128 |    1 |    78.52 |   109.18 |    18.63 |    18.88 |
|   128 |    128 |    2 |    84.62 |   123.94 |    34.54 |    36.92 |
|   128 |    128 |    4 |    84.36 |   122.49 |    52.65 |    61.32 |
|   128 |    128 |    8 |    90.52 |   138.87 |    63.46 |    84.41 |
|   128 |    128 |   16 |    90.11 |   138.56 |    71.04 |   101.33 |
|   128 |    128 |   32 |    89.81 |   137.79 |    75.14 |   110.47 |
---------------------------------------------------------------------
```

commit | commitdiff | tree

Olivier Chafik [Wed, 14 May 2025 18:50:57 +0000 (19:50 +0100)]

`common`: add partial regex support (#12808)

* move string_find_partial_stop & string_ends_with to common

* add common_regex (supports partial matches)

Co-authored-by: Georgi Gerganov <redacted>
* Update common/regex-partial.cpp

Co-authored-by: Georgi Gerganov <redacted>
* Update common/regex-partial.cpp

Co-authored-by: Georgi Gerganov <redacted>
* Update common/regex-partial.h

Co-authored-by: Georgi Gerganov <redacted>
* partial regex: add missing iterator end checks

* string utils: use string_views

* direct throw to avoid ggml.h include

* regex-partial: replace missed ggml_asserts

---------

Co-authored-by: ochafik <redacted>
Co-authored-by: Georgi Gerganov <redacted>

commit | commitdiff | tree

Sigbjørn Skjæret [Wed, 14 May 2025 18:22:49 +0000 (20:22 +0200)]

editorconfig : fix trailing whitespace from #13542 (#13546)

commit | commitdiff | tree

Gilad S. [Wed, 14 May 2025 16:18:18 +0000 (19:18 +0300)]

fix: crash when calling `llama_state_get_size` on a context without a KV cache (#13542)

commit | commitdiff | tree

Johannes Gäßler [Wed, 14 May 2025 14:41:02 +0000 (16:41 +0200)]

CUDA: fix crash on large batch size for quant. MoE (#13537)

commit | commitdiff | tree

Diego Devesa [Wed, 14 May 2025 14:12:36 +0000 (07:12 -0700)]

llama : fix quantize with dl backends (#13539)

commit | commitdiff | tree

Johannes Gäßler [Wed, 14 May 2025 14:08:20 +0000 (16:08 +0200)]

CUDA: faster Deepseek FA, add Turing support (#13435)

commit | commitdiff | tree

Gabe Goodhart [Wed, 14 May 2025 12:53:59 +0000 (06:53 -0600)]

fix: Move build_inp_pos to the top of the graph section for build_granite (#13538)

This matches how others do it, but will still avoid the extra
initialization when rope is disabled.

Branch: GraniteFour

Signed-off-by: Gabe Goodhart <redacted>

commit | commitdiff | tree

Georgi Gerganov [Wed, 14 May 2025 12:42:10 +0000 (15:42 +0300)]

server : passthrough the /models endpoint during loading (#13535)

* server : passthrough the /models endpoint during loading

* server : update readme + return json for "meta" field

commit | commitdiff | tree

Xuan-Son Nguyen [Wed, 14 May 2025 11:35:07 +0000 (13:35 +0200)]

server : fix cache_tokens bug with no cache_prompt (#13533)

commit | commitdiff | tree

bandoti [Wed, 14 May 2025 10:53:57 +0000 (07:53 -0300)]

cmake: simplify vulkan shader test logic (#13263)

commit | commitdiff | tree

Jeff Bolz [Wed, 14 May 2025 09:55:26 +0000 (18:55 +0900)]

vulkan: KHR_coopmat flash attention (#13506)

This shader uses coopmat1 to do the Q*K^T multiply. The P*V multiply is more
difficult for various reasons so I haven't done it. Performance for this
shader is around 2.5x better than for the scalar shader when doing prompt
processing. Some of the benefit may be from other optimizations like staging
through shared memory, or splitting by rows.

commit | commitdiff | tree

Xuan-Son Nguyen [Wed, 14 May 2025 08:26:12 +0000 (10:26 +0200)]

webui : use fflate for more deterministic gzip compress (#13525)

* webui : use pako for more deterministic gzip compress

* simpler code

* use fflate instead of pako

commit | commitdiff | tree

Luca Stefani [Wed, 14 May 2025 08:07:31 +0000 (10:07 +0200)]

webui: Allow pasting file from clipboard (#13526)

* server: Allow pasting file from clipboard

* server: Prevent default action on file paste

* update build

* format then build combined

---------

Co-authored-by: Xuan Son Nguyen <redacted>

commit | commitdiff | tree

ddpasa [Wed, 14 May 2025 07:59:12 +0000 (09:59 +0200)]

docs: Update link to ggml-org in multimodal.md (#13513)

* Update multimodal.md

Minor change to include the huggingface link

* Update docs/multimodal.md

---------

Co-authored-by: Xuan-Son Nguyen <redacted>

commit | commitdiff | tree

Sigbjørn Skjæret [Wed, 14 May 2025 06:41:01 +0000 (08:41 +0200)]

scripts : fix compare-llama-bench.py show parameter (#13514)

commit | commitdiff | tree

Jeff Bolz [Wed, 14 May 2025 04:15:50 +0000 (13:15 +0900)]

vulkan: workaround FA compile failures on macos (#13517)

commit | commitdiff | tree

Ed Addario [Tue, 13 May 2025 17:12:31 +0000 (18:12 +0100)]

quantize : improve tensor-type pattern matching (#13033)

commit | commitdiff | tree

Xuan-Son Nguyen [Tue, 13 May 2025 15:07:21 +0000 (17:07 +0200)]

clip : clip.h become private API (⚠️ breaking change) (#13510)

commit | commitdiff | tree

Georgi Gerganov [Tue, 13 May 2025 15:04:39 +0000 (18:04 +0300)]

metal : use FA-vec kernel up to batch size 20 (#13496)

* batched-bench : fix pp batch contents

* metal : optimize multi-sequence FA vec kernel

ggml-ci

* metal : use FA-vec kernel up to batch size 20

ggml-ci

commit | commitdiff | tree

Georgi Gerganov [Tue, 13 May 2025 15:04:00 +0000 (18:04 +0300)]

metal : optimize multi-sequence FA vec kernel (#13493)

* batched-bench : fix pp batch contents

* metal : optimize multi-sequence FA vec kernel

ggml-ci

commit | commitdiff | tree

Dan Johansson [Tue, 13 May 2025 15:02:28 +0000 (17:02 +0200)]

ggml-cpu: Update KleidiAI to v1.6 and fix include directives (#13509)

Signed-off-by: Dan Johansson <redacted>

commit | commitdiff | tree

Georgi Gerganov [Tue, 13 May 2025 15:01:53 +0000 (18:01 +0300)]

batched-bench : fix pp batch contents (#13492)

commit | commitdiff | tree

Xuan-Son Nguyen [Tue, 13 May 2025 13:33:58 +0000 (15:33 +0200)]

mtmd : remove libllava, remove clip-quantize-cli (⚠️ breaking change) (#13460)

* mtmd : remove libllava, remove clip-quantize-cli

* rm clip_model_quantize

commit | commitdiff | tree

Sigbjørn Skjæret [Tue, 13 May 2025 13:31:12 +0000 (15:31 +0200)]

scripts : support arbitrary input file formats in compare-llama-bench.py (#13455)

commit | commitdiff | tree

Gabe Goodhart [Tue, 13 May 2025 13:12:01 +0000 (07:12 -0600)]

model : Granite MoE shared (#13269)

* feat: Add GGUF conversion for granitemoeshared

Branch: GraniteMoEShared

Signed-off-by: Gabe Goodhart <redacted>
* feat: hparam and arch plumbing for granitemoeshared

Branch: GraniteMoEShared

Signed-off-by: Gabe Goodhart <redacted>
* fix: Split MoE fused tensors for shared experts in conversion

Branch: GraniteMoEShared

Signed-off-by: Gabe Goodhart <redacted>
* feat: First WIP cut at model arch in cpp

The hparam and architecture plumbing should be correct, but the
implementation of the shared experts seems to still be broken.

Branch: GraniteMoEShared

Signed-off-by: Gabe Goodhart <redacted>
* fix: Cleaner (maybe more correct?) splitting for gate/up

Branch: GraniteMoEShared

Signed-off-by: Gabe Goodhart <redacted>
* fix: Fix the input to the shared experts

I had misread that the shared experts take the inputs _before_ the standard
MoE layer and was feeding the output of the MoE to the shared experts.

Branch: GraniteMoEShared

Signed-off-by: Gabe Goodhart <redacted>
* fix: Avoid architecture-specific checks for Granite MoE Shared

This is a cleaner way that will allow more flexibility in architecture
strings going forward.

Branch: GraniteMoEShared

Signed-off-by: Gabe Goodhart <redacted>
* refactor: Split granite architectures out of llm_build_llama

This helps de-clutter the llama-family graph construction and allows
granite to diverge further (in preparation for Granite 4).

NOTE: I removed the granite scale factors from llm_build_deci because they
appear to only be there as copy-paste from llm_build_llama. The HF config
does not seem to set those values:
https://huggingface.co/Deci/DeciLM-7B/blob/main/config.json

Branch: GraniteMoEShared

Signed-off-by: Gabe Goodhart <redacted>
* fix: Fix compiler warning about uninitialized inp_pos

This should not have been reachable, but it warns on some compliers

Branch: GraniteMoEShared

Signed-off-by: Gabe Goodhart <redacted>
* fix: Consoladate GraniteMoEShared into GraniteMoE for conversion

Branch: GraniteMoEShared

Signed-off-by: Gabe Goodhart <redacted>
* fix: Consolidate GraniteMoEShared into GraniteMoE on the c++ side

Branch: GraniteMoEShared

Signed-off-by: Gabe Goodhart <redacted>
---------

Signed-off-by: Gabe Goodhart <redacted>

commit | commitdiff | tree

Georgi Gerganov [Tue, 13 May 2025 11:01:45 +0000 (14:01 +0300)]

sync : ggml

commit | commitdiff | tree

Diego Devesa [Mon, 12 May 2025 22:31:37 +0000 (15:31 -0700)]

llama-bench : add defrag-thold, check for invalid ranges (#13487)

commit | commitdiff | tree

lhez [Mon, 12 May 2025 20:13:49 +0000 (13:13 -0700)]

opencl: remove unnecessary assert for `add` (#13257)

commit | commitdiff | tree

Xuan-Son Nguyen [Mon, 12 May 2025 13:06:51 +0000 (15:06 +0200)]

clip : cap max image size 1024 for qwen vl model (#13478)

commit | commitdiff | tree

Johannes Gäßler [Mon, 12 May 2025 12:44:49 +0000 (14:44 +0200)]

llama/ggml: add LLM training support (#10544)

* llama/ggml: add LLM training support

more compact progress bar

llama_save_model_to_file

llama_opt_param_filter

ggml_graph_dup force_grads

refactor ggml_opt, fix test-opt

* remove logits_all

* refactor CUDA implementation for ACC

* reset graph at beginning of opt period

commit | commitdiff | tree

Georgi Gerganov [Mon, 12 May 2025 12:12:27 +0000 (15:12 +0300)]

context : fix state io for memory-less contexts (#13470)

ggml-ci

commit | commitdiff | tree

Anudit Nagar [Mon, 12 May 2025 11:56:42 +0000 (18:56 +0700)]

server : allow content to be null in oaicompat_completion_params_parse (#13477)

commit | commitdiff | tree

Diego Devesa [Mon, 12 May 2025 11:08:22 +0000 (13:08 +0200)]

llama-bench : accept ranges for integer parameters (#13410)

commit | commitdiff | tree

Dan Johansson [Mon, 12 May 2025 11:06:19 +0000 (13:06 +0200)]

ggml-cpu: Integrate fp32=bf16xbf16 SME KleidiAI kernel (#13053)

* ggml-cpu: Integrate fp32=bf16xbf16 SME KleidiAI kernel

Signed-off-by: Dan Johansson <redacted>
* * code review fixes

Signed-off-by: Dan Johansson <redacted>
* * adds a comment that clarifies barrier usage

Signed-off-by: Dan Johansson <redacted>
---------

Signed-off-by: Dan Johansson <redacted>
Co-authored-by: Charles Xu <redacted>

commit | commitdiff | tree

Johannes Gäßler [Mon, 12 May 2025 08:51:21 +0000 (10:51 +0200)]

CUDA: fix misaligned synchronization in FA (#13469)

commit | commitdiff | tree

Xuan-Son Nguyen [Mon, 12 May 2025 08:29:13 +0000 (10:29 +0200)]

ggml : add mrope kernel for metal (#13457)

commit | commitdiff | tree

Atharva Dubey [Mon, 12 May 2025 05:15:32 +0000 (06:15 +0100)]

enable dpcpp nightly builds with libraries (#13406)

commit | commitdiff | tree

City [Sun, 11 May 2025 22:39:06 +0000 (00:39 +0200)]

mtmd : Use RMS norm for InternVL 3 38B and 78B mmproj (#13459)

commit | commitdiff | tree

Anthony Umfer [Sun, 11 May 2025 15:08:26 +0000 (11:08 -0400)]

tools : fix uninitialized llama_batch in server (#13436)

* add constructor to initialize server_context::batch, preventing destructor's call to llama_batch_free from causing an invalid free()

* Update tools/server/server.cpp

Co-authored-by: Xuan-Son Nguyen <redacted>
* use C++11 initializer syntax

* switch from Copy-list-initialization to Direct-list-initialization

---------

Co-authored-by: Xuan-Son Nguyen <redacted>

commit | commitdiff | tree

Sigbjørn Skjæret [Sun, 11 May 2025 14:20:39 +0000 (16:20 +0200)]

scripts : exit compare-llama-bench.py gracefully when there's nothing to compare (#13451)

commit | commitdiff | tree

Johannes Gäßler [Sun, 11 May 2025 14:09:33 +0000 (16:09 +0200)]

CUDA: fix crash with partial offloading of MoE (#13439)

commit | commitdiff | tree

David Huang [Sun, 11 May 2025 12:18:39 +0000 (20:18 +0800)]

Add `--no-op-offload` to improve `-ot` pp perf in MoE models like llama4 400B (#13386)

commit | commitdiff | tree

City [Sun, 11 May 2025 09:35:52 +0000 (11:35 +0200)]

mtmd : support InternVL 3 38B and 78B mmproj (#13443)

* Support InternVL 3 38B and 78B mmproj

* Swap norms in clip.cpp

* Group variables together

commit | commitdiff | tree

Xuan-Son Nguyen [Sun, 11 May 2025 09:34:23 +0000 (11:34 +0200)]

mtmd : move helpers to dedicated file (#13442)

* mtmd : move helpers to dedicated file

* fix windows build

* rm redundant include

commit | commitdiff | tree

Thomas Germer [Sat, 10 May 2025 20:26:46 +0000 (22:26 +0200)]

docs : Fix typo in InternVL3 model name (#13440)

commit | commitdiff | tree

Johannes Gäßler [Sat, 10 May 2025 20:22:48 +0000 (22:22 +0200)]

CUDA: fix race conditions FlashAttention kernels (#13438)

commit | commitdiff | tree

Sigbjørn Skjæret [Sat, 10 May 2025 20:08:07 +0000 (22:08 +0200)]

vocab : add ByteDance-Seed/Seed-Coder (#13423)

commit | commitdiff | tree

Xuan-Son Nguyen [Sat, 10 May 2025 17:57:54 +0000 (19:57 +0200)]

mtmd : add hard limit on image resolution for qwen2vl / qwen2.5vl (#13434)

* mtmd : add hard limit on image resolution for qwen2vl / qwen2.5vl

* fix typo

commit | commitdiff | tree

Xuan-Son Nguyen [Sat, 10 May 2025 16:44:49 +0000 (18:44 +0200)]

server : update docs (#13432)

commit | commitdiff | tree

Sigbjørn Skjæret [Sat, 10 May 2025 15:19:52 +0000 (17:19 +0200)]

llguidance : set tokenizer slices to default (#13424)

commit | commitdiff | tree

Thammachart Chinvarapon [Sat, 10 May 2025 14:34:48 +0000 (21:34 +0700)]

ci: free_disk_space flag enabled for intel variant (#13426)

before cleanup: 20G
after cleanup: 44G
after all built and pushed: 24G

https://github.com/Thammachart/llama.cpp/actions/runs/14945093573/job/41987371245

commit | commitdiff | tree

Xuan-Son Nguyen [Sat, 10 May 2025 14:26:42 +0000 (16:26 +0200)]

mtmd : support InternVL 2.5 and 3 (#13422)

* convert : internvl support

* InternVL3-1B working

* fix regression

* rm mobilevlm from test

* fix conversion

* add test for internvl

* add to list of pre-quant

* restore boi/eoi check

* add clarify comment for norm eps

commit | commitdiff | tree

Johannes Gäßler [Sat, 10 May 2025 07:16:52 +0000 (09:16 +0200)]

CUDA: fix FlashAttention on Turing (#13415)

commit | commitdiff | tree

Xuan-Son Nguyen [Sat, 10 May 2025 06:16:29 +0000 (08:16 +0200)]

arg : add env var to control mmproj (#13416)

* arg : add env var to control mmproj

* small note about -hf --mmproj

commit | commitdiff | tree

Jeff Bolz [Sat, 10 May 2025 06:07:07 +0000 (23:07 -0700)]

vulkan: scalar flash attention implementation (#13324)

* vulkan: scalar flash attention implementation

* vulkan: always use fp32 for scalar flash attention

* vulkan: use vector loads in scalar flash attention shader

* vulkan: remove PV matrix, helps with register usage

* vulkan: reduce register usage in scalar FA, but perf may be slightly worse

* vulkan: load each Q value once. optimize O reduction. more tuning

* vulkan: support q4_0/q8_0 KV in scalar FA

* CI: increase timeout to accommodate newly-supported tests

* vulkan: for scalar FA, select between 1 and 8 rows

* vulkan: avoid using Float16 capability in scalar FA

commit | commitdiff | tree

Helton Reis [Fri, 9 May 2025 20:15:39 +0000 (17:15 -0300)]

chore(llguidance): use tagged version that does not break the build (#13413)

commit | commitdiff | tree

Xuan-Son Nguyen [Fri, 9 May 2025 17:29:37 +0000 (19:29 +0200)]

server : vision support via libmtmd (#12898)

* server : (experimental) vision support via libmtmd

* mtmd : add more api around mtmd_image_tokens

* mtmd : add more api around mtmd_image_tokens

* mtmd : ability to calc image hash

* shared_ptr for mtmd_image_tokens

* move hash to user-define ID (fixed)

* abstract out the batch management

* small fix

* refactor logic adding tokens to batch

* implement hashing image

* use FNV hash, now hash bitmap instead of file data

* allow decoding image embedding to be split into batches

* rm whitespace

* disable some features when mtmd is on

* fix --no-mmproj-offload

* mtmd_context_params no timings

* refactor server_inp to server_tokens

* fix the failing test case

* init

* wip

* working version

* add mtmd::bitmaps

* add test target

* rm redundant define

* test: mtmd_input_chunks_free

* rm outdated comment

* fix merging issue

* explicitly create mtmd::input_chunks

* mtmd_input_chunk_copy

* add clone()

* improve server_input struct

* clip : fix confused naming ffn_up and ffn_down

* rm ffn_i/o/g naming

* rename n_embd, n_ff

* small fix

* no check n_ff

* fix detokenize

* add const to various places

* add warning about breaking changes

* add c api

* helper: use mtmd_image_tokens_get_n_pos

* fix ctx_shift

* fix name shadowing

* more strict condition

* support remote image_url

* remote image_url log

* add CI test

* do not log base64

* add "has_multimodal" to /props

* remove dangling image

* speculative: use slot.cache_tokens.insert

* Apply suggestions from code review

Co-authored-by: Georgi Gerganov <redacted>
* rm can_be_detokenized

* on prmpt processing done, assert cache_tokens.size

* handle_completions_impl returns void

* adapt the new web ui

* update docs and hot topics

* rm assert

* small fix (2)

---------

Co-authored-by: Georgi Gerganov <redacted>

commit | commitdiff | tree

Alberto Cabrera Pérez [Fri, 9 May 2025 15:34:08 +0000 (16:34 +0100)]

sycl : implementation of reordered Q4_0 MMVQ for Intel GPUs (#12858)

* sycl : Implemented reorder Q4_0 mmvq

Signed-off-by: Alberto Cabrera <redacted>
* sycl : Fixed mmvq being called when reorder is disabled

* sycl : Improved comments in the quants header

Signed-off-by: Alberto Cabrera <redacted>
* Use static_assert

* safe_div -> ceil_div

* Clarify qi comment

* change the reorder tensor from init to execute OP

* dbg

* Undo changes to test-backend-ops

* Refactor changes on top of q4_0 reorder fix

* Missing Reverts

* Refactored opt_for_reorder logic to simplify code path

* Explicit inlining and unroll

* Renamed mul_mat_algo enum for consistency

---------

Signed-off-by: Alberto Cabrera <redacted>
Co-authored-by: romain.biessy <redacted>

commit | commitdiff | tree

Georgi Gerganov [Fri, 9 May 2025 12:14:56 +0000 (15:14 +0300)]

metal : optimize MoE for large batches (#13388)

ggml-ci

commit | commitdiff | tree

Johannes Gäßler [Fri, 9 May 2025 11:34:58 +0000 (13:34 +0200)]

CUDA: FA support for Deepseek (Ampere or newer) (#13306)

* CUDA: FA support for Deepseek (Ampere or newer)

* do loop unrolling via C++ template

commit | commitdiff | tree

Diego Devesa [Fri, 9 May 2025 11:02:07 +0000 (13:02 +0200)]

llama : do not crash if there is no CPU backend (#13395)

* llama : do not crash if there is no CPU backend

* add checks to examples

commit | commitdiff | tree

Johannes Gäßler [Fri, 9 May 2025 10:14:04 +0000 (12:14 +0200)]

CUDA: fix crash on large batch size for MoE models (#13384)

commit | commitdiff | tree

Bartowski [Fri, 9 May 2025 09:53:58 +0000 (05:53 -0400)]

imatrix : Add --parse-special for enabling parsing of special tokens in imatrix calculation (#13389)

* Add --parse-special for enabling parsing of special tokens in imatrix calculation

* whitespace

commit | commitdiff | tree

R0CKSTAR [Fri, 9 May 2025 09:25:50 +0000 (17:25 +0800)]

llama-run: add support for downloading models from ModelScope (#13370)

Signed-off-by: Xiaodong Ye <redacted>

commit | commitdiff | tree

Xuan-Son Nguyen [Fri, 9 May 2025 09:18:02 +0000 (11:18 +0200)]

mtmd : fix batch_view for m-rope (#13397)

* mtmd : fix batch_view for m-rope

* nits : fix comment

commit | commitdiff | tree

Xuan-Son Nguyen [Fri, 9 May 2025 09:17:51 +0000 (11:17 +0200)]

llama : one-off chat template fix for Mistral-Small-2503 (#13398)

* llama : one-off chat template fix for Mistral-Small-2503

* update readme

* add mistral-v7-tekken

commit | commitdiff | tree

Radoslav Gerganov [Fri, 9 May 2025 07:31:07 +0000 (10:31 +0300)]

rpc : add rpc_msg_set_tensor_hash_req (#13353)

* rpc : add rpc_msg_set_tensor_hash_req

Use a dedicated struct for the request of RPC_CMD_SET_TENSOR_HASH which
makes the code cleaner.

* fix

commit | commitdiff | tree

Jeff Bolz [Fri, 9 May 2025 07:23:41 +0000 (02:23 -0500)]

vulkan: Allow up to 4096 elements for mul_mat_id row_ids (#13326)

This assert fired running Qwen_Qwen3-30B-A3B-Q2_K.gguf:

GGML_ASSERT(nei0 * nei1 <= 3072);

The tensor is 8 x 512. Increase this array size to accommodate.

commit | commitdiff | tree

Xuan-Son Nguyen [Fri, 9 May 2025 07:06:37 +0000 (09:06 +0200)]

server : (webui) rename has_multimodal --> modalities (#13393)

* server : (webui) rename has_multimodal --> modalities

* allow converting SVG to PNG

* less complicated code

commit | commitdiff | tree

Diego Devesa [Thu, 8 May 2025 21:45:22 +0000 (23:45 +0200)]

ci : limit write permission to only the release step + fixes (#13392)

* ci : limit write permission to only the release step

* fix win cuda file name

* fix license file copy on multi-config generators

commit | commitdiff | tree

Matt Clayton [Thu, 8 May 2025 18:25:39 +0000 (14:25 -0400)]

mtmd : Expose helper_decode_image_chunk (#13366)

* mtmd: Expose helper_decode_image, output_embd_copy, image_tokens_copy/free

* Slim down

* Cleanups

commit | commitdiff | tree

Xuan-Son Nguyen [Thu, 8 May 2025 16:51:45 +0000 (18:51 +0200)]

server : (webui) fix a very small misalignment (#13387)

* server : (webui) fix a very small misalignment

* restore font-bold

commit | commitdiff | tree

Xuan-Son Nguyen [Thu, 8 May 2025 13:37:29 +0000 (15:37 +0200)]

server : (webui) revamp the input area, plus many small UI improvements (#13365)

* rework the input area

* process selected file

* change all icons to heroicons

* fix thought process collapse

* move conversation more menu to sidebar

* sun icon --> moon icon

* rm default system message

* stricter upload file check, only allow image if server has mtmd

* build it

* add renaming

* better autoscroll

* build

* add conversation group

* fix scroll

* extra context first, then user input in the end

* fix <hr> tag

* clean up a bit

* build

* add mb-3 for <pre>

* throttle adjustTextareaHeight to make it less laggy

* (nits) missing padding in sidebar

* rm stray console log

commit | commitdiff | tree

Sigbjørn Skjæret [Thu, 8 May 2025 13:34:29 +0000 (15:34 +0200)]

convert : support rope_scaling type and rope_type (#13349)

commit | commitdiff | tree

welix [Thu, 8 May 2025 13:03:53 +0000 (22:03 +0900)]

mtmd : fix the calculation of n_tokens for smolvlm (#13381)

Co-authored-by: Taichi Nishimura <redacted>

commit | commitdiff | tree

Georgi Gerganov [Thu, 8 May 2025 11:28:33 +0000 (14:28 +0300)]

context : allow cache-less context for embeddings (#13108)

* context : allow cache-less context for embeddings

ggml-ci

* context : enable reranking with encode()

ggml-ci

* context : encode() clears embd_seq

ggml-ci

* examples : use llama_encode() when appropriate

ggml-ci

* models : nomic bert moe does not require KV cache

* llama : update comments for llama_decode/llama_encode

ggml-ci

* context : update warning log [no ci]

commit | commitdiff | tree

Georgi Gerganov [Thu, 8 May 2025 11:26:50 +0000 (14:26 +0300)]

context : remove logits_all flag (#13284)

* context : remove logits_all flag

ggml-ci

* llama : remove logits_all flag + reorder llama_context_params

ggml-ci

commit | commitdiff | tree

Diego Devesa [Thu, 8 May 2025 11:15:28 +0000 (13:15 +0200)]

ci : move release workflow to a separate file (#13362)

commit | commitdiff | tree

Diego Devesa [Thu, 8 May 2025 11:15:15 +0000 (13:15 +0200)]

llama : print size and type of overridden tensors (#13364)

commit | commitdiff | tree

Alberto Cabrera Pérez [Thu, 8 May 2025 09:08:01 +0000 (10:08 +0100)]

sycl: addressing non-contiguous src1 mul_mats (nc and batched) (#13343)

* sycl: fixed non-contiguous src1 mul_mats (nc and batched)

* Fixed wrong static_cast inside kernel

commit | commitdiff | tree

Diego Devesa [Wed, 7 May 2025 14:36:33 +0000 (16:36 +0200)]

docker : disable arm64 and intel images (#13356)

commit | commitdiff | tree

Georgi Gerganov [Wed, 7 May 2025 13:39:36 +0000 (16:39 +0300)]

sync : ggml

ggml-ci

commit | commitdiff | tree

Daniel Bevenius [Mon, 5 May 2025 11:09:35 +0000 (13:09 +0200)]

whisper: remove MSVC warnings pragmas (whisper/3090)

* ggml : remove MSVC warnings pragmas

This commit removes the MSVC-specific pragmas as these are now handled
in ggml/CMakeLists.txt.

* whisper : remove MSVC warning pragmas

This commit removes the MSVC-specific pragmas. These are now handled in
the ggml/CMakeLists.txt file.

commit | commitdiff | tree

Jared Tweed [Fri, 2 May 2025 09:41:35 +0000 (02:41 -0700)]

cmake : removed stdc++fs (whisper/3097)

* removed stdc++fs

* kept line, but removed stdc++fs

commit | commitdiff | tree

Sigbjørn Skjæret [Wed, 7 May 2025 10:49:27 +0000 (12:49 +0200)]

llama : deci : support ffn-free with attention (#13296)

commit | commitdiff | tree

Ycros [Wed, 7 May 2025 08:23:28 +0000 (18:23 +1000)]

common : Add a warning when we can't match samplers from a string or char. (#13330)

commit | commitdiff | tree

R0CKSTAR [Wed, 7 May 2025 07:48:23 +0000 (15:48 +0800)]

cuda : remove nrows_x in mul_mat_q_process_tile (#13325)

Signed-off-by: Xiaodong Ye <redacted>

commit | commitdiff | tree

Georgi Gerganov [Wed, 7 May 2025 07:28:02 +0000 (10:28 +0300)]

examples : remove infill (#13283)

ggml-ci

Packaging of ggml-org/llama.cpp

RSS Atom