]> git.djapps.eu Git - pkg/ggml/sources/llama.cpp/log
pkg/ggml/sources/llama.cpp
6 weeks agoscripts : support arbitrary input file formats in compare-llama-bench.py (#13455)
Sigbjørn Skjæret [Tue, 13 May 2025 13:31:12 +0000 (15:31 +0200)]
scripts : support arbitrary input file formats in compare-llama-bench.py (#13455)

6 weeks agomodel : Granite MoE shared (#13269)
Gabe Goodhart [Tue, 13 May 2025 13:12:01 +0000 (07:12 -0600)]
model : Granite MoE shared (#13269)

* feat: Add GGUF conversion for granitemoeshared

Branch: GraniteMoEShared

Signed-off-by: Gabe Goodhart <redacted>
* feat: hparam and arch plumbing for granitemoeshared

Branch: GraniteMoEShared

Signed-off-by: Gabe Goodhart <redacted>
* fix: Split MoE fused tensors for shared experts in conversion

Branch: GraniteMoEShared

Signed-off-by: Gabe Goodhart <redacted>
* feat: First WIP cut at model arch in cpp

The hparam and architecture plumbing should be correct, but the
implementation of the shared experts seems to still be broken.

Branch: GraniteMoEShared

Signed-off-by: Gabe Goodhart <redacted>
* fix: Cleaner (maybe more correct?) splitting for gate/up

Branch: GraniteMoEShared

Signed-off-by: Gabe Goodhart <redacted>
* fix: Fix the input to the shared experts

I had misread that the shared experts take the inputs _before_ the standard
MoE layer and was feeding the output of the MoE to the shared experts.

Branch: GraniteMoEShared

Signed-off-by: Gabe Goodhart <redacted>
* fix: Avoid architecture-specific checks for Granite MoE Shared

This is a cleaner way that will allow more flexibility in architecture
strings going forward.

Branch: GraniteMoEShared

Signed-off-by: Gabe Goodhart <redacted>
* refactor: Split granite architectures out of llm_build_llama

This helps de-clutter the llama-family graph construction and allows
granite to diverge further (in preparation for Granite 4).

NOTE: I removed the granite scale factors from llm_build_deci because they
appear to only be there as copy-paste from llm_build_llama. The HF config
does not seem to set those values:
https://huggingface.co/Deci/DeciLM-7B/blob/main/config.json

Branch: GraniteMoEShared

Signed-off-by: Gabe Goodhart <redacted>
* fix: Fix compiler warning about uninitialized inp_pos

This should not have been reachable, but it warns on some compliers

Branch: GraniteMoEShared

Signed-off-by: Gabe Goodhart <redacted>
* fix: Consoladate GraniteMoEShared into GraniteMoE for conversion

Branch: GraniteMoEShared

Signed-off-by: Gabe Goodhart <redacted>
* fix: Consolidate GraniteMoEShared into GraniteMoE on the c++ side

Branch: GraniteMoEShared

Signed-off-by: Gabe Goodhart <redacted>
---------

Signed-off-by: Gabe Goodhart <redacted>
6 weeks agosync : ggml
Georgi Gerganov [Tue, 13 May 2025 11:01:45 +0000 (14:01 +0300)]
sync : ggml

6 weeks agollama-bench : add defrag-thold, check for invalid ranges (#13487)
Diego Devesa [Mon, 12 May 2025 22:31:37 +0000 (15:31 -0700)]
llama-bench : add defrag-thold, check for invalid ranges (#13487)

6 weeks agoopencl: remove unnecessary assert for `add` (#13257)
lhez [Mon, 12 May 2025 20:13:49 +0000 (13:13 -0700)]
opencl: remove unnecessary assert for `add` (#13257)

6 weeks agoclip : cap max image size 1024 for qwen vl model (#13478)
Xuan-Son Nguyen [Mon, 12 May 2025 13:06:51 +0000 (15:06 +0200)]
clip : cap max image size 1024 for qwen vl model (#13478)

6 weeks agollama/ggml: add LLM training support (#10544)
Johannes Gäßler [Mon, 12 May 2025 12:44:49 +0000 (14:44 +0200)]
llama/ggml: add LLM training support (#10544)

* llama/ggml: add LLM training support

more compact progress bar

llama_save_model_to_file

llama_opt_param_filter

ggml_graph_dup force_grads

refactor ggml_opt, fix test-opt

* remove logits_all

* refactor CUDA implementation for ACC

* reset graph at beginning of opt period

6 weeks agocontext : fix state io for memory-less contexts (#13470)
Georgi Gerganov [Mon, 12 May 2025 12:12:27 +0000 (15:12 +0300)]
context : fix state io for memory-less contexts (#13470)

ggml-ci

6 weeks agoserver : allow content to be null in oaicompat_completion_params_parse (#13477)
Anudit Nagar [Mon, 12 May 2025 11:56:42 +0000 (18:56 +0700)]
server : allow content to be null in oaicompat_completion_params_parse (#13477)

6 weeks agollama-bench : accept ranges for integer parameters (#13410)
Diego Devesa [Mon, 12 May 2025 11:08:22 +0000 (13:08 +0200)]
llama-bench : accept ranges for integer parameters (#13410)

6 weeks agoggml-cpu: Integrate fp32=bf16xbf16 SME KleidiAI kernel (#13053)
Dan Johansson [Mon, 12 May 2025 11:06:19 +0000 (13:06 +0200)]
ggml-cpu: Integrate fp32=bf16xbf16 SME KleidiAI kernel (#13053)

* ggml-cpu: Integrate fp32=bf16xbf16 SME KleidiAI kernel

Signed-off-by: Dan Johansson <redacted>
* * code review fixes

Signed-off-by: Dan Johansson <redacted>
* * adds a comment that clarifies barrier usage

Signed-off-by: Dan Johansson <redacted>
---------

Signed-off-by: Dan Johansson <redacted>
Co-authored-by: Charles Xu <redacted>
6 weeks agoCUDA: fix misaligned synchronization in FA (#13469)
Johannes Gäßler [Mon, 12 May 2025 08:51:21 +0000 (10:51 +0200)]
CUDA: fix misaligned synchronization in FA (#13469)

6 weeks agoggml : add mrope kernel for metal (#13457)
Xuan-Son Nguyen [Mon, 12 May 2025 08:29:13 +0000 (10:29 +0200)]
ggml : add mrope kernel for metal (#13457)

6 weeks agoenable dpcpp nightly builds with libraries (#13406)
Atharva Dubey [Mon, 12 May 2025 05:15:32 +0000 (06:15 +0100)]
enable dpcpp nightly builds with libraries (#13406)

6 weeks agomtmd : Use RMS norm for InternVL 3 38B and 78B mmproj (#13459)
City [Sun, 11 May 2025 22:39:06 +0000 (00:39 +0200)]
mtmd : Use RMS norm for InternVL 3 38B and 78B mmproj (#13459)

6 weeks agotools : fix uninitialized llama_batch in server (#13436)
Anthony Umfer [Sun, 11 May 2025 15:08:26 +0000 (11:08 -0400)]
tools : fix uninitialized llama_batch in server (#13436)

* add constructor to initialize server_context::batch, preventing destructor's call to llama_batch_free from causing an invalid free()

* Update tools/server/server.cpp

Co-authored-by: Xuan-Son Nguyen <redacted>
* use C++11 initializer syntax

* switch from Copy-list-initialization to Direct-list-initialization

---------

Co-authored-by: Xuan-Son Nguyen <redacted>
6 weeks agoscripts : exit compare-llama-bench.py gracefully when there's nothing to compare...
Sigbjørn Skjæret [Sun, 11 May 2025 14:20:39 +0000 (16:20 +0200)]
scripts : exit compare-llama-bench.py gracefully when there's nothing to compare (#13451)

6 weeks agoCUDA: fix crash with partial offloading of MoE (#13439)
Johannes Gäßler [Sun, 11 May 2025 14:09:33 +0000 (16:09 +0200)]
CUDA: fix crash with partial offloading of MoE (#13439)

6 weeks agoAdd `--no-op-offload` to improve `-ot` pp perf in MoE models like llama4 400B (#13386)
David Huang [Sun, 11 May 2025 12:18:39 +0000 (20:18 +0800)]
Add `--no-op-offload` to improve `-ot` pp perf in MoE models like llama4 400B (#13386)

6 weeks agomtmd : support InternVL 3 38B and 78B mmproj (#13443)
City [Sun, 11 May 2025 09:35:52 +0000 (11:35 +0200)]
mtmd : support InternVL 3 38B and 78B mmproj (#13443)

* Support InternVL 3 38B and 78B mmproj

* Swap norms in clip.cpp

* Group variables together

6 weeks agomtmd : move helpers to dedicated file (#13442)
Xuan-Son Nguyen [Sun, 11 May 2025 09:34:23 +0000 (11:34 +0200)]
mtmd : move helpers to dedicated file (#13442)

* mtmd : move helpers to dedicated file

* fix windows build

* rm redundant include

7 weeks agodocs : Fix typo in InternVL3 model name (#13440)
Thomas Germer [Sat, 10 May 2025 20:26:46 +0000 (22:26 +0200)]
docs : Fix typo in InternVL3 model name (#13440)

7 weeks agoCUDA: fix race conditions FlashAttention kernels (#13438)
Johannes Gäßler [Sat, 10 May 2025 20:22:48 +0000 (22:22 +0200)]
CUDA: fix race conditions FlashAttention kernels (#13438)

7 weeks agovocab : add ByteDance-Seed/Seed-Coder (#13423)
Sigbjørn Skjæret [Sat, 10 May 2025 20:08:07 +0000 (22:08 +0200)]
vocab : add ByteDance-Seed/Seed-Coder (#13423)

7 weeks agomtmd : add hard limit on image resolution for qwen2vl / qwen2.5vl (#13434)
Xuan-Son Nguyen [Sat, 10 May 2025 17:57:54 +0000 (19:57 +0200)]
mtmd : add hard limit on image resolution for qwen2vl / qwen2.5vl (#13434)

* mtmd : add hard limit on image resolution for qwen2vl / qwen2.5vl

* fix typo

7 weeks agoserver : update docs (#13432)
Xuan-Son Nguyen [Sat, 10 May 2025 16:44:49 +0000 (18:44 +0200)]
server : update docs (#13432)

7 weeks agollguidance : set tokenizer slices to default (#13424)
Sigbjørn Skjæret [Sat, 10 May 2025 15:19:52 +0000 (17:19 +0200)]
llguidance : set tokenizer slices to default (#13424)

7 weeks agoci: free_disk_space flag enabled for intel variant (#13426)
Thammachart Chinvarapon [Sat, 10 May 2025 14:34:48 +0000 (21:34 +0700)]
ci: free_disk_space flag enabled for intel variant (#13426)

before cleanup: 20G
after cleanup: 44G
after all built and pushed: 24G

https://github.com/Thammachart/llama.cpp/actions/runs/14945093573/job/41987371245

7 weeks agomtmd : support InternVL 2.5 and 3 (#13422)
Xuan-Son Nguyen [Sat, 10 May 2025 14:26:42 +0000 (16:26 +0200)]
mtmd : support InternVL 2.5 and 3 (#13422)

* convert : internvl support

* InternVL3-1B working

* fix regression

* rm mobilevlm from test

* fix conversion

* add test for internvl

* add to list of pre-quant

* restore boi/eoi check

* add clarify comment for norm eps

7 weeks agoCUDA: fix FlashAttention on Turing (#13415)
Johannes Gäßler [Sat, 10 May 2025 07:16:52 +0000 (09:16 +0200)]
CUDA: fix FlashAttention on Turing (#13415)

7 weeks agoarg : add env var to control mmproj (#13416)
Xuan-Son Nguyen [Sat, 10 May 2025 06:16:29 +0000 (08:16 +0200)]
arg : add env var to control mmproj (#13416)

* arg : add env var to control mmproj

* small note about -hf --mmproj

7 weeks agovulkan: scalar flash attention implementation (#13324)
Jeff Bolz [Sat, 10 May 2025 06:07:07 +0000 (23:07 -0700)]
vulkan: scalar flash attention implementation (#13324)

* vulkan: scalar flash attention implementation

* vulkan: always use fp32 for scalar flash attention

* vulkan: use vector loads in scalar flash attention shader

* vulkan: remove PV matrix, helps with register usage

* vulkan: reduce register usage in scalar FA, but perf may be slightly worse

* vulkan: load each Q value once. optimize O reduction. more tuning

* vulkan: support q4_0/q8_0 KV in scalar FA

* CI: increase timeout to accommodate newly-supported tests

* vulkan: for scalar FA, select between 1 and 8 rows

* vulkan: avoid using Float16 capability in scalar FA

7 weeks agochore(llguidance): use tagged version that does not break the build (#13413)
Helton Reis [Fri, 9 May 2025 20:15:39 +0000 (17:15 -0300)]
chore(llguidance): use tagged version that does not break the build (#13413)

7 weeks ago server : vision support via libmtmd (#12898)
Xuan-Son Nguyen [Fri, 9 May 2025 17:29:37 +0000 (19:29 +0200)]
 server : vision support via libmtmd (#12898)

* server : (experimental) vision support via libmtmd

* mtmd : add more api around mtmd_image_tokens

* mtmd : add more api around mtmd_image_tokens

* mtmd : ability to calc image hash

* shared_ptr for mtmd_image_tokens

* move hash to user-define ID (fixed)

* abstract out the batch management

* small fix

* refactor logic adding tokens to batch

* implement hashing image

* use FNV hash, now hash bitmap instead of file data

* allow decoding image embedding to be split into batches

* rm whitespace

* disable some features when mtmd is on

* fix --no-mmproj-offload

* mtmd_context_params no timings

* refactor server_inp to server_tokens

* fix the failing test case

* init

* wip

* working version

* add mtmd::bitmaps

* add test target

* rm redundant define

* test: mtmd_input_chunks_free

* rm outdated comment

* fix merging issue

* explicitly create mtmd::input_chunks

* mtmd_input_chunk_copy

* add clone()

* improve server_input struct

* clip :  fix confused naming ffn_up and ffn_down

* rm ffn_i/o/g naming

* rename n_embd, n_ff

* small fix

* no check n_ff

* fix detokenize

* add const to various places

* add warning about breaking changes

* add c api

* helper: use mtmd_image_tokens_get_n_pos

* fix ctx_shift

* fix name shadowing

* more strict condition

* support remote image_url

* remote image_url log

* add CI test

* do not log base64

* add "has_multimodal" to /props

* remove dangling image

* speculative: use slot.cache_tokens.insert

* Apply suggestions from code review

Co-authored-by: Georgi Gerganov <redacted>
* rm can_be_detokenized

* on prmpt processing done, assert cache_tokens.size

* handle_completions_impl returns void

* adapt the new web ui

* update docs and hot topics

* rm assert

* small fix (2)

---------

Co-authored-by: Georgi Gerganov <redacted>
7 weeks agosycl : implementation of reordered Q4_0 MMVQ for Intel GPUs (#12858)
Alberto Cabrera Pérez [Fri, 9 May 2025 15:34:08 +0000 (16:34 +0100)]
sycl : implementation of reordered Q4_0 MMVQ for Intel GPUs  (#12858)

* sycl : Implemented reorder Q4_0 mmvq

Signed-off-by: Alberto Cabrera <redacted>
* sycl : Fixed mmvq being called when reorder is disabled

* sycl : Improved comments in the quants header

Signed-off-by: Alberto Cabrera <redacted>
* Use static_assert

* safe_div -> ceil_div

* Clarify qi comment

* change the reorder tensor from init to execute OP

* dbg

* Undo changes to test-backend-ops

* Refactor changes on top of q4_0 reorder fix

* Missing Reverts

* Refactored opt_for_reorder logic to simplify code path

* Explicit inlining and unroll

* Renamed mul_mat_algo enum for consistency

---------

Signed-off-by: Alberto Cabrera <redacted>
Co-authored-by: romain.biessy <redacted>
7 weeks agometal : optimize MoE for large batches (#13388)
Georgi Gerganov [Fri, 9 May 2025 12:14:56 +0000 (15:14 +0300)]
metal : optimize MoE for large batches (#13388)

ggml-ci

7 weeks agoCUDA: FA support for Deepseek (Ampere or newer) (#13306)
Johannes Gäßler [Fri, 9 May 2025 11:34:58 +0000 (13:34 +0200)]
CUDA: FA support for Deepseek (Ampere or newer) (#13306)

* CUDA: FA support for Deepseek (Ampere or newer)

* do loop unrolling via C++ template

7 weeks agollama : do not crash if there is no CPU backend (#13395)
Diego Devesa [Fri, 9 May 2025 11:02:07 +0000 (13:02 +0200)]
llama : do not crash if there is no CPU backend (#13395)

* llama : do not crash if there is no CPU backend

* add checks to examples

7 weeks agoCUDA: fix crash on large batch size for MoE models (#13384)
Johannes Gäßler [Fri, 9 May 2025 10:14:04 +0000 (12:14 +0200)]
CUDA: fix crash on large batch size for MoE models (#13384)

7 weeks agoimatrix : Add --parse-special for enabling parsing of special tokens in imatrix calcu...
Bartowski [Fri, 9 May 2025 09:53:58 +0000 (05:53 -0400)]
imatrix : Add --parse-special for enabling parsing of special tokens in imatrix calculation (#13389)

* Add --parse-special for enabling parsing of special tokens in imatrix calculation

* whitespace

7 weeks agollama-run: add support for downloading models from ModelScope (#13370)
R0CKSTAR [Fri, 9 May 2025 09:25:50 +0000 (17:25 +0800)]
llama-run: add support for downloading models from ModelScope (#13370)

Signed-off-by: Xiaodong Ye <redacted>
7 weeks agomtmd : fix batch_view for m-rope (#13397)
Xuan-Son Nguyen [Fri, 9 May 2025 09:18:02 +0000 (11:18 +0200)]
mtmd : fix batch_view for m-rope (#13397)

* mtmd : fix batch_view for m-rope

* nits : fix comment

7 weeks agollama : one-off chat template fix for Mistral-Small-2503 (#13398)
Xuan-Son Nguyen [Fri, 9 May 2025 09:17:51 +0000 (11:17 +0200)]
llama : one-off chat template fix for Mistral-Small-2503 (#13398)

* llama : one-off chat template fix for Mistral-Small-2503

* update readme

* add mistral-v7-tekken

7 weeks agorpc : add rpc_msg_set_tensor_hash_req (#13353)
Radoslav Gerganov [Fri, 9 May 2025 07:31:07 +0000 (10:31 +0300)]
rpc : add rpc_msg_set_tensor_hash_req (#13353)

* rpc : add rpc_msg_set_tensor_hash_req

Use a dedicated struct for the request of RPC_CMD_SET_TENSOR_HASH which
makes the code cleaner.

* fix

7 weeks agovulkan: Allow up to 4096 elements for mul_mat_id row_ids (#13326)
Jeff Bolz [Fri, 9 May 2025 07:23:41 +0000 (02:23 -0500)]
vulkan: Allow up to 4096 elements for mul_mat_id row_ids (#13326)

This assert fired running Qwen_Qwen3-30B-A3B-Q2_K.gguf:

GGML_ASSERT(nei0 * nei1 <= 3072);

The tensor is 8 x 512. Increase this array size to accommodate.

7 weeks agoserver : (webui) rename has_multimodal --> modalities (#13393)
Xuan-Son Nguyen [Fri, 9 May 2025 07:06:37 +0000 (09:06 +0200)]
server : (webui) rename has_multimodal --> modalities (#13393)

* server : (webui) rename has_multimodal --> modalities

* allow converting SVG to PNG

* less complicated code

7 weeks agoci : limit write permission to only the release step + fixes (#13392) upstream/0.0.5318
Diego Devesa [Thu, 8 May 2025 21:45:22 +0000 (23:45 +0200)]
ci : limit write permission to only the release step + fixes (#13392)

* ci : limit write permission to only the release step

* fix win cuda file name

* fix license file copy on multi-config generators

7 weeks agomtmd : Expose helper_decode_image_chunk (#13366)
Matt Clayton [Thu, 8 May 2025 18:25:39 +0000 (14:25 -0400)]
mtmd : Expose helper_decode_image_chunk (#13366)

* mtmd: Expose helper_decode_image, output_embd_copy, image_tokens_copy/free

* Slim down

* Cleanups

7 weeks agoserver : (webui) fix a very small misalignment (#13387)
Xuan-Son Nguyen [Thu, 8 May 2025 16:51:45 +0000 (18:51 +0200)]
server : (webui) fix a very small misalignment (#13387)

* server : (webui) fix a very small misalignment

* restore font-bold

7 weeks agoserver : (webui) revamp the input area, plus many small UI improvements (#13365)
Xuan-Son Nguyen [Thu, 8 May 2025 13:37:29 +0000 (15:37 +0200)]
server : (webui) revamp the input area, plus many small UI improvements (#13365)

* rework the input area

* process selected file

* change all icons to heroicons

* fix thought process collapse

* move conversation more menu to sidebar

* sun icon --> moon icon

* rm default system message

* stricter upload file check, only allow image if server has mtmd

* build it

* add renaming

* better autoscroll

* build

* add conversation group

* fix scroll

* extra context first, then user input in the end

* fix <hr> tag

* clean up a bit

* build

* add mb-3 for <pre>

* throttle adjustTextareaHeight to make it less laggy

* (nits) missing padding in sidebar

* rm stray console log

7 weeks agoconvert : support rope_scaling type and rope_type (#13349)
Sigbjørn Skjæret [Thu, 8 May 2025 13:34:29 +0000 (15:34 +0200)]
convert : support rope_scaling type and rope_type (#13349)

7 weeks agomtmd : fix the calculation of n_tokens for smolvlm (#13381)
welix [Thu, 8 May 2025 13:03:53 +0000 (22:03 +0900)]
mtmd : fix the calculation of n_tokens for smolvlm (#13381)

Co-authored-by: Taichi Nishimura <redacted>
7 weeks agocontext : allow cache-less context for embeddings (#13108)
Georgi Gerganov [Thu, 8 May 2025 11:28:33 +0000 (14:28 +0300)]
context : allow cache-less context for embeddings (#13108)

* context : allow cache-less context for embeddings

ggml-ci

* context : enable reranking with encode()

ggml-ci

* context : encode() clears embd_seq

ggml-ci

* examples : use llama_encode() when appropriate

ggml-ci

* models : nomic bert moe does not require KV cache

* llama : update comments for llama_decode/llama_encode

ggml-ci

* context : update warning log [no ci]

7 weeks agocontext : remove logits_all flag (#13284)
Georgi Gerganov [Thu, 8 May 2025 11:26:50 +0000 (14:26 +0300)]
context : remove logits_all flag (#13284)

* context : remove logits_all flag

ggml-ci

* llama : remove logits_all flag + reorder llama_context_params

ggml-ci

7 weeks agoci : move release workflow to a separate file (#13362)
Diego Devesa [Thu, 8 May 2025 11:15:28 +0000 (13:15 +0200)]
ci : move release workflow to a separate file (#13362)

7 weeks agollama : print size and type of overridden tensors (#13364)
Diego Devesa [Thu, 8 May 2025 11:15:15 +0000 (13:15 +0200)]
llama : print size and type of overridden tensors (#13364)

7 weeks agosycl: addressing non-contiguous src1 mul_mats (nc and batched) (#13343)
Alberto Cabrera Pérez [Thu, 8 May 2025 09:08:01 +0000 (10:08 +0100)]
sycl: addressing non-contiguous src1 mul_mats (nc and batched) (#13343)

* sycl: fixed non-contiguous src1 mul_mats (nc and batched)

* Fixed wrong static_cast inside kernel

7 weeks agodocker : disable arm64 and intel images (#13356)
Diego Devesa [Wed, 7 May 2025 14:36:33 +0000 (16:36 +0200)]
docker : disable arm64 and intel images (#13356)

7 weeks agosync : ggml
Georgi Gerganov [Wed, 7 May 2025 13:39:36 +0000 (16:39 +0300)]
sync : ggml

ggml-ci

7 weeks agowhisper: remove MSVC warnings pragmas (whisper/3090)
Daniel Bevenius [Mon, 5 May 2025 11:09:35 +0000 (13:09 +0200)]
whisper: remove MSVC warnings pragmas (whisper/3090)

* ggml : remove MSVC warnings pragmas

This commit removes the MSVC-specific pragmas as these are now handled
in ggml/CMakeLists.txt.

* whisper : remove MSVC warning pragmas

This commit removes the MSVC-specific pragmas. These are now handled in
the ggml/CMakeLists.txt file.

7 weeks agocmake : removed stdc++fs (whisper/3097)
Jared Tweed [Fri, 2 May 2025 09:41:35 +0000 (02:41 -0700)]
cmake : removed stdc++fs (whisper/3097)

* removed stdc++fs

* kept line, but removed stdc++fs

7 weeks agollama : deci : support ffn-free with attention (#13296)
Sigbjørn Skjæret [Wed, 7 May 2025 10:49:27 +0000 (12:49 +0200)]
llama : deci : support ffn-free with attention (#13296)

7 weeks agocommon : Add a warning when we can't match samplers from a string or char. (#13330)
Ycros [Wed, 7 May 2025 08:23:28 +0000 (18:23 +1000)]
common : Add a warning when we can't match samplers from a string or char. (#13330)

7 weeks agocuda : remove nrows_x in mul_mat_q_process_tile (#13325)
R0CKSTAR [Wed, 7 May 2025 07:48:23 +0000 (15:48 +0800)]
cuda : remove nrows_x in mul_mat_q_process_tile (#13325)

Signed-off-by: Xiaodong Ye <redacted>
7 weeks agoexamples : remove infill (#13283)
Georgi Gerganov [Wed, 7 May 2025 07:28:02 +0000 (10:28 +0300)]
examples : remove infill (#13283)

ggml-ci

7 weeks agollama : support tie embedding for chatglm models (#13328)
piDack [Wed, 7 May 2025 07:23:11 +0000 (15:23 +0800)]
llama : support tie embedding for chatglm models (#13328)

7 weeks agoCUDA: mix virt/real CUDA archs for GGML_NATIVE=OFF (#13135)
Johannes Gäßler [Tue, 6 May 2025 21:35:51 +0000 (23:35 +0200)]
CUDA: mix virt/real CUDA archs for GGML_NATIVE=OFF (#13135)

7 weeks agoclip : refactor graph builder (#13321)
Xuan-Son Nguyen [Tue, 6 May 2025 20:40:24 +0000 (22:40 +0200)]
clip : refactor graph builder (#13321)

* mtmd : refactor graph builder

* fix qwen2vl

* clean up siglip cgraph

* pixtral migrated

* move minicpmv to a dedicated build function

* move max_feature_layer to build_llava

* use build_attn for minicpm resampler

* fix windows build

* add comment for batch_size

* also support tinygemma3 test model

* qwen2vl does not use RMS norm

* fix qwen2vl norm (2)

7 weeks agosampling : make top_n_sigma no-op at <=0 or a single candidate (#13345)
DocShotgun [Tue, 6 May 2025 20:36:24 +0000 (13:36 -0700)]
sampling : make top_n_sigma no-op at <=0 or a single candidate (#13345)

7 weeks agosampling : don't consider -infinity values in top_n_sigma (#13344)
oobabooga [Tue, 6 May 2025 18:24:15 +0000 (15:24 -0300)]
sampling : don't consider -infinity values in top_n_sigma (#13344)

7 weeks agocmake : remove arm64 msvc presets (#13342)
Diego Devesa [Tue, 6 May 2025 18:15:31 +0000 (20:15 +0200)]
cmake : remove arm64 msvc presets (#13342)

7 weeks agoSYCL: Disable reorder optimize by default and stop setting tensor extras when optimiz...
Akarshan Biswas [Tue, 6 May 2025 14:57:06 +0000 (20:27 +0530)]
SYCL: Disable reorder optimize by default and stop setting tensor extras when optimize is disabled (#13254)

* SYCL: Do not set tensor extras when reorder optimize is disabled

* SYCL: Disable reorder optimize by default

7 weeks agollama : fix build_ffn without gate (#13336)
Xuan-Son Nguyen [Tue, 6 May 2025 12:25:40 +0000 (14:25 +0200)]
llama : fix build_ffn without gate (#13336)

* llama : fix build_ffn without gate

* fix build on windows

* Revert "fix build on windows"

This reverts commit fc420d3c7eef3481d3d2f313fef2757cb33a7c56.

7 weeks agoCUDA: fix bad asserts for partial offload (#13337)
Johannes Gäßler [Tue, 6 May 2025 11:58:51 +0000 (13:58 +0200)]
CUDA: fix bad asserts for partial offload (#13337)

7 weeks agoconvert : qwen2/3moe : set yarn metadata if present (#13331)
Sigbjørn Skjæret [Tue, 6 May 2025 09:12:06 +0000 (11:12 +0200)]
convert : qwen2/3moe : set yarn metadata if present (#13331)

* set yarn metadata if present

* add comment about enabling YaRN

Co-authored-by: Xuan-Son Nguyen <redacted>
---------

Co-authored-by: Xuan-Son Nguyen <redacted>
7 weeks agoCUDA: fix --split-mode row for MMQ (#13323)
Johannes Gäßler [Tue, 6 May 2025 06:36:46 +0000 (08:36 +0200)]
CUDA: fix --split-mode row for MMQ (#13323)

7 weeks agogguf-py : avoid requiring pyside6 for other scripts (#13036)
compilade [Tue, 6 May 2025 02:27:31 +0000 (22:27 -0400)]
gguf-py : avoid requiring pyside6 for other scripts (#13036)

- gguf-py : remove gguf-py/gguf/scripts/__init__.py because it's not needed

Implicit namespaces are supported since Python 3.3 (https://peps.python.org/pep-0420/),
and the entrypoints in pyproject.toml can directly refer to the main functions.

7 weeks agoCUDA: fix logic for clearing padding with -ngl 0 (#13320)
Johannes Gäßler [Mon, 5 May 2025 20:32:13 +0000 (22:32 +0200)]
CUDA: fix logic for clearing padding with -ngl 0 (#13320)

7 weeks agosampling : Integrate Top-nσ into main sampling chain (and add it to the server) ...
oobabooga [Mon, 5 May 2025 20:12:19 +0000 (17:12 -0300)]
sampling : Integrate Top-nσ into main sampling chain (and add it to the server) (#13264)

* sampling: add Top-nσ sampler to `llama-server` and sampler ordering

* revert: sampler ordering

* revert: VS' crappy auto-formatting

* revert: VS' crappy auto-formatting pt.2

* revert: my crappy eye sight...

* sampling: add XTC to Top-nσ sampler chain

* sampling: add Dyna. Temp. to Top-nσ sampler chain

* sampling: actually remove Top-nσ from sampler(oops)

* Integrate top_n_sigma into main sampler chain

* Define COMMON_SAMPLER_TYPE_TOP_N_SIGMA

* Formatting

* Lint

* Exit early in the sampler if nsigma < 0

---------

Co-authored-by: CasualAutopsy <redacted>
7 weeks agoserver : Webui - change setText command from parent window to also send the message...
igardev [Mon, 5 May 2025 14:03:31 +0000 (17:03 +0300)]
server : Webui - change setText command from parent window to also send the message. (#13309)

* setText command from parent window for llama-vscode now sends the message automatically.

* Upgrade packages versions to fix vulnerabilities with "npm audit fix" command.

* Fix code formatting.

* Add index.html.gz changes.

* Revert "Upgrade packages versions to fix vulnerabilities with "npm audit fix" command."

This reverts commit 67687b7fda8a293724ba92ea30bb151677406bc8.

* easier approach

* add setTimeout

---------

Co-authored-by: igardev <redacted>
Co-authored-by: Xuan Son Nguyen <redacted>
7 weeks agomtmd : rename llava directory to mtmd (#13311)
Xuan-Son Nguyen [Mon, 5 May 2025 14:02:55 +0000 (16:02 +0200)]
mtmd : rename llava directory to mtmd (#13311)

* mv llava to mtmd

* change ref everywhere

7 weeks agoclip : fix confused naming ffn_up and ffn_down (#13290)
Xuan-Son Nguyen [Mon, 5 May 2025 10:54:44 +0000 (12:54 +0200)]
clip :  fix confused naming ffn_up and ffn_down (#13290)

* clip :  fix confused naming ffn_up and ffn_down

* rm ffn_i/o/g naming

* rename n_embd, n_ff

* small fix

* no check n_ff

7 weeks agoconvert : bailingmoe : set yarn metadata if present (#13312)
Sigbjørn Skjæret [Mon, 5 May 2025 10:34:26 +0000 (12:34 +0200)]
convert : bailingmoe : set yarn metadata if present (#13312)

7 weeks agoSYCL: Disable mul_mat kernels for noncontiguous tensor b (#13308)
Akarshan Biswas [Mon, 5 May 2025 08:09:10 +0000 (13:39 +0530)]
SYCL: Disable mul_mat kernels for noncontiguous tensor b (#13308)

ggml-ci

7 weeks agomtmd : add C public API (#13184)
Xuan-Son Nguyen [Sun, 4 May 2025 21:43:42 +0000 (23:43 +0200)]
mtmd : add C public API (#13184)

* init

* wip

* working version

* add mtmd::bitmaps

* add test target

* rm redundant define

* test: mtmd_input_chunks_free

* rm outdated comment

* fix merging issue

* explicitly create mtmd::input_chunks

* mtmd_input_chunk_copy

* add clone()

* add const to various places

* add warning about breaking changes

* helper: use mtmd_image_tokens_get_n_pos

7 weeks agorpc : use backend registry, support dl backends (#13304)
Diego Devesa [Sun, 4 May 2025 19:25:43 +0000 (21:25 +0200)]
rpc : use backend registry, support dl backends (#13304)

7 weeks agoggml : activate s390x simd for Q3_K (#13301)
Aaron Teo [Sun, 4 May 2025 17:49:12 +0000 (01:49 +0800)]
ggml : activate s390x simd for Q3_K (#13301)

Signed-off-by: Aaron Teo <redacted>
7 weeks agollava/mtmd : fixes to fully support dl backends (#13303)
Diego Devesa [Sun, 4 May 2025 15:05:20 +0000 (17:05 +0200)]
llava/mtmd : fixes to fully support dl backends (#13303)

7 weeks agollama : build windows releases with dl backends (#13220)
Diego Devesa [Sun, 4 May 2025 12:20:49 +0000 (14:20 +0200)]
llama : build windows releases with dl backends (#13220)

7 weeks agoCUDA: fix race condition in MMQ stream-k fixup (#13299)
Johannes Gäßler [Sun, 4 May 2025 12:16:39 +0000 (14:16 +0200)]
CUDA: fix race condition in MMQ stream-k fixup (#13299)

7 weeks agoCUDA: fix race condition in MMQ ids_dst (#13294)
Johannes Gäßler [Sun, 4 May 2025 11:58:38 +0000 (13:58 +0200)]
CUDA: fix race condition in MMQ ids_dst (#13294)

7 weeks agovulkan: Additional type support for unary, binary, and copy (#13266)
Jeff Bolz [Sun, 4 May 2025 05:17:16 +0000 (00:17 -0500)]
vulkan: Additional type support for unary, binary, and copy (#13266)

Support f16->f32 copy.
Support f16->f16 and f32->f32 unary ops.
Support all combinations of f16/f32 for src0/src1/dst for add/sub/mul/div.

8 weeks agoimatrix: fix oob writes if src1 is not contiguous (#13286)
Johannes Gäßler [Sat, 3 May 2025 22:50:37 +0000 (00:50 +0200)]
imatrix: fix oob writes if src1 is not contiguous (#13286)

8 weeks agoclip : revert the change of BOI/EOI token for GLM-edge (⚠️ breaking change) (#13259)
Xuan-Son Nguyen [Sat, 3 May 2025 18:07:54 +0000 (20:07 +0200)]
clip : revert the change of BOI/EOI token for GLM-edge (⚠️ breaking change) (#13259)

8 weeks agollama : Llama-3_1-Nemotron-Ultra-253B-v1 support (#12843)
ymcki [Sat, 3 May 2025 15:39:51 +0000 (23:39 +0800)]
llama : Llama-3_1-Nemotron-Ultra-253B-v1 support (#12843)

8 weeks agollama : move end-user examples to tools directory (#13249)
Diego Devesa [Fri, 2 May 2025 18:27:13 +0000 (20:27 +0200)]
llama : move end-user examples to tools directory (#13249)

* llama : move end-user examples to tools directory

---------

Co-authored-by: Xuan Son Nguyen <redacted>
8 weeks agosync : ggml (#13268)
Georgi Gerganov [Fri, 2 May 2025 17:54:30 +0000 (20:54 +0300)]
sync : ggml (#13268)

* vulkan : kernels for depthwise 2D convolution (CONV_2D_DW) (ggml/1204)

* vulkan : add kernels for depthwise 2d convolution (OP_CONV_2D_DW)

* review: remove src_x/y < 0 checks; add performance tests

* sync : ggml

ggml-ci

* vulkan : fix lint (#0)

---------

Co-authored-by: Acly <redacted>
8 weeks agocontext : fix reorder logic (#13267)
Georgi Gerganov [Fri, 2 May 2025 17:54:13 +0000 (20:54 +0300)]
context : fix reorder logic (#13267)

ggml-ci

8 weeks agoggml : Enable MMA for BF16 in llamafile_sgemm (#13148)
shalinib-ibm [Fri, 2 May 2025 16:53:12 +0000 (22:23 +0530)]
ggml : Enable MMA for BF16 in llamafile_sgemm (#13148)

This patch upstreams llamafile's cpu matrix multiplication kernels for ppc64le using MMA builtins for BF16 data type.

This change results in 9x - 40x gains
in total speed S t/s (ie all tokens/total time), across various batch sizes tested using llama-batched-bench benchmark.

The patch is tested with Meta-Lllama-3-8B,
and Mistral-7B models (BF16 models generated by using llama-quantize from corresponding FP32 models) on an IBM POWER10 machine.

Signed-off-by: Shalini Salomi Bodapati <redacted>
8 weeks agollama-model : support Qwen2 embedding models and pooling_mode_lasttoken (#13245)
Jared Van Bortel [Fri, 2 May 2025 15:42:30 +0000 (11:42 -0400)]
llama-model : support Qwen2 embedding models and pooling_mode_lasttoken (#13245)