git.djapps.eu Git - pkg/ggml/sources/llama.cpp/log

]> git.djapps.eu Git - pkg/ggml/sources/llama.cpp/log

overview / pkg / ggml / sources / llama.cpp / log

summary | shortlog | log | commit | commitdiff | tree
first ⋅ prev ⋅ next

commit | commitdiff | tree

Max Krasnyansky [Sat, 31 May 2025 22:39:19 +0000 (15:39 -0700)]

threading: support for GGML_SCHED_PRIO_LOW, update thread info on Windows to avoid throttling (#12995)

* threading: support for GGML_SCHED_PRIO_LOW, update thread info on Windows to avoid throttling

We talked about adding LOW priority for GGML threads in the original threadpool PR.
It might be useful for some cases to avoid contention.

Latest Windows ARM64 releases started parking (offlining) the CPU cores
more aggresively which results in suboptimal performance with n_threads > 4.
To deal with that we now disable Power Throttling for our threads for the NORMAL
and higher priorities.

Co-authored-by: Diego Devesa <redacted>
* threading: disable SetThreadInfo() calls for older Windows versions

* Update tools/llama-bench/llama-bench.cpp

Co-authored-by: Diego Devesa <redacted>
---------

Co-authored-by: Diego Devesa <redacted>

commit | commitdiff | tree

Jiří Podivín [Sat, 31 May 2025 16:58:35 +0000 (18:58 +0200)]

docs : Note about necessity of having libcurl installed for standard build. (#13945)

Signed-off-by: Jiri Podivin <redacted>

commit | commitdiff | tree

Olivier Chafik [Sat, 31 May 2025 15:26:10 +0000 (08:26 -0700)]

server: allow unclosed thinking tags (#13931)

commit | commitdiff | tree

Georgi Gerganov [Sat, 31 May 2025 12:58:33 +0000 (15:58 +0300)]

llama : deprecate explicit kv_self defrag/update calls (#13921)

ggml-ci

commit | commitdiff | tree

Georgi Gerganov [Sat, 31 May 2025 12:57:44 +0000 (15:57 +0300)]

llama : use n_swa + n_ubatch cells for SWA cache (#13833)

* llama : use n_swa + n_ubatch cells for SWA cache

ggml-ci

* llama : add warning about multi-sqeuence SWA contexts

commit | commitdiff | tree

igardev [Sat, 31 May 2025 09:56:08 +0000 (12:56 +0300)]

webui : Replace alert and confirm with custom modals. (#13711)

* Replace alert and confirm with custom modals. This is needed as Webview in VS Code doesn't permit alert and confirm for security reasons.

* use Modal Provider to simplify the use of confirm and alert modals.

* Increase the z index of the modal dialogs.

* Update index.html.gz

* also add showPrompt

* rebuild

---------

Co-authored-by: igardev <redacted>
Co-authored-by: Xuan Son Nguyen <redacted>

commit | commitdiff | tree

Georgi Gerganov [Sat, 31 May 2025 09:55:57 +0000 (12:55 +0300)]

llama : auto-batch preparation (#13845)

* llama : auto-batch

ggml-ci

* context : simplify if branching

commit | commitdiff | tree

Xuan-Son Nguyen [Sat, 31 May 2025 08:14:29 +0000 (10:14 +0200)]

mtmd : drop `_shared` from `libmtmd` name, merge helpers into libmtmd (⚠️ breaking change) (#13917)

* mtmd : fix missing public header

* no object

* apply suggestion from Georgi

* rm mtmd-helper, merge it to mtmd

* missing vendor include dir

commit | commitdiff | tree

Georgi Gerganov [Sat, 31 May 2025 07:24:04 +0000 (10:24 +0300)]

kv-cache : refactor + add llama_memory_state_i (#13746)

* kv-cache : simplify the "struct llama_kv_cache" interface

ggml-ci

* kv-cache : revert the (n_swa + n_ubatch) change (for next PR)

ggml-ci

* kv-cache : some comments

ggml-ci

* context : fix graph reserve for multiple sequences

ggml-ci

* kv-cache : fix typo [no ci]

* kv-cache : fix find_slot() logic for free slots

ggml-ci

* llama : add TODO for deprecating the defrag API in the future

* kv-cache : improve find_slot() using min/max seq pos info

ggml-ci

* llama : handle aborts and compute errors

ggml-ci

* memory : extract state into llama_memory_state

ggml-ci

* kv-cache : add comments

ggml-ci

* server : update batching logic to reset n_batch on successful decode

* server : upon full re-processing, remove the sequence from the cache

* kv-cache : add TODO for doing split_equal when split_simple fails

ggml-ci

commit | commitdiff | tree

Shawn yang [Sat, 31 May 2025 06:48:04 +0000 (14:48 +0800)]

CUDA: add a prop in ggml_cuda_device_infor for distinguish iGPU or dGPU in cuda (#13856) (#13895)

* 1. add "integrated" in ggml_cuda_device_info for distinguish whether it is Intergrate_gpu or discrete_gpu
2. Adjust the func:"ggml_backend_cuda_device_supports_buft" for this new feature

* Update ggml/src/ggml-cuda/ggml-cuda.cu

Adjusted code indentation

Co-authored-by: Johannes Gäßler <redacted>
* Update ggml/src/ggml-cuda/ggml-cuda.cu

Fixed incorrect setting of variable types

Co-authored-by: Johannes Gäßler <redacted>
* Update ggml/src/ggml-cuda/ggml-cuda.cu

Adjusted the judgment logic

Co-authored-by: Johannes Gäßler <redacted>
* add a host_buft assert in case of integrated_cuda_device with func:'evaluate_and_capture_cuda_graph()'

* Update ggml/src/ggml-cuda/ggml-cuda.cu

Add a defensive security assert

Co-authored-by: Johannes Gäßler <redacted>
* Update ggml/src/ggml-cuda/ggml-cuda.cu

Adjusted the support judgment logic.

Co-authored-by: Johannes Gäßler <redacted>
* revoke the suggest commit changes due to it's not applicable in jetson_device

* Update ggml/src/ggml-cuda/ggml-cuda.cu

Add parentheses to enforce operator precedence

Co-authored-by: Diego Devesa <redacted>
* Update ggml/src/ggml-cuda/ggml-cuda.cu

Fix ci bug: add a spaces

Co-authored-by: Johannes Gäßler <redacted>
---------

Co-authored-by: yangxiao <redacted>
Co-authored-by: Johannes Gäßler <redacted>
Co-authored-by: yangxiao <redacted>
Co-authored-by: Diego Devesa <redacted>

commit | commitdiff | tree

Johannes Gäßler [Fri, 30 May 2025 19:22:03 +0000 (21:22 +0200)]

CUDA: fix typo in FlashAttention code (#13926)

commit | commitdiff | tree

Diego Devesa [Fri, 30 May 2025 16:56:19 +0000 (09:56 -0700)]

sched : avoid changing cur_copy when a graph is already allocated (#13922)

commit | commitdiff | tree

Georgi Gerganov [Fri, 30 May 2025 16:38:07 +0000 (19:38 +0300)]

parallel : increase the variability of the prompt lengths (#13927)

ggml-ci

commit | commitdiff | tree

Diego Devesa [Fri, 30 May 2025 14:37:18 +0000 (07:37 -0700)]

cuda : prevent using split buffers with 3d/4d matrices (#13919)

commit | commitdiff | tree

Akarshan Biswas [Fri, 30 May 2025 14:10:57 +0000 (19:40 +0530)]

SYCL: Add mrope kernel (#13755)

* SYCL: Add mrope kernel

* feat: Optimize rope operations with vectorization

Uses `sycl::vec` to load and store two elements at a time,
significantly improving performance in `rope_norm`,
`rope_neox`, and `rope_multi`. This reduces the number of memory
accesses and leverages SIMD instructions for faster execution.

* Use ceil_div

commit | commitdiff | tree

Georgi Gerganov [Fri, 30 May 2025 13:25:45 +0000 (16:25 +0300)]

sync : vendor (#13901)

* sync : vendor

ggml-ci

* cont : fix httplib version

ggml-ci

* cont : fix lint

* cont : fix lint

* vendor : move to common folder /vendor

ggml-ci

* cont : fix lint

* cont : move httplib to /vendor + use json_fwd.hpp

ggml-ci

* cont : fix server build

ggml-ci

* cont : add missing headers

ggml-ci

* cont : header clean-up

ggml-ci

commit | commitdiff | tree

Sigbjørn Skjæret [Fri, 30 May 2025 12:50:43 +0000 (14:50 +0200)]

convert : fix rwkv bos/eos token (#13844)

commit | commitdiff | tree

Xuan-Son Nguyen [Fri, 30 May 2025 10:24:37 +0000 (12:24 +0200)]

convert : allow partial update to the chkhsh pre-tokenizer list (#13847)

* convert : allow partial update to the chkhsh pre-tokenizer list

* code style

* update tokenizer out

* rm inp/out files for models not having gguf

* fixed hash for glm

* skip nomic-bert-moe test

* Update convert_hf_to_gguf_update.py

* fix minerva-7b hash

* rm redundant import

commit | commitdiff | tree

Đinh Trọng Huy [Fri, 30 May 2025 09:56:02 +0000 (18:56 +0900)]

llama : add support for DistilBert (#13907)

* add distilbert

* small fixes

* add note for LLM_ARCH_DISTIL_BERT

* Use MODEL_ARCH.BERT for DistilBert

---------

Co-authored-by: dinhhuy <redacted>

commit | commitdiff | tree

zhangkaihuo [Fri, 30 May 2025 08:31:48 +0000 (16:31 +0800)]

llama : use llm_build_granite for minicpm (#13911)

commit | commitdiff | tree

Christian Kastner [Thu, 29 May 2025 23:28:54 +0000 (01:28 +0200)]

cmake: Guard GGML_CPU_ALL_VARIANTS by architecture (#13890)

commit | commitdiff | tree

Sigbjørn Skjæret [Thu, 29 May 2025 19:42:31 +0000 (21:42 +0200)]

llama : add support for jina-reranker-v2 (#13900)

commit | commitdiff | tree

Sigbjørn Skjæret [Thu, 29 May 2025 13:36:05 +0000 (15:36 +0200)]

gguf-py : add support for sub_type (in arrays) in GGUFWriter add_key_value method (#13561)

commit | commitdiff | tree

Yibo Cai [Thu, 29 May 2025 11:39:20 +0000 (19:39 +0800)]

arm64: optimize q4_k_q8_k kernel with i8mm (#13886)

This PR improves q4_k_q8_k gemm kernel with arm64 i8mm instruction.

Tested on neoverse-n2 with llama3 8b q4_k_m quantization model.
- 34% ~ 50% S_PP uplift for all batch sizes
- 12% ~ 37% S_TG uplift for batch size 4 and above

Perplexity doesn't change with this PR.

```
// tested on neoverse-n2
$ llama-batched-bench \
      -m Meta-Llama-3-8B-Instruct-Q4_K_M.gguf \
      --no-mmap -fa \
      -c 8192 -b 4096 -ub 512 -npp 128 -ntg 128 \
      -npl 1,2,4,8,16,32 \
      -t 64

---------------------------------------------------------------------
|    PP |     TG |    B |       S_PP t/s      |       S_TG t/s      |
|       |        |      | original |  this pr | original |  this pr |
|-------|--------|------|----------|----------|----------|----------|
|   128 |    128 |    1 |   110.12 |   147.83 |    24.36 |    24.28 |
|   128 |    128 |    2 |   121.16 |   172.42 |    46.36 |    47.93 |
|   128 |    128 |    4 |   120.15 |   169.75 |    74.68 |    84.00 |
|   128 |    128 |    8 |   130.97 |   196.81 |    91.04 |   114.74 |
|   128 |    128 |   16 |   131.01 |   196.88 |   101.43 |   135.79 |
|   128 |    128 |   32 |   130.85 |   196.51 |   106.97 |   147.29 |
---------------------------------------------------------------------
```

commit | commitdiff | tree

Christian Kastner [Thu, 29 May 2025 10:50:25 +0000 (12:50 +0200)]

cmake: Factor out CPU architecture detection (#13883)

* cmake: Define function for querying architecture

The tests and results match exactly those of ggml/src/CMakeLists.txt

* Switch arch detection over to new function

commit | commitdiff | tree

Vineel Abhinav [Thu, 29 May 2025 09:18:43 +0000 (14:48 +0530)]

ggml: aarch64: Implement SVE F32 kernels for Mamba Sequential Scan Algorithm (#13882)

* F32-Mamba-Seq_Scan-SVE

* Fix formatting

* ggml : missing space

---------

Co-authored-by: Georgi Gerganov <redacted>

commit | commitdiff | tree

Georgi Gerganov [Thu, 29 May 2025 09:17:16 +0000 (12:17 +0300)]

tests : remove json.hpp from a test (#13880)

ggml-ci

commit | commitdiff | tree

Sigbjørn Skjæret [Thu, 29 May 2025 08:00:57 +0000 (10:00 +0200)]

convert : workaround for AutoConfig dummy labels (#13881)

commit | commitdiff | tree

Sigbjørn Skjæret [Thu, 29 May 2025 06:15:01 +0000 (08:15 +0200)]

llama : add RobertaForSequenceClassification reranker support (#13875)

commit | commitdiff | tree

Vineel Abhinav [Thu, 29 May 2025 06:01:33 +0000 (11:31 +0530)]

ggml: aarch64: Implement SVE F32 kernels for vector functions (#13843)

* F32-Mamba-SVE

* F32-Mamba-SVE

* Resolve test errors-1

* Resolve test errors-2

* F32-vec-SVE

* F32-vec-SVE

* F32-vec-SVE

commit | commitdiff | tree

Beinsezii [Wed, 28 May 2025 21:50:20 +0000 (14:50 -0700)]

gguf-py : fix SafetensorRemote return on undefined size (< 0) (#13841)

commit | commitdiff | tree

Xuan-Son Nguyen [Wed, 28 May 2025 20:35:31 +0000 (22:35 +0200)]

llama : fix KV shift for qwen2vl (#13870)

* llama : fix KV shift for qwen2vl

* add ref to the PR

commit | commitdiff | tree

Xuan-Son Nguyen [Wed, 28 May 2025 20:35:22 +0000 (22:35 +0200)]

mtmd : move helpers to dedicated library (⚠️ breaking change) (#13866)

* mtmd : move helpers to dedicated library

* fix server build

* rm leftover cmakelist code

commit | commitdiff | tree

bandoti [Wed, 28 May 2025 18:46:47 +0000 (15:46 -0300)]

ci: disable LLAMA_CURL for Linux cross-builds (#13871)

commit | commitdiff | tree

Đinh Trọng Huy [Wed, 28 May 2025 17:01:58 +0000 (02:01 +0900)]

llama : add support for BertForSequenceClassification reranker (#13858)

* convert: add support for BertForSequenceClassification

* add support for reranking using BertForSequenceClassification

* merge checks of eos and sep

* fix lint

---------

Co-authored-by: dinhhuy <redacted>

commit | commitdiff | tree

Đinh Trọng Huy [Wed, 28 May 2025 14:34:18 +0000 (23:34 +0900)]

convert: small addition to support LlamaModel (#13838)

Co-authored-by: dinhhuy <redacted>

commit | commitdiff | tree

Sky [Wed, 28 May 2025 14:33:54 +0000 (22:33 +0800)]

server: fix remove 'image_url'/'input_audio' json-object effectlly for 'llama_params' in multimodal-model-mode (#13853)

[fix]: remove 'image_url'/'input_audio' effectlly for 'llama_params' in multimodal-model-mode

commit | commitdiff | tree

Xuan-Son Nguyen [Wed, 28 May 2025 14:12:35 +0000 (16:12 +0200)]

convert : fix qwen omni conversion (#13859)

* convert : fix qwen omni conversion

* fix typo

commit | commitdiff | tree

Alex Fanthome [Wed, 28 May 2025 13:49:28 +0000 (14:49 +0100)]

tests : change umlaut test (#11600)

commit | commitdiff | tree

Johannes Gäßler [Wed, 28 May 2025 11:33:37 +0000 (13:33 +0200)]

CUDA: fix FA tg at long context for CC >= 8.9 (#13852)

commit | commitdiff | tree

Xuan-Son Nguyen [Wed, 28 May 2025 08:05:54 +0000 (10:05 +0200)]

convert : fix tensor naming conflict for llama 4 vision (#13836)

* convert : fix tensor naming conflict for llama 4 vision

* add comment

commit | commitdiff | tree

leo-pony [Wed, 28 May 2025 03:54:20 +0000 (11:54 +0800)]

CANN: Add SOC TYPE printing in cmake configuration (#13837)

commit | commitdiff | tree

lhez [Tue, 27 May 2025 19:56:08 +0000 (12:56 -0700)]

opencl: add new ops - `argsort`, `div`, `sub`, `addrows`, `sigmoid`, `group_norm` (#13787)

* opencl: add `argsort`

* opencl: add `div`

* opencl: add `add_rows`

* opencl: add `sub`

* opencl: add `sigmoid`, both `f16` and `f32`

* opencl: add `group_norm`

commit | commitdiff | tree

lhez [Tue, 27 May 2025 19:53:14 +0000 (12:53 -0700)]

opencl: mark `mul_mat` `f32f32` as supporting non-contiguous tensors (#13790)

commit | commitdiff | tree

Jeff Bolz [Tue, 27 May 2025 16:39:07 +0000 (11:39 -0500)]

vulkan: use timestamp queries for GGML_VULKAN_PERF (#13817)

Also change it to be controlled by an env var rather than cmake flag

commit | commitdiff | tree

Georgi Gerganov [Tue, 27 May 2025 16:08:44 +0000 (19:08 +0300)]

cmake : add llama-cparams.cpp to build (#13832)

commit | commitdiff | tree

Akarshan Biswas [Tue, 27 May 2025 15:22:59 +0000 (20:52 +0530)]

SYCL: add gelu_erf kernel (#13749)

* SYCL: add gelu_erf kernel

* refactor code

Co-authored-by: Atharva Dubey <redacted>
* Use scope_op_debug_print

---------

Co-authored-by: Atharva Dubey <redacted>

commit | commitdiff | tree

Georgi Gerganov [Tue, 27 May 2025 15:04:38 +0000 (18:04 +0300)]

sync : ggml

commit | commitdiff | tree

Xuan-Son Nguyen [Tue, 27 May 2025 13:53:55 +0000 (15:53 +0200)]

ggml : add ggml_repeat_4d (#13824)

commit | commitdiff | tree

xctan [Tue, 27 May 2025 13:21:36 +0000 (21:21 +0800)]

ggml : riscv: add xtheadvector support (#13720)

* ggml : riscv: add xtheadvector support

* ggml : clean up some macro usage

commit | commitdiff | tree

Xuan-Son Nguyen [Tue, 27 May 2025 12:06:10 +0000 (14:06 +0200)]

mtmd : support Qwen 2.5 Omni (input audio+vision, no audio output) (#13784)

* mtmd : allow multiple modalities at the same time

* refactor mtmd tokenizer

* fix compile

* ok, missing SinusoidsPositionEmbedding

* first working version

* fix style

* more strict validate of n_embd

* refactor if..else to switch

* fix regression

* add test for 3B

* update docs

* fix tokenizing with add_special

* add more tests

* fix test case "huge"

* rm redundant code

* set_position_mrope_1d rm n_tokens

commit | commitdiff | tree

bandoti [Tue, 27 May 2025 11:52:40 +0000 (08:52 -0300)]

docs: remove link for llama-cli function calling (#13810)

commit | commitdiff | tree

Christian Kastner [Tue, 27 May 2025 11:18:39 +0000 (13:18 +0200)]

ggml-cpu: x86 feature detection is specific to x86 (#13811)

commit | commitdiff | tree

Diego Devesa [Tue, 27 May 2025 11:05:18 +0000 (04:05 -0700)]

ggml : allow CUDA graphs when using pipeline parallelism (#13814)

commit | commitdiff | tree

Georgi Gerganov [Tue, 27 May 2025 10:49:41 +0000 (13:49 +0300)]

kv-cells : track min/max used cells and per-sequence positions (#13808)

* kv-cells : track min/max used cells and per-sequence positions

ggml-ci

* kv-cells : fix pos-modification updates for seq_pos

ggml-ci

* kv-cells : add comments

ggml-ci

commit | commitdiff | tree

Georgi Gerganov [Tue, 27 May 2025 09:07:52 +0000 (12:07 +0300)]

sampling : make sure samplers return at least 1 token (#13822)

* sampling : min-p should always return at least one token

ggml-ci

* sampling : same for typical sampling

* tests : sampling tests use min_keep == 0

ggml-ci

commit | commitdiff | tree

Georgi Gerganov [Tue, 27 May 2025 06:40:59 +0000 (09:40 +0300)]

llama : validate seq id batch input (#13809)

* llama : validate seq id batch input

ggml-ci

* cont : fix the fix

ggml-ci

commit | commitdiff | tree

Olivier Chafik [Mon, 26 May 2025 21:34:27 +0000 (14:34 -0700)]

server: --offline mode (#13804)

* server: --offline mode (env: LLAMA_OFFLINE)

---------

Co-authored-by: Xuan-Son Nguyen <redacted>

commit | commitdiff | tree

Georgi Gerganov [Mon, 26 May 2025 19:24:01 +0000 (22:24 +0300)]

scripts : add option to compare commits in Debug (#13806)

* scripts : add option to compare commits in Debug

* cont : reuse existing CMAKE_OPTS

commit | commitdiff | tree

Georgi Gerganov [Mon, 26 May 2025 19:14:52 +0000 (22:14 +0300)]

cuda : avoid cuGetErrorString (#13791)

ggml-ci

commit | commitdiff | tree

Akarshan Biswas [Mon, 26 May 2025 15:40:36 +0000 (21:10 +0530)]

SYCL: Add non contiguous support in RMS_NORM and NORM kernels (#13611)

* SYCL: Add non contiguous input support to norm kernel

* refactor and add RMS_NORM non contiguous input support

ggml-ci

* restore subgroup reduction for multi-subgroup thread blocks in norm kernels

* Swap grid dims of nsamples and nrows

ggml-ci

* Revert "Swap grid dims of nsamples and nrows"

This reverts commit 43be2d657fec7f7fba54e2cd154106bc0fc45adf.

* restore not required changes
ggml-ci

* address review comments: change it to more like SYCL

* Use a common function to calculate offset

* remove wrap around logic for handling broadcasts

* remove static from calculate_offset fn and use ceil_div

commit | commitdiff | tree

Olivier Chafik [Mon, 26 May 2025 15:03:57 +0000 (08:03 -0700)]

server: fix streaming crashes (#13786)

* add preludes to content on partial regex match

* allow all parsers to parse non-tool-call content.

* tweak order of <|python_tag|> vs <function= parsing for functionary v3.1 format. still not ideal but hopefully less prone to crash

commit | commitdiff | tree

standby24x7 [Mon, 26 May 2025 14:55:24 +0000 (23:55 +0900)]

examples/training: Fix file name in README (#13803)

This patch fixes binary file names in README.md.

Signed-off-by: Masanari Iida <redacted>

commit | commitdiff | tree

Olivier Chafik [Mon, 26 May 2025 13:56:49 +0000 (06:56 -0700)]

`server`: fix format of streamed tool call deltas (diff name, fix id location) (#13800)

* fix deltas of tool_call.function.name

* fix tool_call.id (was in tool_call.function.id!) + add function type

* add tool_call.type

* populate empty tool_call.function.arguments on first delta

commit | commitdiff | tree

Olivier Chafik [Mon, 26 May 2025 13:16:37 +0000 (06:16 -0700)]

server: fix regression on streamed non-chat completion w/ stops (#13785)

* more forgiving message diffs: partial stop words aren't erased, full stops are

* Add (slow) server test for completion + stream + stop

commit | commitdiff | tree

Georgi Gerganov [Mon, 26 May 2025 11:03:54 +0000 (14:03 +0300)]

examples : allow extracting embeddings from decoder contexts (#13797)

ggml-ci

commit | commitdiff | tree

Georgi Gerganov [Mon, 26 May 2025 09:57:50 +0000 (12:57 +0300)]

llama : clarify deprecation message (#13794)

commit | commitdiff | tree

Romain Biessy [Mon, 26 May 2025 08:28:53 +0000 (10:28 +0200)]

sycl: Add more debug prints (#13640)

commit | commitdiff | tree

Jeff Bolz [Mon, 26 May 2025 04:02:07 +0000 (23:02 -0500)]

vulkan: mark IM2COL as supporting non-contig (#13783)

commit | commitdiff | tree

Bizhao Shi [Mon, 26 May 2025 02:20:18 +0000 (10:20 +0800)]

CANN: Add the basic supports of Flash Attention kernel (#13627)

* cann: add the basic FA support

* cann: update the readme

* cann: update the FlashAttention with PSEShift

* cann: update the input parameters in FA

* cann: update the alibi with max_bias

* cann: add the constrints of softcap

* cann: update the docs CANN.md

* cann: update the docs CANN.md

* cann: fix typo of CANN.md

* cann: add some comments and update the CANN.md

* cann: update the CANN.md

* cann: update the inner precise for fusedInferAttention

* cann: update the constraints of flash_attn_ext on ggml-cann.cpp

* cann: clean the whitespace

* cann: clean the whitespace

* cann: add a new endline

commit | commitdiff | tree

Olivier Chafik [Sun, 25 May 2025 23:30:51 +0000 (00:30 +0100)]

`server`: add `--reasoning-budget 0` to disable thinking (incl. qwen3 w/ enable_thinking:false) (#13771)

---------

Co-authored-by: ochafik <redacted>
Co-authored-by: Xuan-Son Nguyen <redacted>

commit | commitdiff | tree

Xuan-Son Nguyen [Sun, 25 May 2025 17:02:18 +0000 (19:02 +0200)]

webui : bump max upload file size to 500MB (#13779)

commit | commitdiff | tree

Sigbjørn Skjæret [Sun, 25 May 2025 14:22:29 +0000 (16:22 +0200)]

tests : improve UGM tokenizer test coverage (#13773)

commit | commitdiff | tree

Georgi Gerganov [Sun, 25 May 2025 13:34:36 +0000 (16:34 +0300)]

kv-cache : rework kv_cell (#13706)

* kv-cache : rework kv_cell

ggml-ci

* kv-cells : use "shift" instead of "delta" consistently

ggml-ci

* llama : add llama_max_parallel_sequences()

ggml-ci

* kv-cells : update comments [no ci]

* context : fail upon construction if sequences exceed max value

ggml-ci

* kv-cells : get_pos() -> pos_get() + comments

ggml-ci

* kv-cells : fix tracking of "used" cells

ggml-ci

commit | commitdiff | tree

Percy Piper [Sun, 25 May 2025 12:35:53 +0000 (13:35 +0100)]

rpc : Fix build on OpenBSD (#13541)

commit | commitdiff | tree

Xuan-Son Nguyen [Sun, 25 May 2025 12:06:32 +0000 (14:06 +0200)]

mtmd : add support for Qwen2-Audio and SeaLLM-Audio (#13760)

* mtmd : add Qwen2-Audio support

* small clean up

* update discussion link

* clarify mtmd_get_output_embd

* clarification in multimodal.md

* fix ultravox bug

* ggml_cont

commit | commitdiff | tree

ddpasa [Sun, 25 May 2025 12:04:49 +0000 (14:04 +0200)]

docs : add Moondream2 pre-quantized link (#13745)

* Multimodal: Added Moondream2 model and fixed ggml.org link

* Apply suggestions from code review

---------

Co-authored-by: name <redacted>
Co-authored-by: Xuan-Son Nguyen <redacted>

commit | commitdiff | tree

Olivier Chafik [Sun, 25 May 2025 09:45:49 +0000 (10:45 +0100)]

server: fix/test add_generation_prompt (#13770)

Co-authored-by: ochafik <redacted>

commit | commitdiff | tree

Piotr Jasiukajtis [Sun, 25 May 2025 08:29:43 +0000 (10:29 +0200)]

llama : add support for Qwen3 MoE tied word embeddings (#13768)

commit | commitdiff | tree

Akarshan Biswas [Sun, 25 May 2025 07:08:37 +0000 (12:38 +0530)]

SYCL: revert "sycl: simplify bin_bcast_kernel (#13383)" (#13752)

Temporarily reverted due to failing fp16 DIV operation

This reverts commit 02cdd2d8b092b5a4bb18e013c6887ce49ba20ac5.

ggml-ci

commit | commitdiff | tree

Olivier Chafik [Sun, 25 May 2025 00:48:08 +0000 (01:48 +0100)]

`server`: streaming of tool calls and thoughts when `--jinja` is on (#12379)

* add common_json w/ support for truncated json healing

* add common_chat_msg_diff

* partial common_chat_parse

* refactor parser w/ optionals

* server: wire chat diffs in stream mode

* fix trigger of thinking models (must happen after thoughts are closed)

* fix functionary v3.2 raw python!

* rename: common_chat_syntax (now contains format)

* rm common_regex.at_start

* don't return empty <think></think>

* accommodate yet another deepseek r1 distill fantasy syntax (`<｜tool▁calls｜>`)

* fix QwQ 32B tool call parsing after thoughts (hermes2)

* better logs for grammar triggers

* consume spaces after parse_json_tool_calls

* fix required tool calls w/ thinking models that have pre-opened thinking tags

* fix thinking model's initial trigger + test qwq's template

* run most test_tool_call tests in stream + non-stream modes

* make functionary v3.2 parsing more strict (differentiate first match from others)

* send final diff from server, to close off raw python arguments

* support partial content streaming in Generic mode

* tool-call: allow content prelude before hermes2 tool calls (for Qwen2.5)

* Update function-calling.md

* Update tool_bench.py

* chat-parser: remove input from exception (llm output may contain PII)

---------

Co-authored-by: ochafik <redacted>
Co-authored-by: Olivier Chafik <redacted>

commit | commitdiff | tree

Diego Devesa [Sat, 24 May 2025 22:55:16 +0000 (15:55 -0700)]

releases : bundle llvm omp library in windows release (#13763)

commit | commitdiff | tree

Diego Devesa [Sat, 24 May 2025 20:27:03 +0000 (13:27 -0700)]

releases : enable openmp in windows cpu backend build (#13756)

commit | commitdiff | tree

Diego Devesa [Sat, 24 May 2025 20:26:47 +0000 (13:26 -0700)]

ggml-cpu : set openmp wait time if not set (#13758)

commit | commitdiff | tree

0cc4m [Sat, 24 May 2025 14:49:12 +0000 (16:49 +0200)]

Move GLM4 f32 attention fix to the correct function (#13750)

commit | commitdiff | tree

Xuan-Son Nguyen [Sat, 24 May 2025 11:06:47 +0000 (13:06 +0200)]

ggml : add ggml_gelu_erf() CUDA kernel (#13719)

* ggml : add ggml_gelu_erf() CUDA kernel

* missing semicolon

commit | commitdiff | tree

Sigbjørn Skjæret [Sat, 24 May 2025 10:29:09 +0000 (12:29 +0200)]

vocab : fix ugm tokenizer precision (#13743)

commit | commitdiff | tree

Johannes Gäßler [Sat, 24 May 2025 09:46:19 +0000 (11:46 +0200)]

CUDA: fix race condition in FA vector kernels (#13742)

commit | commitdiff | tree

Diego Devesa [Fri, 23 May 2025 20:14:00 +0000 (13:14 -0700)]

ci : enable winget package updates (#13734)

commit | commitdiff | tree

Diego Devesa [Fri, 23 May 2025 20:09:38 +0000 (13:09 -0700)]

ci : add winget package updater (#13732)

commit | commitdiff | tree

Georgi Gerganov [Fri, 23 May 2025 17:16:13 +0000 (20:16 +0300)]

hparams : initialize arrays (#13728)

ggml-ci

commit | commitdiff | tree

Xuan-Son Nguyen [Fri, 23 May 2025 15:07:04 +0000 (17:07 +0200)]

llama : allow custom list of swa_layers (#13726)

commit | commitdiff | tree

Xuan-Son Nguyen [Fri, 23 May 2025 09:03:47 +0000 (11:03 +0200)]

server : support audio input (#13714)

* server : support audio input

* add audio support on webui

commit | commitdiff | tree

Chenguang Li [Fri, 23 May 2025 08:47:53 +0000 (16:47 +0800)]

CANN: Support MUL_MAT_ID for q8_0 and q4_0 (#13705)

* [CANN]Support MUL_MAT_ID Q8 && Q4

Signed-off-by: noemotiovon <redacted>
* codestyle adjustment

Signed-off-by: noemotiovon <redacted>
---------

Signed-off-by: noemotiovon <redacted>

commit | commitdiff | tree

Xuan-Son Nguyen [Fri, 23 May 2025 06:12:48 +0000 (08:12 +0200)]

ggml : fix the order of ggml_unary_op (#13718)

commit | commitdiff | tree

Jeff Bolz [Fri, 23 May 2025 04:45:02 +0000 (00:45 -0400)]

vulkan: support CPY from any type to itself (#13695)

Reuse the f16/f32 copy shaders, and just scale the number of elements
according to the type size.

commit | commitdiff | tree

Jeff Bolz [Fri, 23 May 2025 04:33:45 +0000 (00:33 -0400)]

vulkan: Disable coopmat/coopmat2/bfloat extensions if glslc doesn't support it (#13696)

commit | commitdiff | tree

Judd [Fri, 23 May 2025 04:33:08 +0000 (12:33 +0800)]

use LOG_WARN to replace `std::cerr` (#13657)

commit | commitdiff | tree

Diego Devesa [Thu, 22 May 2025 22:21:37 +0000 (15:21 -0700)]

release : fix windows hip release (#13707)

* release : fix windows hip release

* make single hip release with multiple targets

commit | commitdiff | tree

Georgi Gerganov [Thu, 22 May 2025 19:21:07 +0000 (22:21 +0300)]

tts : fix n_ubatch + make WavTokenizer cache-less (#13713)

ggml-ci

Packaging of ggml-org/llama.cpp