git.djapps.eu Git - pkg/ggml/sources/llama.cpp/log

]> git.djapps.eu Git - pkg/ggml/sources/llama.cpp/log

overview / pkg / ggml / sources / llama.cpp / log

commit | commitdiff | tree

Georgi Gerganov [Sat, 31 May 2025 12:58:33 +0000 (15:58 +0300)]

llama : deprecate explicit kv_self defrag/update calls (#13921)

ggml-ci

commit | commitdiff | tree

Georgi Gerganov [Sat, 31 May 2025 12:57:44 +0000 (15:57 +0300)]

llama : use n_swa + n_ubatch cells for SWA cache (#13833)

* llama : use n_swa + n_ubatch cells for SWA cache

ggml-ci

* llama : add warning about multi-sqeuence SWA contexts

commit | commitdiff | tree

igardev [Sat, 31 May 2025 09:56:08 +0000 (12:56 +0300)]

webui : Replace alert and confirm with custom modals. (#13711)

* Replace alert and confirm with custom modals. This is needed as Webview in VS Code doesn't permit alert and confirm for security reasons.

* use Modal Provider to simplify the use of confirm and alert modals.

* Increase the z index of the modal dialogs.

* Update index.html.gz

* also add showPrompt

* rebuild

---------

Co-authored-by: igardev <redacted>
Co-authored-by: Xuan Son Nguyen <redacted>

commit | commitdiff | tree

Georgi Gerganov [Sat, 31 May 2025 09:55:57 +0000 (12:55 +0300)]

llama : auto-batch preparation (#13845)

* llama : auto-batch

ggml-ci

* context : simplify if branching

commit | commitdiff | tree

Xuan-Son Nguyen [Sat, 31 May 2025 08:14:29 +0000 (10:14 +0200)]

mtmd : drop `_shared` from `libmtmd` name, merge helpers into libmtmd (⚠️ breaking change) (#13917)

* mtmd : fix missing public header

* no object

* apply suggestion from Georgi

* rm mtmd-helper, merge it to mtmd

* missing vendor include dir

commit | commitdiff | tree

Georgi Gerganov [Sat, 31 May 2025 07:24:04 +0000 (10:24 +0300)]

kv-cache : refactor + add llama_memory_state_i (#13746)

* kv-cache : simplify the "struct llama_kv_cache" interface

ggml-ci

* kv-cache : revert the (n_swa + n_ubatch) change (for next PR)

ggml-ci

* kv-cache : some comments

ggml-ci

* context : fix graph reserve for multiple sequences

ggml-ci

* kv-cache : fix typo [no ci]

* kv-cache : fix find_slot() logic for free slots

ggml-ci

* llama : add TODO for deprecating the defrag API in the future

* kv-cache : improve find_slot() using min/max seq pos info

ggml-ci

* llama : handle aborts and compute errors

ggml-ci

* memory : extract state into llama_memory_state

ggml-ci

* kv-cache : add comments

ggml-ci

* server : update batching logic to reset n_batch on successful decode

* server : upon full re-processing, remove the sequence from the cache

* kv-cache : add TODO for doing split_equal when split_simple fails

ggml-ci

commit | commitdiff | tree

Shawn yang [Sat, 31 May 2025 06:48:04 +0000 (14:48 +0800)]

CUDA: add a prop in ggml_cuda_device_infor for distinguish iGPU or dGPU in cuda (#13856) (#13895)

* 1. add "integrated" in ggml_cuda_device_info for distinguish whether it is Intergrate_gpu or discrete_gpu
2. Adjust the func:"ggml_backend_cuda_device_supports_buft" for this new feature

* Update ggml/src/ggml-cuda/ggml-cuda.cu

Adjusted code indentation

Co-authored-by: Johannes Gäßler <redacted>
* Update ggml/src/ggml-cuda/ggml-cuda.cu

Fixed incorrect setting of variable types

Co-authored-by: Johannes Gäßler <redacted>
* Update ggml/src/ggml-cuda/ggml-cuda.cu

Adjusted the judgment logic

Co-authored-by: Johannes Gäßler <redacted>
* add a host_buft assert in case of integrated_cuda_device with func:'evaluate_and_capture_cuda_graph()'

* Update ggml/src/ggml-cuda/ggml-cuda.cu

Add a defensive security assert

Co-authored-by: Johannes Gäßler <redacted>
* Update ggml/src/ggml-cuda/ggml-cuda.cu

Adjusted the support judgment logic.

Co-authored-by: Johannes Gäßler <redacted>
* revoke the suggest commit changes due to it's not applicable in jetson_device

* Update ggml/src/ggml-cuda/ggml-cuda.cu

Add parentheses to enforce operator precedence

Co-authored-by: Diego Devesa <redacted>
* Update ggml/src/ggml-cuda/ggml-cuda.cu

Fix ci bug: add a spaces

Co-authored-by: Johannes Gäßler <redacted>
---------

Co-authored-by: yangxiao <redacted>
Co-authored-by: Johannes Gäßler <redacted>
Co-authored-by: yangxiao <redacted>
Co-authored-by: Diego Devesa <redacted>

commit | commitdiff | tree

Johannes Gäßler [Fri, 30 May 2025 19:22:03 +0000 (21:22 +0200)]

CUDA: fix typo in FlashAttention code (#13926)

commit | commitdiff | tree

Diego Devesa [Fri, 30 May 2025 16:56:19 +0000 (09:56 -0700)]

sched : avoid changing cur_copy when a graph is already allocated (#13922)

commit | commitdiff | tree

Georgi Gerganov [Fri, 30 May 2025 16:38:07 +0000 (19:38 +0300)]

parallel : increase the variability of the prompt lengths (#13927)

ggml-ci

commit | commitdiff | tree

Diego Devesa [Fri, 30 May 2025 14:37:18 +0000 (07:37 -0700)]

cuda : prevent using split buffers with 3d/4d matrices (#13919)

commit | commitdiff | tree

Akarshan Biswas [Fri, 30 May 2025 14:10:57 +0000 (19:40 +0530)]

SYCL: Add mrope kernel (#13755)

* SYCL: Add mrope kernel

* feat: Optimize rope operations with vectorization

Uses `sycl::vec` to load and store two elements at a time,
significantly improving performance in `rope_norm`,
`rope_neox`, and `rope_multi`. This reduces the number of memory
accesses and leverages SIMD instructions for faster execution.

* Use ceil_div

commit | commitdiff | tree

Georgi Gerganov [Fri, 30 May 2025 13:25:45 +0000 (16:25 +0300)]

sync : vendor (#13901)

* sync : vendor

ggml-ci

* cont : fix httplib version

ggml-ci

* cont : fix lint

* cont : fix lint

* vendor : move to common folder /vendor

ggml-ci

* cont : fix lint

* cont : move httplib to /vendor + use json_fwd.hpp

ggml-ci

* cont : fix server build

ggml-ci

* cont : add missing headers

ggml-ci

* cont : header clean-up

ggml-ci

commit | commitdiff | tree

Sigbjørn Skjæret [Fri, 30 May 2025 12:50:43 +0000 (14:50 +0200)]

convert : fix rwkv bos/eos token (#13844)

commit | commitdiff | tree

Xuan-Son Nguyen [Fri, 30 May 2025 10:24:37 +0000 (12:24 +0200)]

convert : allow partial update to the chkhsh pre-tokenizer list (#13847)

* convert : allow partial update to the chkhsh pre-tokenizer list

* code style

* update tokenizer out

* rm inp/out files for models not having gguf

* fixed hash for glm

* skip nomic-bert-moe test

* Update convert_hf_to_gguf_update.py

* fix minerva-7b hash

* rm redundant import

commit | commitdiff | tree

Đinh Trọng Huy [Fri, 30 May 2025 09:56:02 +0000 (18:56 +0900)]

llama : add support for DistilBert (#13907)

* add distilbert

* small fixes

* add note for LLM_ARCH_DISTIL_BERT

* Use MODEL_ARCH.BERT for DistilBert

---------

Co-authored-by: dinhhuy <redacted>

commit | commitdiff | tree

zhangkaihuo [Fri, 30 May 2025 08:31:48 +0000 (16:31 +0800)]

llama : use llm_build_granite for minicpm (#13911)

commit | commitdiff | tree

Christian Kastner [Thu, 29 May 2025 23:28:54 +0000 (01:28 +0200)]

cmake: Guard GGML_CPU_ALL_VARIANTS by architecture (#13890)

commit | commitdiff | tree

Sigbjørn Skjæret [Thu, 29 May 2025 19:42:31 +0000 (21:42 +0200)]

llama : add support for jina-reranker-v2 (#13900)

commit | commitdiff | tree

Sigbjørn Skjæret [Thu, 29 May 2025 13:36:05 +0000 (15:36 +0200)]

gguf-py : add support for sub_type (in arrays) in GGUFWriter add_key_value method (#13561)

commit | commitdiff | tree

Yibo Cai [Thu, 29 May 2025 11:39:20 +0000 (19:39 +0800)]

arm64: optimize q4_k_q8_k kernel with i8mm (#13886)

This PR improves q4_k_q8_k gemm kernel with arm64 i8mm instruction.

Tested on neoverse-n2 with llama3 8b q4_k_m quantization model.
- 34% ~ 50% S_PP uplift for all batch sizes
- 12% ~ 37% S_TG uplift for batch size 4 and above

Perplexity doesn't change with this PR.

```
// tested on neoverse-n2
$ llama-batched-bench \
      -m Meta-Llama-3-8B-Instruct-Q4_K_M.gguf \
      --no-mmap -fa \
      -c 8192 -b 4096 -ub 512 -npp 128 -ntg 128 \
      -npl 1,2,4,8,16,32 \
      -t 64

---------------------------------------------------------------------
|    PP |     TG |    B |       S_PP t/s      |       S_TG t/s      |
|       |        |      | original |  this pr | original |  this pr |
|-------|--------|------|----------|----------|----------|----------|
|   128 |    128 |    1 |   110.12 |   147.83 |    24.36 |    24.28 |
|   128 |    128 |    2 |   121.16 |   172.42 |    46.36 |    47.93 |
|   128 |    128 |    4 |   120.15 |   169.75 |    74.68 |    84.00 |
|   128 |    128 |    8 |   130.97 |   196.81 |    91.04 |   114.74 |
|   128 |    128 |   16 |   131.01 |   196.88 |   101.43 |   135.79 |
|   128 |    128 |   32 |   130.85 |   196.51 |   106.97 |   147.29 |
---------------------------------------------------------------------
```

commit | commitdiff | tree

Christian Kastner [Thu, 29 May 2025 10:50:25 +0000 (12:50 +0200)]

cmake: Factor out CPU architecture detection (#13883)

* cmake: Define function for querying architecture

The tests and results match exactly those of ggml/src/CMakeLists.txt

* Switch arch detection over to new function

commit | commitdiff | tree

Vineel Abhinav [Thu, 29 May 2025 09:18:43 +0000 (14:48 +0530)]

ggml: aarch64: Implement SVE F32 kernels for Mamba Sequential Scan Algorithm (#13882)

* F32-Mamba-Seq_Scan-SVE

* Fix formatting

* ggml : missing space

---------

Co-authored-by: Georgi Gerganov <redacted>

commit | commitdiff | tree

Georgi Gerganov [Thu, 29 May 2025 09:17:16 +0000 (12:17 +0300)]

tests : remove json.hpp from a test (#13880)

ggml-ci

commit | commitdiff | tree

Sigbjørn Skjæret [Thu, 29 May 2025 08:00:57 +0000 (10:00 +0200)]

convert : workaround for AutoConfig dummy labels (#13881)

commit | commitdiff | tree

Sigbjørn Skjæret [Thu, 29 May 2025 06:15:01 +0000 (08:15 +0200)]

llama : add RobertaForSequenceClassification reranker support (#13875)

commit | commitdiff | tree

Vineel Abhinav [Thu, 29 May 2025 06:01:33 +0000 (11:31 +0530)]

ggml: aarch64: Implement SVE F32 kernels for vector functions (#13843)

* F32-Mamba-SVE

* F32-Mamba-SVE

* Resolve test errors-1

* Resolve test errors-2

* F32-vec-SVE

* F32-vec-SVE

* F32-vec-SVE

commit | commitdiff | tree

Beinsezii [Wed, 28 May 2025 21:50:20 +0000 (14:50 -0700)]

gguf-py : fix SafetensorRemote return on undefined size (< 0) (#13841)

commit | commitdiff | tree

Xuan-Son Nguyen [Wed, 28 May 2025 20:35:31 +0000 (22:35 +0200)]

llama : fix KV shift for qwen2vl (#13870)

* llama : fix KV shift for qwen2vl

* add ref to the PR

commit | commitdiff | tree

Xuan-Son Nguyen [Wed, 28 May 2025 20:35:22 +0000 (22:35 +0200)]

mtmd : move helpers to dedicated library (⚠️ breaking change) (#13866)

* mtmd : move helpers to dedicated library

* fix server build

* rm leftover cmakelist code

commit | commitdiff | tree

bandoti [Wed, 28 May 2025 18:46:47 +0000 (15:46 -0300)]

ci: disable LLAMA_CURL for Linux cross-builds (#13871)

commit | commitdiff | tree

Đinh Trọng Huy [Wed, 28 May 2025 17:01:58 +0000 (02:01 +0900)]

llama : add support for BertForSequenceClassification reranker (#13858)

* convert: add support for BertForSequenceClassification

* add support for reranking using BertForSequenceClassification

* merge checks of eos and sep

* fix lint

---------

Co-authored-by: dinhhuy <redacted>

commit | commitdiff | tree

Đinh Trọng Huy [Wed, 28 May 2025 14:34:18 +0000 (23:34 +0900)]

convert: small addition to support LlamaModel (#13838)

Co-authored-by: dinhhuy <redacted>

commit | commitdiff | tree

Sky [Wed, 28 May 2025 14:33:54 +0000 (22:33 +0800)]

server: fix remove 'image_url'/'input_audio' json-object effectlly for 'llama_params' in multimodal-model-mode (#13853)

[fix]: remove 'image_url'/'input_audio' effectlly for 'llama_params' in multimodal-model-mode

commit | commitdiff | tree

Xuan-Son Nguyen [Wed, 28 May 2025 14:12:35 +0000 (16:12 +0200)]

convert : fix qwen omni conversion (#13859)

* convert : fix qwen omni conversion

* fix typo

commit | commitdiff | tree

Alex Fanthome [Wed, 28 May 2025 13:49:28 +0000 (14:49 +0100)]

tests : change umlaut test (#11600)

commit | commitdiff | tree

Johannes Gäßler [Wed, 28 May 2025 11:33:37 +0000 (13:33 +0200)]

CUDA: fix FA tg at long context for CC >= 8.9 (#13852)

commit | commitdiff | tree

Xuan-Son Nguyen [Wed, 28 May 2025 08:05:54 +0000 (10:05 +0200)]

convert : fix tensor naming conflict for llama 4 vision (#13836)

* convert : fix tensor naming conflict for llama 4 vision

* add comment

commit | commitdiff | tree

leo-pony [Wed, 28 May 2025 03:54:20 +0000 (11:54 +0800)]

CANN: Add SOC TYPE printing in cmake configuration (#13837)

commit | commitdiff | tree

lhez [Tue, 27 May 2025 19:56:08 +0000 (12:56 -0700)]

opencl: add new ops - `argsort`, `div`, `sub`, `addrows`, `sigmoid`, `group_norm` (#13787)

* opencl: add `argsort`

* opencl: add `div`

* opencl: add `add_rows`

* opencl: add `sub`

* opencl: add `sigmoid`, both `f16` and `f32`

* opencl: add `group_norm`

commit | commitdiff | tree

lhez [Tue, 27 May 2025 19:53:14 +0000 (12:53 -0700)]

opencl: mark `mul_mat` `f32f32` as supporting non-contiguous tensors (#13790)

commit | commitdiff | tree

Jeff Bolz [Tue, 27 May 2025 16:39:07 +0000 (11:39 -0500)]

vulkan: use timestamp queries for GGML_VULKAN_PERF (#13817)

Also change it to be controlled by an env var rather than cmake flag

commit | commitdiff | tree

Georgi Gerganov [Tue, 27 May 2025 16:08:44 +0000 (19:08 +0300)]

cmake : add llama-cparams.cpp to build (#13832)

commit | commitdiff | tree

Akarshan Biswas [Tue, 27 May 2025 15:22:59 +0000 (20:52 +0530)]

SYCL: add gelu_erf kernel (#13749)

* SYCL: add gelu_erf kernel

* refactor code

Co-authored-by: Atharva Dubey <redacted>
* Use scope_op_debug_print

---------

Co-authored-by: Atharva Dubey <redacted>

commit | commitdiff | tree

Georgi Gerganov [Tue, 27 May 2025 15:04:38 +0000 (18:04 +0300)]

sync : ggml

commit | commitdiff | tree

Xuan-Son Nguyen [Tue, 27 May 2025 13:53:55 +0000 (15:53 +0200)]

ggml : add ggml_repeat_4d (#13824)

commit | commitdiff | tree

xctan [Tue, 27 May 2025 13:21:36 +0000 (21:21 +0800)]

ggml : riscv: add xtheadvector support (#13720)

* ggml : riscv: add xtheadvector support

* ggml : clean up some macro usage

commit | commitdiff | tree

Xuan-Son Nguyen [Tue, 27 May 2025 12:06:10 +0000 (14:06 +0200)]

mtmd : support Qwen 2.5 Omni (input audio+vision, no audio output) (#13784)

* mtmd : allow multiple modalities at the same time

* refactor mtmd tokenizer

* fix compile

* ok, missing SinusoidsPositionEmbedding

* first working version

* fix style

* more strict validate of n_embd

* refactor if..else to switch

* fix regression

* add test for 3B

* update docs

* fix tokenizing with add_special

* add more tests

* fix test case "huge"

* rm redundant code

* set_position_mrope_1d rm n_tokens

commit | commitdiff | tree

bandoti [Tue, 27 May 2025 11:52:40 +0000 (08:52 -0300)]

docs: remove link for llama-cli function calling (#13810)

commit | commitdiff | tree

Christian Kastner [Tue, 27 May 2025 11:18:39 +0000 (13:18 +0200)]

ggml-cpu: x86 feature detection is specific to x86 (#13811)

commit | commitdiff | tree

Diego Devesa [Tue, 27 May 2025 11:05:18 +0000 (04:05 -0700)]

ggml : allow CUDA graphs when using pipeline parallelism (#13814)

commit | commitdiff | tree

Georgi Gerganov [Tue, 27 May 2025 10:49:41 +0000 (13:49 +0300)]

kv-cells : track min/max used cells and per-sequence positions (#13808)

* kv-cells : track min/max used cells and per-sequence positions

ggml-ci

* kv-cells : fix pos-modification updates for seq_pos

ggml-ci

* kv-cells : add comments

ggml-ci

commit | commitdiff | tree

Georgi Gerganov [Tue, 27 May 2025 09:07:52 +0000 (12:07 +0300)]

sampling : make sure samplers return at least 1 token (#13822)

* sampling : min-p should always return at least one token

ggml-ci

* sampling : same for typical sampling

* tests : sampling tests use min_keep == 0

ggml-ci

commit | commitdiff | tree

Georgi Gerganov [Tue, 27 May 2025 06:40:59 +0000 (09:40 +0300)]

llama : validate seq id batch input (#13809)

* llama : validate seq id batch input

ggml-ci

* cont : fix the fix

ggml-ci

commit | commitdiff | tree

Olivier Chafik [Mon, 26 May 2025 21:34:27 +0000 (14:34 -0700)]

server: --offline mode (#13804)

* server: --offline mode (env: LLAMA_OFFLINE)

---------

Co-authored-by: Xuan-Son Nguyen <redacted>

commit | commitdiff | tree

Georgi Gerganov [Mon, 26 May 2025 19:24:01 +0000 (22:24 +0300)]

scripts : add option to compare commits in Debug (#13806)

* scripts : add option to compare commits in Debug

* cont : reuse existing CMAKE_OPTS

commit | commitdiff | tree

Georgi Gerganov [Mon, 26 May 2025 19:14:52 +0000 (22:14 +0300)]

cuda : avoid cuGetErrorString (#13791)

ggml-ci

commit | commitdiff | tree

Akarshan Biswas [Mon, 26 May 2025 15:40:36 +0000 (21:10 +0530)]

SYCL: Add non contiguous support in RMS_NORM and NORM kernels (#13611)

* SYCL: Add non contiguous input support to norm kernel

* refactor and add RMS_NORM non contiguous input support

ggml-ci

* restore subgroup reduction for multi-subgroup thread blocks in norm kernels

* Swap grid dims of nsamples and nrows

ggml-ci

* Revert "Swap grid dims of nsamples and nrows"

This reverts commit 43be2d657fec7f7fba54e2cd154106bc0fc45adf.

* restore not required changes
ggml-ci

* address review comments: change it to more like SYCL

* Use a common function to calculate offset

* remove wrap around logic for handling broadcasts

* remove static from calculate_offset fn and use ceil_div

commit | commitdiff | tree

Olivier Chafik [Mon, 26 May 2025 15:03:57 +0000 (08:03 -0700)]

server: fix streaming crashes (#13786)

* add preludes to content on partial regex match

* allow all parsers to parse non-tool-call content.

* tweak order of <|python_tag|> vs <function= parsing for functionary v3.1 format. still not ideal but hopefully less prone to crash

commit | commitdiff | tree

standby24x7 [Mon, 26 May 2025 14:55:24 +0000 (23:55 +0900)]

examples/training: Fix file name in README (#13803)

This patch fixes binary file names in README.md.

Signed-off-by: Masanari Iida <redacted>

commit | commitdiff | tree

Olivier Chafik [Mon, 26 May 2025 13:56:49 +0000 (06:56 -0700)]

`server`: fix format of streamed tool call deltas (diff name, fix id location) (#13800)

* fix deltas of tool_call.function.name

* fix tool_call.id (was in tool_call.function.id!) + add function type

* add tool_call.type

* populate empty tool_call.function.arguments on first delta

commit | commitdiff | tree

Olivier Chafik [Mon, 26 May 2025 13:16:37 +0000 (06:16 -0700)]

server: fix regression on streamed non-chat completion w/ stops (#13785)

* more forgiving message diffs: partial stop words aren't erased, full stops are

* Add (slow) server test for completion + stream + stop

commit | commitdiff | tree

Georgi Gerganov [Mon, 26 May 2025 11:03:54 +0000 (14:03 +0300)]

examples : allow extracting embeddings from decoder contexts (#13797)

ggml-ci

commit | commitdiff | tree

Georgi Gerganov [Mon, 26 May 2025 09:57:50 +0000 (12:57 +0300)]

llama : clarify deprecation message (#13794)

commit | commitdiff | tree

Romain Biessy [Mon, 26 May 2025 08:28:53 +0000 (10:28 +0200)]

sycl: Add more debug prints (#13640)

commit | commitdiff | tree

Jeff Bolz [Mon, 26 May 2025 04:02:07 +0000 (23:02 -0500)]

vulkan: mark IM2COL as supporting non-contig (#13783)

commit | commitdiff | tree

Bizhao Shi [Mon, 26 May 2025 02:20:18 +0000 (10:20 +0800)]

CANN: Add the basic supports of Flash Attention kernel (#13627)

* cann: add the basic FA support

* cann: update the readme

* cann: update the FlashAttention with PSEShift

* cann: update the input parameters in FA

* cann: update the alibi with max_bias

* cann: add the constrints of softcap

* cann: update the docs CANN.md

* cann: update the docs CANN.md

* cann: fix typo of CANN.md

* cann: add some comments and update the CANN.md

* cann: update the CANN.md

* cann: update the inner precise for fusedInferAttention

* cann: update the constraints of flash_attn_ext on ggml-cann.cpp

* cann: clean the whitespace

* cann: clean the whitespace

* cann: add a new endline

commit | commitdiff | tree

Olivier Chafik [Sun, 25 May 2025 23:30:51 +0000 (00:30 +0100)]

`server`: add `--reasoning-budget 0` to disable thinking (incl. qwen3 w/ enable_thinking:false) (#13771)

---------

Co-authored-by: ochafik <redacted>
Co-authored-by: Xuan-Son Nguyen <redacted>

commit | commitdiff | tree

Xuan-Son Nguyen [Sun, 25 May 2025 17:02:18 +0000 (19:02 +0200)]

webui : bump max upload file size to 500MB (#13779)

commit | commitdiff | tree

Sigbjørn Skjæret [Sun, 25 May 2025 14:22:29 +0000 (16:22 +0200)]

tests : improve UGM tokenizer test coverage (#13773)

commit | commitdiff | tree

Georgi Gerganov [Sun, 25 May 2025 13:34:36 +0000 (16:34 +0300)]

kv-cache : rework kv_cell (#13706)

* kv-cache : rework kv_cell

ggml-ci

* kv-cells : use "shift" instead of "delta" consistently

ggml-ci

* llama : add llama_max_parallel_sequences()

ggml-ci

* kv-cells : update comments [no ci]

* context : fail upon construction if sequences exceed max value

ggml-ci

* kv-cells : get_pos() -> pos_get() + comments

ggml-ci

* kv-cells : fix tracking of "used" cells

ggml-ci

commit | commitdiff | tree

Percy Piper [Sun, 25 May 2025 12:35:53 +0000 (13:35 +0100)]

rpc : Fix build on OpenBSD (#13541)

commit | commitdiff | tree

Xuan-Son Nguyen [Sun, 25 May 2025 12:06:32 +0000 (14:06 +0200)]

mtmd : add support for Qwen2-Audio and SeaLLM-Audio (#13760)

* mtmd : add Qwen2-Audio support

* small clean up

* update discussion link

* clarify mtmd_get_output_embd

* clarification in multimodal.md

* fix ultravox bug

* ggml_cont

commit | commitdiff | tree

ddpasa [Sun, 25 May 2025 12:04:49 +0000 (14:04 +0200)]

docs : add Moondream2 pre-quantized link (#13745)

* Multimodal: Added Moondream2 model and fixed ggml.org link

* Apply suggestions from code review

---------

Co-authored-by: name <redacted>
Co-authored-by: Xuan-Son Nguyen <redacted>

commit | commitdiff | tree

Olivier Chafik [Sun, 25 May 2025 09:45:49 +0000 (10:45 +0100)]

server: fix/test add_generation_prompt (#13770)

Co-authored-by: ochafik <redacted>

commit | commitdiff | tree

Piotr Jasiukajtis [Sun, 25 May 2025 08:29:43 +0000 (10:29 +0200)]

llama : add support for Qwen3 MoE tied word embeddings (#13768)

commit | commitdiff | tree

Akarshan Biswas [Sun, 25 May 2025 07:08:37 +0000 (12:38 +0530)]

SYCL: revert "sycl: simplify bin_bcast_kernel (#13383)" (#13752)

Temporarily reverted due to failing fp16 DIV operation

This reverts commit 02cdd2d8b092b5a4bb18e013c6887ce49ba20ac5.

ggml-ci

commit | commitdiff | tree

Olivier Chafik [Sun, 25 May 2025 00:48:08 +0000 (01:48 +0100)]

`server`: streaming of tool calls and thoughts when `--jinja` is on (#12379)

* add common_json w/ support for truncated json healing

* add common_chat_msg_diff

* partial common_chat_parse

* refactor parser w/ optionals

* server: wire chat diffs in stream mode

* fix trigger of thinking models (must happen after thoughts are closed)

* fix functionary v3.2 raw python!

* rename: common_chat_syntax (now contains format)

* rm common_regex.at_start

* don't return empty <think></think>

* accommodate yet another deepseek r1 distill fantasy syntax (`<｜tool▁calls｜>`)

* fix QwQ 32B tool call parsing after thoughts (hermes2)

* better logs for grammar triggers

* consume spaces after parse_json_tool_calls

* fix required tool calls w/ thinking models that have pre-opened thinking tags

* fix thinking model's initial trigger + test qwq's template

* run most test_tool_call tests in stream + non-stream modes

* make functionary v3.2 parsing more strict (differentiate first match from others)

* send final diff from server, to close off raw python arguments

* support partial content streaming in Generic mode

* tool-call: allow content prelude before hermes2 tool calls (for Qwen2.5)

* Update function-calling.md

* Update tool_bench.py

* chat-parser: remove input from exception (llm output may contain PII)

---------

Co-authored-by: ochafik <redacted>
Co-authored-by: Olivier Chafik <redacted>

commit | commitdiff | tree

Diego Devesa [Sat, 24 May 2025 22:55:16 +0000 (15:55 -0700)]

releases : bundle llvm omp library in windows release (#13763)

commit | commitdiff | tree

Diego Devesa [Sat, 24 May 2025 20:27:03 +0000 (13:27 -0700)]

releases : enable openmp in windows cpu backend build (#13756)

commit | commitdiff | tree

Diego Devesa [Sat, 24 May 2025 20:26:47 +0000 (13:26 -0700)]

ggml-cpu : set openmp wait time if not set (#13758)

commit | commitdiff | tree

0cc4m [Sat, 24 May 2025 14:49:12 +0000 (16:49 +0200)]

Move GLM4 f32 attention fix to the correct function (#13750)

commit | commitdiff | tree

Xuan-Son Nguyen [Sat, 24 May 2025 11:06:47 +0000 (13:06 +0200)]

ggml : add ggml_gelu_erf() CUDA kernel (#13719)

* ggml : add ggml_gelu_erf() CUDA kernel

* missing semicolon

commit | commitdiff | tree

Sigbjørn Skjæret [Sat, 24 May 2025 10:29:09 +0000 (12:29 +0200)]

vocab : fix ugm tokenizer precision (#13743)

commit | commitdiff | tree

Johannes Gäßler [Sat, 24 May 2025 09:46:19 +0000 (11:46 +0200)]

CUDA: fix race condition in FA vector kernels (#13742)

commit | commitdiff | tree

Diego Devesa [Fri, 23 May 2025 20:14:00 +0000 (13:14 -0700)]

ci : enable winget package updates (#13734)

commit | commitdiff | tree

Diego Devesa [Fri, 23 May 2025 20:09:38 +0000 (13:09 -0700)]

ci : add winget package updater (#13732)

commit | commitdiff | tree

Georgi Gerganov [Fri, 23 May 2025 17:16:13 +0000 (20:16 +0300)]

hparams : initialize arrays (#13728)

ggml-ci

commit | commitdiff | tree

Xuan-Son Nguyen [Fri, 23 May 2025 15:07:04 +0000 (17:07 +0200)]

llama : allow custom list of swa_layers (#13726)

commit | commitdiff | tree

Xuan-Son Nguyen [Fri, 23 May 2025 09:03:47 +0000 (11:03 +0200)]

server : support audio input (#13714)

* server : support audio input

* add audio support on webui

commit | commitdiff | tree

Chenguang Li [Fri, 23 May 2025 08:47:53 +0000 (16:47 +0800)]

CANN: Support MUL_MAT_ID for q8_0 and q4_0 (#13705)

* [CANN]Support MUL_MAT_ID Q8 && Q4

Signed-off-by: noemotiovon <redacted>
* codestyle adjustment

Signed-off-by: noemotiovon <redacted>
---------

Signed-off-by: noemotiovon <redacted>

commit | commitdiff | tree

Xuan-Son Nguyen [Fri, 23 May 2025 06:12:48 +0000 (08:12 +0200)]

ggml : fix the order of ggml_unary_op (#13718)

commit | commitdiff | tree

Jeff Bolz [Fri, 23 May 2025 04:45:02 +0000 (00:45 -0400)]

vulkan: support CPY from any type to itself (#13695)

Reuse the f16/f32 copy shaders, and just scale the number of elements
according to the type size.

commit | commitdiff | tree

Jeff Bolz [Fri, 23 May 2025 04:33:45 +0000 (00:33 -0400)]

vulkan: Disable coopmat/coopmat2/bfloat extensions if glslc doesn't support it (#13696)

commit | commitdiff | tree

Judd [Fri, 23 May 2025 04:33:08 +0000 (12:33 +0800)]

use LOG_WARN to replace `std::cerr` (#13657)

commit | commitdiff | tree

Diego Devesa [Thu, 22 May 2025 22:21:37 +0000 (15:21 -0700)]

release : fix windows hip release (#13707)

* release : fix windows hip release

* make single hip release with multiple targets

commit | commitdiff | tree

Georgi Gerganov [Thu, 22 May 2025 19:21:07 +0000 (22:21 +0300)]

tts : fix n_ubatch + make WavTokenizer cache-less (#13713)

ggml-ci

commit | commitdiff | tree

Xuan-Son Nguyen [Thu, 22 May 2025 18:42:48 +0000 (20:42 +0200)]

mtmd : add ultravox audio input (#13623)

* convert ok, load ok

* warmup ok

* test

* still does not work?

* fix padding

* temporary give up

* fix merge conflict

* build_ultravox()

* rm test

* fix merge conflict

* add necessary mtmd APIs

* first working version (only 4s of audio)

* will this monster compile?

* fix compile

* please compile

* fPIC

* fix windows

* various fixes

* clean up audio_helpers

* fix conversion

* add some debug stuff

* long audio input ok

* adapt the api

* add --audio arg

* final touch UX

* add miniaudio to readme

* fix typo

* refactor kv metadata

* mtmd_default_marker()

commit | commitdiff | tree

Aaron Teo [Thu, 22 May 2025 18:31:29 +0000 (02:31 +0800)]

common: Include torch package for s390x (#13699)

* common: update requirements.txt to include pytorch nightly for s390x

Signed-off-by: Aaron Teo <redacted>
* common: fix torch installation via pip for s390x

Signed-off-by: Aaron Teo <redacted>
---------

Signed-off-by: Aaron Teo <redacted>

commit | commitdiff | tree

Georgi Gerganov [Thu, 22 May 2025 13:33:39 +0000 (16:33 +0300)]

server : pad small embedding batches (#13692)

ggml-ci

Packaging of ggml-org/llama.cpp

RSS Atom