git.djapps.eu Git - pkg/ggml/sources/llama.cpp/log

]> git.djapps.eu Git - pkg/ggml/sources/llama.cpp/log

overview / pkg / ggml / sources / llama.cpp / log

commit | commitdiff | tree

Uilian Ries [Wed, 24 Sep 2025 06:53:47 +0000 (08:53 +0200)]

common : add missing chrono header for common.cpp (#16211)

Signed-off-by: Uilian Ries <redacted>

commit | commitdiff | tree

Sigbjørn Skjæret [Wed, 24 Sep 2025 06:53:20 +0000 (08:53 +0200)]

codeowners : match all requirements files (#16214)

commit | commitdiff | tree

Jie Fu (傅杰) [Wed, 24 Sep 2025 06:46:52 +0000 (14:46 +0800)]

model-conversion : run-org-model.py fails to run on mac m1 (#16213)

Signed-off-by: Jie Fu <redacted>

commit | commitdiff | tree

Daniel Bevenius [Wed, 24 Sep 2025 06:10:09 +0000 (08:10 +0200)]

codeowners : use slash prefix for root files [no ci] (#16210)

This commit adds a leading slash to the paths of root-level files
in the CODEOWNERS file.

The motivation for this is that these might otherwise match files
in subdirectories that have other/additional owners will override them.

Refs: https://github.com/ggml-org/llama.cpp/pull/16209#issuecomment-3326434274

commit | commitdiff | tree

Jie Fu (傅杰) [Wed, 24 Sep 2025 04:19:23 +0000 (12:19 +0800)]

model-conversion : fix the make targets in the README.md (#16209)

Fix two incorrect make targets in the readme.

Signed-off-by: Jie Fu <redacted>

commit | commitdiff | tree

Georgi Gerganov [Tue, 23 Sep 2025 17:41:40 +0000 (20:41 +0300)]

ci : disable AMD workflows + update NVIDIA workflows (#16200)

* ci : disable AMD workflows + update NVIDIA workflows

* cont : fixes

* cont : update nvidia vulkan workflows

commit | commitdiff | tree

Georgi Gerganov [Tue, 23 Sep 2025 10:44:25 +0000 (13:44 +0300)]

ci : enable Vulkan workflow on Mac (#16194)

commit | commitdiff | tree

Xiangyan Sun [Tue, 23 Sep 2025 08:58:12 +0000 (01:58 -0700)]

ggml-cpu: Respect cpumask settings (#16164)

commit | commitdiff | tree

Sigbjørn Skjæret [Tue, 23 Sep 2025 08:25:20 +0000 (10:25 +0200)]

ggml : fix uninitialized is_on_grid in quantize_row_iq3_xxs_impl (#15928)

* fix uninitialized is_on_grid in quantize_row_iq3_xxs_impl

* change initialization to true

commit | commitdiff | tree

Aaron Teo [Tue, 23 Sep 2025 06:53:05 +0000 (14:53 +0800)]

zdnn: refactor codebase + add docs (#16178)

* zdnn: initial matmul refactor

Signed-off-by: Aaron Teo <redacted>
* ggml-zdnn: rm static from funcs

Signed-off-by: Aaron Teo <redacted>
* ggml-zdnn: update ggml-zdnn.h

Signed-off-by: Aaron Teo <redacted>
* ggml-zdnn: change header files to hpp

Signed-off-by: Aaron Teo <redacted>
* ggml-zdnn: switch to common.hpp

Signed-off-by: Aaron Teo <redacted>
* ggml-zdnn: move mulmat forward around

Signed-off-by: Aaron Teo <redacted>
* ggml-zdnn: rm inline from utils

Signed-off-by: Aaron Teo <redacted>
* ggml-zdnn: code cleanup

Signed-off-by: Aaron Teo <redacted>
* docs: add zDNN docs

Signed-off-by: Aaron Teo <redacted>
---------

Signed-off-by: Aaron Teo <redacted>

commit | commitdiff | tree

Daniel Bevenius [Tue, 23 Sep 2025 06:13:22 +0000 (08:13 +0200)]

codeowners : add @danbev to model-conversion example [no ci] (#16190)

This commit adds examples/model-conversion/ to the CODEOWNERS file and
assigns myself (@danbev) as the code owner for this directory.

commit | commitdiff | tree

Aaron Teo [Tue, 23 Sep 2025 05:59:34 +0000 (13:59 +0800)]

devops: add s390x containers (#15915)

* devops: add s390x dockerfile

Signed-off-by: Aaron Teo <redacted>
* devops: add missing ninja

Signed-off-by: Aaron Teo <redacted>
* devops: move s390x docker into cpu docker

Signed-off-by: Aaron Teo <redacted>
* devops: rework s390x docker

Signed-off-by: Aaron Teo <redacted>
* devops: copy more tools

Signed-off-by: Aaron Teo <redacted>
* devops: add server build step

Signed-off-by: Aaron Teo <redacted>
* devops: remove apt clean steps as distroless misses it

Signed-off-by: Aaron Teo <redacted>
* devops: remove apt commands from distroless

Signed-off-by: Aaron Teo <redacted>
* devops: fix shared libs in distroless

Signed-off-by: Aaron Teo <redacted>
* devops: use correct libs path

Signed-off-by: Aaron Teo <redacted>
* devops: fix shared libs

Signed-off-by: Aaron Teo <redacted>
* devops: add collector stage

Signed-off-by: Aaron Teo <redacted>
* devops: fix missing stage ref

Signed-off-by: Aaron Teo <redacted>
* devops: fix permission issue

Signed-off-by: Aaron Teo <redacted>
* devops: fix unknown model loading failures

Signed-off-by: Aaron Teo <redacted>
* devops: attempt at fixing model loading failure

Signed-off-by: Aaron Teo <redacted>
* devops: fix missing ggml shared object

failure to load model

Signed-off-by: Aaron Teo <redacted>
* devops: remove move shared objects

Signed-off-by: Aaron Teo <redacted>
* devops: move libggml-cpu and blas into bin

Signed-off-by: Aaron Teo <redacted>
* devops: finalise hardened server stage

Signed-off-by: Aaron Teo <redacted>
* devops: add cli target

Signed-off-by: Aaron Teo <redacted>
* devops: fix typos

Signed-off-by: Aaron Teo <redacted>
* devops: fix missing shared libraries in base

Signed-off-by: Aaron Teo <redacted>
* devops: update debian target

Signed-off-by: Aaron Teo <redacted>
* devops: formalise llama.cpp loc

Signed-off-by: Aaron Teo <redacted>
* Revert "devops: formalise llama.cpp loc"

This reverts commit 0a7664af8466a15f318ff209e02ac3c4e551cc18.

Signed-off-by: Aaron Teo <redacted>
* devops: formalise llama.cpp loc

Signed-off-by: Aaron Teo <redacted>
(cherry picked from commit 0a7664af8466a15f318ff209e02ac3c4e551cc18)
Signed-off-by: Aaron Teo <redacted>
* devops: attempt at fixing missing dir

Signed-off-by: Aaron Teo <redacted>
* devops: attempt at making it cache the build

Signed-off-by: Aaron Teo <redacted>
* devops: fix copying process

Signed-off-by: Aaron Teo <redacted>
* devops: make build dir an argument

Signed-off-by: Aaron Teo <redacted>
* Revert "devops: make build dir an argument"

This reverts commit 438698976b8a5181c1e8179600527cfd5a50cc23.

Signed-off-by: Aaron Teo <redacted>
* devops: add build stage for gguf-py

Signed-off-by: Aaron Teo <redacted>
* devops: move gguf-py installation into build stage

Signed-off-by: Aaron Teo <redacted>
* devops: break system packages?

Signed-off-by: Aaron Teo <redacted>
* devops: add rust compiler installer

Signed-off-by: Aaron Teo <redacted>
* devops: fix rustc not found

Signed-off-by: Aaron Teo <redacted>
* devops: remove cache mount to allow rustc to persist

Signed-off-by: Aaron Teo <redacted>
* devops: move rustc installation to another layer

Signed-off-by: Aaron Teo <redacted>
* devops: move gguf-py installation to full stage, fix copying

Signed-off-by: Aaron Teo <redacted>
* devops: remove rustc installation in build

Signed-off-by: Aaron Teo <redacted>
* devops: disable full target for now

Signed-off-by: Aaron Teo <redacted>
* devops: attempting static build

Signed-off-by: Aaron Teo <redacted>
* devops: merge s390x dockerfile into cpu for now

Signed-off-by: Aaron Teo <redacted>
* devops: switch to gcc image for build step

Signed-off-by: Aaron Teo <redacted>
* devops: remove build essentials

Signed-off-by: Aaron Teo <redacted>
* devops: install openblas into base target

Signed-off-by: Aaron Teo <redacted>
* devops: go back to s390x dockerfile

Signed-off-by: Aaron Teo <redacted>
* devops: remove libggml and libblas

Signed-off-by: Aaron Teo <redacted>
* devops: add full target

Signed-off-by: Aaron Teo <redacted>
* devops: add break system packages

Signed-off-by: Aaron Teo <redacted>
* devops: add libjpeg

Signed-off-by: Aaron Teo <redacted>
* devops: add missing cmake dep

Signed-off-by: Aaron Teo <redacted>
* devops: finalise docker images for s390x

Signed-off-by: Aaron Teo <redacted>
* devops: add custom openblas patch

Signed-off-by: Aaron Teo <redacted>
* devops: use libopenblas-dev instead of libopenblas-openmp-dev

Signed-off-by: Aaron Teo <redacted>
* devops: add s390x docker build

Signed-off-by: Aaron Teo <redacted>
---------

Signed-off-by: Aaron Teo <redacted>

commit | commitdiff | tree

Daniel Bevenius [Tue, 23 Sep 2025 03:59:03 +0000 (05:59 +0200)]

ggml-cpu : fix typo in gemm comments [no ci] (#16189)

commit | commitdiff | tree

Gabe Goodhart [Mon, 22 Sep 2025 18:40:10 +0000 (12:40 -0600)]

feat: Add conversion support in GraniteHybrid for non-hybrid (all attn) (#16177)

This is a configuration of the hparams in the GraniteHybrid architecture
that devolves to the Granite (or GraniteMoe) architecture (ie Granite 3.x).
It may be used for some models in the Granite 4 family with the
GraniteHybrid architecture acting as a superset arch. Rather than support
it directly in the c++ graph, we simply coerce the architecture flag back
to the correct "granite" or "granitemoe" architecture.

Branch: gabe-l-hart/GraniteNonHybridConversion

Signed-off-by: Gabe Goodhart <redacted>
Co-authored-by: Sigbjørn Skjæret <redacted>

commit | commitdiff | tree

Haiyue Wang [Mon, 22 Sep 2025 17:57:46 +0000 (01:57 +0800)]

clang-tidy : disable warning about performance enum size (#16127)

Disable 'performance-enum-size' checking:

Enum 'llama_token_type' uses a larger base type ('unsigned int', size: 4 bytes)
than necessary for its value set, consider using 'std::uint8_t' (1 byte) as the
base type to reduce its size.

commit | commitdiff | tree

Sigbjørn Skjæret [Mon, 22 Sep 2025 17:13:00 +0000 (19:13 +0200)]

ggml : implement set_rows with i32 index (#16159)

* implement set_rows with i32 index

* template fix

* test quantized path

warnings--

* Apply suggestions from code review

Co-authored-by: Georgi Gerganov <redacted>
* forgotten name change

* deduplicate cuda/sycl and test-fix

* indent++

* vulkan: support set_rows with i32 index type (#16162)

* disable i32 index for webgpu for now

---------

Co-authored-by: Georgi Gerganov <redacted>
Co-authored-by: Jeff Bolz <redacted>

commit | commitdiff | tree

Georgi Gerganov [Mon, 22 Sep 2025 15:20:21 +0000 (18:20 +0300)]

codeowners : update + cleanup (#16174)

---------

Co-authored-by: slaren <redacted>

commit | commitdiff | tree

Adrien Gallouët [Mon, 22 Sep 2025 12:13:51 +0000 (14:13 +0200)]

common : enable `--offline` mode without curl support (#16137)

* common : use the json parser

Signed-off-by: Adrien Gallouët <redacted>
* common : enable --offline mode without CURL support

This change refactors the download logic to properly support offline mode
even when the project is built without CURL.

Without this commit, using `--offline` would give the following error:

error: built without CURL, cannot download model from the internet

even if all the files are already cached.

Signed-off-by: Adrien Gallouët <redacted>
---------

Signed-off-by: Adrien Gallouët <redacted>

commit | commitdiff | tree

Quentin Bramas [Mon, 22 Sep 2025 08:53:13 +0000 (10:53 +0200)]

webui : fix handling incomplete chunks (#16107)

commit | commitdiff | tree

GideonSerf [Mon, 22 Sep 2025 08:49:58 +0000 (10:49 +0200)]

embedding : fix typos in README (#16171)

commit | commitdiff | tree

Haiyue Wang [Mon, 22 Sep 2025 08:48:42 +0000 (16:48 +0800)]

common : remove unused local variables (#16140)

These two local variables 'arg' and 'arg_prefix' have been overriden by:

  1. for (const auto & arg : opt.args)

  2. for (int i = 1; i < argc; i++) {
        const std::string arg_prefix = "--";

        std::string arg = argv[i];

commit | commitdiff | tree

Georgi Gerganov [Mon, 22 Sep 2025 08:12:37 +0000 (11:12 +0300)]

ggml : extend ggml_can_fuse to work with non-sequential nodes (#16123)

* ggml : extend ggml_can_fuse to work with non-sequential nodes in the graph

* cont : fix wrong bounds check condition

* cont : remove unnecessary overload

commit | commitdiff | tree

Georgi Gerganov [Mon, 22 Sep 2025 08:12:09 +0000 (11:12 +0300)]

ggml : add ggml_op_is_empty (#16122)

* ggml : add ggml_op_is_empty

* ggml : move to ggml-impl.h

commit | commitdiff | tree

Xuan-Son Nguyen [Mon, 22 Sep 2025 08:10:58 +0000 (15:10 +0700)]

codeowners : update ownership for @ngxson and @allozuar (#16128)

commit | commitdiff | tree

Shin-myoung-serp [Mon, 22 Sep 2025 08:04:01 +0000 (17:04 +0900)]

Vulkan: add conv_transpose_2d operation (#16022)

* Vulkan: add conv_transpose_2d operation

* Vulkan: fix typo in conv_transpose_2d shader(s0mp, s0L, s1mp, s1L)

* Vulkan: fix incorrect indentation in conv_transpose_2d shader

* Vulkan: add checking the push constants size limit and reuse conv2d_mm.comp for conv_transpose_2d operation

* Vulkan: revert the order of the index calculation and bound check in conv_2d shader

* Vulkan: explicity check push constants limit in supports_op() for conv_transpose_2d operation.

* Vulkan: remove unnecessary lower bound checks for H/W_idx in the conv_2d shader.

commit | commitdiff | tree

Sigbjørn Skjæret [Mon, 22 Sep 2025 07:59:05 +0000 (09:59 +0200)]

codeowners : claim responsibility for ci, models, gguf-py and convert (#16124)

* claim responsibility for ci, gguf-py and convert

* add myself to various src/llama- files

commit | commitdiff | tree

Georgi Gerganov [Mon, 22 Sep 2025 07:58:02 +0000 (10:58 +0300)]

contrib : update roles (#16113)

* contrib : update roles

* contrib : merge PR sections + add link to CI instructions

Updated pull request guidelines for contributors and collaborators, and clarified merging practices for maintainers.

commit | commitdiff | tree

Georgi Gerganov [Mon, 22 Sep 2025 07:16:05 +0000 (10:16 +0300)]

ci : remove vulkaninfo calls (#16169)

commit | commitdiff | tree

Georgi Gerganov [Mon, 22 Sep 2025 06:11:39 +0000 (09:11 +0300)]

ci : use smaller model (#16168)

* ci : switch from gemma to qwen3 0.6b

* ci : use smaller model for some tests

commit | commitdiff | tree

Jeff Bolz [Mon, 22 Sep 2025 05:37:17 +0000 (00:37 -0500)]

vulkan: add RTE variants of exp shader (#16165)

This fixes some failures on Turing where "round to zero" rounds to the max f16
value but the CPU reference value is infinite.

commit | commitdiff | tree

Georgi Gerganov [Mon, 22 Sep 2025 05:31:40 +0000 (08:31 +0300)]

ci : adjust params for less runtime (#16167)

* ci : adjust params for less runtime

* ci : gate BF16 on some hardware

* ci : move extra tests to Arm runner

commit | commitdiff | tree

Ruben Ortlam [Mon, 22 Sep 2025 05:22:43 +0000 (07:22 +0200)]

vulkan: vec dot matrix multiplication fix (#16151)

* vulkan: fix matrix multiplication index calculation for odd m/n and odd k in combination with batching

* add odd m/n + odd k test with batching

commit | commitdiff | tree

lhez [Sun, 21 Sep 2025 23:42:10 +0000 (16:42 -0700)]

opencl: fix concat crash on win arm64 with Adreno (#15944)

commit | commitdiff | tree

lhez [Sun, 21 Sep 2025 21:48:44 +0000 (14:48 -0700)]

opencl: initial `q8_0` mv support (#15732)

commit | commitdiff | tree

Georgi Gerganov [Sun, 21 Sep 2025 16:00:27 +0000 (19:00 +0300)]

ci : add label for the RISC-V runner (#16150)

commit | commitdiff | tree

Georgi Gerganov [Sun, 21 Sep 2025 13:50:45 +0000 (16:50 +0300)]

ci : migrate ggml ci to self-hosted runners (#16116)

* ci : migrate ggml ci to a self-hosted runners

* ci : add T4 runner

* ci : add instructions for adding self-hosted runners

* ci : disable test-backend-ops from debug builds due to slowness

* ci : add AMD V710 runner (vulkan)

* cont : add ROCM workflow

* ci : switch to qwen3 0.6b model

* cont : fix the context size

commit | commitdiff | tree

Giuseppe Scrivano [Sun, 21 Sep 2025 06:31:55 +0000 (08:31 +0200)]

vulkan: optimize UMA buffer operations and fix driver hangs (#16059)

* vulkan: optimize UMA buffer operations and fix driver hangs

The previous implementation was blocking the GPU for extended periods,
causing the i915 driver to reset the context due to the hangcheck
protection.

[32628.443070] i915 0000:00:02.0: [drm] GPU HANG: ecode 12:1:85dffffb, in llama-server [194114]
[32628.443091] i915 0000:00:02.0: [drm] llama-server[194114] context reset due to GPU hang

* vulkan: implement deferred_memset on UMA

---------

Signed-off-by: Giuseppe Scrivano <redacted>

commit | commitdiff | tree

Jeff Bolz [Sun, 21 Sep 2025 06:23:37 +0000 (01:23 -0500)]

vulkan: fix validation error about VK_PIPELINE_CREATE_CAPTURE_STATISTICS_BIT_KHR (#16086)

commit | commitdiff | tree

Georgi Gerganov [Sat, 20 Sep 2025 09:55:47 +0000 (12:55 +0300)]

sync : ggml

commit | commitdiff | tree

Daniel Bevenius [Tue, 16 Sep 2025 04:16:52 +0000 (06:16 +0200)]

ggml : introduce semantic versioning (ggml/1336)

* ggml : introduce semantic versioning

This commit introduces semantic versioning for the GGML library.

The motivation for this is that the current versioning, using build
numbers, makes it difficult to track changes and releases for projects
that use ggml.

The release steps are the following:
1. Sync the changes from llama.cpp using sync-llama-am.sh and after the
   PR has been approved and merged move to step 2.
2. Run scripts/release.sh and specify the type of release, major, minor,
   or patch. This script will handle incrementing the version
   (major|minor|patch), create a new commit with the version change,
   create a tag for the version, and prepare for the next development
   iteration.
3. Inspect the commits/tag and push to master. This will trigger the
   github release workflow which is triggered for new tags which will
   then publish a new release on github.

Example usage:
```console
$ ./scripts/release.sh major --dry-run
[dry-run] - No changes will be made

Step 1: Reading current version...
Current version: 0.9.0-dev
New release version: 1.0.0

Step 2: Updating version in ggml/CMakeLists.txt...
  [dry-run] Would update GGML_VERSION_MAJOR to 1
  [dry-run] Would update GGML_VERSION_MINOR to 0
  [dry-run] Would update GGML_VERSION_PATCH to 0
  [dry-run] Would remove -dev suffix

Step 3: Committing version bump...
  [dry-run] Would commit: 'ggml : bump version to 1.0.0'

Step 4: Creating git tag...
  [dry-run] Would create tag: v1.0.0 with message 'Release version 1.0.0'

Step 5: Preparing for next development cycle...
  [dry-run] Would update GGML_VERSION_MINOR to 1
  [dry-run] Would add -dev suffix back

Step 6: Committing development version...
  [dry-run] Would commit: 'ggml : prepare for development of 1.1.0-dev'

[dry-run] Summary (no changes were made):
  • Would have released version: 1.0.0
  • Would have created tag: v1.0.0
  • Would have set next development version: 1.1.0-dev
```

Refs: https://github.com/ggml-org/ggml/issues/1333

* ggml: create branch for release candidate and check master

* ggml : sign the git tag

commit | commitdiff | tree

Gregor Jasny [Wed, 10 Sep 2025 15:21:11 +0000 (17:21 +0200)]

CUDA : conditionally add cuda architectures (ggml/1341)

commit | commitdiff | tree

Ruben Ortlam [Sat, 20 Sep 2025 08:42:56 +0000 (10:42 +0200)]

vulkan: use vec dot for matrix matrix multiplications (#16056)

* vulkan: Change the mul_mm shared memory and register caching system to use vec2 instead of scalars, to enable using dot2 instructions

* use fma instead of dot to fix Nvidia and Apple performance issues

commit | commitdiff | tree

Benni [Sat, 20 Sep 2025 05:56:30 +0000 (07:56 +0200)]

server: fix SSE and OpenAI compatibility for error messages when streaming (#16109)

* server: fix SSE and OpenAI compatibility for error messages when streaming

* server: remove obsolete event parameter and use required data fieldname instead

commit | commitdiff | tree

ssweens [Fri, 19 Sep 2025 22:15:21 +0000 (15:15 -0700)]

llama-bench: add --devices and --list-devices support (#16039)

* * llama-bench: add --devices support
- Support --devices same as llama-server
- Provide for benchmarking different device combinations
- Include --list-devices like llama-server for convenience

* fix: field display ordering restored

* fix: integrated the rpc devices
- aimed to mimic the server as much as possible

* cleanup: defaults for list-devices
- handle dup device listing with RPC

* cleanup: remove dup device load calls

* docs: update llama-bench
- added the recently added n-cpu-moe option to the docs while in there

* llama-bench: rpc device simplification
* rpc servers unify with other devices earlier, simplifying code
* --list-devices made stateless and simpler
* various cleanup

commit | commitdiff | tree

shun095 [Fri, 19 Sep 2025 15:57:30 +0000 (00:57 +0900)]

chat: Fix streaming parser for granite models (#15682)

* fix(chat): fix streaming parser for granite models

* tests: add test cases for Granite models chat parser

commit | commitdiff | tree

Aleksander Grygier [Fri, 19 Sep 2025 07:52:27 +0000 (09:52 +0200)]

feat: Improve mobile UI for Settings Dialog (#16084)

* feat: Improve mobile UI for Settings Dialog

* chore: update webui build output

* fix: Linting errors

* chore: update webui build output

commit | commitdiff | tree

Xuan-Son Nguyen [Fri, 19 Sep 2025 06:02:51 +0000 (13:02 +0700)]

chat : fix build on arm64 (#16101)

commit | commitdiff | tree

Xuan-Son Nguyen [Fri, 19 Sep 2025 04:31:56 +0000 (11:31 +0700)]

ggml : refactor forward_dup for cpu backend (#16062)

* ggml : refactor forward_dup for cpu backend

* clean up a bit

* add quant/dequant perf test

commit | commitdiff | tree

Adrien Gallouët [Thu, 18 Sep 2025 21:07:26 +0000 (23:07 +0200)]

ggml-amx : fix ggml_amx_init() on generic Linux (#16049)

Generalize Linux check to `__linux__` to support non-glibc systems (like musl).
Also, return `false` on unknown/untested OS.

Without this commit, the code compiles (with warnings) but fails:

    register_backend: registered backend CPU (1 devices)
    register_device: registered device CPU (Intel(R) Xeon(R) Platinum 8488C)
    build: 6487 (51c4cac6) with x86_64-linux-musl-gcc (GCC) 15.1.0 for x86_64-linux-musl (debug)
    system info: n_threads = 8, n_threads_batch = 8, total_threads = 16
    ....
    print_info: n_ctx_orig_yarn  = 262144
    print_info: rope_finetuned   = unknown
    print_info: model type       = 4B
    Illegal instruction (core dumped)

Signed-off-by: Adrien Gallouët <redacted>

commit | commitdiff | tree

Adrien Gallouët [Thu, 18 Sep 2025 21:07:18 +0000 (23:07 +0200)]

cmake : fix static linking for OpenMP on Unix-like systems (#16031)

When compiling with GGML_STATIC=ON, the build process would produce a
binary that was still dynamically linked to OpenMP. This defeats the
purpose of a static build:

    $ cmake -B build \
            -DBUILD_SHARED_LIBS=OFF \
            -DLLAMA_CURL=OFF \
            -DGGML_CCACHE=OFF \
            -DGGML_NATIVE=OFF \
            -DGGML_STATIC=ON

    $ ldd llama-server
            linux-vdso.so.1 (0x0000e1a434e3b000)
            libgomp.so.1 => /lib/aarch64-linux-gnu/libgomp.so.1 (0x0000e1a4345a0000)
            libstdc++.so.6 => /lib/aarch64-linux-gnu/libstdc++.so.6 (0x0000e1a434300000)
            libm.so.6 => /lib/aarch64-linux-gnu/libm.so.6 (0x0000e1a434240000)
            libgcc_s.so.1 => /lib/aarch64-linux-gnu/libgcc_s.so.1 (0x0000e1a434200000)
            libc.so.6 => /lib/aarch64-linux-gnu/libc.so.6 (0x0000e1a434030000)
            /lib/ld-linux-aarch64.so.1 (0x0000e1a434df0000)

This commit resolves the issue by modifying `CMAKE_FIND_LIBRARY_SUFFIXES`
to prioritize `.a` files, forcing CMake to link the static version of
the library.

Signed-off-by: Adrien Gallouët <redacted>

commit | commitdiff | tree

Shawn Gu [Thu, 18 Sep 2025 19:03:34 +0000 (12:03 -0700)]

opencl: optimize mxfp4 kernels (#16037)

- flatten mxfp4 and packed fp4->fp16 bit-wise convert function (replace lut)
- MoE kernel optimizations

---------

Co-authored-by: Li He <redacted>

commit | commitdiff | tree

Jeff Bolz [Thu, 18 Sep 2025 18:46:17 +0000 (13:46 -0500)]

rename optimize_graph to graph_optimize (#16082)

commit | commitdiff | tree

Bowen Han [Thu, 18 Sep 2025 18:26:03 +0000 (11:26 -0700)]

CUDA: Optimize PAD_REFLECT_1D (#15957)

* CUDA: Optimize PAD_REFLECT_1D
feat: add more test cases for PAD_REFLECT_1D

* use fast_div to improve performance

* Apply suggestion from JohannesGaessler

Co-authored-by: Johannes Gäßler <redacted>
* Apply suggestion from JohannesGaessler

Co-authored-by: Johannes Gäßler <redacted>
* optimize

* use a concise expression to further speedup the cuda kernel

---------

Co-authored-by: Johannes Gäßler <redacted>

commit | commitdiff | tree

Johannes Gäßler [Thu, 18 Sep 2025 17:28:32 +0000 (19:28 +0200)]

CUDA: fix compilation on CC 6.0 (#16091)

commit | commitdiff | tree

Eric Curtin [Thu, 18 Sep 2025 15:22:50 +0000 (16:22 +0100)]

Add resumable downloads for llama-server model loading (#15963)

- Implement resumable downloads in common_download_file_single function
- Add detection of partial download files (.downloadInProgress)
- Check server support for HTTP Range requests via Accept-Ranges header
- Implement HTTP Range request with "bytes=<start>-" header
- Open files in append mode when resuming vs create mode for new downloads

Signed-off-by: Eric Curtin <redacted>

commit | commitdiff | tree

Georgi Gerganov [Thu, 18 Sep 2025 13:28:41 +0000 (16:28 +0300)]

metal : use function constants for mul_mv_ext kernels (#16074)

* metal : use function constants for mul_mv_ext kernels

ggml-ci

* metal : remove NW template argument

ggml-ci

* metal : adjust constants

ggml-ci

commit | commitdiff | tree

Sigbjørn Skjæret [Thu, 18 Sep 2025 11:28:22 +0000 (13:28 +0200)]

cuda : add missing F32<->I32 entries in ggml_cuda_cpy_fn (#16060)

commit | commitdiff | tree

Radoslav Gerganov [Thu, 18 Sep 2025 10:36:57 +0000 (13:36 +0300)]

server : include usage statistics only when user request them (#16052)

* server : include usage statistics only when user request them

When serving the OpenAI compatible API, we should check if
{"stream_options": {"include_usage": true} is set in the request when
deciding whether we should send usage statistics

closes: #16048

* add unit test

commit | commitdiff | tree

Georgi Gerganov [Thu, 18 Sep 2025 09:47:56 +0000 (12:47 +0300)]

llama : bump max seq limit from 64 to 256 (#15916)

ggml-ci

commit | commitdiff | tree

Georgi Gerganov [Thu, 18 Sep 2025 09:33:45 +0000 (12:33 +0300)]

metal : improve F32, F16 and BF16 mat-vec multiplication (#16057)

* metal : improve F32, F16 and BF16 mat-vec multiplication

ggml-ci

* metal : make the NSG a function constant in mul_mv kernels

ggml-ci

commit | commitdiff | tree

Jhen-Jie Hong [Thu, 18 Sep 2025 07:06:48 +0000 (15:06 +0800)]

metal : avoid call free for non-owned buffer (#16067)

commit | commitdiff | tree

Georgi Gerganov [Thu, 18 Sep 2025 07:03:24 +0000 (10:03 +0300)]

metal : handle nil cv during pipeline creation (#16065)

ggml-ci

commit | commitdiff | tree

Chenguang Li [Thu, 18 Sep 2025 01:26:33 +0000 (09:26 +0800)]

CANN: Remove print (#16044)

Signed-off-by: noemotiovon <redacted>

commit | commitdiff | tree

Reese Levine [Wed, 17 Sep 2025 20:09:40 +0000 (13:09 -0700)]

GGML WebGPU: Support for ADD, MUL, RMS_NORM, GET_ROWS operators (#16018)

* Add paramater buffer pool, batching of submissions, refactor command building/submission

* Add header for linux builds

* Free staged parameter buffers at once

* Format with clang-format

* Fix thread-safe implementation

* Use device implicit synchronization

* Update workflow to use custom release

* Remove testing branch workflow

* some f32 tests passing

* Disable set_rows until it's implemented

* f32 add all tests passing

* Begin work on set_rows

* Work on set rows

* Add error buffers for reporting unsupported SET_ROWS indices

* Remove extra comments

* Add templated addition, clean up code

* Get addition and multiplication working

* Implement rms_norm

* Add get_rows implementation

* Add new get_rows files

* Refactor use of wg size entry

* Fix compilation

* Try manually unrolled q4_0 quant

* Revert "Try manually unrolled q4_0 quant"

This reverts commit 77f8b96515f7e640ae4b0e44f066321fbc4a6166.

* Move to constant max wg size

* Check for tensor size in supports_op

* Vectorize f32 and change default workgroup size

* Move f32 get_rows from < 4 to % 4 != 0

* fix linter errors

* Add in-place tests

---------

Co-authored-by: Neha Abbas <redacted>

commit | commitdiff | tree

Georgi Gerganov [Wed, 17 Sep 2025 17:38:12 +0000 (20:38 +0300)]

metal : refactor + optimize v2 (#15995)

* metal : improve naming

* metal : refactor device

ggml-ci

* cont : props

ggml-ci

* metal : apply ggml_mem_ranges_t

ggml-ci

* metal : remove GGML_METAL_USE_BF16

ggml-ci

* metal : refactor device buffer

ggml-ci

* cont : fix naming

* metal : sync before destroying the backend

ggml-ci

* metal : refactor context

ggml-ci

* metal : migrate ggml-metal.m to ggml-metal.cpp

ggml-ci

* metal : adjust ops API

ggml-ci

* metal : use C++ to store piplienes

ggml-ci

* metal : migrate ops to separate functions

ggml-ci

* metal : add ggml_metal_library_t

ggml-ci

* metal : improve naming

ggml-ci

* metal : cleanp

ggml-ci

* metal : add support for GGML_OP_LOG

ggml-ci

* metal : fix error handling

ggml-ci

commit | commitdiff | tree

Aleksander Grygier [Wed, 17 Sep 2025 17:29:13 +0000 (19:29 +0200)]

SvelteKit-based WebUI (#14839)

commit | commitdiff | tree

Xuan-Son Nguyen [Wed, 17 Sep 2025 17:18:21 +0000 (00:18 +0700)]

convert : add Llama4ForCausalLM (#16042)

* convert : add Llama4ForCausalLM

* handle swa

* half working version

* fix use_kq_norm

* fix use_kq_norm

commit | commitdiff | tree

Johannes Gäßler [Wed, 17 Sep 2025 13:32:42 +0000 (15:32 +0200)]

CUDA: fix FA occupancy, optimize tile kernel (#15982)

commit | commitdiff | tree

David Ribeiro Alves [Wed, 17 Sep 2025 08:08:02 +0000 (01:08 -0700)]

common : Fix corrupted memory error on json grammar initialization (#16038)

Initalizing RESERVED_NAME in is_reserved_name() is not thread
safe and leads to corrupted memory when used from multiple threads
as can be seen in the asan trace below. This fixes the initialization
to make it thread-safe.

    #0 0x000100abd018 in std::__1::pair<std::__1::__hash_iterator<std::__1::__hash_node<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, void*>*>, bool> std::__1::__hash_table<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, std::__1::hash<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>, std::__1::equal_to<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>, std::__1::allocator<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>>::__emplace_unique_key_args<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&>(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&) __hash_table:1565
    #1 0x000100ab0320 in SchemaConverter::visit(nlohmann::json_abi_v3_12_0::basic_json<nlohmann::json_abi_v3_12_0::ordered_map, std::__1::vector, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, bool, long long, unsigned long long, double, std::__1::allocator, nlohmann::json_abi_v3_12_0::adl_serializer, std::__1::vector<unsigned char, std::__1::allocator<unsigned char>>, void> const&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&) json-schema-to-grammar.cpp:802
    #2 0x000100aafc48 in std::__1::__function::__func<build_grammar(std::__1::function<void (common_grammar_builder const&)> const&, common_grammar_options const&)::$_2, std::__1::allocator<build_grammar(std::__1::function<void (common_grammar_builder const&)> const&, common_grammar_options const&)::$_2>, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> (std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&, nlohmann::json_abi_v3_12_0::basic_json<nlohmann::json_abi_v3_12_0::ordered_map, std::__1::vector, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, bool, long long, unsigned long long, double, std::__1::allocator, nlohmann::json_abi_v3_12_0::adl_serializer, std::__1::vector<unsigned char, std::__1::allocator<unsigned char>>, void> const&)>::operator()(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&, nlohmann::json_abi_v3_12_0::basic_json<nlohmann::json_abi_v3_12_0::ordered_map, std::__1::vector, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, bool, long long, unsigned long long, double, std::__1::allocator, nlohmann::json_abi_v3_12_0::adl_serializer, std::__1::vector<unsigned char, std::__1::allocator<unsigned char>>, void> const&) function.h:319
    #3 0x000100a2c938 in std::__1::__function::__func<common_chat_params_init_llama_3_x(minja::chat_template const&, templates_params const&, bool)::$_0::operator()(common_grammar_builder const&) const::'lambda'(nlohmann::json_abi_v3_12_0::basic_json<nlohmann::json_abi_v3_12_0::ordered_map, std::__1::vector, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, bool, long long, unsigned long long, double, std::__1::allocator, nlohmann::json_abi_v3_12_0::adl_serializer, std::__1::vector<unsigned char, std::__1::allocator<unsigned char>>, void> const&), std::__1::allocator<common_chat_params_init_llama_3_x(minja::chat_template const&, templates_params const&, bool)::$_0::operator()(common_grammar_builder const&) const::'lambda'(nlohmann::json_abi_v3_12_0::basic_json<nlohmann::json_abi_v3_12_0::ordered_map, std::__1::vector, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, bool, long long, unsigned long long, double, std::__1::allocator, nlohmann::json_abi_v3_12_0::adl_serializer, std::__1::vector<unsigned char, std::__1::allocator<unsigned char>>, void> const&)>, void (nlohmann::json_abi_v3_12_0::basic_json<nlohmann::json_abi_v3_12_0::ordered_map, std::__1::vector, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, bool, long long, unsigned long long, double, std::__1::allocator, nlohmann::json_abi_v3_12_0::adl_serializer, std::__1::vector<unsigned char, std::__1::allocator<unsigned char>>, void> const&)>::operator()(nlohmann::json_abi_v3_12_0::basic_json<nlohmann::json_abi_v3_12_0::ordered_map, std::__1::vector, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, bool, long long, unsigned long long, double, std::__1::allocator, nlohmann::json_abi_v3_12_0::adl_serializer, std::__1::vector<unsigned char, std::__1::allocator<unsigned char>>, void> const&) function.h:319
    #4 0x000100a139f8 in foreach_function(nlohmann::json_abi_v3_12_0::basic_json<nlohmann::json_abi_v3_12_0::ordered_map, std::__1::vector, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, bool, long long, unsigned long long, double, std::__1::allocator, nlohmann::json_abi_v3_12_0::adl_serializer, std::__1::vector<unsigned char, std::__1::allocator<unsigned char>>, void> const&, std::__1::function<void (nlohmann::json_abi_v3_12_0::basic_json<nlohmann::json_abi_v3_12_0::ordered_map, std::__1::vector, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, bool, long long, unsigned long long, double, std::__1::allocator, nlohmann::json_abi_v3_12_0::adl_serializer, std::__1::vector<unsigned char, std::__1::allocator<unsigned char>>, void> const&)> const&) chat.cpp:762
    #5 0x000100a2a7f4 in std::__1::__function::__func<common_chat_params_init_llama_3_x(minja::chat_template const&, templates_params const&, bool)::$_0, std::__1::allocator<common_chat_params_init_llama_3_x(minja::chat_template const&, templates_params const&, bool)::$_0>, void (common_grammar_builder const&)>::operator()(common_grammar_builder const&) function.h:319
    #6 0x000100aa98f4 in build_grammar(std::__1::function<void (common_grammar_builder const&)> const&, common_grammar_options const&) json-schema-to-grammar.cpp:982
    #7 0x0001009c9314 in common_chat_params_init_llama_3_x(minja::chat_template const&, templates_params const&, bool) chat.cpp:1110
    #8 0x0001009b8afc in common_chat_templates_apply_jinja(common_chat_templates const*, common_chat_templates_inputs const&) chat.cpp:1992
    #9 0x0001009b533c in common_chat_templates_apply(common_chat_templates const*, common_chat_templates_inputs const&) chat.cpp:2074
    #10 0x000100810120 in llamacpp_apply_chat_template+0x724 (predict_oai-98384e17fb94e863:arm64+0x100090120)
    ...

==45482==Register values:
x[0] = 0x00006020004147f8   x[1] = 0x00006080000013c8   x[2] = 0x0000000000000000   x[3] = 0x0000604006289738
x[4] = 0x0000000000000002   x[5] = 0x0000000000000001   x[6] = 0x04034000004b4000   x[7] = 0x0000000000000001
x[8] = 0xbebebebebebebebe   x[9] = 0x17d7d7d7d7d7d7d7  x[10] = 0x00000c04000828ff  x[11] = 0x0000000000000001
x[12] = 0x000000002018d383  x[13] = 0x0000000000000000  x[14] = 0xfa0000000000fafa  x[15] = 0x000010700001ffff
x[16] = 0x000000019dc012c0  x[17] = 0x00000001021284f8  x[18] = 0x0000000000000000  x[19] = 0x00000001700acdc0
x[20] = 0x0000000000000002  x[21] = 0x000000002018d384  x[22] = 0x16dd16fd2e731151  x[23] = 0x0000007000020000
x[24] = 0x0000000100c69c08  x[25] = 0x0000000100c69c20  x[26] = 0x00006080000013c7  x[27] = 0x0000000100c69c00
x[28] = 0x00000001700acd60     fp = 0x00000001700aceb0     lr = 0x0000000100abce30     sp = 0x00000001700acd60
AddressSanitizer can not provide additional info.
SUMMARY: AddressSanitizer: SEGV __hash_table:1565 in std::__1::pair<std::__1::__hash_iterator<std::__1::__hash_node<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, void*>*>, bool> std::__1::__hash_table<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, std::__1::hash<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>, std::__1::equal_to<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>, std::__1::allocator<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>>::__emplace_unique_key_args<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&>(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&)
Thread T5 created by T0 here:
    #0 0x0001020b99d4 in pthread_create+0x5c (libclang_rt.asan_osx_dynamic.dylib:arm64e+0x359d4)
    #1 0x000100873910 in std::sys::pal::unix::thread::Thread::new::h77254fdd87a28e05+0x118 (predict_oai-98384e17fb94e863:arm64+0x1000f3910)
    #2 0x0001007c7a1c in test::run_test::haeb3c2bcd5ed6cf6+0x76c (predict_oai-98384e17fb94e863:arm64+0x100047a1c)
    #3 0x0001007aedb0 in test::console::run_tests_console::he9d142d704f3a986+0x149c (predict_oai-98384e17fb94e863:arm64+0x10002edb0)
    #4 0x0001007c5758 in test::test_main::hf86a5e20735245b9+0x118 (predict_oai-98384e17fb94e863:arm64+0x100045758)
    #5 0x0001007c5da0 in test::test_main_static::h61ee9c8fd30abca0+0x54 (predict_oai-98384e17fb94e863:arm64+0x100045da0)
    ...

==45482==ABORTING

commit | commitdiff | tree

Eve [Wed, 17 Sep 2025 07:35:37 +0000 (07:35 +0000)]

vulkan: automatically remove unsupported devices (#15976)

* remove unsupported vulkan devices

* make this happen during selection instead

* pass by reference

commit | commitdiff | tree

Daniel Bevenius [Wed, 17 Sep 2025 07:34:09 +0000 (09:34 +0200)]

ci : revert back to macos-13 for macOS-latest-cmake-x64 (#16040)

This commit reverts the change of the runs-on parameter for the
macOS-latest-cmake-x64 job back to macos-13 that was make in
Commit 51abc96bdc52ba8cd6ad78dcf12ed9a041d7b442 ("ci : update
macos-latest* jobs to use macos-latest (#15938)").

The motivation for this is that using macos-latest will cause an ARM
based runner to be used, and not an x64 based runner.

Refs: https://github.com/ggml-org/llama.cpp/pull/15938#issuecomment-3300805127

commit | commitdiff | tree

Jie Fu (傅杰) [Wed, 17 Sep 2025 07:30:55 +0000 (15:30 +0800)]

llama-quant : fix the verification of attention layers for encoder-decoder models (#16023)

Signed-off-by: Jie Fu <redacted>

commit | commitdiff | tree

Jie Fu (傅杰) [Wed, 17 Sep 2025 07:29:00 +0000 (15:29 +0800)]

examples : support encoder-decoder models in the simple example (#16002)

Signed-off-by: Jie Fu <redacted>

commit | commitdiff | tree

Shane A [Wed, 17 Sep 2025 07:01:58 +0000 (00:01 -0700)]

model : add OLMo3 support (#16015)

* Add HF to gguf conversion logic for Olmo3

* Add Olmo3 implementation

* Update rope comment

* Fix indentation

Co-authored-by: Sigbjørn Skjæret <redacted>
* Apply suggestion from @CISC

Co-authored-by: Sigbjørn Skjæret <redacted>
---------

Co-authored-by: Sigbjørn Skjæret <redacted>

commit | commitdiff | tree

Chenguang Li [Wed, 17 Sep 2025 06:33:08 +0000 (14:33 +0800)]

CANN: Optimize ggml_cann_set_device (#15935)

* CANN: Fix ggml_cann_set_device to avoid redundant device switches

- Added a check to skip aclrtSetDevice if the current device is already set.
- Prevents unnecessary context switches while keeping thread/device consistency.

* CANN: add device default id

commit | commitdiff | tree

jacekpoplawski [Tue, 16 Sep 2025 14:17:08 +0000 (16:17 +0200)]

llama-bench: add --n-cpu-moe support (#15952)

* llama-bench: add --n-cpu-moe support

Support --n-cpu-moe in llama-bench the same way it is supported by
llama-server.

commit | commitdiff | tree

Daniel Bevenius [Tue, 16 Sep 2025 13:27:52 +0000 (15:27 +0200)]

ci : use macos-latest for arm64 webgpu build (#16029)

This commit updates the runs-on field for the macOS arm64 webgpu build
job to use macos-latest instead of just latest.

The motivation for this is that this job can wait for a runner to pick
up the job for a very long time, sometimes over 7 hours. This is an
attempt to see if this change can help reduce the wait time.

Refs: https://github.com/ggml-org/llama.cpp/actions/runs/17754163447/job/50454257570?pr=16004

commit | commitdiff | tree

Daniel Bevenius [Tue, 16 Sep 2025 13:25:57 +0000 (15:25 +0200)]

ggml : fix padding in timestep embedding kernels (#15932)

* ggml : remove adding extra dim timestep embedding

This commit updates the ggml_timestep_embedding function to no longer
add an extra dimension when the specified dimension is odd.

The motivation for this change is that this introduces an unnecessary
dimension when the dimension is odd, which caused an issue in the
kernels which were not expecting this extra dimension and it resulted in
uninitialized memory for the second to last dimension.

* ggml-cuda : fix padding in timestep embedding kernel

This commit removes the zeroing out of the last dimension now that we
are not adding the extra padding dimension.

* ggml-metal : fix padding in timestep embedding kernel

This commit fixes the zero padding for odd dimensions in
the timestep embedding kernel

* ggml-opencl : fix padding in timestep embedding kernel

This commit fixes the zero padding for odd dimensions in
the timestep embedding kernel.

* ggml-sycl : fix padding in timestep embedding kernel

This commit fixes the zero padding for odd dimensions in
the timestep embedding kernel.

* ggml-vulkan : fix padding in timestep embedding kernel

This commit fixes the zero padding for odd dimensions in
the timestep embedding kernel.

* ggml-cpu : fix padding in timestep embedding function

This commit removes the zeroing out of the last dimension now that we
are not adding the extra padding dimension.

commit | commitdiff | tree

Daniel Bevenius [Tue, 16 Sep 2025 11:41:38 +0000 (13:41 +0200)]

ci : upload xcframework artifact from ios-xcode-build job (#16010)

This commit updates the github workflows build.yml file to include steps
for uploading and downloading the xcframework artifact. The
macos-latest-swift job now depends on the ios-xcode-build job and
downloads the xcframework artifact produced by it.

The motivation for this changes is that it takes a long time to build
the xcframework and we are currently doing this twice in the workflow.
With this change, we only build it once and reuse the artifact.

commit | commitdiff | tree

Bowen Han [Tue, 16 Sep 2025 06:59:19 +0000 (23:59 -0700)]

fix: apply clang-format to CUDA macros (#16017)

clang-format previously broke long CUDA macros (e.g. __launch_bounds__) into
unreadable line breaks inside template declarations, such as:

  template<int D, int ncols, int nwarps, int VKQ_stride,
           typename KQ_acc_t, bool use_logit_softcap>
      __launch_bounds__(nwarps*ggml_cuda_get_physical_warp_size(), 1)

This change adjusts formatting rules so that CUDA macros remain consistent
and aligned with the surrounding template syntax.

commit | commitdiff | tree

Daniel Bevenius [Tue, 16 Sep 2025 03:57:16 +0000 (05:57 +0200)]

ci : update macos-latest* jobs to use macos-latest (#15938)

* ci : update macos-latest* jobs to use macos-latest

This commit updates the jobs that are named macos-latest* to use the
macos-latest label instead explicit versions.

The motivation for this is that there is currently a mixuture of
versions in this workflow and there are jobs that are failing because
they require a newer version.

Refs: https://github.com/ggml-org/llama.cpp/actions/runs/17644792595/job/50140010907#step:5:1759

* ci : add xcodebuild -downloadPlatform iOS command

commit | commitdiff | tree

Yuri Khrustalev [Tue, 16 Sep 2025 02:54:44 +0000 (22:54 -0400)]

cmake : Do not install tools on iOS targets (#15903)

commit | commitdiff | tree

Aman Gupta [Tue, 16 Sep 2025 02:38:28 +0000 (10:38 +0800)]

Add LLaDA-7b-MoE diffusion model (#16003)

commit | commitdiff | tree

Jake Karnes [Mon, 15 Sep 2025 22:28:31 +0000 (16:28 -0600)]

CUDA: fix im2col_3d to respect non-contiguous inputs (views) (#15956)

* fix im2col_3d to respect non-contiguous inputs (views)

The CUDA 3D im2col kernel computed source addresses assuming compact layout (products of dims), ignoring nb[] strides.

This patch switches im2col_3d source indexing to use true strides derived from src1->nb[] (in elements), mirroring the approach used in the 2D CUDA im2col path. Destination indexing is unchanged.

* use ggml_element_size() for src strides

Co-authored-by: Johannes Gäßler <redacted>
---------

Co-authored-by: Johannes Gäßler <redacted>

commit | commitdiff | tree

Diego Devesa [Mon, 15 Sep 2025 21:38:52 +0000 (14:38 -0700)]

docker : enable rocWMMA in ROCm images, add gfx1151 (#15997)

commit | commitdiff | tree

Diego Devesa [Mon, 15 Sep 2025 21:38:42 +0000 (14:38 -0700)]

releases : switch to rocWMMA develop branch, add gfx1151 (#15992)

* releases : switch to rocWMMA develop branch, add gfx1151

* remove unused variable ROCM_VERSION

commit | commitdiff | tree

yael-works [Mon, 15 Sep 2025 16:51:35 +0000 (19:51 +0300)]

SYCL: Add COUNT_EQUAL operator support (#15991)

* SYCL: Add COUNT_EQUAL operator support (rebased on master)

* SYCL: remove duplicate op_count_equal definition

* tests: remove test_count_equal_typed and use test_count_equal for all cases

* tests: keep only I32 case for COUNT_EQUAL as suggested

* tests: keep only I32 case for COUNT_EQUAL as requested

commit | commitdiff | tree

Nikolay Popov [Mon, 15 Sep 2025 10:08:30 +0000 (13:08 +0300)]

llama-run: Fix model download on Windows (#15988)

* llama-run: Fix model download on Windows
* fix SSL error (SSL peer certificate or SSH remote key was not OK)
* fix program crash on std::filesystem::rename

* llama-run: create a separate method to utilize RAII

* llama-run: handle rename exception

commit | commitdiff | tree

Aman Gupta [Mon, 15 Sep 2025 09:35:11 +0000 (17:35 +0800)]

CUDA: some micro-optimizations in mmf.cuh for mul_mat_id (#15926)

commit | commitdiff | tree

ddh0 [Mon, 15 Sep 2025 07:54:57 +0000 (02:54 -0500)]

fix KLD percentile output (#15999)

In `llama-perplexity`, when using `--kl-divergence`, the KL divergence statistics output mistakenly displays the 99th percentile twice. This change fixes that and correctly displays the 90th percentile as originally intended (presumably).

commit | commitdiff | tree

Sigbjørn Skjæret [Sun, 14 Sep 2025 21:00:59 +0000 (23:00 +0200)]

model : add grok-2 support (#15539)

* add grok-2 support

* type fix

* type fix

* type fix

* "fix" vocab for invalid sequences

* fix expert tensor mapping and spaces in vocab

* add chat template

* fix norm tensor mapping

* rename layer_out_norm to ffn_post_norm

* ensure ffn_post_norm is mapped

* fix experts merging

* remove erroneous FFN_GATE entry

* concatenate split tensors and add more metadata

* process all expert layers and try cat instead of hstack

* add support for community BPE vocab

* fix expert feed forward length and ffn_down concat

* commit this too

* add ffn_up/gate/down, unsure if sequence is right

* add ffn_gate/down/up to tensor names

* correct residual moe (still not working)

* mess--

* fix embedding scale being applied twice

* add built in chat template

* change beta fast for grok if default value

* remove spm vocab in favor of community bpe vocab

* change attention temp length metadata type to integer

* update attention temp length metadata

* remove comment

* replace M_SQRT2 with std::sqrt(2)

* add yarn metadata, move defaults to hparams

commit | commitdiff | tree

Sigbjørn Skjæret [Sun, 14 Sep 2025 19:17:04 +0000 (21:17 +0200)]

server : only attempt to enable thinking if using jinja (#15967)

commit | commitdiff | tree

Georgi Gerganov [Sun, 14 Sep 2025 19:02:32 +0000 (22:02 +0300)]

metal : remove memory pools (#15966)

* metal : remove mem pool usage

ggml-ci

* metal : remove mem pool implementation

ggml-ci

* metal : take into account the actual allocated memory of the tensor

ggml-ci

* cont : use ggml_backend_buft_get_alloc_size

ggml-ci

* cont : improve, comments

ggml-ci

* cont : add functions for the extra tensor sizes

* metal : add comments

ggml-ci

* metal : implement .get_alloc_size for the rest of the buffer types

ggml-ci

* metal : remove ggml_metal_heap

ggml-ci

commit | commitdiff | tree

Adam [Sun, 14 Sep 2025 18:43:54 +0000 (04:43 +1000)]

rocm.Dockerfile: added gfx1200,gfx1201 architectures to support AMD Radeon RX 9000 series (#15994)

* rocm.Dockerfile: added gfx1200,gfx1201 architectures to support AMD Radeon RX 9000 series

https://rocm.docs.amd.com/projects/install-on-linux/en/docs-6.4.1/reference/system-requirements.html#rdna-os
states the Radeon RX 9000 series is supported support from Ubuntu 24.04.2, and the dockerfile is using 24.04 which is ROCm 6.4.

This fixed the `ROCm error: invalid device function` I was getting when trying to use the rocm container.

commit | commitdiff | tree

Ruben Ortlam [Sun, 14 Sep 2025 14:56:28 +0000 (16:56 +0200)]

Vulkan: Clean up mul_mm shader (#15987)

* vulkan: move mul_mm dequantization steps into a separate file and functions

* improve mul_mm vector load code

* fix debug mode issues and warnings

commit | commitdiff | tree

lcy [Sun, 14 Sep 2025 14:20:35 +0000 (22:20 +0800)]

build: fix the build failures of Windows HIP release job (#15984)

* build: fix the cache keys for Windows HIP release job

Update the cache keys to include the HIP SDK version, preventing the
use of outdated ROCm installation caches.

* build: sync changes from release.yml to build.yml

- Update HIP SDK version to 25.Q3 and ROCm version to 6.4.2
- Update the cache keys to reflect the new versions

* build: remove Windows HIP release for gfx1151
since the current stable rocWMMA does not support gfx1151.

commit | commitdiff | tree

Georgi Gerganov [Sun, 14 Sep 2025 12:33:22 +0000 (15:33 +0300)]

metal : fix kernel requirements (#15983)

* metal : fix kernel requirements

ggml-ci

* cont : fix supports_op

* cont : fix supports_op for ARGMAX

commit | commitdiff | tree

Radoslav Gerganov [Sun, 14 Sep 2025 09:28:18 +0000 (12:28 +0300)]

rpc : fix regression when --device is used (#15981)

Fix regression introduced with commit 50f4281a6

commit | commitdiff | tree

Diego Devesa [Sun, 14 Sep 2025 09:21:59 +0000 (02:21 -0700)]

releases : update ROCM, add gfx1200, gfx1201, gfx1151 (#15972)

* releases : update ROCM, add gfx1200, gfx1201, gfx1151

* releases : set target to 13.3 for macos-x64

* add hipblaslt.dll to release

* add hipblaslt/library to release

commit | commitdiff | tree

Radoslav Gerganov [Sun, 14 Sep 2025 09:10:07 +0000 (12:10 +0300)]

doc : update documentation for --tensor-split (#15980)

* doc : update documentation for --tensor-split

* Update tools/main/README.md

Co-authored-by: Johannes Gäßler <redacted>
* Update tools/main/README.md

Co-authored-by: Diego Devesa <redacted>
---------

Co-authored-by: Johannes Gäßler <redacted>
Co-authored-by: Diego Devesa <redacted>

Packaging of ggml-org/llama.cpp

RSS Atom