]>
git.djapps.eu Git - pkg/ggml/sources/llama.cpp/log
Uilian Ries [Wed, 24 Sep 2025 06:53:47 +0000 (08:53 +0200)]
common : add missing chrono header for common.cpp (#16211)
Signed-off-by: Uilian Ries <redacted>
Sigbjørn Skjæret [Wed, 24 Sep 2025 06:53:20 +0000 (08:53 +0200)]
codeowners : match all requirements files (#16214)
Jie Fu (傅杰) [Wed, 24 Sep 2025 06:46:52 +0000 (14:46 +0800)]
model-conversion : run-org-model.py fails to run on mac m1 (#16213)
Signed-off-by: Jie Fu <redacted>
Daniel Bevenius [Wed, 24 Sep 2025 06:10:09 +0000 (08:10 +0200)]
codeowners : use slash prefix for root files [no ci] (#16210)
This commit adds a leading slash to the paths of root-level files
in the CODEOWNERS file.
The motivation for this is that these might otherwise match files
in subdirectories that have other/additional owners will override them.
Refs: https://github.com/ggml-org/llama.cpp/pull/16209#issuecomment-
3326434274
Jie Fu (傅杰) [Wed, 24 Sep 2025 04:19:23 +0000 (12:19 +0800)]
model-conversion : fix the make targets in the README.md (#16209)
Fix two incorrect make targets in the readme.
Signed-off-by: Jie Fu <redacted>
Georgi Gerganov [Tue, 23 Sep 2025 17:41:40 +0000 (20:41 +0300)]
ci : disable AMD workflows + update NVIDIA workflows (#16200)
* ci : disable AMD workflows + update NVIDIA workflows
* cont : fixes
* cont : update nvidia vulkan workflows
Georgi Gerganov [Tue, 23 Sep 2025 10:44:25 +0000 (13:44 +0300)]
ci : enable Vulkan workflow on Mac (#16194)
Xiangyan Sun [Tue, 23 Sep 2025 08:58:12 +0000 (01:58 -0700)]
ggml-cpu: Respect cpumask settings (#16164)
Sigbjørn Skjæret [Tue, 23 Sep 2025 08:25:20 +0000 (10:25 +0200)]
ggml : fix uninitialized is_on_grid in quantize_row_iq3_xxs_impl (#15928)
* fix uninitialized is_on_grid in quantize_row_iq3_xxs_impl
* change initialization to true
Aaron Teo [Tue, 23 Sep 2025 06:53:05 +0000 (14:53 +0800)]
zdnn: refactor codebase + add docs (#16178)
* zdnn: initial matmul refactor
Signed-off-by: Aaron Teo <redacted>
* ggml-zdnn: rm static from funcs
Signed-off-by: Aaron Teo <redacted>
* ggml-zdnn: update ggml-zdnn.h
Signed-off-by: Aaron Teo <redacted>
* ggml-zdnn: change header files to hpp
Signed-off-by: Aaron Teo <redacted>
* ggml-zdnn: switch to common.hpp
Signed-off-by: Aaron Teo <redacted>
* ggml-zdnn: move mulmat forward around
Signed-off-by: Aaron Teo <redacted>
* ggml-zdnn: rm inline from utils
Signed-off-by: Aaron Teo <redacted>
* ggml-zdnn: code cleanup
Signed-off-by: Aaron Teo <redacted>
* docs: add zDNN docs
Signed-off-by: Aaron Teo <redacted>
---------
Signed-off-by: Aaron Teo <redacted>
Daniel Bevenius [Tue, 23 Sep 2025 06:13:22 +0000 (08:13 +0200)]
codeowners : add @danbev to model-conversion example [no ci] (#16190)
This commit adds examples/model-conversion/ to the CODEOWNERS file and
assigns myself (@danbev) as the code owner for this directory.
Aaron Teo [Tue, 23 Sep 2025 05:59:34 +0000 (13:59 +0800)]
devops: add s390x containers (#15915)
* devops: add s390x dockerfile
Signed-off-by: Aaron Teo <redacted>
* devops: add missing ninja
Signed-off-by: Aaron Teo <redacted>
* devops: move s390x docker into cpu docker
Signed-off-by: Aaron Teo <redacted>
* devops: rework s390x docker
Signed-off-by: Aaron Teo <redacted>
* devops: copy more tools
Signed-off-by: Aaron Teo <redacted>
* devops: add server build step
Signed-off-by: Aaron Teo <redacted>
* devops: remove apt clean steps as distroless misses it
Signed-off-by: Aaron Teo <redacted>
* devops: remove apt commands from distroless
Signed-off-by: Aaron Teo <redacted>
* devops: fix shared libs in distroless
Signed-off-by: Aaron Teo <redacted>
* devops: use correct libs path
Signed-off-by: Aaron Teo <redacted>
* devops: fix shared libs
Signed-off-by: Aaron Teo <redacted>
* devops: add collector stage
Signed-off-by: Aaron Teo <redacted>
* devops: fix missing stage ref
Signed-off-by: Aaron Teo <redacted>
* devops: fix permission issue
Signed-off-by: Aaron Teo <redacted>
* devops: fix unknown model loading failures
Signed-off-by: Aaron Teo <redacted>
* devops: attempt at fixing model loading failure
Signed-off-by: Aaron Teo <redacted>
* devops: fix missing ggml shared object
failure to load model
Signed-off-by: Aaron Teo <redacted>
* devops: remove move shared objects
Signed-off-by: Aaron Teo <redacted>
* devops: move libggml-cpu and blas into bin
Signed-off-by: Aaron Teo <redacted>
* devops: finalise hardened server stage
Signed-off-by: Aaron Teo <redacted>
* devops: add cli target
Signed-off-by: Aaron Teo <redacted>
* devops: fix typos
Signed-off-by: Aaron Teo <redacted>
* devops: fix missing shared libraries in base
Signed-off-by: Aaron Teo <redacted>
* devops: update debian target
Signed-off-by: Aaron Teo <redacted>
* devops: formalise llama.cpp loc
Signed-off-by: Aaron Teo <redacted>
* Revert "devops: formalise llama.cpp loc"
This reverts commit
0a7664af8466a15f318ff209e02ac3c4e551cc18 .
Signed-off-by: Aaron Teo <redacted>
* devops: formalise llama.cpp loc
Signed-off-by: Aaron Teo <redacted>
(cherry picked from commit
0a7664af8466a15f318ff209e02ac3c4e551cc18 )
Signed-off-by: Aaron Teo <redacted>
* devops: attempt at fixing missing dir
Signed-off-by: Aaron Teo <redacted>
* devops: attempt at making it cache the build
Signed-off-by: Aaron Teo <redacted>
* devops: fix copying process
Signed-off-by: Aaron Teo <redacted>
* devops: make build dir an argument
Signed-off-by: Aaron Teo <redacted>
* Revert "devops: make build dir an argument"
This reverts commit
438698976b8a5181c1e8179600527cfd5a50cc23 .
Signed-off-by: Aaron Teo <redacted>
* devops: add build stage for gguf-py
Signed-off-by: Aaron Teo <redacted>
* devops: move gguf-py installation into build stage
Signed-off-by: Aaron Teo <redacted>
* devops: break system packages?
Signed-off-by: Aaron Teo <redacted>
* devops: add rust compiler installer
Signed-off-by: Aaron Teo <redacted>
* devops: fix rustc not found
Signed-off-by: Aaron Teo <redacted>
* devops: remove cache mount to allow rustc to persist
Signed-off-by: Aaron Teo <redacted>
* devops: move rustc installation to another layer
Signed-off-by: Aaron Teo <redacted>
* devops: move gguf-py installation to full stage, fix copying
Signed-off-by: Aaron Teo <redacted>
* devops: remove rustc installation in build
Signed-off-by: Aaron Teo <redacted>
* devops: disable full target for now
Signed-off-by: Aaron Teo <redacted>
* devops: attempting static build
Signed-off-by: Aaron Teo <redacted>
* devops: merge s390x dockerfile into cpu for now
Signed-off-by: Aaron Teo <redacted>
* devops: switch to gcc image for build step
Signed-off-by: Aaron Teo <redacted>
* devops: remove build essentials
Signed-off-by: Aaron Teo <redacted>
* devops: install openblas into base target
Signed-off-by: Aaron Teo <redacted>
* devops: go back to s390x dockerfile
Signed-off-by: Aaron Teo <redacted>
* devops: remove libggml and libblas
Signed-off-by: Aaron Teo <redacted>
* devops: add full target
Signed-off-by: Aaron Teo <redacted>
* devops: add break system packages
Signed-off-by: Aaron Teo <redacted>
* devops: add libjpeg
Signed-off-by: Aaron Teo <redacted>
* devops: add missing cmake dep
Signed-off-by: Aaron Teo <redacted>
* devops: finalise docker images for s390x
Signed-off-by: Aaron Teo <redacted>
* devops: add custom openblas patch
Signed-off-by: Aaron Teo <redacted>
* devops: use libopenblas-dev instead of libopenblas-openmp-dev
Signed-off-by: Aaron Teo <redacted>
* devops: add s390x docker build
Signed-off-by: Aaron Teo <redacted>
---------
Signed-off-by: Aaron Teo <redacted>
Daniel Bevenius [Tue, 23 Sep 2025 03:59:03 +0000 (05:59 +0200)]
ggml-cpu : fix typo in gemm comments [no ci] (#16189)
Gabe Goodhart [Mon, 22 Sep 2025 18:40:10 +0000 (12:40 -0600)]
feat: Add conversion support in GraniteHybrid for non-hybrid (all attn) (#16177)
This is a configuration of the hparams in the GraniteHybrid architecture
that devolves to the Granite (or GraniteMoe) architecture (ie Granite 3.x).
It may be used for some models in the Granite 4 family with the
GraniteHybrid architecture acting as a superset arch. Rather than support
it directly in the c++ graph, we simply coerce the architecture flag back
to the correct "granite" or "granitemoe" architecture.
Branch: gabe-l-hart/GraniteNonHybridConversion
Signed-off-by: Gabe Goodhart <redacted>
Co-authored-by: Sigbjørn Skjæret <redacted>
Haiyue Wang [Mon, 22 Sep 2025 17:57:46 +0000 (01:57 +0800)]
clang-tidy : disable warning about performance enum size (#16127)
Disable 'performance-enum-size' checking:
Enum 'llama_token_type' uses a larger base type ('unsigned int', size: 4 bytes)
than necessary for its value set, consider using 'std::uint8_t' (1 byte) as the
base type to reduce its size.
Sigbjørn Skjæret [Mon, 22 Sep 2025 17:13:00 +0000 (19:13 +0200)]
ggml : implement set_rows with i32 index (#16159)
* implement set_rows with i32 index
* template fix
* test quantized path
warnings--
* Apply suggestions from code review
Co-authored-by: Georgi Gerganov <redacted>
* forgotten name change
* deduplicate cuda/sycl and test-fix
* indent++
* vulkan: support set_rows with i32 index type (#16162)
* disable i32 index for webgpu for now
---------
Co-authored-by: Georgi Gerganov <redacted>
Co-authored-by: Jeff Bolz <redacted>
Georgi Gerganov [Mon, 22 Sep 2025 15:20:21 +0000 (18:20 +0300)]
codeowners : update + cleanup (#16174)
---------
Co-authored-by: slaren <redacted>
Adrien Gallouët [Mon, 22 Sep 2025 12:13:51 +0000 (14:13 +0200)]
common : enable `--offline` mode without curl support (#16137)
* common : use the json parser
Signed-off-by: Adrien Gallouët <redacted>
* common : enable --offline mode without CURL support
This change refactors the download logic to properly support offline mode
even when the project is built without CURL.
Without this commit, using `--offline` would give the following error:
error: built without CURL, cannot download model from the internet
even if all the files are already cached.
Signed-off-by: Adrien Gallouët <redacted>
---------
Signed-off-by: Adrien Gallouët <redacted>
Quentin Bramas [Mon, 22 Sep 2025 08:53:13 +0000 (10:53 +0200)]
webui : fix handling incomplete chunks (#16107)
GideonSerf [Mon, 22 Sep 2025 08:49:58 +0000 (10:49 +0200)]
embedding : fix typos in README (#16171)
Haiyue Wang [Mon, 22 Sep 2025 08:48:42 +0000 (16:48 +0800)]
common : remove unused local variables (#16140)
These two local variables 'arg' and 'arg_prefix' have been overriden by:
1. for (const auto & arg : opt.args)
2. for (int i = 1; i < argc; i++) {
const std::string arg_prefix = "--";
std::string arg = argv[i];
Georgi Gerganov [Mon, 22 Sep 2025 08:12:37 +0000 (11:12 +0300)]
ggml : extend ggml_can_fuse to work with non-sequential nodes (#16123)
* ggml : extend ggml_can_fuse to work with non-sequential nodes in the graph
* cont : fix wrong bounds check condition
* cont : remove unnecessary overload
Georgi Gerganov [Mon, 22 Sep 2025 08:12:09 +0000 (11:12 +0300)]
ggml : add ggml_op_is_empty (#16122)
* ggml : add ggml_op_is_empty
* ggml : move to ggml-impl.h
Xuan-Son Nguyen [Mon, 22 Sep 2025 08:10:58 +0000 (15:10 +0700)]
codeowners : update ownership for @ngxson and @allozuar (#16128)
Shin-myoung-serp [Mon, 22 Sep 2025 08:04:01 +0000 (17:04 +0900)]
Vulkan: add conv_transpose_2d operation (#16022)
* Vulkan: add conv_transpose_2d operation
* Vulkan: fix typo in conv_transpose_2d shader(s0mp, s0L, s1mp, s1L)
* Vulkan: fix incorrect indentation in conv_transpose_2d shader
* Vulkan: add checking the push constants size limit and reuse conv2d_mm.comp for conv_transpose_2d operation
* Vulkan: revert the order of the index calculation and bound check in conv_2d shader
* Vulkan: explicity check push constants limit in supports_op() for conv_transpose_2d operation.
* Vulkan: remove unnecessary lower bound checks for H/W_idx in the conv_2d shader.
Sigbjørn Skjæret [Mon, 22 Sep 2025 07:59:05 +0000 (09:59 +0200)]
codeowners : claim responsibility for ci, models, gguf-py and convert (#16124)
* claim responsibility for ci, gguf-py and convert
* add myself to various src/llama- files
Georgi Gerganov [Mon, 22 Sep 2025 07:58:02 +0000 (10:58 +0300)]
contrib : update roles (#16113)
* contrib : update roles
* contrib : merge PR sections + add link to CI instructions
Updated pull request guidelines for contributors and collaborators, and clarified merging practices for maintainers.
Georgi Gerganov [Mon, 22 Sep 2025 07:16:05 +0000 (10:16 +0300)]
ci : remove vulkaninfo calls (#16169)
Georgi Gerganov [Mon, 22 Sep 2025 06:11:39 +0000 (09:11 +0300)]
ci : use smaller model (#16168)
* ci : switch from gemma to qwen3 0.6b
* ci : use smaller model for some tests
Jeff Bolz [Mon, 22 Sep 2025 05:37:17 +0000 (00:37 -0500)]
vulkan: add RTE variants of exp shader (#16165)
This fixes some failures on Turing where "round to zero" rounds to the max f16
value but the CPU reference value is infinite.
Georgi Gerganov [Mon, 22 Sep 2025 05:31:40 +0000 (08:31 +0300)]
ci : adjust params for less runtime (#16167)
* ci : adjust params for less runtime
* ci : gate BF16 on some hardware
* ci : move extra tests to Arm runner
Ruben Ortlam [Mon, 22 Sep 2025 05:22:43 +0000 (07:22 +0200)]
vulkan: vec dot matrix multiplication fix (#16151)
* vulkan: fix matrix multiplication index calculation for odd m/n and odd k in combination with batching
* add odd m/n + odd k test with batching
lhez [Sun, 21 Sep 2025 23:42:10 +0000 (16:42 -0700)]
opencl: fix concat crash on win arm64 with Adreno (#15944)
lhez [Sun, 21 Sep 2025 21:48:44 +0000 (14:48 -0700)]
opencl: initial `q8_0` mv support (#15732)
Georgi Gerganov [Sun, 21 Sep 2025 16:00:27 +0000 (19:00 +0300)]
ci : add label for the RISC-V runner (#16150)
Georgi Gerganov [Sun, 21 Sep 2025 13:50:45 +0000 (16:50 +0300)]
ci : migrate ggml ci to self-hosted runners (#16116)
* ci : migrate ggml ci to a self-hosted runners
* ci : add T4 runner
* ci : add instructions for adding self-hosted runners
* ci : disable test-backend-ops from debug builds due to slowness
* ci : add AMD V710 runner (vulkan)
* cont : add ROCM workflow
* ci : switch to qwen3 0.6b model
* cont : fix the context size
Giuseppe Scrivano [Sun, 21 Sep 2025 06:31:55 +0000 (08:31 +0200)]
vulkan: optimize UMA buffer operations and fix driver hangs (#16059)
* vulkan: optimize UMA buffer operations and fix driver hangs
The previous implementation was blocking the GPU for extended periods,
causing the i915 driver to reset the context due to the hangcheck
protection.
[32628.443070] i915 0000:00:02.0: [drm] GPU HANG: ecode 12:1:
85dffffb , in llama-server [194114]
[32628.443091] i915 0000:00:02.0: [drm] llama-server[194114] context reset due to GPU hang
* vulkan: implement deferred_memset on UMA
---------
Signed-off-by: Giuseppe Scrivano <redacted>
Jeff Bolz [Sun, 21 Sep 2025 06:23:37 +0000 (01:23 -0500)]
vulkan: fix validation error about VK_PIPELINE_CREATE_CAPTURE_STATISTICS_BIT_KHR (#16086)
Georgi Gerganov [Sat, 20 Sep 2025 09:55:47 +0000 (12:55 +0300)]
sync : ggml
Daniel Bevenius [Tue, 16 Sep 2025 04:16:52 +0000 (06:16 +0200)]
ggml : introduce semantic versioning (ggml/1336)
* ggml : introduce semantic versioning
This commit introduces semantic versioning for the GGML library.
The motivation for this is that the current versioning, using build
numbers, makes it difficult to track changes and releases for projects
that use ggml.
The release steps are the following:
1. Sync the changes from llama.cpp using sync-llama-am.sh and after the
PR has been approved and merged move to step 2.
2. Run scripts/release.sh and specify the type of release, major, minor,
or patch. This script will handle incrementing the version
(major|minor|patch), create a new commit with the version change,
create a tag for the version, and prepare for the next development
iteration.
3. Inspect the commits/tag and push to master. This will trigger the
github release workflow which is triggered for new tags which will
then publish a new release on github.
Example usage:
```console
$ ./scripts/release.sh major --dry-run
[dry-run] - No changes will be made
Step 1: Reading current version...
Current version: 0.9.0-dev
New release version: 1.0.0
Step 2: Updating version in ggml/CMakeLists.txt...
[dry-run] Would update GGML_VERSION_MAJOR to 1
[dry-run] Would update GGML_VERSION_MINOR to 0
[dry-run] Would update GGML_VERSION_PATCH to 0
[dry-run] Would remove -dev suffix
Step 3: Committing version bump...
[dry-run] Would commit: 'ggml : bump version to 1.0.0'
Step 4: Creating git tag...
[dry-run] Would create tag: v1.0.0 with message 'Release version 1.0.0'
Step 5: Preparing for next development cycle...
[dry-run] Would update GGML_VERSION_MINOR to 1
[dry-run] Would add -dev suffix back
Step 6: Committing development version...
[dry-run] Would commit: 'ggml : prepare for development of 1.1.0-dev'
[dry-run] Summary (no changes were made):
• Would have released version: 1.0.0
• Would have created tag: v1.0.0
• Would have set next development version: 1.1.0-dev
```
Refs: https://github.com/ggml-org/ggml/issues/1333
* ggml: create branch for release candidate and check master
* ggml : sign the git tag
Gregor Jasny [Wed, 10 Sep 2025 15:21:11 +0000 (17:21 +0200)]
CUDA : conditionally add cuda architectures (ggml/1341)
Ruben Ortlam [Sat, 20 Sep 2025 08:42:56 +0000 (10:42 +0200)]
vulkan: use vec dot for matrix matrix multiplications (#16056)
* vulkan: Change the mul_mm shared memory and register caching system to use vec2 instead of scalars, to enable using dot2 instructions
* use fma instead of dot to fix Nvidia and Apple performance issues
Benni [Sat, 20 Sep 2025 05:56:30 +0000 (07:56 +0200)]
server: fix SSE and OpenAI compatibility for error messages when streaming (#16109)
* server: fix SSE and OpenAI compatibility for error messages when streaming
* server: remove obsolete event parameter and use required data fieldname instead
ssweens [Fri, 19 Sep 2025 22:15:21 +0000 (15:15 -0700)]
llama-bench: add --devices and --list-devices support (#16039)
* * llama-bench: add --devices support
- Support --devices same as llama-server
- Provide for benchmarking different device combinations
- Include --list-devices like llama-server for convenience
* fix: field display ordering restored
* fix: integrated the rpc devices
- aimed to mimic the server as much as possible
* cleanup: defaults for list-devices
- handle dup device listing with RPC
* cleanup: remove dup device load calls
* docs: update llama-bench
- added the recently added n-cpu-moe option to the docs while in there
* llama-bench: rpc device simplification
* rpc servers unify with other devices earlier, simplifying code
* --list-devices made stateless and simpler
* various cleanup
shun095 [Fri, 19 Sep 2025 15:57:30 +0000 (00:57 +0900)]
chat: Fix streaming parser for granite models (#15682)
* fix(chat): fix streaming parser for granite models
* tests: add test cases for Granite models chat parser
Aleksander Grygier [Fri, 19 Sep 2025 07:52:27 +0000 (09:52 +0200)]
feat: Improve mobile UI for Settings Dialog (#16084)
* feat: Improve mobile UI for Settings Dialog
* chore: update webui build output
* fix: Linting errors
* chore: update webui build output
Xuan-Son Nguyen [Fri, 19 Sep 2025 06:02:51 +0000 (13:02 +0700)]
chat : fix build on arm64 (#16101)
Xuan-Son Nguyen [Fri, 19 Sep 2025 04:31:56 +0000 (11:31 +0700)]
ggml : refactor forward_dup for cpu backend (#16062)
* ggml : refactor forward_dup for cpu backend
* clean up a bit
* add quant/dequant perf test
Adrien Gallouët [Thu, 18 Sep 2025 21:07:26 +0000 (23:07 +0200)]
ggml-amx : fix ggml_amx_init() on generic Linux (#16049)
Generalize Linux check to `__linux__` to support non-glibc systems (like musl).
Also, return `false` on unknown/untested OS.
Without this commit, the code compiles (with warnings) but fails:
register_backend: registered backend CPU (1 devices)
register_device: registered device CPU (Intel(R) Xeon(R) Platinum 8488C)
build: 6487 (
51c4cac6 ) with x86_64-linux-musl-gcc (GCC) 15.1.0 for x86_64-linux-musl (debug)
system info: n_threads = 8, n_threads_batch = 8, total_threads = 16
....
print_info: n_ctx_orig_yarn = 262144
print_info: rope_finetuned = unknown
print_info: model type = 4B
Illegal instruction (core dumped)
Signed-off-by: Adrien Gallouët <redacted>
Adrien Gallouët [Thu, 18 Sep 2025 21:07:18 +0000 (23:07 +0200)]
cmake : fix static linking for OpenMP on Unix-like systems (#16031)
When compiling with GGML_STATIC=ON, the build process would produce a
binary that was still dynamically linked to OpenMP. This defeats the
purpose of a static build:
$ cmake -B build \
-DBUILD_SHARED_LIBS=OFF \
-DLLAMA_CURL=OFF \
-DGGML_CCACHE=OFF \
-DGGML_NATIVE=OFF \
-DGGML_STATIC=ON
$ ldd llama-server
linux-vdso.so.1 (0x0000e1a434e3b000)
libgomp.so.1 => /lib/aarch64-linux-gnu/libgomp.so.1 (0x0000e1a4345a0000)
libstdc++.so.6 => /lib/aarch64-linux-gnu/libstdc++.so.6 (0x0000e1a434300000)
libm.so.6 => /lib/aarch64-linux-gnu/libm.so.6 (0x0000e1a434240000)
libgcc_s.so.1 => /lib/aarch64-linux-gnu/libgcc_s.so.1 (0x0000e1a434200000)
libc.so.6 => /lib/aarch64-linux-gnu/libc.so.6 (0x0000e1a434030000)
/lib/ld-linux-aarch64.so.1 (0x0000e1a434df0000)
This commit resolves the issue by modifying `CMAKE_FIND_LIBRARY_SUFFIXES`
to prioritize `.a` files, forcing CMake to link the static version of
the library.
Signed-off-by: Adrien Gallouët <redacted>
Shawn Gu [Thu, 18 Sep 2025 19:03:34 +0000 (12:03 -0700)]
opencl: optimize mxfp4 kernels (#16037)
- flatten mxfp4 and packed fp4->fp16 bit-wise convert function (replace lut)
- MoE kernel optimizations
---------
Co-authored-by: Li He <redacted>
Jeff Bolz [Thu, 18 Sep 2025 18:46:17 +0000 (13:46 -0500)]
rename optimize_graph to graph_optimize (#16082)
Bowen Han [Thu, 18 Sep 2025 18:26:03 +0000 (11:26 -0700)]
CUDA: Optimize PAD_REFLECT_1D (#15957)
* CUDA: Optimize PAD_REFLECT_1D
feat: add more test cases for PAD_REFLECT_1D
* use fast_div to improve performance
* Apply suggestion from JohannesGaessler
Co-authored-by: Johannes Gäßler <redacted>
* Apply suggestion from JohannesGaessler
Co-authored-by: Johannes Gäßler <redacted>
* optimize
* use a concise expression to further speedup the cuda kernel
---------
Co-authored-by: Johannes Gäßler <redacted>
Johannes Gäßler [Thu, 18 Sep 2025 17:28:32 +0000 (19:28 +0200)]
CUDA: fix compilation on CC 6.0 (#16091)
Eric Curtin [Thu, 18 Sep 2025 15:22:50 +0000 (16:22 +0100)]
Add resumable downloads for llama-server model loading (#15963)
- Implement resumable downloads in common_download_file_single function
- Add detection of partial download files (.downloadInProgress)
- Check server support for HTTP Range requests via Accept-Ranges header
- Implement HTTP Range request with "bytes=<start>-" header
- Open files in append mode when resuming vs create mode for new downloads
Signed-off-by: Eric Curtin <redacted>
Georgi Gerganov [Thu, 18 Sep 2025 13:28:41 +0000 (16:28 +0300)]
metal : use function constants for mul_mv_ext kernels (#16074)
* metal : use function constants for mul_mv_ext kernels
ggml-ci
* metal : remove NW template argument
ggml-ci
* metal : adjust constants
ggml-ci
Sigbjørn Skjæret [Thu, 18 Sep 2025 11:28:22 +0000 (13:28 +0200)]
cuda : add missing F32<->I32 entries in ggml_cuda_cpy_fn (#16060)
Radoslav Gerganov [Thu, 18 Sep 2025 10:36:57 +0000 (13:36 +0300)]
server : include usage statistics only when user request them (#16052)
* server : include usage statistics only when user request them
When serving the OpenAI compatible API, we should check if
{"stream_options": {"include_usage": true} is set in the request when
deciding whether we should send usage statistics
closes: #16048
* add unit test
Georgi Gerganov [Thu, 18 Sep 2025 09:47:56 +0000 (12:47 +0300)]
llama : bump max seq limit from 64 to 256 (#15916)
ggml-ci
Georgi Gerganov [Thu, 18 Sep 2025 09:33:45 +0000 (12:33 +0300)]
metal : improve F32, F16 and BF16 mat-vec multiplication (#16057)
* metal : improve F32, F16 and BF16 mat-vec multiplication
ggml-ci
* metal : make the NSG a function constant in mul_mv kernels
ggml-ci
Jhen-Jie Hong [Thu, 18 Sep 2025 07:06:48 +0000 (15:06 +0800)]
metal : avoid call free for non-owned buffer (#16067)
Georgi Gerganov [Thu, 18 Sep 2025 07:03:24 +0000 (10:03 +0300)]
metal : handle nil cv during pipeline creation (#16065)
ggml-ci
Chenguang Li [Thu, 18 Sep 2025 01:26:33 +0000 (09:26 +0800)]
CANN: Remove print (#16044)
Signed-off-by: noemotiovon <redacted>
Reese Levine [Wed, 17 Sep 2025 20:09:40 +0000 (13:09 -0700)]
GGML WebGPU: Support for ADD, MUL, RMS_NORM, GET_ROWS operators (#16018)
* Add paramater buffer pool, batching of submissions, refactor command building/submission
* Add header for linux builds
* Free staged parameter buffers at once
* Format with clang-format
* Fix thread-safe implementation
* Use device implicit synchronization
* Update workflow to use custom release
* Remove testing branch workflow
* some f32 tests passing
* Disable set_rows until it's implemented
* f32 add all tests passing
* Begin work on set_rows
* Work on set rows
* Add error buffers for reporting unsupported SET_ROWS indices
* Remove extra comments
* Add templated addition, clean up code
* Get addition and multiplication working
* Implement rms_norm
* Add get_rows implementation
* Add new get_rows files
* Refactor use of wg size entry
* Fix compilation
* Try manually unrolled q4_0 quant
* Revert "Try manually unrolled q4_0 quant"
This reverts commit
77f8b96515f7e640ae4b0e44f066321fbc4a6166 .
* Move to constant max wg size
* Check for tensor size in supports_op
* Vectorize f32 and change default workgroup size
* Move f32 get_rows from < 4 to % 4 != 0
* fix linter errors
* Add in-place tests
---------
Co-authored-by: Neha Abbas <redacted>
Georgi Gerganov [Wed, 17 Sep 2025 17:38:12 +0000 (20:38 +0300)]
metal : refactor + optimize v2 (#15995)
* metal : improve naming
* metal : refactor device
ggml-ci
* cont : props
ggml-ci
* metal : apply ggml_mem_ranges_t
ggml-ci
* metal : remove GGML_METAL_USE_BF16
ggml-ci
* metal : refactor device buffer
ggml-ci
* cont : fix naming
* metal : sync before destroying the backend
ggml-ci
* metal : refactor context
ggml-ci
* metal : migrate ggml-metal.m to ggml-metal.cpp
ggml-ci
* metal : adjust ops API
ggml-ci
* metal : use C++ to store piplienes
ggml-ci
* metal : migrate ops to separate functions
ggml-ci
* metal : add ggml_metal_library_t
ggml-ci
* metal : improve naming
ggml-ci
* metal : cleanp
ggml-ci
* metal : add support for GGML_OP_LOG
ggml-ci
* metal : fix error handling
ggml-ci
Aleksander Grygier [Wed, 17 Sep 2025 17:29:13 +0000 (19:29 +0200)]
SvelteKit-based WebUI (#14839)
Xuan-Son Nguyen [Wed, 17 Sep 2025 17:18:21 +0000 (00:18 +0700)]
convert : add Llama4ForCausalLM (#16042)
* convert : add Llama4ForCausalLM
* handle swa
* half working version
* fix use_kq_norm
* fix use_kq_norm
Johannes Gäßler [Wed, 17 Sep 2025 13:32:42 +0000 (15:32 +0200)]
CUDA: fix FA occupancy, optimize tile kernel (#15982)
David Ribeiro Alves [Wed, 17 Sep 2025 08:08:02 +0000 (01:08 -0700)]
common : Fix corrupted memory error on json grammar initialization (#16038)
Initalizing RESERVED_NAME in is_reserved_name() is not thread
safe and leads to corrupted memory when used from multiple threads
as can be seen in the asan trace below. This fixes the initialization
to make it thread-safe.
#0 0x000100abd018 in std::__1::pair<std::__1::__hash_iterator<std::__1::__hash_node<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, void*>*>, bool> std::__1::__hash_table<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, std::__1::hash<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>, std::__1::equal_to<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>, std::__1::allocator<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>>::__emplace_unique_key_args<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&>(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&) __hash_table:1565
#1 0x000100ab0320 in SchemaConverter::visit(nlohmann::json_abi_v3_12_0::basic_json<nlohmann::json_abi_v3_12_0::ordered_map, std::__1::vector, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, bool, long long, unsigned long long, double, std::__1::allocator, nlohmann::json_abi_v3_12_0::adl_serializer, std::__1::vector<unsigned char, std::__1::allocator<unsigned char>>, void> const&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&) json-schema-to-grammar.cpp:802
#2 0x000100aafc48 in std::__1::__function::__func<build_grammar(std::__1::function<void (common_grammar_builder const&)> const&, common_grammar_options const&)::$_2, std::__1::allocator<build_grammar(std::__1::function<void (common_grammar_builder const&)> const&, common_grammar_options const&)::$_2>, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> (std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&, nlohmann::json_abi_v3_12_0::basic_json<nlohmann::json_abi_v3_12_0::ordered_map, std::__1::vector, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, bool, long long, unsigned long long, double, std::__1::allocator, nlohmann::json_abi_v3_12_0::adl_serializer, std::__1::vector<unsigned char, std::__1::allocator<unsigned char>>, void> const&)>::operator()(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&, nlohmann::json_abi_v3_12_0::basic_json<nlohmann::json_abi_v3_12_0::ordered_map, std::__1::vector, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, bool, long long, unsigned long long, double, std::__1::allocator, nlohmann::json_abi_v3_12_0::adl_serializer, std::__1::vector<unsigned char, std::__1::allocator<unsigned char>>, void> const&) function.h:319
#3 0x000100a2c938 in std::__1::__function::__func<common_chat_params_init_llama_3_x(minja::chat_template const&, templates_params const&, bool)::$_0::operator()(common_grammar_builder const&) const::'lambda'(nlohmann::json_abi_v3_12_0::basic_json<nlohmann::json_abi_v3_12_0::ordered_map, std::__1::vector, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, bool, long long, unsigned long long, double, std::__1::allocator, nlohmann::json_abi_v3_12_0::adl_serializer, std::__1::vector<unsigned char, std::__1::allocator<unsigned char>>, void> const&), std::__1::allocator<common_chat_params_init_llama_3_x(minja::chat_template const&, templates_params const&, bool)::$_0::operator()(common_grammar_builder const&) const::'lambda'(nlohmann::json_abi_v3_12_0::basic_json<nlohmann::json_abi_v3_12_0::ordered_map, std::__1::vector, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, bool, long long, unsigned long long, double, std::__1::allocator, nlohmann::json_abi_v3_12_0::adl_serializer, std::__1::vector<unsigned char, std::__1::allocator<unsigned char>>, void> const&)>, void (nlohmann::json_abi_v3_12_0::basic_json<nlohmann::json_abi_v3_12_0::ordered_map, std::__1::vector, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, bool, long long, unsigned long long, double, std::__1::allocator, nlohmann::json_abi_v3_12_0::adl_serializer, std::__1::vector<unsigned char, std::__1::allocator<unsigned char>>, void> const&)>::operator()(nlohmann::json_abi_v3_12_0::basic_json<nlohmann::json_abi_v3_12_0::ordered_map, std::__1::vector, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, bool, long long, unsigned long long, double, std::__1::allocator, nlohmann::json_abi_v3_12_0::adl_serializer, std::__1::vector<unsigned char, std::__1::allocator<unsigned char>>, void> const&) function.h:319
#4 0x000100a139f8 in foreach_function(nlohmann::json_abi_v3_12_0::basic_json<nlohmann::json_abi_v3_12_0::ordered_map, std::__1::vector, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, bool, long long, unsigned long long, double, std::__1::allocator, nlohmann::json_abi_v3_12_0::adl_serializer, std::__1::vector<unsigned char, std::__1::allocator<unsigned char>>, void> const&, std::__1::function<void (nlohmann::json_abi_v3_12_0::basic_json<nlohmann::json_abi_v3_12_0::ordered_map, std::__1::vector, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, bool, long long, unsigned long long, double, std::__1::allocator, nlohmann::json_abi_v3_12_0::adl_serializer, std::__1::vector<unsigned char, std::__1::allocator<unsigned char>>, void> const&)> const&) chat.cpp:762
#5 0x000100a2a7f4 in std::__1::__function::__func<common_chat_params_init_llama_3_x(minja::chat_template const&, templates_params const&, bool)::$_0, std::__1::allocator<common_chat_params_init_llama_3_x(minja::chat_template const&, templates_params const&, bool)::$_0>, void (common_grammar_builder const&)>::operator()(common_grammar_builder const&) function.h:319
#6 0x000100aa98f4 in build_grammar(std::__1::function<void (common_grammar_builder const&)> const&, common_grammar_options const&) json-schema-to-grammar.cpp:982
#7 0x0001009c9314 in common_chat_params_init_llama_3_x(minja::chat_template const&, templates_params const&, bool) chat.cpp:1110
#8 0x0001009b8afc in common_chat_templates_apply_jinja(common_chat_templates const*, common_chat_templates_inputs const&) chat.cpp:1992
#9 0x0001009b533c in common_chat_templates_apply(common_chat_templates const*, common_chat_templates_inputs const&) chat.cpp:2074
#10 0x000100810120 in llamacpp_apply_chat_template+0x724 (predict_oai-
98384e17fb94e863 :arm64+0x100090120)
...
==45482==Register values:
x[0] = 0x00006020004147f8 x[1] = 0x00006080000013c8 x[2] = 0x0000000000000000 x[3] = 0x0000604006289738
x[4] = 0x0000000000000002 x[5] = 0x0000000000000001 x[6] = 0x04034000004b4000 x[7] = 0x0000000000000001
x[8] = 0xbebebebebebebebe x[9] = 0x17d7d7d7d7d7d7d7 x[10] = 0x00000c04000828ff x[11] = 0x0000000000000001
x[12] = 0x000000002018d383 x[13] = 0x0000000000000000 x[14] = 0xfa0000000000fafa x[15] = 0x000010700001ffff
x[16] = 0x000000019dc012c0 x[17] = 0x00000001021284f8 x[18] = 0x0000000000000000 x[19] = 0x00000001700acdc0
x[20] = 0x0000000000000002 x[21] = 0x000000002018d384 x[22] = 0x16dd16fd2e731151 x[23] = 0x0000007000020000
x[24] = 0x0000000100c69c08 x[25] = 0x0000000100c69c20 x[26] = 0x00006080000013c7 x[27] = 0x0000000100c69c00
x[28] = 0x00000001700acd60 fp = 0x00000001700aceb0 lr = 0x0000000100abce30 sp = 0x00000001700acd60
AddressSanitizer can not provide additional info.
SUMMARY: AddressSanitizer: SEGV __hash_table:1565 in std::__1::pair<std::__1::__hash_iterator<std::__1::__hash_node<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, void*>*>, bool> std::__1::__hash_table<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, std::__1::hash<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>, std::__1::equal_to<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>, std::__1::allocator<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>>::__emplace_unique_key_args<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&>(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&)
Thread T5 created by T0 here:
#0 0x0001020b99d4 in pthread_create+0x5c (libclang_rt.asan_osx_dynamic.dylib:arm64e+0x359d4)
#1 0x000100873910 in std::sys::pal::unix::thread::Thread::new::h77254fdd87a28e05+0x118 (predict_oai-
98384e17fb94e863 :arm64+0x1000f3910)
#2 0x0001007c7a1c in test::run_test::haeb3c2bcd5ed6cf6+0x76c (predict_oai-
98384e17fb94e863 :arm64+0x100047a1c)
#3 0x0001007aedb0 in test::console::run_tests_console::he9d142d704f3a986+0x149c (predict_oai-
98384e17fb94e863 :arm64+0x10002edb0)
#4 0x0001007c5758 in test::test_main::hf86a5e20735245b9+0x118 (predict_oai-
98384e17fb94e863 :arm64+0x100045758)
#5 0x0001007c5da0 in test::test_main_static::h61ee9c8fd30abca0+0x54 (predict_oai-
98384e17fb94e863 :arm64+0x100045da0)
...
==45482==ABORTING
Eve [Wed, 17 Sep 2025 07:35:37 +0000 (07:35 +0000)]
vulkan: automatically remove unsupported devices (#15976)
* remove unsupported vulkan devices
* make this happen during selection instead
* pass by reference
Daniel Bevenius [Wed, 17 Sep 2025 07:34:09 +0000 (09:34 +0200)]
ci : revert back to macos-13 for macOS-latest-cmake-x64 (#16040)
This commit reverts the change of the runs-on parameter for the
macOS-latest-cmake-x64 job back to macos-13 that was make in
Commit
51abc96bdc52ba8cd6ad78dcf12ed9a041d7b442 ("ci : update
macos-latest* jobs to use macos-latest (#15938)").
The motivation for this is that using macos-latest will cause an ARM
based runner to be used, and not an x64 based runner.
Refs: https://github.com/ggml-org/llama.cpp/pull/15938#issuecomment-
3300805127
Jie Fu (傅杰) [Wed, 17 Sep 2025 07:30:55 +0000 (15:30 +0800)]
llama-quant : fix the verification of attention layers for encoder-decoder models (#16023)
Signed-off-by: Jie Fu <redacted>
Jie Fu (傅杰) [Wed, 17 Sep 2025 07:29:00 +0000 (15:29 +0800)]
examples : support encoder-decoder models in the simple example (#16002)
Signed-off-by: Jie Fu <redacted>
Shane A [Wed, 17 Sep 2025 07:01:58 +0000 (00:01 -0700)]
model : add OLMo3 support (#16015)
* Add HF to gguf conversion logic for Olmo3
* Add Olmo3 implementation
* Update rope comment
* Fix indentation
Co-authored-by: Sigbjørn Skjæret <redacted>
* Apply suggestion from @CISC
Co-authored-by: Sigbjørn Skjæret <redacted>
---------
Co-authored-by: Sigbjørn Skjæret <redacted>
Chenguang Li [Wed, 17 Sep 2025 06:33:08 +0000 (14:33 +0800)]
CANN: Optimize ggml_cann_set_device (#15935)
* CANN: Fix ggml_cann_set_device to avoid redundant device switches
- Added a check to skip aclrtSetDevice if the current device is already set.
- Prevents unnecessary context switches while keeping thread/device consistency.
* CANN: add device default id
jacekpoplawski [Tue, 16 Sep 2025 14:17:08 +0000 (16:17 +0200)]
llama-bench: add --n-cpu-moe support (#15952)
* llama-bench: add --n-cpu-moe support
Support --n-cpu-moe in llama-bench the same way it is supported by
llama-server.
Daniel Bevenius [Tue, 16 Sep 2025 13:27:52 +0000 (15:27 +0200)]
ci : use macos-latest for arm64 webgpu build (#16029)
This commit updates the runs-on field for the macOS arm64 webgpu build
job to use macos-latest instead of just latest.
The motivation for this is that this job can wait for a runner to pick
up the job for a very long time, sometimes over 7 hours. This is an
attempt to see if this change can help reduce the wait time.
Refs: https://github.com/ggml-org/llama.cpp/actions/runs/
17754163447 /job/
50454257570 ?pr=16004
Daniel Bevenius [Tue, 16 Sep 2025 13:25:57 +0000 (15:25 +0200)]
ggml : fix padding in timestep embedding kernels (#15932)
* ggml : remove adding extra dim timestep embedding
This commit updates the ggml_timestep_embedding function to no longer
add an extra dimension when the specified dimension is odd.
The motivation for this change is that this introduces an unnecessary
dimension when the dimension is odd, which caused an issue in the
kernels which were not expecting this extra dimension and it resulted in
uninitialized memory for the second to last dimension.
* ggml-cuda : fix padding in timestep embedding kernel
This commit removes the zeroing out of the last dimension now that we
are not adding the extra padding dimension.
* ggml-metal : fix padding in timestep embedding kernel
This commit fixes the zero padding for odd dimensions in
the timestep embedding kernel
* ggml-opencl : fix padding in timestep embedding kernel
This commit fixes the zero padding for odd dimensions in
the timestep embedding kernel.
* ggml-sycl : fix padding in timestep embedding kernel
This commit fixes the zero padding for odd dimensions in
the timestep embedding kernel.
* ggml-vulkan : fix padding in timestep embedding kernel
This commit fixes the zero padding for odd dimensions in
the timestep embedding kernel.
* ggml-cpu : fix padding in timestep embedding function
This commit removes the zeroing out of the last dimension now that we
are not adding the extra padding dimension.
Daniel Bevenius [Tue, 16 Sep 2025 11:41:38 +0000 (13:41 +0200)]
ci : upload xcframework artifact from ios-xcode-build job (#16010)
This commit updates the github workflows build.yml file to include steps
for uploading and downloading the xcframework artifact. The
macos-latest-swift job now depends on the ios-xcode-build job and
downloads the xcframework artifact produced by it.
The motivation for this changes is that it takes a long time to build
the xcframework and we are currently doing this twice in the workflow.
With this change, we only build it once and reuse the artifact.
Bowen Han [Tue, 16 Sep 2025 06:59:19 +0000 (23:59 -0700)]
fix: apply clang-format to CUDA macros (#16017)
clang-format previously broke long CUDA macros (e.g. __launch_bounds__) into
unreadable line breaks inside template declarations, such as:
template<int D, int ncols, int nwarps, int VKQ_stride,
typename KQ_acc_t, bool use_logit_softcap>
__launch_bounds__(nwarps*ggml_cuda_get_physical_warp_size(), 1)
This change adjusts formatting rules so that CUDA macros remain consistent
and aligned with the surrounding template syntax.
Daniel Bevenius [Tue, 16 Sep 2025 03:57:16 +0000 (05:57 +0200)]
ci : update macos-latest* jobs to use macos-latest (#15938)
* ci : update macos-latest* jobs to use macos-latest
This commit updates the jobs that are named macos-latest* to use the
macos-latest label instead explicit versions.
The motivation for this is that there is currently a mixuture of
versions in this workflow and there are jobs that are failing because
they require a newer version.
Refs: https://github.com/ggml-org/llama.cpp/actions/runs/
17644792595 /job/
50140010907 #step:5:1759
* ci : add xcodebuild -downloadPlatform iOS command
Yuri Khrustalev [Tue, 16 Sep 2025 02:54:44 +0000 (22:54 -0400)]
cmake : Do not install tools on iOS targets (#15903)
Aman Gupta [Tue, 16 Sep 2025 02:38:28 +0000 (10:38 +0800)]
Add LLaDA-7b-MoE diffusion model (#16003)
Jake Karnes [Mon, 15 Sep 2025 22:28:31 +0000 (16:28 -0600)]
CUDA: fix im2col_3d to respect non-contiguous inputs (views) (#15956)
* fix im2col_3d to respect non-contiguous inputs (views)
The CUDA 3D im2col kernel computed source addresses assuming compact layout (products of dims), ignoring nb[] strides.
This patch switches im2col_3d source indexing to use true strides derived from src1->nb[] (in elements), mirroring the approach used in the 2D CUDA im2col path. Destination indexing is unchanged.
* use ggml_element_size() for src strides
Co-authored-by: Johannes Gäßler <redacted>
---------
Co-authored-by: Johannes Gäßler <redacted>
Diego Devesa [Mon, 15 Sep 2025 21:38:52 +0000 (14:38 -0700)]
docker : enable rocWMMA in ROCm images, add gfx1151 (#15997)
Diego Devesa [Mon, 15 Sep 2025 21:38:42 +0000 (14:38 -0700)]
releases : switch to rocWMMA develop branch, add gfx1151 (#15992)
* releases : switch to rocWMMA develop branch, add gfx1151
* remove unused variable ROCM_VERSION
yael-works [Mon, 15 Sep 2025 16:51:35 +0000 (19:51 +0300)]
SYCL: Add COUNT_EQUAL operator support (#15991)
* SYCL: Add COUNT_EQUAL operator support (rebased on master)
* SYCL: remove duplicate op_count_equal definition
* tests: remove test_count_equal_typed and use test_count_equal for all cases
* tests: keep only I32 case for COUNT_EQUAL as suggested
* tests: keep only I32 case for COUNT_EQUAL as requested
Nikolay Popov [Mon, 15 Sep 2025 10:08:30 +0000 (13:08 +0300)]
llama-run: Fix model download on Windows (#15988)
* llama-run: Fix model download on Windows
* fix SSL error (SSL peer certificate or SSH remote key was not OK)
* fix program crash on std::filesystem::rename
* llama-run: create a separate method to utilize RAII
* llama-run: handle rename exception
Aman Gupta [Mon, 15 Sep 2025 09:35:11 +0000 (17:35 +0800)]
CUDA: some micro-optimizations in mmf.cuh for mul_mat_id (#15926)
ddh0 [Mon, 15 Sep 2025 07:54:57 +0000 (02:54 -0500)]
fix KLD percentile output (#15999)
In `llama-perplexity`, when using `--kl-divergence`, the KL divergence statistics output mistakenly displays the 99th percentile twice. This change fixes that and correctly displays the 90th percentile as originally intended (presumably).
Sigbjørn Skjæret [Sun, 14 Sep 2025 21:00:59 +0000 (23:00 +0200)]
model : add grok-2 support (#15539)
* add grok-2 support
* type fix
* type fix
* type fix
* "fix" vocab for invalid sequences
* fix expert tensor mapping and spaces in vocab
* add chat template
* fix norm tensor mapping
* rename layer_out_norm to ffn_post_norm
* ensure ffn_post_norm is mapped
* fix experts merging
* remove erroneous FFN_GATE entry
* concatenate split tensors and add more metadata
* process all expert layers and try cat instead of hstack
* add support for community BPE vocab
* fix expert feed forward length and ffn_down concat
* commit this too
* add ffn_up/gate/down, unsure if sequence is right
* add ffn_gate/down/up to tensor names
* correct residual moe (still not working)
* mess--
* fix embedding scale being applied twice
* add built in chat template
* change beta fast for grok if default value
* remove spm vocab in favor of community bpe vocab
* change attention temp length metadata type to integer
* update attention temp length metadata
* remove comment
* replace M_SQRT2 with std::sqrt(2)
* add yarn metadata, move defaults to hparams
Sigbjørn Skjæret [Sun, 14 Sep 2025 19:17:04 +0000 (21:17 +0200)]
server : only attempt to enable thinking if using jinja (#15967)
Georgi Gerganov [Sun, 14 Sep 2025 19:02:32 +0000 (22:02 +0300)]
metal : remove memory pools (#15966)
* metal : remove mem pool usage
ggml-ci
* metal : remove mem pool implementation
ggml-ci
* metal : take into account the actual allocated memory of the tensor
ggml-ci
* cont : use ggml_backend_buft_get_alloc_size
ggml-ci
* cont : improve, comments
ggml-ci
* cont : add functions for the extra tensor sizes
* metal : add comments
ggml-ci
* metal : implement .get_alloc_size for the rest of the buffer types
ggml-ci
* metal : remove ggml_metal_heap
ggml-ci
Adam [Sun, 14 Sep 2025 18:43:54 +0000 (04:43 +1000)]
rocm.Dockerfile: added gfx1200,gfx1201 architectures to support AMD Radeon RX 9000 series (#15994)
* rocm.Dockerfile: added gfx1200,gfx1201 architectures to support AMD Radeon RX 9000 series
https://rocm.docs.amd.com/projects/install-on-linux/en/docs-6.4.1/reference/system-requirements.html#rdna-os
states the Radeon RX 9000 series is supported support from Ubuntu 24.04.2, and the dockerfile is using 24.04 which is ROCm 6.4.
This fixed the `ROCm error: invalid device function` I was getting when trying to use the rocm container.
Ruben Ortlam [Sun, 14 Sep 2025 14:56:28 +0000 (16:56 +0200)]
Vulkan: Clean up mul_mm shader (#15987)
* vulkan: move mul_mm dequantization steps into a separate file and functions
* improve mul_mm vector load code
* fix debug mode issues and warnings
lcy [Sun, 14 Sep 2025 14:20:35 +0000 (22:20 +0800)]
build: fix the build failures of Windows HIP release job (#15984)
* build: fix the cache keys for Windows HIP release job
Update the cache keys to include the HIP SDK version, preventing the
use of outdated ROCm installation caches.
* build: sync changes from release.yml to build.yml
- Update HIP SDK version to 25.Q3 and ROCm version to 6.4.2
- Update the cache keys to reflect the new versions
* build: remove Windows HIP release for gfx1151
since the current stable rocWMMA does not support gfx1151.
Georgi Gerganov [Sun, 14 Sep 2025 12:33:22 +0000 (15:33 +0300)]
metal : fix kernel requirements (#15983)
* metal : fix kernel requirements
ggml-ci
* cont : fix supports_op
* cont : fix supports_op for ARGMAX
Radoslav Gerganov [Sun, 14 Sep 2025 09:28:18 +0000 (12:28 +0300)]
rpc : fix regression when --device is used (#15981)
Fix regression introduced with commit
50f4281a6
Diego Devesa [Sun, 14 Sep 2025 09:21:59 +0000 (02:21 -0700)]
releases : update ROCM, add gfx1200, gfx1201, gfx1151 (#15972)
* releases : update ROCM, add gfx1200, gfx1201, gfx1151
* releases : set target to 13.3 for macos-x64
* add hipblaslt.dll to release
* add hipblaslt/library to release
Radoslav Gerganov [Sun, 14 Sep 2025 09:10:07 +0000 (12:10 +0300)]
doc : update documentation for --tensor-split (#15980)
* doc : update documentation for --tensor-split
* Update tools/main/README.md
Co-authored-by: Johannes Gäßler <redacted>
* Update tools/main/README.md
Co-authored-by: Diego Devesa <redacted>
---------
Co-authored-by: Johannes Gäßler <redacted>
Co-authored-by: Diego Devesa <redacted>