Jeff Bolz [Sun, 8 Mar 2026 11:33:48 +0000 (06:33 -0500)]
vulkan: Fix data races in coopmat1 mul_mat(_id) (#20084)
* vulkan: Fix data races in coopmat1 mul_mat(_id)
Add barriers between coopmat store and regular loads. We sort of got away with
this because it was the same subgroup accessing the values, but it's still a
race and may not work.
shalinib-ibm [Fri, 6 Mar 2026 15:22:39 +0000 (20:52 +0530)]
ggml-cpu: Fix gcc 15 ICE on ppc64le (#20083) (#20130)
This patch addresses an Internal Compiler Error (Segmentation fault)
observed with gcc 15 by replacing the intrinsic + cast by doing
a cat on the data first and then calling the intrinsic. This bypasses the
buggy compiler path while maintaining identical instruction selection.
Performance Verification:
Assembly analysis on RHEL 9 (GCC 15.1.1) confirms that both the original
code and this fix generate the identical Power10 prefixed load instruction:
`plxv 40, 2(14)`
This ensures zero performance regression while unblocking builds on
newer toolchains.
Reproduced on:
- Alpine Linux + GCC 15.2.0-r2
- RHEL 9 + GCC 15.1.1 (gcc-toolset-15)
Andreas Kieslinger [Thu, 5 Mar 2026 11:53:21 +0000 (12:53 +0100)]
CUDA: Improve performance via less synchronizations between token (#17795)
* Adds CPU-to-CUDA copy capability to
ggml_backend_cuda_cpy_tensor_async()
* Adds function to relax sync requirements between input copies on
supported backends (CUDA for now)
* Exchanges synchronous copy with async copy function.
* Adds macro guards to allow compilation in non-CUDA builds
* Reworked backend detection in ggml-backend.cpp to avoid linking
conflicts
* Relax requirement of checks in async CUDA copies from backend and buffer type to just buffer type, to avoid linking issues
* Minor cleanup
* Makes opt-in to relax use of explicit syncs more general. Backends like
vulkan which require a synchronization between HtoD copies and graph
execution could also adopt this change now.
* Reintroduces stricter check for CPU->CUDA backend async copy via
GGML_DEVICE_TYPE_CPU.
* Corrects initialization of ggml_backend_sync_mode in
ggml_backend_sched_split initialization
* Simplifies synchronizations to adhere to `saaasg` pattern.
* Apply suggestion from @ggerganov (src->buffer to buf_src)
Co-authored-by: Georgi Gerganov <redacted>
* Apply suggestion from @ggerganov (src->buffer to buf_src) v2
Co-authored-by: Georgi Gerganov <redacted>
---------
Marcel Petrick [Thu, 5 Mar 2026 07:50:21 +0000 (08:50 +0100)]
chore : correct typos [no ci] (#20041)
* fix(docs): correct typos found during code review
Non-functional changes only:
- Fixed minor spelling mistakes in comments
- Corrected typos in user-facing strings
- No variables, logic, or functional code was modified.
Abhijit Ramesh [Tue, 3 Mar 2026 03:35:11 +0000 (19:35 -0800)]
ggml webgpu: fix workgroup dispatch limit for large batch sizes (#19965)
* ggml-webgpu: fix workgroup dispatch limit for large batch sizes
WebGPU limits workgroup sizes to 65535 per dimension. Large MUL_MAT
operations with batch sizes exceedeing this limi would fail.
* add compute_2d_workgroups() helper to split total workgroup ID across
X/Y dimensions
* update mul_mat_reg_tile.wgsl to reconstruct linear workgroup ID from 2D
dispatch
* update mul_mat_subgroup_matrix.wgsl to reconstruct linear workgroup ID
from 2D dispatch
* update mul_mat.wgsl to compute global index from 2D workgroup
coordinates
* refactor all three mul_mat dispatch paths to use the shared helper
* ggml-webgpu: add bounds checking for over-dispatched workgroups
2D workgroup dispatch can over-dispatch when total workgroups don't
divide evenly into the 65535 per-dimension limit. Extra workgroups
would compute invalid batch indices, causing memory corruption.
* add batch_idx bound check to mul_mat_reg_tile.wgsl and
mul_mat_subgroup_matrix.wgsl to prevent over-dispatched workgroups
from accessing invalid memory
* fixes test failures with large batch sizes (eg., bs=[128, 1024])
* ggml-webgpu: add back TODO for spliting large sizes into batches
* address review: replace FATTN_WARP_SIZE with constexpr, improve dispatch
- Replace #define FATTN_WARP_SIZE with constexpr int warp_size =
ggml_cuda_get_physical_warp_size() in each device function
- Use ne[1]*gqa_ratio threshold for MMA vs tile dispatch. Benchmarked
crossover on MI300X @ d32768 with power-of-2 GQA models:
hsk=64 (Llama 1B, gqa=4): MMA wins at eff >= 128 (+11%)
hsk=128 (Llama 3B, gqa=4): MMA wins at eff >= 128 (+4%)
Unified threshold: eff_nq >= 128 for all head sizes.
- Remove VEC fallback; small batches fall through to tile kernel
* Update ggml/src/ggml-cuda/fattn.cu
* use ggml_cuda_info().devices warp_size instead of hardcoded check
Daniel Bevenius [Fri, 27 Feb 2026 13:42:24 +0000 (14:42 +0100)]
ci : use ubuntu-latest for gguf-publish workflow (#19951)
This commit changes the runner for the gguf-publish workflow from
ubuntu-slim back to ubuntu-latest, which was updated in Commit 142cbe2ac68978e5dec3a2e19c1b64ef1c5740b1 ("ci : use new 1vCPU runner for
lightweight jobs (#19107)").
The motivation for this is that the action used in the workflow depends
on the docker daemon, which does not seem not available in the
ubuntu-slim runner. This is currently causing an error in the workflow
and preventing the gguf-publish workflow from running successfully.
Today was the the first time since the original change (I think) that
publish task has been run which may be why the issue was not noticed
before.
Pascal [Fri, 27 Feb 2026 06:05:23 +0000 (07:05 +0100)]
server : support multiple model aliases via comma-separated --alias (#19926)
* server : support multiple model aliases via comma-separated --alias
* server : update --alias description and regenerate docs
* server : multiple model aliases and tags
- address review feedback from ngxson
- --alias accepts comma-separated values (std::set, no duplicates)
- --tags for informational metadata (not used for routing)
- aliases resolve transparently in router via get_meta/has_model
- /v1/models exposes aliases and tags fields
* regenerate docs
* nits
* server : use first alias as model_name for backward compat
address review feedback from ngxson
* server : add single-model test for aliases and tags
Jan Patrick Lehr [Fri, 27 Feb 2026 04:37:54 +0000 (05:37 +0100)]
tests : enable test-chat out of tree build (#19558)
The binary relies on model files that it tries to find. However, when
configuring the build directory to be parallel to the source tree those
heuristics fail.
This sets the working directory for the test executable to be the
source-tree which resolves this issue.
Vishal Singh [Fri, 27 Feb 2026 00:43:41 +0000 (06:13 +0530)]
ggml-zendnn: update code for latest ZenDNN API (#19923)
- adapt ggml-zendnn.cpp to the new lowoha::matmul interface
- update the ZenDNN git tag in CMake to the latest release (ZenDNN‑2026‑WW08)
- add static lib support in CMake
Kevin Pouget [Thu, 26 Feb 2026 12:00:57 +0000 (13:00 +0100)]
ggml-virtgpu: improve the reliability of the code (#19846)
* ggml-virtgpu-backend: validate the consistency of the received objects
This patch adds consistency checks in the
ggml-virtgpu-backend (running on the host side) to ensure that the
data received from the guest is consistent (valid pointers, valid
sizes and offsets).
* ggml-virtgpu-backend: add fallback/skips for optional ggml backend methods
these three methods are optional in the GGML interface. `get_max_size`
was already properly defaulted, but `backend sychronize` and `butf
get_max_size` would have segfaulted the backend if not implemented.
* ggml-virtgpu-backend: fix log format missing argument
* ggml-virtgpu-backend: improve the abort message
* ggml-virtgpu-backend: more safety checks
* ggml-virtgpu-backend: new error code
* ggml-virtgpu-backend: initialize all the error codes
* ggml-virtgpu: add a missing comment generated by the code generator
* ggml-virtgpu: add the '[virtgpu]' prefix to the device/buffer names
* ggml-virtgpu: apir_device_buffer_from_ptr: improve the error message
* ggml-virtgpu: shared: make it match the latest api_remoting.h of Virglrenderer APIR
(still unmerged)
* ggml-virtgpu: update the code generator to have dispatch_command_name in a host/guest shared file
* ggml-virtgpu: REMOTE_CALL: fail if the backend returns an error
* docs/backend/VirtGPU.md: indicate that the RAM+VRAM size is limed to 64 GB with libkrun
* ggml-virtgpu: turn off clang-format header ordering for some of the files
Compilation breaks when ordered alphabetically.
* ggml-virtgpu: clang-format
* ggml-virtgpu/backend/shared/api_remoting: better comments for the APIR return codes