Gaurav Garg [Tue, 3 Feb 2026 06:41:02 +0000 (12:11 +0530)]
cuda : revert CUDA_SCALE_LAUNCH_QUEUES override until investigated (llama/19227)
Hangs were reported on Jetson Orin AGX if we set CUDA_SCALE_LAUNCH_QUEUES=4x. Reverting the previous PR (#19042) and updating the document to consider setting CUDA_SCALE_LAUNCH_QUEUES=4x for faster throughput on multi-GPU systems.
Neo Zhang [Mon, 2 Feb 2026 13:06:21 +0000 (21:06 +0800)]
Remove support for Nvidia & AMD GPU, because the oneAPI plugin for Nvidia & AMD GPU is unavailable: download/installation channels are out of work. (llama/19246)
User can't build up the software for Nvidia & AMD GPU.
rm the oneMath since it is only used in NV and AMD code path.
The syclcompat/math.hpp is not used anymore. The change that intrduced it was successfuly reverted (https://github.com/ggml-org/llama.cpp/pull/17826).
This include path will become obsolete and dropped in oneAPI 2026.0 effectively breaking ggml-sycl builds.
Oleksandr Kuvshynov [Wed, 28 Jan 2026 11:35:54 +0000 (06:35 -0500)]
vulkan: handle device dedup on MacOS + Vega II Duo cards (llama/19058)
Deduplication here relied on the fact that vulkan would return unique
UUID for different physical GPUs. It is at the moment not always the case.
On Mac Pro 2019 running Mac OS, with 2 Vega II Duo cards (so, 4 GPU total),
MotlenVK would assign same UUID to pairs of GPUs, unless they
are connected with Infinity Fabric.
See more details here: KhronosGroup/MoltenVK#2683.
The right way is to fix that in MoltenVK, but until it is fixed,
llama.cpp would only recognize 2 of 4 GPUs in such configuration.
The deduplication logic here is changed to only filter GPUs if UUID is
same but driver is different.
The XIELU operation was incorrectly using static_cast to convert
float parameters to uint32_t, which converted numeric values instead
of preserving IEEE 754 bit patterns. This caused incorrect values
to be interpreted by the GPU shader.
* Use reinterpret_cast to preserve float bit patterns when passing
through uint32_t params buffer
* Update WGSL shader parameter types from u32 to f32
* Re-enable XIELU support (was disabled due to numerical issues)
Fixes NMSE test failures for XIELU operation on WebGPU backend.
neg f16xf32xip builds and runs, havent actually ran a model that uses neg kernel yet though
Co-authored-by: James Contini <redacted> Co-authored-by: Neha Abbas <redacted> Co-authored-by: Abhijit Ramesh <redacted>
* Remove extra code and format
* Add ops documentation (finally)
* ggml webgpu: add SOFTPLUS unary operator
Implements SOFTPLUS (log(1 + exp(x))) with f16/f32 support. Uses f32
precision for intermediate calculations to prevent f16 overflow.
* Add shader implementation and 4 variants (f32/f16, inplace/non-inplace)
* Register pipelines and device support
* Follow Vulkan backend numerical stability pattern
* ggml webgpu: add EXPM1 unary operator
Implements EXPM1 (exp(x) - 1) with f16/f32 support.
* Add shader implementation and 4 variants (f32/f16, inplace/non-inplace)
* Register pipelines and device support
* ggml webgpu: add FLOOR unary operator
Implements FLOOR (rounds down to nearest integer) with f16/f32 support.
* Add shader implementation and 4 variants (f32/f16, inplace/non-inplace)
* Register pipelines and device support
* ggml webgpu: add CEIL unary operator
Implements CEIL (rounds up to nearest integer) with f16/f32 support.
* Add shader implementation and 4 variants (f32/f16, inplace/non-inplace)
* Register pipelines and device support
* ggml webgpu: add ROUND unary operator
Implements ROUND (rounds to nearest integer) with f16/f32 support.
* Add shader implementation and 4 variants (f32/f16, inplace/non-inplace)
* Register pipelines and device support
* ggml webgpu: add TRUNC unary operator
Implements TRUNC (truncates towards zero) with f16/f32 support.
* Add shader implementation and 4 variants (f32/f16, inplace/non-inplace)
* Register pipelines and device support
* docs : update WebGPU support for unary operators (FLOOR, CEIL, ROUND, TRUNC, EXPM1, SOFTPLUS)
* Updates to webgpu get_memory
* Move shared state (webgpu_context) and device creation out of registration context, device context, and buffer context, and move into backend context
* Small cleanup
* Move Instance, Device, Adapter, Device creation, and capabilities to global state while moving Queue, pipelines, and buffers to per-thread state.
* Cleanups
* More cleanup
* Move staging_buf mutex to global context
* Resolve merge
* Resolve merge
* Resolve merge
* Clean up merge errors, delete forward declaration, and run clang-format
* Rename device_init to backend_init
* Move webgpu_context to backend_context
* Move buffer context members into global context and refactor function calls
* Run clang-format
* Remove commends
* Move parameter buffers to per-thread, add single memset_tensor param buf
* Fix CI compilation issue
* Fix builds for emscripten not supporting subgroups
Gaurav Garg [Tue, 27 Jan 2026 06:52:44 +0000 (06:52 +0000)]
Reduce CPU-side stalls due to the CUDA command buffer being full (llama/19042)
* [CUDA] Reduce CPU-side stalls due to the CUDA command buffer being full
With pipeline parallelism, during prompt processing, the CPU-side CUDA command buffer gets full, stalling the CPU. Due to this, enough work doesn't get submitted to the GPU, causing bubbles in the GPU timeline.
Fix this by setting the CUDA environment variable CUDA_SCALE_LAUNCH_QUEUES to 4x to increase the command buffer size.
* Set the env variable in the CUDA backend registry allocation
Aman Gupta [Sun, 25 Jan 2026 15:25:58 +0000 (23:25 +0800)]
ggml-cpu: Use tiled FA for prompt-processing (llama/19012)
* ggml-cpu: Use tiled FA for prompt-processing
the FA performance is gimped on CPU on long contexts because it essentially uses a vector kernel. This PR adds a tiled FA for PP. Perf tuning for tile sizes done on a AMD EPYC single-socket 64-c machine.
Jeff Bolz [Wed, 21 Jan 2026 17:01:40 +0000 (11:01 -0600)]
vulkan: Remove transfer_ctx, do everything in compute_ctx. (llama/18945)
* vulkan: Remove transfer_ctx, do everything in compute_ctx.
We had a bug where a set_tensor_async (using transfer_ctx) didn't get
submitted before the graph_compute (using compute_ctx) that came after
it. To avoid this sort of issue, just do everything in compute_ctx.
Remove transfer_cmd_pool, which was already unused.
Jeff Bolz [Wed, 21 Jan 2026 15:22:02 +0000 (09:22 -0600)]
vulkan: Use mul_mat_vec_id for small values of n (llama/18918)
Change ggml_vk_mul_mat_vec_id_q_f16 to loop over the batch dimension and
update the indexing calculations in get_offsets.
Mat-vec is faster than mat-mat for small values of n. We don't get the same
reuse of the weights as in the non-ID path, but with this the cost is linear
in n rather than n>1 being far slower than n==1.
Oliver Simons [Wed, 21 Jan 2026 01:34:29 +0000 (02:34 +0100)]
CUDA: Fix builds for older CCCL versions by ifdefing strided_iterator (llama/18964)
* CUDA: Fix builds for older CCCL versions by ifdefing strided_iterator
Strided iterator was added in [CCCL
3.1](https://github.com/NVIDIA/cccl/releases/tag/v3.1.0), which is packaged into
[CTK
13.1](https://docs.nvidia.com/cuda/cuda-toolkit-release-notes/index.html#id5)
hipudding [Fri, 16 Jan 2026 08:18:49 +0000 (16:18 +0800)]
CANN: support gated linear attn (llama/18653)
* CANN: support gated linear attn
This change adds support for the GGML_OP_GATED_LINEAR_ATTN operator.
The feature was implemented by YushengZhao. Because the previous
submission was based on an outdated codebase, this PR was rebased to
merge.
Oliver Simons [Thu, 15 Jan 2026 02:44:54 +0000 (03:44 +0100)]
CUDA: Factor out and re-use `block_reduce` function (llama/18785)
* CUDA: Refactor and expose two_stage_warp_reduce_* function
* Use `two_stage_warp_reduce` also in softmax kernel, move smem out of it
Moving smem out of `__device__` function to `__global__` function
allows for explicit smem reuse, as either compiler or cuda rt seem to not
free it afterwards (`cudaFuncSetAttribute` fails when not accounting for
it once for each call to two_stage_warp_reduce)
* Update ggml/src/ggml-cuda/common.cuh
Co-authored-by: Aman Gupta <redacted>
* Use two_stage_warp_reduce in group_norm_f32
* Use two_stage_warp_reduce in rms_norm_f32
* Fix smem calculation which expects bytes
* Make `two_stage_warp_reduce` accept all values warp_reduce accepts
Also integrate it into norm_f32 function
* Use two_stage_warp_reduce in l2_norm_f32
* Use type traits for block reduction for better legibility
Also adresss other requests by @am17an such as variable renaming
* Make norm tests cover all cuda paths
* Mark columns % WARP_SIZE !=0 as supported for RMS_NORM_BACK
Unit-tests passed locally, let's see if they pass in the CI as well
* Use `enum class` for `block_reduce_method`
This is more type-safe than plain enum
* Rename variables as suggested in code review by @am17an
* Rename two_stage_warp_reduce -> block_reduce
* Fix trailing whitespace in common.cuh
* Make condition of static_assert type-dependent
This delays evaluation until the template is actually instantiated.
Otherwise, some compilers may evaluate the assert when parsing the
template, resulting in build errors as observed here:
Jeff Bolz [Mon, 12 Jan 2026 11:32:13 +0000 (05:32 -0600)]
vulkan: Use VK_EXT_shader_64bit_indexing to handle large mat_mul(_id) (llama/18678)
This fixes incoherent output in Llama-4-Maverick-17B-128E-PAB-Q8_0, which
has a mul_mat_id with an A matrix that's Q8_0 8192 x 5120 x 128.
This should work when the number of blocks in the A matrix is less than 2^32
(for mul_mat_vec or mul_mm_cm2), or for mul_mm I think the limit is like
2^32*LOAD_VEC_A elements.
- Divide batch_stride by QUANT_K earlier, so the block index calculation works in 32b.
- Each vk_pipeline_struct has a linked list of pipelines that will allow it to handle
variants. So far this change just adds a single use case for this, compiling with the
e64BitIndexingEXT flag.
- Use the 64b indexing variant when the A matrix is larger than maxStorageBufferRange.
64-bit indexing has some cost - around 3-5% in MoE models, so it's worth the effort
to avoid enabling it unconditionally.