git.djapps.eu Git - pkg/ggml/sources/ggml/log

]> git.djapps.eu Git - pkg/ggml/sources/ggml/log

overview / pkg / ggml / sources / ggml / log

commit | commitdiff | tree

Mathieu Baudier [Mon, 16 Feb 2026 11:27:27 +0000 (12:27 +0100)]

Build CUDA only on amd64

commit | commitdiff | tree

Mathieu Baudier [Mon, 16 Feb 2026 11:21:08 +0000 (12:21 +0100)]

Fix arm64 build

commit | commitdiff | tree

Mathieu Baudier [Mon, 16 Feb 2026 05:45:22 +0000 (06:45 +0100)]

Better align with official Debian packages

commit | commitdiff | tree

Mathieu Baudier [Mon, 16 Feb 2026 05:15:33 +0000 (06:15 +0100)]

Upstream release

commit | commitdiff | tree

Mathieu Baudier [Mon, 16 Feb 2026 05:13:41 +0000 (06:13 +0100)]

Merge tag 'upstream/0.9.7' into debian/latest

Upstream release

commit | commitdiff | tree

Georgi Gerganov [Sun, 15 Feb 2026 20:21:04 +0000 (22:21 +0200)]

ggml : bump version to 0.9.7 (#1425)

commit | commitdiff | tree

Georgi Gerganov [Sun, 15 Feb 2026 20:19:16 +0000 (22:19 +0200)]

sync : whisper.cpp

commit | commitdiff | tree

Georgi Gerganov [Sat, 14 Feb 2026 15:54:06 +0000 (17:54 +0200)]

sync : llama.cpp

commit | commitdiff | tree

Georgi Gerganov [Sat, 14 Feb 2026 10:57:36 +0000 (12:57 +0200)]

models : optimize qwen3next graph (llama/19375)

* models : optimizing qwen3next graph

* cont

* wip

* wip

* wip

* wip

* wip

* wip

* wip

* wip

* wip

* wip

* cont : remove redundant q, g chunking

* minor

* minor

* avoid passing masks around

* avoid concats during chunking

* naming + shapes

* update names and use prefix to disable CUDA graphs

commit | commitdiff | tree

Adrien Gallouët [Sat, 14 Feb 2026 10:22:57 +0000 (11:22 +0100)]

ggml : fix GGML_DEBUG with OpenMP (llama/19599)

last_graph is only available without OpenMP, but
ggml_graph_compute_thread() is called in both cases.

Signed-off-by: Adrien Gallouët <redacted>

commit | commitdiff | tree

Georgi Gerganov [Sat, 14 Feb 2026 07:54:03 +0000 (09:54 +0200)]

metal : fix ACC op (llama/19427)

commit | commitdiff | tree

Jeff Bolz [Sat, 14 Feb 2026 05:42:04 +0000 (21:42 -0800)]

vulkan: support L2_NORM with contiguous rows (llama/19604)

commit | commitdiff | tree

Jeff Bolz [Sat, 14 Feb 2026 05:36:38 +0000 (21:36 -0800)]

vulkan: support GGML_OP_SET (llama/19584)

commit | commitdiff | tree

Sophon [Sat, 14 Feb 2026 05:29:17 +0000 (13:29 +0800)]

vulkan: Add vendor id for Qualcomm drivers (llama/19569)

This commit allows Qualcomm native vulkan driver to be used on Windows
instead of Mesa Dozen.

commit | commitdiff | tree

Max Krasnyansky [Sat, 14 Feb 2026 00:27:30 +0000 (16:27 -0800)]

hexagon: further optimizations and refactoring for flash attention (llama/19583)

* ggml-hexagon: fa improvements

ggml-hexagon: optimize flash attention calculations with improved variable handling

ggml-hexagon: streamline flash attention operations by removing redundant checks for FP32

ggml-hexagon: optimize hvx_dot_f16_f16_aa_rx2 by simplifying variable handling for unused elements

ggml-hexagon: optimize flash attention by changing slope vector type to F16

* hexfa: fixed test-backend-ops failurs due to leftover element handling

* hexagon: refactor and optimize fa to use local context struct

* ggml-hexagon: optimize flash-attention using hvx_vec_expf

Use HVX for online softmax.

---------

Co-authored-by: chraac <redacted>

commit | commitdiff | tree

Jeff Bolz [Fri, 13 Feb 2026 19:35:29 +0000 (11:35 -0800)]

vulkan: restore -inf check in FA shaders (llama/19582)

commit | commitdiff | tree

Alberto Cabrera Pérez [Fri, 13 Feb 2026 12:32:14 +0000 (12:32 +0000)]

Fix wrong memcpy length for block_interleave == 4 (llama/19575)

commit | commitdiff | tree

ymcki [Fri, 13 Feb 2026 12:31:37 +0000 (20:31 +0800)]

fix vulkan ggml_acc only works in 3d but not 4d (llama/19426)

* fix vulkan ggml_acc only works in 3d but not 4d

* removed clamp in test_acc_block

* use the correct stride and its test case

* cuda : fix "supports op" condition

* change src0 to src1 in ggml_vk_acc. Update acc.comp with jeffbolznv\'s suggestion except to keep the boundary check

* version without boundary check

* revert back to boundary check version

---------

Co-authored-by: Georgi Gerganov <redacted>

commit | commitdiff | tree

Aman Gupta [Fri, 13 Feb 2026 11:31:40 +0000 (17:01 +0530)]

CUDA: loop over ne2*ne3 in case it overflows (llama/19538)

* CUDA: loop over ne2*ne3 in case it overflows

* use fastdiv

commit | commitdiff | tree

Oliver Simons [Fri, 13 Feb 2026 09:37:55 +0000 (10:37 +0100)]

CUDA: Do not mutate cgraph for fused ADDs (llama/19566)

* Do not mutate cgraph for fused ADDs

1. We should try to minimize in-place changes to the incoming
   ggml_cgraph where possible (those should happen in graph_optimize)
2. Modifying in-place leads to an additional, unnecessary graph capture
   step as we store the properties before modifying the graph in-place
   in the cuda-backend

* Assert ggml_tensor is trivially copyable

* Update ggml/src/ggml-cuda/ggml-cuda.cu

Co-authored-by: Aman Gupta <redacted>
---------

Co-authored-by: Aman Gupta <redacted>

commit | commitdiff | tree

Georgi Gerganov [Fri, 13 Feb 2026 05:35:57 +0000 (07:35 +0200)]

metal : improve concurrency (llama/19555)

commit | commitdiff | tree

Georgi Gerganov [Fri, 13 Feb 2026 05:34:52 +0000 (07:34 +0200)]

metal : support GGML_OP_SET (llama/19548)

commit | commitdiff | tree

Shupei Fan [Thu, 12 Feb 2026 23:07:49 +0000 (07:07 +0800)]

hexagon: fix typo in vtcm_needs_release (llama/19545)

commit | commitdiff | tree

lhez [Thu, 12 Feb 2026 22:52:37 +0000 (14:52 -0800)]

opencl: add basic support for q4_1 (llama/19534)

* opencl: add q4_1 mv

* opencl: clean up

* opencl: add flattened q4_1 mv

* opencl: clean up

* opencl: add basic q4_1 mm

* opencl: fix whitespace

* opencl: add general q4_0 mm

commit | commitdiff | tree

Georgi Gerganov [Thu, 12 Feb 2026 09:35:28 +0000 (11:35 +0200)]

metal : update sum_rows kernel to support float4 (llama/19524)

commit | commitdiff | tree

Mario Limonciello [Thu, 12 Feb 2026 08:38:35 +0000 (02:38 -0600)]

Add a workaround for compilation with ROCWMMA_FATTN and gfx9 (llama/19461)

There is an upstream problem [1] with AMD's LLVM 22 fork and
rocWMMA 2.2.0 causing compilation issues on devices without
native fp16 support (CDNA devices).

The specialized types aren't resolved properly:
```
/opt/rocm/include/rocwmma/internal/mfma_impl.hpp:2549:37: error: ambiguous partial specializations of 'amdgcn_mfma<__half, __half, __half, 16, 16, 16>'
2549 | using ARegsT = typename Impl::ARegsT;
```

Add a workaround to explicitly declare the types and cast when
compiling with HIP and ROCWMMA_FATTN [2]. When this is actually
fixed upstream some guards can be used to detect and wrap the
version that has the fix to only apply when necessary.

Link: https://github.com/ROCm/rocm-libraries/issues/4398
Link: https://github.com/ggml-org/llama.cpp/issues/19269
Signed-off-by: Mario Limonciello <redacted>

commit | commitdiff | tree

Max Krasnyansky [Thu, 12 Feb 2026 07:04:27 +0000 (23:04 -0800)]

hexagon: further optimization and tuning of matmul and dot kernels (llama/19407)

* ggml-hexagon: implement 2x2 matmul kernel

* hexmm: implement vec_dot_rx2x2 for Q8_0 and MXFP4

* hexagon: fix editor config failures

* hexagon: refactor matmul ops to use context struct and remove wrappers

Also implement vec_dot_f16 2x2

* hexagon: refactor dyn quantizers to use mmctx

* hexagon: remove mm fastdiv from op_ctx

* hexagon: refactor matmul entry point to reduce code duplication

---------

Co-authored-by: Trivikram Reddy <redacted>

commit | commitdiff | tree

lhez [Wed, 11 Feb 2026 18:33:13 +0000 (10:33 -0800)]

opencl: add general Q6_K mm and Q4_K mv (llama/19347)

* opencl: add general q6_k mm

* opencl: refine condition for q6_K mm

* opencl: add general q4_K mv

* opencl: fix whitespace

commit | commitdiff | tree

Georgi Gerganov [Wed, 11 Feb 2026 16:58:43 +0000 (18:58 +0200)]

ggml : unary ops support non-cont src0 + metal F16 unary ops (llama/19511)

* ggml : unary ops support non-cont src0

* metal : support F16 unary ops + fix ELU

commit | commitdiff | tree

Georgi Gerganov [Wed, 11 Feb 2026 12:53:19 +0000 (14:53 +0200)]

metal : extend l2_norm support for non-cont src0 (llama/19502)

commit | commitdiff | tree

Max Krasnyansky [Wed, 11 Feb 2026 07:21:12 +0000 (23:21 -0800)]

hexagon: Add ARGSORT, DIV, SQR, SQRT, SUM_ROWS, GEGLU (llama/19406)

* hexagon: add ARGSORT op

Co-authored-by: Yarden Tal <redacted>
* hexagon: argsort reject tensors with huge rows for now

* Adding support for DIV,SQR,SQRT,SUM_ROWS ops in hexagon backend

* hexagon : Add GEGLU op

* hexagon: fix editor config check

* hexagon: rewrite and optimize binary ops ADD/SUB/MUL/DIV/ADD_ID to use DMA

---------

Co-authored-by: Yarden Tal <redacted>
Co-authored-by: Manohara Hosakoppa Krishnamurthy <redacted>

commit | commitdiff | tree

Georgi Gerganov [Wed, 11 Feb 2026 05:52:00 +0000 (07:52 +0200)]

ggml : extend bin bcast for permuted src1 (llama/19484)

* tests : extend bin bcast for permuted src1

* cont : extend bin support

* cont : s0 is always 1

* tests : simplify

commit | commitdiff | tree

Georgi Gerganov [Wed, 11 Feb 2026 05:51:12 +0000 (07:51 +0200)]

metal : consolidate unary ops (llama/19490)

commit | commitdiff | tree

Oliver Simons [Tue, 10 Feb 2026 21:31:19 +0000 (22:31 +0100)]

CUDA : Update CCCL-tag for 3.2 to final release from RC (llama/19486)

CCCL 3.2 has been released since it was added to llama.cpp as part of
the backend-sampling PR, and it makes sense to update from RC to final
released version.

https://github.com/NVIDIA/cccl/releases/tag/v3.2.0

commit | commitdiff | tree

Nikhil Jain [Tue, 10 Feb 2026 16:04:00 +0000 (08:04 -0800)]

Plug memory leaks and free resources on shutdown (llama/19315)

* Fix memory leaks in shader lib, backend, backend_context, buffer_context, and webgpu_buf_pool

* Free pools

* Cleanup

* More cleanup

* Run clang-format

* Fix arg-parser and tokenizer test errors that free an unallocated buffer

* Fix device lost callback to not print on device teardown

* Fix include and run clang-format

* remove unused unused

* Update binary ops

---------

Co-authored-by: Reese Levine <redacted>

commit | commitdiff | tree

Xuan-Son Nguyen [Tue, 10 Feb 2026 13:37:50 +0000 (14:37 +0100)]

test: fix IMROPE perf test case (llama/19465)

commit | commitdiff | tree

Alberto Cabrera Pérez [Tue, 10 Feb 2026 10:47:45 +0000 (10:47 +0000)]

ggml-cpu: arm64: q6_K repack gemm and gemv (and generic) implementations (dotprod) (llama/19360)

* First working version of GEMM and GEMV

* interleave loads and compute

* Clang-format

* Added missing fallback. Removed tested TODO.

* Swap M and N to be consistent with the repack template convention

commit | commitdiff | tree

k4ss4n [Tue, 10 Feb 2026 09:57:48 +0000 (10:57 +0100)]

ggml : use noexcept overload for is_regular_file in backend registration (llama/19452)

using noexcept std::filesystem::directory_entry::is_regular_file
overload prevents abnormal termination upon throwing an error
(as caused by symlinks to non-existent folders on linux)

Resolves: #18560

commit | commitdiff | tree

Raul Torres [Tue, 10 Feb 2026 06:19:30 +0000 (06:19 +0000)]

CANN: Remove unnecessary wrapper for `gml_backend_buft_is_cann` (llama/18968)

commit | commitdiff | tree

hipudding [Tue, 10 Feb 2026 06:18:59 +0000 (14:18 +0800)]

CANN: implement quantized MUL_MAT_ID for MoE models (llama/19228)

Implement ggml_cann_mul_mat_id_quant function to support quantized matrix
multiplication for Mixture of Experts (MoE) architectures on CANN backend.

Key features:
- Support Q4_0 and Q8_0 quantized weight formats
- Use IndexSelect to dynamically route expert-specific weights based on indices
- Leverage WeightQuantBatchMatmulV2 for efficient quantized computation
- Handle automatic F16 type conversion for hardware compatibility
- Support both per-expert and broadcast input modes

Implementation details:
- Extract expert weights and scales using CANN IndexSelect operation
- Process each batch and expert combination independently
- Create proper tensor views with correct stride for matmul operations
- Automatic input/output type casting to/from F16 as needed

Testing: All test cases passed for supported types (F32, F16, Q4_0, Q8_0).

commit | commitdiff | tree

Georgi Gerganov [Tue, 10 Feb 2026 06:07:16 +0000 (08:07 +0200)]

cuda : extend GGML_OP_PAD to work with non-cont src0 (llama/19429)

* cuda : extend GGML_OP_PAD to work with non-cont src0

* tests : add permuted pad

commit | commitdiff | tree

Oliver Simons [Sun, 8 Feb 2026 13:12:51 +0000 (14:12 +0100)]

CUDA: Fix non-contig rope (llama/19338)

* Rename variables + fix rope_neox

Seems memory layout is shared with Vulkan so we can port fix from
https://github.com/ggml-org/llama.cpp/pull/19299

* Fix rope_multi

* Fix rope_vision

* Fix rope_norm

* Rename ne* to ne0* for consistent variable naming

* cont : consistent stride names

---------

Co-authored-by: Georgi Gerganov <redacted>

commit | commitdiff | tree

Georgi Gerganov [Sat, 7 Feb 2026 08:36:51 +0000 (10:36 +0200)]

sync : llama.cpp

commit | commitdiff | tree

Georgi Gerganov [Sat, 7 Feb 2026 08:35:56 +0000 (10:35 +0200)]

metal : consolidate bin kernels (llama/19390)

* metal : refactor bin kernels

* cont

* cont : fix cv

commit | commitdiff | tree

Georgi Gerganov [Sat, 7 Feb 2026 05:37:15 +0000 (07:37 +0200)]

metal : fix event synchronization in cpy_tensor_async (llama/19402)

commit | commitdiff | tree

Abhijit Ramesh [Fri, 6 Feb 2026 18:33:30 +0000 (10:33 -0800)]

ggml-webgpu: JIT compile binary operators and handle binding overlaps (llama/19310)

* ggml webgpu: port binary operators to use pre-wgsl

* Add binary.wgsl: unified shader with conditionals for all 4 ops

* Add gen_binary_shaders.cpp: build tool for using pre_wgsl preprocessor

* Remove bin_op.tmpl.wgsl and binary.wgsl (Python template)

* Update CMake to generate binary operator shaders at build time

* ggml-webgpu: migrate binary ops to JIT compilation with overlap handling

* port binary operators from AOT to pre-wgsl JIT compilation

* add src1=dst overlap handling for binary ops

* use compile-time workgroup size defines instead of runtime overrides

* ggml-webgpu: complete overlap handling for binary ops

* add support for inplace & overlap case in binding setup

* restructure conditional logic to handle all overlap cases

* ensure all buffer bindings are correctly assigned for edge cases

* ggml-webgpu: remove unused binary overlap cases

Remove src0==src1 binary overlap case that never occurs in practice.

* keep INPLACE (src0==dst), OVERLAP (src1==dst), DEFAULT

* remove unused src0==src1 and all-same variant

* refactor wgsl to eliminate duplication

commit | commitdiff | tree

Georgi Gerganov [Sat, 7 Feb 2026 05:38:05 +0000 (07:38 +0200)]

sync : llama.cpp

commit | commitdiff | tree

Nechama Krashinski [Fri, 6 Feb 2026 15:13:44 +0000 (17:13 +0200)]

sycl: add F16 support for GGML_OP_CEIL (llama/19306)

* Fix SYCL CEIL operator

* sycl: implement GGML_OP_CEIL

commit | commitdiff | tree

Jeff Bolz [Fri, 6 Feb 2026 14:50:30 +0000 (08:50 -0600)]

tests: reduce number of FA test permutations (llama/19381)

Only test non-F16 for head size 64 and 72 (one a multiple of QK, one not).

commit | commitdiff | tree

Jeff Bolz [Fri, 6 Feb 2026 08:15:13 +0000 (02:15 -0600)]

vulkan: For coopmat2 FA, use fp16 accumulators for the final result (llama/19376)

The cpu and cuda backends use fp16 for the VKQ accumulator type, this change
does the same for vulkan. This helps particularly with large head sizes which
are very register-limited.

I tried this for the coopmat1 path and it slowed down a bit. I didn't try for
scalar.

I applied the softmax bias that the cuda backend uses to avoid overflow,
although I was not able to reproduce the original bug without it.

commit | commitdiff | tree

Jeff Bolz [Fri, 6 Feb 2026 07:49:58 +0000 (01:49 -0600)]

vulkan: make FA mask/softcap enables spec constants (llama/19309)

* vulkan: make FA mask/softcap enables spec constants

* don't specialize for sinks

* bump timeout a little bit

commit | commitdiff | tree

Georgi Gerganov [Fri, 6 Feb 2026 07:25:11 +0000 (09:25 +0200)]

metal : skip loading all-zero mask (llama/19337)

* metal : skip loading all-zero mask

* cont : minor

commit | commitdiff | tree

Georgi Gerganov [Fri, 6 Feb 2026 05:55:06 +0000 (07:55 +0200)]

cuda : cuda graphs now compare all node params (llama/19383)

commit | commitdiff | tree

Georgi Gerganov [Thu, 5 Feb 2026 17:07:22 +0000 (19:07 +0200)]

metal : adaptive CPU/GPU interleave based on number of nodes (llama/19369)

commit | commitdiff | tree

Jeff Bolz [Thu, 5 Feb 2026 15:26:38 +0000 (09:26 -0600)]

vulkan: Preprocess FA mask to detect all-neg-inf and all-zero. (llama/19281)

Write out a 2-bit code per block and avoid loading the mask when it
matches these two common cases.

Apply this optimization when the mask is relatively large (i.e. prompt
processing).

commit | commitdiff | tree

Georgi Gerganov [Thu, 5 Feb 2026 08:08:45 +0000 (10:08 +0200)]

metal : add diag (llama/19330)

commit | commitdiff | tree

Oleksandr Kuvshynov [Thu, 5 Feb 2026 08:06:59 +0000 (03:06 -0500)]

vulkan: fix GPU deduplication logic. (llama/19222)

* vulkan: fix GPU deduplication logic.

As reported in https://github.com/ggml-org/llama.cpp/issues/19221, the
(same uuid, same driver) logic is problematic for windows+intel igpu.

Let's just avoid filtering for MoltenVK which is apple-specific, and
keep the logic the same as before 88d23ad5 - just dedup based on UUID.

Verified that MacOS + 4xVega still reports 4 GPUs with this version.

* vulkan: only skip dedup when both drivers are moltenVk

commit | commitdiff | tree

Jeff Bolz [Thu, 5 Feb 2026 07:48:33 +0000 (01:48 -0600)]

vulkan: Set k_load_shmem to false when K is too large (llama/19301)

commit | commitdiff | tree

Jeff Bolz [Thu, 5 Feb 2026 07:38:59 +0000 (01:38 -0600)]

vulkan: fix non-contig rope (llama/19299)

commit | commitdiff | tree

will-lms [Thu, 5 Feb 2026 06:05:09 +0000 (01:05 -0500)]

metal : add missing includes (llama/19348)

commit | commitdiff | tree

Georgi Gerganov [Wed, 4 Feb 2026 10:45:21 +0000 (12:45 +0200)]

tests : add non-cont, inplace rope tests (llama/19296)

* tests : add non-cont, inplace rope tests

* cont : exercise dim 3

Co-authored-by: Jeff Bolz <redacted>
* cont : more dim3 exercises

---------

Co-authored-by: Jeff Bolz <redacted>

commit | commitdiff | tree

Kevin Pouget [Wed, 4 Feb 2026 02:46:18 +0000 (03:46 +0100)]

ggml-virtgpu: make the code thread safe (llama/19204)

* ggml-virtgpu: regenerate_remoting.py: add the ability to deprecate a function

* ggml-virtgpu: deprecate buffer_type is_host remoting

not necessary

* ggml-virtgpu: stop using static vars as cache

The static init isn't thread safe.

* ggml-virtgpu: protect the use of the shared memory to transfer data

* ggml-virtgpu: make the remote calls thread-safe

* ggml-virtgpu: backend: don't continue if couldn't allocate the tensor memory

* ggml-virtgpu: add a cleanup function for consistency

* ggml-virtgpu: backend: don't crash if buft->iface.get_max_size is missing

* fix style and ordering

* Remove the static variable in apir_device_get_count

* ggml-virtgpu: improve the logging

* fix review minor formatting changes

commit | commitdiff | tree

Aman Gupta [Wed, 4 Feb 2026 01:43:29 +0000 (09:43 +0800)]

ggml-cpu: use LUT for converting e8->f32 scales on x86 (llama/19288)

* ggml-cpu: use LUT for converting e8->f32 scales on x86

* add dispatch based on macro

commit | commitdiff | tree

Georgi Gerganov [Tue, 3 Feb 2026 21:43:14 +0000 (23:43 +0200)]

metal : add solve_tri (llama/19302)

commit | commitdiff | tree

Ruben Ortlam [Tue, 3 Feb 2026 16:37:32 +0000 (17:37 +0100)]

vulkan: disable coopmat1 fa on Nvidia Turing (llama/19290)

commit | commitdiff | tree

Aman Gupta [Tue, 3 Feb 2026 15:31:23 +0000 (23:31 +0800)]

CUDA: use mmvq for mul-mat-id for small batch sizes (llama/18958)

* CUDA: use mmvq for mul-mat-id for small batch sizes

* add mmvq too

* Fix perf issue on ampere. Use mmvf mm-id only for non-nvidia GPUs

* templatize multi_token_path

commit | commitdiff | tree

Georgi Gerganov [Tue, 3 Feb 2026 11:43:29 +0000 (13:43 +0200)]

metal : minor cleanup (llama/19251)

commit | commitdiff | tree

Oliver Simons [Tue, 3 Feb 2026 10:33:14 +0000 (11:33 +0100)]

CUDA: Fix loop unrolling for BW in mul_mat_q_stream_k_fixup (llama/19053)

By providing stride_* variables as size_t (i.e., 64-bit) the compiler can
correctly unroll the [two for-loops](https://github.com/ggml-org/llama.cpp/blob/557515be1e93ed8939dd8a7c7d08765fdbe8be31/ggml/src/ggml-cuda/mmq.cuh#L3789-L3816)
on BW. This gives some perf for prefill/pp phase on BW, while not affecting
other SMs:

| GPU                                                     | Model                 | Test   |   t/s master |   t/s osimons/fix_bw_mmq_fixup_kernel |   Speedup |
|:--------------------------------------------------------|:----------------------|:-------|-------------:|--------------------------------------:|----------:|
| NVIDIA RTX 6000 Ada Generation                          | gpt-oss 20B MXFP4 MoE | pp8096 |      8404.05 |                               8375.79 |      1.00 |
| NVIDIA RTX 6000 Ada Generation                          | llama 3B Q4_K_M       | pp8096 |     16148.93 |                              16019.60 |      0.99 |
| NVIDIA RTX 6000 Ada Generation                          | llama 8B Q4_0         | pp8096 |      8008.29 |                               7978.80 |      1.00 |
| NVIDIA RTX 6000 Ada Generation                          | nemotron_h 9B BF16    | pp8096 |      4263.16 |                               4248.53 |      1.00 |
| NVIDIA RTX 6000 Ada Generation                          | nemotron_h 9B Q4_K_M  | pp8096 |      5165.11 |                               5157.43 |      1.00 |
| NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition | gpt-oss 20B MXFP4 MoE | pp8096 |     12582.80 |                              12758.37 |      1.01 |
| NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition | llama 3B Q4_K_M       | pp8096 |     16879.10 |                              17619.47 |      1.04 |
| NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition | llama 8B Q4_0         | pp8096 |     10649.90 |                              10982.65 |      1.03 |
| NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition | nemotron_h 9B BF16    | pp8096 |      7717.73 |                               7716.22 |      1.00 |
| NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition | nemotron_h 9B Q4_K_M  | pp8096 |      7301.90 |                               7370.38 |      1.01 |

commit | commitdiff | tree

George [Tue, 3 Feb 2026 06:43:39 +0000 (08:43 +0200)]

ggml: added cleanups in ggml_quantize_free (llama/19278)

Add missing cleanup calls for IQ2_S, IQ1_M quantization types and IQ3XS with 512 blocks during quantization cleanup.

commit | commitdiff | tree

Gaurav Garg [Tue, 3 Feb 2026 06:41:02 +0000 (12:11 +0530)]

cuda : revert CUDA_SCALE_LAUNCH_QUEUES override until investigated (llama/19227)

Hangs were reported on Jetson Orin AGX if we set CUDA_SCALE_LAUNCH_QUEUES=4x. Reverting the previous PR (#19042) and updating the document to consider setting CUDA_SCALE_LAUNCH_QUEUES=4x for faster throughput on multi-GPU systems.

commit | commitdiff | tree

lhez [Mon, 2 Feb 2026 23:54:43 +0000 (15:54 -0800)]

opencl: refactor some ops, concat, repeat, tanh and scale (llama/19226)

* opencl: refactor concat

* opencl: refactor repeat

* opencl: refactor tanh

* opencl: enable fp16 for tanh

* opencl: refactor scale

* opencl: fix unused variables

commit | commitdiff | tree

Aman Gupta [Mon, 2 Feb 2026 17:19:55 +0000 (01:19 +0800)]

ggml-cpu: FA split across kv for faster TG (llama/19209)

* ggml-cpu: split across kv for faster TG

* simplify sinks application

* add ref impl

commit | commitdiff | tree

Neo Zhang [Mon, 2 Feb 2026 13:06:21 +0000 (21:06 +0800)]

Remove support for Nvidia & AMD GPU, because the oneAPI plugin for Nvidia & AMD GPU is unavailable: download/installation channels are out of work. (llama/19246)

User can't build up the software for Nvidia & AMD GPU.
rm the oneMath since it is only used in NV and AMD code path.

commit | commitdiff | tree

Tamar [Mon, 2 Feb 2026 13:05:51 +0000 (15:05 +0200)]

sycl: implement GGML_OP_TOP_K (llama/19242)

commit | commitdiff | tree

Georgi Gerganov [Mon, 2 Feb 2026 12:29:44 +0000 (14:29 +0200)]

metal : support virtual devices (llama/18919)

* metal : support virtual devices

* cont : manage buffer type context memory

* metal : add events

* cont : implement cpy_tensor_async

commit | commitdiff | tree

Johannes Gäßler [Mon, 2 Feb 2026 09:00:05 +0000 (10:00 +0100)]

ggml-backend: fix async set/get fallback sync (llama/19179)

commit | commitdiff | tree

Christian Kastner [Mon, 2 Feb 2026 06:38:55 +0000 (07:38 +0100)]

docs : Minor cleanups (llama/19252)

* Update old URLs to github.com/ggml-org/

* Bump copyrights

commit | commitdiff | tree

Nikhil Jain [Mon, 2 Feb 2026 02:47:29 +0000 (18:47 -0800)]

Remove pipeline cache mutexes (llama/19195)

* Remove mutex for pipeline caches, since they are now per-thread.

* Add comment

* Run clang-format

* Cleanup

* Run CI again

* Run CI once more

* Run clang-format

commit | commitdiff | tree

Max Krasnyansky [Sun, 1 Feb 2026 22:13:38 +0000 (14:13 -0800)]

Bump cmake max version (needed for Windows on Snapdragon builds) (llama/19188)

* Bump max cmake version (needed for Windows on Snapdragon builds)

* cmake: move max version setting into ggml/CMakeLists

commit | commitdiff | tree

nullname [Sat, 31 Jan 2026 05:14:20 +0000 (13:14 +0800)]

ggml-hexagon: flash-attention and reduce-sum optimizations (llama/19141)

* wip

* ggml-hexagon: add vectorized dot product function for FP32 and FP16 accumulation

* ggml-hexagon: optimize dot product functions for FP16 and FP32 with new vectorized implementations

* wip

* ggml-hexagon: optimize hvx_vec_dump_f32_n and hvx_vec_reduce_sum_qf32x2 functions for improved performance

* ggml-hexagon: refactor dot product functions to use a common loading function for improved readability

* optimize vector dot product functions to use unified reduction for improved performance

* wip

* ggml-hexagon: add vectorized dot product function for FP32 and FP16 accumulation

* ggml-hexagon: optimize dot product functions for FP16 and FP32 with new vectorized implementations

* wip

* ggml-hexagon: optimize hvx_vec_dump_f32_n and hvx_vec_reduce_sum_qf32x2 functions for improved performance

* ggml-hexagon: refactor dot product functions to use a common loading function for improved readability

* optimize vector dot product functions to use unified reduction for improved performance

* hexagon: optimize reduce-sum for v75+

* hexagon: always keep row_sums in sf/fp32

* ggml-hexagon: enhance directory checks for HEXAGON_SDK_ROOT and HEXAGON_TOOLS_ROOT

* fix compiling error after rebase

---------

Co-authored-by: Max Krasnyansky <redacted>

commit | commitdiff | tree

shaofeiqi [Fri, 30 Jan 2026 18:19:27 +0000 (10:19 -0800)]

opencl: add optimized q8_0 mm kernel for adreno (llama/18871)

* Add Q8_0 OpenCL kernel

Co-authored-by: yunjie <redacted>
* opencl: fix build for non-adreno

* opencl: refactor q8_0

* opencl: enforce subgroup size of 64 for adreno for q8_0

* For A750 and older generations, subgroup size can be 64 or 128.
This kernel assumes subgroup size 64.

* opencl: suppress warning when adreno kernels are disabled

---------

Co-authored-by: yunjie <redacted>
Co-authored-by: Li He <redacted>

commit | commitdiff | tree

Simon Redman [Fri, 30 Jan 2026 16:27:16 +0000 (11:27 -0500)]

Correctly fetch q8_1 quantize pipeline in test as needed by 8a3519b (llama/19194)

commit | commitdiff | tree

Georgi Gerganov [Fri, 30 Jan 2026 11:52:57 +0000 (13:52 +0200)]

tests : add GQA=20 FA test (llama/19095)

commit | commitdiff | tree

Georgi Gerganov [Sat, 7 Feb 2026 08:33:58 +0000 (10:33 +0200)]

ci : remove "Release" word from the title of the release

commit | commitdiff | tree

Georgi Gerganov [Sat, 7 Feb 2026 07:58:02 +0000 (09:58 +0200)]

ggml : bump version to 0.9.6 (#1423)

commit | commitdiff | tree

Georgi Gerganov [Fri, 30 Jan 2026 14:29:51 +0000 (16:29 +0200)]

cmake : remove unused file (#1419)

commit | commitdiff | tree

Georgi Gerganov [Fri, 30 Jan 2026 14:25:41 +0000 (16:25 +0200)]

sync : whisper.cpp

commit | commitdiff | tree

Georgi Gerganov [Fri, 30 Jan 2026 13:56:15 +0000 (15:56 +0200)]

cuda : fix compile warnings (whisper/0)

commit | commitdiff | tree

Georgi Gerganov [Fri, 30 Jan 2026 08:35:15 +0000 (10:35 +0200)]

sync : llama.cpp

commit | commitdiff | tree

bssrdf [Fri, 30 Jan 2026 04:57:52 +0000 (23:57 -0500)]

add tensor type checking as part of cuda graph properties (llama/19186)

commit | commitdiff | tree

s8322 [Fri, 30 Jan 2026 04:01:38 +0000 (06:01 +0200)]

sycl: implement GGML_UNARY_OP_SOFTPLUS (llama/19114)

* sycl: add softplus unary op implementation

* sycl: add softplus unary op implementation

* docs(ops): mark SYCL SOFTPLUS as supported

* docs: update SYCL status for SOFTPLUS

commit | commitdiff | tree

RachelMantel [Fri, 30 Jan 2026 04:00:49 +0000 (06:00 +0200)]

sycl: implement GGML_OP_TRI (llama/19089)

* sycl: implement GGML_OP_TRI

* docs: update ops.md for SYCL TRI

* docs: regenerate ops.md

* docs: update SYCL support for GGML_OP_TRI

commit | commitdiff | tree

Zheyuan Chen [Thu, 29 Jan 2026 22:05:30 +0000 (14:05 -0800)]

ggml-webgpu: improve flastAttention performance by software pipelining (llama/19151)

* webgpu : pipeline flash_attn Q/K loads in WGSL

* ggml-webgpu: unroll Q*K accumlation inner loop

* ggml-webgpu: vectorization

* ggml-webgpu: unrolling

* ggml-webgpu: remove redundant unrolling

* ggml-webgpu: restore the config

* ggml-webgpu: remove redundant comments

* ggml-webgpu: formatting

* ggml-webgpu: formatting and remove vectorization

* ggml-webgpu: remove unnecessary constants

* ggml-webgpu: change QKV buffer to read_write to pass validation

* ggml-webgpu: add explanation for the additional bracket around Q K accumulate

* Indentation and for -> if for tail

* Kick off CI on wgsl only commits

---------

Co-authored-by: Reese Levine <redacted>

commit | commitdiff | tree

Todor Boinovski [Thu, 29 Jan 2026 20:33:21 +0000 (12:33 -0800)]

hexagon: enable offloading to Hexagon on Windows on Snapdragon (llama/19150)

* hexagon: updates to enable offloading to HTP on WoS

* Update windows.md

* Update windows.md

* hexagon: enable -O3 optimizations

* hexagon: move all _WINDOWS conditional compilation to _WIN32

* hexagon: updates to enable offloading to HTP on WoS

* hexagon: use run-time vs load-time dynamic linking for cdsp driver interface

* refactor htp-drv

* hexagon: add run-bench.ps1 script

* hexagon: htdrv refactor

* hexagon: unify Android and Windows build readmes

* hexagon: update README.md

* hexagon: refactor htpdrv

* hexagon: drv refactor

* hexagon: more drv refactor

* hexagon: fixes for android builds

* hexagon: factor out dl into ggml-backend-dl

* hexagon: add run-tool.ps1 script

* hexagon: merge htp-utils in htp-drv and remove unused code

* wos: no need for getopt_custom.h

* wos: add missing CR in htpdrv

* hexagon: ndev enforecement applies only to the Android devices

* hexagon: add support for generating and signing .cat file

* hexagon: add .inf file

* hexagon: working auto-signing and improved windows builds

* hexagon: futher improve skel build

* hexagon: add rough WoS guide

* hexagon: updated windows guide

* hexagon: improve cmake handling of certs and logging

* hexagon: improve windows setup/build doc

* hexagon: more windows readme updates

* hexagon: windows readme updates

* hexagon: windows readme updates

* hexagon: windows readme updates

* hexagon: windows readme updates

* Update windows.md

* Update windows.md

* snapdragon: rename docs/backend/hexagon to docs/backends/snapdragon

Also added a power shell script to simplify build env setup.

* hexagon: remove trailing whitespace and move cmake requirement to user-presets

* hexagon: fix CMakeUserPresets path in workflow yaml

* hexagon: introduce local version of libdl.h

* hexagon: fix src1 reuse logic

gpt-oss needs a bigger lookahead window.
The check for src[1] itself being quantized was wrong.

---------

Co-authored-by: Max Krasnyansky <redacted>

commit | commitdiff | tree

Georgi Gerganov [Thu, 29 Jan 2026 16:45:30 +0000 (18:45 +0200)]

cuda : fix nkvo, offload and cuda graph node properties matching (llama/19165)

* cuda : fix nkvo

* cont : more robust cuda graph node property matching

* cont : restore pre-leafs implementation

* cont : comments + static_assert

commit | commitdiff | tree

yulo [Thu, 29 Jan 2026 10:10:53 +0000 (18:10 +0800)]

HIP: add mmf for CDNA (llama/18896)

* refactor mmf rows_per_block

* speed up compile

* pass cdna compile

* fix cuda error

* clean up mmf

* f32 mmf

* clean float mma

* fix mmf error

* faster mmf

* extend tile k

* fix compile error

* Revert "extend tile k"

This reverts commit 4d2ef3d483932659801a59a5af0b6b48f6ffd5c7.

* fix smem overflow

* speed up compiling mmf

* speed up compile for hip

* 512 block for cdna

* config pad size

* fix as comment

* update select logic

* move some code to cuh

* fix as comment

* correct cdna3 config

---------

Co-authored-by: zhang hui <redacted>

commit | commitdiff | tree

Vishal Singh [Thu, 29 Jan 2026 04:28:57 +0000 (09:58 +0530)]

ggml-zendnn : resolve ZenDNN backend cross-module symbol dependency (llama/19159)

commit | commitdiff | tree

Aman Gupta [Thu, 29 Jan 2026 02:31:28 +0000 (10:31 +0800)]

CUDA: refactor topk-moe to enable more models (GLM 4.7, Nemotron etc.) (llama/19126)

commit | commitdiff | tree

Neo Zhang [Thu, 29 Jan 2026 01:20:22 +0000 (09:20 +0800)]

sycl: fix norm kernels: l2_norm, group_norm, rms_norm by remove assert to support more cases (llama/19154)

Co-authored-by: Neo Zhang Jianyu <redacted>

commit | commitdiff | tree

Ruben Ortlam [Wed, 28 Jan 2026 17:52:45 +0000 (18:52 +0100)]

Vulkan Flash Attention Coopmat1 Refactor (llama/19075)

* vulkan: use coopmat for flash attention p*v matrix multiplication

* fix P loading issue

* fix barrier position

* remove reduction that is no longer needed

* move max thread reduction into loop

* remove osh padding

* add bounds checks and padding

* remove unused code

* fix shmem sizes, loop duration and accesses

* don't overwrite Qf, add new shared psh buffer instead

* add missing bounds checks

* use subgroup reductions

* optimize

* move bounds check, reduce barriers

* support other Bc values and other subgroup sizes

* remove D_split

* replace Of register array with shared memory Ofsh array

* parallelize HSV across the rowgroups

* go back to Of in registers, not shmem

* vectorize sfsh

* don't store entire K tile in shmem

* fixes

* load large k tiles to shmem on Nvidia

* adapt shared memory host check function to shader changes

* remove Bc 32 case

* remove unused variable

* fix missing mask reduction tmspsh barrier

* fix mask bounds check

* fix rowmax f16 under/overflow to inf

* fix flash_attn_cm2 BLOCK_SIZE preprocessor directives

Packaging of ggml-org/ggml

RSS Atom