]> git.djapps.eu Git - pkg/ggml/sources/llama.cpp/log
pkg/ggml/sources/llama.cpp
6 weeks agoUpdate upstream based on f08c4c0 - 0.0.6199
Mathieu Baudier [Tue, 19 Aug 2025 08:21:23 +0000 (10:21 +0200)]
Update upstream based on f08c4c0 - 0.0.6199

6 weeks agoMerge tag 'upstream/0.0.6199' into debian/latest
Mathieu Baudier [Tue, 19 Aug 2025 08:21:23 +0000 (10:21 +0200)]
Merge tag 'upstream/0.0.6199' into debian/latest

Pinned upstream commit

6 weeks agomtmd : clean up clip_n_output_tokens (#15391) upstream/0.0.6199
Xuan-Son Nguyen [Mon, 18 Aug 2025 20:53:52 +0000 (22:53 +0200)]
mtmd : clean up clip_n_output_tokens (#15391)

6 weeks agocodeowners : remove mmv.*
Georgi Gerganov [Mon, 18 Aug 2025 19:02:50 +0000 (22:02 +0300)]
codeowners : remove mmv.*

6 weeks agosync : ggml
Georgi Gerganov [Mon, 18 Aug 2025 19:02:11 +0000 (22:02 +0300)]
sync : ggml

6 weeks agoscripts : update sync scripts
Georgi Gerganov [Mon, 18 Aug 2025 17:35:47 +0000 (20:35 +0300)]
scripts : update sync scripts

6 weeks agollama : merge conts and reshapes and remove unnecessary cont (#15380)
Sigbjørn Skjæret [Mon, 18 Aug 2025 17:30:17 +0000 (19:30 +0200)]
llama : merge conts and reshapes and remove unnecessary cont (#15380)

* remove unnecessary conts and merge reshapes

* restore necessary conts

* merge more conts and reshapes

* merge even more conts and reshapes

6 weeks agoreadme : update hot topics (#15397)
Georgi Gerganov [Mon, 18 Aug 2025 15:11:44 +0000 (18:11 +0300)]
readme : update hot topics (#15397)

6 weeks agoserver : fix incoming tasks not process in order (#15395)
davidef [Mon, 18 Aug 2025 14:51:42 +0000 (16:51 +0200)]
server : fix incoming tasks not process in order (#15395)

6 weeks agoFix multiarch
Mathieu Baudier [Mon, 18 Aug 2025 12:15:03 +0000 (14:15 +0200)]
Fix multiarch

6 weeks agoFix broken build: require updated pip to support --break-system-packages (#15357)
Dobri Danchev [Mon, 18 Aug 2025 10:50:48 +0000 (05:50 -0500)]
Fix broken build: require updated pip to support --break-system-packages (#15357)

* Revert "devops : fix compile bug when the BASE_CUDA_DEV_CONTAINER is based on Ubuntu 24.04 (#15005)"

This reverts commit e4e915912cfd2ee15c5a4a0074813232134892f6.

* devops: Allow pip to modify externally-managed python environment (system installation)

- Updated pip install commands to include the --break-system-packages
  flag, ensuring compatibility when working with system-managed Python
  environments (PEP 668).

- Note: The --break-system-packages option was introduced in 2023.
  Ensure pip is updated to a recent version before using this flag.

fixes [#15004](https://github.com/danchev/llama.cpp/issues/15004)

6 weeks agoOptimize more aggressively
Mathieu Baudier [Mon, 18 Aug 2025 10:36:09 +0000 (12:36 +0200)]
Optimize more aggressively

6 weeks agoggml-quants : fix make_qp_quants NANs and IQ1 assertion errors (#15379)
compilade [Mon, 18 Aug 2025 07:23:56 +0000 (03:23 -0400)]
ggml-quants : fix make_qp_quants NANs and IQ1 assertion errors (#15379)

* ggml-quants : fix make_qp_quants NANs and IQ1 assertion errors

* ggml-quants : avoid division by zero in make_q3_quants

6 weeks agovulkan: disable spirv-opt for bfloat16 shaders (#15352)
Jeff Bolz [Mon, 18 Aug 2025 05:56:29 +0000 (00:56 -0500)]
vulkan: disable spirv-opt for bfloat16 shaders (#15352)

6 weeks agoserver : export max observed n_past value (#15361)
Oleksandr Kuvshynov [Sun, 17 Aug 2025 22:28:58 +0000 (18:28 -0400)]
server : export max observed n_past value (#15361)

Add tracking for high watermark cache usage and make it available in /metrics endpoint.

Use-case: Tracking largest needed cache usage under realistic workload
to better understand memory requirements and be able to adjust
cache size/quantization for model/cache accordingly.

6 weeks agovulkan: Use larger workgroups for mul_mat_vec when M is small (#15355)
Jeff Bolz [Sun, 17 Aug 2025 16:08:57 +0000 (11:08 -0500)]
vulkan: Use larger workgroups for mul_mat_vec when M is small (#15355)

* vulkan: Use larger workgroups for mul_mat_vec when M is small

Also use subgroup instructions for (part of) the reduction when supported.
Without this, the more expensive reductions would eat into the benefits of
the larger workgroups.

* update heuristic for amd/intel

Co-authored-by: 0cc4m <redacted>
---------

Co-authored-by: 0cc4m <redacted>
6 weeks agovulkan: support sqrt (#15370)
Dong Won Kim [Sun, 17 Aug 2025 14:03:09 +0000 (23:03 +0900)]
vulkan: support sqrt (#15370)

6 weeks agoconvert : force patch_embd weights to F16 or F32 to avoid broken GGUFs (#15367)
Sigbjørn Skjæret [Sun, 17 Aug 2025 12:47:42 +0000 (14:47 +0200)]
convert : force patch_embd weights to F16 or F32 to avoid broken GGUFs (#15367)

* force patch_embd weights to f32

* use MmprojModel base tensor_force_quant instead

6 weeks agoci : fix hang in windows-hip build/release (#15365)
Sigbjørn Skjæret [Sun, 17 Aug 2025 11:30:23 +0000 (13:30 +0200)]
ci : fix hang in windows-hip build/release (#15365)

* fix hang in windows-latest-cmake-hip

* apply fix to release as well

6 weeks agovulkan: Optimize argsort (#15354)
Jeff Bolz [Sun, 17 Aug 2025 08:41:45 +0000 (03:41 -0500)]
vulkan: Optimize argsort (#15354)

- Launch an appropriate number of invocations (next larger power of two).
32 invocations is common and the barrier is much cheaper there.
- Specialize for "needs bounds checking" vs not.
- Make the code less branchy and [[unroll]] the loops. In the final code,
I see no branches inside the main loop (only predicated stores) when
needs_bounds_check is false.
- Always sort ascending, then apply the ascending vs descending option when
doing the final stores to memory.
- Copy the values into shared memory, makes them slightly cheaper to access.

6 weeks agomodel : support vision LiquidAI LFM2-VL family (#15347)
Tarek Dakhran [Sat, 16 Aug 2025 21:33:54 +0000 (23:33 +0200)]
model : support vision LiquidAI LFM2-VL family (#15347)

* wip lfm2 vision model

* Fix conv weight

* Implement dynamic resolution

* Fix cuda

* support LFM2-VL-450M

* happy CI

* Remove extra `ggml_conv` and put others into the right place

Co-authored-by: Sigbjørn Skjæret <redacted>
---------

Co-authored-by: Xuan Son Nguyen <redacted>
Co-authored-by: Sigbjørn Skjæret <redacted>
6 weeks agovulkan: fuse adds (#15252)
Jeff Bolz [Sat, 16 Aug 2025 16:48:22 +0000 (11:48 -0500)]
vulkan: fuse adds (#15252)

* vulkan: fuse adds

Fuse adds that have the same shape, which are common in MoE models.
It will currently fuse up to 6 adds, because we assume no more than
8 descriptors per dispatch. But this could be changed.

* check runtimeDescriptorArray feature

* disable multi_add for Intel due to likely driver bug

6 weeks agovulkan: Support mul_mat_id with f32 accumulators (#15337)
Jeff Bolz [Sat, 16 Aug 2025 09:18:31 +0000 (04:18 -0500)]
vulkan: Support mul_mat_id with f32 accumulators (#15337)

* vulkan: Add missing bounds checking to scalar/coopmat1 mul_mat_id

* vulkan: Support mul_mat_id with f32 accumulators, but they are not hooked up

- There's no explicit way to request f32 precision for mul_mat_id, but there
probably should be, and this gets the code in place for that.
- A couple fixes to check_results.
- Remove casts to fp16 in coopmat1 FA shader (found by inspection).

6 weeks agovulkan: Add missing bounds checking to scalar/coopmat1 mul_mat_id (#15334)
Jeff Bolz [Sat, 16 Aug 2025 08:58:38 +0000 (03:58 -0500)]
vulkan: Add missing bounds checking to scalar/coopmat1 mul_mat_id (#15334)

6 weeks agoOpenCL: add initial FA support (#14987)
rmatif [Sat, 16 Aug 2025 08:05:55 +0000 (10:05 +0200)]
OpenCL: add initial FA support (#14987)

* add F16/F16 fa support

* fix kernel init

* use mad instead of fma

* use inline function

* mark FA with sinks as unsupported for now

* add pragma unroll to loops

7 weeks agocommon : fix double bos, use common_chat_templates for add_bos and add_eos (#15326)
Daniel Bevenius [Fri, 15 Aug 2025 17:50:52 +0000 (19:50 +0200)]
common : fix double bos, use common_chat_templates for add_bos and add_eos (#15326)

This commit updates common_chat_templates_apply_jinja to use the
the add_bos and add_eos parameters from the chat template instead of
the inputs.

The motivation for this is that currently if the `add_bos` and `add_eos`
from the input parameters are used it is possible to there will be a
missmatch between the model and the chat template which can lead to the
the removal of duplicate BOS/EOS tokens in chat.cpp `apply` to not
happen leading to two BOS tokens being added to the template.

7 weeks agoopencl: add initial mxfp4 support via mv (#15270)
lhez [Fri, 15 Aug 2025 16:52:14 +0000 (00:52 +0800)]
opencl: add initial mxfp4 support via mv (#15270)

* opencl: add reference `mul_mv_mxfp4_f32`

* opencl: add reference `mul_mv_id` for mxfp4

* Q4_0 tranpose fix for Adreno

---------

Co-authored-by: shawngu-quic <redacted>
7 weeks agoAdapt to changes in ggml
Mathieu Baudier [Fri, 15 Aug 2025 16:00:08 +0000 (18:00 +0200)]
Adapt to changes in ggml

7 weeks agoPackage multimodal tools separately
Mathieu Baudier [Fri, 15 Aug 2025 14:32:13 +0000 (16:32 +0200)]
Package multimodal tools separately

7 weeks agoMake Debian files easier to diff with Debian official packages
Mathieu Baudier [Fri, 15 Aug 2025 14:02:18 +0000 (16:02 +0200)]
Make Debian files easier to diff with Debian official packages

7 weeks agovulkan : fix out-of-bounds access in argmax kernel (#15342)
Georgi Gerganov [Fri, 15 Aug 2025 14:16:36 +0000 (17:16 +0300)]
vulkan : fix out-of-bounds access in argmax kernel (#15342)

ggml-ci

7 weeks agovulkan : fix compile warnings on macos (#15340)
Georgi Gerganov [Fri, 15 Aug 2025 13:28:28 +0000 (16:28 +0300)]
vulkan : fix compile warnings on macos (#15340)

ggml-ci

7 weeks agoggml: initial IBM zDNN backend (#14975)
Aaron Teo [Fri, 15 Aug 2025 13:11:22 +0000 (21:11 +0800)]
ggml: initial IBM zDNN backend (#14975)

* ggml-zdnn: inital backend impl

Signed-off-by: Aaron Teo <redacted>
ggml-zdnn: temp change z17 to arch15

Signed-off-by: Aaron Teo <redacted>
ggml-zdnn: fix build bugs

Signed-off-by: Aaron Teo <redacted>
* ggml-zdnn: tensor->extra logging check

Signed-off-by: Aaron Teo <redacted>
ggml-zdnn: add layout name mapping, ztensor information

Signed-off-by: Aaron Teo <redacted>
ggml-zdnn: separate logging into its own line

Signed-off-by: Aaron Teo <redacted>
ggml-zdnn: add shape comparison

Signed-off-by: Aaron Teo <redacted>
ggml-zdnn: add ggml_tensor shape log

Signed-off-by: Aaron Teo <redacted>
ggml-zdnn: fix incorrect shape logging

Signed-off-by: Aaron Teo <redacted>
* ggml-zdnn: add output buffer check

Signed-off-by: Aaron Teo <redacted>
* ggml-zdnn: run compute and store into tensor->extra

Signed-off-by: Aaron Teo <redacted>
* ggml-zdnn: add set_tensor

Signed-off-by: Aaron Teo <redacted>
* ggml-zdnn: add more loggers

Signed-off-by: Aaron Teo <redacted>
* ggml-zdnn: update set_tensor logging to check only for matmul

Signed-off-by: Aaron Teo <redacted>
* ggml-zdnn: last working matmul version

Signed-off-by: Aaron Teo <redacted>
* ggml-zdnn: add comments to prevent accidentally deleting lines

Signed-off-by: Aaron Teo <redacted>
* ggml-zdnn: support op out_prod

Signed-off-by: Aaron Teo <redacted>
* ggml-zdnn: update op out_prod to use tensor->extra

Signed-off-by: Aaron Teo <redacted>
* ggml-zdnn: rewrite the backend implementation

Signed-off-by: Aaron Teo <redacted>
* ggml-zdnn: bugfix new impl

Signed-off-by: Aaron Teo <redacted>
* ggml-zdnn: fix compiler warnings and bugfixes

Signed-off-by: Aaron Teo <redacted>
* ggml-zdnn: test ztensor finding in init_tensor

Signed-off-by: Aaron Teo <redacted>
* ggml-zdnn: implement at least 1 op to test

Signed-off-by: Aaron Teo <redacted>
* ggml-zdnn: assign tensor->extra to buffer

Signed-off-by: Aaron Teo <redacted>
* ggml-zdnn: add check for view tensors to prevent init_tensor

Signed-off-by: Aaron Teo <redacted>
* ggml-zdnn: rework init_tensor to create new buffers

Signed-off-by: Aaron Teo <redacted>
* ggml-zdnn: switch to std vector instead of array

Signed-off-by: Aaron Teo <redacted>
* ggml-zdnn: switch buffers back and set to arbitrary number

Signed-off-by: Aaron Teo <redacted>
* ggml-zdnn: impl init_tensor

Signed-off-by: Aaron Teo <redacted>
* ggml-zdnn: update supports_op matmul matrix

Signed-off-by: Aaron Teo <redacted>
* ggml-zdnn: fix incorrect ztensor shape, reduce memory padding

Signed-off-by: Aaron Teo <redacted>
* ggml-zdnn: code clean up

Signed-off-by: Aaron Teo <redacted>
* ggml-zdnn: impl matmul

Signed-off-by: Aaron Teo <redacted>
* ggml-zdnn: fix compiler error missing type

Signed-off-by: Aaron Teo <redacted>
* ggml-zdnn: fix missing data transform call

Signed-off-by: Aaron Teo <redacted>
* ggml-zdnn: add bias init_tensor

Signed-off-by: Aaron Teo <redacted>
* ggml-zdnn: tighten memory usage, change string allocation

Signed-off-by: Aaron Teo <redacted>
* ggml-zdnn: add bias ztensor and data free

Signed-off-by: Aaron Teo <redacted>
* ggml-zdnn: add bias data transform

Signed-off-by: Aaron Teo <redacted>
* ggml-zdnn: add more debug info for extra buffer transform

Signed-off-by: Aaron Teo <redacted>
* ggml-zdnn: add logger to check if mat mul ops go through set_tensor

Signed-off-by: Aaron Teo <redacted>
* ggml-zdnn: activate bias transform in matmul

Signed-off-by: Aaron Teo <redacted>
* ggml-zdnn: move weights transform into mulmat

Signed-off-by: Aaron Teo <redacted>
* ggml-zdnn: add more safeguards in matmul

Signed-off-by: Aaron Teo <redacted>
* ggml-zdnn: fix sequencing of transforms

Signed-off-by: Aaron Teo <redacted>
* ggml-zdnn: bugfix transform ztensor vs origtensor

Signed-off-by: Aaron Teo <redacted>
* ggml-zdnn: figure out why sigtrap is happening

Signed-off-by: Aaron Teo <redacted>
* ggml-zdnn: fix sigsegv

Signed-off-by: Aaron Teo <redacted>
* ggml-zdnn: move everything back to local declaration

Signed-off-by: Aaron Teo <redacted>
* ggml-zdnn: move bias data to local also

Signed-off-by: Aaron Teo <redacted>
* ggml-zdnn: bring back working matmul

Signed-off-by: Aaron Teo <redacted>
* ggml-zdnn: rewrite into mre

Signed-off-by: Aaron Teo <redacted>
* ggml-zdnn: fix missing vector import

Signed-off-by: Aaron Teo <redacted>
* ggml-zdnn: fix missing vector import in header

Signed-off-by: Aaron Teo <redacted>
* ggml-zdnn: attempt to fix sigsegv

Signed-off-by: Aaron Teo <redacted>
* ggml-zdnn: fix missing load tensor

Signed-off-by: Aaron Teo <redacted>
* ggml-zdnn: fix invalid ztensor buffer release

Signed-off-by: Aaron Teo <redacted>
* ggml-zdnn: add logging to debug free buffer

Signed-off-by: Aaron Teo <redacted>
* ggml-zdnn: remove free_buffer debug info

Signed-off-by: Aaron Teo <redacted>
* ggml-zdnn: add parmblkformat detections

Signed-off-by: Aaron Teo <redacted>
* ggml-zdnn: add nnpa installed detection

Signed-off-by: Aaron Teo <redacted>
* ggml-zdnn: add zdnn_init call for static libs

Signed-off-by: Aaron Teo <redacted>
* ggml-zdnn: add init_tensor

Signed-off-by: Aaron Teo <redacted>
* ggml-zdnn: attempt at fixing invalid buffer

Signed-off-by: Aaron Teo <redacted>
* ggml-zdnn: switch to using deque to fix pointer deref problem

Signed-off-by: Aaron Teo <redacted>
* ggml-zdnn: add weights logging to check

Signed-off-by: Aaron Teo <redacted>
* ggml-zdnn: attempt to use unique ptr

Signed-off-by: Aaron Teo <redacted>
* ggml-zdnn: add tensor to pre_tfm_desc logging

Signed-off-by: Aaron Teo <redacted>
* ggml-zdnn: add inputs logging

Signed-off-by: Aaron Teo <redacted>
* ggml-zdnn: disable op_none initialisation for testing

Signed-off-by: Aaron Teo <redacted>
* ggml-zdnn: fix missing return from init_tensor

Signed-off-by: Aaron Teo <redacted>
* ggml-zdnn: load ztensors in cgraph exec

Signed-off-by: Aaron Teo <redacted>
* ggml-zdnn: work on moving output ztensor as well

Signed-off-by: Aaron Teo <redacted>
* ggml-zdnn: disable logging and breakpoints for full test

Signed-off-by: Aaron Teo <redacted>
* ggml-zdnn: attempt at manually changing the layout

Signed-off-by: Aaron Teo <redacted>
* ggml-zdnn: attempt at using default nwhc format instead

Signed-off-by: Aaron Teo <redacted>
* ggml-zdnn: disable global load ztensor for now

Signed-off-by: Aaron Teo <redacted>
* ggml-zdnn: fix errorenous output load tensor

Signed-off-by: Aaron Teo <redacted>
* ggml-zdnn: add guards to prevent loading ztensor if transformed

Signed-off-by: Aaron Teo <redacted>
* ggml-zdnn: code cleanup

Signed-off-by: Aaron Teo <redacted>
* ggml-zdnn: bring load ztensor back to init routine

Signed-off-by: Aaron Teo <redacted>
* ggml-zdnn: code clean up

Signed-off-by: Aaron Teo <redacted>
* ggml-zdnn: fix ztensor deallocation abort

stabilise ggml <-> zdnn api

Signed-off-by: Aaron Teo <redacted>
* ggml-zdnn: clean up matmul selection

Signed-off-by: Aaron Teo <redacted>
* ggml-zdnn: clean up project structure

Signed-off-by: Aaron Teo <redacted>
* ggml-zdnn: update documentation, prepare for upstream

Signed-off-by: Aaron Teo <redacted>
* chore: add codeowners

Signed-off-by: Aaron Teo <redacted>
* ggml-zdnn: disable batched matmul

Signed-off-by: Aaron Teo <redacted>
* ggml-zdnn: attempt at fixing tensor views during matmul

Signed-off-by: Aaron Teo <redacted>
* ggml-zdnn: deny all view tensors directly

Signed-off-by: Aaron Teo <redacted>
* ggml-zdnn: fix pr comments

Signed-off-by: Aaron Teo <redacted>
* docs: update ops docs for zdnn

Signed-off-by: Aaron Teo <redacted>
* ggml-zdnn: redo test-backend-ops for ops.md

Signed-off-by: Aaron Teo <redacted>
* ggml-zdnn: fix typo in build-s390x.md

Signed-off-by: Aaron Teo <redacted>
* codeowners: remove taronaeo for now

Signed-off-by: Aaron Teo <redacted>
* Revert "codeowners: remove taronaeo for now"

This reverts commit 411ea4ed78d08778967bd0bd33a6538cfcbe082f.

* ggml-zdnn: remove unused ggml_zdnn macro

Signed-off-by: Aaron Teo <redacted>
---------

Signed-off-by: Aaron Teo <redacted>
7 weeks agoUpdate upstream
Mathieu Baudier [Fri, 15 Aug 2025 13:00:03 +0000 (15:00 +0200)]
Update upstream

7 weeks agoMerge tag 'upstream/0.0.6164' into debian/latest
Mathieu Baudier [Fri, 15 Aug 2025 12:58:03 +0000 (14:58 +0200)]
Merge tag 'upstream/0.0.6164' into debian/latest

Upstream release

7 weeks agoci : fix ios-xcode-build (#15324)
Sigbjørn Skjæret [Fri, 15 Aug 2025 12:02:39 +0000 (14:02 +0200)]
ci : fix ios-xcode-build (#15324)

* fix ios-xcode-build

* use xcode-select with fixed version

* switch to macos-15 to get xcode 16.4

7 weeks agoci : move ccache action to ggml-org fork (#15328)
Diego Devesa [Fri, 15 Aug 2025 10:27:02 +0000 (03:27 -0700)]
ci : move ccache action to ggml-org fork (#15328)

7 weeks agotest-opt: fix backend support check (#15317)
Johannes Gäßler [Fri, 15 Aug 2025 09:23:17 +0000 (11:23 +0200)]
test-opt: fix backend support check (#15317)

* test-opt: fix backend support check

* Update tests/test-opt.cpp

Co-authored-by: Georgi Gerganov <redacted>
---------

Co-authored-by: Georgi Gerganov <redacted>
7 weeks agoCUDA: fix negative KV_max values in FA (#15321)
Johannes Gäßler [Thu, 14 Aug 2025 21:21:24 +0000 (23:21 +0200)]
CUDA: fix negative KV_max values in FA (#15321)

7 weeks agoeval-callback : stop on first NaN (#15320)
Georgi Gerganov [Thu, 14 Aug 2025 19:10:51 +0000 (22:10 +0300)]
eval-callback : stop on first NaN (#15320)

* eval-callback : stop on first NaN

* cont : log error

7 weeks agochat : include kwargs in template example (#15309)
Diego Devesa [Thu, 14 Aug 2025 17:28:29 +0000 (10:28 -0700)]
chat : include kwargs in template example (#15309)

7 weeks agollama : add 18-layer model type for Gemma 3-270m (#15319)
Daniel Bevenius [Thu, 14 Aug 2025 15:56:26 +0000 (17:56 +0200)]
llama : add 18-layer model type for Gemma 3-270m (#15319)

This commit adds support for the 18-layer model type in the Gemma3
series, which is the size of the Gemma3-270m model.

The motivation for this commit is was the only change required for
Gemma3-270m to be converted to GGUF format and used with llama.cpp.

Once the model has been converted and uploaded to Huggingface it can be
used like this:
```console
$ ./build/bin/llama-cli -hf ggml-org/gemma-3-270m-GGUF:Q8_0
```

7 weeks agodevops : fix compile bug when the BASE_CUDA_DEV_CONTAINER is based on Ubuntu 24.04...
simevo [Thu, 14 Aug 2025 15:45:27 +0000 (17:45 +0200)]
devops : fix compile bug when the BASE_CUDA_DEV_CONTAINER is based on Ubuntu 24.04 (#15005)

fixes #15004

Co-authored-by: Paolo Greppi <redacted>
7 weeks agoHIP: Cleanup hipification header (#15285)
uvos [Thu, 14 Aug 2025 14:23:56 +0000 (16:23 +0200)]
HIP: Cleanup hipification header (#15285)

add expicit conversion operator to support older versions of rocm
Switch over to hip_bf16 from legacy hip_bfloat16
Simplify RDNA3 define
Reduce swap over of new hipblas api to rocm 6.5 as this version is used for rocm 7.0 previews

---------

Co-authored-by: Johannes Gäßler <redacted>
7 weeks agogpt-oss: implement harmony parsing (#15181) upstream/0.0.6164
Aldehir Rojas [Thu, 14 Aug 2025 14:23:11 +0000 (09:23 -0500)]
gpt-oss: implement harmony parsing (#15181)

* model : add harmony parser for gpt-oss

* gpt-oss : fix grammar trigger from causing empty stack

* gpt-oss: tweak the grammar trigger again

* gpt-oss : add support for recipient in role header

* gpt-oss : fix ungrouped tool calls in grammar

* gpt-oss : loosen function name matching during parse

* gpt-oss : clean up workarounds

* gpt-oss : add template tests

* gpt-oss : simulate thinking and tool call tags

* gpt-oss : undo think tags when reasoning_format is none

* gpt-oss : set special tokens back to user defined

* gpt-oss : update openai-gpt-oss template

* server : filter out harmony thought messages

* gpt-oss : simplify parsing

7 weeks agodocker : Enable GGML_CPU_ALL_VARIANTS for ARM (#15267)
Christian Kastner [Thu, 14 Aug 2025 14:22:58 +0000 (16:22 +0200)]
docker : Enable GGML_CPU_ALL_VARIANTS for ARM (#15267)

7 weeks agoreadme : update hot topics (#15315)
Georgi Gerganov [Thu, 14 Aug 2025 14:16:03 +0000 (17:16 +0300)]
readme : update hot topics (#15315)

7 weeks agovulkan: perf_logger improvements (#15246)
Jeff Bolz [Thu, 14 Aug 2025 13:38:10 +0000 (08:38 -0500)]
vulkan: perf_logger improvements (#15246)

* vulkan: perf_logger improvements

- Account for batch dimension in flops calculation.
- Fix how "_VEC" is detected for mat_mul_id.
- Fix "n" dimension for mat_mul_id (in case of broadcasting).
- Include a->type in name.

* use <=mul_mat_vec_max_cols rather than ==1

7 weeks agoserver : add SWA checkpoints (#15293)
Georgi Gerganov [Thu, 14 Aug 2025 11:59:50 +0000 (14:59 +0300)]
server : add SWA checkpoints (#15293)

* server : add SWA checkpoints

ggml-ci

* cont : server clean-up

* server : handle state restore fails

* llama : add extended llama_state_seq_ API

* server : do not make checkpoints if --swa-full

ggml-ci

* llama : remove flags value for NONE

* server : configure number of SWA checkpoints with CLI arg

ggml-ci

* args : fix scope of new argument

7 weeks agosync : ggml
Georgi Gerganov [Thu, 14 Aug 2025 11:19:23 +0000 (14:19 +0300)]
sync : ggml

ggml-ci

7 weeks agoggml: fix ggml_conv_1d_dw bug (ggml/1323)
Jason Ni [Thu, 14 Aug 2025 11:17:51 +0000 (19:17 +0800)]
ggml: fix ggml_conv_1d_dw bug (ggml/1323)

* ggml: fix ggml_conv_1d_dw bug

* Fixed conv1d_dw weight tensor dimension.

7 weeks agotests : remove unused includes (ggml/0)
Georgi Gerganov [Thu, 14 Aug 2025 10:41:03 +0000 (13:41 +0300)]
tests : remove unused includes (ggml/0)

7 weeks agoperplexity : provide a helpful hint for has_cpl case in split_equal error. (#15304)
kallewoof [Thu, 14 Aug 2025 11:03:30 +0000 (20:03 +0900)]
perplexity : provide a helpful hint for has_cpl case in split_equal error. (#15304)

When attempting to do llama-perplexity on certain tasks which have coupled sequences there is a cryptic error that does not tell you what to do, which is to set the -kvu flag. This adds a hint about that fact.

7 weeks agocuda : fix GGML_CUDA_GRAPHS=OFF (#15300)
Sigbjørn Skjæret [Thu, 14 Aug 2025 10:22:07 +0000 (12:22 +0200)]
cuda : fix GGML_CUDA_GRAPHS=OFF (#15300)

* fix USE_CUDA_GRAPH=OFF

ggml-ci

* check capture status

* completely disable capturing check instead

7 weeks agofinetune: SGD optimizer, more CLI args (#13873)
Jonathan Graehl [Thu, 14 Aug 2025 10:03:57 +0000 (03:03 -0700)]
finetune: SGD optimizer, more CLI args (#13873)

* examples/finetune -opt SGD (stochastic gradient descent) memory opt

add unit tested GGML_OPT_OPTIMIZER_SGD to ggml - avoids allocating
m, v tensors.

support finetune.cpp arg -opt SGD (or sgd). (default adamw as before)

llama 3.2-1b-F32 result: observed 11gb gpu ram (41 sec/epoch)
when using SGD instead of 19gb (55 sec/epoch) using adamw.
(wikipedia 100 lines finetune)

(
using the same GPU memory, adamw can only do before OOM 512
batch/context, reaching:
train: [███████▉] data=0000140/0000140 loss=0.02575±0.00099 acc=99.52±0.03% t=00:00:47 ETA=00:00:00
val:   [███████▉] data=0000008/0000008 loss=4.76565±0.28810 acc=41.46±0.77% t=00:00:00 ETA=00:00:00

SGD is superior, though it converges slower, with max before OOM 1728
batch/context (esp see the better validation perf):
train: [███████▉] data=0000039/0000039 loss=0.00371±0.00010 acc=99.96±0.01% t=00:00:41 ETA=00:00:00
val:   [███████▉] data=0000003/0000003 loss=5.11406±0.76034 acc=48.01±0.69% t=00:00:01 ETA=00:00:00
)

note: when finetuning long enough (or w/ enough -lr),
validation accuracy *eventually* drops ('catastrophic forgetting')

-lr-half (halflife) option useful for SGD to avoid oscillation or
super slow underdamped learning (makes setting -lr more forgiving).
terminal -lr for now is set by lr-halvings i.e. if you want at most
1/8 the inital -lr you set -lr-halvings 3.

note: objective loss not directly comparable between adamw, sgd? -
check perplexity or accuracy or consider relative improvements
for convergence

new finetune args -wd 1e-9 to enable weight decay in sgd or adamw,
and max -epochs N (default 2 as before)

cache (1 - wd*alpha) in 'adamw' opt struct -
no noticeable perf benefit, disabled (still done
for new SGD though)

since opt. memory is pre-allocated, the ggml_opt_get_optimizer_params
would probably be able to change between SGD and AdamW with each epoch
but would need to use adamw for the first (unconfirmed - no cmdline arg
to set such a policy yet)

test-opt checks adamw as before and now sgd (except for a few disabled
tests for sgd only; probably just needs logging values and adding
alternate reference values);  tolerance on the 'regression'
test is broader for sgd (so we don't need many more epochs)

* Vulkan: Implement GGML_OP_OPT_STEP_SGD

* tests: Fix OPT_STEP_SGD test-backend-ops

* SGD op param store weight-decay and not 1-alpha*wd

* minor + cosmetic changes

* fix vulkan sgd

* try CI fix

---------

Co-authored-by: 0cc4m <redacted>
Co-authored-by: Johannes Gäßler <redacted>
7 weeks agoperplexity: give more information about constraints on failure (#15303)
kallewoof [Thu, 14 Aug 2025 06:16:32 +0000 (15:16 +0900)]
perplexity: give more information about constraints on failure (#15303)

* perplexity: give more information about constraints on failure

This checks whether -np is insufficient vs context, and provides clues as to how much is needed for each.

* log formatting

* log error and return instead of storing max_seq_exceeded int

* check if s0 is zero for -np check

7 weeks agoHIP: bump requirement to rocm 6.1 (#15296)
uvos [Wed, 13 Aug 2025 18:44:30 +0000 (20:44 +0200)]
HIP: bump requirement to rocm 6.1 (#15296)

7 weeks agofix(nix): remove non-functional llama-cpp cachix cache from flake.nix (#15295)
Bas Nijholt [Wed, 13 Aug 2025 18:21:31 +0000 (11:21 -0700)]
fix(nix): remove non-functional llama-cpp cachix cache from flake.nix (#15295)

The flake.nix included references to llama-cpp.cachix.org cache with a comment
claiming it's 'Populated by the CI in ggml-org/llama.cpp', but:

1. No visible CI workflow populates this cache
2. The cache is empty for recent builds (tested b6150, etc.)
3. This misleads users into expecting pre-built binaries that don't exist

This change removes the non-functional cache references entirely, leaving only
the working cuda-maintainers cache that actually provides CUDA dependencies.

Users can still manually add the llama-cpp cache if it becomes functional in the future.

7 weeks agoserver : enable -td and -tbd parameters (#15172)
Sigbjørn Skjæret [Wed, 13 Aug 2025 13:43:00 +0000 (15:43 +0200)]
server : enable -td and -tbd parameters (#15172)

7 weeks agoggml : update `ggml_rope_multi` (#12665)
Judd [Wed, 13 Aug 2025 10:45:15 +0000 (18:45 +0800)]
ggml : update `ggml_rope_multi` (#12665)

* update `rope_multi`:

1. add `ggml_rope_multi_inplace`;
1. use `GGML_MROPE_SECTIONS` instead of 4.

* Apply suggestions from code review

Co-authored-by: Georgi Gerganov <redacted>
---------

Co-authored-by: Georgi Gerganov <redacted>
7 weeks ago common : add --override-tensor-draft, --cpu-moe-draft and --n-cpu-moe-draft paramete...
Copilot [Wed, 13 Aug 2025 10:44:40 +0000 (12:44 +0200)]
 common : add --override-tensor-draft, --cpu-moe-draft and --n-cpu-moe-draft parameters (#15191)

* Checkpoint from VS Code for coding agent session

* Initial plan

* Fix typo in --override-tensor-draft flag implementation

* Add null termination for speculative tensor buffer overrides

* Apply suggestions from code review

* Apply suggestions from code review

* Extract tensor override parsing logic to common function (addresses @slaren's feedback)

* Apply suggestions from code review

* Apply suggestions

---------

Co-authored-by: Sigbjørn Skjæret <redacted>
Co-authored-by: Georgi Gerganov <redacted>
Co-authored-by: Diego Devesa <redacted>
7 weeks agoserver : filter out harmony thought messages (#15278)
Aldehir Rojas [Wed, 13 Aug 2025 10:28:21 +0000 (05:28 -0500)]
server : filter out harmony thought messages (#15278)

7 weeks agoci : Added CI with RISC-V RVV1.0 Hardware (#14439)
Ali Tariq [Wed, 13 Aug 2025 10:14:44 +0000 (15:14 +0500)]
ci : Added CI with RISC-V RVV1.0 Hardware (#14439)

* Changed the CI file to hw

* Changed the CI file to hw

* Added to sudoers for apt

* Removed the clone command and used checkout

* Added libcurl

* Added gcc-14

* Checking gcc --version

* added gcc-14 symlink

* added CC and C++ variables

* Added the gguf weight

* Changed the weights path

* Added system specification

* Removed white spaces

* ci: Replace Jenkins riscv native build Cloud-V pipeline with GitHub Actions workflow

Removed the legacy .devops/cloud-v-pipeline Jenkins CI configuration and introduced .github/workflows/build-riscv-native.yml for native RISC-V builds using GitHub Actions.

* removed trailing whitespaces

---------

Co-authored-by: Akif Ejaz <redacted>
7 weeks agoci : add more python requirements to copilot-setup-steps (#15289)
Sigbjørn Skjæret [Wed, 13 Aug 2025 09:30:45 +0000 (11:30 +0200)]
ci : add more python requirements to copilot-setup-steps (#15289)

* ci : add flake8 and pyright to copilot-setup-steps.yml

* add tools/server/tests/requirements.txt

7 weeks agoggml : repack block_iq4_nlx8 (#14904)
Georgi Gerganov [Wed, 13 Aug 2025 08:09:39 +0000 (11:09 +0300)]
ggml : repack block_iq4_nlx8 (#14904)

ggml-ci

7 weeks agoCUDA: Optimize `reduce_rows_f32` kernel, leading up to 25x perf improvement on kernel...
Oliver Simons [Wed, 13 Aug 2025 08:04:46 +0000 (10:04 +0200)]
CUDA: Optimize `reduce_rows_f32` kernel, leading up to 25x perf improvement on kernel-level and 10% perf increase for Gemma3n (#15132)

* Factor out `reduce_rows_f32` from common.cuh

This increases iteration cycle speed by not having to recompile
every kernel all the time

* Hide memory-latency by loop unrolling in reduce_rows_f32

* Further optimizations to `reduce_rows_f32`

1. Increase threadblock size to better hide latency of memory requests.
   As a consequence of bigger threadblocks, do 2-step summation, using
   shared memory to communicate results between invocations
2. Use sum_temp array to reduce waits on sum
3. Adjust num_unroll to reflext bigger threadblock
4. Improve default block_dims, increase support for more block_dims

* Add perf tests for `reduce_rows_f32` kernel

* Add heuristic to toggle 128/512 threads based on sm count

Break even point was the minimum of the following multiples.

| GPU Model                     | Nrow SM Count Multiple |
| -----------                   | -----------            |
| RTX 4000 SFF ADA              | 2.0x                   |
| RTX 6000 ADA                  | 2.5x                   |
| RTX PRO 6000 Blackwell Max-Q  | 3.04x                  |
| RTX PRO 4500 Blackwell | 3.15x                  |

* Ensure perf gains also for small ncols and large nrows

Alternative to this, one could have also made the number of unrollings
template-able, but that would require compiling the kernel multiple
times, increasing binary size unnecessarily

* Modify perf and unit-tests

* Apply auto-formatting by clang

* Fix CI build failure

See https://github.com/ggml-org/llama.cpp/actions/runs/16798370266/job/47573716079?pr=15132#step:7:486
Building with VS generator worked though.

* Remove sm_count property from `ggml_backend_cuda_context`

Requested by @JohannesGaessler, and should fix remaining CI issues as a
side-effect

* Add CUB-based implementation for GGML_OP_MEAN

Currently this branch is only executed for nrows==1

* Add heuristics to execute CUB branch only when it brings perf

Heuristics were determined on the following HW:

* RTX 4000 SFF ADA
* RTX 6000 ADA
* RTX PRO 6000 Blackwell Max-Q
* RTX PRO 4500 Blackwell

* Add unit-test for CUB-based mean

Tests should run with CUDA Graphs enabled per default on NVGPUs

* Rename `USE_CUB` to `GGML_CUDA_USE_CUB`

Suggested by @JohannesGaessler

* Unindent Preprocessor directives

See
https://github.com/ggml-org/llama.cpp/pull/15132#discussion_r2269213506

7 weeks agoci : add copilot-setup-steps.yml (#15214)
Sigbjørn Skjæret [Wed, 13 Aug 2025 07:07:13 +0000 (09:07 +0200)]
ci : add copilot-setup-steps.yml (#15214)

7 weeks agoggml-rpc: chunk send()/recv() to avoid EINVAL for very large tensors over RPC (macOS...
Tak-RS [Wed, 13 Aug 2025 05:54:30 +0000 (14:54 +0900)]
ggml-rpc: chunk send()/recv() to avoid EINVAL for very large tensors over RPC (macOS & others) (#15188)

* ggml-rpc: chunk send()/recv() to avoid EINVAL for very large tensors over RPC (macOS & others). Fixes #15055

* ggml-rpc: rename RPC_IO_CHUNK->MAX_CHUNK_SIZE, use std::min() for cap, switch to GGML_LOG_ERROR, handle 0-length send/recv

* rpc: drop n==0 special case in send_data(); retry in loop per review

* rpc: remove trailing whitespace in send_data()

---------

Co-authored-by: Shinnosuke Takagi <redacted>
7 weeks agoHIP: disable sync warp shuffel operators from clr amd_warp_sync_functions.h (#15273)
uvos [Tue, 12 Aug 2025 20:15:12 +0000 (22:15 +0200)]
HIP: disable sync warp shuffel operators from clr amd_warp_sync_functions.h (#15273)

7 weeks agoUpdate upstream
Mathieu Baudier [Tue, 12 Aug 2025 12:20:00 +0000 (14:20 +0200)]
Update upstream

7 weeks agoMerge tag 'upstream/0.0.6073' into debian/latest
Mathieu Baudier [Tue, 12 Aug 2025 12:18:02 +0000 (14:18 +0200)]
Merge tag 'upstream/0.0.6073' into debian/latest

Upstream release

7 weeks agosycl: Fix and disable more configurations of mul_mat (#15151)
Romain Biessy [Tue, 12 Aug 2025 11:58:22 +0000 (13:58 +0200)]
sycl: Fix and disable more configurations of mul_mat (#15151)

* sycl: Fix and disable more configurations of mul_mat

* Disable more configurations

7 weeks agoopencl: allow mixed f16/f32 `add` (#15140)
rmatif [Tue, 12 Aug 2025 09:42:41 +0000 (11:42 +0200)]
opencl: allow mixed f16/f32 `add` (#15140)

7 weeks agoCUDA cmake: add `-lineinfo` for easier debug (#15260)
Aman Gupta [Tue, 12 Aug 2025 09:21:45 +0000 (17:21 +0800)]
CUDA cmake: add `-lineinfo` for easier debug (#15260)

7 weeks agoCANN: GGML_OP_CPY optimization (#15070)
Chenguang Li [Tue, 12 Aug 2025 08:12:13 +0000 (16:12 +0800)]
CANN: GGML_OP_CPY optimization (#15070)

Signed-off-by: noemotiovon <redacted>
7 weeks agomusa: fix failures in test-backend-ops for mul_mat_id op (#15236)
R0CKSTAR [Tue, 12 Aug 2025 02:02:51 +0000 (10:02 +0800)]
musa: fix failures in test-backend-ops for mul_mat_id op (#15236)

* musa: fix failures in test-backend-ops for mul_mat_id op

Signed-off-by: Xiaodong Ye <redacted>
* Address review comments

Signed-off-by: Xiaodong Ye <redacted>
---------

Signed-off-by: Xiaodong Ye <redacted>
7 weeks agoCANN: Add broadcast for softmax and FA (#15208)
hipudding [Mon, 11 Aug 2025 14:50:31 +0000 (22:50 +0800)]
CANN: Add broadcast for softmax and FA (#15208)

* refactor softmax

* fix fa

* fix mask shape

* format

* add comments

* Remove whitespace

7 weeks agomtmd : Fix MinicpmV model converter and clip to avoid using hardcode. (#14750)
rainred [Mon, 11 Aug 2025 14:12:12 +0000 (22:12 +0800)]
mtmd : Fix MinicpmV model converter and clip to avoid using hardcode. (#14750)

* Fix MinicpmV model converter and clip to avoid using hardcode.

* Code update for pr/14750

* Remove unused field, update script path in docs.

* Add version 5 for fallback code.

---------

Co-authored-by: lzhang <redacted>
7 weeks agochat : hotfix gpt-oss jinja raising an exception (#15243)
Xuan-Son Nguyen [Mon, 11 Aug 2025 13:31:35 +0000 (15:31 +0200)]
chat : hotfix gpt-oss jinja raising an exception (#15243)

* chat : hotfix gpt-oss jinja raising an exception

* fix

7 weeks agoserver : allow specifying reasoning_format in HTTP request (#15238)
Xuan-Son Nguyen [Mon, 11 Aug 2025 12:48:41 +0000 (14:48 +0200)]
server : allow specifying reasoning_format in HTTP request (#15238)

7 weeks agoreadme : update infra list (#15234)
Zagaj [Mon, 11 Aug 2025 12:27:54 +0000 (14:27 +0200)]
readme : update infra list (#15234)

7 weeks agokv-cache : fix seq_rm with seq_id == -1 (#15226)
Georgi Gerganov [Mon, 11 Aug 2025 10:58:24 +0000 (13:58 +0300)]
kv-cache : fix seq_rm with seq_id == -1 (#15226)

* kv-cache : fix seq_rm with seq_id == -1

ggml-ci

* cont : iterate over streams

ggml-ci

7 weeks agokv-cache : log (debug) all streams in find_slot (#15176)
Daniel Bevenius [Mon, 11 Aug 2025 09:21:19 +0000 (11:21 +0200)]
kv-cache : log (debug) all streams in find_slot (#15176)

This commit updates `llama_kv_cache_unified::find_slot` to log
information for all streams when debug is enabled.

The motivation for this change is that currently if a non-unified
kv-cache is used, then only one stream will be logged because the
code was currently uses `seq_to_stream[1]`.

7 weeks agoconvert : fix merge conflicts (#15229)
Sigbjørn Skjæret [Mon, 11 Aug 2025 09:15:44 +0000 (11:15 +0200)]
convert : fix merge conflicts (#15229)

7 weeks agoperplexity : update comments/error msg to use decode [no ci] (#15227)
Daniel Bevenius [Mon, 11 Aug 2025 08:21:24 +0000 (10:21 +0200)]
perplexity : update comments/error msg to use decode [no ci] (#15227)

This commit updates comments and error messages to use "decode" instead
of "eval" in perplexity.cpp.

The motivation for this is that `llama_eval` was renamed to
`llama_decode` a while ago, but the comments and error messages
still referred to "eval". This change ensures consistency and clarity.

7 weeks agoconvert : improve Mistral models integration (#14737)
Julien Denize [Mon, 11 Aug 2025 08:07:49 +0000 (10:07 +0200)]
convert : improve Mistral models integration (#14737)

* Improve Mistral models integration with llama.cpp

* Revert changes and fix gguf

* Revert change

* refactor convert_mistral_to_gguf.py in convert_hf_to_gguf.py

* Revert collateral

* Rename model name

* refactor

* revert

* remove duplicate

* Remove duplication code

* Fixes

* Fix flake issues

* Apply comments

* Apply comments

* Apply comments

* Fix remote

* add default chat template

* Revert

* nit

7 weeks agokleidiai: fix unsigned overflow bug (#15150)
Charles Xu [Mon, 11 Aug 2025 07:59:26 +0000 (09:59 +0200)]
kleidiai: fix unsigned overflow bug (#15150)

* kleidiai: fix unsigned overflow bug

* address review comments

7 weeks agocuda: refactored ssm_scan and use CUB (#13291)
David Zhao [Sat, 9 Aug 2025 18:29:43 +0000 (13:29 -0500)]
cuda: refactored ssm_scan and use CUB (#13291)

* cuda: refactored ssm_scan to use CUB

* fixed compilation error when when not using CUB

* assign L to constant and use size_t instead of int

* deduplicated functions

* change min blocks per mp to 1

* Use cub load and store warp transpose

* suppress clang warning

7 weeks agoCUDA: add attention sinks for tile and wmma (#15178)
Aman Gupta [Sat, 9 Aug 2025 12:00:24 +0000 (20:00 +0800)]
CUDA: add attention sinks for tile and wmma (#15178)

* CUDA: add attention sinks for tile and wmma

* Review: formatting changes + remove syncthreads from tile + remove warp_reduce_max from wmma

8 weeks agogguf-py : add Numpy MXFP4 de/quantization support (#15111)
compilade [Fri, 8 Aug 2025 21:48:26 +0000 (17:48 -0400)]
gguf-py : add Numpy MXFP4 de/quantization support (#15111)

* gguf-py : add MXFP4 de/quantization support

* ggml-quants : handle zero amax for MXFP4

8 weeks agoserver-bench: external OAI servers, sqlite (#15179)
Johannes Gäßler [Fri, 8 Aug 2025 21:04:36 +0000 (23:04 +0200)]
server-bench: external OAI servers, sqlite (#15179)

* server-bench: external OAI servers, sqlite

* Update scripts/server-bench.py

Co-authored-by: Sigbjørn Skjæret <redacted>
* Update scripts/server-bench.py

Co-authored-by: Sigbjørn Skjæret <redacted>
* Update scripts/server-bench.py

Co-authored-by: Sigbjørn Skjæret <redacted>
* raise_for_status

---------

Co-authored-by: Sigbjørn Skjæret <redacted>
8 weeks agoggml : fix field name when new ggml_backend (#14944)
AN Long [Fri, 8 Aug 2025 12:37:22 +0000 (21:37 +0900)]
ggml : fix field name when new ggml_backend (#14944)

8 weeks agovendor: sync minja (#15161)
Olivier Chafik [Fri, 8 Aug 2025 09:45:18 +0000 (10:45 +0100)]
vendor: sync minja (#15161)

* vendor: sync minja

* Update minja.hpp

* Apply suggestions from code review

Co-authored-by: Sigbjørn Skjæret <redacted>
---------

Co-authored-by: Sigbjørn Skjæret <redacted>
8 weeks agoCUDA: attention sinks for mma FlashAttention (#15157)
Johannes Gäßler [Fri, 8 Aug 2025 06:19:58 +0000 (08:19 +0200)]
CUDA: attention sinks for mma FlashAttention (#15157)

8 weeks agoopencl: support sink in `soft_max` (attn sinks) (#15152)
lhez [Fri, 8 Aug 2025 04:47:03 +0000 (13:47 +0900)]
opencl: support sink in `soft_max` (attn sinks) (#15152)

8 weeks agoconvert : support non-mxfp4 HF model (#15153)
Xuan-Son Nguyen [Thu, 7 Aug 2025 21:26:03 +0000 (23:26 +0200)]
convert : support non-mxfp4 HF model (#15153)

* convert : support non-mxfp4 HF model

* rm redundant check

* disable debug check

8 weeks agovulkan: support fattn sinks (#15126)
Jeff Bolz [Thu, 7 Aug 2025 20:44:20 +0000 (15:44 -0500)]
vulkan: support fattn sinks (#15126)

8 weeks agovulkan: Add env var to disable host visible vidmem (#15109)
Jeff Bolz [Thu, 7 Aug 2025 20:07:11 +0000 (15:07 -0500)]
vulkan: Add env var to disable host visible vidmem (#15109)

8 weeks agollama : Support intern-s1 (#14875)
RunningLeon [Thu, 7 Aug 2025 16:20:40 +0000 (00:20 +0800)]
llama : Support intern-s1 (#14875)

* support internvl

* support interns1

* resolve comments

* put interns1 in tensor mapping

* resolve comment

* move tokenizer changes to sub class

8 weeks agoHIP: add cmake option to enable compiler output of kernel resource usage metrics...
uvos [Thu, 7 Aug 2025 14:44:14 +0000 (16:44 +0200)]
HIP: add cmake option to enable compiler output of kernel resource usage metrics (#15103)