]>
git.djapps.eu Git - pkg/ggml/sources/llama.cpp/log
Xuan-Son Nguyen [Wed, 9 Jul 2025 06:26:13 +0000 (08:26 +0200)]
convert : fix smollm3 jinja template (#14586)
Jeff Bolz [Tue, 8 Jul 2025 18:11:42 +0000 (13:11 -0500)]
vulkan: optimize flash attention split_k_reduce (#14554)
* vulkan: allow FA split_k with smaller KV values
* vulkan: spread split_k_reduce work across more threads
k_num can get rather large. Use the whole workgroup to reduce the M/L values.
Launch a thread for each element in the HSV dimension of the output. Helps a
lot for large HSV (like deepseek).
stevenkuang [Tue, 8 Jul 2025 16:29:29 +0000 (00:29 +0800)]
model : fix hunyuan moe chat template (#14584)
Signed-off-by: stevenkuang <redacted>
Xuan-Son Nguyen [Tue, 8 Jul 2025 16:07:01 +0000 (18:07 +0200)]
model : add SmolLM3 (#14581)
* Init - first pass.
* Model -> ModelBase.
* fix errors in conversion.
* Update the graph.
* up.
* up.
* wip
* cgraph ok
* rm redundant code
---------
Co-authored-by: Vaibhavs10 <redacted>
compilade [Tue, 8 Jul 2025 15:37:47 +0000 (11:37 -0400)]
memory : fix broken batch splits for recurrent cache (#14575)
Splits producing more than one ubatch per batch for recurrent models
were broken with #14512.
This fixes it by moving the completeness check after the ubatch split loop.
Jeff Bolz [Tue, 8 Jul 2025 13:21:21 +0000 (08:21 -0500)]
vulkan : fix rope with partial rotation and non-cont src (#14582)
Alawode Oluwandabira [Tue, 8 Jul 2025 08:47:33 +0000 (11:47 +0300)]
server: Add ability to mount server at prefix (#14544)
* Add server_prefix
* Correct server path env
* Rename cli flag to --api-prefix
* Change all to api_prefix
Xuan-Son Nguyen [Tue, 8 Jul 2025 08:24:06 +0000 (10:24 +0200)]
model : add hunyuan moe (#14425)
* model : add hunyuan moe
* tokenizer ok
* fix tensor name
* cgraph init
* chat template
* wip
* almost working
* skip embed, fix bos
* cleanup
* yarn scaling
* cleanup
* correct rope type
* failed token fix
* ntk alpha freq_base
* tokenization working
* cleanup and pr changes
* vocab_size sanity check
* ntk alpha generic
* Update convert_hf_to_gguf.py
* Apply suggestions from code review
* fix regression
* fix style
---------
Co-authored-by: kooshi <redacted>
Jeff Bolz [Tue, 8 Jul 2025 07:38:31 +0000 (02:38 -0500)]
vulkan: increase timeout for CI (#14574)
Georgi Gerganov [Tue, 8 Jul 2025 07:15:21 +0000 (10:15 +0300)]
cuda : fix rope with partial rotation and non-cont src (#14580)
* cuda : fix rope non-cont
ggml-ci
* cont : fix multi-rope + add test
ggml-ci
* sycl : try fix
ggml-ci
* cont : fix sycl + clean-up cuda
ggml-ci
Aman Gupta [Tue, 8 Jul 2025 02:11:18 +0000 (10:11 +0800)]
CUDA: add bilinear interpolation for upscale (#14563)
R0CKSTAR [Mon, 7 Jul 2025 23:58:30 +0000 (07:58 +0800)]
musa: fix build warnings (unused variable) (#14561)
Signed-off-by: Xiaodong Ye <redacted>
Sigbjørn Skjæret [Mon, 7 Jul 2025 21:35:35 +0000 (23:35 +0200)]
llama : fix incorrect minicpm3 v_states shape (#14571)
Sigbjørn Skjæret [Mon, 7 Jul 2025 19:35:08 +0000 (21:35 +0200)]
llama : remove ggml_cont where possible (#14568)
Aman Gupta [Mon, 7 Jul 2025 13:45:43 +0000 (21:45 +0800)]
CUDA: add bf16 and i32 to getrows (#14529)
Eve [Sun, 6 Jul 2025 10:29:36 +0000 (10:29 +0000)]
vulkan: increase LOAD_VEC_A to 8 (IQ1/IQ2) or 4 (IQ3) (#14485)
Commit taken from remyoudompheng's PR https://github.com/ggml-org/llama.cpp/pull/12260
Co-authored-by: Rémy Oudompheng <redacted>
Jeff Bolz [Sun, 6 Jul 2025 08:08:16 +0000 (03:08 -0500)]
vulkan: fix rms_norm+mul fusion (#14545)
The fused operation was grabbing the epsilon value from the wrong place.
Add an env var to disable fusion.
Add some missing checks for supported shapes/types.
Handle fused rms_norm+mul in check_results.
Jeff Bolz [Sat, 5 Jul 2025 07:26:04 +0000 (02:26 -0500)]
vulkan: Handle updated FA dim2/3 definition (#14518)
* vulkan: Handle updated FA dim2/3 definition
Pack mask boolean and n_head_log2 into a single dword to keep the push
constant block under the 128B limit.
* handle null mask for gqa
* allow gqa with dim3>1
Sigbjørn Skjæret [Sat, 5 Jul 2025 07:17:14 +0000 (09:17 +0200)]
server : fix assistant prefilling when content is an array (#14360)
Sigbjørn Skjæret [Sat, 5 Jul 2025 06:24:56 +0000 (08:24 +0200)]
opencl: add GELU_ERF (#14476)
Georgi Gerganov [Sat, 5 Jul 2025 04:18:09 +0000 (07:18 +0300)]
eval-callback : check for empty input (#14539)
R0CKSTAR [Sat, 5 Jul 2025 04:10:53 +0000 (12:10 +0800)]
test-backend-ops: add support for specifying output format (#14368)
* test-backend-ops: add support for specifying output format
Signed-off-by: Xiaodong Ye <redacted>
* Address review comments
Signed-off-by: Xiaodong Ye <redacted>
* Add build_commit and build_number in test_result
Signed-off-by: Xiaodong Ye <redacted>
* Address review comments
Signed-off-by: Xiaodong Ye <redacted>
* refactor
Signed-off-by: Xiaodong Ye <redacted>
* Get build commit from ggml_commit()
Signed-off-by: Xiaodong Ye <redacted>
* Merge errors into test_operation_info && address review comments
Signed-off-by: Xiaodong Ye <redacted>
* Address review comments
Signed-off-by: Xiaodong Ye <redacted>
* Address review comments
Signed-off-by: Xiaodong Ye <redacted>
* remove visitor nonsense
* remove visitor comment
Signed-off-by: Xiaodong Ye <redacted>
* Address review comments
Signed-off-by: Xiaodong Ye <redacted>
---------
Signed-off-by: Xiaodong Ye <redacted>
Co-authored-by: slaren <redacted>
Georgi Gerganov [Fri, 4 Jul 2025 16:19:09 +0000 (19:19 +0300)]
metal : disable fast math in all quantize kernels (#14528)
ggml-ci
Georgi Gerganov [Fri, 4 Jul 2025 06:08:59 +0000 (09:08 +0300)]
batch : add optional for sequential equal split (#14511)
ggml-ci
Georgi Gerganov [Fri, 4 Jul 2025 06:05:36 +0000 (09:05 +0300)]
graph : prepare for 4D mask (#14515)
ggml-ci
Georgi Gerganov [Fri, 4 Jul 2025 06:04:59 +0000 (09:04 +0300)]
batch : add n_used count (#14512)
ggml-ci
luyhcsu [Fri, 4 Jul 2025 03:50:07 +0000 (11:50 +0800)]
CANN: Replace aclrtMemsetSync with aclnnInplaceZero operator (#14002)
Co-authored-by: luyuhong <redacted>
Sigbjørn Skjæret [Thu, 3 Jul 2025 21:07:22 +0000 (23:07 +0200)]
ggml : implement GEGLU_ERF and GEGLU_QUICK ops (#14445)
lhez [Thu, 3 Jul 2025 18:22:24 +0000 (11:22 -0700)]
opencl : broadcast for soft_max (#14510)
Jeff Bolz [Thu, 3 Jul 2025 18:21:14 +0000 (13:21 -0500)]
vulkan: support mixed/deepseekR1 FA head sizes (#14509)
* vulkan: better parameterize FA by head sizes
* vulkan: support mixed/deepseekR1 FA head sizes
Johannes Gäßler [Thu, 3 Jul 2025 15:05:18 +0000 (17:05 +0200)]
ggml: backward pass for split swiglu (#14483)
Nicolò Scipione [Thu, 3 Jul 2025 09:00:03 +0000 (11:00 +0200)]
Fix conditional enabling following arch checks for ggml-sycl (#14504)
Signed-off-by: nscipione <redacted>
Xuan-Son Nguyen [Thu, 3 Jul 2025 08:03:06 +0000 (10:03 +0200)]
convert : correct gemma 3n conversion (#14450)
* convert : correct gemma 3n conversion
* rm redundant code
Georgi Gerganov [Thu, 3 Jul 2025 07:53:35 +0000 (10:53 +0300)]
kv-cache : use ggml_set_rows (#14285)
* kv-cache : use ggml_set_rows
ggml-ci
* graph : separate k and v indices
ggml-ci
* cont : remove redundant ifs
ggml-ci
* kv-cache : improve find_slot impl
* kv-cache : bounds-check when accessing slot_info indices
* kv-cache : add comments
ggml-ci
* ggml : add TODOs for adding GGML_OP_SET_ROWS support in the backends
ggml-ci
Georgi Gerganov [Thu, 3 Jul 2025 07:46:57 +0000 (10:46 +0300)]
ggml : fix FA mask dim 2 and 3 (#14505)
* ggml : fix FA mask dim 2 and 3
ggml-ci
* backends : unsupport batched FA in CUDA and Vulkan
ggml-ci
* vulkan : disable FA for mask->ne[2] != 1
Georgi Gerganov [Thu, 3 Jul 2025 04:48:32 +0000 (07:48 +0300)]
ggml : remove kompute backend (#14501)
ggml-ci
Aman Gupta [Wed, 2 Jul 2025 23:45:11 +0000 (07:45 +0800)]
CUDA: add dynamic shared mem to softmax, refactor general usage (#14497)
Sigbjørn Skjæret [Wed, 2 Jul 2025 19:02:35 +0000 (21:02 +0200)]
gguf-py : add support for chat template jinja files (#14508)
* add support for chat template jinja files
* remove gemma3n hack
compilade [Wed, 2 Jul 2025 17:10:24 +0000 (13:10 -0400)]
llama : initial Mamba-2 support (#9126)
* llama : initial Mamba-2 support
* ggml : SIMD ggml_ssm_scan for Mamba-2
* ggml : improve ggml_mul speed when masking recurrent states
* llama : support running Mamba-Codestral-7B-v0.1
* llama : fix Mamba-2 conv state saving
* ggml : make the ggml_mul fast broadcast path more consistently formatted
* llama : remove unused variable
* llama : add missing break
* convert_hf : prefer SentencePiece tokenizer for Mamba-2 when present
The tokenzier.json of Mamba-Codestral-7B-v0.1 otherwise requires
workarounds to work correctly.
* llama : avoid redundant state copy for Mamba 1 and 2
* metal : attempt to adapt SSM_SCAN for Mamba-2
* metal : fix SSM_SCAN pipeline scope
* metal : use log and exp instead of log1pf and expf in SSM_SCAN
* metal : remove unused arguments for SSM_SCAN
The max index is 31, so trimming the arguments is necessary.
* metal : add back n_seqs to SSM_SCAN args
Whoops, this is needed for the offset in the concatenated output.
* metal : fix SSM_SCAN state head offset
* metal : fix wrong number of tokens per sequence in SSM_SCAN
* ggml : remove unused fast broadcast path in GGML_MUL
This was initially added because states were masked with ggml_mul,
but this is no longer done and so this "optimisation" is no longer
necessary, or at least not worth the additional code complexity.
* ggml : avoid multiply by D in GGML_OP_SSM_SCAN
This makes the weight buft detection in src/llama.cpp simpler.
* convert : transpose Mamba-2 A, D and reshape SSM_NORM
This breaks existing conversions of Mamba-2 models
to avoid some reshapes.
Not sure if it's a good idea,
but it makes the graph slightly cleaner.
* llama : more appropriate SSM_SCAN and SSM_CONV buft support checks
* convert : fix flake8 lint
* metal : fix confusion between ; and ,
* metal : add missing args for nb references in ssm_scan_f32_group
* metal : single-user mamba2 inference works
* kv-cache : remove const_cast when setting inputs for s_copy
And also fix multi-user inference for recurrent models
by using cell_id instead of i as the kv cell index
when populating s_copy.
* convert : avoid AutoConfig for Mamba and Mamba2 hparams
* kv-cache : allow context shift for recurrent models
* graph : fix recurrent state copies when avoiding copies
Works, but using lambda functions might not be that clean.
* ggml : fix mamba2 ssm scan when compiled with SVE
* ggml-cpu : reorder SVE FMA for consistency with other SIMD arches
* cuda : implement ssm scan for Mamba2
There is still room for improvement, but it works!
* cuda : adapt Mamba1 ssm scan to shape changes from Mamba2
* mamba : fix mismatched new and delete size for llm_build_mamba
Subclasses of llm_graph_context cannot have extra fields,
because the called destructor is not the one from the subclass.
This otherwise would cause problems when runnning Mamba-(1|2) inference
when compiled -DGGML_SANITIZE_ADDRESS=ON
* cuda : graceful fallback for Mamba-1 models with weird embd size
Georgi Gerganov [Wed, 2 Jul 2025 16:35:47 +0000 (19:35 +0300)]
sync : ggml
ggml-ci
Daniel Bevenius [Wed, 2 Jul 2025 11:55:32 +0000 (13:55 +0200)]
ggml : add version function to get lib version (ggml/1286)
* ggml : add version function to get lib version
This commit adds a function `ggml_version()` to the ggml library that
returns the version of the library as a string.
The motivation for this is that it can be useful to be able to
programmatically check the version of the ggml library being used.
Usage:
```c
printf("GGML version: %s\n", ggml_version());
```
Output:
```console
GGML version: 0.0.2219
```
* ggml : add ggml_commit()
---------
Co-authored-by: Georgi Gerganov <redacted>
Rotem Dan [Wed, 2 Jul 2025 16:37:16 +0000 (19:37 +0300)]
Set RPATH to "@loader_path" / "$ORIGIN" to ensure executables and dynamic libraries search for dependencies in their origin directory. (#14309)
Aman Gupta [Wed, 2 Jul 2025 12:34:24 +0000 (20:34 +0800)]
CUDA: add softmax broadcast (#14475)
* CUDA: add softmax broadcast
* Pass by const ref
* Review: Use blockDims for indexing, remove designated initializers
* Add TODO for noncontigous input/output
Johannes Gäßler [Wed, 2 Jul 2025 11:42:12 +0000 (13:42 +0200)]
CUDA: broadcasting for FlashAttention mask (#14500)
Jeff Bolz [Tue, 1 Jul 2025 08:32:56 +0000 (03:32 -0500)]
vulkan: support softmax/FA batch and broadcast (#14449)
Georgi Gerganov [Fri, 27 Jun 2025 18:50:57 +0000 (21:50 +0300)]
ggml : support bcast ggml_soft_max_ext, ggml_flash_attn_ext (#14435)
ggml-ci
zhouwg [Wed, 2 Jul 2025 12:38:10 +0000 (20:38 +0800)]
opencl : fix possible buffer overflow in dump_tensor (#14490)
Georgi Gerganov [Wed, 2 Jul 2025 11:12:07 +0000 (14:12 +0300)]
simple-chat : fix context-exceeded condition (#14494)
* simple-chat : fix context-exceeded condition
ggml-ci
* cont : fix n_ctx_used computation
ggml-ci
Eric Zhang [Wed, 2 Jul 2025 11:00:04 +0000 (19:00 +0800)]
opencl : skip empty nodes on cgraph compute (#14491)
lhez [Wed, 2 Jul 2025 07:07:42 +0000 (00:07 -0700)]
opencl : update upscale to support align corners (#14488)
Sigbjørn Skjæret [Wed, 2 Jul 2025 07:02:51 +0000 (09:02 +0200)]
ci : add OpenCL to labeler workflow (#14496)
Eric Zhang [Wed, 2 Jul 2025 05:41:35 +0000 (13:41 +0800)]
github : add OpenCL backend to issue templates (#14492)
Björn Ganster [Wed, 2 Jul 2025 05:19:31 +0000 (07:19 +0200)]
ggml : Callback before abort (#14481)
* Add a callback that will be called just before abort. This allows apps without a console to display a message to the user and save data if needed.
* Return previous callback to allow callback chaining
* style fixes
---------
Co-authored-by: Diego Devesa <redacted>
Georgi Gerganov [Tue, 1 Jul 2025 15:04:08 +0000 (18:04 +0300)]
ci : disable fast-math for Metal GHA CI (#14478)
* ci : disable fast-math for Metal GHA CI
ggml-ci
* cont : remove -g flag
ggml-ci
Grzegorz Grasza [Tue, 1 Jul 2025 13:44:11 +0000 (15:44 +0200)]
Add Vulkan images to docker.md (#14472)
Right now it's not easy to find those.
Chenguang Li [Tue, 1 Jul 2025 08:47:30 +0000 (16:47 +0800)]
CANN: update aclnnGroupedMatmulV2 to aclnnGroupedMatmulV3 (#14411)
* [CANN]update to aclnnGroupedMatmulV2
Signed-off-by: noemotiovon <redacted>
* Support MUL_MAT_ID on 310p
Signed-off-by: noemotiovon <redacted>
* fix editorconfig
Signed-off-by: noemotiovon <redacted>
---------
Signed-off-by: noemotiovon <redacted>
Jeff Bolz [Tue, 1 Jul 2025 08:43:08 +0000 (03:43 -0500)]
vulkan: Split large mul_mat_id to fit in shared memory (#14451)
Sigbjørn Skjæret [Tue, 1 Jul 2025 08:14:21 +0000 (10:14 +0200)]
add GELU_ERF (#14455)
Georgi Gerganov [Tue, 1 Jul 2025 08:05:48 +0000 (11:05 +0300)]
ggml : remove trailing whitespace (#0)
Georgi Gerganov [Tue, 1 Jul 2025 07:27:52 +0000 (10:27 +0300)]
sync : ggml
ggml-ci
Acly [Tue, 1 Jul 2025 07:11:00 +0000 (09:11 +0200)]
ggml-cpu : "align corners" for bilinear upscale/downscale (ggml/1285)
* add "align corners" mode for bilinear upscale, and allow downscaling
* add ggml_interpolate, deprecate ggml_upscale_ext, pass in align-corners as bit-flag
* test-backend-ops: replace ggml_upscale_ext with ggml_interpolate, add test cases for downscale and align-corners
Daniel Bevenius [Tue, 24 Jun 2025 04:10:16 +0000 (06:10 +0200)]
ggml-quants : rename best_mad to best_error (ggml/1283)
This commit renames the variable `best_mad` to `best_error` in the
`make_qkx2_quants` function.
The motivation for this is that the name `best_mad` can be somewhat
confusing if mean absolute deviation (MAD) is not in use.
lhez [Tue, 1 Jul 2025 07:19:16 +0000 (00:19 -0700)]
opencl : add GEGLU, REGLU, SWIGLU (#14456)
Aman Gupta [Mon, 30 Jun 2025 15:57:04 +0000 (23:57 +0800)]
Add Conv2d for CPU (#14388)
* Conv2D: Add CPU version
* Half decent
* Tiled approach for F32
* remove file
* Fix tests
* Support F16 operations
* add assert about size
* Review: further formatting fixes, add assert and use CPU version of fp32->fp16
Georgi Gerganov [Mon, 30 Jun 2025 15:03:03 +0000 (18:03 +0300)]
memory : correctly handle failure in apply() (#14438)
ggml-ci
Georgi Gerganov [Mon, 30 Jun 2025 14:04:05 +0000 (17:04 +0300)]
metal : disable fast-math for some cpy kernels (#14460)
* metal : disable fast-math for some cpy kernels
ggml-ci
* cont : disable for q4_1
ggml-ci
* cont : disable for iq4_nl
ggml-ci
Romain Biessy [Mon, 30 Jun 2025 12:52:02 +0000 (14:52 +0200)]
ggml-cpu: sycl: Re-enable exp f16 (#14462)
Diego Devesa [Mon, 30 Jun 2025 10:43:15 +0000 (03:43 -0700)]
test-backend-ops : disable llama test (#14461)
xiaobing318 [Mon, 30 Jun 2025 09:48:24 +0000 (17:48 +0800)]
cmake : Remove redundant include path in CMakeLists.txt (#14452)
* Update docker.yml
修改docker.yml文件中的内容使其停止周期性的运行该workflow,如果想要运行该workflow可以手动启动
* Remove redundant include path in CMakeLists.txt
The parent directory '..' was removed from the include directories for the ggml-cpu-feats target, to avoid unnecessary include paths.
* Enable scheduled Docker image builds
Uncomments the workflow schedule to trigger daily Docker image rebuilds at 04:12 UTC, improving automation and keeping images up to date.
Vedran Miletić [Mon, 30 Jun 2025 08:17:18 +0000 (10:17 +0200)]
scripts : make the shell scripts cross-platform (#14341)
matteo [Sun, 29 Jun 2025 18:02:53 +0000 (20:02 +0200)]
server : support jinja extra template kwargs (Qwen3 enable_thinking feature), from command line and from client (#13196)
* initial commit for handling extra template kwargs
* enable_thinking and assistant prefill cannot be enabled at the same time
* can set chat_template_kwargs in command line
* added doc
* fixed formatting
* add support for extra context in generic template init
* coding standard: common/chat.cpp
Co-authored-by: Georgi Gerganov <redacted>
* coding standard: common/chat.cpp
Co-authored-by: Georgi Gerganov <redacted>
* Apply suggestions from code review
coding standard: cosmetic changes
Co-authored-by: Georgi Gerganov <redacted>
* fix merge conflict
* chat.cpp: simplify calls to apply to ensure systematic propagation of extra_context (+ the odd existing additional_context)
* normalize environment variable name
* simplify code
* prefill cannot be used with thinking models
* compatibility with the new reasoning-budget parameter
* fix prefill for non thinking models
---------
Co-authored-by: Georgi Gerganov <redacted>
Co-authored-by: Olivier Chafik <redacted>
Renat [Sun, 29 Jun 2025 17:29:57 +0000 (19:29 +0200)]
server : fix appearance of the chats list context menu for Safari (#14322)
Akarshan Biswas [Sun, 29 Jun 2025 15:37:58 +0000 (21:07 +0530)]
SYCL: disable faulty fp16 exp kernel (#14395)
* SYCL: disable faulty fp16 CPU exponent for now
* Revert "SYCL: disable faulty fp16 CPU exponent for now"
This reverts commit
ed0aab1ec31b4eb4b0f275dd7acd41d96a375202 .
* SYCL: disable faulty fp16 CPU exponent for now
* Fix logic of disabling exponent kernel
Sigbjørn Skjæret [Sun, 29 Jun 2025 12:38:10 +0000 (14:38 +0200)]
ggml : fix unmerged GGML_FPxx_TO_FPxx refactoring (#14443)
Sigbjørn Skjæret [Sun, 29 Jun 2025 09:04:10 +0000 (11:04 +0200)]
ggml : implement REGLU/GEGLU/SWIGLU ops (#14158)
* implement unary REGLU/GEGLU/SWIGLU cpu ops
* relax constraints
* duplicate shape of source
* fix ggml_vec_geglu_f16
* special case gated ops
* implement unary REGLU/GEGLU/SWIGLU cuda ops
* tighten constraints again
* refactor into GGML_GLU_OP
* metal : add glu kernels
ggml-ci
* add CUDA_GLU_BLOCK_SIZE [no ci]
* more constraints and use 64bit ints
ggml-ci
* 64bit multiplication [no ci]
* implement swapped variants (cpu/cuda)
* update comment [no ci]
ggml-ci
* Vulkan: Add GLU ops and shaders
* SYCL: Implement fused kernel GEGLU, SWIGLU and REGLU for single up+gate
* ggml : implement GLU for split up/gate (#14181)
* implement GLU for split up/gate
* add tests for ggml_glu_split
* Vulkan: Implement glu_split logic and shader support
* add split to logging [no ci]
* SYCL: refactor element_size ops and add split up and gate support to gated kernels
* SYCL: switch GEGLU to use tanh approximation
---------
Co-authored-by: 0cc4m <redacted>
Co-authored-by: Akarshan <redacted>
* GGML: increase OP count in assertion
* Refactor: Optimize SYCL element-wise operations with unary function inlining
This commit refactors the SYCL element-wise operations to improve performance by:
- Inlining unary operations (sgn, abs, elu, gelu, silu, etc.) to reduce kernel launch overhead.
- Introducing helper functions `op_xxx` for each unary operation to encapsulate the logic.
- Replacing direct kernel calls with calls to these inlined functions.
- Using `__dpct_inline__` to encourage compiler inlining.
- Minor code cleanup and consistency improvements.
The changes aim to reduce kernel launch overhead and improve the overall efficiency of element-wise operations on SYCL devices.
* vulkan: Increase workgroup size for GLU, for performance (#14345)
* vulkan: Increase workgroup size for GLU, for performance
* vulkan: change GLU shaders to do one element per invocation rather than one row per workgroup
* merge fix
* metal : add support for split and swap
ggml-ci
---------
Co-authored-by: Georgi Gerganov <redacted>
Co-authored-by: 0cc4m <redacted>
Co-authored-by: Akarshan <redacted>
Co-authored-by: Jeff Bolz <redacted>
Jeff Bolz [Sun, 29 Jun 2025 07:43:36 +0000 (02:43 -0500)]
vulkan: Add fusion support for RMS_NORM+MUL (#14366)
* vulkan: Add fusion support for RMS_NORM+MUL
- Add a use_count to ggml_tensor, so we can detect if an output is used more than once.
- Change the ggml-vulkan rms_norm shader to optionally multiply by another tensor.
- Add detection logic and basic fusion logic in ggml-vulkan.
- Add some testing support for fusion. Rather than computing one node at a time, allow
for computing the whole graph and just testing one node's results. Add rms_norm_mul tests
and enable a llama test.
* extract some common fusion logic
* fix -Winconsistent-missing-override
* move ggml_can_fuse to a common function
* build fix
* C and C++ versions of can_fuse
* move use count to the graph to avoid data races and double increments when used in multiple threads
* use hash table lookup to find node index
* change use_counts to be indexed by hash table slot
* minimize hash lookups
style fixes
* last node doesn't need single use.
fix type.
handle mul operands being swapped.
* remove redundant parameter
---------
Co-authored-by: slaren <redacted>
Aman Gupta [Sat, 28 Jun 2025 17:30:53 +0000 (01:30 +0800)]
CUDA: add bf16 and f32 support to cublas_mul_mat_batched (#14361)
* CUDA: add bf16 and f32 support to cublas_mul_mat_batched
* Review: add type traits and make function more generic
* Review: make check more explicit, add back comments, and fix formatting
* Review: fix formatting, remove useless type conversion, fix naming for bools
Jeff Bolz [Sat, 28 Jun 2025 15:36:40 +0000 (10:36 -0500)]
vulkan: handle noncontig in the final case of ggml_vk_get_cpy_pipeline (#14378)
Jeff Bolz [Sat, 28 Jun 2025 15:17:09 +0000 (10:17 -0500)]
vulkan: lock accesses of pinned_memory vector (#14333)
Weizhao Ouyang [Sat, 28 Jun 2025 14:08:21 +0000 (22:08 +0800)]
model : add support for ERNIE 4.5 0.3B model (#14408)
Add Day-0 support for Baidu ERNIE 4.5 0.3B model.
Signed-off-by: Weizhao Ouyang <redacted>
Xinpeng Dou [Sat, 28 Jun 2025 09:35:41 +0000 (17:35 +0800)]
fix async_mode bug (#14432)
Sigbjørn Skjæret [Sat, 28 Jun 2025 07:57:07 +0000 (09:57 +0200)]
ci : fix windows build and release (#14431)
Jeff Bolz [Sat, 28 Jun 2025 03:35:30 +0000 (22:35 -0500)]
vulkan: Fix GGML_VULKAN_SHADER_DEBUG_INFO (#14427)
This setting needs to be passed through to vulkan-shaders-gen
Georgi Gerganov [Fri, 27 Jun 2025 18:42:02 +0000 (21:42 +0300)]
graph : make llm_graph_context destructor virtual (#14410)
ggml-ci
Georgi Gerganov [Fri, 27 Jun 2025 14:55:45 +0000 (17:55 +0300)]
recurrent : call balloc split_reset() in init_batch() (#14414)
ggml-ci
Radoslav Gerganov [Fri, 27 Jun 2025 13:41:40 +0000 (16:41 +0300)]
ggml : add ggml_set_rows (#14274)
* ggml : add ggml_set_rows
Add ggml_set_rows(a, b, c) which copies rows from 'b' into 'a' using
indices from 'c'.
ref: #8366
* use I64 for indices
* ggml : add repeat impl for i64
* ggml : add ggml_is_contiguous_rows
* ggml : ggml_set_rows support broadcast
* ggml : ggml_set_rows support quantized dst
ggml-ci
* ggml : support GGML_TYPE_F32 ".from_float" trait
* ggml : ggml_set_rows update comment + better index name
* tests : add ggml_set_rows
* metal : add ggml_set_rows implementation
ggml-ci
* ggml : simplify forward_dup_f32
* ggml : fix supports_op
* tests : add comment to set_rows
* ggml : leave the repeat_i64 for a separate PR
ggml-ci
* ggml : set_rows use std::min instead of MIN
* ggml : better error message for set_rows unsupported type
* metal : perform op->type check only once
* tests : more consistent implementation + more tests
ggml-ci
---------
Co-authored-by: Georgi Gerganov <redacted>
Sigbjørn Skjæret [Fri, 27 Jun 2025 08:42:19 +0000 (10:42 +0200)]
convert : fix broken sentencepiece vocab (#14416)
Xuan-Son Nguyen [Thu, 26 Jun 2025 17:34:02 +0000 (19:34 +0200)]
model : gemma3n text-only (#14400)
* gemma3n
* add llm_graph_input_one
bandoti [Thu, 26 Jun 2025 16:46:53 +0000 (13:46 -0300)]
cmake: regen vulkan shaders when shaders-gen sources change (#14398)
* Add shaders-gen sources as target deps
Sigbjørn Skjæret [Thu, 26 Jun 2025 13:01:14 +0000 (15:01 +0200)]
llama : return mistral-v7-tekken as default template only (#14390)
Georgi Gerganov [Thu, 26 Jun 2025 12:51:19 +0000 (15:51 +0300)]
metal : add special-case mat-vec mul for ne00 == 4 (#14385)
ggml-ci
Georgi Gerganov [Thu, 26 Jun 2025 12:50:15 +0000 (15:50 +0300)]
metal : batch rows copy in a single threadgroup (#14384)
* metal : batch rows copy in a single threadgroup
ggml-ci
* metal : handle some edge cases when threadgroup size is not a power of 2
ggml-ci
Aaron Teo [Thu, 26 Jun 2025 10:41:41 +0000 (18:41 +0800)]
docs: update s390x documentation + add faq (#14389)
* docs: update s390x documentation + add faq
Signed-off-by: Aaron Teo <redacted>
* docs: add s390x z17 build q&a
Signed-off-by: Aaron Teo <redacted>
---------
Signed-off-by: Aaron Teo <redacted>
R0CKSTAR [Thu, 26 Jun 2025 04:11:59 +0000 (12:11 +0800)]
musa: enable fp16 mma (all) and cublas on qy2 (#13842)
* musa: enable fp16 mma (all) and cublas on qy2
Signed-off-by: Xiaodong Ye <redacted>
* Update ggml/src/ggml-cuda/ggml-cuda.cu
Co-authored-by: Johannes Gäßler <redacted>
* Address review comments
Signed-off-by: Xiaodong Ye <redacted>
* Address review comments
Signed-off-by: Xiaodong Ye <redacted>
* musa: disable MUL_MAT_ID (q2_k × f32) due to precision issues
Signed-off-by: Xiaodong Ye <redacted>
---------
Signed-off-by: Xiaodong Ye <redacted>
Co-authored-by: Johannes Gäßler <redacted>
Aaron Teo [Wed, 25 Jun 2025 21:49:04 +0000 (05:49 +0800)]
ggml-cpu: enable IBM NNPA Vector Intrinsics (#14317)
* ggml-cpu: add nnpa compile flag
Signed-off-by: Aaron Teo <redacted>
(cherry picked from commit
4a9f60c201573128f73a65999b3e5cc497fae5c1 )
* ggml-cpu: add fp16->fp32 nnpa first
Signed-off-by: Aaron Teo <redacted>
(cherry picked from commit
8d4a7987f9c1887f716be96250f2caeee0253929 )
* ggml-cpu: add fp32->fp16
Signed-off-by: Aaron Teo <redacted>
(cherry picked from commit
0ff0d6516247a41d2ade42b42cf0d676a4dd1627 )
* ggml-cpu: better variable names
Signed-off-by: Aaron Teo <redacted>
(cherry picked from commit
2f58bbcbb89c183340e252362b2a40651f573f1f )
* docs: update s390x docs
Signed-off-by: Aaron Teo <redacted>
(cherry picked from commit
01b929491b50071a5d0572235dcf5a449da70aa7 )
* ggml-cpu: add debugging prints to see if dlf16 is correct
Signed-off-by: Aaron Teo <redacted>
* ggml-cpu: fix print vs printf
Signed-off-by: Aaron Teo <redacted>
* ggml-cpu: fix float placeholder
Signed-off-by: Aaron Teo <redacted>
* ggml-cpu: ensure fp16 and fp32 load and stores are called
Signed-off-by: Aaron Teo <redacted>
* ggml-cpu: fp16 load ensured to hit
Signed-off-by: Aaron Teo <redacted>
* ggml-cpu: remove sigint from fp16 store
for some reason, the function is not getting a hit when debugged with
gdb. we will need to investigate further
Signed-off-by: Aaron Teo <redacted>
* ggml-cpu: activate nnpa for ggml_cpu_fp16_to_fp32
Signed-off-by: Aaron Teo <redacted>
* ggml-cpu: nnpa activate ggml_cpu_fp16_to_fp32 for 8 elements
Signed-off-by: Aaron Teo <redacted>
* ggml-cpu: nnpa switch to vec_xst test
Signed-off-by: Aaron Teo <redacted>
* ggml-cpu: switch to vec_xst for 4 element loops also
Signed-off-by: Aaron Teo <redacted>
* ggml-cpu: rework noop
Signed-off-by: Aaron Teo <redacted>
* ggml-cpu: remove noop, general code cleanup
Signed-off-by: Aaron Teo <redacted>
* ggml-cpu: clarify variable naming
Signed-off-by: Aaron Teo <redacted>
* ggml-cpu: activate nnpa for ggml_cpu_fp32_to_fp16
Signed-off-by: Aaron Teo <redacted>
* ggml-cpu: add breakpoint for debugging
Signed-off-by: Aaron Teo <redacted>
* ggml-cpu: test fix for conversion failure
Signed-off-by: Aaron Teo <redacted>
* ggml-cpu: disable fp32->fp16 nnpa conversions for now
there are some conversion failures in nnpa that requires the eyes of an
ibm stsm. will create a separate pr to introduce the fp32->fp16 change.
Signed-off-by: Aaron Teo <redacted>
* ggml-cpu: switch to elif macro
Signed-off-by: Aaron Teo <redacted>
* ggml-cpu: reattempt fp32->fp16
Signed-off-by: Aaron Teo <redacted>
* ggml-cpu: fix typo
Signed-off-by: Aaron Teo <redacted>
* ggml-cpu: reattempt fp32->fp16
Signed-off-by: Aaron Teo <redacted>
* ggml-cpu: fix compiler types
Signed-off-by: Aaron Teo <redacted>
* ggml-cpu: change to typedef vector types
Signed-off-by: Aaron Teo <redacted>
* ggml-cpu: add 4 element loops for fp32->fp16
Signed-off-by: Aaron Teo <redacted>
* ggml-cpu: clarified vector naming
Signed-off-by: Aaron Teo <redacted>
* ggml-cpu: bring back fp32->fp16 store nnpa
Signed-off-by: Aaron Teo <redacted>
* ggml-cpu: activate nnpa fp32->fp16 or fp16->fp32 compute
Signed-off-by: Aaron Teo <redacted>
* ggml-cpu: add nnpa macro check in ggml-impl
Signed-off-by: Aaron Teo <redacted>
* ggml-cpu: add missing __func__
Signed-off-by: Aaron Teo <redacted>
* ggml-cpu: diagnose why __NNPA__ macro is not being defined
Signed-off-by: Aaron Teo <redacted>
* ggml-cpu: import vecintrin.h to fix compiler errors
Signed-off-by: Aaron Teo <redacted>
* ggml-cpu: update macro tests
Signed-off-by: Aaron Teo <redacted>
* ggml-cpu: move s390x typedef to own header file
Signed-off-by: Aaron Teo <redacted>
* Revert "ggml-cpu: move s390x typedef to own header file"
This reverts commit
157f856c34589566151630e294563a420702db39 .
Signed-off-by: Aaron Teo <redacted>
* ggml-cpu: switch to importing ggml-cpu-impl instead
Signed-off-by: Aaron Teo <redacted>
* ggml-cpu: fix macro declaration
Signed-off-by: Aaron Teo <redacted>
* ggml-cpu: test more macros
Signed-off-by: Aaron Teo <redacted>
* ggml-cpu: add debug prints
Signed-off-by: Aaron Teo <redacted>
* ggml-cpu: bruteforce macro definitions
Signed-off-by: Aaron Teo <redacted>
* ggml-cpu: move macro definitions
Signed-off-by: Aaron Teo <redacted>
* ggml-cpu: add ggml-impl.h to cmakelists
Signed-off-by: Aaron Teo <redacted>
* ggml-cpu: switch to private macros
Signed-off-by: Aaron Teo <redacted>
* ggml-cpu: move s390x typedef to own header file
Signed-off-by: Aaron Teo <redacted>
(cherry picked from commit
157f856c34589566151630e294563a420702db39 )
* ggml-cpu: move things around
Signed-off-by: Aaron Teo <redacted>
* ggml-cpu: bring back compile macros
Signed-off-by: Aaron Teo <redacted>
* ggml-cpu: switch to quotes for import
Signed-off-by: Aaron Teo <redacted>
* ggml-cpu: add compiler error macro
Signed-off-by: Aaron Teo <redacted>
* ggml-cpu: add s390x detection in ggml-src
Signed-off-by: Aaron Teo <redacted>
* ggml-cpu: bring back compile definitions
Signed-off-by: Aaron Teo <redacted>
* ggml-cpu: undo cmakelists work
Signed-off-by: Aaron Teo <redacted>
* Revert "ggml-cpu: move s390x typedef to own header file"
This reverts commit
18d79e1a30b39d9aaa0bd58400c5cf2c32135c9a .
Signed-off-by: Aaron Teo <redacted>
* ggml-cpu: remove typedefs.h
Signed-off-by: Aaron Teo <redacted>
* ggml-cpu: remove typedef from cmakelists
Signed-off-by: Aaron Teo <redacted>
* ggml-cpu: add ggml-impl.h future notes
Signed-off-by: Aaron Teo <redacted>
* ggml-cpu: add todo comment for future reference
Signed-off-by: Aaron Teo <redacted>
* ggml-cpu: clarify naming of dlf16
Signed-off-by: Aaron Teo <redacted>
* ggml-cpu: remove unnecessary target compile definitions
Signed-off-by: Aaron Teo <redacted>
* ggml-cpu: move nnpa fp16->fp32 and fp32->fp16 to simd-mappings
Signed-off-by: Aaron Teo <redacted>
* ggml: refactor fp32->fp16 and fp16->fp32 simd to ggml-cpu
Signed-off-by: Aaron Teo <redacted>
* docs: update broken huggingface link for s390x
Signed-off-by: Aaron Teo <redacted>
* ggml-cpu: fix duplicate func names during compile
Signed-off-by: Aaron Teo <redacted>
* Revert "ggml-cpu: fix duplicate func names during compile"
This reverts commit
fbb733451f27677063b914d4f6c9a9841d45b38d .
Signed-off-by: Aaron Teo <redacted>
* Revert "ggml: refactor fp32->fp16 and fp16->fp32 simd to ggml-cpu"
This reverts commit
bd288e8fa52b5244f65cee21cb61062f1a9e0ca5 .
Signed-off-by: Aaron Teo <redacted>
* ggml: refactor fp16<->fp32 simd to ggml-cpu
Signed-off-by: Aaron Teo <redacted>
* ggml-cpu: fix missing simd-mappings.h import in quants.c
Signed-off-by: Aaron Teo <redacted>
* ggml-cpu: fix missing simd-mappings.h within repack
Signed-off-by: Aaron Teo <redacted>
* ggml-cpu: fix amx mmq missing simd-mappings.h
Signed-off-by: Aaron Teo <redacted>
* ggml-cpu: attempt at fixing loongarch failing build
Signed-off-by: Aaron Teo <redacted>
* ggml-cpu: move nnpa together with other fp16<->fp32 simd
Signed-off-by: Aaron Teo <redacted>
* ggml-cpu: fix wrong refactor of ggml-base
ref: https://github.com/ggml-org/llama.cpp/pull/14317#discussion_r2164176555
Signed-off-by: Aaron Teo <redacted>
* ggml: remove dependency on ggml-cpu from ggml-base
Signed-off-by: Aaron Teo <redacted>
* ggml-cpu: rename all fp16<->fp32 macros to prefix with ggml_cpu
ref: https://github.com/ggml-org/llama.cpp/pull/14317#discussion_r2164449406
Signed-off-by: Aaron Teo <redacted>
* ggml-cpu: remove mistaken fallback macro
fallback logic was already implemented but i was too sleepy to realise
Signed-off-by: Aaron Teo <redacted>
* ggml: move ggml_table_f32_f16 to ggml-cpu
ref: https://github.com/ggml-org/llama.cpp/pull/14317#discussion_r2164775006
Signed-off-by: Aaron Teo <redacted>
* ggml-cpu: move ggml_table_f32_f16 back to ggml-base due to ci failures
Signed-off-by: Aaron Teo <redacted>
* Revert "ggml-cpu: move ggml_table_f32_f16 back to ggml-base due to ci failures"
This reverts commit
32a3533564bdb7902cefb9c89b1c9e956a81ce29 .
Signed-off-by: Aaron Teo <redacted>
* Revert "ggml: move ggml_table_f32_f16 to ggml-cpu"
This reverts commit
9e40d984ad27d7b60392fb2b7548885201864fe4 .
Signed-off-by: Aaron Teo <redacted>
* ggml: move ggml_table_f32_f16 to ggml-cpu
ref: https://github.com/ggml-org/llama.cpp/pull/14317#discussion_r2164775006
Signed-off-by: Aaron Teo <redacted>
(cherry picked from commit
9e40d984ad27d7b60392fb2b7548885201864fe4 )
* ggml: move ggml_table_f32_f16 to ggml-cpu.c
Signed-off-by: Aaron Teo <redacted>
* ggml-cpu: extern c ggml_table_f32_f16 + chore docs
Signed-off-by: Aaron Teo <redacted>
* ggml-cpu: dedup ggml_table_f32_f16 from simd-mappings.h
we rely on the variable declaration in ggml-cpu.c instead
Signed-off-by: Aaron Teo <redacted>
* Revert "ggml-cpu: dedup ggml_table_f32_f16 from simd-mappings.h"
This reverts commit
f71b21d2f74f5e03ec0c2b4fefd3cbf395aecf16 .
Signed-off-by: Aaron Teo <redacted>
* ggml-cpu: bring back ggml_table_f32_f16
Signed-off-by: Aaron Teo <redacted>
* Revert "ggml-cpu: bring back ggml_table_f32_f16"
This reverts commit
2dce119178bed5ef5c8398c4230ddd14fef80e49 .
Signed-off-by: Aaron Teo <redacted>
* fix ggml time initialization
* fix f32_f16 table init
* remove extra line
---------
Signed-off-by: Aaron Teo <redacted>
Co-authored-by: slaren <redacted>
Sigbjørn Skjæret [Wed, 25 Jun 2025 21:26:51 +0000 (23:26 +0200)]
ggml : do not output unprintable characters on GGUF load failure (#14381)
Anton Mitkov [Wed, 25 Jun 2025 16:09:55 +0000 (17:09 +0100)]
sycl: GGML_SYCL_DISABLE_OPT on by default for all Intel Devices (#13973)
lhez [Tue, 24 Jun 2025 18:46:25 +0000 (11:46 -0700)]
opencl: ref count `ggml_backend_opencl_context` and refactor profiling (#14254)
* Move profiling info into `ggml_backend_opencl_context`
* Add `enqueue_ndrange_kernel` to launch kernel
Georgi Gerganov [Tue, 24 Jun 2025 15:26:30 +0000 (18:26 +0300)]
batch : fix check for empty sequences in memory (#14364)
* batch : fix check for empty sequences in memory
ggml-ci
* cont : reuse the var
ggml-ci
Mathieu Baudier [Tue, 24 Jun 2025 13:05:31 +0000 (15:05 +0200)]
cmake : use LLAMA_BUILD_NUMBER when defining LLAMA_INSTALL_VERSION (#14362)