Daniel Bevenius [Mon, 21 Jul 2025 13:53:12 +0000 (15:53 +0200)]
ggml-cpu : remove stdlib include from repack.cpp (#1276)
This commit removes the inclusion of `<cstdlib>`.
The motivation for this change is that this source file does not seem to
use any functions from this header and the comment about `qsort` is a
little misleading/confusing.
Oliver Simons [Fri, 18 Jul 2025 11:35:32 +0000 (13:35 +0200)]
cuda : Fix Gemma3n not executed as CUDA_GRAPH on NVGPUs (llama/14741)
* Fix Gemma3n not executed as CUDA_GRAPH on NVGPUs
Gemma3n uses Matrix-Matrix addition as part of their input processing,
wrongly triggering CUDA_GRAPH disablement on NVGPUs even when batch-size
of 1 is used.
* Exclude `project_per_layer_input` by matching node names
This ensures that all other graphs which don't exhibit this pattern do
not have their behavior changed.
Remove un-necessary templates from class definition and packing functions
Reduce deeply nested conditionals, if-else switching in mnapck function
Replace repetitive code with inline functions in Packing functions
2 ~ 7% improvement in Q8 Model
15 ~ 50% improvement in Q4 Model
Jeff Bolz [Sat, 12 Jul 2025 10:12:26 +0000 (05:12 -0500)]
vulkan: support SET_ROWS (llama/14587)
* vulkan: support SET_ROWS
Add variants of the copy_to_quant shader that do the SET_ROWS operation.
Change these shaders to spread the work across the workgroup.
The memory access pattern is probably not great (one thread per quant block),
but should be fine for now.
* vulkan: optimize set_rows
Larger workgroups for non-quant types.
Set "norepeat" (there is manual repeat logic).
Use fastmod.
model : support LiquidAI LFM2 hybrid family (llama/14620)
**Important**
LFM2 was [merged ](https://github.com/huggingface/transformers/pull/39340)into transformers, but has not yet been released.
To convert into gguf, install transformers from source
```shell
pip install "transformers @ git+https://github.com/huggingface/transformers.git@main"
```
* ggml : improve ggml_mul speed when masking recurrent states
* llama : support running Mamba-Codestral-7B-v0.1
* llama : fix Mamba-2 conv state saving
* ggml : make the ggml_mul fast broadcast path more consistently formatted
* llama : remove unused variable
* llama : add missing break
* convert_hf : prefer SentencePiece tokenizer for Mamba-2 when present
The tokenzier.json of Mamba-Codestral-7B-v0.1 otherwise requires
workarounds to work correctly.
* llama : avoid redundant state copy for Mamba 1 and 2
* metal : attempt to adapt SSM_SCAN for Mamba-2
* metal : fix SSM_SCAN pipeline scope
* metal : use log and exp instead of log1pf and expf in SSM_SCAN
* metal : remove unused arguments for SSM_SCAN
The max index is 31, so trimming the arguments is necessary.
* metal : add back n_seqs to SSM_SCAN args
Whoops, this is needed for the offset in the concatenated output.
* metal : fix SSM_SCAN state head offset
* metal : fix wrong number of tokens per sequence in SSM_SCAN
* ggml : remove unused fast broadcast path in GGML_MUL
This was initially added because states were masked with ggml_mul,
but this is no longer done and so this "optimisation" is no longer
necessary, or at least not worth the additional code complexity.
* ggml : avoid multiply by D in GGML_OP_SSM_SCAN
This makes the weight buft detection in src/llama.cpp simpler.
* convert : transpose Mamba-2 A, D and reshape SSM_NORM
This breaks existing conversions of Mamba-2 models
to avoid some reshapes.
Not sure if it's a good idea,
but it makes the graph slightly cleaner.
* llama : more appropriate SSM_SCAN and SSM_CONV buft support checks
* convert : fix flake8 lint
* metal : fix confusion between ; and ,
* metal : add missing args for nb references in ssm_scan_f32_group
* metal : single-user mamba2 inference works
* kv-cache : remove const_cast when setting inputs for s_copy
And also fix multi-user inference for recurrent models
by using cell_id instead of i as the kv cell index
when populating s_copy.
* convert : avoid AutoConfig for Mamba and Mamba2 hparams
* kv-cache : allow context shift for recurrent models
* graph : fix recurrent state copies when avoiding copies
Works, but using lambda functions might not be that clean.
* ggml : fix mamba2 ssm scan when compiled with SVE
* ggml-cpu : reorder SVE FMA for consistency with other SIMD arches
* cuda : implement ssm scan for Mamba2
There is still room for improvement, but it works!
* cuda : adapt Mamba1 ssm scan to shape changes from Mamba2
* mamba : fix mismatched new and delete size for llm_build_mamba
Subclasses of llm_graph_context cannot have extra fields,
because the called destructor is not the one from the subclass.
This otherwise would cause problems when runnning Mamba-(1|2) inference
when compiled -DGGML_SANITIZE_ADDRESS=ON
* cuda : graceful fallback for Mamba-1 models with weird embd size
* SYCL: Implement fused kernel GEGLU, SWIGLU and REGLU for single up+gate
* ggml : implement GLU for split up/gate (llama/14181)
* implement GLU for split up/gate
* add tests for ggml_glu_split
* Vulkan: Implement glu_split logic and shader support
* add split to logging [no ci]
* SYCL: refactor element_size ops and add split up and gate support to gated kernels
* SYCL: switch GEGLU to use tanh approximation
---------
Co-authored-by: 0cc4m <redacted> Co-authored-by: Akarshan <redacted>
* GGML: increase OP count in assertion
* Refactor: Optimize SYCL element-wise operations with unary function inlining
This commit refactors the SYCL element-wise operations to improve performance by:
- Inlining unary operations (sgn, abs, elu, gelu, silu, etc.) to reduce kernel launch overhead.
- Introducing helper functions `op_xxx` for each unary operation to encapsulate the logic.
- Replacing direct kernel calls with calls to these inlined functions.
- Using `__dpct_inline__` to encourage compiler inlining.
- Minor code cleanup and consistency improvements.
The changes aim to reduce kernel launch overhead and improve the overall efficiency of element-wise operations on SYCL devices.
* vulkan: Increase workgroup size for GLU, for performance (llama/14345)
* vulkan: Increase workgroup size for GLU, for performance
* vulkan: change GLU shaders to do one element per invocation rather than one row per workgroup
* merge fix
* metal : add support for split and swap
ggml-ci
---------
Co-authored-by: Georgi Gerganov <redacted> Co-authored-by: 0cc4m <redacted> Co-authored-by: Akarshan <redacted> Co-authored-by: Jeff Bolz <redacted>
Jeff Bolz [Sun, 29 Jun 2025 07:43:36 +0000 (02:43 -0500)]
vulkan: Add fusion support for RMS_NORM+MUL (llama/14366)
* vulkan: Add fusion support for RMS_NORM+MUL
- Add a use_count to ggml_tensor, so we can detect if an output is used more than once.
- Change the ggml-vulkan rms_norm shader to optionally multiply by another tensor.
- Add detection logic and basic fusion logic in ggml-vulkan.
- Add some testing support for fusion. Rather than computing one node at a time, allow
for computing the whole graph and just testing one node's results. Add rms_norm_mul tests
and enable a llama test.
* extract some common fusion logic
* fix -Winconsistent-missing-override
* move ggml_can_fuse to a common function
* build fix
* C and C++ versions of can_fuse
* move use count to the graph to avoid data races and double increments when used in multiple threads
* use hash table lookup to find node index
* change use_counts to be indexed by hash table slot
* minimize hash lookups
style fixes
* last node doesn't need single use.
fix type.
handle mul operands being swapped.