shalinib-ibm [Thu, 19 Feb 2026 06:28:53 +0000 (11:58 +0530)]
llamafile: powerpc: add FP16 MMA path for Q4/Q8 matmul (#19709)
Avoid xvi8ger4pp signed→unsigned bias correction by dequantizing Q4/Q8
inputs to FP16 and using FP16×FP16→FP32 MMA. This removes
post-processing overhead and improves performance.
Performance Impact:
1.5 ~ 2x improvement in PP_Speed for Q4 and Q8 Models,
measured with llama-bench and llama-batched-bench.
Q8 Model: granite-4.0-h-micro-Q8_0.gguf (from huggingface)
Q4 Model: Meta-Llama3-8b Q4 model (generated with llama-quantize from
f32 model)
llama-bench Q8 Model Results:
model size params backend threads test Base t/s Patch t/s
granitehybrid 3B Q8_0 3.16 GiB 3.19 B CPU 10 pp8 64.48 ± 4.72 73.99 ± 0.27
granitehybrid 3B Q8_0 3.16 GiB 3.19 B CPU 10 pp16 80.11 ± 0.32 112.53 ± 0.40
granitehybrid 3B Q8_0 3.16 GiB 3.19 B CPU 10 pp32 89.10 ± 0.27 152.95 ± 0.68
granitehybrid 3B Q8_0 3.16 GiB 3.19 B CPU 10 pp64 93.65 ± 0.25 187.83 ± 0.83
granitehybrid 3B Q8_0 3.16 GiB 3.19 B CPU 10 pp128 99.93 ± 0.02 201.32 ± 0.11
granitehybrid 3B Q8_0 3.16 GiB 3.19 B CPU 10 pp256 102.32 ± 0.40 208.32 ± 0.41
granitehybrid 3B Q8_0 3.16 GiB 3.19 B CPU 10 pp512 103.42 ± 0.40 209.98 ± 0.14
granitehybrid 3B Q8_0 3.16 GiB 3.19 B CPU 10 tg128 20.35 ± 0.01 19.57 ± 0.01
llama-bench Q4 Model Results:
model size params backend threads test Base t/s Patch t/s
llama 8B Q4_0 4.33 GiB 8.03 B CPU 10 pp8 34.77 ± 0.10 41.23 ± 0.08
llama 8B Q4_0 4.33 GiB 8.03 B CPU 10 pp16 40.81 ± 0.04 64.55 ± 0.15
llama 8B Q4_0 4.33 GiB 8.03 B CPU 10 pp32 44.65 ± 0.05 90.84 ± 0.22
llama 8B Q4_0 4.33 GiB 8.03 B CPU 10 pp64 47.49 ± 0.03 114.39 ± 0.11
llama 8B Q4_0 4.33 GiB 8.03 B CPU 10 pp128 49.29 ± 0.24 120.13 ± 0.19
llama 8B Q4_0 4.33 GiB 8.03 B CPU 10 pp256 49.77 ± 0.23 121.51 ± 0.11
llama 8B Q4_0 4.33 GiB 8.03 B CPU 10 pp512 49.89 ± 0.23 117.52 ± 0.10
llama 8B Q4_0 4.33 GiB 8.03 B CPU 10 tg128 13.40 ± 0.01 13.37 ± 0.00
Llama perplexity Results:
Model Base Final PPL Estimate Patch Final PPL Estimate
granite-4.0-h-micro-Q8_0 1.3862 +/- 0.04424 1.3868 +/- 0.04432
Meta-Llama3-8b Q4 1.3801 +/- 0.04116 1.3803 +/- 0.04116
Jeff Bolz [Wed, 18 Feb 2026 09:47:10 +0000 (01:47 -0800)]
vulkan: split mul_mat into multiple dispatches to avoid overflow (#19509)
* vulkan: split mul_mat into multiple dispatches to avoid overflow
The batch dimensions can be greater than the max workgroup count limit,
in which case we need to split into multiple dispatches and pass the base
index through a push constant.
Fall back for the less common p021 and nc variants.
When LTO enabled in build environments it forces all builds to have LTO
in place. But feature detection logic is fragile, and causing Illegal
instruction errors with lto. This disables LTO for the feature
detection code to prevent cross-module optimization from inlining
architecture-specific instructions into the score function. Without this,
LTO can cause SIGILL when loading backends on older CPUs (e.g., loading
power10 backend on power9 crashes before feature check runs).
Daniel Bevenius [Tue, 17 Feb 2026 09:46:53 +0000 (10:46 +0100)]
model-conversion : make printing of config values optional (#19681)
* model-conversion : make printing of config values optional
This commit updates run-org-model.py to make the printing of model
configuration values optional.
The motivation for this change is that not all models have these
configuration values defined and those that do not will error when
running this script. With these changes we only print the values if they
exist or a default value.
We could optionally just remove them but it can be useful to see these
values when running the original model.
Mario Limonciello [Mon, 16 Feb 2026 13:46:08 +0000 (07:46 -0600)]
Adjust workaround for ROCWMMA_FATTN/GFX9 to only newer ROCm veresions (#19591)
Avoids issues with ROCm 6.4.4.
Closes: https://github.com/ggml-org/llama.cpp/issues/19580 Fixes: 6845f7f87 ("Add a workaround for compilation with ROCWMMA_FATTN and gfx9 (#19461)") Signed-off-by: Mario Limonciello (AMD) <redacted>
- load all 8 int8 for a grid position in one load
- calculate signs via popcnt instead of fetching from ksigns table
- broadcast signs to drop individual shift/mask
Adrien Gallouët [Sun, 15 Feb 2026 14:38:50 +0000 (15:38 +0100)]
build : remove LLAMA_HTTPLIB option (#19623)
This option was introduced as a workaround because cpp-httplib could not
build on visionOS. Since it has been fixed and now compiles on all platforms,
we can remove it and simplify many things.
Daniel Bevenius [Sun, 15 Feb 2026 12:59:38 +0000 (13:59 +0100)]
cmake : check if KleidiAI API has been fetched (#19640)
This commit addresses a build issue with the KleidiAI backend when
building multiple cpu backends. Commmit 3a00c98584e42a20675b6569d81beadb282b0952 ("cmake : fix KleidiAI install
target failure with EXCLUDE_FROM_ALL") introduced a change where
FetchContent_Populate is called instead of FetchContent_MakeAvailable,
where the latter does handle this case (it is idempotent but
FetchContent_Populate is not).
I missed this during my review and I should not have commited without
verifying the CI failure, sorry about that.
SamareshSingh [Sun, 15 Feb 2026 05:22:53 +0000 (23:22 -0600)]
cmake : fix KleidiAI install target failure with EXCLUDE_FROM_ALL (#19581)
* cmake: fix KleidiAI install target failure with EXCLUDE_FROM_ALL
Fix for the bug #19501 by adding EXCLUDE_FROM_ALL to FetchContent_Declare. This properly excludes KleidiAI from both build and install targets, preventing install failures when GGML_CPU_KLEIDIAI=ON is used.
The KleidiAI source files are still compiled into libggml-cpu.so, preserving all functionality.
George [Sat, 14 Feb 2026 08:05:12 +0000 (10:05 +0200)]
mmap: Fix Windows handle lifetime (#19598)
* ggml: added cleanups in ggml_quantize_free
Add missing cleanup calls for IQ2_S, IQ1_M quantization types and IQ3XS with 512 blocks during quantization cleanup.
* mmap: Fix Windows handle lifetime
Move hMapping from local variable to member variable so it stays alive for the entire lifetime of the mapping.
The file mapping handle must remain valid until UnmapViewOfFile is called.
Fixes cleanup order in destructor.
Oliver Simons [Fri, 13 Feb 2026 09:37:55 +0000 (10:37 +0100)]
CUDA: Do not mutate cgraph for fused ADDs (#19566)
* Do not mutate cgraph for fused ADDs
1. We should try to minimize in-place changes to the incoming
ggml_cgraph where possible (those should happen in graph_optimize)
2. Modifying in-place leads to an additional, unnecessary graph capture
step as we store the properties before modifying the graph in-place
in the cuda-backend
Mario Limonciello [Thu, 12 Feb 2026 08:38:35 +0000 (02:38 -0600)]
Add a workaround for compilation with ROCWMMA_FATTN and gfx9 (#19461)
There is an upstream problem [1] with AMD's LLVM 22 fork and
rocWMMA 2.2.0 causing compilation issues on devices without
native fp16 support (CDNA devices).
The specialized types aren't resolved properly:
```
/opt/rocm/include/rocwmma/internal/mfma_impl.hpp:2549:37: error: ambiguous partial specializations of 'amdgcn_mfma<__half, __half, __half, 16, 16, 16>'
2549 | using ARegsT = typename Impl::ARegsT;
```
Add a workaround to explicitly declare the types and cast when
compiling with HIP and ROCWMMA_FATTN [2]. When this is actually
fixed upstream some guards can be used to detect and wrap the
version that has the fix to only apply when necessary.
Daniel Bevenius [Wed, 11 Feb 2026 16:41:35 +0000 (17:41 +0100)]
common : remove unused token util functions (#19506)
This commit removes two unused functions `common_lcp` and `common_lcs`.
The last usage of these functions was removed in
Commit 33eff4024084d1f0c8441b79f7208a52fad79858 ("server : vision support
via libmtmd") and are no longer used anywhere in the codebase.
AesSedai [Wed, 11 Feb 2026 15:47:30 +0000 (07:47 -0800)]
model: Add Kimi-K2.5 support (#19170)
* Move dequant_model to after the text_config merge
Add new kimi-k2.5 keys to mtmd convert
Update V_MMPROJ tensor mapping for new mm_projector.proj keys
Update V_M_IMP_NORM for new mm_projector.pre_norm key
* Fix a couple of oversights
* Add image support for Kimi-K2.5
* Revert changes to KimiVLForConditionalGeneration
* Fix an assert crash
* Fix permute swapping w / h on accident
* Kimi-K2.5: Use merged QKV for vision
* Kimi-K2.5: pre-convert vision QK to use build_rope_2d
* Kimi-K2.5: support non-interleaved rope for vision