Matthieu Coudron [Wed, 21 Jan 2026 06:52:46 +0000 (07:52 +0100)]
gguf: display strerrno when cant load a model (#18884)
I've had issues loading models with llama-server:
[44039] E gguf_init_from_file: failed to open GGUF file 'mistral-7b-v0.1.Q8_0.gguf'
and I was sure it could access the file. Seems like --models-dir and
--models-presets dont interact like I thought they would but I salvaged
this snippet that helps troubleshooting
[44039] E gguf_init_from_file: failed to open GGUF file 'mistral-7b-v0.1.Q8_0.gguf' (errno No such file or directory)
Oliver Simons [Wed, 21 Jan 2026 01:34:29 +0000 (02:34 +0100)]
CUDA: Fix builds for older CCCL versions by ifdefing strided_iterator (#18964)
* CUDA: Fix builds for older CCCL versions by ifdefing strided_iterator
Strided iterator was added in [CCCL
3.1](https://github.com/NVIDIA/cccl/releases/tag/v3.1.0), which is packaged into
[CTK
13.1](https://docs.nvidia.com/cuda/cuda-toolkit-release-notes/index.html#id5)
Daniel Bevenius [Tue, 20 Jan 2026 05:55:24 +0000 (06:55 +0100)]
convert : use n_groups instead of hardcoded values in reshape (#18929)
* convert : use n_groups instead of hardcoded values in reshape
This commit modifies the conversion script for NemotronHModel to use
the 'n_groups' hyperparameter, and allow Python to calculate the the
last dimension, using -1, when reshaping the 'mixer.norm.weight' tensor.
* use self.n_group instead of self.hparams["n_groups"]
Daniel Bevenius [Mon, 19 Jan 2026 12:12:38 +0000 (13:12 +0100)]
model-conversion : add BUILD_DIR variable to run-converted-model scripts (#18927)
This commit adds a BUILD_DIR variable to the scripts used for running
converted models.
The motivation for this is that currently the `build` directory is
hardcoded and it can be useful to specify a different build directory,
with builds for different configurations.
hipudding [Fri, 16 Jan 2026 08:18:49 +0000 (16:18 +0800)]
CANN: support gated linear attn (#18653)
* CANN: support gated linear attn
This change adds support for the GGML_OP_GATED_LINEAR_ATTN operator.
The feature was implemented by YushengZhao. Because the previous
submission was based on an outdated codebase, this PR was rebased to
merge.
* fix compiler warning, add commented-out logging per token
* re-write + change parameters + simplify
* oops forgot args.cpp
* fix leftover `window_size`
* add missing values to `common_params_sampling::print()`
* with logging
* does this fix it?
* no, but does this?
* update default decay
* optimize
* fix bad merge
my git skills are lacking
* silence `missing initializer for member`
* update default decay to 0.9
* fix logging
* format (double)
* add power law to the new `samplers` vector
* log sampler init values
* improve logging messages in llama_sampler_power_law
* remove extraneous logging
* simplify target computation
last commit with debug logging!
* remove debug logging, explicitly clamp params at init
* add `use_power_law` flag + logic, minor cleanup
* update `power-law` -> `adaptive-p`
* fix cold start EMA
- `ctx->weighted_sum` is now initialized and reset to `target / (1.0f -
clamped_decay)`
- `ctx->total_weight` is now initialized and reset to `1.0f / (1.0f -
clamped_decay)`
this fixes a "cold start" problem with the moving average
* update `SHARPNESS` constant to `10.0f`
* minor style fixes
no functional changes
* minor style fixes cont.
* update `llama_sampler_adaptive_p_i` for backend sampling (ref: #17004)
* separate into `apply` + `accept` functions
* `pending_token_idx`: switch from `llama_token` to `int32`
functionally identical (`llama.h` has `typedef int32_t llama_token;`),
but its more correct now
* don't transform logits <= -1e9f
* fix masking in backend top-p, min-p
* address review comments
* typo in comments `RND` -> `RNG`
* add docs
* add recommended values in completion docs
* address PR feedback
* remove trailing whitespace (for CI `editorconfig`)
* add to adaptive-p to `common_sampler_types_from_chars`
Max Krasnyansky [Thu, 15 Jan 2026 05:46:12 +0000 (21:46 -0800)]
hexagon: support for OP_CPY, host buffers now optional, hvx-utils refactoring and optimizations (#18822)
* hexagon: disable repack buffers if host buffers are disabled, improved handling of env vars
* hexagon: add support for OP_CPY fp16/fp32 -> fp16/fp32
Factore out all hvx_copy functions into hvx-copy.h header and reduced code duplication.
Update HTP ops infra to support OP_CPY
* hexagon: cleanup and refactor hex/hvx/htp headers and helper libs
hex is basically all scalar/core platform stuff (L2, DMA, basic utils)
hvx is all hvx related utils, helpers, etc
htp is higher level stuff like Ops, etc
hvx-utils library got a nice round of cleanup and refactoring to reduce duplication
use hvx_vec_store_a where possible
* hexagon: refactor HVX sigmoid functions to hvx-sigmoid.h
Moved sigmoid and tanh vector functions from hvx-utils.h to a new header
hvx-sigmoid.h. Implemented aligned and unaligned variants for sigmoid
array processing using a macro pattern similar to hvx-copy.h. Updated
act-ops.c to use the new aligned variant hvx_sigmoid_f32_aa. Removed
unused hvx-sigmoid.c.
* hexagon: factor out hvx-sqrt.h
* hexagon: mintor update to hvx-utils.h
* hexagon: remove spurios log
* hexagon: factor out and optimize hvx_add/sub/mul
* hexagon: remove _opt variants of add/sub/mul as they simply fully aligned versions
* hexagon: refactor reduction functions to hvx-reduce.h
Moved `hvx_self_max_f32` and `hvx_self_sum_f32` from `hvx-utils.h`/`.c` to `hvx-reduce.h`.
Renamed them to `hvx_reduce_max_f32` and `hvx_reduce_sum_f32`.
Added aligned (`_a`) and unaligned (`_u`) variants and used macros to unify logic.
Updated `softmax-ops.c` to use the new functions.
* hexagon: refactor the rest of arithmetic functions to hvx-arith.h
Moved `hvx_sum_of_squares_f32`, `hvx_min_scalar_f32`, and `hvx_clamp_scalar_f32` from `hvx-utils.c/h` to `hvx-arith.h`. Implemented aligned/unaligned variants (`_aa`, `_au`, etc.) and used macros to reduce code duplication. Updated `hvx_min_scalar_f32` and `hvx_clamp_scalar_f32` to use `dst, src, ..., n` argument order. Updated call sites in `act-ops.c`.
Refactor Hexagon HVX arithmetic functions (min, clamp) to hvx-arith.h
Moved `hvx_min_scalar_f32` and `hvx_clamp_scalar_f32` from `hvx-utils.c/h` to `hvx-arith.h`. Implemented aligned/unaligned variants (`_aa`, `_au`, etc.) and used macros to reduce code duplication. Updated these functions to use `dst, src, ..., n` argument order and updated call sites in `act-ops.c`. `hvx_sum_of_squares_f32` remains in `hvx-utils.c` as requested.
* hexagon: refactor hvx_sum_of_squares_f32
- Modify `hvx_sum_of_squares_f32` in `ggml/src/ggml-hexagon/htp/hvx-reduce.h` to use `dst, src` signature.
- Implement `_a` (aligned) and `_u` (unaligned) variants for `hvx_sum_of_squares_f32`.
- Update `hvx_reduce_loop_body` macro to support both returning and storing results via `finalize_op`.
- Update existing reduction functions in `hvx-reduce.h` to use the updated macro.
- Update `rms_norm_htp_f32` in `ggml/src/ggml-hexagon/htp/unary-ops.c` to match the new signature.
* hexagon: use hvx_splat instead of memset
* hexagon: consistent use of f32/f16 in all function names to match the rest of GGML
* hexagon: fix hvx_copy_f16_f32 on v75 and older
* hexagon: update readme to include GGML_HEXAGON_EXPERIMENTAL
* scripts: update snapdragon/adb scripts to enable host param
Oliver Simons [Thu, 15 Jan 2026 02:44:54 +0000 (03:44 +0100)]
CUDA: Factor out and re-use `block_reduce` function (#18785)
* CUDA: Refactor and expose two_stage_warp_reduce_* function
* Use `two_stage_warp_reduce` also in softmax kernel, move smem out of it
Moving smem out of `__device__` function to `__global__` function
allows for explicit smem reuse, as either compiler or cuda rt seem to not
free it afterwards (`cudaFuncSetAttribute` fails when not accounting for
it once for each call to two_stage_warp_reduce)
* Update ggml/src/ggml-cuda/common.cuh
Co-authored-by: Aman Gupta <redacted>
* Use two_stage_warp_reduce in group_norm_f32
* Use two_stage_warp_reduce in rms_norm_f32
* Fix smem calculation which expects bytes
* Make `two_stage_warp_reduce` accept all values warp_reduce accepts
Also integrate it into norm_f32 function
* Use two_stage_warp_reduce in l2_norm_f32
* Use type traits for block reduction for better legibility
Also adresss other requests by @am17an such as variable renaming
* Make norm tests cover all cuda paths
* Mark columns % WARP_SIZE !=0 as supported for RMS_NORM_BACK
Unit-tests passed locally, let's see if they pass in the CI as well
* Use `enum class` for `block_reduce_method`
This is more type-safe than plain enum
* Rename variables as suggested in code review by @am17an
* Rename two_stage_warp_reduce -> block_reduce
* Fix trailing whitespace in common.cuh
* Make condition of static_assert type-dependent
This delays evaluation until the template is actually instantiated.
Otherwise, some compilers may evaluate the assert when parsing the
template, resulting in build errors as observed here:
Daniel Benjaminsson [Wed, 14 Jan 2026 07:11:05 +0000 (08:11 +0100)]
mmap: add Haiku support by skipping RLIMIT_MEMLOCK check (#18819)
Haiku OS does not support RLIMIT_MEMLOCK, similar to visionOS/tvOS.
Skip the resource limit check on Haiku to allow mlock functionality
to work without compile errors.
Tested on Haiku with NVIDIA RTX 3080 Ti using Vulkan backend.
Adrien Gallouët [Wed, 14 Jan 2026 06:46:27 +0000 (07:46 +0100)]
ci, tests : use cmake to download models and remove libcurl dependency (#18791)
* ci, tests : use cmake to download models and remove libcurl dependency
* llama_dl_model -> llama_download_model
* use EXPECTED_HASH for robust model downloading
* Move llama_download_model to cmake/common.cmake
Gabe Goodhart [Tue, 13 Jan 2026 08:43:51 +0000 (01:43 -0700)]
graph : clean up t5 input builders (#18795)
* fix: Remove unnecessary `h` loops where `h` was only ever 0
Branch: CleanUpT5InputBuilders
Signed-off-by: Gabe Goodhart <redacted>
* fix: Remove unnecessary padding loop that is never hit anymore
The upper bound used to use GGML_PAD(n_tokens, GGML_KQ_MASK_PAD), but was
removed in https://github.com/ggml-org/llama.cpp/pull/17910 leaving the
loop dead.
Daniel Bevenius [Mon, 12 Jan 2026 12:47:58 +0000 (13:47 +0100)]
examples : add --kv-unified to batched example (#18774)
This commit adds the --kv-unified flag to the batched example. This flag
is currently specified in the README.md as required, but is currently
not available as a command line option for the batched example.
The motivation for this is that specifying this flag as the README
instructs, will lead to an error about the flag not being recognized,
and without this option the example fail with the following error:
```console
split_equal: sequential split is not supported when there are coupled
sequences in the input batch (you may need to use the -kvu flag)
decode: failed to find a memory slot for batch of size 4
main: llama_decode() failed
```
Jeff Bolz [Mon, 12 Jan 2026 11:32:13 +0000 (05:32 -0600)]
vulkan: Use VK_EXT_shader_64bit_indexing to handle large mat_mul(_id) (#18678)
This fixes incoherent output in Llama-4-Maverick-17B-128E-PAB-Q8_0, which
has a mul_mat_id with an A matrix that's Q8_0 8192 x 5120 x 128.
This should work when the number of blocks in the A matrix is less than 2^32
(for mul_mat_vec or mul_mm_cm2), or for mul_mm I think the limit is like
2^32*LOAD_VEC_A elements.
- Divide batch_stride by QUANT_K earlier, so the block index calculation works in 32b.
- Each vk_pipeline_struct has a linked list of pipelines that will allow it to handle
variants. So far this change just adds a single use case for this, compiling with the
e64BitIndexingEXT flag.
- Use the 64b indexing variant when the A matrix is larger than maxStorageBufferRange.
64-bit indexing has some cost - around 3-5% in MoE models, so it's worth the effort
to avoid enabling it unconditionally.
Daniel Bevenius [Sun, 11 Jan 2026 15:34:41 +0000 (16:34 +0100)]
debug : include LLAMA_POOLING_TYPE_UNSPECIFIED in pooling check (#18692)
* debug : include LLAMA_POOLING_TYPE_UNSPECIFIED in pooling check
This commit updates the pooling check in the debug example to
also include LLAMA_POOLING_TYPE_UNSPECIFIED and not just
LLAMA_POOLING_TYPE_NONE.
* debug : normalize both pooled and token embeddings
This commit updates debug.cpp to normalize embeddings for both pooled
and non-pooled outputs. For pooled embeddings, normalization is applied
to the single vector, and for non-pooled embeddings, normalization is
applied to each token embedding vector individually.
The motivation for this is to enable non-pooled embeddings to be
normalized which was not possible previously.
Xuan-Son Nguyen [Fri, 9 Jan 2026 23:00:41 +0000 (00:00 +0100)]
server: fix n_cmpl not skipping processing prompt (#18663)
* server: fix n_cmpl not skipping processing
* fix infinite loop on empty batch
* cont : init child samplers + modify child logic
* cont : cleanup
* cont : improve n_cmpl logic
- launch the parent task first so it finds the slot with best cache
- parent task waits for child tasks to be launched
- when a child task finishes - remove its cache
Simranjeet Singh [Fri, 9 Jan 2026 22:42:38 +0000 (22:42 +0000)]
mtmd: Add Gemma3n multimodal support with MobileNetV5 vision encoder (#18256)
* Add Gemma3nVisionModel - MobileNetV5 vision encoder convertor to convert_hf_to_gguf.py. Add gemma3n to vision projectors in gguf-py/gguf/constants.py.
* Add mobilenetv5 impl
* Fix comments, remove unused vars
* Fix permute and remove transpose of projection weights
* Fix comments, remove debugging prints from hf_to_gguf
* 1. Hard-code image_mean = 0 and image_std = 1
2. Use available tensor mapping logic
3. Remove redundant chat template replacement of soft tokens placeholder with media placeholder
* 1. Move mobilenetv5 helpers declarations to `clip_graph_mobilenetv5` struct and definitions to mobilenetv5.cpp
2.Remove unused `clip_is_gemma3n` func declarations and definitions
3. Remove redundant `rescale_image_u8_to_f32` func and use `normalize_image_u8_to_f32` with zero mean and unit std
4. Calculate n_patches using image_size / patch_size
* Remove obsolete comments
* - convert_hf_to_gguf.py & constants.py & tensor_mapping.py: Use explicit mapping: Custom map for double indexed blocks and tensor_mapping.py for rest
- convert_hf_to_gguf.py: Unsqueeze Stem Bias and Layer scale tensors to correct shape while converting to gguf
- mobilenetv5.cpp: Remove explicit reshaping of Stem Bias and Layer scale which are now handled while converting to gguf, replace fprintf with LOG_*
- clip.cpp: Remove unused embedding and hard_emb_norm tensor loading
* - Rename tensors to v.conv..., v.blk..., v.msfa... to better align with already existing terminology
* Fix stem conv bias name
* Remove explicit handling of bias term for stem conv
* - Change order of addition in "project_per_layer_inputs" to support broadcasting of vision inp_per_layer
- Simplify the vision embeddings path of "get_per_layer_inputs" to output [n_embd_altup, n_layer, 1], broadcastable