]>
git.djapps.eu Git - pkg/ggml/sources/llama.cpp/log
Jeff Bolz [Fri, 26 Dec 2025 17:15:50 +0000 (11:15 -0600)]
vulkan: optimize decodeFuncB in coopmat2 mul_mat_id shader (#18349)
* vulkan: Use BK=32 for coopmat2 mul_mat_id
* vulkan: optimize decodeFuncB in coopmat2 mul_mat_id shader
Disable robustness, remove the OOB check in decodeFuncB, and initialize the
row_ids to zero to avoid OOB access.
Don't slice/offset the B matrix to ic * BN, only to adjust the coord back down
to the range [0, BN) in decodeFuncB. Instead just slice with a row offset of
zero and remove the '& (BN - 1)'. This allows the compiler to common some of
the shared memory loads.
Jeff Bolz [Fri, 26 Dec 2025 17:15:02 +0000 (11:15 -0600)]
vulkan: Use BK=32 for coopmat2 mul_mat_id (#18332)
Eve [Fri, 26 Dec 2025 17:12:11 +0000 (17:12 +0000)]
vulkan: small dequantization improvements (#18380)
* iq4_xs
* quants
Jeff Bolz [Fri, 26 Dec 2025 16:00:57 +0000 (10:00 -0600)]
vulkan: Support UPSCALE w/antialias (#18327)
Jeff Bolz [Fri, 26 Dec 2025 15:53:46 +0000 (09:53 -0600)]
vulkan: handle rope with large number of rows (#18306)
o7si [Fri, 26 Dec 2025 15:35:29 +0000 (23:35 +0800)]
server : fix crash when seq_rm fails for hybrid/recurrent models (#18391)
* server : fix crash when seq_rm fails for hybrid/recurrent models
* server : add allow_processing param to clear_slot
Francisco Herrera [Fri, 26 Dec 2025 02:34:30 +0000 (21:34 -0500)]
docs: added note for pre SYCL Intel hardware (#18016)
Specify that it's for pre sycl hardware
0Marble [Fri, 26 Dec 2025 01:12:04 +0000 (09:12 +0800)]
CANN: implement the SSM_CONV operator (#17737)
* CANN: implement SSM_CONV operator
Co-authored-by: Aleksei Lobanov, <redacted>
Co-authored-by: Sujin Kang, <redacted>
* CANN: remove custom error limit for SSM_CONV
* CANN: merge SSM_CONV tensor shape/strides into one line
---------
Co-authored-by: Sujin Kang, <redacted>
Aman Gupta [Thu, 25 Dec 2025 17:35:14 +0000 (01:35 +0800)]
ggml-cuda: fix regex for arch list (#18371)
* ggml-cuda: fix regex for arch list
* make regex exact
Aman Gupta [Thu, 25 Dec 2025 15:55:38 +0000 (23:55 +0800)]
cuda: optimize cumsum cub path (#18362)
* cuda: optimize cumsum cub path
* remove heavy perf test
Aman Gupta [Thu, 25 Dec 2025 14:12:11 +0000 (22:12 +0800)]
ggml-cuda: fix blackwell native builds (#18361)
* ggml-cuda: fix blackwell native builds
Replace 12x in native architectures by 12xa
* replace for GGML_NATIVE=OFF too
* only replace for native
* remove 120f-virtual for default compilation
---------
Co-authored-by: Aman Gupta <aman>
Penglin Cai [Thu, 25 Dec 2025 08:46:09 +0000 (16:46 +0800)]
CANN: Add support for CONV_TRANSPOSE_1D when kernel size > 255 (#17934)
* CONV_TRANSPOSE_1D kernel_size>255
* remove condition check
* fix the bug of type conversion
* removing trailing whitespaces
* fix: return true in the switch case
Aadeshveer Singh [Thu, 25 Dec 2025 04:11:13 +0000 (09:41 +0530)]
ggml : optimize cuda cumsum fallback kernel (#18343)
Xuan-Son Nguyen [Wed, 24 Dec 2025 22:47:49 +0000 (23:47 +0100)]
server: (router) add stop-timeout option (#18350)
* server: (router) add stop-timeout option
* also allow stop while loading
* add docs
* unload_lru: also wait for unload to complete
Xuan-Son Nguyen [Wed, 24 Dec 2025 22:07:08 +0000 (23:07 +0100)]
model: support MiMo-V2-Flash (#18328)
* mimov2: convert ok
* rename mimov2 --> mimo2
* fix conversion
* runnable not incorrect
* use sink
* add_sliding_window_pattern
* add swa and per-layer n_head_kv
* correct params
* somewhat working
* correct gating func
* nits
* mimo2: wire RMS eps + MoE bias + converter guards
* add co-author
Co-authored-by: Aaryan-Kapoor <redacted>
* use add_rope_freq_base_swa
---------
Co-authored-by: Aaryan Kapoor <redacted>
Co-authored-by: Aaryan-Kapoor <redacted>
Aadeshveer Singh [Wed, 24 Dec 2025 14:57:38 +0000 (20:27 +0530)]
fit-params : fix race condition in fit-params output (#18276)
Aman Gupta [Wed, 24 Dec 2025 14:28:26 +0000 (22:28 +0800)]
CUDA: experimental native mxfp4 support for blackwell (#17906)
* CUDA: experimental native mxfp4 support for blackwell
* optimize load_tiles
* optimize quantize_mxfp4
* cleanup
* first pass review: formatting
* use interleaved layout for mma
* mmq: add assert for size
* use __nv_fp4x4_e2m1
* use iter_k as 512, cleanup
* Use 1200 as blackwell instead of 1000
* address review comments
* mmq: fix stride
* quantize.cu: use reference impl of e8m0 scale
* address review comments
* add 120f-virtual + minor fixes
---------
Co-authored-by: Aman Gupta <aman>
Saba Fallah [Wed, 24 Dec 2025 13:02:36 +0000 (14:02 +0100)]
model : support for LlamaBidirectionalModel architecture (#18220)
* model: llama-embed-nemotron
* minor: python lint
* changed arch-name
* templated llm_build_llama to be used for both llama and llama-embed arch
Jeff Bolz [Wed, 24 Dec 2025 11:36:34 +0000 (05:36 -0600)]
vulkan: fix command buffer corruption in ggml_backend_vk_event_wait (#18302)
Wang Weixuan [Wed, 24 Dec 2025 09:50:24 +0000 (17:50 +0800)]
CANN : refactor ACL graph cache (#17752)
Move the graph property checking code into methods of LRU cache.
Signed-off-by: Wang Weixuan <redacted>
Jesse Ikonen [Wed, 24 Dec 2025 09:19:47 +0000 (11:19 +0200)]
docs: Fix typos in SYCL documentation (#18269)
Ruben Ortlam [Wed, 24 Dec 2025 07:59:14 +0000 (08:59 +0100)]
vulkan: use fewer FA rows for small cache runs (#18280)
TianHao324 [Wed, 24 Dec 2025 06:55:33 +0000 (14:55 +0800)]
CANN: Uses yarn_ramp cache in ROPE (#17725)
ddh0 [Wed, 24 Dec 2025 06:19:12 +0000 (00:19 -0600)]
common: add `LLAMA_ARG_OVERRIDE_TENSOR` env var for `-ot` arg (#18267)
Xuan-Son Nguyen [Tue, 23 Dec 2025 20:49:05 +0000 (21:49 +0100)]
server: return_progress to also report 0% processing state (#18305)
Pascal [Tue, 23 Dec 2025 14:48:03 +0000 (15:48 +0100)]
webui: apply webui_settings on first load (#18223)
* webui: apply webui_settings on first load
The webui_settings from /props were not applied on initial load
when default_generation_settings.params was null
Now syncs whenever serverProps is available, regardless of params,
works for both single-model and router modes
* chore: update webui build output
Xuan-Son Nguyen [Tue, 23 Dec 2025 13:39:36 +0000 (14:39 +0100)]
server: fix crash with model not having BOS/EOS (#18321)
Daniel Bevenius [Tue, 23 Dec 2025 13:07:25 +0000 (14:07 +0100)]
model-conversion : add device option to run-org-model.py (#18318)
* model-conversion : add device option to run-org-model.py
This commit refactors the `run-org-model.py` script to include a
`--device` argument, to allow users to specify the device on which to
run the model (e.g., cpu, cuda, mps, auto).
It also extracts a few common functions to prepare for future changes
where some code duplication will be removed which there currently
exists in embedding scripts.
The Makefile is also been updated to pass the device argument, for
example:
```console
(venv) $ make causal-verify-logits DEVICE=cpu
```
* fix error handling and remove parser reference
This commit fixes the error handling which previously referenced an
undefined 'parser' variable.
Chris Rohlf [Tue, 23 Dec 2025 09:56:49 +0000 (04:56 -0500)]
rpc : add check for rpc buffer type (#18242)
nullname [Tue, 23 Dec 2025 07:13:24 +0000 (15:13 +0800)]
ggml-hexagon: create generalized functions for cpu side op (#17500)
* refactor: replace ggml_hexagon_mul_mat with template-based binary operation for improved flexibility
* refactor: replace ggml_hexagon_mul_mat_id with template-based binary operation for improved flexibility
* refactor: initialize buffer types and streamline dspqueue_buffers_init calls for clarity
* add comment
* refactor: remove redundant buffer checks in hexagon supported operations
* wip
* add missing include to fix weak symbol warning
* add ggml_hexagon_op_generic
* refactor: simplify tensor operation initialization and buffer management in hexagon implementation
* refactor: streamline hexagon operation initialization and buffer management
* refactor: update function signatures and streamline request handling in hexagon operations
* wip
* ggml-hexagon: clean up code formatting and improve unary operation handling
* wip
* rename
* fix: add support for permuted F16 tensors and enhance quantization checks in matrix operations
* refactor: replace ggml_hexagon_mul_mat with template-based binary operation for improved flexibility
refactor: replace ggml_hexagon_mul_mat_id with template-based binary operation for improved flexibility
refactor: initialize buffer types and streamline dspqueue_buffers_init calls for clarity
refactor: remove redundant buffer checks in hexagon supported operations
add missing include to fix weak symbol warning
add ggml_hexagon_op_generic
refactor: simplify tensor operation initialization and buffer management in hexagon implementation
refactor: streamline hexagon operation initialization and buffer management
refactor: update function signatures and streamline request handling in hexagon operations
ggml-hexagon: clean up code formatting and improve unary operation handling
fix: add support for permuted F16 tensors and enhance quantization checks in matrix operations
# Conflicts:
# ggml/src/ggml-hexagon/ggml-hexagon.cpp
* hexagon: fix merge conflicts
* hexagon: minor cleanup for buffer support checks
* hexagon: factor out op_desc and the overal op logging
* hexagon: further simplify and cleanup op dispatch logic
* snapdragon: update adb scripts to use llama-cli and llama-completion
* fix pipeline failure
---------
Co-authored-by: Max Krasnyansky <redacted>
Daniel Bevenius [Tue, 23 Dec 2025 06:27:37 +0000 (07:27 +0100)]
model-conversion : add trust_remote_code for embedding scripts (#18288)
This commit adds the trust_remote_code=True parameter when loading
models and configurations in the embedding model conversion scripts.
It also adds a cast to float for models that might use a data type that
is not supported by python, for example bfloat16.
The motivation for this is that some models may require custom code to
be executed during loading, and setting trust_remote_code to True avoids
getting prompted for confirmation.
Future work will consolidate the embedding conversion scripts with the
causal conversion scripts to avoid code duplication. But in the mean
time it would be nice to have this fix in place.
Neo Zhang [Tue, 23 Dec 2025 04:59:12 +0000 (12:59 +0800)]
[SYCL] replace llama-cli by llama-completion to rm the impact to test script (#18290)
* replace llama-cli by llama-completion to rm the impact to test script
* Update examples/sycl/run-llama2.sh
Co-authored-by: Sigbjørn Skjæret <redacted>
* Update examples/sycl/run-llama2.sh
Co-authored-by: Sigbjørn Skjæret <redacted>
* Update examples/sycl/run-llama3.sh
Co-authored-by: Sigbjørn Skjæret <redacted>
* Update examples/sycl/run-llama3.sh
Co-authored-by: Sigbjørn Skjæret <redacted>
* Update examples/sycl/win-run-llama2.bat
Co-authored-by: Sigbjørn Skjæret <redacted>
* Update examples/sycl/win-run-llama3.bat
Co-authored-by: Sigbjørn Skjæret <redacted>
---------
Co-authored-by: Neo Zhang Jianyu <redacted>
Co-authored-by: Sigbjørn Skjæret <redacted>
Alessandro98-git [Tue, 23 Dec 2025 02:04:57 +0000 (03:04 +0100)]
model : fix div-by-zero for Nemotron V2 (#18309)
* llama-model : fix Nemotron V2 crash by moving MoE parameters calculation
* remove whitespace
---------
Co-authored-by: Sigbjørn Skjæret <redacted>
Ryan Mangeno [Mon, 22 Dec 2025 23:28:19 +0000 (18:28 -0500)]
model : Granite Embedding support (#15641)
ModernBERT but without `head.norm` so will currently fail to convert and run any other ModernBERT models, PRs with `head.norm` support welcome!
* constants and tensor mappings for modern bert support, model not supported yet but working on getting conversion to work for encoder only
* conversion now working, hf -> gguf
* working on support, now working on building graph
* some cleanup
* cleanup
* continuing
* correct tensor shape for qkv
* fixed tensor mappings and working on buildin graph
* tensor debugging now works -> (llama-eval-callback), instead of simulated gate split with views, GEGLU is now used which does exactly this
* cleanup
* cleanup
* cleanup
* more cleanup
* ubatch issues, the assert for checking equal seqs in llama-graph.cpp when building attention keeps failing, setting ubatch size to 1 when running llama-embedding with --ubatch-size 1 makes it work, but needs to be looked into more
* added cls token per previous modern bert attempt, still working on checking out the rest
* fixed pre tokenizer and still working through previous pr
* working through previous attemp, implimented more accurate conversion per previous attempt, added local sliding window attention that alternates every third layer
* fixed pre tokenizer
* working on swa with local and global alternating attention
* some cleanup and now fails on build attn
* starting to work, and some cleanup, currently failing on last layer construction in graph build
* alternating rope implemented and modern bert graph build succeeds
* fixed asser for equal ubatch seq
* cleanup
* added mask check in vocab
* fixed alternating rope, the hparams.rope_freq_base_train and hparams.rope_freq_base_train_swa were the same and i set them to correct values
* reuse variable
* removed repeat
* standard swa method can be used instead of a new enum being LLAMA_SWA_TYPE_LOCAL
* correct swa layer indexing, is supposed to be 0, 3, 6 ... instead of 1, 4, 7 ...
* more modular hparam setting
* replaced attn out norm with ffn_norm and cosine similarity between hf embds and llama.cpp embds went way up, from 0.05 to 0.24, replaced the cacheless kv with swa todo per the previous conversion
* Update gguf-py/gguf/tensor_mapping.py
Co-authored-by: Sigbjørn Skjæret <redacted>
* Update convert_hf_to_gguf_update.py
Co-authored-by: Sigbjørn Skjæret <redacted>
* Update src/llama-model.cpp
Co-authored-by: Sigbjørn Skjæret <redacted>
* Update src/llama-vocab.cpp
Co-authored-by: Sigbjørn Skjæret <redacted>
* Update src/llama-model.cpp
Co-authored-by: Sigbjørn Skjæret <redacted>
* Update gguf-py/gguf/tensor_mapping.py
Co-authored-by: Sigbjørn Skjæret <redacted>
* Update convert_hf_to_gguf.py
Co-authored-by: Sigbjørn Skjæret <redacted>
* Update gguf-py/gguf/tensor_mapping.py
Co-authored-by: Sigbjørn Skjæret <redacted>
* Update gguf-py/gguf/tensor_mapping.py
Co-authored-by: Sigbjørn Skjæret <redacted>
* Update convert_hf_to_gguf.py
Co-authored-by: Sigbjørn Skjæret <redacted>
* Update gguf-py/gguf/tensor_mapping.py
Co-authored-by: Sigbjørn Skjæret <redacted>
* Update gguf-py/gguf/tensor_mapping.py
Co-authored-by: Sigbjørn Skjæret <redacted>
* Update gguf-py/gguf/tensor_mapping.py
Co-authored-by: Sigbjørn Skjæret <redacted>
* Update gguf-py/gguf/tensor_mapping.py
Co-authored-by: Sigbjørn Skjæret <redacted>
* Update gguf-py/gguf/tensor_mapping.py
Co-authored-by: Sigbjørn Skjæret <redacted>
* Update gguf-py/gguf/tensor_mapping.py
Co-authored-by: Sigbjørn Skjæret <redacted>
* Update src/llama-graph.cpp
Co-authored-by: Sigbjørn Skjæret <redacted>
* Update src/llama-arch.cpp
Co-authored-by: Sigbjørn Skjæret <redacted>
* Update src/llama-model.cpp
Co-authored-by: Sigbjørn Skjæret <redacted>
* Update src/llama-model.cpp
Co-authored-by: Sigbjørn Skjæret <redacted>
* Update src/llama-model.cpp
Co-authored-by: Sigbjørn Skjæret <redacted>
* Update src/llama-model.cpp
Co-authored-by: Sigbjørn Skjæret <redacted>
* Update src/llama-model.cpp
Co-authored-by: Sigbjørn Skjæret <redacted>
* removed redundant hparam set
* enums for model sizes
* conversion for modern-bert model supported rather than just granite-small
* Update src/llama-model.cpp
Co-authored-by: Gabe Goodhart <redacted>
* Update src/llama-model.cpp
Co-authored-by: Gabe Goodhart <redacted>
* fixed ordering of enum for freq_base_swa
* fixed where I added residual, now gives much much better embeddings~
* readded cacheless logic
* removing whitespace
* conversion now working for swa pattern - dense every n layers
* modern bert put into seperate src file
* removing whitespace
* fixed whitespace and newline errors in editorconfig job
* Update convert_hf_to_gguf.py
Co-authored-by: Sigbjørn Skjæret <redacted>
* better naming convention, n_swa_pattern -> swa_period
* reusing sliding_window_pattern key rather than making new dense_every_n_layers key, and adding writing and reading support
* fixing pyright type-check fail
* Update convert_hf_to_gguf.py
Co-authored-by: Sigbjørn Skjæret <redacted>
* Update gguf-py/gguf/gguf_writer.py
Co-authored-by: Sigbjørn Skjæret <redacted>
* Update src/llama-hparams.h
Co-authored-by: Sigbjørn Skjæret <redacted>
* Update src/llama-model-saver.cpp
Co-authored-by: Sigbjørn Skjæret <redacted>
* Update src/models/modern-bert.cpp
Co-authored-by: Sigbjørn Skjæret <redacted>
* Update src/models/modern-bert.cpp
Co-authored-by: Sigbjørn Skjæret <redacted>
* Update src/models/modern-bert.cpp
Co-authored-by: Sigbjørn Skjæret <redacted>
* Update gguf-py/gguf/gguf_writer.py
Co-authored-by: Sigbjørn Skjæret <redacted>
* Update src/models/modern-bert.cpp
Co-authored-by: Sigbjørn Skjæret <redacted>
* Update src/models/modern-bert.cpp
Co-authored-by: Sigbjørn Skjæret <redacted>
* Update src/llama-model.cpp
Co-authored-by: Sigbjørn Skjæret <redacted>
* Update src/llama-model-loader.cpp
Co-authored-by: Sigbjørn Skjæret <redacted>
* Update src/llama-model-loader.cpp
Co-authored-by: Sigbjørn Skjæret <redacted>
* Update src/llama-model-loader.cpp
Co-authored-by: Sigbjørn Skjæret <redacted>
* added descriptions in llama-model
* fixed tensor mappings for conversion
* Update src/llama-model.cpp
Co-authored-by: Sigbjørn Skjæret <redacted>
* Update src/llama-model.cpp
Co-authored-by: Sigbjørn Skjæret <redacted>
* mapping name for size
* nits
* unused
---------
Co-authored-by: Sigbjørn Skjæret <redacted>
Co-authored-by: Gabe Goodhart <redacted>
compilade [Mon, 22 Dec 2025 19:25:16 +0000 (14:25 -0500)]
gguf-py : do not align the data start offset (#18291)
The safetensors format doesn't require alignment.
Shouyu [Mon, 22 Dec 2025 18:56:52 +0000 (13:56 -0500)]
ggml-hexagon: gelu optimization (#18151)
* feat: working gelu with src0 put on vtcm
* feat: gelu ping-pong for both in and out
* fix: fixu compile error
* break: distinguish dma ddr->vtcm and vtcm->ddr operation
* fix: fix dma queue size
* break: update dma api to either pop src or dst ptr
* fix: fix activation vtcm allocation issue for src1 when swapperd
* refactor: ping-pong gelu logic to avoid unnecessary if else
* dma: improved queue interface and prefetch handling
* gelu: fix N+2 block prefetch
---------
Co-authored-by: Max Krasnyansky <redacted>
Xuan-Son Nguyen [Mon, 22 Dec 2025 18:30:19 +0000 (19:30 +0100)]
gen-docs: automatically update markdown file (#18294)
* gen-docs: automatically update markdown file
* also strip whitespace
* do not add extra newline
* update TOC
Taimur Ahmad [Mon, 22 Dec 2025 18:20:23 +0000 (23:20 +0500)]
llamafile: add rvv support for sgemm kernels (#18199)
Co-authored-by: Rehan Qasim <redacted>
lhez [Mon, 22 Dec 2025 18:19:01 +0000 (10:19 -0800)]
opencl: unpack q4_0 for adreno in get_tensor (#18278)
Jeff Bolz [Mon, 22 Dec 2025 17:03:13 +0000 (11:03 -0600)]
vulkan: Extend rope fusions to allow mrope (#18264)
Extend the test-backend-ops tests as well.
Xuan-Son Nguyen [Mon, 22 Dec 2025 13:23:34 +0000 (14:23 +0100)]
server: prevent data race from HTTP threads (#18263)
* server: prevent data race from HTTP threads
* fix params
* fix default_generation_settings
* nits: make handle_completions_impl looks less strange
* stricter const
* fix GGML_ASSERT(idx < states.size())
* move index to be managed by server_response_reader
* http: make sure req & res lifecycle are tied together
* fix compile
* fix index handling buggy
* fix data race for lora endpoint
* nits: fix shadow variable
* nits: revert redundant changes
* nits: correct naming for json_webui_settings
Xuan-Son Nguyen [Mon, 22 Dec 2025 12:21:43 +0000 (13:21 +0100)]
server: fix data race in to_json_anthropic (#18283)
Mattt [Mon, 22 Dec 2025 12:11:46 +0000 (04:11 -0800)]
release: update release workflow to store XCFramework as Zip file (#18284)
* Update release workflow to store XCFramework as Zip file
* Add comments to document Zip file requirement for XCFramework
* Apply suggestions from code review
Co-authored-by: Sigbjørn Skjæret <redacted>
---------
Co-authored-by: Sigbjørn Skjæret <redacted>
Aaron Teo [Mon, 22 Dec 2025 12:03:49 +0000 (20:03 +0800)]
convert: rework ftype heuristics (#18214)
* convert: rework ftype heuristics
Signed-off-by: Aaron Teo <redacted>
convert: fix type-check
Signed-off-by: Aaron Teo <redacted>
convert: bring back heuristics comment
Signed-off-by: Aaron Teo <redacted>
* convert: revert to using first tensor
Signed-off-by: Aaron Teo <redacted>
* convert: rework heuristics logic
Signed-off-by: Aaron Teo <redacted>
* convert: rm redundant float32 check
Co-authored-by: Sigbjørn Skjæret <redacted>
---------
Signed-off-by: Aaron Teo <redacted>
Co-authored-by: Sigbjørn Skjæret <redacted>
Xuan-Son Nguyen [Mon, 22 Dec 2025 11:22:01 +0000 (12:22 +0100)]
server: (docs) remove mention about extra_args (#18262)
Johannes Gäßler [Mon, 22 Dec 2025 10:00:37 +0000 (11:00 +0100)]
tool/ex/tests: consistently free ctx, then model (#18168)
Jeff Bolz [Sun, 21 Dec 2025 20:52:09 +0000 (14:52 -0600)]
vulkan: Implement set_tensor_async and the event interfaces (#18047)
The goal is to enable the async loading code paths in
llama_model_loader::load_all_data, originally from #7896. This works and the
loads themselves are faster, but with host visible vidmem I think the cost of
allocating/mapping vidmem moves and becomes more expensive, and I don't see a
benefit by default. But with GGML_VK_DISABLE_HOST_VISIBLE_VIDMEM=1 I do see a
significant improvement in model loading time.
Johannes Gäßler [Sun, 21 Dec 2025 18:33:08 +0000 (19:33 +0100)]
llama: fix RPC for -fit on (#18233)
Xuan-Son Nguyen [Sun, 21 Dec 2025 18:09:21 +0000 (19:09 +0100)]
move copilot instructions to AGENTS.md (#18259)
* move copilot --> agents.md
* agents: add disclose AI usage
* refine
Jeff Bolz [Sun, 21 Dec 2025 09:32:58 +0000 (03:32 -0600)]
vulkan: fix im2col overflowing maxworkgroupcount (#18180)
Jeff Bolz [Sun, 21 Dec 2025 09:27:34 +0000 (03:27 -0600)]
vulkan/cuda: fix topk_moe with exp_probs_b (#18071)
I updated test_topk_moe to more closely match llm_graph_context::build_moe_ffn
and added coverage for exp_probs_b and some other missing combinations. This
exposed a bug in both CUDA and Vulkan backends where they were assuming the
input to argsort and the input to get_rows are the same. I'd like to optimize
this graph in another change, but for now just get it functional.
CUDA also had a bug where it got n_experts from the wrong place, leading to
GGML_ASSERT failures in some of the new tests.
Jeff Bolz [Sun, 21 Dec 2025 09:17:58 +0000 (03:17 -0600)]
vulkan: support GGML_UNARY_OP_XIELU (#18062)
Jeff Bolz [Sun, 21 Dec 2025 09:05:08 +0000 (03:05 -0600)]
vulkan: in graph_optimize, try to group ADD operations (#18060)
I saw the adds not staying together in the new nemotron 3 nano model.
lovedheart [Sun, 21 Dec 2025 08:59:52 +0000 (09:59 +0100)]
Vulkan: some improvement on mul_mat_iq2_xs (#18031)
* Some improvement on mul_mat_iq2_xs
Refactor calculations for db values and grid data to optimize performance and reduce redundancy.
* Fix trailing whitespace
Daniel Bevenius [Sun, 21 Dec 2025 08:35:40 +0000 (09:35 +0100)]
docs : fix links in parsing.md (#18245)
This commit corrects the links in the parsing.md which currently result
in 404 errors.
Aldehir Rojas [Sun, 21 Dec 2025 03:43:21 +0000 (21:43 -0600)]
common : reorganize includes to prioritize vendored deps (#18222)
Xuan-Son Nguyen [Sun, 21 Dec 2025 01:24:42 +0000 (02:24 +0100)]
server: add auto-sleep after N seconds of idle (#18228)
* implement sleeping at queue level
* implement server-context suspend
* add test
* add docs
* optimization: add fast path
* make sure to free llama_init
* nits
* fix use-after-free
* allow /models to be accessed during sleeping, fix use-after-free
* don't allow accessing /models during sleep, it is not thread-safe
* fix data race on accessing props and model_meta
* small clean up
* trailing whitespace
* rm outdated comments
Jeff Bolz [Sat, 20 Dec 2025 19:46:46 +0000 (13:46 -0600)]
tests: Avoid floating point precision false positives in SUM (#17471)
* tests: Avoid floating point precision false positives in SUM
* also apply to test_mean
Jeff Bolz [Sat, 20 Dec 2025 19:45:45 +0000 (13:45 -0600)]
test-backend-ops: improve msvc build time (#18209)
Aadeshveer Singh [Sat, 20 Dec 2025 11:28:57 +0000 (16:58 +0530)]
Added comments explaining thread block size selection logic based on row count and column size, derived from historical commit context (#18212)
Oleksandr Kuvshynov [Sat, 20 Dec 2025 09:57:40 +0000 (04:57 -0500)]
server : [easy] fix per round speculative decode logging (#18211)
Currently we always log 0, as we clear slot.drafted before.
To reproduce:
Run llama-server with devstral-2 as main model and devstral-2-small as
md, and verbose logging:
```
% ./build/bin/llama-server -v \
-m ~/llms/Devstral-2-123B-Instruct-2512-UD-Q6_K_XL-00001-of-00003.gguf \
-md ~/llms/Devstral-Small-2-24B-Instruct-2512-UD-Q2_K_XL.gguf \
-c 8192 2> /tmp/llama.cpp.debug
Check the log:
slot update_slots: id 3 | task 0 | accepted 11/0 draft tokens, new
n_tokens = 741
slot update_slots: id 3 | task 0 | accepted 4/0 draft tokens, new
n_tokens = 746
slot update_slots: id 3 | task 0 | accepted 16/0 draft tokens, new
n_tokens = 763
slot update_slots: id 3 | task 0 | accepted 11/0 draft tokens, new
n_tokens = 775
slot update_slots: id 3 | task 0 | accepted 2/0 draft tokens, new
n_tokens = 778
slot update_slots: id 3 | task 0 | accepted 4/0 draft tokens, new
n_tokens = 783
slot update_slots: id 3 | task 0 | accepted 8/0 draft tokens, new
n_tokens = 792
slot update_slots: id 3 | task 0 | accepted 2/0 draft tokens, new
n_tokens = 795
slot update_slots: id 3 | task 0 | accepted 1/0 draft tokens, new
n_tokens = 797
slot update_slots: id 3 | task 0 | accepted 1/0 draft tokens, new
n_tokens = 799
slot update_slots: id 3 | task 0 | accepted 0/0 draft tokens, new
n_tokens = 800
slot update_slots: id 3 | task 0 | accepted 2/0 draft tokens, new
n_tokens = 803
slot update_slots: id 3 | task 0 | accepted 1/0 draft tokens, new
n_tokens = 805
slot update_slots: id 3 | task 0 | accepted 6/0 draft tokens, new
n_tokens = 812
slot update_slots: id 3 | task 0 | accepted 3/0 draft tokens, new
n_tokens = 816
```
After the fix, get correct per round logging:
```
slot update_slots: id 3 | task 0 | accepted 7/8 draft tokens, new
n_tokens = 654
slot update_slots: id 3 | task 0 | accepted 1/2 draft tokens, new
n_tokens = 656
slot update_slots: id 3 | task 0 | accepted 2/16 draft tokens, new
n_tokens = 659
slot update_slots: id 3 | task 0 | accepted 1/16 draft tokens, new
n_tokens = 661
slot update_slots: id 3 | task 0 | accepted 2/16 draft tokens, new
n_tokens = 664
slot update_slots: id 3 | task 0 | accepted 16/16 draft tokens, new
n_tokens = 681
slot update_slots: id 3 | task 0 | accepted 16/16 draft tokens, new
n_tokens = 698
slot update_slots: id 3 | task 0 | accepted 3/4 draft tokens, new
n_tokens = 702
slot update_slots: id 3 | task 0 | accepted 5/12 draft tokens, new
n_tokens = 708
slot update_slots: id 3 | task 0 | accepted 16/16 draft tokens, new
n_tokens = 725
slot update_slots: id 3 | task 0 | accepted 1/1 draft tokens, new
n_tokens = 727
slot update_slots: id 3 | task 0 | accepted 8/16 draft tokens, new
n_tokens = 736
```
Xuan-Son Nguyen [Sat, 20 Dec 2025 08:25:27 +0000 (09:25 +0100)]
server: support load model on startup, support preset-only options (#18206)
* server: support autoload model, support preset-only options
* add docs
* load-on-startup
* fix
* Update common/arg.cpp
Co-authored-by: Pascal <redacted>
---------
Co-authored-by: Pascal <redacted>
Sigbjørn Skjæret [Fri, 19 Dec 2025 21:29:46 +0000 (22:29 +0100)]
ci : remove non-windows zip artifacts (#18201)
* remove non-windows zip artifacts
* add cuda dll links
Sigbjørn Skjæret [Fri, 19 Dec 2025 21:29:37 +0000 (22:29 +0100)]
ci : only save ccache on master (#18207)
Alfred [Fri, 19 Dec 2025 17:42:28 +0000 (12:42 -0500)]
ggml-hexagon: Implement true Q8_0 quantization on Hexagon NPU for more accurate mixed-precision matmul operations (#17977)
* feat: implement real Q8_0
* feat: adding cmake option for configuring FP32 quantize group size
* typo: set() shall be used
---------
Co-authored-by: ngdxzy <redacted>
Pascal [Fri, 19 Dec 2025 17:01:56 +0000 (18:01 +0100)]
arg: fix order to use short form before long form (#18196)
* arg: fix order to use short form before long form
* arg: update doc
* arg: update test-arg-parser
* arg: address review feedback from ngxson
simplified to check first.length() <= last.length() only
fixed: --sampler-seq, --rerank, --draft ordering
note: middle positions in 3+ arg sets are not verified
* arg: update doc
Julius Tischbein [Fri, 19 Dec 2025 14:42:46 +0000 (15:42 +0100)]
llama : Changing off_t to size_t for Windows (#18204)
Aman Gupta [Fri, 19 Dec 2025 11:10:00 +0000 (19:10 +0800)]
server: friendlier error msg when ctx < input (#18174)
* llama-server: friendlier error msg when ctx < input
This PR adds formatted strings to the server's send_error function
* llama-server: use string_format inline
* fix test
Xuan-Son Nguyen [Fri, 19 Dec 2025 11:08:20 +0000 (12:08 +0100)]
presets: refactor, allow cascade presets from different sources, add global section (#18169)
* presets: refactor, allow cascade presets from different sources
* update docs
* fix neg arg handling
* fix empty mmproj
* also filter out server-controlled args before to_ini()
* skip loading custom_models if not specified
* fix unset_reserved_args
* fix crash on windows
Aleksander Grygier [Fri, 19 Dec 2025 10:14:07 +0000 (11:14 +0100)]
webui: Add editing attachments in user messages (#18147)
* feat: Enable editing attachments in user messages
* feat: Improvements for data handling & UI
* docs: Update Architecture diagrams
* chore: update webui build output
* refactor: Exports
* chore: update webui build output
* feat: Add handling paste for Chat Message Edit Form
* chore: update webui build output
* refactor: Cleanup
* chore: update webui build output
Daniel Bevenius [Fri, 19 Dec 2025 07:43:16 +0000 (08:43 +0100)]
model-conversion : add verbose flag in run-org-model.py (#18194)
This commit adds a --verbose flag to the run-org-model.py script to
enable or disable detailed debug output, such as input and output
tensors for each layer. Debug utilities (summarize, debug_hook,
setup_rope_debug) have been moved to utils/common.py.
The motivation for this is that the detailed debug output can be useful
for diagnosing issues with model conversion or execution, but it can
also produce a large amount of output that may not always be needed.
The script will also be further cleaned/refactored in follow-up commits.
Naco Siren [Fri, 19 Dec 2025 07:32:04 +0000 (23:32 -0800)]
android: fix missing screenshots for Android.md (#18156)
* Android basic sample app layout polish
* Add missing screenshots and polish android README doc
* Replace file blobs with URLs served by GitHub pages service.
Jeff Bolz [Fri, 19 Dec 2025 05:36:46 +0000 (23:36 -0600)]
vulkan: Add perf logger mode with concurrency (#17944)
This implements a variation of the perf logger where rather than timing each
operation individually with effectively a barrier in between, we put the
timing boundaries where we already synchronize and time the groups of work
that normally overlap. This can be useful to help understand whether
individual operations need to be optimized, or if the group is already running
efficiently.
GGML_VK_PERF_LOGGER_CONCURRENT=1 enables the new mode (when
GGML_VK_PERF_LOGGER is also set).
GGML_VK_SYNC_LOGGER=1 replaces the ENABLE_SYNC_LOGGING compile time switch.
Xuan-Son Nguyen [Thu, 18 Dec 2025 23:18:01 +0000 (00:18 +0100)]
model : add ASR support for LFM2-Audio-1.5B (conformer) (#18106)
* ASR with LFM2-Audio-1.5B
* Set rope_theta
* Fix comment
* Remove rope_theta setting
* Address PR feedback
* rename functions to conformer
* remove some redundant ggml_cont
* fix missing tensor
* add prefix "a." for conv tensors
* remove redundant reshape
* clean up
* add test model
---------
Co-authored-by: Tarek Dakhran <redacted>
Pascal [Thu, 18 Dec 2025 16:55:03 +0000 (17:55 +0100)]
webui: display prompt processing stats (#18146)
* webui: display prompt processing stats
* feat: Improve UI of Chat Message Statistics
* chore: update webui build output
* refactor: Post-review improvements
* chore: update webui build output
---------
Co-authored-by: Aleksander Grygier <redacted>
Taimur Ahmad [Thu, 18 Dec 2025 14:02:09 +0000 (19:02 +0500)]
ggml-cpu: extend support for RVV floating-point kernels (#17318)
* cmake: add BF16 RVV flag for ggml-cpu
* ggml-cpu: add floating-point conversion kernels
* ggml: add floating-point kernels
Co-authored-by: Rehan Qasim <redacted>
* ggml-cpu: fix lmul in vec_dot_bf16
* ggml-cpu: change redsum to lmul 4, fix leftover
---------
Co-authored-by: Rehan Qasim <redacted>
Xuan-Son Nguyen [Thu, 18 Dec 2025 13:30:32 +0000 (14:30 +0100)]
arg: fix ASAN error on sampler_type_names empty (#18167)
Sigbjørn Skjæret [Thu, 18 Dec 2025 12:45:38 +0000 (13:45 +0100)]
gguf-py : use copy-on-write mode for localtensor (#18162)
yulo [Thu, 18 Dec 2025 11:50:56 +0000 (19:50 +0800)]
remove i_major_dual (#18157)
Co-authored-by: zhang hui <redacted>
Aleksander Grygier [Thu, 18 Dec 2025 10:13:52 +0000 (11:13 +0100)]
webui: Fix selecting generated output issues during active streaming (#18091)
* draft: incremental markdown rendering with stable blocks
* refactor: Logic improvements
* refactor: DRY Markdown post-processing logic
* refactor: ID generation improvements
* fix: Remove runes
* refactor: Clean up & add JSDocs
* chore: update webui static output
* fix: Add tick to prevent race conditions for rendering Markdown blocks
Suggestion from @ServeurpersoCom
Co-authored-by: Pascal <redacted>
* chore: Run `npm audit fix`
* chore: update webui static output
* feat: Improve performance using global counter & id instead of UUID
* refactor: Enhance Markdown rendering with link and code features
* chore: update webui static output
* fix: Code block content extraction
* chore: update webui static output
* chore: update webui static output
---------
Co-authored-by: Pascal <redacted>
Kim S. [Thu, 18 Dec 2025 10:08:42 +0000 (11:08 +0100)]
webui: fix chat screen shadow width (#18010)
* webui: fix chat screen shadow width
* chore: add index.html.gz
Johannes Gäßler [Thu, 18 Dec 2025 07:12:18 +0000 (08:12 +0100)]
llama: offload output layer to GPU first (#18148)
Sigbjørn Skjæret [Thu, 18 Dec 2025 06:54:54 +0000 (07:54 +0100)]
convert : sort and use file parts from model index if present (#18043)
* keep file part order from model index
* treat index as authoritative
* sort index parts
Julius Tischbein [Thu, 18 Dec 2025 06:27:19 +0000 (07:27 +0100)]
llama : Async DirectIO model loading on Linux (#18012)
* Uncached model read
* Removing additional --mmap arg
* Removing trailing whitespaces
* Adding fallback when O_DIRECT is not supported
* Remove branching in llama-model-loader.cpp and reduce code duplications in llama-mmap.cpp
* Adding maybe unused keyword for Mac and Windows.
* File seek aligned
* Removing all branches for direct_io in llama-model-loader.cpp
* Always use alignment from llama_file
* use_mmap=true
Shouyu [Wed, 17 Dec 2025 21:38:21 +0000 (16:38 -0500)]
ggml-hexagon: swiglu_oai operation (#18114)
* snapshot: debug ggml-hexagon swiglu-oai
* fix: fix hvx_min_scalar_f32
* feat: working swiglu-oai
* chore: fix formating isue
Sigbjørn Skjæret [Wed, 17 Dec 2025 21:15:53 +0000 (22:15 +0100)]
convert : force patch_merger tensors to f16/f32 (#18124)
Pascal [Wed, 17 Dec 2025 20:45:45 +0000 (21:45 +0100)]
server: (webui) add --webui-config (#18028)
* server/webui: add server-side WebUI config support
Add CLI arguments --webui-config (inline JSON) and --webui-config-file
(file path) to configure WebUI default settings from server side.
Backend changes:
- Parse JSON once in server_context::load_model() for performance
- Cache parsed config in webui_settings member (zero overhead on /props)
- Add proper error handling in router mode with try/catch
- Expose webui_settings in /props endpoint for both router and child modes
Frontend changes:
- Add 14 configurable WebUI settings via parameter sync
- Add tests for webui settings extraction
- Fix subpath support with base path in API calls
Addresses feedback from @ngxson and @ggerganov
* server: address review feedback from ngxson
* server: regenerate README with llama-gen-docs
Xuan-Son Nguyen [Wed, 17 Dec 2025 20:39:08 +0000 (21:39 +0100)]
server: (router) disable SSL on child process (#18141)
Johannes Gäßler [Wed, 17 Dec 2025 20:10:03 +0000 (21:10 +0100)]
llama-fit-params: fix memory print (#18136)
Kim S. [Wed, 17 Dec 2025 19:05:45 +0000 (20:05 +0100)]
webui: fix chat header width when sidebar is closed (#17981)
* webui: fix chat header width when sidebar is closed
* chore: add index.html.gz
Shouyu [Wed, 17 Dec 2025 18:39:32 +0000 (13:39 -0500)]
ggml-hexagon: gelu operation (#17921)
* feat: inital support for gelu using sigmoid approximation
* snapshot: faster gelu using polynomial approximation
* test: disable l2-block prefetch in polynomail approximation
* Revert "test: disable l2-block prefetch in polynomail approximation"
This reverts commit
72339994d45b2bed887e79994403c378d90b62b5 .
* Revert "snapshot: faster gelu using polynomial approximation"
This reverts commit
2a787a61d11f9e63e5943a2e6d134b2f0c402ace .
* debug: temporarily disable unnecessary log message for debug purpose
* Feat: optiized unaligned sigmoid_f32
* Feat: larger l2prefetch block
* feat: apply unaligned-load optimization on mul and mul_scalar
* Revert "debug: temporarily disable unnecessary log message for debug purpose"
This reverts commit
84f2f23aa9f17e2fa826db969cd825d0ab192995 .
* refactor: cleanup commented unused code
* chore: reformat code with clang-formatter to pass cli test
* Revert "chore: reformat code with clang-formatter to pass cli test"
This reverts commit
952877ec24732b12010c7fa7ed3fc8de4b74e718 .
* fix: fix loop overflow
* chore: fix formating ci error
Georgi Gerganov [Wed, 17 Dec 2025 17:46:00 +0000 (19:46 +0200)]
common : restore grammar-based rejection sampling (#18137)
* common : restart grammar-based rejection sampling
* sampling : allow null samplers
Johannes Gäßler [Wed, 17 Dec 2025 17:44:13 +0000 (18:44 +0100)]
common: clarify instructions for bug reports (#18134)
HonestQiao [Wed, 17 Dec 2025 15:34:35 +0000 (23:34 +0800)]
model: fix GLM-ASR-Nano-2512 load error (#18130) (#18142)
Xuan-Son Nguyen [Wed, 17 Dec 2025 13:54:11 +0000 (14:54 +0100)]
server: (router) allow child process to report status via stdout (#18110)
* server: (router) allow child process to report status via stdout
* apply suggestions
Piotr Wilkin (ilintar) [Wed, 17 Dec 2025 13:21:51 +0000 (14:21 +0100)]
Extend run-org-model.py, add (a) batching (b) loading prompt from file (c) multimodal capacity (#18034)
Johannes Gäßler [Wed, 17 Dec 2025 12:46:48 +0000 (13:46 +0100)]
Github: ask for -v logs for params_fit [no ci] (#18128)
Alberto Cabrera Pérez [Wed, 17 Dec 2025 11:39:13 +0000 (11:39 +0000)]
ggml-cpu: ARM64: repack version of q8_0 (dotprod and i8mm) (#18096)
* wip: skeleton for q8_0 repack
* q8_0 repack GEMV implementations
* GEMM implementations
* Formatting
* Fixed format consistency of repack gemm and gemv declarations
* gemv and gemm generic location consistent with declarations
* Removed non-correct unused variables statements
* Cleanup, consistent style
* Missing generic fallbacks for x86 and powerpc
Tarek Dakhran [Wed, 17 Dec 2025 11:17:11 +0000 (12:17 +0100)]
model: fix LFM2_MOE missing tensors (#18132)
Sigbjørn Skjæret [Wed, 17 Dec 2025 09:45:40 +0000 (10:45 +0100)]
ci : clean up webui jobs (#18116)
* clean up webui jobs
* refined step control
* forgot dependencies
* apparently always() is needed