]> git.djapps.eu Git - pkg/ggml/sources/llama.cpp/log
pkg/ggml/sources/llama.cpp
3 weeks agocommon/grammar : replace problematic backtracking regex `[\s\S]*` (#18342)
Aldehir Rojas [Sat, 3 Jan 2026 22:02:43 +0000 (16:02 -0600)]
common/grammar : replace problematic backtracking regex `[\s\S]*` (#18342)

* grammar : add support for std::regex_search() with trigger patterns

* common : update hermes2 pro trigger to search instead of match

* common : use regex_search with anchoring for partial matching

* common : adjust regex partial tests to use new pattern

* grammar : check pattern directly instead of adding a type

* common : adjust existing patterns to match new semantics

3 weeks agograph : fix graph reuse logic when `n_pos_per_embd > 1` (#18566)
Georgi Gerganov [Sat, 3 Jan 2026 21:59:06 +0000 (23:59 +0200)]
graph : fix graph reuse logic when `n_pos_per_embd > 1` (#18566)

3 weeks agoggml-cuda: fixes for concurrent streams (#18496)
Aman Gupta [Sat, 3 Jan 2026 15:15:01 +0000 (23:15 +0800)]
ggml-cuda: fixes for concurrent streams (#18496)

3 weeks agocontext : fix reserve token padding to n_seqs (#18536)
Georgi Gerganov [Sat, 3 Jan 2026 13:45:34 +0000 (15:45 +0200)]
context : fix reserve token padding to n_seqs (#18536)

3 weeks agoCUDA: only allocate FA tmp buffer if needed (#18564)
Johannes Gäßler [Sat, 3 Jan 2026 12:55:53 +0000 (13:55 +0100)]
CUDA: only allocate FA tmp buffer if needed (#18564)

3 weeks agoUpdate upstream debian/0.0.7599-1
Mathieu Baudier [Sat, 3 Jan 2026 10:50:21 +0000 (11:50 +0100)]
Update upstream

3 weeks agoMerge tag 'upstream/0.0.7599' into debian/latest
Mathieu Baudier [Sat, 3 Jan 2026 10:43:17 +0000 (11:43 +0100)]
Merge tag 'upstream/0.0.7599' into debian/latest

Upstream release

3 weeks ago(Bugfix, ggml-cuda) Pool alloc count fix + small size computation type adjustment...
pl752 [Sat, 3 Jan 2026 10:13:40 +0000 (15:13 +0500)]
(Bugfix, ggml-cuda) Pool alloc count fix + small size computation type adjustment (#18559)

* CUDA: Fixed obj byte size instead of obj count being passed to pool alloc (fattn-common, dst_tmp_meta)

* CUDA: Explicitly casted some of the int alloc counts before multiplication in argsort

---------

Co-authored-by: pl752 <redacted>
3 weeks agoggml-hexagon: optimize activation function (#18393)
Shouyu [Sat, 3 Jan 2026 05:24:24 +0000 (00:24 -0500)]
ggml-hexagon: optimize activation function (#18393)

* refactor: refactor silu

* refactor: optimize swiglu

* refactor: remove unncessary if in swiglu

* refactor: refactor swiglu_oai

* chore: fix formatting issue

3 weeks agovulkan: Optimize GGML_OP_CUMSUM (#18417)
Jeff Bolz [Fri, 2 Jan 2026 21:32:30 +0000 (15:32 -0600)]
vulkan: Optimize GGML_OP_CUMSUM (#18417)

* vulkan: Optimize GGML_OP_CUMSUM

There are two paths: The preexisting one that does a whole row per workgroup
in a single shader, and one that splits each row into multiple blocks and does
two passes. The first pass computes partials within a block, the second adds
the block partials to compute the final result. The multipass shader is used
when there are a small number of large rows.

In the whole-row shader, handle multiple elements per invocation.

* use 2 ELEM_PER_THREAD for AMD/Intel

* address feedback

3 weeks agovulkan: Implement mmvq for iq1_s/iq1_m (#18450)
Jeff Bolz [Fri, 2 Jan 2026 19:19:04 +0000 (13:19 -0600)]
vulkan: Implement mmvq for iq1_s/iq1_m (#18450)

3 weeks agomodel : Maincoder-1B support (#18534)
Prabod [Fri, 2 Jan 2026 19:11:59 +0000 (06:11 +1100)]
model : Maincoder-1B support (#18534)

* Add Maincoder model support

* Removed SPM model vocabulary setting and MOE related GGUF parameters
Removed trailing spaces from maincoder.cpp

* removed set_vocab

* added new line

* Fix formatting

* Add a new line for PEP8

3 weeks agometal : adjust extra size for FA buffer to avoid reallocations (#18545)
Georgi Gerganov [Fri, 2 Jan 2026 17:02:18 +0000 (19:02 +0200)]
metal : adjust extra size for FA buffer to avoid reallocations (#18545)

3 weeks agograph : reduce topology branching (#18548)
Georgi Gerganov [Fri, 2 Jan 2026 17:01:56 +0000 (19:01 +0200)]
graph : reduce topology branching (#18548)

3 weeks agovocab : reduce debug logs about non-EOG control tokens (#18541)
Georgi Gerganov [Fri, 2 Jan 2026 14:17:33 +0000 (16:17 +0200)]
vocab : reduce debug logs about non-EOG control tokens (#18541)

* vocab : reduce debug logs about non-EOG control tokens

* cont : add comment

3 weeks agorpc : use unordered_map::reserve and emplace (#18513)
Chris Rohlf [Fri, 2 Jan 2026 10:09:36 +0000 (05:09 -0500)]
rpc : use unordered_map::reserve and emplace (#18513)

3 weeks agocuda : fix copy of large tensors (ggml_nbytes <= INT_MAX assertion) (#18433)
MeeMin [Thu, 1 Jan 2026 23:24:20 +0000 (04:54 +0530)]
cuda : fix copy of large tensors (ggml_nbytes <= INT_MAX assertion) (#18433)

* ggml-cuda: fixed assertion in ggml_cuda_cpy (#18140)

* ggml-cuda: changes in data types to int64_t

* ggml-cuda: added asserts for CUDA block numbers

* ggml-cuda: changed the condition for y and z dimension

3 weeks agomodel : remove modern-bert iswa template (#18529)
Sigbjørn Skjæret [Thu, 1 Jan 2026 23:06:42 +0000 (00:06 +0100)]
model : remove modern-bert iswa template (#18529)

* remove modern-bert iswa template

* forgotten

3 weeks agomodel: support youtu-vl model (#18479)
tt [Thu, 1 Jan 2026 18:25:54 +0000 (02:25 +0800)]
model: support youtu-vl model (#18479)

* Support Youtu-VL Model

* merge code

* fix bug

* revert qwen2 code & support rsplit in minja.hpp

* update warm info

* fix annotation

* u

* revert minja.hpp

* fix

* Do not write routed_scaling_factor to gguf when routed_scaling_factor is None

* fix expert_weights_scale

* LGTM after whitespace fixes

* fix

* fix

* fix

* layers to layer_index

* enum fix

---------

Co-authored-by: Xuan-Son Nguyen <redacted>
Co-authored-by: Sigbjørn Skjæret <redacted>
3 weeks agoAdd conversion support for IQuestCoderForCausalLM (#18524)
Piotr Wilkin (ilintar) [Thu, 1 Jan 2026 17:45:55 +0000 (18:45 +0100)]
Add conversion support for IQuestCoderForCausalLM (#18524)

3 weeks agomodel : add support for JinaBertModel with non-gated ffn (#18475)
o7si [Thu, 1 Jan 2026 17:38:51 +0000 (01:38 +0800)]
model : add support for JinaBertModel with non-gated ffn (#18475)

* WIP: Initial commit for fixing JinaBert original FF type support

* convert: add jina-v2-de tokenizer variant for German_Semantic_V3

* convert: fix token collision in BERT phantom vocab conversion

* convert: add feed_forward_type metadata

* model: add feed_forward_type metadata for jina-bert-v2

* model: jina-bert-v2 support standard GELU FFN variant

* model: remove ffn_type, detect FFN variant from tensor dimensions

* Update src/llama-model.cpp

Co-authored-by: Sigbjørn Skjæret <redacted>
* Update src/llama-model.cpp

Co-authored-by: Sigbjørn Skjæret <redacted>
* Update src/models/bert.cpp

Co-authored-by: Sigbjørn Skjæret <redacted>
* Update src/models/bert.cpp

Co-authored-by: Sigbjørn Skjæret <redacted>
* revert collision fix to be handled in separate PR

---------

Co-authored-by: Sigbjørn Skjæret <redacted>
3 weeks agoconvert : fix encoding of WPM vocab for BERT models (#18500)
o7si [Thu, 1 Jan 2026 17:27:07 +0000 (01:27 +0800)]
convert : fix encoding of WPM vocab for BERT models (#18500)

* convert: avoid token collision when stripping ## prefix

* convert: use token types for BERT special tokens check

* Update convert_hf_to_gguf.py

Co-authored-by: Sigbjørn Skjæret <redacted>
---------

Co-authored-by: Sigbjørn Skjæret <redacted>
3 weeks agomodel: add Solar Open model (#18511)
HelloKS [Thu, 1 Jan 2026 17:01:43 +0000 (02:01 +0900)]
model: add Solar Open model (#18511)

* model: add Solar-Open model

* vocab: add solar-open to end eog blacklist

* model: add proper llm type

* chat: basic template for solar open

* typo: fix comment about vocab

* convert: sugested changes

* convert: suggested changes

* chat: change reasoning end tag for solar-open

* llama-chat: add solar-open template

3 weeks agowebui: fix code copy stripping XML/HTML tags (#18518)
Anri Lombard [Thu, 1 Jan 2026 12:44:11 +0000 (14:44 +0200)]
webui: fix code copy stripping XML/HTML tags (#18518)

* webui: fix code copy stripping XML/HTML tags

* webui: update static build

3 weeks agoggml-cuda: remove unneccesary prints on ggml_cuda_init (#18502)
Aman Gupta [Thu, 1 Jan 2026 11:18:43 +0000 (19:18 +0800)]
ggml-cuda: remove unneccesary prints on ggml_cuda_init (#18502)

3 weeks agovulkan: extend topk_moe to handle sigmoid w/exp_probs_b for nemotron (#18295)
Jeff Bolz [Thu, 1 Jan 2026 07:58:27 +0000 (01:58 -0600)]
vulkan: extend topk_moe to handle sigmoid w/exp_probs_b for nemotron (#18295)

* vulkan: extend topk_moe to handle sigmoid w/exp_probs_b for nemotron

Also handle GGML_OP_SCALE at the end (nemotron, deepseek2).

Fewer pipeline variants and spec constants, just use push constants.

In test_topk_moe, change exp_probs_b to be 1D, matching real networks.

Update test-backend-ops and ggml-backend to allow verifying multiple outputs
in a fusion test (topk_moe has two outputs). Previously only the final node
was verified.

* change test_topk_moe to allow results in arbitrary order

* disable sigmoid fusion for moltenvk

3 weeks agollama: handle short reads in direct I/O path (#18504) upstream/0.0.7599
triplenom [Thu, 1 Jan 2026 02:24:43 +0000 (21:24 -0500)]
llama: handle short reads in direct I/O path (#18504)

3 weeks agochat: make tool description and parameters optional per OpenAI spec (#18478)
Anri Lombard [Wed, 31 Dec 2025 23:21:37 +0000 (01:21 +0200)]
chat: make tool description and parameters optional per OpenAI spec (#18478)

* chat: make tool description and parameters optional per OpenAI spec

Per the OpenAI API specification, both 'description' and 'parameters'
fields in tool function definitions are optional. Previously, the parser
would throw an exception if these fields were missing.

Attempts to fix #17667

* refactor: use value() for cleaner optional field access

3 weeks agosync : ggml
Georgi Gerganov [Wed, 31 Dec 2025 16:27:54 +0000 (18:27 +0200)]
sync : ggml

3 weeks agoggml : bump version to 0.9.5 (ggml/1410)
Georgi Gerganov [Wed, 31 Dec 2025 16:24:07 +0000 (18:24 +0200)]
ggml : bump version to 0.9.5 (ggml/1410)

3 weeks agoquantize: prevent input/output file collision (#18451)
Anri Lombard [Wed, 31 Dec 2025 15:29:03 +0000 (17:29 +0200)]
quantize: prevent input/output file collision (#18451)

Check if input and output files are the same before quantizing to prevent
file corruption when mmap reads from a file being written to.

Fixes #12753

3 weeks agoconvert : lint fix (#18507)
Sigbjørn Skjæret [Wed, 31 Dec 2025 13:28:21 +0000 (14:28 +0100)]
convert : lint fix (#18507)

3 weeks agomtmd : Adding support for Nvidia Music Flamingo Model (#18470)
Henry147147 [Wed, 31 Dec 2025 11:13:23 +0000 (06:13 -0500)]
mtmd : Adding support for Nvidia Music Flamingo Model (#18470)

* Inital commit, debugging q5_k_s quant

* Made hf_to_gguf extend whisper to reduce code duplication

* addressed convert_hf_to_gguf pull request issue

---------

Co-authored-by: Henry D <redacted>
3 weeks agometal : add count_equal op (#18314)
gatbontonpc [Wed, 31 Dec 2025 08:39:48 +0000 (00:39 -0800)]
metal : add count_equal op (#18314)

* add count equal for metal

* remove trailing whitespace

* updated doc ops table

* changed shmem to i32

* added multi tg and templating

* removed BLAS support from Metal docs

* Apply suggestions from code review

Co-authored-by: Georgi Gerganov <redacted>
* add memset to set dst to 0

* metal : cleanup

---------

Co-authored-by: Georgi Gerganov <redacted>
3 weeks agoCUDA: fix KQ max calculation (#18487)
Johannes Gäßler [Wed, 31 Dec 2025 08:37:00 +0000 (09:37 +0100)]
CUDA: fix KQ max calculation (#18487)

3 weeks agometal : remove BF16 x F16 kernels (#18456)
Georgi Gerganov [Wed, 31 Dec 2025 07:53:48 +0000 (09:53 +0200)]
metal : remove BF16 x F16 kernels (#18456)

3 weeks agosycl: add newline at the end of CMakeLists.txt (#18503)
Aman Gupta [Wed, 31 Dec 2025 06:23:44 +0000 (14:23 +0800)]
sycl: add newline at the end of CMakeLists.txt (#18503)

3 weeks agoWork around broken IntelSYCLConfig.cmake in Intel oneAPI 2025.x (#18345)
Rahul Sathe [Wed, 31 Dec 2025 01:08:44 +0000 (06:38 +0530)]
Work around broken IntelSYCLConfig.cmake in Intel oneAPI 2025.x (#18345)

* cmake: work around broken IntelSYCLConfig.cmake in oneAPI 2025.x

* [AI] sycl: auto-detect and skip incompatible IntelSYCL package

Automatically detect compiler versions with incompatible IntelSYCL
CMake configuration files and fall back to manual SYCL flags instead
of requiring users to set options manually.

Fixes build failures with oneAPI 2025.x where IntelSYCLConfig.cmake
has SYCL_FEATURE_TEST_EXTRACT invocation errors.

* refactor: improve SYCL provider handling and error messages in CMake configuration

* refactor: enhance SYCL provider validation and error handling in CMake configuration

* ggml-sycl: wrap find_package(IntelSYCL) to prevent build crashes

3 weeks agodocker : add CUDA 13.1 image build (#18441)
Sigbjørn Skjæret [Tue, 30 Dec 2025 21:28:53 +0000 (22:28 +0100)]
docker : add CUDA 13.1 image build (#18441)

* add updated cuda-new.Dockerfile for Ubuntu 24.04 compatibilty

* add cuda13 build

3 weeks agodocs : document that JSON Schema is not available to model when using response_format...
Bart Louwers [Tue, 30 Dec 2025 21:13:49 +0000 (22:13 +0100)]
docs : document that JSON Schema is not available to model when using response_format  (#18492)

* Document unsupported JSON Schema annotations

Add note about unsupported JSON Schema annotations.

* Update README.md

* Update README.md

* Update README.md

3 weeks agocommon : default content to an empty string (#18485)
Aldehir Rojas [Tue, 30 Dec 2025 18:00:57 +0000 (12:00 -0600)]
common : default content to an empty string (#18485)

* common : default content to an empty string

* common : fix tests that break when content != null

3 weeks agollama : fix typo in comment in llama-kv-cache.h [no ci] (#18489)
Daniel Bevenius [Tue, 30 Dec 2025 16:20:14 +0000 (17:20 +0100)]
llama : fix typo in comment in llama-kv-cache.h [no ci] (#18489)

3 weeks agoRemove separate server package
Mathieu Baudier [Tue, 30 Dec 2025 15:35:53 +0000 (16:35 +0100)]
Remove separate server package

3 weeks agolora: count lora nodes in graph_max_nodes (#18469)
Xuan-Son Nguyen [Tue, 30 Dec 2025 14:53:12 +0000 (15:53 +0100)]
lora: count lora nodes in graph_max_nodes (#18469)

* lora: count lora nodes in graph_max_nodes

* 3 nodes per weight

* 4 nodes

* keep track n_lora_nodes from llama_model

* fix assert

* rm redundant header

* common: load adapters before context creation

* use 6 nodes

3 weeks agosampling: reuse token data buffer in llama_sampler_sample (#18365)
Jay Zenith [Tue, 30 Dec 2025 14:27:49 +0000 (06:27 -0800)]
sampling: reuse token data buffer in llama_sampler_sample (#18365)

* sampling: reuse token data buffer in llama_sampler_sample

* move cur buffer before timing section, after samplers

* minor : fix build

---------

Co-authored-by: Georgi Gerganov <redacted>
3 weeks agoserver: fix files built redundantly (#18474)
Jeff Bolz [Tue, 30 Dec 2025 12:11:13 +0000 (06:11 -0600)]
server: fix files built redundantly (#18474)

3 weeks agokleidiai: add and integrate SVE 256-bit vector-length kernel (#18458)
Charles Xu [Tue, 30 Dec 2025 12:04:53 +0000 (13:04 +0100)]
kleidiai: add and integrate SVE 256-bit vector-length kernel (#18458)

* kleidiai: add and integrate SVE 256-bit vector-length kernel

* updated for review comments

3 weeks agoCUDA: add log line when mxfp4 acceleration is used (#18483)
Aman Gupta [Tue, 30 Dec 2025 09:40:46 +0000 (17:40 +0800)]
CUDA: add log line when mxfp4 acceleration is used (#18483)

* CUDA: add log line when mxfp4 acceleration is used

* add in backend_get_features

3 weeks agomodel-conversion : use CONVERTED_MODEL for compare-embeddings (#18461)
Daniel Bevenius [Tue, 30 Dec 2025 09:13:12 +0000 (10:13 +0100)]
model-conversion : use CONVERTED_MODEL for compare-embeddings (#18461)

This commit updates the causal model verification script to use the
CONVERTED_MODEL environment variable instead of using the MODEL_PATH
(the original model path) as the basis for the converted model file
name.

The motivation for this that currently if the converted model file name
differs from the original model directory/name the verification script
will look for the wrong .bin file that was generating when running
the converted model.

This similar to the change made for the embeddings models script in
Commit db81d5ec4b0a9cb19e98c4533731c9554eb025db ("model-conversion :
use CONVERTED_EMBEDDING_MODEL for embedding_verify_logits (#18079)"),
but we also verify the embeddings of for causal models as well.

3 weeks agowebui: fix prompt progress ETA calculation (#18468)
Xuan-Son Nguyen [Mon, 29 Dec 2025 20:42:11 +0000 (21:42 +0100)]
webui: fix prompt progress ETA calculation (#18468)

* webui: fix prompt progress ETA calculation

* handle case done === 0

3 weeks agoWebui/prompt processing progress (#18300)
Pascal [Mon, 29 Dec 2025 18:32:21 +0000 (19:32 +0100)]
Webui/prompt processing progress (#18300)

* webui: display prompt preprocessing progress

* webui: add percentage/ETA and exclude cached tokens from progress

Address review feedback from ngxson

* webui: add minutes and first chunk (0%) case

* Update tools/server/webui/src/lib/components/app/chat/ChatMessages/ChatMessageAssistant.svelte

Co-authored-by: Aleksander Grygier <redacted>
* Update tools/server/webui/src/lib/components/app/chat/ChatMessages/ChatMessageAssistant.svelte

Co-authored-by: Aleksander Grygier <redacted>
* webui: address review feedback from allozaur

* chore: update webui build output

* webui: address review feedback from allozaur

* nit

* chore: update webui build output

* feat: Enhance chat processing state

* feat: Improve chat processing statistics UI

* chore: update webui build output

* feat: Add live generation statistics to processing state hook

* feat: Persist prompt processing stats in hook for better UX

* refactor: Enhance ChatMessageStatistics for live stream display

* feat: Implement enhanced live chat statistics into assistant message

* chore: update webui build output

* fix: Proper tab for each stage of prompt processing/generation

* chore: update webui build output

* fix: Improved ETA calculation & display logic

* chore: update webui build output

* feat: Simplify logic & remove ETA from prompt progress

* chore: update webui build output

---------

Co-authored-by: Aleksander Grygier <redacted>
3 weeks agoImprove arm64 optimization
Mathieu Baudier [Mon, 29 Dec 2025 17:49:57 +0000 (18:49 +0100)]
Improve arm64 optimization

3 weeks agoDisable hardening
Mathieu Baudier [Mon, 29 Dec 2025 17:30:13 +0000 (18:30 +0100)]
Disable hardening

3 weeks agoAdd llama-completion tool
Mathieu Baudier [Mon, 29 Dec 2025 17:22:45 +0000 (18:22 +0100)]
Add llama-completion tool

3 weeks agoRe-enable optimizations on arm64
Mathieu Baudier [Mon, 29 Dec 2025 16:59:30 +0000 (17:59 +0100)]
Re-enable optimizations on arm64

3 weeks agoCUDA: fix replacment of bad archs in CMake (#18457)
Johannes Gäßler [Mon, 29 Dec 2025 16:58:20 +0000 (17:58 +0100)]
CUDA: fix replacment of bad archs in CMake (#18457)

3 weeks agoserver : Cmdline arg -to changes http read timeout from current 600sec default (...
wbtek [Mon, 29 Dec 2025 16:12:48 +0000 (01:12 +0900)]
server : Cmdline arg -to changes http read timeout from current 600sec default (#18279)

* Prevent crash if TTFT >300sec, boosted to 90 days

* server : allow configurable HTTP timeouts for child models

* server : pass needed timeouts from params only

---------

Co-authored-by: Greg Slocum <redacted>
3 weeks agoReduce optimizations on arm64
Mathieu Baudier [Mon, 29 Dec 2025 15:58:30 +0000 (16:58 +0100)]
Reduce optimizations on arm64

3 weeks agoReduce optimizations on arm64
Mathieu Baudier [Mon, 29 Dec 2025 15:49:13 +0000 (16:49 +0100)]
Reduce optimizations on arm64

3 weeks agoRemove separate multimodal package
Mathieu Baudier [Mon, 29 Dec 2025 15:46:58 +0000 (16:46 +0100)]
Remove separate multimodal package

3 weeks agocontributing: tighten AI usage policy (#18388)
Xuan-Son Nguyen [Mon, 29 Dec 2025 15:01:32 +0000 (16:01 +0100)]
contributing: tighten AI usage policy (#18388)

* contributing: tighten AI usage policy

* refactor AGENTS.md

* proofreading

* update contributing

* add claude.md

* add trailing newline

* add note about dishonest practices

* rm point about dishonest

* rm requirement watermarking

* add .gemini/settings.json

* allow initially AI-generated content

* revise

* Update CONTRIBUTING.md

Co-authored-by: Johannes Gäßler <redacted>
* improve

* trailing space

* Apply suggestions from code review

Co-authored-by: Johannes Gäßler <redacted>
* update

---------

Co-authored-by: Johannes Gäßler <redacted>
3 weeks agoandroid: routine maintenance - Dec 2025 (#18338)
Naco Siren [Mon, 29 Dec 2025 13:51:13 +0000 (05:51 -0800)]
android: routine maintenance - Dec 2025 (#18338)

* Fix `msg` typo

* Fix thread safety in destroy() to support generation abortion in lifecycle callbacks.

* UI polish: stack new message change from below; fix GGUF margin not in view port

* Bug fixes: rare racing condition when main thread updating view and and default thread updating messages at the same time; user input not disabled during generation.

* Bump dependencies' versions; Deprecated outdated dsl usage.

3 weeks agoserver : handle closed connection for tasks (#18459)
Georgi Gerganov [Mon, 29 Dec 2025 13:34:41 +0000 (15:34 +0200)]
server : handle closed connection for tasks (#18459)

3 weeks agomodel-conversion : add device option to embd run orig model (#18386)
Daniel Bevenius [Mon, 29 Dec 2025 12:37:02 +0000 (13:37 +0100)]
model-conversion : add device option to embd run orig model (#18386)

This commit refactors the original model embedding script to include a
device selection option. Users can now specify the device (cpu, cuda,
mps, auto) via command-line arguments. It also refactors the code to be
more structured.

3 weeks agoUse GGML_NATIVE=ON on Ubuntu arm64
Mathieu Baudier [Mon, 29 Dec 2025 12:06:46 +0000 (13:06 +0100)]
Use GGML_NATIVE=ON on Ubuntu arm64

3 weeks agoretrieval : use at most n_seq_max chunks (#18400)
Héctor Estrada Moreno [Mon, 29 Dec 2025 11:21:13 +0000 (05:21 -0600)]
retrieval : use at most n_seq_max chunks (#18400)

3 weeks agoIntroduce llama-tools-server package
Mathieu Baudier [Mon, 29 Dec 2025 10:57:50 +0000 (11:57 +0100)]
Introduce llama-tools-server package

3 weeks agoImprove OS detection
Mathieu Baudier [Mon, 29 Dec 2025 10:33:36 +0000 (11:33 +0100)]
Improve OS detection

3 weeks agoImprove ARM flags
Mathieu Baudier [Mon, 29 Dec 2025 10:27:08 +0000 (11:27 +0100)]
Improve ARM flags

3 weeks agoFix ARM flags
Mathieu Baudier [Mon, 29 Dec 2025 10:23:25 +0000 (11:23 +0100)]
Fix ARM flags

3 weeks agoOptimize ARM build
Mathieu Baudier [Mon, 29 Dec 2025 10:17:51 +0000 (11:17 +0100)]
Optimize ARM build

3 weeks agocommon: fix return value check for setpriority (#18412)
o7si [Mon, 29 Dec 2025 09:07:49 +0000 (17:07 +0800)]
common: fix return value check for setpriority (#18412)

* common: fix return value check for setpriority

* tools: add logging for process priority setting

3 weeks agoCUDA: Blackwell features for non-native builds (#18436)
Johannes Gäßler [Mon, 29 Dec 2025 08:35:42 +0000 (09:35 +0100)]
CUDA: Blackwell features for non-native builds (#18436)

3 weeks agocuda: fix race condition in cumsum (#18448)
Aman Gupta [Mon, 29 Dec 2025 06:07:17 +0000 (14:07 +0800)]
cuda: fix race condition in cumsum (#18448)

* ggml-cuda: fix race condition in cumsum

* remove unneccesary sync_threads

3 weeks agoci : re-enable rocm build on amd64 (#18439)
Tim Neumann [Sun, 28 Dec 2025 23:29:23 +0000 (00:29 +0100)]
ci : re-enable rocm build on amd64 (#18439)

This was disabled in #9340 due to compiler crash, but seems to build now as confirmed by the latest comments in #11913.

I've also managed to build the image with `docker build -f .devops/rocm.Dockerfile .` (for all three stages, `full`, `server` and `light`).

A quick attempt at trying to build an arm64 image failed. Since none of the other images are build for arm, I only enabled the amd64 one.

The `runs_on` option was added to match the other entries.

3 weeks agoHIP: Use mmq on MFMA devices for MUL_MAT_ID in cases where a lot of splits would...
uvos [Sun, 28 Dec 2025 19:12:55 +0000 (20:12 +0100)]
HIP: Use mmq on MFMA devices for MUL_MAT_ID in cases where a lot of splits would be generated (#18202)

4 weeks agomodel : Plamo3 support (#17304)
momonga [Sun, 28 Dec 2025 16:28:31 +0000 (01:28 +0900)]
model : Plamo3 support (#17304)

* plamo3

* fix plamo3

* clean code

* clean up the code

* fix diff

* clean up the code

* clean up the code

* clean up the code

* clean up the code

* clean up the code

* clean up the code

* add chat_template if exist

* clean up the code

* fix cpu-backend

* chore: whitespace trim fix + typo fix

* Fix: address review feedback

* restore `FREQ_BASE_SWA` constant

* Fix: address review feedback2

* Fix:typecheck

* Fix: address review feedback3

* final cleanup

---------

Co-authored-by: mmngays <redacted>
Co-authored-by: Sigbjørn Skjæret <redacted>
4 weeks agoRevert "ggml-cuda: use CMAKE_CUDA_ARCHITECTURES if set when GGML_NATIVE=ON (#18413...
Aman Gupta [Sun, 28 Dec 2025 12:53:36 +0000 (20:53 +0800)]
Revert "ggml-cuda: use CMAKE_CUDA_ARCHITECTURES if set when GGML_NATIVE=ON (#18413)" (#18426)

4 weeks agorpc: fix segfault on invalid endpoint format (#18387)
o7si [Sun, 28 Dec 2025 10:34:41 +0000 (18:34 +0800)]
rpc: fix segfault on invalid endpoint format (#18387)

* rpc: fix segfault on invalid endpoint format

* rpc: add error log for failed endpoint connection

4 weeks agollama-fit-params: fix step size for last device (#18415)
Johannes Gäßler [Sun, 28 Dec 2025 09:52:09 +0000 (10:52 +0100)]
llama-fit-params: fix step size for last device (#18415)

4 weeks agogithub: update issue templates [no ci] (#18410)
Johannes Gäßler [Sun, 28 Dec 2025 09:50:56 +0000 (10:50 +0100)]
github: update issue templates [no ci] (#18410)

* github: update issue templates [no ci]

* Apply suggestions from code review

Co-authored-by: Sigbjørn Skjæret <redacted>
---------

Co-authored-by: Sigbjørn Skjæret <redacted>
4 weeks agomtmd: clarify that we no longer accept AI-generated PRs (#18406)
Xuan-Son Nguyen [Sun, 28 Dec 2025 08:57:04 +0000 (09:57 +0100)]
mtmd: clarify that we no longer accept AI-generated PRs (#18406)

4 weeks agocmake: Added more x86_64 CPU backends when building with `GGML_CPU_ALL_VARIANTS=On...
Boian Berberov [Sun, 28 Dec 2025 07:33:29 +0000 (07:33 +0000)]
cmake: Added more x86_64 CPU backends when building with `GGML_CPU_ALL_VARIANTS=On` (#18186)

* minor: Consolidated `#include <immintrin.h>` under `ggml-cpu-impl.h`

* cmake: Added more x86-64 CPU backends when building with `GGML_CPU_ALL_VARIANTS=On`

- `ivybridge`
- `piledriver`
- `cannonlake`
- `cascadelake`
- `cooperlake`
- `zen4`

Resolves: #17966

4 weeks agoggml-cuda: use CMAKE_CUDA_ARCHITECTURES if set when GGML_NATIVE=ON (#18413)
QDelta [Sun, 28 Dec 2025 01:33:14 +0000 (20:33 -0500)]
ggml-cuda: use CMAKE_CUDA_ARCHITECTURES if set when GGML_NATIVE=ON (#18413)

4 weeks agoopencl: allow resizing transpose buffers (#18384)
lhez [Sat, 27 Dec 2025 23:51:14 +0000 (15:51 -0800)]
opencl: allow resizing transpose buffers (#18384)

* opencl: allow resizing transpose buffers instead of using fixed sizes

* opencl: remove commented code

4 weeks agollama-fit-params: fix overflow check (#18354)
Johannes Gäßler [Sat, 27 Dec 2025 19:20:45 +0000 (20:20 +0100)]
llama-fit-params: fix overflow check (#18354)

4 weeks agollama: fix magic number of 999 for GPU layers (#18266)
Johannes Gäßler [Sat, 27 Dec 2025 19:18:35 +0000 (20:18 +0100)]
llama: fix magic number of 999 for GPU layers (#18266)

* llama: fix magic number of 999 for GPU layers

* use strings for -ngl, -ngld

* enacapsulate n_gpu_layers, split_mode

4 weeks agoggml-cuda: Use same regex for GGML_NATIVE=OFF (#18407)
Aman Gupta [Sat, 27 Dec 2025 11:56:27 +0000 (19:56 +0800)]
ggml-cuda: Use same regex for GGML_NATIVE=OFF (#18407)

4 weeks agoUpdate upstream
Mathieu Baudier [Sat, 27 Dec 2025 11:02:13 +0000 (12:02 +0100)]
Update upstream

4 weeks agoMerge tag 'upstream/0.0.7446' into debian/latest
Mathieu Baudier [Sat, 27 Dec 2025 11:00:44 +0000 (12:00 +0100)]
Merge tag 'upstream/0.0.7446' into debian/latest

Upstream release

4 weeks agollama_fit_params: return enum for fail vs. error (#18374)
Johannes Gäßler [Sat, 27 Dec 2025 08:59:19 +0000 (09:59 +0100)]
llama_fit_params: return enum for fail vs. error (#18374)

4 weeks agollama-fit-params: fix Gemma 3 calculation (#18372)
Johannes Gäßler [Sat, 27 Dec 2025 08:56:04 +0000 (09:56 +0100)]
llama-fit-params: fix Gemma 3 calculation (#18372)

4 weeks agovulkan: preprocess mul_mat_id experts and discard workgroups more quickly (#18352)
Jeff Bolz [Fri, 26 Dec 2025 22:12:58 +0000 (16:12 -0600)]
vulkan: preprocess mul_mat_id experts and discard workgroups more quickly (#18352)

Run a preprocess to count how many times each expert is used, and use this to
quickly discard workgroups that aren't needed.

4 weeks agovulkan: optimize decodeFuncB in coopmat2 mul_mat_id shader (#18349)
Jeff Bolz [Fri, 26 Dec 2025 17:15:50 +0000 (11:15 -0600)]
vulkan: optimize decodeFuncB in coopmat2 mul_mat_id shader (#18349)

* vulkan: Use BK=32 for coopmat2 mul_mat_id

* vulkan: optimize decodeFuncB in coopmat2 mul_mat_id shader

Disable robustness, remove the OOB check in decodeFuncB, and initialize the
row_ids to zero to avoid OOB access.

Don't slice/offset the B matrix to ic * BN, only to adjust the coord back down
to the range [0, BN) in decodeFuncB. Instead just slice with a row offset of
zero and remove the '& (BN - 1)'. This allows the compiler to common some of
the shared memory loads.

4 weeks agovulkan: Use BK=32 for coopmat2 mul_mat_id (#18332)
Jeff Bolz [Fri, 26 Dec 2025 17:15:02 +0000 (11:15 -0600)]
vulkan: Use BK=32 for coopmat2 mul_mat_id (#18332)

4 weeks agovulkan: small dequantization improvements (#18380)
Eve [Fri, 26 Dec 2025 17:12:11 +0000 (17:12 +0000)]
vulkan: small dequantization improvements (#18380)

* iq4_xs

* quants

4 weeks agovulkan: Support UPSCALE w/antialias (#18327)
Jeff Bolz [Fri, 26 Dec 2025 16:00:57 +0000 (10:00 -0600)]
vulkan: Support UPSCALE w/antialias (#18327)

4 weeks agovulkan: handle rope with large number of rows (#18306)
Jeff Bolz [Fri, 26 Dec 2025 15:53:46 +0000 (09:53 -0600)]
vulkan: handle rope with large number of rows (#18306)

4 weeks agoserver : fix crash when seq_rm fails for hybrid/recurrent models (#18391)
o7si [Fri, 26 Dec 2025 15:35:29 +0000 (23:35 +0800)]
server : fix crash when seq_rm fails for hybrid/recurrent models (#18391)

* server : fix crash when seq_rm fails for hybrid/recurrent models

* server : add allow_processing param to clear_slot

4 weeks agodocs: added note for pre SYCL Intel hardware (#18016)
Francisco Herrera [Fri, 26 Dec 2025 02:34:30 +0000 (21:34 -0500)]
docs: added note for pre SYCL Intel hardware (#18016)

Specify that it's for pre sycl hardware