git.djapps.eu Git - pkg/ggml/sources/llama.cpp/log

]> git.djapps.eu Git - pkg/ggml/sources/llama.cpp/log

overview / pkg / ggml / sources / llama.cpp / log

Anri Lombard [Thu, 1 Jan 2026 12:44:11 +0000 (14:44 +0200)]

webui: fix code copy stripping XML/HTML tags (#18518)

* webui: fix code copy stripping XML/HTML tags

* webui: update static build

commit | commitdiff | tree

Aman Gupta [Thu, 1 Jan 2026 11:18:43 +0000 (19:18 +0800)]

ggml-cuda: remove unneccesary prints on ggml_cuda_init (#18502)

commit | commitdiff | tree

Jeff Bolz [Thu, 1 Jan 2026 07:58:27 +0000 (01:58 -0600)]

vulkan: extend topk_moe to handle sigmoid w/exp_probs_b for nemotron (#18295)

* vulkan: extend topk_moe to handle sigmoid w/exp_probs_b for nemotron

Also handle GGML_OP_SCALE at the end (nemotron, deepseek2).

Fewer pipeline variants and spec constants, just use push constants.

In test_topk_moe, change exp_probs_b to be 1D, matching real networks.

Update test-backend-ops and ggml-backend to allow verifying multiple outputs
in a fusion test (topk_moe has two outputs). Previously only the final node
was verified.

* change test_topk_moe to allow results in arbitrary order

* disable sigmoid fusion for moltenvk

commit | commitdiff | tree

triplenom [Thu, 1 Jan 2026 02:24:43 +0000 (21:24 -0500)]

llama: handle short reads in direct I/O path (#18504)

commit | commitdiff | tree

Anri Lombard [Wed, 31 Dec 2025 23:21:37 +0000 (01:21 +0200)]

chat: make tool description and parameters optional per OpenAI spec (#18478)

* chat: make tool description and parameters optional per OpenAI spec

Per the OpenAI API specification, both 'description' and 'parameters'
fields in tool function definitions are optional. Previously, the parser
would throw an exception if these fields were missing.

Attempts to fix #17667

* refactor: use value() for cleaner optional field access

commit | commitdiff | tree

Georgi Gerganov [Wed, 31 Dec 2025 16:27:54 +0000 (18:27 +0200)]

sync : ggml

commit | commitdiff | tree

Georgi Gerganov [Wed, 31 Dec 2025 16:24:07 +0000 (18:24 +0200)]

ggml : bump version to 0.9.5 (ggml/1410)

commit | commitdiff | tree

Anri Lombard [Wed, 31 Dec 2025 15:29:03 +0000 (17:29 +0200)]

quantize: prevent input/output file collision (#18451)

Check if input and output files are the same before quantizing to prevent
file corruption when mmap reads from a file being written to.

Fixes #12753

commit | commitdiff | tree

Sigbjørn Skjæret [Wed, 31 Dec 2025 13:28:21 +0000 (14:28 +0100)]

convert : lint fix (#18507)

commit | commitdiff | tree

Henry147147 [Wed, 31 Dec 2025 11:13:23 +0000 (06:13 -0500)]

mtmd : Adding support for Nvidia Music Flamingo Model (#18470)

* Inital commit, debugging q5_k_s quant

* Made hf_to_gguf extend whisper to reduce code duplication

* addressed convert_hf_to_gguf pull request issue

---------

Co-authored-by: Henry D <redacted>

commit | commitdiff | tree

gatbontonpc [Wed, 31 Dec 2025 08:39:48 +0000 (00:39 -0800)]

metal : add count_equal op (#18314)

* add count equal for metal

* remove trailing whitespace

* updated doc ops table

* changed shmem to i32

* added multi tg and templating

* removed BLAS support from Metal docs

* Apply suggestions from code review

Co-authored-by: Georgi Gerganov <redacted>
* add memset to set dst to 0

* metal : cleanup

---------

Co-authored-by: Georgi Gerganov <redacted>

commit | commitdiff | tree

Johannes Gäßler [Wed, 31 Dec 2025 08:37:00 +0000 (09:37 +0100)]

CUDA: fix KQ max calculation (#18487)

commit | commitdiff | tree

Georgi Gerganov [Wed, 31 Dec 2025 07:53:48 +0000 (09:53 +0200)]

metal : remove BF16 x F16 kernels (#18456)

commit | commitdiff | tree

Aman Gupta [Wed, 31 Dec 2025 06:23:44 +0000 (14:23 +0800)]

sycl: add newline at the end of CMakeLists.txt (#18503)

commit | commitdiff | tree

Rahul Sathe [Wed, 31 Dec 2025 01:08:44 +0000 (06:38 +0530)]

Work around broken IntelSYCLConfig.cmake in Intel oneAPI 2025.x (#18345)

* cmake: work around broken IntelSYCLConfig.cmake in oneAPI 2025.x

* [AI] sycl: auto-detect and skip incompatible IntelSYCL package

Automatically detect compiler versions with incompatible IntelSYCL
CMake configuration files and fall back to manual SYCL flags instead
of requiring users to set options manually.

Fixes build failures with oneAPI 2025.x where IntelSYCLConfig.cmake
has SYCL_FEATURE_TEST_EXTRACT invocation errors.

* refactor: improve SYCL provider handling and error messages in CMake configuration

* refactor: enhance SYCL provider validation and error handling in CMake configuration

* ggml-sycl: wrap find_package(IntelSYCL) to prevent build crashes

commit | commitdiff | tree

Sigbjørn Skjæret [Tue, 30 Dec 2025 21:28:53 +0000 (22:28 +0100)]

docker : add CUDA 13.1 image build (#18441)

* add updated cuda-new.Dockerfile for Ubuntu 24.04 compatibilty

* add cuda13 build

commit | commitdiff | tree

Bart Louwers [Tue, 30 Dec 2025 21:13:49 +0000 (22:13 +0100)]

docs : document that JSON Schema is not available to model when using response_format (#18492)

* Document unsupported JSON Schema annotations

Add note about unsupported JSON Schema annotations.

* Update README.md

* Update README.md

* Update README.md

commit | commitdiff | tree

Aldehir Rojas [Tue, 30 Dec 2025 18:00:57 +0000 (12:00 -0600)]

common : default content to an empty string (#18485)

* common : default content to an empty string

* common : fix tests that break when content != null

commit | commitdiff | tree

Daniel Bevenius [Tue, 30 Dec 2025 16:20:14 +0000 (17:20 +0100)]

llama : fix typo in comment in llama-kv-cache.h [no ci] (#18489)

commit | commitdiff | tree

Xuan-Son Nguyen [Tue, 30 Dec 2025 14:53:12 +0000 (15:53 +0100)]

lora: count lora nodes in graph_max_nodes (#18469)

* lora: count lora nodes in graph_max_nodes

* 3 nodes per weight

* 4 nodes

* keep track n_lora_nodes from llama_model

* fix assert

* rm redundant header

* common: load adapters before context creation

* use 6 nodes

commit | commitdiff | tree

Jay Zenith [Tue, 30 Dec 2025 14:27:49 +0000 (06:27 -0800)]

sampling: reuse token data buffer in llama_sampler_sample (#18365)

* sampling: reuse token data buffer in llama_sampler_sample

* move cur buffer before timing section, after samplers

* minor : fix build

---------

Co-authored-by: Georgi Gerganov <redacted>

commit | commitdiff | tree

Jeff Bolz [Tue, 30 Dec 2025 12:11:13 +0000 (06:11 -0600)]

server: fix files built redundantly (#18474)

commit | commitdiff | tree

Charles Xu [Tue, 30 Dec 2025 12:04:53 +0000 (13:04 +0100)]

kleidiai: add and integrate SVE 256-bit vector-length kernel (#18458)

* kleidiai: add and integrate SVE 256-bit vector-length kernel

* updated for review comments

commit | commitdiff | tree

Aman Gupta [Tue, 30 Dec 2025 09:40:46 +0000 (17:40 +0800)]

CUDA: add log line when mxfp4 acceleration is used (#18483)

* CUDA: add log line when mxfp4 acceleration is used

* add in backend_get_features

commit | commitdiff | tree

Daniel Bevenius [Tue, 30 Dec 2025 09:13:12 +0000 (10:13 +0100)]

model-conversion : use CONVERTED_MODEL for compare-embeddings (#18461)

This commit updates the causal model verification script to use the
CONVERTED_MODEL environment variable instead of using the MODEL_PATH
(the original model path) as the basis for the converted model file
name.

The motivation for this that currently if the converted model file name
differs from the original model directory/name the verification script
will look for the wrong .bin file that was generating when running
the converted model.

This similar to the change made for the embeddings models script in
Commit db81d5ec4b0a9cb19e98c4533731c9554eb025db ("model-conversion :
use CONVERTED_EMBEDDING_MODEL for embedding_verify_logits (#18079)"),
but we also verify the embeddings of for causal models as well.

commit | commitdiff | tree

Xuan-Son Nguyen [Mon, 29 Dec 2025 20:42:11 +0000 (21:42 +0100)]

webui: fix prompt progress ETA calculation (#18468)

* webui: fix prompt progress ETA calculation

* handle case done === 0

commit | commitdiff | tree

Pascal [Mon, 29 Dec 2025 18:32:21 +0000 (19:32 +0100)]

Webui/prompt processing progress (#18300)

* webui: display prompt preprocessing progress

* webui: add percentage/ETA and exclude cached tokens from progress

Address review feedback from ngxson

* webui: add minutes and first chunk (0%) case

* Update tools/server/webui/src/lib/components/app/chat/ChatMessages/ChatMessageAssistant.svelte

Co-authored-by: Aleksander Grygier <redacted>
* Update tools/server/webui/src/lib/components/app/chat/ChatMessages/ChatMessageAssistant.svelte

Co-authored-by: Aleksander Grygier <redacted>
* webui: address review feedback from allozaur

* chore: update webui build output

* webui: address review feedback from allozaur

* nit

* chore: update webui build output

* feat: Enhance chat processing state

* feat: Improve chat processing statistics UI

* chore: update webui build output

* feat: Add live generation statistics to processing state hook

* feat: Persist prompt processing stats in hook for better UX

* refactor: Enhance ChatMessageStatistics for live stream display

* feat: Implement enhanced live chat statistics into assistant message

* chore: update webui build output

* fix: Proper tab for each stage of prompt processing/generation

* chore: update webui build output

* fix: Improved ETA calculation & display logic

* chore: update webui build output

* feat: Simplify logic & remove ETA from prompt progress

* chore: update webui build output

---------

Co-authored-by: Aleksander Grygier <redacted>

commit | commitdiff | tree

Johannes Gäßler [Mon, 29 Dec 2025 16:58:20 +0000 (17:58 +0100)]

CUDA: fix replacment of bad archs in CMake (#18457)

commit | commitdiff | tree

wbtek [Mon, 29 Dec 2025 16:12:48 +0000 (01:12 +0900)]

server : Cmdline arg -to changes http read timeout from current 600sec default (#18279)

* Prevent crash if TTFT >300sec, boosted to 90 days

* server : allow configurable HTTP timeouts for child models

* server : pass needed timeouts from params only

---------

Co-authored-by: Greg Slocum <redacted>

commit | commitdiff | tree

Xuan-Son Nguyen [Mon, 29 Dec 2025 15:01:32 +0000 (16:01 +0100)]

contributing: tighten AI usage policy (#18388)

* contributing: tighten AI usage policy

* refactor AGENTS.md

* proofreading

* update contributing

* add claude.md

* add trailing newline

* add note about dishonest practices

* rm point about dishonest

* rm requirement watermarking

* add .gemini/settings.json

* allow initially AI-generated content

* revise

* Update CONTRIBUTING.md

Co-authored-by: Johannes Gäßler <redacted>
* improve

* trailing space

* Apply suggestions from code review

Co-authored-by: Johannes Gäßler <redacted>
* update

---------

Co-authored-by: Johannes Gäßler <redacted>

commit | commitdiff | tree

Naco Siren [Mon, 29 Dec 2025 13:51:13 +0000 (05:51 -0800)]

android: routine maintenance - Dec 2025 (#18338)

* Fix `msg` typo

* Fix thread safety in destroy() to support generation abortion in lifecycle callbacks.

* UI polish: stack new message change from below; fix GGUF margin not in view port

* Bug fixes: rare racing condition when main thread updating view and and default thread updating messages at the same time; user input not disabled during generation.

* Bump dependencies' versions; Deprecated outdated dsl usage.

commit | commitdiff | tree

Georgi Gerganov [Mon, 29 Dec 2025 13:34:41 +0000 (15:34 +0200)]

server : handle closed connection for tasks (#18459)

commit | commitdiff | tree

Daniel Bevenius [Mon, 29 Dec 2025 12:37:02 +0000 (13:37 +0100)]

model-conversion : add device option to embd run orig model (#18386)

This commit refactors the original model embedding script to include a
device selection option. Users can now specify the device (cpu, cuda,
mps, auto) via command-line arguments. It also refactors the code to be
more structured.

commit | commitdiff | tree

Héctor Estrada Moreno [Mon, 29 Dec 2025 11:21:13 +0000 (05:21 -0600)]

retrieval : use at most n_seq_max chunks (#18400)

commit | commitdiff | tree

o7si [Mon, 29 Dec 2025 09:07:49 +0000 (17:07 +0800)]

common: fix return value check for setpriority (#18412)

* common: fix return value check for setpriority

* tools: add logging for process priority setting

commit | commitdiff | tree

Johannes Gäßler [Mon, 29 Dec 2025 08:35:42 +0000 (09:35 +0100)]

CUDA: Blackwell features for non-native builds (#18436)

commit | commitdiff | tree

Aman Gupta [Mon, 29 Dec 2025 06:07:17 +0000 (14:07 +0800)]

cuda: fix race condition in cumsum (#18448)

* ggml-cuda: fix race condition in cumsum

* remove unneccesary sync_threads

commit | commitdiff | tree

Tim Neumann [Sun, 28 Dec 2025 23:29:23 +0000 (00:29 +0100)]

ci : re-enable rocm build on amd64 (#18439)

This was disabled in #9340 due to compiler crash, but seems to build now as confirmed by the latest comments in #11913.

I've also managed to build the image with `docker build -f .devops/rocm.Dockerfile .` (for all three stages, `full`, `server` and `light`).

A quick attempt at trying to build an arm64 image failed. Since none of the other images are build for arm, I only enabled the amd64 one.

The `runs_on` option was added to match the other entries.

commit | commitdiff | tree

uvos [Sun, 28 Dec 2025 19:12:55 +0000 (20:12 +0100)]

HIP: Use mmq on MFMA devices for MUL_MAT_ID in cases where a lot of splits would be generated (#18202)

commit | commitdiff | tree

momonga [Sun, 28 Dec 2025 16:28:31 +0000 (01:28 +0900)]

model : Plamo3 support (#17304)

* plamo3

* fix plamo3

* clean code

* clean up the code

* fix diff

* clean up the code

* clean up the code

* clean up the code

* clean up the code

* clean up the code

* clean up the code

* add chat_template if exist

* clean up the code

* fix cpu-backend

* chore: whitespace trim fix + typo fix

* Fix: address review feedback

* restore `FREQ_BASE_SWA` constant

* Fix: address review feedback2

* Fix:typecheck

* Fix: address review feedback3

* final cleanup

---------

Co-authored-by: mmngays <redacted>
Co-authored-by: Sigbjørn Skjæret <redacted>

commit | commitdiff | tree

Aman Gupta [Sun, 28 Dec 2025 12:53:36 +0000 (20:53 +0800)]

Revert "ggml-cuda: use CMAKE_CUDA_ARCHITECTURES if set when GGML_NATIVE=ON (#18413)" (#18426)

commit | commitdiff | tree

o7si [Sun, 28 Dec 2025 10:34:41 +0000 (18:34 +0800)]

rpc: fix segfault on invalid endpoint format (#18387)

* rpc: fix segfault on invalid endpoint format

* rpc: add error log for failed endpoint connection

commit | commitdiff | tree

Johannes Gäßler [Sun, 28 Dec 2025 09:52:09 +0000 (10:52 +0100)]

llama-fit-params: fix step size for last device (#18415)

commit | commitdiff | tree

Johannes Gäßler [Sun, 28 Dec 2025 09:50:56 +0000 (10:50 +0100)]

github: update issue templates [no ci] (#18410)

* github: update issue templates [no ci]

* Apply suggestions from code review

Co-authored-by: Sigbjørn Skjæret <redacted>
---------

Co-authored-by: Sigbjørn Skjæret <redacted>

commit | commitdiff | tree

Xuan-Son Nguyen [Sun, 28 Dec 2025 08:57:04 +0000 (09:57 +0100)]

mtmd: clarify that we no longer accept AI-generated PRs (#18406)

commit | commitdiff | tree

Boian Berberov [Sun, 28 Dec 2025 07:33:29 +0000 (07:33 +0000)]

cmake: Added more x86_64 CPU backends when building with `GGML_CPU_ALL_VARIANTS=On` (#18186)

* minor: Consolidated `#include <immintrin.h>` under `ggml-cpu-impl.h`

* cmake: Added more x86-64 CPU backends when building with `GGML_CPU_ALL_VARIANTS=On`

- `ivybridge`
- `piledriver`
- `cannonlake`
- `cascadelake`
- `cooperlake`
- `zen4`

Resolves: #17966

commit | commitdiff | tree

QDelta [Sun, 28 Dec 2025 01:33:14 +0000 (20:33 -0500)]

ggml-cuda: use CMAKE_CUDA_ARCHITECTURES if set when GGML_NATIVE=ON (#18413)

commit | commitdiff | tree

lhez [Sat, 27 Dec 2025 23:51:14 +0000 (15:51 -0800)]

opencl: allow resizing transpose buffers (#18384)

* opencl: allow resizing transpose buffers instead of using fixed sizes

* opencl: remove commented code

commit | commitdiff | tree

Johannes Gäßler [Sat, 27 Dec 2025 19:20:45 +0000 (20:20 +0100)]

llama-fit-params: fix overflow check (#18354)

commit | commitdiff | tree

Johannes Gäßler [Sat, 27 Dec 2025 19:18:35 +0000 (20:18 +0100)]

llama: fix magic number of 999 for GPU layers (#18266)

* llama: fix magic number of 999 for GPU layers

* use strings for -ngl, -ngld

* enacapsulate n_gpu_layers, split_mode

commit | commitdiff | tree

Aman Gupta [Sat, 27 Dec 2025 11:56:27 +0000 (19:56 +0800)]

ggml-cuda: Use same regex for GGML_NATIVE=OFF (#18407)

commit | commitdiff | tree

Johannes Gäßler [Sat, 27 Dec 2025 08:59:19 +0000 (09:59 +0100)]

llama_fit_params: return enum for fail vs. error (#18374)

commit | commitdiff | tree

Johannes Gäßler [Sat, 27 Dec 2025 08:56:04 +0000 (09:56 +0100)]

llama-fit-params: fix Gemma 3 calculation (#18372)

commit | commitdiff | tree

Jeff Bolz [Fri, 26 Dec 2025 22:12:58 +0000 (16:12 -0600)]

vulkan: preprocess mul_mat_id experts and discard workgroups more quickly (#18352)

Run a preprocess to count how many times each expert is used, and use this to
quickly discard workgroups that aren't needed.

commit | commitdiff | tree

Jeff Bolz [Fri, 26 Dec 2025 17:15:50 +0000 (11:15 -0600)]

vulkan: optimize decodeFuncB in coopmat2 mul_mat_id shader (#18349)

* vulkan: Use BK=32 for coopmat2 mul_mat_id

* vulkan: optimize decodeFuncB in coopmat2 mul_mat_id shader

Disable robustness, remove the OOB check in decodeFuncB, and initialize the
row_ids to zero to avoid OOB access.

Don't slice/offset the B matrix to ic * BN, only to adjust the coord back down
to the range [0, BN) in decodeFuncB. Instead just slice with a row offset of
zero and remove the '& (BN - 1)'. This allows the compiler to common some of
the shared memory loads.

commit | commitdiff | tree

Jeff Bolz [Fri, 26 Dec 2025 17:15:02 +0000 (11:15 -0600)]

vulkan: Use BK=32 for coopmat2 mul_mat_id (#18332)

commit | commitdiff | tree

Eve [Fri, 26 Dec 2025 17:12:11 +0000 (17:12 +0000)]

vulkan: small dequantization improvements (#18380)

* iq4_xs

* quants

commit | commitdiff | tree

Jeff Bolz [Fri, 26 Dec 2025 16:00:57 +0000 (10:00 -0600)]

vulkan: Support UPSCALE w/antialias (#18327)

commit | commitdiff | tree

Jeff Bolz [Fri, 26 Dec 2025 15:53:46 +0000 (09:53 -0600)]

vulkan: handle rope with large number of rows (#18306)

commit | commitdiff | tree

o7si [Fri, 26 Dec 2025 15:35:29 +0000 (23:35 +0800)]

server : fix crash when seq_rm fails for hybrid/recurrent models (#18391)

* server : fix crash when seq_rm fails for hybrid/recurrent models

* server : add allow_processing param to clear_slot

commit | commitdiff | tree

Francisco Herrera [Fri, 26 Dec 2025 02:34:30 +0000 (21:34 -0500)]

docs: added note for pre SYCL Intel hardware (#18016)

Specify that it's for pre sycl hardware

commit | commitdiff | tree

0Marble [Fri, 26 Dec 2025 01:12:04 +0000 (09:12 +0800)]

CANN: implement the SSM_CONV operator (#17737)

* CANN: implement SSM_CONV operator

Co-authored-by: Aleksei Lobanov, <redacted>
Co-authored-by: Sujin Kang, <redacted>
* CANN: remove custom error limit for SSM_CONV

* CANN: merge SSM_CONV tensor shape/strides into one line

---------

Co-authored-by: Sujin Kang, <redacted>

commit | commitdiff | tree

Aman Gupta [Thu, 25 Dec 2025 17:35:14 +0000 (01:35 +0800)]

ggml-cuda: fix regex for arch list (#18371)

* ggml-cuda: fix regex for arch list

* make regex exact

commit | commitdiff | tree

Aman Gupta [Thu, 25 Dec 2025 15:55:38 +0000 (23:55 +0800)]

cuda: optimize cumsum cub path (#18362)

* cuda: optimize cumsum cub path

* remove heavy perf test

commit | commitdiff | tree

Aman Gupta [Thu, 25 Dec 2025 14:12:11 +0000 (22:12 +0800)]

ggml-cuda: fix blackwell native builds (#18361)

* ggml-cuda: fix blackwell native builds

Replace 12x in native architectures by 12xa

* replace for GGML_NATIVE=OFF too

* only replace for native

* remove 120f-virtual for default compilation

---------

Co-authored-by: Aman Gupta <aman>

commit | commitdiff | tree

Penglin Cai [Thu, 25 Dec 2025 08:46:09 +0000 (16:46 +0800)]

CANN: Add support for CONV_TRANSPOSE_1D when kernel size > 255 (#17934)

* CONV_TRANSPOSE_1D kernel_size>255

* remove condition check

* fix the bug of type conversion

* removing trailing whitespaces

* fix: return true in the switch case

commit | commitdiff | tree

Aadeshveer Singh [Thu, 25 Dec 2025 04:11:13 +0000 (09:41 +0530)]

ggml : optimize cuda cumsum fallback kernel (#18343)

commit | commitdiff | tree

Xuan-Son Nguyen [Wed, 24 Dec 2025 22:47:49 +0000 (23:47 +0100)]

server: (router) add stop-timeout option (#18350)

* server: (router) add stop-timeout option

* also allow stop while loading

* add docs

* unload_lru: also wait for unload to complete

commit | commitdiff | tree

Xuan-Son Nguyen [Wed, 24 Dec 2025 22:07:08 +0000 (23:07 +0100)]

model: support MiMo-V2-Flash (#18328)

* mimov2: convert ok

* rename mimov2 --> mimo2

* fix conversion

* runnable not incorrect

* use sink

* add_sliding_window_pattern

* add swa and per-layer n_head_kv

* correct params

* somewhat working

* correct gating func

* nits

* mimo2: wire RMS eps + MoE bias + converter guards

* add co-author

Co-authored-by: Aaryan-Kapoor <redacted>
* use add_rope_freq_base_swa

---------

Co-authored-by: Aaryan Kapoor <redacted>
Co-authored-by: Aaryan-Kapoor <redacted>

commit | commitdiff | tree

Aadeshveer Singh [Wed, 24 Dec 2025 14:57:38 +0000 (20:27 +0530)]

fit-params : fix race condition in fit-params output (#18276)

commit | commitdiff | tree

Aman Gupta [Wed, 24 Dec 2025 14:28:26 +0000 (22:28 +0800)]

CUDA: experimental native mxfp4 support for blackwell (#17906)

* CUDA: experimental native mxfp4 support for blackwell

* optimize load_tiles

* optimize quantize_mxfp4

* cleanup

* first pass review: formatting

* use interleaved layout for mma

* mmq: add assert for size

* use __nv_fp4x4_e2m1

* use iter_k as 512, cleanup

* Use 1200 as blackwell instead of 1000

* address review comments

* mmq: fix stride

* quantize.cu: use reference impl of e8m0 scale

* address review comments

* add 120f-virtual + minor fixes

---------

Co-authored-by: Aman Gupta <aman>

commit | commitdiff | tree

Saba Fallah [Wed, 24 Dec 2025 13:02:36 +0000 (14:02 +0100)]

model : support for LlamaBidirectionalModel architecture (#18220)

* model: llama-embed-nemotron

* minor: python lint

* changed arch-name

* templated llm_build_llama to be used for both llama and llama-embed arch

commit | commitdiff | tree

Jeff Bolz [Wed, 24 Dec 2025 11:36:34 +0000 (05:36 -0600)]

vulkan: fix command buffer corruption in ggml_backend_vk_event_wait (#18302)

commit | commitdiff | tree

Wang Weixuan [Wed, 24 Dec 2025 09:50:24 +0000 (17:50 +0800)]

CANN : refactor ACL graph cache (#17752)

Move the graph property checking code into methods of LRU cache.

Signed-off-by: Wang Weixuan <redacted>

commit | commitdiff | tree

Jesse Ikonen [Wed, 24 Dec 2025 09:19:47 +0000 (11:19 +0200)]

docs: Fix typos in SYCL documentation (#18269)

commit | commitdiff | tree

Ruben Ortlam [Wed, 24 Dec 2025 07:59:14 +0000 (08:59 +0100)]

vulkan: use fewer FA rows for small cache runs (#18280)

commit | commitdiff | tree

TianHao324 [Wed, 24 Dec 2025 06:55:33 +0000 (14:55 +0800)]

CANN: Uses yarn_ramp cache in ROPE (#17725)

commit | commitdiff | tree

ddh0 [Wed, 24 Dec 2025 06:19:12 +0000 (00:19 -0600)]

common: add `LLAMA_ARG_OVERRIDE_TENSOR` env var for `-ot` arg (#18267)

commit | commitdiff | tree

Xuan-Son Nguyen [Tue, 23 Dec 2025 20:49:05 +0000 (21:49 +0100)]

server: return_progress to also report 0% processing state (#18305)

commit | commitdiff | tree

Pascal [Tue, 23 Dec 2025 14:48:03 +0000 (15:48 +0100)]

webui: apply webui_settings on first load (#18223)

* webui: apply webui_settings on first load

The webui_settings from /props were not applied on initial load
when default_generation_settings.params was null

Now syncs whenever serverProps is available, regardless of params,
works for both single-model and router modes

* chore: update webui build output

commit | commitdiff | tree

Xuan-Son Nguyen [Tue, 23 Dec 2025 13:39:36 +0000 (14:39 +0100)]

server: fix crash with model not having BOS/EOS (#18321)

commit | commitdiff | tree

Daniel Bevenius [Tue, 23 Dec 2025 13:07:25 +0000 (14:07 +0100)]

model-conversion : add device option to run-org-model.py (#18318)

* model-conversion : add device option to run-org-model.py

This commit refactors the `run-org-model.py` script to include a
`--device` argument, to allow users to specify the device on which to
run the model (e.g., cpu, cuda, mps, auto).
It also extracts a few common functions to prepare for future changes
where some code duplication will be removed which there currently
exists in embedding scripts.

The Makefile is also been updated to pass the device argument, for
example:
```console
(venv) $ make causal-verify-logits DEVICE=cpu
```

* fix error handling and remove parser reference

This commit fixes the error handling which previously referenced an
undefined 'parser' variable.

commit | commitdiff | tree

Chris Rohlf [Tue, 23 Dec 2025 09:56:49 +0000 (04:56 -0500)]

rpc : add check for rpc buffer type (#18242)

commit | commitdiff | tree

nullname [Tue, 23 Dec 2025 07:13:24 +0000 (15:13 +0800)]

ggml-hexagon: create generalized functions for cpu side op (#17500)

* refactor: replace ggml_hexagon_mul_mat with template-based binary operation for improved flexibility

* refactor: replace ggml_hexagon_mul_mat_id with template-based binary operation for improved flexibility

* refactor: initialize buffer types and streamline dspqueue_buffers_init calls for clarity

* add comment

* refactor: remove redundant buffer checks in hexagon supported operations

* wip

* add missing include to fix weak symbol warning

* add ggml_hexagon_op_generic

* refactor: simplify tensor operation initialization and buffer management in hexagon implementation

* refactor: streamline hexagon operation initialization and buffer management

* refactor: update function signatures and streamline request handling in hexagon operations

* wip

* ggml-hexagon: clean up code formatting and improve unary operation handling

* wip

* rename

* fix: add support for permuted F16 tensors and enhance quantization checks in matrix operations

* refactor: replace ggml_hexagon_mul_mat with template-based binary operation for improved flexibility

refactor: replace ggml_hexagon_mul_mat_id with template-based binary operation for improved flexibility

refactor: initialize buffer types and streamline dspqueue_buffers_init calls for clarity

refactor: remove redundant buffer checks in hexagon supported operations

add missing include to fix weak symbol warning

add ggml_hexagon_op_generic

refactor: simplify tensor operation initialization and buffer management in hexagon implementation

refactor: streamline hexagon operation initialization and buffer management

refactor: update function signatures and streamline request handling in hexagon operations

ggml-hexagon: clean up code formatting and improve unary operation handling

fix: add support for permuted F16 tensors and enhance quantization checks in matrix operations

# Conflicts:
# ggml/src/ggml-hexagon/ggml-hexagon.cpp

* hexagon: fix merge conflicts

* hexagon: minor cleanup for buffer support checks

* hexagon: factor out op_desc and the overal op logging

* hexagon: further simplify and cleanup op dispatch logic

* snapdragon: update adb scripts to use llama-cli and llama-completion

* fix pipeline failure

---------

Co-authored-by: Max Krasnyansky <redacted>

commit | commitdiff | tree

Daniel Bevenius [Tue, 23 Dec 2025 06:27:37 +0000 (07:27 +0100)]

model-conversion : add trust_remote_code for embedding scripts (#18288)

This commit adds the trust_remote_code=True parameter when loading
models and configurations in the embedding model conversion scripts.
It also adds a cast to float for models that might use a data type that
is not supported by python, for example bfloat16.

The motivation for this is that some models may require custom code to
be executed during loading, and setting trust_remote_code to True avoids
getting prompted for confirmation.

Future work will consolidate the embedding conversion scripts with the
causal conversion scripts to avoid code duplication. But in the mean
time it would be nice to have this fix in place.

commit | commitdiff | tree

Neo Zhang [Tue, 23 Dec 2025 04:59:12 +0000 (12:59 +0800)]

[SYCL] replace llama-cli by llama-completion to rm the impact to test script (#18290)

* replace llama-cli by llama-completion to rm the impact to test script

* Update examples/sycl/run-llama2.sh

Co-authored-by: Sigbjørn Skjæret <redacted>
* Update examples/sycl/run-llama2.sh

Co-authored-by: Sigbjørn Skjæret <redacted>
* Update examples/sycl/run-llama3.sh

Co-authored-by: Sigbjørn Skjæret <redacted>
* Update examples/sycl/run-llama3.sh

Co-authored-by: Sigbjørn Skjæret <redacted>
* Update examples/sycl/win-run-llama2.bat

Co-authored-by: Sigbjørn Skjæret <redacted>
* Update examples/sycl/win-run-llama3.bat

Co-authored-by: Sigbjørn Skjæret <redacted>
---------

Co-authored-by: Neo Zhang Jianyu <redacted>
Co-authored-by: Sigbjørn Skjæret <redacted>

commit | commitdiff | tree

Alessandro98-git [Tue, 23 Dec 2025 02:04:57 +0000 (03:04 +0100)]

model : fix div-by-zero for Nemotron V2 (#18309)

* llama-model : fix Nemotron V2 crash by moving MoE parameters calculation

* remove whitespace

---------

Co-authored-by: Sigbjørn Skjæret <redacted>

commit | commitdiff | tree

Ryan Mangeno [Mon, 22 Dec 2025 23:28:19 +0000 (18:28 -0500)]

model : Granite Embedding support (#15641)

ModernBERT but without `head.norm` so will currently fail to convert and run any other ModernBERT models, PRs with `head.norm` support welcome!

* constants and tensor mappings for modern bert support, model not supported yet but working on getting conversion to work for encoder only

* conversion now working, hf -> gguf

* working on support, now working on building graph

* some cleanup

* cleanup

* continuing

* correct tensor shape for qkv

* fixed tensor mappings and working on buildin graph

* tensor debugging now works -> (llama-eval-callback), instead of simulated gate split with views, GEGLU is now used which does exactly this

* cleanup

* cleanup

* cleanup

* more cleanup

* ubatch issues, the assert for checking equal seqs in llama-graph.cpp when building attention keeps failing, setting ubatch size to 1 when running llama-embedding with --ubatch-size 1 makes it work, but needs to be looked into more

* added cls token per previous modern bert attempt, still working on checking out the rest

* fixed pre tokenizer and still working through previous pr

* working through previous attemp, implimented more accurate conversion per previous attempt, added local sliding window attention that alternates every third layer

* fixed pre tokenizer

* working on swa with local and global alternating attention

* some cleanup and now fails on build attn

* starting to work, and some cleanup, currently failing on last layer construction in graph build

* alternating rope implemented and modern bert graph build succeeds

* fixed asser for equal ubatch seq

* cleanup

* added mask check in vocab

* fixed alternating rope, the hparams.rope_freq_base_train and hparams.rope_freq_base_train_swa were the same and i set them to correct values

* reuse variable

* removed repeat

* standard swa method can be used instead of a new enum being LLAMA_SWA_TYPE_LOCAL

* correct swa layer indexing, is supposed to be 0, 3, 6 ... instead of 1, 4, 7 ...

* more modular hparam setting

* replaced attn out norm with ffn_norm and cosine similarity between hf embds and llama.cpp embds went way up, from 0.05 to 0.24, replaced the cacheless kv with swa todo per the previous conversion

* Update gguf-py/gguf/tensor_mapping.py

Co-authored-by: Sigbjørn Skjæret <redacted>
* Update convert_hf_to_gguf_update.py

Co-authored-by: Sigbjørn Skjæret <redacted>
* Update src/llama-model.cpp

Co-authored-by: Sigbjørn Skjæret <redacted>
* Update src/llama-vocab.cpp

Co-authored-by: Sigbjørn Skjæret <redacted>
* Update src/llama-model.cpp

Co-authored-by: Sigbjørn Skjæret <redacted>
* Update gguf-py/gguf/tensor_mapping.py

Co-authored-by: Sigbjørn Skjæret <redacted>
* Update convert_hf_to_gguf.py

Co-authored-by: Sigbjørn Skjæret <redacted>
* Update gguf-py/gguf/tensor_mapping.py

Co-authored-by: Sigbjørn Skjæret <redacted>
* Update gguf-py/gguf/tensor_mapping.py

Co-authored-by: Sigbjørn Skjæret <redacted>
* Update convert_hf_to_gguf.py

Co-authored-by: Sigbjørn Skjæret <redacted>
* Update gguf-py/gguf/tensor_mapping.py

Co-authored-by: Sigbjørn Skjæret <redacted>
* Update gguf-py/gguf/tensor_mapping.py

Co-authored-by: Sigbjørn Skjæret <redacted>
* Update gguf-py/gguf/tensor_mapping.py

Co-authored-by: Sigbjørn Skjæret <redacted>
* Update gguf-py/gguf/tensor_mapping.py

Co-authored-by: Sigbjørn Skjæret <redacted>
* Update gguf-py/gguf/tensor_mapping.py

Co-authored-by: Sigbjørn Skjæret <redacted>
* Update gguf-py/gguf/tensor_mapping.py

Co-authored-by: Sigbjørn Skjæret <redacted>
* Update src/llama-graph.cpp

Co-authored-by: Sigbjørn Skjæret <redacted>
* Update src/llama-arch.cpp

Co-authored-by: Sigbjørn Skjæret <redacted>
* Update src/llama-model.cpp

Co-authored-by: Sigbjørn Skjæret <redacted>
* Update src/llama-model.cpp

Co-authored-by: Sigbjørn Skjæret <redacted>
* Update src/llama-model.cpp

Co-authored-by: Sigbjørn Skjæret <redacted>
* Update src/llama-model.cpp

Co-authored-by: Sigbjørn Skjæret <redacted>
* Update src/llama-model.cpp

Co-authored-by: Sigbjørn Skjæret <redacted>
* removed redundant hparam set

* enums for model sizes

* conversion for modern-bert model supported rather than just granite-small

* Update src/llama-model.cpp

Co-authored-by: Gabe Goodhart <redacted>
* Update src/llama-model.cpp

Co-authored-by: Gabe Goodhart <redacted>
* fixed ordering of enum for freq_base_swa

* fixed where I added residual, now gives much much better embeddings~

* readded cacheless logic

* removing whitespace

* conversion now working for swa pattern - dense every n layers

* modern bert put into seperate src file

* removing whitespace

* fixed whitespace and newline errors in editorconfig job

* Update convert_hf_to_gguf.py

Co-authored-by: Sigbjørn Skjæret <redacted>
* better naming convention, n_swa_pattern -> swa_period

* reusing sliding_window_pattern key rather than making new dense_every_n_layers key, and adding writing and reading support

* fixing pyright type-check fail

* Update convert_hf_to_gguf.py

Co-authored-by: Sigbjørn Skjæret <redacted>
* Update gguf-py/gguf/gguf_writer.py

Co-authored-by: Sigbjørn Skjæret <redacted>
* Update src/llama-hparams.h

Co-authored-by: Sigbjørn Skjæret <redacted>
* Update src/llama-model-saver.cpp

Co-authored-by: Sigbjørn Skjæret <redacted>
* Update src/models/modern-bert.cpp

Co-authored-by: Sigbjørn Skjæret <redacted>
* Update src/models/modern-bert.cpp

Co-authored-by: Sigbjørn Skjæret <redacted>
* Update src/models/modern-bert.cpp

Co-authored-by: Sigbjørn Skjæret <redacted>
* Update gguf-py/gguf/gguf_writer.py

Co-authored-by: Sigbjørn Skjæret <redacted>
* Update src/models/modern-bert.cpp

Co-authored-by: Sigbjørn Skjæret <redacted>
* Update src/models/modern-bert.cpp

Co-authored-by: Sigbjørn Skjæret <redacted>
* Update src/llama-model.cpp

Co-authored-by: Sigbjørn Skjæret <redacted>
* Update src/llama-model-loader.cpp

Co-authored-by: Sigbjørn Skjæret <redacted>
* Update src/llama-model-loader.cpp

Co-authored-by: Sigbjørn Skjæret <redacted>
* Update src/llama-model-loader.cpp

Co-authored-by: Sigbjørn Skjæret <redacted>
* added descriptions in llama-model

* fixed tensor mappings for conversion

* Update src/llama-model.cpp

Co-authored-by: Sigbjørn Skjæret <redacted>
* Update src/llama-model.cpp

Co-authored-by: Sigbjørn Skjæret <redacted>
* mapping name for size

* nits

* unused

---------

Co-authored-by: Sigbjørn Skjæret <redacted>
Co-authored-by: Gabe Goodhart <redacted>

commit | commitdiff | tree

compilade [Mon, 22 Dec 2025 19:25:16 +0000 (14:25 -0500)]

gguf-py : do not align the data start offset (#18291)

The safetensors format doesn't require alignment.

commit | commitdiff | tree

Shouyu [Mon, 22 Dec 2025 18:56:52 +0000 (13:56 -0500)]

ggml-hexagon: gelu optimization (#18151)

* feat: working gelu with src0 put on vtcm

* feat: gelu ping-pong for both in and out

* fix: fixu compile error

* break: distinguish dma ddr->vtcm and vtcm->ddr operation

* fix: fix dma queue size

* break: update dma api to either pop src or dst ptr

* fix: fix activation vtcm allocation issue for src1 when swapperd

* refactor: ping-pong gelu logic to avoid unnecessary if else

* dma: improved queue interface and prefetch handling

* gelu: fix N+2 block prefetch

---------

Co-authored-by: Max Krasnyansky <redacted>

commit | commitdiff | tree

Xuan-Son Nguyen [Mon, 22 Dec 2025 18:30:19 +0000 (19:30 +0100)]

gen-docs: automatically update markdown file (#18294)

* gen-docs: automatically update markdown file

* also strip whitespace

* do not add extra newline

* update TOC

commit | commitdiff | tree

Taimur Ahmad [Mon, 22 Dec 2025 18:20:23 +0000 (23:20 +0500)]

llamafile: add rvv support for sgemm kernels (#18199)

Co-authored-by: Rehan Qasim <redacted>

commit | commitdiff | tree

lhez [Mon, 22 Dec 2025 18:19:01 +0000 (10:19 -0800)]

opencl: unpack q4_0 for adreno in get_tensor (#18278)

commit | commitdiff | tree

Jeff Bolz [Mon, 22 Dec 2025 17:03:13 +0000 (11:03 -0600)]

vulkan: Extend rope fusions to allow mrope (#18264)

Extend the test-backend-ops tests as well.

commit | commitdiff | tree

Xuan-Son Nguyen [Mon, 22 Dec 2025 13:23:34 +0000 (14:23 +0100)]

server: prevent data race from HTTP threads (#18263)

* server: prevent data race from HTTP threads

* fix params

* fix default_generation_settings

* nits: make handle_completions_impl looks less strange

* stricter const

* fix GGML_ASSERT(idx < states.size())

* move index to be managed by server_response_reader

* http: make sure req & res lifecycle are tied together

* fix compile

* fix index handling buggy

* fix data race for lora endpoint

* nits: fix shadow variable

* nits: revert redundant changes

* nits: correct naming for json_webui_settings

commit | commitdiff | tree

Xuan-Son Nguyen [Mon, 22 Dec 2025 12:21:43 +0000 (13:21 +0100)]

server: fix data race in to_json_anthropic (#18283)

commit | commitdiff | tree

Mattt [Mon, 22 Dec 2025 12:11:46 +0000 (04:11 -0800)]

release: update release workflow to store XCFramework as Zip file (#18284)

* Update release workflow to store XCFramework as Zip file

* Add comments to document Zip file requirement for XCFramework

* Apply suggestions from code review

Co-authored-by: Sigbjørn Skjæret <redacted>
---------

Co-authored-by: Sigbjørn Skjæret <redacted>

commit | commitdiff | tree

Aaron Teo [Mon, 22 Dec 2025 12:03:49 +0000 (20:03 +0800)]

convert: rework ftype heuristics (#18214)

* convert: rework ftype heuristics

Signed-off-by: Aaron Teo <redacted>
convert: fix type-check

Signed-off-by: Aaron Teo <redacted>
convert: bring back heuristics comment

Signed-off-by: Aaron Teo <redacted>
* convert: revert to using first tensor

Signed-off-by: Aaron Teo <redacted>
* convert: rework heuristics logic

Signed-off-by: Aaron Teo <redacted>
* convert: rm redundant float32 check

Co-authored-by: Sigbjørn Skjæret <redacted>
---------

Signed-off-by: Aaron Teo <redacted>
Co-authored-by: Sigbjørn Skjæret <redacted>

commit | commitdiff | tree

Xuan-Son Nguyen [Mon, 22 Dec 2025 11:22:01 +0000 (12:22 +0100)]

server: (docs) remove mention about extra_args (#18262)

commit | commitdiff | tree

Johannes Gäßler [Mon, 22 Dec 2025 10:00:37 +0000 (11:00 +0100)]

tool/ex/tests: consistently free ctx, then model (#18168)

Packaging of ggml-org/llama.cpp

RSS Atom