git.djapps.eu Git - pkg/ggml/sources/llama.cpp/log

]> git.djapps.eu Git - pkg/ggml/sources/llama.cpp/log

overview / pkg / ggml / sources / llama.cpp / log

ren [Fri, 27 Mar 2026 07:05:21 +0000 (00:05 -0700)]

metal : Fix dimension constraint violation in matmul2d descriptor (#21048)

Updates Metal tensor API test probe to fix the dimension constraint violation in the matmul2d descriptor (at least one value must be a multiple of 16).

commit | commitdiff | tree

KokerZhou [Fri, 27 Mar 2026 00:53:00 +0000 (08:53 +0800)]

CANN: update docker images to 8.5.0 and improve CANN.md (#20801)

* cann: update docker images to 8.5.0

- bump CANN base image from 8.3.rc2 to 8.5.0
- bump ASCEND_VERSION from 8.1.RC1.alpha001 to 8.5.0

Move to newer stable releases.

* cann: update CANN.md

* Update CANN.md to include BF16 support

Added BF16 support information to the CANN documentation and corrected formatting for the installation instructions.

* Fix formatting issues in CANN.md

Fix 234: Trailing whitespace

commit | commitdiff | tree

Saba Fallah [Thu, 26 Mar 2026 23:07:55 +0000 (00:07 +0100)]

mtmd: fix "v.patch_embd" quant and unsupported im2col ops on Metal for deepseek-ocr (#21027)

* mtmd: fix "v.patch_embd" quant and unsupported im2col ops on Metal for deepseek-ocr

* Update src/llama-quant.cpp

Co-authored-by: Sigbjørn Skjæret <redacted>
---------

Co-authored-by: Sigbjørn Skjæret <redacted>

commit | commitdiff | tree

uvos [Thu, 26 Mar 2026 22:06:33 +0000 (23:06 +0100)]

hip: use fnuz fp8 for conversion on CDNA3 (#21040)

commit | commitdiff | tree

Xuan-Son Nguyen [Thu, 26 Mar 2026 19:44:00 +0000 (20:44 +0100)]

ci: pin external actions to exact commit SHA (#21033)

commit | commitdiff | tree

Adrien Gallouët [Thu, 26 Mar 2026 19:34:23 +0000 (20:34 +0100)]

common : add getpwuid fallback for HF cache when HOME is not set (#21035)

Signed-off-by: Adrien Gallouët <redacted>

commit | commitdiff | tree

Xuan-Son Nguyen [Thu, 26 Mar 2026 18:49:20 +0000 (19:49 +0100)]

mtmd: refactor image preprocessing (#21031)

* mtmd: refactor image pre-processing

* correct some places

* correct lfm2

* fix deepseek-ocr on server

* add comment to clarify about mtmd_image_preprocessor_dyn_size

commit | commitdiff | tree

lhez [Thu, 26 Mar 2026 15:52:21 +0000 (08:52 -0700)]

opencl: allow large buffer for adreno (#20997)

commit | commitdiff | tree

Michael Wand [Thu, 26 Mar 2026 15:52:06 +0000 (08:52 -0700)]

convert : support Qwen3.5/Qwen3.5 Moe NVFP4 and add input scales (#20505)

* convert : fix Qwen3.5 NVFP4 conversion

* Updated copilot concerns and rebased

* move into _LinearAttentionVReorderBase and simplify

* --flake

* new_name not needed

* Added input_scale to gguf

* Fixed input_scale addition as tensor

* Added input scale to loader and named _in_s

* Update convert_hf_to_gguf.py

Re-removed input_scale from aux cleanup

Co-authored-by: Sigbjørn Skjæret <redacted>
---------

Co-authored-by: Sigbjørn Skjæret <redacted>

commit | commitdiff | tree

Pavel Zloi [Thu, 26 Mar 2026 15:49:09 +0000 (18:49 +0300)]

convert : add RuGPT3XL (RuGPT3XLForCausalLM) support (#21011)

* Support of ruGPT3XL model added

* Update convert_hf_to_gguf.py

Co-authored-by: Sigbjørn Skjæret <redacted>
* Update convert_hf_to_gguf.py

Co-authored-by: Sigbjørn Skjæret <redacted>
* Update convert_hf_to_gguf.py

Co-authored-by: Sigbjørn Skjæret <redacted>
* chkhsh for ruGPT3XL model added

* Update convert_hf_to_gguf.py

Co-authored-by: Sigbjørn Skjæret <redacted>
* Update convert_hf_to_gguf.py

Co-authored-by: Sigbjørn Skjæret <redacted>
* Fixing chkhsh for ruGPT3XL, rerun updated and _qkv_parts in RuGPT3XLModel

---------

Co-authored-by: Sigbjørn Skjæret <redacted>

commit | commitdiff | tree

Adrien Gallouët [Thu, 26 Mar 2026 14:37:18 +0000 (15:37 +0100)]

common : filter out imatrix when finding models (#21023)

Signed-off-by: Adrien Gallouët <redacted>

commit | commitdiff | tree

ihb2032 [Thu, 26 Mar 2026 11:08:41 +0000 (19:08 +0800)]

fix(ggml): correct RISC-V ISA string canonical ordering for RVV in CMake (#20888)

Signed-off-by: ihb2032 <redacted>

commit | commitdiff | tree

Adrien Gallouët [Thu, 26 Mar 2026 11:04:57 +0000 (12:04 +0100)]

common : make LLAMA_CACHE the one cache for everything (#21009)

Signed-off-by: Adrien Gallouët <redacted>

commit | commitdiff | tree

Adrien Gallouët [Thu, 26 Mar 2026 11:04:37 +0000 (12:04 +0100)]

common : fix split model migration (#21019)

Sadly the manifest does not list all required files, i honestly thought
it was the case

Without the files listed we don't have the sha256, so if the first file
is valid, and all others have the correct size, then we can assume we
are good and do the migration...

Here my test:

    $ find /home/angt/.cache/llama.cpp
    /home/angt/.cache/llama.cpp
    /home/angt/.cache/llama.cpp/angt_test-split-model-stories260K_stories260K-f32-00002-of-00002.gguf
    /home/angt/.cache/llama.cpp/angt_test-split-model-stories260K_stories260K-f32-00001-of-00002.gguf
    /home/angt/.cache/llama.cpp/angt_test-split-model-stories260K_stories260K-f32-00001-of-00002.gguf.etag
    /home/angt/.cache/llama.cpp/angt_test-split-model-stories260K_stories260K-f32-00002-of-00002.gguf.etag
    /home/angt/.cache/llama.cpp/manifest=angt=test-split-model-stories260K=latest.json

    $ build/bin/llama-server
    ================================================================================
    WARNING: Migrating cache to HuggingFace cache directory
      Old cache: /home/angt/.cache/llama.cpp/
      New cache: /home/angt/.cache/huggingface/hub
    This one-time migration moves models previously downloaded with -hf
    from the legacy llama.cpp cache to the standard HuggingFace cache.
    Models downloaded with --model-url are not affected.
    ================================================================================
    migrate_file: migrated angt_test-split-model-stories260K_stories260K-f32-00001-of-00002.gguf -> /home/angt/.cache/huggingface/hub/models--angt--test-split-model-stories260K/snapshots/68c3ea2061e8c7688455fab07597dde0f4d7f0db/stories260K-f32-00001-of-00002.gguf
    migrate_file: migrated angt_test-split-model-stories260K_stories260K-f32-00002-of-00002.gguf -> /home/angt/.cache/huggingface/hub/models--angt--test-split-model-stories260K/snapshots/68c3ea2061e8c7688455fab07597dde0f4d7f0db/stories260K-f32-00002-of-00002.gguf
    migrate_old_cache_to_hf_cache: migration complete, deleting manifest: /home/angt/.cache/llama.cpp/manifest=angt=test-split-model-stories260K=latest.json

    $ find /home/angt/.cache/llama.cpp /home/angt/.cache/huggingface
    /home/angt/.cache/llama.cpp
    /home/angt/.cache/huggingface
    /home/angt/.cache/huggingface/hub
    /home/angt/.cache/huggingface/hub/models--angt--test-split-model-stories260K
    /home/angt/.cache/huggingface/hub/models--angt--test-split-model-stories260K/blobs
    /home/angt/.cache/huggingface/hub/models--angt--test-split-model-stories260K/blobs/50d019817c2626eb9e8a41f361ff5bfa538757e6f708a3076cd3356354a75694
    /home/angt/.cache/huggingface/hub/models--angt--test-split-model-stories260K/blobs/7b273e1dbfab11dc67dce479deb5923fef27c39cbf56a20b3a928a47b77dab3c
    /home/angt/.cache/huggingface/hub/models--angt--test-split-model-stories260K/refs
    /home/angt/.cache/huggingface/hub/models--angt--test-split-model-stories260K/refs/main
    /home/angt/.cache/huggingface/hub/models--angt--test-split-model-stories260K/snapshots
    /home/angt/.cache/huggingface/hub/models--angt--test-split-model-stories260K/snapshots/68c3ea2061e8c7688455fab07597dde0f4d7f0db
    /home/angt/.cache/huggingface/hub/models--angt--test-split-model-stories260K/snapshots/68c3ea2061e8c7688455fab07597dde0f4d7f0db/stories260K-f32-00002-of-00002.gguf
    /home/angt/.cache/huggingface/hub/models--angt--test-split-model-stories260K/snapshots/68c3ea2061e8c7688455fab07597dde0f4d7f0db/stories260K-f32-00001-of-00002.gguf

Signed-off-by: Adrien Gallouët <redacted>

commit | commitdiff | tree

Michael Wand [Thu, 26 Mar 2026 08:54:03 +0000 (01:54 -0700)]

ggml-cuda: Add NVFP4 dp4a kernel (#20644)

Added check for dst_t to cuda_cast template for float
Restored ggml_cuda_ue4m3_to_fp32, changed vecdot ints to int32ts
Added CUDART/HIP Check and HIP/fp8 include
Added NVFP4 to Test-backend-ops
Added hip_fp8_e4m3 to __nv_fp8_e4m3 typedef

---------

Co-authored-by: Johannes Gäßler <redacted>

commit | commitdiff | tree

SamareshSingh [Thu, 26 Mar 2026 07:14:36 +0000 (02:14 -0500)]

imatrix : fix crash when using --show-statistics with zero counts (#19532)

* imatrix: fix crash when using --show-statistics with zero counts

Fixes division by zero that caused floating point exceptions when processing imatrix files with zero count values. Added checks to skip zero counts and handle empty activation vectors.

Fix for the bug #19190

* imatrix: lower log level for zero-count skip message to DBG

commit | commitdiff | tree

Yihao Wang [Thu, 26 Mar 2026 02:19:14 +0000 (19:19 -0700)]

CUDA & CPU: support F32 kernel type for `CONV_TRANSPOSE_2D` (#17094)

* Refactor CUDA 2D transpose implementation to support multiple kernel types and improve parameter handling

- Introduced a `conv2d_transpose_params` struct for better parameter management.
- Updated `conv2d_transpose_kernel` to be templated for different kernel types (float and half).
- Modified `ggml_cuda_conv_2d_transpose_p0` to handle both F16 and F32 kernel types.
- Enhanced test cases to validate functionality for both kernel types.

* Refactor test cases for 2D convolution transpose to support dynamic kernel types

- Updated `test_conv_transpose_2d` structure to improve parameter handling by reordering constructor arguments.
- Enhanced test case generation to iterate over kernel types, allowing for flexible testing of different configurations.
- Removed hardcoded kernel type instances in favor of a loop for better maintainability and scalability.

* Refactor ggml_compute_forward_conv_transpose_2d to support both F16 and F32 tensor types.

* Refactor conv2d transpose kernel to use a template for kernel type, enhancing flexibility for different data types.
Update test cases to include both F16 and F32 tensor types for comprehensive coverage.

* Update ggml/src/ggml-cuda/conv2d-transpose.cu

Co-authored-by: Aman Gupta <redacted>
* Update ggml/src/ggml-cpu/ggml-cpu.c

Co-authored-by: Aman Gupta <redacted>
* Refactor conv2d transpose implementation by removing the conv2d_transpose_params struct and dispatching with direct kernel launch.

* Enhance cpu conv2d transpose implementation by introducing a templated kernel type for improved flexibility with F16 and F32 data types.

---------

Co-authored-by: Aman Gupta <redacted>

commit | commitdiff | tree

Adrien Gallouët [Wed, 25 Mar 2026 21:28:04 +0000 (22:28 +0100)]

common : do not delete old files from the old cache when updating (#21000)

Signed-off-by: Adrien Gallouët <redacted>

commit | commitdiff | tree

Saba Fallah [Wed, 25 Mar 2026 18:57:40 +0000 (19:57 +0100)]

mtmd: Add DeepSeekOCR Support (#17400)

* mtmd: llama.cpp DeepSeekOCR support
init commit

* loading sam tensors

* mtmd: fix vision model processing

* deepseek-ocr clip-vit model impl

* mtmd: add DeepSeek-OCR LM support with standard attention

* mtmd: successfully runs DeepSeek-OCR LM in llama-cli

* mtmd: Fix RoPE type for DeepSeek-OCR LM.

* loading LM
testing Vision model loading

* sam warmup working

* sam erroneous return corrected

* clip-vit:  corrected cls_embd concat

* clip-vit: model convert  qkv_proj split

* corrected combining of image encoders' results

* fix: update callback for ffn_moe_weighted and add callback for attn_out in deepseek2 model

* concat image_newline and image_seperator tokens

* visual_model warmup (technically) works

* window partitioning using standard ggml ops

* sam implementation without using CPU only ops

* clip: fixed warnings

* Merge branch 'sf/deepseek-ocr' of github.com:sfallah/llama.cpp into sf/deepseek-ocr

* mtmd: fix get_rel_pos

* mtmd: fixed the wrong scaler for get_rel_pos

* image encoding technically works but the output can't be checked singe image decoding fails

* mtmd: minor changed

* mtmd: add native resolution support

* - image encoding debugged
- issues fixed mainly related wrong config like n_patches etc.
- configs need to be corrected in the converter

* mtmd: correct token order

* - dynamic resizing
- changes are concerning PR https://github.com/sfallah/llama.cpp/pull/4

* mtmd: quick fix token order

* mtmd: fix danling pointer

* mtmd: SAM numerically works

* mtmd: debug CLIP-L (vit_pre_ln)

* mtmd: debug CLIP-L & first working DeepSeek-OCR model

* mtmd : add --dsocr-mode CLI argument for DeepSeek-OCR resolution control & all native resolution modes work

* mtmd: simplify SAM patch embedding

* mtmd: adapt Pillow image resizing function

* mtmd:  simplify DeepSeek-OCR dynamic resolution preprocessing

* mtmd: remove --dsocr-mode argument

* mtmd: refactor code & remove unused helper functions

* mtmd: fix tensor names for image newlines and view separator

* clean up

* reverting automatically removed spaces

* reverting automatically removed spaces

* mtmd: fixed bad ocr check in Deepseek2 (LM)

* mtmd: support combined QKV projection in buid_vit

* using common build_attn in sam

* corrected code-branch when flash-attn disabled
enabling usage of --flash-attn option

* mtmd: minor fix

* minor formatting and style

* fixed flake8 lint issues

* minor editorconfig-check fixes

* minor editorconfig-check fixes

* mtmd: simplify get_rel_pos

* mtmd: make sam hparams configurable

* mtmd: add detailed comments for resize_bicubic_pillow

* mtmd: fixed wrong input setting

* mtmd: convert model in FP16

* mtmd: minor fix

* mtmd: remove tweak to llama-mtmd-cli & deepseek-ocr template

* fix: test-1.jpg ORC issue with small (640) resolution
setting min-resolution base (1024) max large (1280) for dynamic-resolution

* minor: editconfig-check fix

* merge with changes from https://github.com/ggml-org/llama.cpp/pull/17909
added new opt to tests.sh to disable flash-attn

* minor: editconfig-check fix

* testing deepseek-ocr
quick and dirty test script comparing results of Qwen2.5-VL vs DeepSeek-OCR

* quick and (potential) dirty merge with https://github.com/ggml-org/llama.cpp/pull/17909

* refactoring, one single builder function and static helpers

* added deepseek-ocr test to tests.sh

* minor formatting fixes

* check with fixed expected resutls

* minor formatting

* editorconfig-check fix

* merge with changes from https://github.com/ggml-org/llama.cpp/pull/18042

* minor
- added GLM-4.6V to big tests
- added missing deps for python test

* convert: minor fix

* mtmd: format code

* convert: quick fix

* convert: quick fix

* minor python formatting

* fixed merge build issue

* merge resolved
- fixed issues in convert
- tested several deepseek models

* minor fix

* minor

* Update convert_hf_to_gguf.py

Co-authored-by: Sigbjørn Skjæret <redacted>
* - removed clip_is_deepseekocr
- removed redundant RESIZE_ALGO_BICUBIC_PILLOW resize-algo
- simplified image-preprocessing
- removed/simplified debug functions

* - cleaning commented out code

* fixing instabilities issues reintroducing resize_bicubic_pillow

* - use f16 model for deepseek-ocr test
- ignore llama-arch test for deepseek-ocr

* rename fc_w --> mm_fc_w

* add links to OCR discussion

* cleaner loading code

* add missing .weight to some tensors

* add default jinja template (to be used by server)

* move test model to ggml-org

* rolling back upscale change

* Update convert_hf_to_gguf.py

Co-authored-by: Sigbjørn Skjæret <redacted>
---------

Co-authored-by: bluebread <redacted>
Co-authored-by: Sigbjørn Skjæret <redacted>
Co-authored-by: Xuan Son Nguyen <redacted>
Co-authored-by: Xuan-Son Nguyen <redacted>

commit | commitdiff | tree

Adrien Gallouët [Wed, 25 Mar 2026 18:41:01 +0000 (19:41 +0100)]

common : fix verbosity setup (#20989)

The verbosity threshold was set at the end of common_params_parse_ex(),
after doing many things (like downloading files..)

Signed-off-by: Adrien Gallouët <redacted>

commit | commitdiff | tree

Adrien Gallouët [Wed, 25 Mar 2026 18:18:06 +0000 (19:18 +0100)]

common : fix gguf selection in common_list_cached_models (#20996)

Signed-off-by: Adrien Gallouët <redacted>

commit | commitdiff | tree

uvos [Wed, 25 Mar 2026 18:00:37 +0000 (19:00 +0100)]

ci : fix parsing of vgpr counts in hip-quality-check (#20987)

* scripts: hip: gcn-cdna-vgpr-check: fix parsing of vgpr counts when an amdclang Remark block is interlieved with another from a different process

* Return warning ignore

* obay pep8 inline double space before inline commets

* add # noqa: NP100 for other prints too

* Add script changes to cause autotrigger

commit | commitdiff | tree

Saba Fallah [Wed, 25 Mar 2026 17:33:42 +0000 (18:33 +0100)]

model: codefuse-ai/F2LLM-v2 support

commit | commitdiff | tree

Dowon [Wed, 25 Mar 2026 17:12:38 +0000 (02:12 +0900)]

model : allow causal_attn and pooling_type on all architectures (#20973)

* models : allow causal_attn and pooling_type on all architectures

* fix: move location

commit | commitdiff | tree

Aparna M P [Wed, 25 Mar 2026 16:43:12 +0000 (22:13 +0530)]

snapdragon: add missing features to WoS scripts to achieve parity with ADB scripts (#20884)

* Add missing features to WoS scripts to achieve parity with ADB scripts

* Fix line-ending in run-mtmd.ps1

Signed-off-by: Max Krasnyansky <redacted>
---------

Signed-off-by: Max Krasnyansky <redacted>
Co-authored-by: Max Krasnyansky <redacted>

commit | commitdiff | tree

Shreya Jain [Wed, 25 Mar 2026 16:36:27 +0000 (09:36 -0700)]

Use docker in build-android.yml (#20928)

* use docker instead of SDK separately

* fix whitespaces

* Update .github/workflows/build-android.yml

Co-authored-by: Sigbjørn Skjæret <redacted>
---------

Co-authored-by: Max Krasnyansky <redacted>
Co-authored-by: Sigbjørn Skjæret <redacted>

commit | commitdiff | tree

Aman Gupta [Wed, 25 Mar 2026 13:17:27 +0000 (21:17 +0800)]

llama-bench: print `-n-cpu-moe` when offloaded layers > 1 (#20984)

commit | commitdiff | tree

Masato Nakasaka [Wed, 25 Mar 2026 13:00:49 +0000 (06:00 -0700)]

ci: Allow ninja to be used during unit test (#20742)

* Remove make dependency

* Added option to specify Ninja generator

* use ninja-build as default for several CI

* Revert "use ninja-build as default for several CI"

This reverts commit f552c4559b85e222aab37f654da764af4283fee7.

* changed use plain string rather than arrays

* Enabled ninja build by default for experimentation

* ci: add run.sh to test conditions to trigger GitHub CI and self-hosted runners

Signed-off-by: Aaron Teo <redacted>
* Enabled ninja build by default on self-hosted envs for experimentation

* ci: revert generator to ninja instead of ninja multi-config

Signed-off-by: Aaron Teo <redacted>
* ci: install ninja-build for self-hosted workflows

Signed-off-by: Aaron Teo <redacted>
* ci: revert ninja from self-hosted runners

Signed-off-by: Aaron Teo <redacted>
* ci: missed one self-hosted step

Signed-off-by: Aaron Teo <redacted>
* ci: fix windows ci errors from an errenous revert

Signed-off-by: Aaron Teo <redacted>
* Added explicit build types for Ninja

Also reverted some needless change

* ci: use ninja multi-config for vulkan-x64 build

Signed-off-by: Aaron Teo <redacted>
* added time command to measure build time

* Keeping some configs to use Ninja which show improvement

* minor fix based on review

Co-authored-by: Aaron Teo <redacted>
* ci: rm `time` from custom containers

Signed-off-by: Aaron Teo <redacted>
---------

Signed-off-by: Aaron Teo <redacted>
Co-authored-by: Aaron Teo <redacted>
Co-authored-by: Aaron Teo <redacted>

commit | commitdiff | tree

Georgi Gerganov [Wed, 25 Mar 2026 12:46:40 +0000 (14:46 +0200)]

ci : disable self-hosted mac jobs (#20985)

commit | commitdiff | tree

Xuan-Son Nguyen [Wed, 25 Mar 2026 11:22:48 +0000 (12:22 +0100)]

jinja: fix macro with kwargs (#20960)

* jinja: fix macro with kwargs

* Apply suggestions from code review

Co-authored-by: Sigbjørn Skjæret <redacted>
* fix newline problem

---------

Co-authored-by: Sigbjørn Skjæret <redacted>

commit | commitdiff | tree

Francisco Herrera [Wed, 25 Mar 2026 11:12:50 +0000 (06:12 -0500)]

gguf-split : clarify operation of gguf-split (#19749)

* clarify operation of gguf-split

so that you don't have to find out by trial and error

* formatting

commit | commitdiff | tree

Johannes Gäßler [Wed, 25 Mar 2026 10:53:16 +0000 (11:53 +0100)]

llama: fix llama-model-saver (#20503)

* llama : add fd-based model loading via llama_model_load_from_fd

* llama : address review feedback for fd-based model loading

* llama : use FILE pointer instead of fd in public API

* llama : use FILE pointer consistently, address review feedback

* fixup

* fix tensor names

* fix llama-model-saver

* roundtrip tests

* fixup

* refactor tests

* fix prints

* fix model saving

* fix CI, disable Chameleon

* print seed

---------

Co-authored-by: Siddhesh2377 <redacted>

commit | commitdiff | tree

Aleksander Grygier [Wed, 25 Mar 2026 10:47:33 +0000 (11:47 +0100)]

webui: Fix editing assistant message without branching (#20944)

* fix: Editing assistant response without branching

* chore: update webui build output

commit | commitdiff | tree

Pascal [Wed, 25 Mar 2026 10:02:32 +0000 (11:02 +0100)]

Add SLEEPING status to the WebUI model selector (#20949)

* webui: handle sleeping model status, fix favourite -> favorite

* Update tools/server/webui/src/lib/components/app/models/ModelsSelectorOption.svelte

Co-authored-by: Aleksander Grygier <redacted>
* Update tools/server/webui/src/lib/components/app/models/ModelsSelectorOption.svelte

Co-authored-by: Aleksander Grygier <redacted>
* webui: fix optional event parameter in sleeping model onclick

* typo

* webui: restore orange sleeping indicator dot with hover unload

* chore: update webui build output

* webui: move stopPropagation into ActionIcon onclick, remove svelte-ignore

* chore: update webui build output

* webui: fix favourite -> favorite (UK -> US spelling) everywhere

Address review feedback from WhyNotHugo

* chore: update webui build output

---------

Co-authored-by: Aleksander Grygier <redacted>

commit | commitdiff | tree

yikechayedan [Wed, 25 Mar 2026 09:51:26 +0000 (17:51 +0800)]

android : fix-pointer-dangling (#20974)

commit | commitdiff | tree

Neo Zhang [Wed, 25 Mar 2026 09:48:37 +0000 (17:48 +0800)]

sycl : fix wrong variable check by assert (#20903)

* fix wrong variable check by assert

* use GGML api

commit | commitdiff | tree

Sigbjørn Skjæret [Wed, 25 Mar 2026 09:04:59 +0000 (10:04 +0100)]

ci : bump gguf publish python version (#20982)

commit | commitdiff | tree

Sigbjørn Skjæret [Wed, 25 Mar 2026 08:55:37 +0000 (09:55 +0100)]

ci : limit requirements versions (#20980)

* set requests version

* limit versions outside requirements

commit | commitdiff | tree

Dowon [Wed, 25 Mar 2026 08:37:59 +0000 (17:37 +0900)]

convert : register Qwen3Model architecture (#20967)

commit | commitdiff | tree

Ravi Panchumarthy [Wed, 25 Mar 2026 08:33:51 +0000 (01:33 -0700)]

docs : Update OpenVINO backend docs (#20968)

* OpenVINO doc updates

* Update docs/backend/OPENVINO.md

Co-authored-by: Aaron Teo <redacted>
---------

Co-authored-by: Aaron Teo <redacted>

commit | commitdiff | tree

Georgi Gerganov [Tue, 24 Mar 2026 15:00:30 +0000 (17:00 +0200)]

models : move the token embedding norms to the first layer (#20943)

* models : move the token embedding norms to the first layer

* cont : fix LLM_TENSOR_CONV1D + fix il indexing

commit | commitdiff | tree

Aman Gupta [Tue, 24 Mar 2026 12:47:00 +0000 (20:47 +0800)]

ggml-backend: re-enable graph reuse with pipeline parallelism (#20927)

commit | commitdiff | tree

Alessandro de Oliveira Faria (A.K.A.CABELO) [Tue, 24 Mar 2026 12:33:33 +0000 (09:33 -0300)]

vendor : update cpp-httplib to 0.39.0 (#20933)

commit | commitdiff | tree

Adrien Gallouët [Tue, 24 Mar 2026 12:33:14 +0000 (13:33 +0100)]

common : fix get_gguf_split_info (#20946)

Signed-off-by: Adrien Gallouët <redacted>

commit | commitdiff | tree

BlueMöhre [Tue, 24 Mar 2026 12:17:45 +0000 (13:17 +0100)]

WebUI: fix edit msg form textarea height (#20830)

* autoresize textarea on mount

* allow textarea to grow to same height as rendered messages

* add UI build file

commit | commitdiff | tree

Adrien Gallouët [Tue, 24 Mar 2026 09:35:07 +0000 (10:35 +0100)]

readme : clarify MODEL_ENDPOINT usage (#20941)

Signed-off-by: Adrien Gallouët <redacted>

commit | commitdiff | tree

Adrien Gallouët [Tue, 24 Mar 2026 08:24:39 +0000 (09:24 +0100)]

common : add a WARNING for HF cache migration (#20935)

Signed-off-by: Adrien Gallouët <redacted>

commit | commitdiff | tree

nuri [Tue, 24 Mar 2026 08:13:07 +0000 (17:13 +0900)]

metal : add FLOOR, CEIL, ROUND, TRUNC unary ops (#20930)

Co-authored-by: nryoo <redacted>

commit | commitdiff | tree

Georgi Gerganov [Tue, 24 Mar 2026 08:03:09 +0000 (10:03 +0200)]

metal : add FA instantiations for HSK=512, HSV=512 (#20902)

commit | commitdiff | tree

Aaron Teo [Tue, 24 Mar 2026 06:41:10 +0000 (14:41 +0800)]

issues: add openvino backends (#20932)

Signed-off-by: Aaron Teo <redacted>

commit | commitdiff | tree

Adrien Gallouët [Tue, 24 Mar 2026 06:30:33 +0000 (07:30 +0100)]

common : add standard Hugging Face cache support (#20775)

* common : add standard Hugging Face cache support

- Use HF API to find all files
- Migrate all manifests to hugging face cache at startup

Signed-off-by: Adrien Gallouët <redacted>
* Check with the quant tag

Signed-off-by: Adrien Gallouët <redacted>
* Cleanup

Signed-off-by: Adrien Gallouët <redacted>
* Improve error handling and report API errors

Signed-off-by: Adrien Gallouët <redacted>
* Restore common_cached_model_info and align mmproj filtering

Signed-off-by: Adrien Gallouët <redacted>
* Prefer main when getting cached ref

Signed-off-by: Adrien Gallouët <redacted>
* Use cached files when HF API fails

Signed-off-by: Adrien Gallouët <redacted>
* Use final_path..

Signed-off-by: Adrien Gallouët <redacted>
* Check all inputs

Signed-off-by: Adrien Gallouët <redacted>
---------

Signed-off-by: Adrien Gallouët <redacted>

commit | commitdiff | tree

Aman Gupta [Tue, 24 Mar 2026 04:57:57 +0000 (12:57 +0800)]

llama-fit: fix regex pattern for gate_up tensors (#20910)

* llama-fit: fix regex pattern for gate_up tensors

* Apply suggestions from code review

Co-authored-by: Johannes Gäßler <redacted>
---------

Co-authored-by: Johannes Gäßler <redacted>

commit | commitdiff | tree

Aldehir Rojas [Tue, 24 Mar 2026 03:21:47 +0000 (22:21 -0500)]

common : replace wrap_for_generation with a prefix convenience function and fix gpt-oss (#20912)

commit | commitdiff | tree

Max Krasnyansky [Mon, 23 Mar 2026 22:33:49 +0000 (15:33 -0700)]

hexagon: general DMA and Binary Op fixes for large strides (#20918)

* hex-dma: make chained dma the default to handle newer models

This also includes some new instrumentation that we can remove later.

* hexagon: add uint32 dump helper

* hexagon: use single-page VTCM allocation to avoid issues with large gather ops in ssm-conv

ssm-conv uses HVX gather instruction and that instruction cannot handle cases where the base+offset
spans page boundaries.

* hexagon: update ssm-conv to make base-addr compute a bit easier to read

* hex-dma: use 1d mode for reshaping, it supports sizes up to 24-bits (>16MB)

* hex-bin: fix incorrect stride logic

* hexagon: make sure repack buffs are dumped for verbose > 2

* hex-bin: consistently use dma_queue_push even for dummy dst transactions

* hex-dma: start using 2d-wide mode on v75 and up

The removes the need to deal with the 16-bit limitaion for the strides.

* hex-bin: cleanup kernel selection logic

* hex-bin: cleanup binary op core and fix transposed tensor handling

* snapdragon: update run-bench to use larger ubatch and fa-on

commit | commitdiff | tree

Max Krasnyansky [Mon, 23 Mar 2026 21:57:18 +0000 (14:57 -0700)]

Add codeowners for scripts/snapdragon and docs/snapdragon (#20915)

* Add codeowners for scripts/snapdragon

* Also add docs/backends/snapdragon

commit | commitdiff | tree

lhez [Mon, 23 Mar 2026 19:44:18 +0000 (12:44 -0700)]

opencl: add q6_K gemm and gemv kernels for Adreno (#20089)

* opencl: add q6_K noshuffle kernels, initial q6_K gemv, some host code

* opencl: add q6_K transpose

* opencl: fix cvt kernel name

* opencl: add call to q6_K gemv

* opencl: fix q6_K scale transpose

* opencl: fix loading for gemv q6_K, refactor

* opencl: fix transpose_8_buf kernel assignment, refactor

* opencl: refactor q6_K transpose

* opencl: add gemm_noshuffle_q6_k_f32

* opencl: fix qh loading

* opencl: refactor q6_K gemv host side, release bufs and imgs

* opencl: refactor

* opencl: fix q6_K dequant and scale selection

* opencl: workaround compiler bug, fix dump_tensor

* opencl: refactor q6_K convert kernels

* opencl: unpack transformed q6_K in get_tensor

* opencl: refactor, handle non-uniform workgroups

* opencl: support non-vector subgroup bcast

commit | commitdiff | tree

las7 [Mon, 23 Mar 2026 17:54:57 +0000 (10:54 -0700)]

rpc : RCE patch (#20908)

commit | commitdiff | tree

Xuan-Son Nguyen [Mon, 23 Mar 2026 15:59:02 +0000 (16:59 +0100)]

contrib: add "Requirements" section to PR template (#20841)

* contrib: add "Requirements" section to PR template

* typo [no ci]

* use h2, add "Additional information"

---------

Co-authored-by: Piotr Wilkin (ilintar) <redacted>

commit | commitdiff | tree

Davi Henrique Linhares [Mon, 23 Mar 2026 13:47:34 +0000 (10:47 -0300)]

devops: upgraded default oneAPI version (#20731)

commit | commitdiff | tree

Aleksander Grygier [Mon, 23 Mar 2026 13:30:55 +0000 (14:30 +0100)]

webui: Improve chat form positioning (#20901)

commit | commitdiff | tree

Geo Maciolek [Mon, 23 Mar 2026 13:24:55 +0000 (09:24 -0400)]

docs: Fix typo in reasoning flag documentation (#20780)

Tested to verify - the typo is just in the docs, not the actual flag.

commit | commitdiff | tree

Georgi Gerganov [Mon, 23 Mar 2026 12:08:46 +0000 (14:08 +0200)]

memory : fix seq_id bounds in llama_memory_recurrent::state_read_meta() (#20887)

commit | commitdiff | tree

Eric Zhang [Mon, 23 Mar 2026 11:33:38 +0000 (19:33 +0800)]

docs : rerun llama-gen-docs to include new CLI args (#20892)

commit | commitdiff | tree

Xuan-Son Nguyen [Mon, 23 Mar 2026 11:22:46 +0000 (12:22 +0100)]

server: use httplib dynamic threads (#20817)

* server: use httplib dynamic threads

* change to n_threads_http + 1024

commit | commitdiff | tree

Georgi Gerganov [Mon, 23 Mar 2026 11:21:41 +0000 (13:21 +0200)]

ai : update gh permissions (#20895)

commit | commitdiff | tree

Pascal [Mon, 23 Mar 2026 10:25:35 +0000 (11:25 +0100)]

webui: fix --webui-config-file settings not applied on load (#20823)

* webui: fix --webui-config-file settings not applied on load

* chore: update webui build output

commit | commitdiff | tree

Rashid Ul Islam [Mon, 23 Mar 2026 07:45:34 +0000 (13:15 +0530)]

metal: add CONV_3D (#19927)

* Apply suggestions from code review

Co-authored-by: Georgi Gerganov <redacted>
* metal:add conv_3d backend

Rebased with master and resolved conflicts.

* Resolved issues related to changes in variable names

* kernel void kernel_upscale_bilinear_f32 was missing in my branch, added back, should pass all tests now

---------

Co-authored-by: Georgi Gerganov <redacted>

commit | commitdiff | tree

Jhen-Jie Hong [Mon, 23 Mar 2026 07:35:27 +0000 (15:35 +0800)]

common/autoparser : detect reasoning markers when enable_thinking changes system prompt (#20859)

commit | commitdiff | tree

Chenguang Li [Mon, 23 Mar 2026 07:24:06 +0000 (15:24 +0800)]

CANN: add RoPE cache preload before ACL graph capture (#20747)

ACL graph capture disallows host-to-device memcpy and device memory
malloc/free on the captured stream. Pre-load the RoPE cache before
capture so that:
- Host-to-device copies and allocations run on the non-captured stream
- Cache metadata is populated and memory pool is warmed up
- During capture, only on-device computations are recorded; host-side
and allocation branches are skipped

commit | commitdiff | tree

Dan Hoffman [Mon, 23 Mar 2026 06:05:37 +0000 (23:05 -0700)]

fix(openvino): explicit memset in buffer_context allocation (#20857)

* fix(openvino): explicit memset in buffer_context allocation

* minor

---------

Co-authored-by: Dan Hoffman <redacted>
Co-authored-by: Georgi Gerganov <redacted>

commit | commitdiff | tree

shaofeiqi [Mon, 23 Mar 2026 05:45:11 +0000 (22:45 -0700)]

opencl: add flattened Q4_K mv and general Q4_K mm (#20773)

commit | commitdiff | tree

bssrdf [Mon, 23 Mar 2026 00:06:30 +0000 (20:06 -0400)]

mtmd: Add dynamic high-resolution image preprocessing for InternVL model (#20847)

* added support for internvl's dynamic high-resolution (Qianfan-OCR needed)

* add min/max dynamic patch to gguf meta

* clean up

* simplified handling min/max dynamic patch

* reuse llava_uhd logic for slice images

* provide default values for older models

* flake8

* prevent writing 0 value to gguf

* remove duplicated resolution candidates with a better algorithm

* fix indentation

* format

* add protection from divide by zero

* change to 0 to be safe

---------

Co-authored-by: Xuan Son Nguyen <redacted>

commit | commitdiff | tree

DorianRudolph [Mon, 23 Mar 2026 00:04:14 +0000 (01:04 +0100)]

mtmd : fix LightOnOCR image preprocessing (#20877)

commit | commitdiff | tree

Xuan-Son Nguyen [Sun, 22 Mar 2026 17:33:52 +0000 (18:33 +0100)]

server: allow router to report child instances sleep status (#20849)

* server: allow router to report child instances sleep status

* refactor

* move sleeping to state

* nits

commit | commitdiff | tree

Johannes Gäßler [Sun, 22 Mar 2026 16:53:33 +0000 (17:53 +0100)]

CUDA: fix BF16 FA compilation (#20865)

commit | commitdiff | tree

Sigbjørn Skjæret [Sun, 22 Mar 2026 16:45:10 +0000 (17:45 +0100)]

jinja : refactor token advancement (#20864)

* refactor token advancement

* exercise sub-expressions

commit | commitdiff | tree

Evgeny Kurnevsky [Sun, 22 Mar 2026 14:29:22 +0000 (15:29 +0100)]

server: fix Host header (#20843)

It should include port when it's not default.

commit | commitdiff | tree

Neo Zhang [Sun, 22 Mar 2026 14:06:27 +0000 (22:06 +0800)]

support bf16 and quantized type (#20803)

commit | commitdiff | tree

Patrick Buckley [Sun, 22 Mar 2026 10:05:51 +0000 (03:05 -0700)]

ggml-cuda: native bf16 flash attention for vec kernel (#20525)

* ggml-cuda: native bf16 flash attention for vec and tile kernels

mma kernel still converts bf16 to fp16 before launch, native mma bf16 todo

* ggml-cuda: address code owner review feedback

reverted tile kernel changes to avoid larger refactor

* fix ci failures on turing and hip

* fix bf16 vec kernel compile on hip v_dot2 platforms

* add comments

---------

Co-authored-by: Johannes Gäßler <redacted>

commit | commitdiff | tree

Gaurav Garg [Sun, 22 Mar 2026 08:49:35 +0000 (14:19 +0530)]

[CUDA] Increase number of output elements per-thread block if the K-dimension is small (#20635)

* Increase per-thread work if the K-dimension is small

With tensor parallelism, the K-dimension of the FFN-down matrices is split, which makes it quite small, especially for MOEs. For example, Qwen3-30b-A3B has a K-dimension of 768, and Qwen3235B-A22B has k-dimension of 1536.
The current heuristic uses a group of 4 warps irrespective of K-dimension size, resulting in some of the threads being idle. This results in poor performance for these matrices.

This change increases the number of output elements per block for such cases.

* Limit this change to ncols_dst = 1

* tab to space

commit | commitdiff | tree

ddh0 [Sat, 21 Mar 2026 21:00:26 +0000 (16:00 -0500)]

misc : prefer ggml-org models in docs and examples (#20827)

* misc : prefer ggml-org models in docs and examples

Prefer referring to known-good quantizations under ggml-org rather than
3rd-party uploaders.

* remove accidentally committed file

commit | commitdiff | tree

Andrea Arcangeli [Sat, 21 Mar 2026 17:43:35 +0000 (13:43 -0400)]

common/grammar: fix grammar parsing issues to prevent stack overflow and hangs (#18604)

* grammar: add test case for nullable symbol loop

Reproduce stack overflow (or OOM) with ( [x]* )* found while adding
GBNF support to ripgrep-edit.

llama-server reproducer:

curl \
  -X POST \
  -d '{
    "messages": [{ "role": "user", "content": "write yes" }],
    "grammar": "root ::= ( [x]* )*"
  }' \
  -H "Content-Type: application/json" \
  http://localhost:8811/v1/chat/completions

* grammar: prevent stack overflow with nullable symbol loop

Fix a potential stack overflow in llama_grammar_advance_stack that
could occur when processing grammars with nullable symbols that lead
to infinite derivations of empty strings. The fix introduces cycle
detection by tracking visited stacks to prevent infinite recursion.

rg-edit regexp: llama_grammar_advance_stack
rg-edit extra-args: -A20
rg-edit directive: """Rewrite: fix the following segfault:

[..]
⚫ Testing segfault. Grammar:
            root ::= ( [x]* )*

            root ::= ( [x]* )*

Segmentation fault         build/bin/test-grammar-integration"""

gptel-context:
(("~/llama.cpp/src/llama-grammar.cpp")
("~/llama.cpp/tests/test-grammar-integration.cpp")
("~/llama.cpp/grammars/./list.gbnf")
("~/llama.cpp/grammars/./json_arr.gbnf")
("~/llama.cpp/grammars/./json.gbnf")
("~/llama.cpp/grammars/./japanese.gbnf")
("~/llama.cpp/grammars/./english.gbnf")
("~/llama.cpp/grammars/./chess.gbnf")
("~/llama.cpp/grammars/./c.gbnf")
("~/llama.cpp/grammars/./arithmetic.gbnf")
("~/llama.cpp/grammars/./README.md"))

* grammar: convert recursive llama_grammar_advance_stack to iterative

This change converts the function to an iterative approach using
explicit stacks, which prevents deep recursion and eliminates the risk
of stack overflow.

rg-edit regexp: llama_grammar_advance_stack
rg-edit extra-args: -A30
rg-edit directive: """Rewrite: fix the following segfault:

[..]
⚫ Testing segfault. Grammar:
            root ::= ( [x]* )*

            root ::= ( [x]* )*

Segmentation fault         build/bin/test-grammar-integration

convert from recursive to interactive"""

gptel-context:
(("~/llama.cpp/src/llama-grammar.cpp")
("~/llama.cpp/tests/test-grammar-integration.cpp")
("~/llama.cpp/grammars/./list.gbnf")
("~/llama.cpp/grammars/./json_arr.gbnf")
("~/llama.cpp/grammars/./json.gbnf")
("~/llama.cpp/grammars/./japanese.gbnf")
("~/llama.cpp/grammars/./english.gbnf")
("~/llama.cpp/grammars/./chess.gbnf")
("~/llama.cpp/grammars/./c.gbnf")
("~/llama.cpp/grammars/./arithmetic.gbnf")
("~/llama.cpp/grammars/./README.md"))

v2: Added a `std::set` to perform tree-based lookups with O(N log N)
complexity. Testing with a parallel run of `test-grammar-integration`
shows a double-digit percentage increase in runtime. An
`unordered_set` with O(1) hashing was also evaluated, but the overhead
of constructing hash keys from pointers made it significantly slower
than the rbtree implementation that only requires an ordering
operator. The performance regression in the test suite appears
justified by the overall reduction in algorithmic complexity.

Co-developed-by: Piotr Wilkin (ilintar) <redacted>
* grammar: add test case for hang in repetition grammar processing

This commit adds a new test case to the grammar integration tests that
specifically targets a hang scenario in the repetition grammar parser
found while adding GBNF support to ripgrep-edit.

llama-server reproducer:

curl \
  -X POST \
  -d '{
    "messages": [{ "role": "user", "content": "write yes" }],
    "grammar": "root ::= (([^x]*){0,99}){0,99}"
  }' \
  -H "Content-Type: application/json" \
  http://localhost:8811/v1/chat/completions

* grammar: add repetition threshold check

The change introduces a maximum repetition threshold to avoid
excessive rule expansion during grammar parsing. When parsing
repetition patterns like {m,n}, the parser now calculates the
potential number of rules that would be generated and throws an error
if the product of previous rules and new rules exceeds the threshold.

A test case was added to verify the threshold is properly enforced for
deeply nested repetition patterns that would otherwise cause hangs.

commit | commitdiff | tree

Tom Hillbrunner [Sat, 21 Mar 2026 17:35:00 +0000 (18:35 +0100)]

context : use n_embd_out for pooled embedding extraction (#20840)

The MEAN/CLS/LAST pooling paths in encode() and decode() used
n_embd_inp() (16384 for qwen3vl with deepstack) to read from the
pooled embedding tensor, which only has n_embd_out() (4096) floats
per sequence. This caused a tensor read out of bounds assertion.

Fixes embedding mode for Qwen3-VL-Embedding models.

commit | commitdiff | tree

Xuan-Son Nguyen [Sat, 21 Mar 2026 14:50:16 +0000 (15:50 +0100)]

docs : explicit about banning accounts that violates policy (#19593)

commit | commitdiff | tree

y198 [Sat, 21 Mar 2026 13:59:43 +0000 (20:59 +0700)]

fix(rpc): prevent division by zero in deserialize_tensor (#20712)

rpc : prevent division by zero in deserialize_tensor

When receiving an RPC message with a deprecated tensor type (e.g., type 4 or 5 where `blck_size == 0`), `ggml_row_size()` will trigger a division by zero (SIGFPE) and crash the rpc-server.

This patch adds a simple validation check in `deserialize_tensor` to return `nullptr` if the requested tensor type has a block size of 0.

(Note: This was originally reported via Security Advisory and maintainer suggested dropping a patch here).

* style: remove trailing whitespace

commit | commitdiff | tree

Michael Wand [Sat, 21 Mar 2026 11:35:21 +0000 (04:35 -0700)]

Convert: Make NVFP4 and MXFP4 HF conversions say NVFP4/MXFP4 instead of BF16 (#20730)

* Corrected convert script for NVFP4 naming and updated gguf constants

* Add mostly_MXFP4 to FileType

Co-authored-by: Sigbjørn Skjæret <redacted>
* Update convert_hf_to_gguf.py

Co-authored-by: Sigbjørn Skjæret <redacted>
* Update convert_hf_to_gguf.py

Co-authored-by: Sigbjørn Skjæret <redacted>
* simplify

* set initial value [no ci]

---------

Co-authored-by: Sigbjørn Skjæret <redacted>

commit | commitdiff | tree

Sigbjørn Skjæret [Sat, 21 Mar 2026 07:54:34 +0000 (08:54 +0100)]

ci : switch from pyright to ty (#20826)

* type fixes

* switch to ty

* tweak rules

* tweak more rules

* more tweaks

* final tweak

* use common import-not-found rule

commit | commitdiff | tree

Matt Corallo [Sat, 21 Mar 2026 04:22:51 +0000 (04:22 +0000)]

Add shader count for Intel Arc Pro B60 (#20818)

commit | commitdiff | tree

Piotr Wilkin (ilintar) [Fri, 20 Mar 2026 23:19:04 +0000 (00:19 +0100)]

common/parser: fix nasty bug causing subtle corruption of generation prompt (#20825)

commit | commitdiff | tree

shalinib-ibm [Fri, 20 Mar 2026 23:11:45 +0000 (04:41 +0530)]

ggml-cpu: add always_inline to tinyBLAS_PPC accumulator saves (#20791)

Explicitly mark save_acc and add_save_Acc with always_inline
in tinyBLAS_PPC. This ensures the compiler keeps MMA accumulator
disassembly within kernel's register context, preventing un-necessary
stask spills.

Signed-off-by: Shalini Salomi Bodapati <redacted>

commit | commitdiff | tree

Georgi Gerganov [Fri, 20 Mar 2026 18:31:25 +0000 (20:31 +0200)]

ai : limit runtime of the agent (#20816)

commit | commitdiff | tree

James O'Leary [Fri, 20 Mar 2026 17:23:18 +0000 (10:23 -0700)]

common : fix typo in debug log ('extracft' -> 'extract') (#20807)

commit | commitdiff | tree

Georgi Gerganov [Fri, 20 Mar 2026 17:06:33 +0000 (19:06 +0200)]

ai : do not run bash commands in the prompt (#20810)

commit | commitdiff | tree

Victor Villar [Fri, 20 Mar 2026 14:16:09 +0000 (15:16 +0100)]

model : fix Granite Hybrid type check for 7B.A1B (#20795)

* Check granite hybriid expert count to set type as LLM_TYPE_7B_A1B or LLM_TYPE_1B

* Use feed fwd dim instead of num of experts

Co-authored-by: Sigbjørn Skjæret <redacted>
---------

Co-authored-by: Sigbjørn Skjæret <redacted>

commit | commitdiff | tree

Xuan-Son Nguyen [Fri, 20 Mar 2026 13:03:50 +0000 (14:03 +0100)]

server: (doc) clarify in-scope and out-scope features (#20794)

* server: (doc) clarify in-scope and out-scope features

* Apply suggestions from code review

Co-authored-by: Georgi Gerganov <redacted>
---------

Co-authored-by: Georgi Gerganov <redacted>

commit | commitdiff | tree

Jeff Bolz [Fri, 20 Mar 2026 11:17:15 +0000 (06:17 -0500)]

vulkan: change gated_delta_net to shard a column across a subgroup (#20662)

* vulkan: change gated_delta_net to shard a column across a subgroup

This is based on https://github.com/ggml-org/llama.cpp/pull/20391, I used an
LLM to port the CUDA code to Vulkan, and guided to it to make various fixes to
work with Vulkan (e.g. handling different subgroup sizes, unknown mapping of
subgroup to invocation id, using subgroupAdd optionally, etc.).

This fixes a perf regression from the transposing of the values in memory
(!20443).

* vulkan: Spread columns across fewer lanes to reduce the number of workgroups

commit | commitdiff | tree

Ruikai Peng [Fri, 20 Mar 2026 09:31:34 +0000 (17:31 +0800)]

context: zero output buffer on allocation (#20781)

* context: zero output buffer on allocation

Address GHSA-wqq9-25mr-rw76.

The logits output buffer allocated in output_reserve() uses
posix_memalign(), which does not zero memory. The buffer is only
written during decode when needs_raw_logits() returns true. When
backend samplers cover all output sequences, needs_raw_logits()
returns false and the buffer is never written, but
llama_get_logits() still returns a pointer to it, exposing stale
heap content.

Zero the buffer after allocation to prevent information disclosure
through the public logits API.

Found-by: Pwno
* Update src/llama-context.cpp

Co-authored-by: Georgi Gerganov <redacted>
---------

Co-authored-by: Georgi Gerganov <redacted>

commit | commitdiff | tree

Ruikai Peng [Fri, 20 Mar 2026 09:17:58 +0000 (17:17 +0800)]

model: assert nextn_predict_layers to prevent underflow (#20783)

Address GHSA-645x-v54x-34w8.

When nextn_predict_layers >= n_layer, n_layer - nextn_predict_layers
can underflow (unsigned wrap), which corrupts n_layer_kv_from_start.

Assert nextn_predict_layers immediately after parsing the GGUF key.

Found-by: Pwno

commit | commitdiff | tree

Georgi Gerganov [Fri, 20 Mar 2026 09:13:12 +0000 (11:13 +0200)]

server : improve mtmd ctx checkpoints (#20726)

* server : improve mtmd ctx checkpoints

* server : fix off-by-one in pos_min_thold

commit | commitdiff | tree

hipudding [Fri, 20 Mar 2026 09:08:39 +0000 (17:08 +0800)]

CANN: add BF16 support for core operators (#20152)

* CANN: add BF16 support for core operators

Add BF16 (bfloat16) type support to the CANN backend for the following
operators: MUL_MAT, MUL_MAT_ID, GET_ROWS, SET_ROWS, CPY, CONT, and
OUT_PROD. This enables BF16 models to run on Ascend NPUs.

* CANN: skip NZ weight format for BF16 and add 310P compile guards

NZ weight format conversion does not support BF16 tensors, skip it
in set_tensor, get_alloc_size and mul_mat. Remove BF16 from MUL_MAT_ID
and OUT_PROD as there are no BF16 use cases. Add #ifndef ASCEND_310P
guards for all BF16 operator support since 310P does not support BF16.

Packaging of ggml-org/llama.cpp

RSS Atom