]> git.djapps.eu Git - pkg/ggml/sources/llama.cpp/log
pkg/ggml/sources/llama.cpp
8 weeks agoci: fix cross-compile sync issues (#12804)
bandoti [Thu, 1 May 2025 22:06:39 +0000 (19:06 -0300)]
ci: fix cross-compile sync issues (#12804)

8 weeks agorpc : avoid uninitialized memory in serialize_tensor (#13210)
Justin Santa Barbara [Thu, 1 May 2025 21:32:11 +0000 (17:32 -0400)]
rpc : avoid uninitialized memory in serialize_tensor (#13210)

Zero out the name and padding buffers.

8 weeks agoggml: Don't assert fail when tensor data changes (#13222)
Jesse Gross [Thu, 1 May 2025 20:46:10 +0000 (13:46 -0700)]
ggml: Don't assert fail when tensor data changes (#13222)

The following scenario will cause an assertion failure in the graph
allocator:
 - Build and allocate a graph containing a tensor with a non-NULL data
   pointer
 - Build and allocate a new graph where that data is NULL

Result:
ggml-alloc.c:819: GGML_ASSERT(talloc->buffer_id >= 0) failed

This happens during revalidation because we think that memory should
have been previously allocated based on the current graph but in
reality the previous graph was different. In this situation, we
should do a full reallocation pass.

8 weeks agobuild : fix build info on windows (#13239)
Diego Devesa [Thu, 1 May 2025 19:48:08 +0000 (21:48 +0200)]
build : fix build info on windows (#13239)

* build : fix build info on windows

* fix cuda host compiler msg

8 weeks agoclip : (minicpmv) Re-enable upscaling of images smaller than the CLIP image size...
Loïc Carrère [Thu, 1 May 2025 19:32:21 +0000 (21:32 +0200)]
clip : (minicpmv) Re-enable upscaling of images smaller than the CLIP image size (#13237)

8 weeks agollama-chat : update GLM4 chat template (#13238)
matteo [Thu, 1 May 2025 19:16:38 +0000 (21:16 +0200)]
llama-chat : update GLM4 chat template (#13238)

* update GLM4 chat template

* Update chat template

Co-authored-by: Xuan-Son Nguyen <redacted>
---------

Co-authored-by: Xuan-Son Nguyen <redacted>
8 weeks agovulkan: Add bfloat16 support (#12554)
Jeff Bolz [Thu, 1 May 2025 18:49:39 +0000 (13:49 -0500)]
vulkan: Add bfloat16 support (#12554)

* vulkan: Add bfloat16 support

This adds bfloat16 matrix multiply support based on VK_KHR_shader_bfloat16.
The extension is required for coopmat multiply support, but matrix-vector
multiply trivially promotes bf16 to fp32 and doesn't require the extension.
The copy/get_rows shaders also don't require the extension.

It's probably possible to fall back to non-coopmat and promote to fp32 when
the extension isn't supported, but this change doesn't do that.

The coopmat support also requires a glslc that supports the extension, which
currently requires a custom build.

* vulkan: Support bf16 tensors without the bf16 extension or coopmat support

Compile a variant of the scalar mul_mm shader that will promote the bf16
values to float, and use that when either the bf16 extension or the coopmat
extensions aren't available.

* vulkan: bfloat16 fixes (really works without bfloat16 support now)

* vulkan: fix spirv-val failure and reenable -O

8 weeks agovulkan: Handle src1 batch dimension in non-contiguous mat-vec-mul shader (#13191)
Jeff Bolz [Thu, 1 May 2025 18:19:31 +0000 (13:19 -0500)]
vulkan: Handle src1 batch dimension in non-contiguous mat-vec-mul shader (#13191)

* vulkan: Handle src1 batch dimension in non-contiguous mat-vec-mul shader

8 weeks agotest: non-cont. b in test-backend-ops -o MUL_MAT (#13187)
Johannes Gäßler [Thu, 1 May 2025 18:18:56 +0000 (20:18 +0200)]
test: non-cont. b in test-backend-ops -o MUL_MAT (#13187)

8 weeks agosync : ggml
Georgi Gerganov [Thu, 1 May 2025 14:07:13 +0000 (17:07 +0300)]
sync : ggml

ggml-ci

8 weeks agowhisper : add check that target name exists (whisper/3103)
Daniel Bevenius [Thu, 1 May 2025 08:05:24 +0000 (10:05 +0200)]
whisper : add check that target name exists (whisper/3103)

This commit adds a check to makes sure that the target exists before
trying to add compile options to ignore warnings when using MSVC.

The motivation for this is currently the build is broken depending on
the cmake options provided. With this fix it should be possible to build
even if the targets are not actually available.

Refs: https://github.com/ggml-org/whisper.cpp/pull/3090#issuecomment-2842760104

8 weeks agoggml : suppress Windows compiler warnings (whisper/3075)
Daniel Bevenius [Tue, 29 Apr 2025 13:47:55 +0000 (15:47 +0200)]
ggml : suppress Windows compiler warnings (whisper/3075)

* whisper: suppress Windows compiler warnings

This commit disables compiler warnings on window using MSVC.

The motivation for these changes is that some compilers generate
warnings for these conversion, for example Windows MSVC, and
there are quite a few of them. This makes it a little difficult to
spot new warnings that may be introduced and also can be difficult
for users/embedders of ggml where these warnings are hard to separate
from their own warnings.

* squash! whisper: suppress Windows compiler warnings

Move ggml related warnings into ggml. This commit also fixes the
indentation and adds a missing whitespace to the if statement.

8 weeks agomtmd : add **vision** support for Mistral Small 3.1 (#13231)
Xuan-Son Nguyen [Thu, 1 May 2025 15:05:42 +0000 (17:05 +0200)]
mtmd : add **vision** support for Mistral Small 3.1 (#13231)

* convert ok

* load ok, missing patch merger

* ah sheet it works

* update llava/readme

* add test

* fix test

8 weeks agoarg : remove CURLINFO_EFFECTIVE_METHOD (#13228)
Xuan-Son Nguyen [Thu, 1 May 2025 08:23:25 +0000 (10:23 +0200)]
arg : remove CURLINFO_EFFECTIVE_METHOD (#13228)

8 weeks agollama-model : fix the reported size class for nomic-embed-text-v2-moe (#13223)
Jared Van Bortel [Thu, 1 May 2025 07:09:41 +0000 (03:09 -0400)]
llama-model : fix the reported size class for nomic-embed-text-v2-moe (#13223)

8 weeks agosync : ggml
Georgi Gerganov [Thu, 1 May 2025 06:59:02 +0000 (09:59 +0300)]
sync : ggml

8 weeks agoggml : fix ggml_gallocr_ptr type (ggml/1205)
Diego Devesa [Wed, 30 Apr 2025 13:20:40 +0000 (15:20 +0200)]
ggml : fix ggml_gallocr_ptr type (ggml/1205)

8 weeks agocuda : fix unused variable compile warning (whisper/0)
Georgi Gerganov [Thu, 24 Apr 2025 15:59:06 +0000 (18:59 +0300)]
cuda : fix unused variable compile warning (whisper/0)

ggml-ci

8 weeks agoCUDA: batched+noncont MMQ, refactor bs>1 MoE code (#13199)
Johannes Gäßler [Wed, 30 Apr 2025 21:12:59 +0000 (23:12 +0200)]
CUDA: batched+noncont MMQ, refactor bs>1 MoE code (#13199)

8 weeks agoarg : -hf do not fail if url mismatch (#13219)
Xuan-Son Nguyen [Wed, 30 Apr 2025 20:29:15 +0000 (22:29 +0200)]
arg : -hf do not fail if url mismatch (#13219)

* arg : -hf do not fail if url mismatch

* do not return if cannot parse metadata json

8 weeks agofix typo: `n_ctx_pre_seq` -> `n_ctx_per_seq` (#13221)
ddh0 [Wed, 30 Apr 2025 20:28:43 +0000 (15:28 -0500)]
fix typo: `n_ctx_pre_seq` -> `n_ctx_per_seq` (#13221)

8 weeks agoconvert : improve model arch handling (#13122)
Xuan-Son Nguyen [Wed, 30 Apr 2025 14:56:24 +0000 (16:56 +0200)]
convert : improve model arch handling (#13122)

* convert : improve model arch handling

* use AutoConfig

* rm trust_remote_code

* Update convert_hf_to_gguf.py

* fix self.block_count for vision

* fix NomicBertModel

8 weeks agollava : remove duplicate include (#13207)
Tatsuya Tanaka [Wed, 30 Apr 2025 13:25:20 +0000 (22:25 +0900)]
llava : remove duplicate include (#13207)

8 weeks agocommon : add -jf / --json-schema-file flag (#12011)
Olivier Chafik [Wed, 30 Apr 2025 12:52:35 +0000 (13:52 +0100)]
common : add -jf / --json-schema-file flag (#12011)

8 weeks agovulkan: use uint array index to avoid glslang bug (#13193)
Jeff Bolz [Wed, 30 Apr 2025 12:38:37 +0000 (07:38 -0500)]
vulkan: use uint array index to avoid glslang bug (#13193)

8 weeks agoggml : fix ppc64le build (#13176)
shalinib-ibm [Wed, 30 Apr 2025 11:17:08 +0000 (16:47 +0530)]
ggml : fix ppc64le build (#13176)

Build fails with compilation error on power pc.
This patch fixes the same.

Tested with unit tests run via
 --build <build_dir> && cd <build_dir> && make test

Signed-off-by: Shalini Salomi Bodapati <redacted>
8 weeks agoconvert : correct typo image_mean --> image_std (#13208)
Xuan-Son Nguyen [Wed, 30 Apr 2025 11:06:15 +0000 (13:06 +0200)]
convert : correct typo image_mean --> image_std (#13208)

8 weeks agofeat(ggml-cpu): enable z17 compile (#13182)
Aaron Teo [Wed, 30 Apr 2025 09:47:35 +0000 (17:47 +0800)]
feat(ggml-cpu): enable z17 compile (#13182)

z17 compilation requires GCC 15.1.0 and onwards

Signed-off-by: Aaron Teo <redacted>
8 weeks agoarg : allow using -hf offline (#13202)
Xuan-Son Nguyen [Wed, 30 Apr 2025 08:46:32 +0000 (10:46 +0200)]
arg : allow using -hf offline (#13202)

* arg : allow using -hf offline

* add more comments in code [no ci]

8 weeks agodocker : do not build tests (#13204)
Xuan-Son Nguyen [Wed, 30 Apr 2025 08:44:07 +0000 (10:44 +0200)]
docker : do not build tests (#13204)

* docker : do not build tests

* include "ggml-cpu.h"

8 weeks agorpc : fix cache directory initialization (#13188)
xiaofei [Wed, 30 Apr 2025 06:29:22 +0000 (14:29 +0800)]
rpc : fix cache directory initialization (#13188)

Signed-off-by: xiaofei <redacted>
8 weeks agoscripts: n_depth for compare-llama-bench [no ci] (#13201)
Johannes Gäßler [Tue, 29 Apr 2025 21:32:04 +0000 (23:32 +0200)]
scripts: n_depth for compare-llama-bench [no ci] (#13201)

8 weeks agoserver : Prefilling assistant message in openai compatible API (#13174)
matteo [Tue, 29 Apr 2025 18:33:10 +0000 (20:33 +0200)]
server : Prefilling assistant message in openai compatible API (#13174)

* Prefilling assistant message in openai compatible API

* fixed indentation

* fixed code convention

* simplify method usage

* no more than one assistant message at end of messages

* merge checks into prefill code

* Update examples/server/utils.hpp

---------

Co-authored-by: matteo <redacted>
Co-authored-by: Xuan-Son Nguyen <redacted>
8 weeks agosampling : when top-k <= 0 -> noop (#13173)
Georgi Gerganov [Tue, 29 Apr 2025 17:22:57 +0000 (20:22 +0300)]
sampling : when top-k <= 0 -> noop (#13173)

ggml-ci

8 weeks agollama-bench: fixed size of fields to correctly map to values (#13183)
Alberto Cabrera Pérez [Tue, 29 Apr 2025 15:24:36 +0000 (16:24 +0100)]
llama-bench: fixed size of fields to correctly map to values (#13183)

8 weeks agoCUDA: fix non-cont. inputs for batched mat mul (#13155)
Johannes Gäßler [Tue, 29 Apr 2025 14:00:27 +0000 (16:00 +0200)]
CUDA: fix non-cont. inputs for batched mat mul (#13155)

8 weeks agollama : llm_type order by size (#13177)
Sigbjørn Skjæret [Tue, 29 Apr 2025 11:25:53 +0000 (13:25 +0200)]
llama : llm_type order by size (#13177)

8 weeks agomtmd : add qwen2vl and qwen2.5vl (#13141)
Xuan-Son Nguyen [Tue, 29 Apr 2025 09:47:04 +0000 (11:47 +0200)]
mtmd : add qwen2vl and qwen2.5vl (#13141)

* llava : add clip_n_output_tokens, deprecate clip_n_patches

* mtmd : add qwen2vl and qwen2.5vl

* decode_embd_batch::set_position_...

* working version

* deprecate llama-qwen2vl-cli

* correct order W, H of clip_embd_nbytes_by_img

* edit existing line in hot topics

8 weeks agollama : set qwen3 model type sizes (#13175)
Sigbjørn Skjæret [Tue, 29 Apr 2025 09:00:31 +0000 (11:00 +0200)]
llama : set qwen3 model type sizes (#13175)

8 weeks agollama-graph : fix text position for mrope (#13159)
Xuan-Son Nguyen [Tue, 29 Apr 2025 06:45:49 +0000 (08:45 +0200)]
llama-graph : fix text position for mrope (#13159)

* llama-graph : fix text position for mrope

* fix typo

* explicitly set 4th dim in the loop

2 months agomodel : Nomic Embed Text V2 with Mixture-of-Experts (MoE) architecture (#12466)
AT [Mon, 28 Apr 2025 19:52:15 +0000 (15:52 -0400)]
model : Nomic Embed Text V2 with Mixture-of-Experts (MoE) architecture (#12466)

* Nomic Embed Text V2 with Mixture-of-Experts (MoE) architecture

- Adds MoE-based embedding model supporting multilingual embeddings.
- Selects architecture variant based on hyperparameter detection (MoE layers).
- Removes unnecessary subclass initialization checks for clarity.

https://www.nomic.ai/blog/posts/nomic-embed-text-v2

Co-authored-by: Jared Van Bortel <redacted>
* fix tokenizer

* don't rename this tensor

---------

Co-authored-by: Jared Van Bortel <redacted>
2 months agoclip : fix model size display (#13153)
Xuan-Son Nguyen [Mon, 28 Apr 2025 19:23:19 +0000 (21:23 +0200)]
clip : fix model size display (#13153)

2 months agofix(rpc): Improve input validation and error handling (#13069)
Ville Vesilehto [Mon, 28 Apr 2025 18:00:20 +0000 (21:00 +0300)]
fix(rpc): Improve input validation and error handling (#13069)

* fix(rpc): Improve input validation and error handling

The `rpc-server` was vulnerable to Denial of Service attacks via
several RPC commands (`SET_TENSOR`, `GRAPH_COMPUTE`, etc.). Malformed
messages could trigger failed assertions (e.g., invalid `ggml_type`)
or out-of-bounds reads/writes leading to `GGML_ABORT` calls,
crashing the server process.

This PR introduces robust input validation and replaces `abort()`
calls with graceful error handling:

- **Type Validation:** `deserialize_tensor` now checks if the
  `tensor->type` is within the valid `GGML_TYPE_COUNT` range
  *before* calling `ggml_new_tensor_4d`. Returns `nullptr` on
  invalid type.
- **Bounds Checks:** Replaced `GGML_ABORT` in `set_tensor`,
  `set_tensor_hash`, and `get_tensor` handlers with error
  logging and returning `false` when data/offset parameters
  are out of buffer bounds.
- **Size Checks:** Added safe arithmetic checks (for overflow) in
  `graph_compute` when calculating required message sizes based
  on client-provided `n_nodes` and `n_tensors`. Returns early
  if the reported sizes conflict with the actual message size or
  would lead to overflow.
- **Error Propagation:**
    - `create_node` now checks for `nullptr` return values from
      `deserialize_tensor` and its recursive calls, propagating
      `nullptr` upwards on failure. Uses `find` instead of `at`
      for safer map access.
    - `copy_tensor` now checks for `nullptr` from `deserialize_tensor`
      and sets the response status to failure if deserialization
      or bounds checks fail.
    - `graph_compute` now checks for `nullptr` return from
      `create_node` and returns failure status correctly. The final
      return value now reflects the actual computation status.

These changes improve the RPC server's resilience
against malformed client requests, preventing crashes and ensuring
errors are handled more gracefully.

Signed-off-by: Ville Vesilehto <redacted>
* refactor(rpc): address pr comments

removed comments and unnecessary returns

Signed-off-by: Ville Vesilehto <redacted>
* refactor(rpc): ambiguous nullptr from create_node

rpc_server::create_node could previously return nullptr if the input ID
was 0 (valid) or if an internal error (deserialization, recursion
failure) occurred (invalid). This ambiguity made error handling
difficult for the caller (`graph_compute`).

This commit clarifies the meaning of nullptr:
- `graph_compute` now checks if the input 'id' was non-zero when
  `create_node` returns nullptr, correctly identifying failures
  versus intentional null links.
- `create_node` avoids recursive calls for zero IDs and propagates
  nullptr unambiguously on failure during recursion.

Signed-off-by: Ville Vesilehto <redacted>
* refactor(rpc): initial zero check in create_node

The caller (`graph_compute`) already checks `id != 0` when handling
a `nullptr` return from `create_node`, correctly distinguishing
intentional null links from actual errors. This makes the initial
`if (id == 0)` check redundant.

Also removes the log message when a tensor ID is not found in the
provided map which was added in this branch.

Signed-off-by: Ville Vesilehto <redacted>
* fix(rpc): Handle get_alloc_size failure in server

Check the return value of `server.get_alloc_size` in the RPC server
loop. If the call fails, return early to close the connection.

Signed-off-by: Ville Vesilehto <redacted>
* refactor(rpc): input size validation in graph_compute

Removes detailed, step-by-step size calculations and overflow
checks in favor of simpler direct comparisons, assuming 64-bit
overflow is unlikely.

Signed-off-by: Ville Vesilehto <redacted>
* refactor(rpc): remove extra status code setting

Removes the explicit setting of `response.result = GGML_STATUS_FAILED`
when `create_node` returns `nullptr` within `graph_compute`.
Primary signal is the `false` return value in case of failure.

Signed-off-by: Ville Vesilehto <redacted>
* refactor(rpc): remove redundant check for tensor->type

Breaks CI on ubuntu-cpu-make. Tensor type is uint32_t, thus
the check is not needed.

Signed-off-by: Ville Vesilehto <redacted>
---------

Signed-off-by: Ville Vesilehto <redacted>
2 months agollama-bench: add `-d` depth arg (#13096)
Vishal Agarwal [Mon, 28 Apr 2025 14:50:39 +0000 (20:20 +0530)]
llama-bench: add `-d` depth arg (#13096)

* add depth param

* update llama-bench README and add depth param

* llama-bench: default params for depth arg for faster execution

* Update examples/llama-bench/README.md

Co-authored-by: Johannes Gäßler <redacted>
* fix buffer print ub

* use user provided args

* remove extra whitespaces

---------

Co-authored-by: Johannes Gäßler <redacted>
2 months agomtmd : fix glm-edge redundant token count (#13139)
Xuan-Son Nguyen [Mon, 28 Apr 2025 14:12:56 +0000 (16:12 +0200)]
mtmd : fix glm-edge redundant token count (#13139)

* mtmd : fix glm-edge redundant token count

* fix chat template

* temporary disable GLMEdge test chat tmpl

2 months agocontext : do not clear output buffer on reserve (#13152)
pockers21 [Mon, 28 Apr 2025 13:45:40 +0000 (06:45 -0700)]
context : do not clear output buffer on reserve (#13152)

Co-authored-by: pockers21 <redacted>
2 months agollama : (mrope) allow using normal 1D position for text token (#13138)
Xuan-Son Nguyen [Mon, 28 Apr 2025 12:20:56 +0000 (14:20 +0200)]
llama : (mrope) allow using normal 1D position for text token (#13138)

* llama : (mrope) use normal position for text token

* rm n_pos_per_embd from llm_graph_input_attn_temp

2 months agoclip : refactor set input for cgraph + fix qwen2.5vl input (#13136)
Xuan-Son Nguyen [Mon, 28 Apr 2025 10:18:59 +0000 (12:18 +0200)]
clip : refactor set input for cgraph + fix qwen2.5vl input (#13136)

* clip : refactor set input for cgraph

* more strict assert

* minicpmv : use clip_n_mmproj_embd instead of copying the same code everywhere

* split qwen2 and qwen2.5 code blocks

* minor style fix

2 months agoSYCL: Add all missing unary kernels (#13074)
Akarshan Biswas [Mon, 28 Apr 2025 09:33:25 +0000 (15:03 +0530)]
SYCL: Add all missing unary kernels (#13074)

* SYCL: Add all missing unary kernels

ggml-ci

* decouple kernel launch range from data size using strided loop

* use ciel_div helper for num_blocks
ggml-ci

* clean auto imported header files

2 months agoreadme : update hot topics (#13150)
Georgi Gerganov [Mon, 28 Apr 2025 09:10:18 +0000 (12:10 +0300)]
readme : update hot topics (#13150)

2 months agocommon : fix noreturn compile warning (#13151)
Georgi Gerganov [Mon, 28 Apr 2025 08:57:19 +0000 (11:57 +0300)]
common : fix noreturn compile warning (#13151)

ggml-ci

2 months agollama-chat : fix typo GML --> GLM (#13143)
Xuan-Son Nguyen [Mon, 28 Apr 2025 08:11:58 +0000 (10:11 +0200)]
llama-chat : fix typo GML --> GLM (#13143)

2 months agomusa: fix typo in cc control (#13144)
R0CKSTAR [Mon, 28 Apr 2025 07:33:28 +0000 (15:33 +0800)]
musa: fix typo in cc control (#13144)

Signed-off-by: Xiaodong Ye <redacted>
2 months agoCUDA: fix q_nope_absorbed prec for DS 2 Lite f16 (#13137)
Johannes Gäßler [Mon, 28 Apr 2025 07:29:26 +0000 (09:29 +0200)]
CUDA: fix q_nope_absorbed prec for DS 2 Lite f16 (#13137)

2 months agoarg : fix unused variable (#13142)
Xuan-Son Nguyen [Mon, 28 Apr 2025 05:16:59 +0000 (07:16 +0200)]
arg : fix unused variable (#13142)

2 months agollama-bench : Add `--override-tensors` arg (#12922)
4onen [Sun, 27 Apr 2025 21:48:26 +0000 (14:48 -0700)]
llama-bench : Add `--override-tensors` arg (#12922)

* Add --override-tensors option to llama-bench

* Correct llama-bench --override-tensors to --override-tensor

* llama-bench: Update --override-tensors parsing to match --tensor-split, appear in test matrix.

* Make new llama-bench util functions static to fix Ubuntu CI

* llama-bench: Correct -ot corner cases (No -ot calls, leading and trailing empty -ot spans, etc.)

2 months agollama-chat : fix wrong template in GLM4-0414 (#13140)
matteo [Sun, 27 Apr 2025 19:57:32 +0000 (21:57 +0200)]
llama-chat : fix wrong template in GLM4-0414 (#13140)

* fix wrong template in GLM4-0414

* fix spaces

* no bos token since it is already in the template

* moved the chatgml4 check to higher priority

* restored template for old GLM models

* moved the GLM4 template check in the correct place with correct check

2 months agomusa: fix build warning (#13129)
R0CKSTAR [Sun, 27 Apr 2025 11:22:49 +0000 (19:22 +0800)]
musa: fix build warning (#13129)

Signed-off-by: Xiaodong Ye <redacted>
2 months agoFixes Qwen2.5VL segfault during inference with https://github.com/ggml-org/llama...
LostRuins Concedo [Sun, 27 Apr 2025 10:43:37 +0000 (18:43 +0800)]
Fixes Qwen2.5VL segfault during inference with https://github.com/ggml-org/llama.cpp/pull/12402 as has_qwen2vl_merger migration was incomplete (#13133)

2 months agoclip : Add Qwen2.5VL support (#12402)
HimariO [Sun, 27 Apr 2025 08:10:34 +0000 (16:10 +0800)]
clip : Add Qwen2.5VL support (#12402)

* implment vision model architecture, gguf convertor

* handle window attention inputs

* add debug utils

* fix few incorrect tensor memory layout

* move position id remap out of ggml to avoid int32 cuda operations

* cleaning up

* ignore transformers Qwen2_5_xxx type check

* remove not so often use `qwen2vl-cli` debug functions

* remove commented-out code blocks

* fix attn weight scaling after rebase

* add `PROJECTOR_TYPE_QWEN2_5_VL`

* remove `KEY_USE_GLU_MLP`, `KEY_USE_RMS_NORM`

* replace `KEY_FULLATTN_BLK_IDX` with `KEY_WIN_ATTN_PATTERN`

* remove `attn_window_size` from gguf

* fix model conversion

* clean up

* fix merging problem

* add test

---------

Co-authored-by: Xuan Son Nguyen <redacted>
2 months agocommon : add common_remote_get_content (#13123)
Xuan-Son Nguyen [Sat, 26 Apr 2025 20:58:12 +0000 (22:58 +0200)]
common : add common_remote_get_content (#13123)

* common : add common_remote_get_content

* support max size and timeout

* add tests

2 months agoclip : improve projector naming (#13118)
Xuan-Son Nguyen [Sat, 26 Apr 2025 20:39:47 +0000 (22:39 +0200)]
clip : improve projector naming (#13118)

* clip : improve projector naming

* no more kv has_llava_projector

* rm unused kv

* rm more unused

2 months agoggml: move fp16/bf16 conversion optimizations to CPU backend + export conversion...
SXX [Sat, 26 Apr 2025 14:05:31 +0000 (22:05 +0800)]
ggml: move fp16/bf16 conversion optimizations to CPU backend + export conversion APIs (#13107)

* ggml: dynamic x86_64 feature detection for FP32 <-> FP16/BF16 conversion

* move fp converter to ggml-cpu

* Switch ggml_compute_forward_get_rows_f16/bf16 to new ggml_cpu_fp16/bf16_to_fp32

2 months agogrammar : handle maxItems == 0 in JSON schema (#13117)
frob [Sat, 26 Apr 2025 08:10:20 +0000 (10:10 +0200)]
grammar : handle maxItems == 0 in JSON schema (#13117)

Co-authored-by: Richard Lyons <redacted>
2 months agollama : fix K-shift with quantized K and BLAS backend (#13113)
Diego Devesa [Fri, 25 Apr 2025 17:40:11 +0000 (19:40 +0200)]
llama : fix K-shift with quantized K and BLAS backend (#13113)

2 months agoForce FP32 compute in GLM4 FFN Down (#13101)
City [Fri, 25 Apr 2025 12:38:34 +0000 (14:38 +0200)]
Force FP32 compute in GLM4 FFN Down (#13101)

* Force FP32 compute in cuBLAS GEMM

* Revert "Force FP32 compute in cuBLAS GEMM"

This reverts commit 6efd872732159ab88ee7b3c1d77ba5ebc83079bd.

* Force F32 compute in GLM4 ffn down

* Edit comment to clarify issue

Co-authored-by: Johannes Gäßler <redacted>
---------

Co-authored-by: Johannes Gäßler <redacted>
2 months agoclip : fix pixtral on some GPU backends (#13097)
Xuan-Son Nguyen [Fri, 25 Apr 2025 12:31:42 +0000 (14:31 +0200)]
clip : fix pixtral on some GPU backends (#13097)

* clip : fix pixtral on some GPU backends

* refactor inp_raw set

* rm outdated comment

* fix dynamic size

* add TODO

2 months agochange the reorder tensor from init to execute OP (#13003)
Neo Zhang Jianyu [Fri, 25 Apr 2025 09:37:51 +0000 (17:37 +0800)]
change the reorder tensor from init to execute OP (#13003)

2 months agorpc : do not wait for response when sending RPC_CMD_SET_TENSOR (#12943)
Radoslav Gerganov [Fri, 25 Apr 2025 07:08:08 +0000 (10:08 +0300)]
rpc : do not wait for response when sending RPC_CMD_SET_TENSOR (#12943)

RPC_CMD_SET_TENSOR always returns an empty response and we send this 4
times per token. We can improve TG speed if we don't wait for this empty
response.

The performance impact of this change depends on the network latency.

2 months agoclip : remove boi/eoi embeddings for GLM-edge model (#13081)
Xuan-Son Nguyen [Thu, 24 Apr 2025 20:17:04 +0000 (22:17 +0200)]
clip : remove boi/eoi embeddings for GLM-edge model (#13081)

2 months agoembeddings : fix batch sizes (#13076) upstream/0.0.5185
Georgi Gerganov [Thu, 24 Apr 2025 19:29:22 +0000 (22:29 +0300)]
embeddings : fix batch sizes (#13076)

ggml-ci

2 months agoggml : fix trailing whitespaces (#0)
Georgi Gerganov [Thu, 24 Apr 2025 14:22:27 +0000 (17:22 +0300)]
ggml : fix trailing whitespaces (#0)

2 months agosync : ggml
Georgi Gerganov [Thu, 24 Apr 2025 13:47:43 +0000 (16:47 +0300)]
sync : ggml

ggml-ci

2 months agoggml : Depthwise 2D convolution (ggml/1152)
Acly [Thu, 17 Apr 2025 12:16:45 +0000 (14:16 +0200)]
ggml : Depthwise 2D convolution (ggml/1152)

* ggml-cpu : kernels for faster depthwise 2D convolution

* fix compile: remove static after moving to ops.cpp

* add dilation for depthwise_conv_2d

* review: rename to ggml_conv_2d_dw_direct, remove redundant struct keywords, pass by ref, whitespace

* review: rename depthwise_conv_2d -> conv_2d_dw everywhere

2 months agoCUDA: use switch statements in constexpr functions (#13095)
Johannes Gäßler [Thu, 24 Apr 2025 13:57:10 +0000 (15:57 +0200)]
CUDA: use switch statements in constexpr functions (#13095)

2 months agocmake : do not include ./src as public for libllama (#13062)
Georgi Gerganov [Thu, 24 Apr 2025 13:00:10 +0000 (16:00 +0300)]
cmake : do not include ./src as public for libllama (#13062)

* cmake : do not include ./src as public for libllama

ggml-ci

* cmake : rework tests

ggml-ci

* llguidance : remove unicode include

ggml-ci

* cmake : make c++17 private

ggml-ci

2 months agoclang-tidy : disable warning about missing math parenthesis (#13091)
Georgi Gerganov [Thu, 24 Apr 2025 12:44:05 +0000 (15:44 +0300)]
clang-tidy : disable warning about missing math parenthesis (#13091)

2 months agoarg : add --no-mmproj-offload (#13093)
Xuan-Son Nguyen [Thu, 24 Apr 2025 12:04:14 +0000 (14:04 +0200)]
arg : add --no-mmproj-offload (#13093)

* arg : add --no-mmproj-offload

* Update common/arg.cpp

2 months agoarg : clean up handling --mmproj with -hf (#13082)
Xuan-Son Nguyen [Thu, 24 Apr 2025 10:14:13 +0000 (12:14 +0200)]
arg : clean up handling --mmproj with -hf (#13082)

* arg : clean up handling --mmproj with -hf

* rm change about no_mmproj

* Revert "rm change about no_mmproj"

This reverts commit 2cac8e0efb629d66c612f137e75d562f94bb9e6c.

* handle no_mmproj explicitly

* skip download mmproj on examples not using it

2 months agometal : fix floating-point range of attention scores in FA kernels (#13090)
Georgi Gerganov [Thu, 24 Apr 2025 07:38:30 +0000 (10:38 +0300)]
metal : fix floating-point range of attention scores in FA kernels (#13090)

ggml-ci

2 months agovulkan: matmul gcn tuning (#13016)
Eve [Thu, 24 Apr 2025 07:18:33 +0000 (07:18 +0000)]
vulkan: matmul gcn tuning (#13016)

* tune matmul for gcn

* this one is more power efficient

* Update ggml/src/ggml-vulkan/ggml-vulkan.cpp

Co-authored-by: 0cc4m <redacted>
* disable this tune for the proprietary driver

---------

Co-authored-by: 0cc4m <redacted>
2 months agollama-mtmd-cli: Sigint rework in mtmd vision example (#13080)
pl752 [Wed, 23 Apr 2025 21:32:35 +0000 (02:32 +0500)]
llama-mtmd-cli: Sigint rework in mtmd vision example (#13080)

* Sigint rework in mtmd vision example

* Applied suggestions on mtmd-cli PR

* Forgot to invert one of the conditions

* Update examples/llava/mtmd-cli.cpp

* Removed redundant exit check

---------

Co-authored-by: pl752 <redacted>
Co-authored-by: Xuan-Son Nguyen <redacted>
2 months agomtmd : Support Pixtral 12B (#13065)
Xuan-Son Nguyen [Wed, 23 Apr 2025 18:21:59 +0000 (20:21 +0200)]
mtmd : Support Pixtral 12B (#13065)

* add pixtral text model (vision is wip)

* cgraph ok, just missing 2D RoPE

* fix bad rebase

* first working version

* fix problem with img_break token

* support dynamic image size

* update docs

* update test script

2 months agoconvert : Append mult-eos,half-rope,bos to GLM4-0414 and Z (#13021)
piDack [Wed, 23 Apr 2025 14:59:14 +0000 (22:59 +0800)]
convert : Append mult-eos,half-rope,bos to GLM4-0414 and Z (#13021)

* append mult-eos,half-rope,bos to GLM4-0414

* remove unset var

2 months agorpc : add command line option for number of threads for the CPU backend (#13060)
Radoslav Gerganov [Wed, 23 Apr 2025 07:32:49 +0000 (10:32 +0300)]
rpc : add command line option for number of threads for the CPU backend (#13060)

closes #13051

2 months agoCUDA: noncont MMVQ + batched bs1 MUL_MAT_ID (#13014)
Johannes Gäßler [Tue, 22 Apr 2025 19:27:40 +0000 (21:27 +0200)]
CUDA: noncont MMVQ + batched bs1 MUL_MAT_ID (#13014)

* CUDA: noncont MMVQ + batched bs1 MUL_MAT_ID

* fix logic for RoPE support, CUDA graphs

2 months agomtmd : support SmolVLM (version 1 and 2) (#13050)
Xuan-Son Nguyen [Tue, 22 Apr 2025 14:24:54 +0000 (16:24 +0200)]
mtmd : support SmolVLM (version 1 and 2) (#13050)

* mtmd : support SmolVLM (version 1 and 2)

* correct chat template

* fix n_patches

* scale_factor is an int

* add more models to test

2 months agosecurity : add note about RPC and server functionality (#13061)
Georgi Gerganov [Tue, 22 Apr 2025 13:16:10 +0000 (16:16 +0300)]
security : add note about RPC and server functionality (#13061)

* security : add note about RPC functionality

* security : add note about llama-server

2 months agometal : add memory pool for temp allocs (#12850)
Georgi Gerganov [Tue, 22 Apr 2025 13:15:51 +0000 (16:15 +0300)]
metal : add memory pool for temp allocs (#12850)

* metal : add memory pool for temp allocs (wip) [no ci]

* cont : free buffers from the heap

* cont : resize heap [no ci]

* cont : refactor heap [no ci]

* cont : heap for each cmd buffer [no ci]

* cont : fix free

* wip

* cont : fix alignment [no ci]

* cont : not working .. [no ci]

* cont : heap allocation now works [no ci]

* cont : use MTLHeapTypePlacement

ggml-ci

* metal : use dynamic MTLHeap allocations

ggml-ci

* metal : add comments

* metal : disable softmax use of mem_pool

ggml-ci

* metal : final touches

2 months agollava : update documentations (#13055)
Xuan-Son Nguyen [Tue, 22 Apr 2025 08:37:00 +0000 (10:37 +0200)]
llava : update documentations (#13055)

* llava : update documentations

* fix typo

2 months agoggml : add SSE 4.2 and x64 base variant for CPUs without AVX (#12871)
Diego Devesa [Mon, 21 Apr 2025 16:13:51 +0000 (18:13 +0200)]
ggml : add SSE 4.2 and x64 base variant for CPUs without AVX (#12871)

* ggml : add SSE 4.2 variant for CPUs without AVX

* ggml : add x64 base ABI variant

2 months agoSYCL: Add non-contiguous support in ROPE (#12993)
Akarshan Biswas [Mon, 21 Apr 2025 13:43:30 +0000 (19:13 +0530)]
SYCL: Add non-contiguous support in ROPE (#12993)

ggml-ci

2 months agomtmd : merge llava, gemma3 and minicpmv CLI into single `llama-mtmd-cli` (#13012)
Xuan-Son Nguyen [Mon, 21 Apr 2025 13:32:58 +0000 (15:32 +0200)]
mtmd : merge llava, gemma3 and minicpmv CLI into single `llama-mtmd-cli` (#13012)

* mtmd : merge `llava-cli` and `gemma3-cli` into single `mtmd-cli`

* support for minicpmv

* remove cpp files of llava and minicpmv

* update hot topics

* mtmd : add not supported msg for qwen2vl

* Update examples/llava/mtmd.cpp

Co-authored-by: Georgi Gerganov <redacted>
---------

Co-authored-by: Georgi Gerganov <redacted>
2 months agoconvert : experimental support for `--mmproj` flag (#13023)
Xuan-Son Nguyen [Sun, 20 Apr 2025 21:29:36 +0000 (23:29 +0200)]
convert : experimental support for `--mmproj` flag (#13023)

* convert : experimental support for `--mmproj` flag

* fix bad ctrl+f replace

* fix style

* split into subclasses TextModel and VisionModel

* rename Mode --> ModelBase

* small fix

* correct CLIP_VISION arch name (because existing GGUF already use it)

* Apply suggestions from code review

Co-authored-by: compilade <redacted>
* fix Mistral3Model

* fix typo

Co-authored-by: compilade <redacted>
---------

Co-authored-by: compilade <redacted>
2 months agollava: fix errors in clip.h on certain compilers (#13030)
Jeffrey Morgan [Sun, 20 Apr 2025 10:15:41 +0000 (03:15 -0700)]
llava: fix errors in clip.h on certain compilers (#13030)

2 months agovulkan: support noncontiguous rms_norm (#13031)
Jeff Bolz [Sun, 20 Apr 2025 08:50:02 +0000 (03:50 -0500)]
vulkan: support noncontiguous rms_norm (#13031)

2 months agometal: add neg operator (#13029)
Jeffrey Morgan [Sun, 20 Apr 2025 05:28:40 +0000 (22:28 -0700)]
metal: add neg operator (#13029)

2 months agoDisable CI cross-compile builds (#13022)
bandoti [Sat, 19 Apr 2025 16:05:03 +0000 (13:05 -0300)]
Disable CI cross-compile builds (#13022)

2 months agogguf-py : fix upload python package workflow (#13020) gguf-v0.16.2
Sigbjørn Skjæret [Sat, 19 Apr 2025 14:26:38 +0000 (16:26 +0200)]
gguf-py : fix upload python package workflow (#13020)

2 months agoclip : refactor, add `image_manipulation` and `llava_uhd` classes (#13011)
Xuan-Son Nguyen [Sat, 19 Apr 2025 07:15:45 +0000 (09:15 +0200)]
clip : refactor, add `image_manipulation` and `llava_uhd` classes (#13011)

* clip : refactor, add `image_manipulation` and `llava_uhd`

* refactor llava-1.6 preprocessing

* simplify logic for llava-1.5

* missing include