]> git.djapps.eu Git - pkg/ggml/sources/llama.cpp/log
pkg/ggml/sources/llama.cpp
2 months agomtmd : fix glm-edge redundant token count (#13139)
Xuan-Son Nguyen [Mon, 28 Apr 2025 14:12:56 +0000 (16:12 +0200)]
mtmd : fix glm-edge redundant token count (#13139)

* mtmd : fix glm-edge redundant token count

* fix chat template

* temporary disable GLMEdge test chat tmpl

2 months agocontext : do not clear output buffer on reserve (#13152)
pockers21 [Mon, 28 Apr 2025 13:45:40 +0000 (06:45 -0700)]
context : do not clear output buffer on reserve (#13152)

Co-authored-by: pockers21 <redacted>
2 months agollama : (mrope) allow using normal 1D position for text token (#13138)
Xuan-Son Nguyen [Mon, 28 Apr 2025 12:20:56 +0000 (14:20 +0200)]
llama : (mrope) allow using normal 1D position for text token (#13138)

* llama : (mrope) use normal position for text token

* rm n_pos_per_embd from llm_graph_input_attn_temp

2 months agoclip : refactor set input for cgraph + fix qwen2.5vl input (#13136)
Xuan-Son Nguyen [Mon, 28 Apr 2025 10:18:59 +0000 (12:18 +0200)]
clip : refactor set input for cgraph + fix qwen2.5vl input (#13136)

* clip : refactor set input for cgraph

* more strict assert

* minicpmv : use clip_n_mmproj_embd instead of copying the same code everywhere

* split qwen2 and qwen2.5 code blocks

* minor style fix

2 months agoSYCL: Add all missing unary kernels (#13074)
Akarshan Biswas [Mon, 28 Apr 2025 09:33:25 +0000 (15:03 +0530)]
SYCL: Add all missing unary kernels (#13074)

* SYCL: Add all missing unary kernels

ggml-ci

* decouple kernel launch range from data size using strided loop

* use ciel_div helper for num_blocks
ggml-ci

* clean auto imported header files

2 months agoreadme : update hot topics (#13150)
Georgi Gerganov [Mon, 28 Apr 2025 09:10:18 +0000 (12:10 +0300)]
readme : update hot topics (#13150)

2 months agocommon : fix noreturn compile warning (#13151)
Georgi Gerganov [Mon, 28 Apr 2025 08:57:19 +0000 (11:57 +0300)]
common : fix noreturn compile warning (#13151)

ggml-ci

2 months agollama-chat : fix typo GML --> GLM (#13143)
Xuan-Son Nguyen [Mon, 28 Apr 2025 08:11:58 +0000 (10:11 +0200)]
llama-chat : fix typo GML --> GLM (#13143)

2 months agomusa: fix typo in cc control (#13144)
R0CKSTAR [Mon, 28 Apr 2025 07:33:28 +0000 (15:33 +0800)]
musa: fix typo in cc control (#13144)

Signed-off-by: Xiaodong Ye <redacted>
2 months agoCUDA: fix q_nope_absorbed prec for DS 2 Lite f16 (#13137)
Johannes Gäßler [Mon, 28 Apr 2025 07:29:26 +0000 (09:29 +0200)]
CUDA: fix q_nope_absorbed prec for DS 2 Lite f16 (#13137)

2 months agoarg : fix unused variable (#13142)
Xuan-Son Nguyen [Mon, 28 Apr 2025 05:16:59 +0000 (07:16 +0200)]
arg : fix unused variable (#13142)

2 months agollama-bench : Add `--override-tensors` arg (#12922)
4onen [Sun, 27 Apr 2025 21:48:26 +0000 (14:48 -0700)]
llama-bench : Add `--override-tensors` arg (#12922)

* Add --override-tensors option to llama-bench

* Correct llama-bench --override-tensors to --override-tensor

* llama-bench: Update --override-tensors parsing to match --tensor-split, appear in test matrix.

* Make new llama-bench util functions static to fix Ubuntu CI

* llama-bench: Correct -ot corner cases (No -ot calls, leading and trailing empty -ot spans, etc.)

2 months agollama-chat : fix wrong template in GLM4-0414 (#13140)
matteo [Sun, 27 Apr 2025 19:57:32 +0000 (21:57 +0200)]
llama-chat : fix wrong template in GLM4-0414 (#13140)

* fix wrong template in GLM4-0414

* fix spaces

* no bos token since it is already in the template

* moved the chatgml4 check to higher priority

* restored template for old GLM models

* moved the GLM4 template check in the correct place with correct check

2 months agomusa: fix build warning (#13129)
R0CKSTAR [Sun, 27 Apr 2025 11:22:49 +0000 (19:22 +0800)]
musa: fix build warning (#13129)

Signed-off-by: Xiaodong Ye <redacted>
2 months agoFixes Qwen2.5VL segfault during inference with https://github.com/ggml-org/llama...
LostRuins Concedo [Sun, 27 Apr 2025 10:43:37 +0000 (18:43 +0800)]
Fixes Qwen2.5VL segfault during inference with https://github.com/ggml-org/llama.cpp/pull/12402 as has_qwen2vl_merger migration was incomplete (#13133)

2 months agoclip : Add Qwen2.5VL support (#12402)
HimariO [Sun, 27 Apr 2025 08:10:34 +0000 (16:10 +0800)]
clip : Add Qwen2.5VL support (#12402)

* implment vision model architecture, gguf convertor

* handle window attention inputs

* add debug utils

* fix few incorrect tensor memory layout

* move position id remap out of ggml to avoid int32 cuda operations

* cleaning up

* ignore transformers Qwen2_5_xxx type check

* remove not so often use `qwen2vl-cli` debug functions

* remove commented-out code blocks

* fix attn weight scaling after rebase

* add `PROJECTOR_TYPE_QWEN2_5_VL`

* remove `KEY_USE_GLU_MLP`, `KEY_USE_RMS_NORM`

* replace `KEY_FULLATTN_BLK_IDX` with `KEY_WIN_ATTN_PATTERN`

* remove `attn_window_size` from gguf

* fix model conversion

* clean up

* fix merging problem

* add test

---------

Co-authored-by: Xuan Son Nguyen <redacted>
2 months agocommon : add common_remote_get_content (#13123)
Xuan-Son Nguyen [Sat, 26 Apr 2025 20:58:12 +0000 (22:58 +0200)]
common : add common_remote_get_content (#13123)

* common : add common_remote_get_content

* support max size and timeout

* add tests

2 months agoclip : improve projector naming (#13118)
Xuan-Son Nguyen [Sat, 26 Apr 2025 20:39:47 +0000 (22:39 +0200)]
clip : improve projector naming (#13118)

* clip : improve projector naming

* no more kv has_llava_projector

* rm unused kv

* rm more unused

2 months agoggml: move fp16/bf16 conversion optimizations to CPU backend + export conversion...
SXX [Sat, 26 Apr 2025 14:05:31 +0000 (22:05 +0800)]
ggml: move fp16/bf16 conversion optimizations to CPU backend + export conversion APIs (#13107)

* ggml: dynamic x86_64 feature detection for FP32 <-> FP16/BF16 conversion

* move fp converter to ggml-cpu

* Switch ggml_compute_forward_get_rows_f16/bf16 to new ggml_cpu_fp16/bf16_to_fp32

2 months agogrammar : handle maxItems == 0 in JSON schema (#13117)
frob [Sat, 26 Apr 2025 08:10:20 +0000 (10:10 +0200)]
grammar : handle maxItems == 0 in JSON schema (#13117)

Co-authored-by: Richard Lyons <redacted>
2 months agollama : fix K-shift with quantized K and BLAS backend (#13113)
Diego Devesa [Fri, 25 Apr 2025 17:40:11 +0000 (19:40 +0200)]
llama : fix K-shift with quantized K and BLAS backend (#13113)

2 months agoForce FP32 compute in GLM4 FFN Down (#13101)
City [Fri, 25 Apr 2025 12:38:34 +0000 (14:38 +0200)]
Force FP32 compute in GLM4 FFN Down (#13101)

* Force FP32 compute in cuBLAS GEMM

* Revert "Force FP32 compute in cuBLAS GEMM"

This reverts commit 6efd872732159ab88ee7b3c1d77ba5ebc83079bd.

* Force F32 compute in GLM4 ffn down

* Edit comment to clarify issue

Co-authored-by: Johannes Gäßler <redacted>
---------

Co-authored-by: Johannes Gäßler <redacted>
2 months agoclip : fix pixtral on some GPU backends (#13097)
Xuan-Son Nguyen [Fri, 25 Apr 2025 12:31:42 +0000 (14:31 +0200)]
clip : fix pixtral on some GPU backends (#13097)

* clip : fix pixtral on some GPU backends

* refactor inp_raw set

* rm outdated comment

* fix dynamic size

* add TODO

2 months agochange the reorder tensor from init to execute OP (#13003)
Neo Zhang Jianyu [Fri, 25 Apr 2025 09:37:51 +0000 (17:37 +0800)]
change the reorder tensor from init to execute OP (#13003)

2 months agorpc : do not wait for response when sending RPC_CMD_SET_TENSOR (#12943)
Radoslav Gerganov [Fri, 25 Apr 2025 07:08:08 +0000 (10:08 +0300)]
rpc : do not wait for response when sending RPC_CMD_SET_TENSOR (#12943)

RPC_CMD_SET_TENSOR always returns an empty response and we send this 4
times per token. We can improve TG speed if we don't wait for this empty
response.

The performance impact of this change depends on the network latency.

2 months agoclip : remove boi/eoi embeddings for GLM-edge model (#13081)
Xuan-Son Nguyen [Thu, 24 Apr 2025 20:17:04 +0000 (22:17 +0200)]
clip : remove boi/eoi embeddings for GLM-edge model (#13081)

2 months agoembeddings : fix batch sizes (#13076) upstream/0.0.5185
Georgi Gerganov [Thu, 24 Apr 2025 19:29:22 +0000 (22:29 +0300)]
embeddings : fix batch sizes (#13076)

ggml-ci

2 months agoggml : fix trailing whitespaces (#0)
Georgi Gerganov [Thu, 24 Apr 2025 14:22:27 +0000 (17:22 +0300)]
ggml : fix trailing whitespaces (#0)

2 months agosync : ggml
Georgi Gerganov [Thu, 24 Apr 2025 13:47:43 +0000 (16:47 +0300)]
sync : ggml

ggml-ci

2 months agoggml : Depthwise 2D convolution (ggml/1152)
Acly [Thu, 17 Apr 2025 12:16:45 +0000 (14:16 +0200)]
ggml : Depthwise 2D convolution (ggml/1152)

* ggml-cpu : kernels for faster depthwise 2D convolution

* fix compile: remove static after moving to ops.cpp

* add dilation for depthwise_conv_2d

* review: rename to ggml_conv_2d_dw_direct, remove redundant struct keywords, pass by ref, whitespace

* review: rename depthwise_conv_2d -> conv_2d_dw everywhere

2 months agoCUDA: use switch statements in constexpr functions (#13095)
Johannes Gäßler [Thu, 24 Apr 2025 13:57:10 +0000 (15:57 +0200)]
CUDA: use switch statements in constexpr functions (#13095)

2 months agocmake : do not include ./src as public for libllama (#13062)
Georgi Gerganov [Thu, 24 Apr 2025 13:00:10 +0000 (16:00 +0300)]
cmake : do not include ./src as public for libllama (#13062)

* cmake : do not include ./src as public for libllama

ggml-ci

* cmake : rework tests

ggml-ci

* llguidance : remove unicode include

ggml-ci

* cmake : make c++17 private

ggml-ci

2 months agoclang-tidy : disable warning about missing math parenthesis (#13091)
Georgi Gerganov [Thu, 24 Apr 2025 12:44:05 +0000 (15:44 +0300)]
clang-tidy : disable warning about missing math parenthesis (#13091)

2 months agoarg : add --no-mmproj-offload (#13093)
Xuan-Son Nguyen [Thu, 24 Apr 2025 12:04:14 +0000 (14:04 +0200)]
arg : add --no-mmproj-offload (#13093)

* arg : add --no-mmproj-offload

* Update common/arg.cpp

2 months agoarg : clean up handling --mmproj with -hf (#13082)
Xuan-Son Nguyen [Thu, 24 Apr 2025 10:14:13 +0000 (12:14 +0200)]
arg : clean up handling --mmproj with -hf (#13082)

* arg : clean up handling --mmproj with -hf

* rm change about no_mmproj

* Revert "rm change about no_mmproj"

This reverts commit 2cac8e0efb629d66c612f137e75d562f94bb9e6c.

* handle no_mmproj explicitly

* skip download mmproj on examples not using it

2 months agometal : fix floating-point range of attention scores in FA kernels (#13090)
Georgi Gerganov [Thu, 24 Apr 2025 07:38:30 +0000 (10:38 +0300)]
metal : fix floating-point range of attention scores in FA kernels (#13090)

ggml-ci

2 months agovulkan: matmul gcn tuning (#13016)
Eve [Thu, 24 Apr 2025 07:18:33 +0000 (07:18 +0000)]
vulkan: matmul gcn tuning (#13016)

* tune matmul for gcn

* this one is more power efficient

* Update ggml/src/ggml-vulkan/ggml-vulkan.cpp

Co-authored-by: 0cc4m <redacted>
* disable this tune for the proprietary driver

---------

Co-authored-by: 0cc4m <redacted>
2 months agollama-mtmd-cli: Sigint rework in mtmd vision example (#13080)
pl752 [Wed, 23 Apr 2025 21:32:35 +0000 (02:32 +0500)]
llama-mtmd-cli: Sigint rework in mtmd vision example (#13080)

* Sigint rework in mtmd vision example

* Applied suggestions on mtmd-cli PR

* Forgot to invert one of the conditions

* Update examples/llava/mtmd-cli.cpp

* Removed redundant exit check

---------

Co-authored-by: pl752 <redacted>
Co-authored-by: Xuan-Son Nguyen <redacted>
2 months agomtmd : Support Pixtral 12B (#13065)
Xuan-Son Nguyen [Wed, 23 Apr 2025 18:21:59 +0000 (20:21 +0200)]
mtmd : Support Pixtral 12B (#13065)

* add pixtral text model (vision is wip)

* cgraph ok, just missing 2D RoPE

* fix bad rebase

* first working version

* fix problem with img_break token

* support dynamic image size

* update docs

* update test script

2 months agoconvert : Append mult-eos,half-rope,bos to GLM4-0414 and Z (#13021)
piDack [Wed, 23 Apr 2025 14:59:14 +0000 (22:59 +0800)]
convert : Append mult-eos,half-rope,bos to GLM4-0414 and Z (#13021)

* append mult-eos,half-rope,bos to GLM4-0414

* remove unset var

2 months agorpc : add command line option for number of threads for the CPU backend (#13060)
Radoslav Gerganov [Wed, 23 Apr 2025 07:32:49 +0000 (10:32 +0300)]
rpc : add command line option for number of threads for the CPU backend (#13060)

closes #13051

2 months agoCUDA: noncont MMVQ + batched bs1 MUL_MAT_ID (#13014)
Johannes Gäßler [Tue, 22 Apr 2025 19:27:40 +0000 (21:27 +0200)]
CUDA: noncont MMVQ + batched bs1 MUL_MAT_ID (#13014)

* CUDA: noncont MMVQ + batched bs1 MUL_MAT_ID

* fix logic for RoPE support, CUDA graphs

2 months agomtmd : support SmolVLM (version 1 and 2) (#13050)
Xuan-Son Nguyen [Tue, 22 Apr 2025 14:24:54 +0000 (16:24 +0200)]
mtmd : support SmolVLM (version 1 and 2) (#13050)

* mtmd : support SmolVLM (version 1 and 2)

* correct chat template

* fix n_patches

* scale_factor is an int

* add more models to test

2 months agosecurity : add note about RPC and server functionality (#13061)
Georgi Gerganov [Tue, 22 Apr 2025 13:16:10 +0000 (16:16 +0300)]
security : add note about RPC and server functionality (#13061)

* security : add note about RPC functionality

* security : add note about llama-server

2 months agometal : add memory pool for temp allocs (#12850)
Georgi Gerganov [Tue, 22 Apr 2025 13:15:51 +0000 (16:15 +0300)]
metal : add memory pool for temp allocs (#12850)

* metal : add memory pool for temp allocs (wip) [no ci]

* cont : free buffers from the heap

* cont : resize heap [no ci]

* cont : refactor heap [no ci]

* cont : heap for each cmd buffer [no ci]

* cont : fix free

* wip

* cont : fix alignment [no ci]

* cont : not working .. [no ci]

* cont : heap allocation now works [no ci]

* cont : use MTLHeapTypePlacement

ggml-ci

* metal : use dynamic MTLHeap allocations

ggml-ci

* metal : add comments

* metal : disable softmax use of mem_pool

ggml-ci

* metal : final touches

2 months agollava : update documentations (#13055)
Xuan-Son Nguyen [Tue, 22 Apr 2025 08:37:00 +0000 (10:37 +0200)]
llava : update documentations (#13055)

* llava : update documentations

* fix typo

2 months agoggml : add SSE 4.2 and x64 base variant for CPUs without AVX (#12871)
Diego Devesa [Mon, 21 Apr 2025 16:13:51 +0000 (18:13 +0200)]
ggml : add SSE 4.2 and x64 base variant for CPUs without AVX (#12871)

* ggml : add SSE 4.2 variant for CPUs without AVX

* ggml : add x64 base ABI variant

2 months agoSYCL: Add non-contiguous support in ROPE (#12993)
Akarshan Biswas [Mon, 21 Apr 2025 13:43:30 +0000 (19:13 +0530)]
SYCL: Add non-contiguous support in ROPE (#12993)

ggml-ci

2 months agomtmd : merge llava, gemma3 and minicpmv CLI into single `llama-mtmd-cli` (#13012)
Xuan-Son Nguyen [Mon, 21 Apr 2025 13:32:58 +0000 (15:32 +0200)]
mtmd : merge llava, gemma3 and minicpmv CLI into single `llama-mtmd-cli` (#13012)

* mtmd : merge `llava-cli` and `gemma3-cli` into single `mtmd-cli`

* support for minicpmv

* remove cpp files of llava and minicpmv

* update hot topics

* mtmd : add not supported msg for qwen2vl

* Update examples/llava/mtmd.cpp

Co-authored-by: Georgi Gerganov <redacted>
---------

Co-authored-by: Georgi Gerganov <redacted>
2 months agoconvert : experimental support for `--mmproj` flag (#13023)
Xuan-Son Nguyen [Sun, 20 Apr 2025 21:29:36 +0000 (23:29 +0200)]
convert : experimental support for `--mmproj` flag (#13023)

* convert : experimental support for `--mmproj` flag

* fix bad ctrl+f replace

* fix style

* split into subclasses TextModel and VisionModel

* rename Mode --> ModelBase

* small fix

* correct CLIP_VISION arch name (because existing GGUF already use it)

* Apply suggestions from code review

Co-authored-by: compilade <redacted>
* fix Mistral3Model

* fix typo

Co-authored-by: compilade <redacted>
---------

Co-authored-by: compilade <redacted>
2 months agollava: fix errors in clip.h on certain compilers (#13030)
Jeffrey Morgan [Sun, 20 Apr 2025 10:15:41 +0000 (03:15 -0700)]
llava: fix errors in clip.h on certain compilers (#13030)

2 months agovulkan: support noncontiguous rms_norm (#13031)
Jeff Bolz [Sun, 20 Apr 2025 08:50:02 +0000 (03:50 -0500)]
vulkan: support noncontiguous rms_norm (#13031)

2 months agometal: add neg operator (#13029)
Jeffrey Morgan [Sun, 20 Apr 2025 05:28:40 +0000 (22:28 -0700)]
metal: add neg operator (#13029)

2 months agoDisable CI cross-compile builds (#13022)
bandoti [Sat, 19 Apr 2025 16:05:03 +0000 (13:05 -0300)]
Disable CI cross-compile builds (#13022)

2 months agogguf-py : fix upload python package workflow (#13020) gguf-v0.16.2
Sigbjørn Skjæret [Sat, 19 Apr 2025 14:26:38 +0000 (16:26 +0200)]
gguf-py : fix upload python package workflow (#13020)

2 months agoclip : refactor, add `image_manipulation` and `llava_uhd` classes (#13011)
Xuan-Son Nguyen [Sat, 19 Apr 2025 07:15:45 +0000 (09:15 +0200)]
clip : refactor, add `image_manipulation` and `llava_uhd` classes (#13011)

* clip : refactor, add `image_manipulation` and `llava_uhd`

* refactor llava-1.6 preprocessing

* simplify logic for llava-1.5

* missing include

2 months agomain : Fix Ctrl+D/newline handling (#12951)
Daniel Tang [Fri, 18 Apr 2025 20:02:55 +0000 (16:02 -0400)]
main : Fix Ctrl+D/newline handling (#12951)

This restores the behavior from #491. This does not affect Ctrl+D's ability to
terminate --multiline-input lines (#1040).

This also actually implements #587: "If the user wants the text to end in a
newline, this should be accomplished by explicitly adding a newline by using
\ followed by return, then returning control by pressing return again."

Fixes #12949

2 months agogguf-py : GGUF Editor GUI - Python + Qt6 (#12930) gguf-v0.16.1
Chris Thompson [Fri, 18 Apr 2025 18:30:41 +0000 (12:30 -0600)]
gguf-py : GGUF Editor GUI - Python + Qt6 (#12930)

2 months agoserver : use std::move whenever possible (#12936)
Xuan-Son Nguyen [Fri, 18 Apr 2025 17:58:12 +0000 (19:58 +0200)]
server : use std::move whenever possible (#12936)

* server : use std::move whenever possible

* use r-value ref

* Apply suggestions from code review

Co-authored-by: Georgi Gerganov <redacted>
* make task creation scoped

* restore std::move

* fix task_id not set correctly

* apply changes from suggestion

Co-authored-by: ggerganov <redacted>
---------

Co-authored-by: Georgi Gerganov <redacted>
2 months agoSYCL: Refactor and enable FP16 in binary broadcast OPs (#12975)
Akarshan Biswas [Fri, 18 Apr 2025 13:57:56 +0000 (19:27 +0530)]
SYCL: Refactor and enable FP16 in binary broadcast OPs (#12975)

* SYCL: refactor move to a separate file

* Fix binbcast

* Remove duplicates

* fix include formatting

* fix typo

2 months agomtmd : add methods to access `mtmd_image_tokens` (#12906)
Xuan-Son Nguyen [Fri, 18 Apr 2025 08:04:51 +0000 (10:04 +0200)]
mtmd : add methods to access `mtmd_image_tokens` (#12906)

* mtmd : add more api around mtmd_image_tokens

* mtmd : ability to calc image hash

* shared_ptr for mtmd_image_tokens

* move hash to user-define ID (fixed)

* fix prompt_modified

* rm redundant data member

2 months agorpc : add RPC_CMD_HELLO (#12955)
Radoslav Gerganov [Fri, 18 Apr 2025 07:13:42 +0000 (10:13 +0300)]
rpc : add RPC_CMD_HELLO (#12955)

Add RPC_CMD_HELLO for getting the version of the protocol implemend by
the server. Follow the semantic versioning rules at https://semver.org

Hopefully this bring better user experience when we make breaking
changes at the protocol level and avoid issues like #12465

2 months agograph : make FA compatible with MLA + add initial Metal kernels (#12953)
Georgi Gerganov [Thu, 17 Apr 2025 15:16:36 +0000 (18:16 +0300)]
graph : make FA compatible with MLA + add initial Metal kernels (#12953)

* graph : make mla compatible with FA

* metal : add exp FA kernels for DeepSeek models

ggml-ci

* llama : minor naming updates

ggml-ci

* ggml : disable FA for DS head sizes

* tests : add FA tests for MLA shapes

ggml-ci

2 months agoggml: Re-enable CUDA graphs in presence of CONT and DUP nodes (#12970)
Alan Gray [Thu, 17 Apr 2025 13:19:42 +0000 (14:19 +0100)]
ggml: Re-enable CUDA graphs in presence of CONT and DUP nodes (#12970)

2 months agoCANN: Add support for async operator submission (#12864)
hipudding [Thu, 17 Apr 2025 12:34:16 +0000 (20:34 +0800)]
CANN: Add support for async operator submission (#12864)

Submit operators using asynchronous threads to improve performance.

Use the environment variable GGML_CANN_ASYNC_MODE to control whether
asynchronous submission is enabled. It is disabled by default.

Testing shows a 10%–20% performance improvement in scenarios with
small parameter sizes, especially in quantized models.

2 months agollama : recognize IBM Granite 3.3 FIM tokens (#12988)
Mikko Juola [Thu, 17 Apr 2025 08:37:05 +0000 (01:37 -0700)]
llama : recognize IBM Granite 3.3 FIM tokens (#12988)

The Granite's FIM tokens are very similar to Qwen's; it's just that
they use underscore instead of a dash. So <fim_middle> for example
instead of <fim-middle>.

Opening up tokenizer_config.json in ibm-granite/granite-3.3-8b-base
shows:

```
    "<fim_prefix>",
    "<fim_middle>",
    "<fim_suffix>",
    "<fim_pad>",
    ...
    "<reponame>",
```

2 months agoopencl: fix incorrect local_size index in profiling log (#12868)
kimminsu [Wed, 16 Apr 2025 21:25:57 +0000 (06:25 +0900)]
opencl: fix incorrect local_size index in profiling log (#12868)

2 months agovulkan: enable coopmat2 FA gqa and split_k optimizations more often (#12931)
Jeff Bolz [Wed, 16 Apr 2025 18:37:25 +0000 (13:37 -0500)]
vulkan: enable coopmat2 FA gqa and split_k optimizations more often (#12931)

The grouped query attention optmization doesn't require a power of two ratio,
the only thing relying on it was the modulo operation written as bitwise &.

split_k need not depend on gqa_ratio - enable it any time there's only one
workgroup in the X dimension. The shader gets the split index from the x coord,
and multiple workgroups in the X dimension (pre-split) indicates a larger
FA operation that wouldn't need splitting.

2 months agoCANN: Add 310P operator support check (#12962)
Chenguang Li [Wed, 16 Apr 2025 08:21:05 +0000 (16:21 +0800)]
CANN: Add 310P operator support check (#12962)

2 months agoopencl: split `ggml-opencl.cl` into multiple files and cleanup (#12886)
lhez [Tue, 15 Apr 2025 19:26:00 +0000 (12:26 -0700)]
opencl: split `ggml-opencl.cl` into multiple files and cleanup (#12886)

* opencl: refactor - split the kernel files

---------

Co-authored-by: Shangqing Gu <redacted>
* opencl: split more kernels into separate files

* opencl: specify subgroup size instead of querying it

* opencl: refine Adreno cl compiler version parsing

* opencl: skip some kernels not used by Adreno on old compilers

* opencl: refine logic for selecting Adreno kernels

* opencl: refine Adreno cl compiler version

* opencl: cleanup preprocessor for kernels

* opencl: consider Adreno CL compiler on Windows

* opencl: add final newline for `mul_mv_f16_f16.cl`

---------

Co-authored-by: Shangqing Gu <redacted>
2 months agometal : add FA-vec kernels for head size 96 (#12952)
Georgi Gerganov [Tue, 15 Apr 2025 11:45:05 +0000 (14:45 +0300)]
metal : add FA-vec kernels for head size 96 (#12952)

ggml-ci

2 months agoCANN: Add x86 build ci (#12950)
hipudding [Tue, 15 Apr 2025 11:08:55 +0000 (19:08 +0800)]
CANN: Add x86 build ci (#12950)

* CANN: Add x86 build ci

* CANN: fix code format

2 months agoCUDA/HIP: Share the same unified memory allocation logic. (#12934)
David Huang [Tue, 15 Apr 2025 09:20:38 +0000 (17:20 +0800)]
CUDA/HIP: Share the same unified memory allocation logic. (#12934)

Replace compile-time `GGML_HIP_UMA` with environment variable `GGML_CUDA_ENABLE_UNIFIED_MEMORY`. This unifies the usage on NVIDIA and AMD GPUs, and allows a single binary to be shared between integrated and dedicated GPUs.

2 months agoSYCL: Add ROPE vision kernel (#12887)
Akarshan Biswas [Tue, 15 Apr 2025 08:37:42 +0000 (14:07 +0530)]
SYCL: Add ROPE vision kernel (#12887)

* SYCL: Add ROPE vision kernel

* Add comment about rope mode

2 months agollama : DeepSeek V2/V3 MLA implementation (#12801)
Juk Armstrong [Tue, 15 Apr 2025 06:49:57 +0000 (07:49 +0100)]
llama : DeepSeek V2/V3 MLA implementation (#12801)

* Merged using squash to remove all noise commit messages

* Force flash attention off for `LLM_ARCH_DEEPSEEK2` - embedding too large

* Removed 3 conts (2x RoPE and 1x RMS-norm)

* Changed to use `<cmath>` instead of `<math.h>`

* Reverted removal of the 3 conts

* Used `reshape` in `llm_graph_context::build_attn_mha()`

* Use `k_pe = ggml_reshape`

* Removed the 3 conts again

* Removed the 3D views of `wk_b` and `wv_b`, and just save and 3D in GGUF

* Removed MQA optimisation from `build_attn_mha()` as no gains now

* Simplified `is_mla` branch in `llm_build_deepseek2()`

* Removed `build_attn_mla` and added `nullptr` to all `build_atnn` calls

* Fixed call to `build_attn` in `llm_build_t5_enc`

2 months agoggml : Add AVX512 implementation of GEMM - Q4_Kx8 (#12829)
Srihari-mcw [Tue, 15 Apr 2025 06:22:36 +0000 (11:52 +0530)]
ggml : Add AVX512 implementation of GEMM - Q4_Kx8 (#12829)

* Add AVX512 implementation of GEMM - q4kx8

* Update changes to remove unnecessary whitespaces

2 months agoCANN: Opt ROPE optimization (#12865)
Chenguang Li [Tue, 15 Apr 2025 02:09:35 +0000 (10:09 +0800)]
CANN: Opt ROPE optimization (#12865)

* [CANN]Opt ROPE optimization

* [CANN]Codestyle adjustment

* [CANN]Fix the ROPE precision issue

* [CANN]codestyle fix

* [CANN]add rope unsupport case

Signed-off-by: noemotiovon <redacted>
2 months agoCANN: Optimize CANN buffer pool memory management (#12875)
Xinpeng Dou [Tue, 15 Apr 2025 02:04:24 +0000 (10:04 +0800)]
CANN: Optimize CANN buffer pool memory management (#12875)

Multiple optional memory pools are provided for CANN, including VMM,
priority queue-based, and traditional memory pools.
1.When the memory pool is available and GGML_CANN_DISABLE_VMM_POOL
   is not defined, the VMM pool is selected by default.
2.Otherwise, if GGML_CANN_ENABLE_BUF_PRIO_POOL is defined,
   the priority queue-based memory pool is used.
3.If neither condition is met, the default memory pool is used.

2 months agoAdd performance print for gemma3 in example (#12929)
Russyyds [Mon, 14 Apr 2025 17:18:20 +0000 (01:18 +0800)]
Add performance print for gemma3 in example (#12929)

2 months agoSYCL: Fix im2col (#12910)
Akarshan Biswas [Mon, 14 Apr 2025 12:23:53 +0000 (17:53 +0530)]
SYCL: Fix im2col (#12910)

* SYCL: Fix im2col

* restore local workgroup size adjustments for large inputs

* restore format

2 months agorpc : use ggml_context_ptr (#12938)
Radoslav Gerganov [Mon, 14 Apr 2025 10:59:34 +0000 (13:59 +0300)]
rpc : use ggml_context_ptr (#12938)

2 months agodsiable curl lib check, this action is missed by commit bd3f59f81289b920bcc597a208c14...
Neo Zhang Jianyu [Mon, 14 Apr 2025 10:19:07 +0000 (18:19 +0800)]
dsiable curl lib check, this action is missed by commit bd3f59f81289b920bcc597a208c14f55e39ed37e (#12761) (#12937)

2 months agosync : ggml
Georgi Gerganov [Mon, 14 Apr 2025 05:52:10 +0000 (08:52 +0300)]
sync : ggml

ggml-ci

2 months agocpu: fix cpu backend's supports-op for GET_ROWS_BACK. fixes a fatal when running...
cmdr2 [Fri, 11 Apr 2025 06:44:19 +0000 (12:14 +0530)]
cpu: fix cpu backend's supports-op for GET_ROWS_BACK. fixes a fatal when running test-backend-ops with only the CPU backend (ggml/1190)

2 months agoggml: use _mm[512/256]_dpbusd[_avx]_epi32 to directly accumulate into the result...
SXX [Mon, 14 Apr 2025 05:47:55 +0000 (13:47 +0800)]
ggml: use _mm[512/256]_dpbusd[_avx]_epi32 to directly accumulate into the result register (#12773)

* ggml: use _mm[512/256]_dpbusd[_avx]_epi32 to directly accumulate into the result register

* simplifies the codebase by removing redundant functions

2 months agoggml: disable CUDA graphs for unsupported DUP and CONT node types (#12891)
Alan Gray [Sun, 13 Apr 2025 21:12:21 +0000 (22:12 +0100)]
ggml: disable CUDA graphs for unsupported DUP and CONT node types (#12891)

Fixes #12798

2 months agoquantize: Handle user-defined quantization levels for additional tensors (#12511)
Ed Addario [Sun, 13 Apr 2025 18:29:28 +0000 (19:29 +0100)]
quantize: Handle user-defined quantization levels for additional tensors (#12511)

* Add llama_model_quantize_params parameters

* Add new quantize parameters parsing and validation

* Update usage

* Add new parameters defaults

* Add new quantization parameters logic

* Add llama_model_quantize_params parameters

* Add new quantize parameters parsing and validation

* Update usage

* Add new parameters defaults

* Add new quantization parameters logic

* Minor refactoring as per the contributors' coding guidelines

* Update descriptions to match existing style

* Add llama_model_quantize_params parameters

* Add new quantize parameters parsing and validation

* Update usage

* Add new parameters defaults

* Add new quantization parameters logic

* Minor refactoring as per the contributors' guidelines

* Implement general --tensor-type instead of tensor-specific command option

* Fix implied type bug

* Restore missing #includes

* Add regex capability for tensor selection

* Refactor function name and update ALLOWED_TENSOR_TYPE

* Add missing #include

* Handle edge case when tensor name is cls.output

* Minor logging improvement

2 months agocommon : Define cache directory on AIX (#12915)
Prajwal B Mehendarkar [Sat, 12 Apr 2025 15:33:39 +0000 (21:03 +0530)]
common : Define cache directory on AIX (#12915)

2 months agovulkan: use aligned loads for flash attention mask (#12853)
Jeff Bolz [Sat, 12 Apr 2025 08:44:48 +0000 (03:44 -0500)]
vulkan: use aligned loads for flash attention mask (#12853)

Rewrite the stride logic for the mask tensor in the FA shader to force the
stride to be aligned, to allow using more efficient loads.

2 months agollava: Fix cpu-only clip image encoding sefault (#12907)
Matt Clayton [Sat, 12 Apr 2025 05:29:03 +0000 (01:29 -0400)]
llava: Fix cpu-only clip image encoding sefault (#12907)

* llava: Fix cpu-only clip image encoding

* clip : no smart ptr for ggml_backend_t

* Fix for backend_ptr push_back

---------

Co-authored-by: Xuan Son Nguyen <redacted>
2 months agoserver : add VSCode's Github Copilot Chat support (#12896)
Georgi Gerganov [Fri, 11 Apr 2025 20:37:41 +0000 (23:37 +0300)]
server : add VSCode's Github Copilot Chat support (#12896)

* server : add VSCode's Github Copilot Chat support

* cont : update handler name

2 months agorpc : Set cache directory in rpc-server.cpp on FreeBSD (#12903)
yuri@FreeBSD [Fri, 11 Apr 2025 20:04:14 +0000 (13:04 -0700)]
rpc : Set cache directory in rpc-server.cpp on FreeBSD (#12903)

2 months ago`tool-call`: fix non-tool-calling grammar crashes w/ Qwen / Hermes 2 templates (...
Olivier Chafik [Fri, 11 Apr 2025 19:47:52 +0000 (12:47 -0700)]
`tool-call`: fix non-tool-calling grammar crashes w/ Qwen / Hermes 2 templates (#12900)

* `tool-call`: don't call common_chat_params_init_hermes_2_pro when there aren't tools (or when there's a schema)

* test all chat formats w/o tools

2 months agocommon : Define cache directory on FreeBSD (#12892)
yuri@FreeBSD [Fri, 11 Apr 2025 19:45:44 +0000 (12:45 -0700)]
common : Define cache directory on FreeBSD (#12892)

2 months agosycl: Support sycl_ext_oneapi_limited_graph (#12873)
Ewan Crawford [Fri, 11 Apr 2025 13:32:14 +0000 (15:32 +0200)]
sycl: Support sycl_ext_oneapi_limited_graph (#12873)

The current usage of the SYCL-Graph extension checks for
the `sycl_ext_oneapi_graph` device aspect. However, it is also
possible to support `sycl_ext_oneapi_limied_graph` devices that
don't support update

2 months agocontrib: support modelscope community (#12664)
tastelikefeet [Fri, 11 Apr 2025 12:01:56 +0000 (20:01 +0800)]
contrib: support modelscope community (#12664)

* support download from modelscope

* support login

* remove comments

* add arguments

* fix code

* fix win32

* test passed

* fix readme

* revert readme

* change to MODEL_ENDPOINT

* revert tail line

* fix readme

* refactor model endpoint

* remove blank line

* fix header

* fix as comments

* update comment

* update readme

---------

Co-authored-by: tastelikefeet <redacted>
2 months agollama-model : add Glm4Model implementation for GLM-4-0414 (#12867)
Yuxuan Zhang [Fri, 11 Apr 2025 10:10:10 +0000 (18:10 +0800)]
llama-model : add Glm4Model implementation for GLM-4-0414 (#12867)

* GLM-4-0414

* use original one

* Using with tensor map

* fix bug

* change order

* change order

* format with flask8

2 months agoclip : use smart pointer (⚠️ breaking change) (#12869)
Xuan-Son Nguyen [Fri, 11 Apr 2025 10:09:39 +0000 (12:09 +0200)]
clip : use smart pointer (⚠️ breaking change) (#12869)

* clip : use smart pointers

* fix warmup

* add forward declaration

* misisng include

* fix include (2)

* composite

* simplify batch ptr

* fix conflict

2 months agoSYCL: Add fp16 type support to unary op kernels (#12788)
Akarshan Biswas [Fri, 11 Apr 2025 08:03:50 +0000 (13:33 +0530)]
SYCL: Add fp16 type support to unary op kernels (#12788)

* SYCL: Add fp16 support to some elementwise OP kernels

* remove comment

ggml-ci

* Use static_cast directly

* remove not needed cast from tanh

* Use static cast and remove unneeded castings

* Adjust device_support_op for unary OPs

* Use cast_data and typed_data struct to deduplicate casting code

2 months agoconvert : Llama4 RoPE fix (#12889)
Daniel Han [Fri, 11 Apr 2025 07:49:09 +0000 (00:49 -0700)]
convert : Llama4 RoPE fix (#12889)