]> git.djapps.eu Git - pkg/ggml/sources/llama.cpp/log
pkg/ggml/sources/llama.cpp
2 months agocmake : enable curl by default (#12761)
Xuan-Son Nguyen [Mon, 7 Apr 2025 11:35:19 +0000 (13:35 +0200)]
cmake : enable curl by default (#12761)

* cmake : enable curl by default

* no curl if no examples

* fix build

* fix build-linux-cross

* add windows-setup-curl

* fix

* shell

* fix path

* fix windows-latest-cmake*

* run: include_directories

* LLAMA_RUN_EXTRA_LIBS

* sycl: no llama_curl

* no test-arg-parser on windows

* clarification

* try riscv64 / arm64

* windows: include libcurl inside release binary

* add msg

* fix mac / ios / android build

* will this fix xcode?

* try clearing the cache

* add bunch of licenses

* revert clear cache

* fix xcode

* fix xcode (2)

* fix typo

2 months agoCANN: fix typo in ggml-cann (#12733)
zhouwg [Mon, 7 Apr 2025 11:34:14 +0000 (19:34 +0800)]
CANN: fix typo in ggml-cann (#12733)

2 months agoCANN: Refactor to reduce duplicate code (#12731)
hipudding [Mon, 7 Apr 2025 09:10:36 +0000 (17:10 +0800)]
CANN: Refactor to reduce duplicate code (#12731)

* CANN: Refactor to reduce duplicate code

* CANN: fix review comment

2 months agomusa: fix compilation warnings in mp_22/31 (#12780)
R0CKSTAR [Sun, 6 Apr 2025 13:23:54 +0000 (21:23 +0800)]
musa: fix compilation warnings in mp_22/31 (#12780)

Signed-off-by: Xiaodong Ye <redacted>
2 months agovulkan: fix NaN issue in flash attention shader (#12776)
Jeff Bolz [Sun, 6 Apr 2025 09:03:47 +0000 (04:03 -0500)]
vulkan: fix NaN issue in flash attention shader (#12776)

Use -FLT_MAX/2 rather than -inf as the initial value for computing the maximum.

2 months agovulkan: Use unclamped loads for flash attention mask (#12720)
Jeff Bolz [Sun, 6 Apr 2025 08:47:13 +0000 (03:47 -0500)]
vulkan: Use unclamped loads for flash attention mask (#12720)

nem1 must be a multiple of GGML_KQ_MASK_PAD, and GGML_KQ_MASK_PAD is a multiple
of the number of rows in the matrix. The KV dim is a multiple of the number of
columns for the aligned shader.

2 months agoVulkan: Tune Vulkan mmq int dot shader for performance (#12767)
0cc4m [Sat, 5 Apr 2025 16:04:03 +0000 (18:04 +0200)]
Vulkan: Tune Vulkan mmq int dot shader for performance (#12767)

2 months agocommon : fix includes in arg.cpp and gemma3-cli.cpp (#12766)
Sergey Fedorov [Sat, 5 Apr 2025 15:46:00 +0000 (23:46 +0800)]
common : fix includes in arg.cpp and gemma3-cli.cpp (#12766)

* arg.cpp: add a missing include

* gemma3-cli.cpp: fix cinttypes include

2 months agoclip : refactor clip_init, add tests (#12757)
Xuan-Son Nguyen [Sat, 5 Apr 2025 15:17:40 +0000 (17:17 +0200)]
clip : refactor clip_init, add tests (#12757)

* refactor clip_init

* fix loading file

* fix style

* test ok

* better test with report

* add missing headers

* clarify

* add KEY_MM_PATCH_MERGE_TYPE

* remove bool has_* pattern

* Apply suggestions from code review

Co-authored-by: Georgi Gerganov <redacted>
* Update examples/llava/clip.cpp

Co-authored-by: Georgi Gerganov <redacted>
* use ggml_soft_max_ext

* refactor logging system

* add minicpm-v-o 2.6 for testing

* use nullptr everywhere

* fix Yi-VL model

---------

Co-authored-by: Georgi Gerganov <redacted>
2 months agocommon: custom hf endpoint support (#12769)
エシュナヴァリシア [Sat, 5 Apr 2025 13:31:42 +0000 (21:31 +0800)]
common: custom hf endpoint support (#12769)

* common: custom hf endpoint support

Add support for custom huggingface endpoints via HF_ENDPOINT environment variable

You can now specify a custom huggingface endpoint using the HF_ENDPOINT environment variable when using the --hf-repo flag, which works similarly to huggingface-cli's endpoint configuration.

Example usage:
HF_ENDPOINT=https://hf-mirror.com/ ./bin/llama-cli --hf-repo Qwen/Qwen1.5-0.5B-Chat-GGUF --hf-file qwen1_5-0_5b-chat-q2_k.gguf -p "The meaning to life and the universe is"

The trailing slash in the URL is optional:
HF_ENDPOINT=https://hf-mirror.com ./bin/llama-cli --hf-repo Qwen/Qwen1.5-0.5B-Chat-GGUF --hf-file qwen1_5-0_5b-chat-q2_k.gguf -p "The meaning to life and the universe is"

* Update common/arg.cpp

readability Improvement

Co-authored-by: Xuan-Son Nguyen <redacted>
* Apply suggestions from code review

---------

Co-authored-by: ベアトリーチェ <redacted>
Co-authored-by: Xuan-Son Nguyen <redacted>
2 months agosync: minja (#12739)
Olivier Chafik [Fri, 4 Apr 2025 20:16:39 +0000 (13:16 -0700)]
sync: minja (#12739)

* sync: minja

https://github.com/google/minja/pull/57

* fix json include

2 months agokv-cache : simplify + fix warning for recurrent models (#12756)
Georgi Gerganov [Fri, 4 Apr 2025 18:48:10 +0000 (21:48 +0300)]
kv-cache : simplify + fix warning for recurrent models (#12756)

ggml-ci

2 months agoci: add Linux cross-compile build (#12428)
bandoti [Fri, 4 Apr 2025 17:05:12 +0000 (14:05 -0300)]
ci: add Linux cross-compile build (#12428)

2 months agoserver : webui : Upgrade daisyui, tailwindcss. (#12735)
Nauful Shaikh [Fri, 4 Apr 2025 14:09:52 +0000 (09:09 -0500)]
server : webui : Upgrade daisyui, tailwindcss. (#12735)

* Upgrade daisyui, tailwindcss.

* Switch to all themes.

* Revert a change.

* Update formatting.

* Install packages before npm build.

* Revert "Install packages before npm build."

This reverts commit 336c5147e614e60993162794ba9d9d4629a916f8.

* Add index.html.gz

* run build

---------

Co-authored-by: Xuan Son Nguyen <redacted>
2 months agogguf-split : --merge now respects --dry-run option (#12681)
nick huang [Fri, 4 Apr 2025 14:09:12 +0000 (22:09 +0800)]
gguf-split : --merge now respects --dry-run option (#12681)

* gguf-split now respects dry-run option

* removing trailing space

2 months agosycl: allow ggml-sycl configuration and compilation using Visual Studio project/solut...
Nicolò Scipione [Fri, 4 Apr 2025 14:00:46 +0000 (16:00 +0200)]
sycl: allow ggml-sycl configuration and compilation using Visual Studio project/solution (#12625)

2 months agocmake: fix ggml-shaders-gen compiler paths containing spaces (#12747)
Ronny Brendel [Fri, 4 Apr 2025 13:12:40 +0000 (15:12 +0200)]
cmake: fix ggml-shaders-gen compiler paths containing spaces (#12747)

fixes error for compiler paths with spaces

2 months agodocs : add XCFramework section to README.md [no ci] (#12746)
Daniel Bevenius [Fri, 4 Apr 2025 08:24:12 +0000 (10:24 +0200)]
docs : add XCFramework section to README.md [no ci] (#12746)

This commit adds a new section to the README.md file, detailing the
usage of the XCFramework.

The motivation for this is that it might not be immediately clear to
users how to use the XCFramework in their projects and hopefully this
will help.

2 months agovulkan: Hybrid waitForFences/getFenceStatus to reduce fence latency (#12630)
Jeff Bolz [Fri, 4 Apr 2025 05:54:35 +0000 (00:54 -0500)]
vulkan: Hybrid waitForFences/getFenceStatus to reduce fence latency (#12630)

There seems to be a bubble waking up from waitForFences, which costs a few
percent performance and also increased variance in performance. This change
inserts an "almost_ready" fence when the graph is about 80% complete and we
waitForFences for the almost_ready fence and then spin (with _mm_pauses) waiting
for the final fence to be signaled.

2 months agovulkan: set cmake minimum and project name in vulkan-shaders (#12744)
Jeff Bolz [Fri, 4 Apr 2025 05:53:20 +0000 (00:53 -0500)]
vulkan: set cmake minimum and project name in vulkan-shaders (#12744)

2 months agoopencl: update doc for OpenCL (#12702)
lhez [Fri, 4 Apr 2025 05:18:17 +0000 (22:18 -0700)]
opencl: update doc for OpenCL (#12702)

* opencl: add OpenCL to build.md

* opencl: remove fixed issue/TODO

* opencl: add link to OPENCL.md

* opencl: update doc - refine tools requirement for Windows 11 arm64

2 months agoCUDA: Prefer vector flash decoding kernel for Gemma models (#12738)
Gaurav Garg [Thu, 3 Apr 2025 16:20:29 +0000 (21:50 +0530)]
CUDA: Prefer vector flash decoding kernel for Gemma models (#12738)

* Prefer vector flash decoding kernel for Gemma models

Vector flash decoding kernel was not being picked for models with head dimension 256. Gemma models are in this category.
Removing this limit improves e2e performance by upto 12% in gen phase throughput for Gemm models.

* Update ggml/src/ggml-cuda/fattn.cu

Co-authored-by: Johannes Gäßler <redacted>
---------

Co-authored-by: Johannes Gäßler <redacted>
2 months agovocab : use string_view::find() to avoid unnecessary looking up beyond the fragment...
yumeyao [Thu, 3 Apr 2025 15:32:54 +0000 (23:32 +0800)]
vocab : use string_view::find() to avoid unnecessary looking up beyond the fragment range (#12706)

2 months agovulkan: Fix missing cmake logic for dot product extension (#12721)
Jeff Bolz [Thu, 3 Apr 2025 15:08:26 +0000 (10:08 -0500)]
vulkan: Fix missing cmake logic for dot product extension (#12721)

2 months agoci : add env variable in ggml-ci and document the same in SYCL.md (#12736)
Atharva Dubey [Thu, 3 Apr 2025 12:12:39 +0000 (13:12 +0100)]
ci : add env variable in ggml-ci and document the same in SYCL.md (#12736)

2 months agosync : minja (inclusionAI/Ling) and update tests (#12699)
R0CKSTAR [Thu, 3 Apr 2025 11:51:35 +0000 (19:51 +0800)]
sync : minja (inclusionAI/Ling) and update tests (#12699)

Signed-off-by: Xiaodong Ye <redacted>
2 months agofix MUSA compiler warning (#12704)
a3sh [Thu, 3 Apr 2025 07:32:55 +0000 (15:32 +0800)]
fix MUSA compiler warning (#12704)

* fix MUSA compiler warning

* replace (void) with GGML_UNUSED

2 months agoCANN: Support operator SIN COS ARGMAX (#12709)
Chenguang Li [Thu, 3 Apr 2025 07:18:08 +0000 (15:18 +0800)]
CANN: Support operator SIN COS ARGMAX (#12709)

* [CANN]support sin cos argmax

Signed-off-by: noemotiovon <redacted>
* [CANN]codestyle adjustment

Signed-off-by: noemotiovon <redacted>
* [CANN]Remove redundant code

Signed-off-by: noemotiovon <redacted>
---------

Signed-off-by: noemotiovon <redacted>
Co-authored-by: noemotiovon <redacted>
2 months agoSimplify and improve CUDA graphs through use of indirect copy pointers (#9017)
Alan Gray [Thu, 3 Apr 2025 01:31:15 +0000 (02:31 +0100)]
Simplify and improve CUDA graphs through use of indirect copy pointers (#9017)

* CUDA: Simplify and improve CUDA graphs through use of indirect copy pointers

Previously there was complexity in the CUDA graphs implementation due
frequently changing parameters to copy kernels associated with K and V
cache pointers. This patch simplifies by using indirection to avoid
such parameters frequently changing, avoiding the need for frequent
graph updates.

Fixes #12152

* Addressed comments

* fix HIP builds

* properly sync to stream

* removed ggml_cuda_cpy_fn_ptrs

* move stream sync before free

* guard to only use indirection with graphs

* style fixes

* check for errors

---------

Co-authored-by: slaren <redacted>
2 months agoCANN: Fix failed test cases (#12708)
hipudding [Thu, 3 Apr 2025 00:49:51 +0000 (08:49 +0800)]
CANN: Fix failed test cases (#12708)

* CANN: Fix memory waste in aclnn_tensor

* CANN: fix backend ops fail

* CANN: fix acl_tensor memory alloc.

* CANN: format

* CANN: remove trailing whitespace

2 months agoopencl: use `max_alloc_size` in backend ctx instead of querying again (#12705)
lhez [Thu, 3 Apr 2025 00:01:42 +0000 (17:01 -0700)]
opencl: use `max_alloc_size` in backend ctx instead of querying again (#12705)

2 months agovulkan: Implement split_k for coopmat2 flash attention. (#12627)
Jeff Bolz [Wed, 2 Apr 2025 19:25:08 +0000 (14:25 -0500)]
vulkan: Implement split_k for coopmat2 flash attention. (#12627)

When using group query attention, we have one workgroup per KV batch and this
can be very few workgroups (e.g. just 8 in some models). Enable split_k to
spread the work across SMs. This helps a lot when the KV cache is large.

2 months agocmake: remove caching from vulkan coopmat checks (#12719)
bandoti [Wed, 2 Apr 2025 17:56:26 +0000 (14:56 -0300)]
cmake: remove caching from vulkan coopmat checks (#12719)

2 months agovulkan: Implement grouped query attention in the coopmat2 FA shader (#12559)
Jeff Bolz [Wed, 2 Apr 2025 17:40:32 +0000 (12:40 -0500)]
vulkan: Implement grouped query attention in the coopmat2 FA shader (#12559)

When adjacent batches of Q share the same batches of K/V, batch them into
the same workgroup. For example, when:

dst(128,32,1,1) = FA(q(128,1,32,1), k(128,16640,8,1), v(128,16640,8,1))

previously we would run 32 workgroups computing 1 result each, now we will
run 8 workgroups computing 4 results each.

This doesn't directly translate to better performance (at least when you have
>=32 SMs), but in a subsequent change I'll enable split_k which will scale much
better with 4x fewer workgroups.

2 months agoVulkan: Fix mmq int dot float cache size (#12722)
0cc4m [Wed, 2 Apr 2025 17:12:30 +0000 (19:12 +0200)]
Vulkan: Fix mmq int dot float cache size (#12722)

2 months agomodel : print tensor size during load (#12711)
Georgi Gerganov [Wed, 2 Apr 2025 13:38:54 +0000 (16:38 +0300)]
model : print tensor size during load (#12711)

* model : print tensor size during load

* cont : fix units MB -> MiB

Co-authored-by: Diego Devesa <redacted>
---------

Co-authored-by: Diego Devesa <redacted>
2 months agollama : add option to override model tensor buffers (#11397) upstream/0.0.5028
Diego Devesa [Wed, 2 Apr 2025 12:52:01 +0000 (14:52 +0200)]
llama : add option to override model tensor buffers (#11397)

* llama : add option to override tensor buffers

* ggml : fix possible underflow in ggml_nbytes

2 months agollama : refactor kv cache guard (#12695)
Georgi Gerganov [Wed, 2 Apr 2025 11:32:59 +0000 (14:32 +0300)]
llama : refactor kv cache guard (#12695)

* llama : refactor kv cache guard

ggml-ci

* cont : fix comment [no ci]

* llama : fix kv_cache restore logic

ggml-ci

* context : simplify kv cache updates

ggml-ci

* cont : better name [no ci]

* llama : fix llama_decode return code when could not find KV slot

ggml-ci

* context : change log err -> warn [no ci]

* kv-cache : add comment + warning

2 months agovocab : BailingMoE : change possessive quantifiers to greedy (#12677)
Sigbjørn Skjæret [Wed, 2 Apr 2025 09:21:48 +0000 (11:21 +0200)]
vocab : BailingMoE : change possessive quantifiers to greedy (#12677)

2 months agocommon : remove json.hpp from common.cpp (#12697)
Xuan-Son Nguyen [Wed, 2 Apr 2025 07:58:34 +0000 (09:58 +0200)]
common : remove json.hpp from common.cpp (#12697)

* common : remove json.hpp from common.cpp

* fix comment

2 months ago[CANN] get_rows and dup optimization (#12671)
Chenguang Li [Wed, 2 Apr 2025 07:22:13 +0000 (15:22 +0800)]
[CANN] get_rows and dup optimization (#12671)

* [CANN]get_rows and dup optimization.

Co-authored-by: hipudding <redacted>
Signed-off-by: noemotiovon <redacted>
* [CANN]GET_ROWS and CPY/DUP optimization

Co-authored-by: hipudding <redacted>
Signed-off-by: noemotiovon <redacted>
* [CANN]code style adjustment

Signed-off-by: noemotiovon <redacted>
* [CANN]code style adjustment

Signed-off-by: noemotiovon <redacted>
* [CANN]code style adjustment

Signed-off-by: noemotiovon <redacted>
* [CANN]code style adjustment

Signed-off-by: noemotiovon <redacted>
---------

Signed-off-by: noemotiovon <redacted>
Co-authored-by: noemotiovon <redacted>
Co-authored-by: hipudding <redacted>
2 months agocommon : refactor downloading system, handle mmproj with -hf option (#12694)
Xuan-Son Nguyen [Tue, 1 Apr 2025 21:44:05 +0000 (23:44 +0200)]
common : refactor downloading system, handle mmproj with -hf option (#12694)

* (wip) refactor downloading system [no ci]

* fix all examples

* fix mmproj with -hf

* gemma3: update readme

* only handle mmproj in llava example

* fix multi-shard download

* windows: fix problem with std::min and std::max

* fix 2

2 months agoopencl : fix memory allocation size (#12649)
Junil Kim [Tue, 1 Apr 2025 16:54:34 +0000 (01:54 +0900)]
opencl : fix memory allocation size (#12649)

issue:
https://github.com/CodeLinaro/llama.cpp/pull/17#issuecomment-2760611283

This patch fixes the memory allocation size
not exceeding the maximum size of the OpenCL device.

2 months agollama : use LLM_KV_GENERAL_FILE_TYPE instead of gguf_find_key (#12672)
jklincn [Tue, 1 Apr 2025 12:54:28 +0000 (20:54 +0800)]
llama : use LLM_KV_GENERAL_FILE_TYPE instead of gguf_find_key (#12672)

2 months agoconvert : BailingMoE : fix qkv split when head_dim is 0 (#12687)
Sigbjørn Skjæret [Tue, 1 Apr 2025 12:37:13 +0000 (14:37 +0200)]
convert : BailingMoE : fix qkv split when head_dim is 0 (#12687)

NOTE: Ling-lite-base is broken, see https://huggingface.co/inclusionAI/Ling-lite-base/discussions/2

2 months agometal : use F32 prec in FA kernels (#12688)
Georgi Gerganov [Tue, 1 Apr 2025 11:57:19 +0000 (14:57 +0300)]
metal : use F32 prec in FA kernels (#12688)

* metal : use F32 prec in FA kernels

ggml-ci

* cont : fix FA vec kernel

ggml-ci

2 months agoFix clang warning in gguf_check_reserved_keys (#12686)
R0CKSTAR [Tue, 1 Apr 2025 11:12:53 +0000 (19:12 +0800)]
Fix clang warning in gguf_check_reserved_keys (#12686)

* Fix clang warning in gguf_check_reserved_keys

Signed-off-by: Xiaodong Ye <redacted>
* Fix typo

Signed-off-by: Xiaodong Ye <redacted>
---------

Signed-off-by: Xiaodong Ye <redacted>
2 months agovulkan: fix build when glslc doesn't support coopmat (#12683)
Wagner Bruna [Tue, 1 Apr 2025 09:38:07 +0000 (06:38 -0300)]
vulkan: fix build when glslc doesn't support coopmat (#12683)

2 months agoSYCL: Rename oneMKL to oneMath (#12192)
Romain Biessy [Tue, 1 Apr 2025 08:24:29 +0000 (10:24 +0200)]
SYCL: Rename oneMKL to oneMath (#12192)

* Rename oneMKL Interface to oneMath

* Use oneMath for Intel vendor

* Rename occurences to mkl

* clang-format

* Silence verbose warnings

* Set oneMath HIP_TARGETS

* Fix silence warnings

* Remove step to build oneMath from build instructions

* Use fixed oneMath version

* Remove INTEL_CPU

* Fold CMake oneDNN conditions

* Use Intel oneMKL for Intel devices

* Improve CMake message

* Link against MKL::MKL_SYCL::BLAS only

* Move oneMath documentation to Nvidia and AMD sections

2 months agoSYCL: switch to SYCL namespace (#12674)
Akarshan Biswas [Tue, 1 Apr 2025 08:11:39 +0000 (13:41 +0530)]
SYCL: switch to SYCL namespace (#12674)

2 months agoconvert : BailingMoE : avoid setting rope_dim to 0 (#12678)
Sigbjørn Skjæret [Mon, 31 Mar 2025 21:09:48 +0000 (23:09 +0200)]
convert : BailingMoE : avoid setting rope_dim to 0 (#12678)

2 months agovocab : add special infill tokens for CodeLlama (#11850)
Daniel Bevenius [Mon, 31 Mar 2025 16:40:56 +0000 (18:40 +0200)]
vocab : add special infill tokens for CodeLlama (#11850)

* vocab : add special infill tokens for CodeLlama

The commit adds the following special tokens for CodeLlama infill:
- `▁<PRE>`
- `▁<SUF>`
- `▁<MID>`

The motivation for this is that currently the infill example uses
CodeLlama as a suggested model. But when using this model the following
error is generated:
```console
/llama.cpp-debug/examples/infill/infill.cpp:165: GGML_ASSERT(llama_vocab_fim_pre(vocab) >= 0) failed

Could not attach to process.  If your uid matches the uid of the target
process, check the setting of /proc/sys/kernel/yama/ptrace_scope, or try
again as the root user.  For more details, see /etc/sysctl.d/10-ptrace.conf
ptrace: Operation not permitted.
No stack.
The program is not being run.
305251 Aborted                 (core dumped)
./build/bin/llama-infill -t 10 -ngl 0 -m models/codellama-13b.Q5_K_S.gguf \
  -c 4096 --temp 0.7 --repeat_penalty 1.1 -n 20 \
  --in-prefix "def helloworld():\n    print(\"hell" \
  --in-suffix "\n   print(\"goodbye world\")\n    "
```

* squash! vocab : add special infill tokens for CodeLlama

Add _<EOT> as well.

2 months agoggml : faster ssm scan (#10558)
a3sh [Mon, 31 Mar 2025 16:05:13 +0000 (00:05 +0800)]
ggml : faster ssm scan (#10558)

* faster ssm_scan

* delete unused commnet

* clang format

* add space

* modify unnecessary calculations

* faster ssm conv implementatioin

* modify file name with dash

2 months agoconvert : Qwerky : use lora_rank_tokenshift and lora_rank_decay if present (#12667)
Sigbjørn Skjæret [Mon, 31 Mar 2025 14:36:25 +0000 (16:36 +0200)]
convert : Qwerky : use lora_rank_tokenshift and lora_rank_decay if present (#12667)

2 months agoVulkan: Add DP4A MMQ and Q8_1 quantization shader (#12135)
0cc4m [Mon, 31 Mar 2025 12:37:01 +0000 (14:37 +0200)]
Vulkan: Add DP4A MMQ and Q8_1 quantization shader (#12135)

* Vulkan: Add DP4A MMQ and Q8_1 quantization shader

* Add q4_0 x q8_1 matrix matrix multiplication support

* Vulkan: Add int8 coopmat MMQ support

* Vulkan: Add q4_1, q5_0 and q5_1 quants, improve integer dot code

* Add GL_EXT_integer_dot_product check

* Remove ggml changes, fix mmq pipeline picker

* Remove ggml changes, restore Intel coopmat behaviour

* Fix glsl compile attempt when integer vec dot is not supported

* Remove redundant code, use non-saturating integer dot, enable all matmul sizes for mmq

* Remove redundant comment

* Fix integer dot check

* Fix compile issue with unsupported int dot glslc

* Update Windows build Vulkan SDK version

2 months agocmake : fix whitespace (#0)
Georgi Gerganov [Mon, 31 Mar 2025 12:05:30 +0000 (15:05 +0300)]
cmake : fix whitespace (#0)

2 months agosync : ggml
Georgi Gerganov [Mon, 31 Mar 2025 11:59:21 +0000 (14:59 +0300)]
sync : ggml

ggml-ci

2 months agocmake: improve Vulkan cooperative matrix support checks (whisper/2966)
Sandro Hanea [Mon, 31 Mar 2025 10:44:36 +0000 (12:44 +0200)]
cmake: improve Vulkan cooperative matrix support checks (whisper/2966)

Co-authored-by: Sandro Hanea <redacted>
2 months agollava : proper description fix (#12668)
Sigbjørn Skjæret [Mon, 31 Mar 2025 09:28:30 +0000 (11:28 +0200)]
llava : proper description fix (#12668)

2 months agoSYCL: Remove misleading ggml_sycl_op_flatten function (#12387)
Akarshan Biswas [Mon, 31 Mar 2025 09:25:24 +0000 (14:55 +0530)]
SYCL: Remove misleading ggml_sycl_op_flatten function (#12387)

* SYCL: Remove misleading ggml_sycl_op_flatten function

* remove trailing whitespace

* Fix L2 norm from rebase

* remove try catch block from element_wise.cpp

* remove comment from common.hp

* ggml-sycl.cpp: Add try catch sycl::exception block in compute_forward

* norm.cpp: remove try catch exception block

2 months agollava : fix clip loading GGUFs with missing description (#12660)
Sigbjørn Skjæret [Mon, 31 Mar 2025 09:07:07 +0000 (11:07 +0200)]
llava : fix clip loading GGUFs with missing description (#12660)

2 months agotts : remove printfs (#12640)
marcoStocchi [Mon, 31 Mar 2025 08:20:30 +0000 (10:20 +0200)]
tts : remove printfs (#12640)

* tts.cpp : llama tokens console output is done using LOG_INF instead of printf(). Therefore the options '--log-disable' and '--log-file' have now uniform impact on all output.

2 months agollama : support BailingMoE (Ling) (#12634)
Sigbjørn Skjæret [Sun, 30 Mar 2025 20:21:03 +0000 (22:21 +0200)]
llama : support BailingMoE (Ling) (#12634)

2 months agometal : use constexpr in FA kernels + fix typedef (#12659)
Georgi Gerganov [Sun, 30 Mar 2025 19:04:04 +0000 (22:04 +0300)]
metal : use constexpr in FA kernels + fix typedef (#12659)

* metal : use constexpr in FA kernels

ggml-ci

* cont

ggml-ci

* cont : fix typedef

ggml-ci

2 months agollama : add Trillion 7B model support (#12556)
Juyoung Suk [Sun, 30 Mar 2025 18:38:33 +0000 (03:38 +0900)]
llama : add Trillion 7B model support (#12556)

* Support Trillion 7B

* Update llama.h

* Update llama.h

* Update llama-vocab.cpp for Trillion

* Update llama-vocab.cpp

2 months agollama-chat : Add Yandex instruct model template support (#12621)
Sergei Vorobyov [Sun, 30 Mar 2025 18:12:03 +0000 (21:12 +0300)]
llama-chat : Add Yandex instruct model template support (#12621)

* add yandex template

* update yandex chat template

* fix tests

* adjust chat template

* fix style

* fix tool macro in template

* add clarify comment

---------

Co-authored-by: Sergei Vorobev <redacted>
Co-authored-by: Xuan-Son Nguyen <redacted>
2 months agomusa: fix all warnings, re-enable `-DLLAMA_FATAL_WARNINGS=ON` in ci and update doc...
R0CKSTAR [Sun, 30 Mar 2025 08:59:38 +0000 (16:59 +0800)]
musa: fix all warnings, re-enable `-DLLAMA_FATAL_WARNINGS=ON` in ci and update doc (#12611)

* musa: fix all warnings

Signed-off-by: Xiaodong Ye <redacted>
* musa: enable -DLLAMA_FATAL_WARNINGS=ON in run.sh

Signed-off-by: Xiaodong Ye <redacted>
* musa: update ci doc (install ccache)

Signed-off-by: Xiaodong Ye <redacted>
* fix Windows build issue

Signed-off-by: Xiaodong Ye <redacted>
* Address review comments

Signed-off-by: Xiaodong Ye <redacted>
* Address review comments

Signed-off-by: Xiaodong Ye <redacted>
---------

Signed-off-by: Xiaodong Ye <redacted>
2 months agosync : ggml
Georgi Gerganov [Sat, 29 Mar 2025 13:37:54 +0000 (15:37 +0200)]
sync : ggml

ggml-ci

2 months agocpu : rm unused variable (ggml/1166)
Xuan-Son Nguyen [Sat, 29 Mar 2025 10:59:56 +0000 (11:59 +0100)]
cpu : rm unused variable (ggml/1166)

2 months agocpu: de-duplicate some of the operators and refactor (ggml/1144)
cmdr2 [Sat, 29 Mar 2025 06:07:13 +0000 (11:37 +0530)]
cpu: de-duplicate some of the operators and refactor (ggml/1144)

* cpu: de-duplicate some of the operators and refactor

* Fix PR comments

* Fix PR comments

2 months agoggml : add logging for native build options/vars (whisper/2935)
Daniel Bevenius [Mon, 24 Mar 2025 08:53:38 +0000 (09:53 +0100)]
ggml : add logging for native build options/vars (whisper/2935)

This commit adds debug level logging for the native build options and
variables to ggml/CMakeLists.txt.

The motivation for this is that it can be useful to see the effective
result of `GGML_NATIVE`, `GGML_NATIVE_DEFAULT`, and `INS_ENB` for a
cmake build. I've found myself adding similar logging a few times now,
so I thought it might be a good idea to add this.

Example output, specifying `-DCMAKE_MESSAGE_LOG_LEVEL=DEBUG` when
running cmake produces the following output:
```console
-- GGML_NATIVE         : OFF
-- GGML_NATIVE_DEFAULT : OFF
-- INS_ENB             : OFF
```

2 months agoexamples : command.wasm updates (whisper/2904)
Daniel Bevenius [Thu, 20 Mar 2025 06:02:18 +0000 (07:02 +0100)]
examples : command.wasm updates (whisper/2904)

This commit updates the command.wasm example by adding a server.py script to make it easy to start a local http server to try out the example, updates the build instructions, and also addresses some of the compiler warnings that were being generated.

* emscripten : fix TOTAL_STACK for wasm

This commit moves the TOTAL_STACK setting from the compile flags to the
linker flags. This is because the TOTAL_STACK setting is a linker
setting.

The motivation for this change is that currently the following warnings
are generated when building:
```console
em++: warning: linker setting ignored during compilation: 'TOTAL_STACK' [-Wunused-command-line-argument]
em++: warning: linker setting ignored during compilation: 'TOTAL_STACK' [-Wunused-command-line-argument]
em++: warning: linker setting ignored during compilation: 'TOTAL_STACK' [-Wunused-command-line-argument]
em++: warning: linker setting ignored during compilation: 'TOTAL_STACK' [-Wunused-command-line-argument]
em++: warning: linker setting ignored during compilation: 'TOTAL_STACK' [-Wunused-command-line-argument]
em++: warning: linker setting ignored during compilation: 'TOTAL_STACK' [-Wunused-command-line-argument]
```

* examples : suppress C++17 deprecation warning for std::codecvt_utf8

This commit suppresses the C++17 deprecation warning for
std::codecvt_utf8 similar to what is done in
examples/talk-llama/unicode.cpp.

The motivation for this change is to suppress these warnings:
```console
/Users/danbev/work/ai/whisper-work/examples/common.cpp:251:31: warning: 'codecvt_utf8<wchar_t>' is deprecated [-Wdeprecated-declarations]
  251 |     std::wstring_convert<std::codecvt_utf8<wchar_t>> converter;
      |                               ^
/Users/danbev/work/wasm/emsdk/upstream/emscripten/cache/sysroot/include/c++/v1/codecvt:193:28: note: 'codecvt_utf8<wchar_t>' has been explicitly marked deprecated here
  193 | class _LIBCPP_TEMPLATE_VIS _LIBCPP_DEPRECATED_IN_CXX17 codecvt_utf8 : public __codecvt_utf8<_Elem> {
      |                            ^
/Users/danbev/work/wasm/emsdk/upstream/emscripten/cache/sysroot/include/c++/v1/__config:723:41: note: expanded from macro '_LIBCPP_DEPRECATED_IN_CXX17'
  723 | #    define _LIBCPP_DEPRECATED_IN_CXX17 _LIBCPP_DEPRECATED
      |                                         ^
/Users/danbev/work/wasm/emsdk/upstream/emscripten/cache/sysroot/include/c++/v1/__config:688:49: note: expanded from macro '_LIBCPP_DEPRECATED'
  688 | #      define _LIBCPP_DEPRECATED __attribute__((__deprecated__))
      |                                                 ^
/Users/danbev/work/ai/whisper-work/examples/common.cpp:251:10: warning: 'wstring_convert<std::codecvt_utf8<wchar_t>>' is deprecated [-Wdeprecated-declarations]
  251 |     std::wstring_convert<std::codecvt_utf8<wchar_t>> converter;
      |          ^
/Users/danbev/work/wasm/emsdk/upstream/emscripten/cache/sysroot/include/c++/v1/locale:3145:28: note: 'wstring_convert<std::codecvt_utf8<wchar_t>>' has been explicitly marked deprecated here
 3145 | class _LIBCPP_TEMPLATE_VIS _LIBCPP_DEPRECATED_IN_CXX17 wstring_convert {
      |                            ^
/Users/danbev/work/wasm/emsdk/upstream/emscripten/cache/sysroot/include/c++/v1/__config:723:41: note: expanded from macro '_LIBCPP_DEPRECATED_IN_CXX17'
  723 | #    define _LIBCPP_DEPRECATED_IN_CXX17 _LIBCPP_DEPRECATED
      |                                         ^
/Users/danbev/work/wasm/emsdk/upstream/emscripten/cache/sysroot/include/c++/v1/__config:688:49: note: expanded from macro '_LIBCPP_DEPRECATED'
  688 | #      define _LIBCPP_DEPRECATED __attribute__((__deprecated__))
      |                                                 ^
/Users/danbev/work/ai/whisper-work/examples/common.cpp:257:31: warning: 'codecvt_utf8<wchar_t>' is deprecated [-Wdeprecated-declarations]
  257 |     std::wstring_convert<std::codecvt_utf8<wchar_t>> converter;
      |                               ^
/Users/danbev/work/wasm/emsdk/upstream/emscripten/cache/sysroot/include/c++/v1/codecvt:193:28: note: 'codecvt_utf8<wchar_t>' has been explicitly marked deprecated here
  193 | class _LIBCPP_TEMPLATE_VIS _LIBCPP_DEPRECATED_IN_CXX17 codecvt_utf8 : public __codecvt_utf8<_Elem> {
      |                            ^
/Users/danbev/work/wasm/emsdk/upstream/emscripten/cache/sysroot/include/c++/v1/__config:723:41: note: expanded from macro '_LIBCPP_DEPRECATED_IN_CXX17'
  723 | #    define _LIBCPP_DEPRECATED_IN_CXX17 _LIBCPP_DEPRECATED
      |                                         ^
/Users/danbev/work/wasm/emsdk/upstream/emscripten/cache/sysroot/include/c++/v1/__config:688:49: note: expanded from macro '_LIBCPP_DEPRECATED'
  688 | #      define _LIBCPP_DEPRECATED __attribute__((__deprecated__))
      |                                                 ^
/Users/danbev/work/ai/whisper-work/examples/common.cpp:257:10: warning: 'wstring_convert<std::codecvt_utf8<wchar_t>>' is deprecated [-Wdeprecated-declarations]
  257 |     std::wstring_convert<std::codecvt_utf8<wchar_t>> converter;
      |          ^
/Users/danbev/work/wasm/emsdk/upstream/emscripten/cache/sysroot/include/c++/v1/locale:3145:28: note: 'wstring_convert<std::codecvt_utf8<wchar_t>>' has been explicitly marked deprecated here
 3145 | class _LIBCPP_TEMPLATE_VIS _LIBCPP_DEPRECATED_IN_CXX17 wstring_convert {
      |                            ^
/Users/danbev/work/wasm/emsdk/upstream/emscripten/cache/sysroot/include/c++/v1/__config:723:41: note: expanded from macro '_LIBCPP_DEPRECATED_IN_CXX17'
  723 | #    define _LIBCPP_DEPRECATED_IN_CXX17 _LIBCPP_DEPRECATED
      |                                         ^
/Users/danbev/work/wasm/emsdk/upstream/emscripten/cache/sysroot/include/c++/v1/__config:688:49: note: expanded from macro '_LIBCPP_DEPRECATED'
  688 | #      define _LIBCPP_DEPRECATED __attribute__((__deprecated__))
      |                                                 ^
4 warnings generated.
```

* ggml : suppress double-promotion warning in GGML_F16x4_REDUCE

This commit adds a cast to `ggml_float` in the `GGML_F16x4_REDUCE` macro
to suppress a double-promotion warning.

Currently the following warning is generated when compiling the
command.wasm example:
```console
/whisper-work/src/ggml-cpu/ggml-cpu.c:1592:5: warning: implicit conversion increases floating-point precision: 'float' to 'ggml_float' (aka 'double') [-Wdouble-promotion]
 1592 |     GGML_F16_VEC_REDUCE(sumf, sum);
      |     ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/Users/danbev/work/ai/whisper-work/src/ggml-cpu/ggml-cpu.c:932:37: note: expanded from macro 'GGML_F16_VEC_REDUCE'
  932 | #define GGML_F16_VEC_REDUCE         GGML_F16x4_REDUCE
      |                                     ^
/Users/danbev/work/ai/whisper-work/src/ggml-cpu/ggml-cpu.c:920:44: note: expanded from macro 'GGML_F16x4_REDUCE'
  918 |     res = wasm_f32x4_extract_lane(x[0], 0) +       \
      |         ~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  919 |           wasm_f32x4_extract_lane(x[0], 1) +       \
      |           ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  920 |           wasm_f32x4_extract_lane(x[0], 2) +       \
      |           ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~
  921 |           wasm_f32x4_extract_lane(x[0], 3);        \
      |           ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/whisper-work/src/ggml-cpu/ggml-cpu.c:1640:9: warning: implicit conversion increases floating-point precision: 'float' to 'ggml_float' (aka 'double') [-Wdouble-promotion]
 1640 |         GGML_F16_VEC_REDUCE(sumf[k], sum[k]);
      |         ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/Users/danbev/work/ai/whisper-work/src/ggml-cpu/ggml-cpu.c:932:37: note: expanded from macro 'GGML_F16_VEC_REDUCE'
  932 | #define GGML_F16_VEC_REDUCE         GGML_F16x4_REDUCE
      |                                     ^
/Users/danbev/work/ai/whisper-work/src/ggml-cpu/ggml-cpu.c:920:44: note: expanded from macro 'GGML_F16x4_REDUCE'
  918 |     res = wasm_f32x4_extract_lane(x[0], 0) +       \
      |         ~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  919 |           wasm_f32x4_extract_lane(x[0], 1) +       \
      |           ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  920 |           wasm_f32x4_extract_lane(x[0], 2) +       \
      |           ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~
  921 |           wasm_f32x4_extract_lane(x[0], 3);        \
      |           ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
2 warnings generated.
```
wasm_f32x4_extract_lane returns a 32-bit float and this is what the
addition is performed on. But there is an implicit conversion from
32-bit float to 64-bit double when the result is assigned to `res`,
which is of type `ggml_float`. My understanding here is that this is
intentional and adding a cast to `ggml_float` should suppress the
warning.

* emscripten : add -Wno-deprecated to for emscripten

This commit adds -Wno-deprecated to the CMAKE_CXX_FLAGS for emscripten
builds.

The motivation for this is that currently there a number of warnings
generated like the following:
```console
warning: JS library symbol '$print' is deprecated. Please open a bug if you have a continuing need for this symbol [-Wdeprecated]
warning: JS library symbol '$printErr' is deprecated. Please open a bug if you have a continuing need for this symbol [-Wdeprecated]
em++: warning: warnings in JS library compilation [-Wjs-compiler]
em++: warning: linker setting ignored during compilation: 'ENVIRONMENT' [-Wunused-command-line-argument]
warning: JS library symbol '$print' is deprecated. Please open a bug if you have a continuing need for this symbol [-Wdeprecated]
warning: JS library symbol '$printErr' is deprecated. Please open a bug if you have a continuing need for this symbol [-Wdeprecated]
em++: warning: warnings in JS library compilation [-Wjs-compiler]
warning: JS library symbol '$print' is deprecated. Please open a bug if you have a continuing need for this symbol [-Wdeprecated]
warning: JS library symbol '$printErr' is deprecated. Please open a bug if you have a continuing need for this symbol [-Wdeprecated]
em++: warning: warnings in JS library compilation [-Wjs-compiler]
em++: warning: linker setting ignored during compilation: 'ENVIRONMENT' [-Wunused-command-line-argument]
em++: warning: linker setting ignored during compilation: 'ENVIRONMENT' [-Wunused-command-line-argument]
```

The downside of this is that we might miss other deprecation warnings
in the future so I'm not sure if this is acceptable. But it make the
wasm examples cleaner without the warnings.

* examples : fix tautological-compare warning in stb_vorbis.c [no ci]

This commit applies a fix to address a tautological-compare warning
in stb_vorbis.c.

The motivation for this is that currently the following warning is
generated when compiling the commmand-wasm example:
```console
/Users/danbev/work/ai/whisper-work/examples/stb_vorbis.c:1404:75: warning: pointer comparison always evaluates to false [-Wtautological-compare]
 1404 |       if (f->stream_start + loc >= f->stream_end || f->stream_start + loc < f->stream_start) {
      |                                                                           ^
1 warning generated.
```

This fix was taken from an open pull request on the stb repository
that addreses this issue:
https://github.com/nothings/stb/pull/1746

* squash! examples : update command.wasm instructions [no ci]

This commit adds a Python script to serve the the wasm examples build
in the `build-em` directory. Initially I thought that it would be enough
to start a simple python server but I did not notice that there was an
error in the browser console when I did that:
```console
command.js:1 Uncaught (in promise) DataCloneError: Failed to execute 'postMessage' on 'Worker': SharedArrayBuffer transfer requires self.crossOriginIsolated.
    at command.js:1:1206224
    at new Promise (<anonymous>)
    at loadWasmModuleToWorker (command.js:1:1204981)
    at Array.map (<anonymous>)
    at Object.loadWasmModuleToAllWorkers (command.js:1:1206428)
    at command.js:1:1204318
    at callRuntimeCallbacks (command.js:1:1202062)
    at preRun (command.js:1:6136)
    at run (command.js:1:1294094)
    at removeRunDependency (command.js:1:7046)
```
We need a few CORS headers to be set and in order hopefully make this
easy for users a Python script is added to the examples directory.
This should be able to server all the wasm examples provided they have
been built. command.wasm's README.md is updated to reflect this change.

* examples : remove unused functions

This commit removed the unused functions convert_to_utf8 and
convert_to_wstring from examples/common.cpp.

* Revert "examples : fix tautological-compare warning in stb_vorbis.c [no ci]"

This reverts commit 8e3c47d96141c7675c985562ebdc705e839e338a.

We should not make this change here and instead when the upstream PR is
merged we can sync with it.

Refs: https://github.com/ggerganov/whisper.cpp/issues/2784

2 months agollama : fix non-causal mask for gemma 3 (#12615)
Xuan-Son Nguyen [Sat, 29 Mar 2025 23:07:37 +0000 (00:07 +0100)]
llama : fix non-causal mask for gemma 3 (#12615)

3 months agollama : change cpu_buft_list order: ACCEL -> GPU host -> CPU extra -> CPU (#12632)
Djip007 [Sat, 29 Mar 2025 13:07:37 +0000 (14:07 +0100)]
llama : change cpu_buft_list order: ACCEL -> GPU host -> CPU extra -> CPU (#12632)

this allow to use GPU host when possible over CPU repack.
this have the same effect to resolve this issues (#12498) without
completely disable CPU extra buffer.

Co-authored-by: philou <redacted>
3 months agocmake : fix ccache conflict (#12522)
Jay [Sat, 29 Mar 2025 10:04:58 +0000 (18:04 +0800)]
cmake : fix ccache conflict (#12522)

If users already set CMAKE_C_COMPILER_LAUNCHER globally, setting it in
cmake again will lead to conflict and compile fail.

Signed-off-by: Jay <redacted>
3 months agoCANN : remove clang-format in ggml-cann (#12607)
hipudding [Sat, 29 Mar 2025 10:03:28 +0000 (18:03 +0800)]
CANN : remove clang-format in ggml-cann (#12607)

3 months agollama : fix incorrect Qwen2Moe ffn_moe_out graph callback (#12631)
Sigbjørn Skjæret [Fri, 28 Mar 2025 21:13:02 +0000 (22:13 +0100)]
llama : fix incorrect Qwen2Moe ffn_moe_out graph callback (#12631)

3 months agometal : improve FA + improve MoE (#12612)
Georgi Gerganov [Fri, 28 Mar 2025 18:21:59 +0000 (20:21 +0200)]
metal : improve FA + improve MoE (#12612)

* ggml : FA with different K, V head sizes (CPU)

ggml-ci

* metal : add FA with HS=192

* metal : extend FA to support different K and V head sizes

ggml-ci

* metal : add FA vector kernels for heads K 192 and V 128

ggml-ci

* ggml : restrict op on other backends to equal head sizes

ggml-ci

* metal : optimize FA-vec kernel

ggml-ci

* metal : FA remove mq registers

* metal : improve MoE mul_mat_id condition

ggml-ci

* metal : fix comments + remove unnecessary addition

ggml-ci

* metal : avoid too much shared memory usage with mul_mat_id

ggml-ci

3 months agovulkan: fix coopmat shader generation when cross-compiling (#12272)
Icenowy Zheng [Fri, 28 Mar 2025 17:51:06 +0000 (01:51 +0800)]
vulkan: fix coopmat shader generation when cross-compiling (#12272)

* vulkan: fix coopmat shader generation when cross-compiling

Previously the status of coopmat{,2} support isn't passed to the
vulkan-shaders-gen project building on the host, which leads to build
failure because of the cross-compiling code expecting coopmat{,2}
shaders that didn't get generated.

Fix this by passing the coopmat{,2} support status to vulkan-shaders
subproject.

Signed-off-by: Icenowy Zheng <redacted>
* Only call coop-mat shaders once

* Fix whitespace

---------

Signed-off-by: Icenowy Zheng <redacted>
Co-authored-by: bandoti <redacted>
3 months agollama: fix error on bad grammar (#12628)
Johannes Gäßler [Fri, 28 Mar 2025 17:08:52 +0000 (18:08 +0100)]
llama: fix error on bad grammar (#12628)

3 months agoserver : include speculative decoding stats when timings_per_token is enabled (#12603)
Benson Wong [Fri, 28 Mar 2025 08:05:44 +0000 (01:05 -0700)]
server : include speculative decoding stats when timings_per_token is enabled (#12603)

* Include speculative decoding stats when timings_per_token is true

New fields added to the `timings` object:

  - draft_n           : number of draft tokens generated
  - draft_accepted_n  : number of draft tokens accepted
  - draft_accept_ratio: ratio of accepted/generated

* Remove redundant draft_accept_ratio var

* add draft acceptance rate to server console output

3 months agorpc : update README for cache usage (#12620)
Radoslav Gerganov [Fri, 28 Mar 2025 07:44:13 +0000 (09:44 +0200)]
rpc : update README for cache usage (#12620)

3 months agollamafile : ppc64le GEMV forwarding for FP32. (#12594)
amritahs-ibm [Fri, 28 Mar 2025 07:43:22 +0000 (13:13 +0530)]
llamafile : ppc64le GEMV forwarding for FP32. (#12594)

This patch enables usage of MMA when one of the
dimensions of the matrix(ie either M or N) is 1. This
is useful in case of token generation where N < 2.

The concept of 'GEMV Forwarding' is used where when one
of the matrix has a single row/column, the elements are
broadcasted, instead of using packing routine to prepack
the matrix elements.

This change results in 5% - 15% improvement in total
speed(ie all tokens/total time), across various batch
sizes. This is in comparision with the corresponding
dot product implementation.

The patch is tested with FP32 models of Meta-Lllama-3-8B,
Mistral-7B, Llama-2-7B-chat-hf on a IBM POWER10 machine.

Signed-off-by: Amrita H S <redacted>
3 months agorpc : send hash when tensor data is above some fixed threshold (#12496)
Radoslav Gerganov [Fri, 28 Mar 2025 06:18:04 +0000 (08:18 +0200)]
rpc : send hash when tensor data is above some fixed threshold (#12496)

* rpc : send hash when tensor data is above some fixed threshold

ref #10095

* rpc : put cache under $HOME/.cache/llama.cpp

* try to fix win32 build

* another try to fix win32 build

* remove llama as dependency

3 months agoserver : Support listening on a unix socket (#12613)
Piotr [Thu, 27 Mar 2025 22:41:04 +0000 (23:41 +0100)]
server : Support listening on a unix socket (#12613)

* server : Bump cpp-httplib to include AF_UNIX windows support

Signed-off-by: Piotr Stankiewicz <redacted>
* server : Allow running the server example on a unix socket

Signed-off-by: Piotr Stankiewicz <redacted>
---------

Signed-off-by: Piotr Stankiewicz <redacted>
3 months agomedia : add SVG logo [no ci] (#12616)
Georgi Gerganov [Thu, 27 Mar 2025 21:09:05 +0000 (23:09 +0200)]
media : add SVG logo [no ci] (#12616)

3 months agoopencl: add multi and vision rope, `gelu_quick` and `im2col` (#12600)
lhez [Thu, 27 Mar 2025 15:08:08 +0000 (08:08 -0700)]
opencl: add multi and vision rope, `gelu_quick` and `im2col` (#12600)

* opencl: add `im2col`

* opencl: add `gelu_quick`

* opencl: add mrope

* opencl: add vision rope

3 months agollama : add PLM GGUF Conversion & Inference Support (#12457)
Si1w [Thu, 27 Mar 2025 10:49:15 +0000 (10:49 +0000)]
llama : add PLM GGUF Conversion & Inference Support (#12457)

* add edgellm model arch[conversation feature doesn't work]

* remove output.weight layer for edgellm arch

* [Model] update the name of the model

* update the name of model arch in convert gguf

* [Model] Refarctor the model arch into llama-model

* [Bug] Fix the bug in create attn kv

* [Code] Fix editorconfig erros

* [Code] Remove Trailing whitespace

* [Code] Remove Trailing whitespace

* [Code] Change the order of model arch in list

* [Code] Fix flake8 Lint errors

* Remove trailing white space

* [Code] Remove  call in model arch

3 months agomodel : restore support for T5Encoder (#12590)
HighDoping [Thu, 27 Mar 2025 10:43:33 +0000 (18:43 +0800)]
model : restore support for T5Encoder (#12590)

3 months agoconvert : Support Qwen2_5_VLForConditionalGeneration (#12595)
Csaba Kecskemeti [Thu, 27 Mar 2025 10:11:23 +0000 (03:11 -0700)]
convert : Support Qwen2_5_VLForConditionalGeneration (#12595)

3 months agosync : ggml
Georgi Gerganov [Thu, 27 Mar 2025 07:36:13 +0000 (09:36 +0200)]
sync : ggml

ggml-ci

3 months agoscripts : update sync + fix cmake merge
Georgi Gerganov [Thu, 27 Mar 2025 07:22:30 +0000 (09:22 +0200)]
scripts : update sync + fix cmake merge

ggml-ci

3 months agosync : ggml
Georgi Gerganov [Thu, 27 Mar 2025 07:01:21 +0000 (09:01 +0200)]
sync : ggml

ggml-ci

3 months agocmake : sync/merge PowerPC build commands (#0)
Georgi Gerganov [Thu, 27 Mar 2025 07:00:57 +0000 (09:00 +0200)]
cmake : sync/merge PowerPC build commands (#0)

3 months agollamafile : ppc64le MMA implementation for Q4_0. (#12489)
amritahs-ibm [Thu, 27 Mar 2025 06:51:47 +0000 (12:21 +0530)]
llamafile : ppc64le MMA implementation for Q4_0. (#12489)

This change upstreams llamafile's cpu matrix
multiplication kernels for ppc64le ISA using MMA
builtins. This patch handles matrix multiplication
between quantised datatypes, block_q4_0 and
block_q8_0.

This change results in 5% - 50% improvement
in total speed(ie all tokens/total time), across
various batch sizes.

The patch is tested with Meta-Lllama-3-8B,
Mistral-7B, Llama-2-7B-chat-hf models on a
IBM POWER10 machine.

Signed-off-by: Amrita H S <redacted>
3 months agoggml : riscv: add 128-bit RVV support (#12530)
xctan [Thu, 27 Mar 2025 06:38:34 +0000 (14:38 +0800)]
ggml : riscv: add 128-bit RVV support (#12530)

* ggml : add 128-bit RVV support

* ggml : revert to old RVV 256+ q2_K, q3_K, q4_K, q6_K impl

* remove trailing whitespaces

* restructure vector length selection code

3 months agollama : make loras compatible with repacking (#12593)
Georgi Gerganov [Thu, 27 Mar 2025 06:24:10 +0000 (08:24 +0200)]
llama : make loras compatible with repacking (#12593)

* llama : make loras compatible with repacking

ggml-ci

* cont : simplify

ggml-ci

* cont : add TODO [no ci]

3 months agoSYCL: implement memset ggml backend buffer interface (#12580)
Akarshan Biswas [Thu, 27 Mar 2025 01:46:00 +0000 (07:16 +0530)]
SYCL: implement memset ggml backend buffer interface (#12580)

* SYCL: implement memset ggml backend buffer interface

* use GGML_ABORT macro

* Do not wait for all queues to finish for memset operation

3 months agoHIP: Add support for RDNA4 targets (#12372)
Slobodan Josic [Wed, 26 Mar 2025 22:46:30 +0000 (23:46 +0100)]
HIP: Add support for RDNA4 targets (#12372)

3 months agometal : refactor mat-vec code (#12569)
Georgi Gerganov [Wed, 26 Mar 2025 19:38:38 +0000 (21:38 +0200)]
metal : refactor mat-vec code (#12569)

* metal : refactor mat-vec code

ggml-ci

* metal : rename all_sum -> sum_all

ggml-ci

* metal : fix comments [no ci]

* metal : fix nr constant [no ci]

* metal : mv q6_K support nr0 > 1

ggml-ci

* metal : reduce register pressure

ggml-ci

* metal : fix typo [no ci]

* metal : reduce register pressure

ggml-ci