]> git.djapps.eu Git - pkg/ggml/sources/llama.cpp/log
pkg/ggml/sources/llama.cpp
3 months agometal : fix default.metallib build (#12224)
Daniel Bevenius [Fri, 7 Mar 2025 05:23:16 +0000 (06:23 +0100)]
metal : fix default.metallib build (#12224)

This commit updates the custom command to build the default.metallib
file to use the correct path to ../ggml-common.h by using the variable
METALLIB_COMMON.

The motivation for this change is that currently when building and
specifying GGML_METAL_EMBED_LIBRARY=OFF the following error is
generated:
```console
[ 11%] Linking CXX shared library ../../bin/libggml.dylib
[ 11%] Built target ggml
make[2]: *** No rule to make target `ggml/src/ggml-metal/ggml-common.h', needed by `bin/default.metallib'.  Stop.
make[1]: *** [ggml/src/ggml-metal/CMakeFiles/ggml-metal-lib.dir/all] Error 2
```

With the above change the build could progress but there was a follow
on error about not being able to find the ggml-common.h file in
ggml-metal.metal where is was included as a relative path:
```console
[ 11%] Compiling Metal kernels
/Users/danbev/work/llama.cpp/build/bin/ggml-metal.metal:6:10: error: '../ggml-common.h' file not found, did you mean 'ggml-common.h'?
         ^~~~~~~~~~~~~~~~~~
         "ggml-common.h"
1 error generated.
```
Removing the relative path then allowed the build to complete
successfully.

3 months agoopencl: Noncontiguous `norm`, `rms_norm`, disable `fp16` for some ops (#12217)
lhez [Fri, 7 Mar 2025 00:20:35 +0000 (16:20 -0800)]
opencl: Noncontiguous `norm`, `rms_norm`, disable `fp16` for some ops (#12217)

* opencl: support noncontiguous `norm`

* opencl: support noncontiguous `rms_norm`

* opencl: disable fp16 for `ADD`, `MUL`, `SCALE`, `RELU`, `GELU`, `SILU`, `CLAMP`

3 months agocmake : fix undefined reference errors for std::filesystem in ggml (#12092) (#12094)
xiaofei [Thu, 6 Mar 2025 22:58:25 +0000 (06:58 +0800)]
cmake : fix undefined reference errors for std::filesystem in ggml (#12092) (#12094)

Signed-off-by: Ray Lee <redacted>
Co-authored-by: Ray Lee <redacted>
3 months agoreadme : update bindings (#12229)
Lucas Moura Belo [Thu, 6 Mar 2025 19:15:13 +0000 (16:15 -0300)]
readme : update bindings (#12229)

3 months agoCUDA: fix FA logic for PTX 7.0 and CC >= 7.5 (#12222)
Johannes Gäßler [Thu, 6 Mar 2025 17:45:09 +0000 (18:45 +0100)]
CUDA: fix FA logic for PTX 7.0 and CC >= 7.5 (#12222)

3 months agoHIP: rocWMMA documentation and enabling in workflow builds (#12179)
David Huang [Thu, 6 Mar 2025 13:14:11 +0000 (21:14 +0800)]
HIP: rocWMMA documentation and enabling in workflow builds (#12179)

* Enable rocWMMA for Windows CI build

* Enable for Ubuntu

* GGML_HIP_ROCWMMA_FATTN documentation work

3 months agoupdate function-calling.md w/ template override for functionary-small-v3.2 (#12214)
Olivier Chafik [Thu, 6 Mar 2025 09:03:31 +0000 (09:03 +0000)]
update function-calling.md w/ template override for functionary-small-v3.2 (#12214)

3 months agollava: add big-endian conversion for image encoder (#12218)
Aaron Teo [Thu, 6 Mar 2025 08:33:21 +0000 (16:33 +0800)]
llava: add big-endian conversion for image encoder (#12218)

Signed-off-by: Aaron Teo <redacted>
3 months agoHIP/CUDA: set the paramerter value in maintain_cuda_graph instead of replaceing it...
uvos [Thu, 6 Mar 2025 07:20:52 +0000 (08:20 +0100)]
HIP/CUDA: set the paramerter value in maintain_cuda_graph instead of replaceing it. (#12209)

This avoids conflict with internal cuda/hip runtimes memory managment behavior.

3 months agoandroid : fix KV cache log message condition (#12212)
Han Yin [Thu, 6 Mar 2025 06:22:49 +0000 (22:22 -0800)]
android : fix KV cache log message condition (#12212)

3 months agoopencl : fix buffer alignment (#12197)
Henry Linjamäki [Thu, 6 Mar 2025 01:33:40 +0000 (03:33 +0200)]
opencl : fix buffer alignment (#12197)

Fix the following error:

```
ggml-alloc.c:99: not enough space in the buffer
ggml_tallocr_alloc: not enough space in the buffer to allocate blk.17.ffn_down.weight (needed 27525120, available 27521024)
```

which occurs when `ggml_backend_opencl_context::alignment` is larger
than `cl_ptr_base` (hard-coded to `0x1000`).

Also, fix `ggml_backend_opencl_context::alignment` was set to
`CL_DEVICE_MEM_BASE_ADDR_ALIGN` which was treated as bytes but the
value is reported in bits.

3 months agoopencl : fix `ulong` kernel args were set from `int` variables (#12174)
Henry Linjamäki [Thu, 6 Mar 2025 01:31:14 +0000 (03:31 +0200)]
opencl : fix `ulong` kernel args were set from `int` variables (#12174)

... which left garbage bits in the upper half of the kernel args. This
caused segmentation faults when running PoCL.

3 months agoopencl : fix profile-related errors (#12095)
simon886212 [Thu, 6 Mar 2025 01:30:05 +0000 (09:30 +0800)]
opencl : fix profile-related errors (#12095)

Co-authored-by: ubuntu <redacted>
3 months agoggml-cpu: Faster IQ1 mul_mat_vec on AVX2 using BMI2 instructions (#12154)
Rémy O [Thu, 6 Mar 2025 01:26:10 +0000 (02:26 +0100)]
ggml-cpu: Faster IQ1 mul_mat_vec on AVX2 using BMI2 instructions (#12154)

* ggml-cpu: Faster IQ1 mul_mat_vec on AVX2 using BMI2 instructions

* cmake: Add GGML_BMI2 build option

* ggml: enable BMI2 on relevant CPU variants

* ggml-cpu: include BMI2 in backend score

* ggml-cpu: register BMI2 in ggml_backend_cpu_get_features

* ggml-cpu: add __BMI2__ define when using MSVC

3 months agoSYCL: Disable f16 Unary OPs as not supported by the kernels (#12201)
Akarshan Biswas [Wed, 5 Mar 2025 15:58:23 +0000 (21:28 +0530)]
SYCL: Disable f16 Unary OPs as not supported by the kernels (#12201)

3 months agoggml : fix GGMLMetalClass ODR (#12200)
Plamen Minev [Wed, 5 Mar 2025 15:16:01 +0000 (17:16 +0200)]
ggml : fix GGMLMetalClass ODR (#12200)

-- it might happen if ggml is loaded from 2 separate libraries since each one of them will expose the class. This is more of a guard since we want to use only Metal as embedded library and don't care about the other case.

3 months agoci : add fetch-depth to xcframework upload (#12195)
Daniel Bevenius [Wed, 5 Mar 2025 13:16:40 +0000 (14:16 +0100)]
ci : add fetch-depth to xcframework upload (#12195)

This commit adds the fetch-depth: 0 option to the checkout action in the
build.yml workflow file (0 meaning that it fetches the complete
history). The default value is 1 when not specified which only fetches
the latest commit.

This is necessary to ensure that `git rev-list --count HEAD` counts the
total number of commits in the history. Currently because the default is
being used the name of the xcframework artifact is always
llama-b1-xcframework.

3 months ago`tool-call`: fix Qwen 2.5 Coder support, add micro benchmarks, support trigger patter...
Olivier Chafik [Wed, 5 Mar 2025 13:05:13 +0000 (13:05 +0000)]
`tool-call`: fix Qwen 2.5 Coder support, add micro benchmarks, support trigger patterns for lazy grammars (#12034)

* sampler: turn lazy grammar trigger words to regexes

* add scripts/tool_bench.sh & .py

* constrain llama json output regardless of function name if matches at beginning

* update relaxed newline space rule in grammar tests

* support add_generation_prompt query parameter (useful for /apply_template)

* Update src/llama-grammar.cpp

Co-authored-by: Georgi Gerganov <redacted>
---------

Co-authored-by: Georgi Gerganov <redacted>
3 months agoci : fix xcframework artifact tag (#12191)
Daniel Bevenius [Wed, 5 Mar 2025 09:22:29 +0000 (10:22 +0100)]
ci : fix xcframework artifact tag (#12191)

The commit add the name parameter to the upload-artifact action to
ensure that the artifact is uploaded with the correct name.

The motivation for this is that currently the uploaded xcframework
is named as llama-b1-xcframework.zip. With this change the name of this
artifact should contain the build number like the other artifacts.

3 months agoci : remove xframework upload (#12190)
Daniel Bevenius [Wed, 5 Mar 2025 07:34:02 +0000 (08:34 +0100)]
ci : remove xframework upload (#12190)

* ci : remove xframework upload

This commit removes the upload of the xframework zip file as an
artifact.

The motivation for this change is that the xframework zip file is
currently being uploaded as part of strategy and will therefore be
attempted to be uploaded multiple times and will fail the build.

The uploading should be moved to somewhere else in the build to avoid
this.

* ci : add xcframework upload to macos-latest job

3 months agoserver : fix cache reuse logic (#12161)
Clauszy [Wed, 5 Mar 2025 07:25:45 +0000 (15:25 +0800)]
server : fix cache reuse logic (#12161)

The first kv shift offsets the positions of all tokens after head_c.
When using llama_kv_cache_seq_rm next, using head_c will remove the valid tokens because their positions have already been offset.

3 months agollama : add xcframework build script (#11996)
Daniel Bevenius [Wed, 5 Mar 2025 05:30:31 +0000 (06:30 +0100)]
llama : add xcframework build script (#11996)

* llama : add xcframework build script

This commit adds a script to build an XCFramework for Apple
ios, macos, visionos, and tvos platforms.

The generated XCFramework can then be added to a project and used in
the same way as a regular framework. The llama.swiftui example project
has been updated to use the XCFramework and can be started using the
following command:
```console
$ open examples/llama.swiftui/llama.swiftui.xcodeproj/
```

Refs: https://github.com/ggml-org/llama.cpp/issues/10747

* examples : remove llama.cpp (source dir ref) from project.pbxproj

This commit removes the reference to llama.cpp from the project.pbxproj
file since Package.swift has been removed.

* ci : updated build.yml to use build-xcframework.sh

* ci : add xcframework build to github releases

This commit adds the ability to create a GitHub release with the
xcframework build artifact.

* scripts : add apple app validation scripts

This commit adds scripts that can validate the iOS, macOS, tvOS, and
VisionOS applications. The scripts create a simple test app project,
copy the llama.xcframework to the test project, build and archive the
app, create an IPA from the archive, and validate the IPA using altool.

The motivation for this is to provide some basic validation and
hopefully avoid having to manually validate apps in Xcode.

* llama : remove Package.swift

This commit removes the Package.swift file, as we are now building an
XCFramework for the project.

* llama : remove Sources and spm-headers directories

* llama : use TargetConditionals.h for visionOS/tvOS

3 months agoggml : portability fixes for VS 2017 (#12150)
mgroeber9110 [Tue, 4 Mar 2025 16:53:26 +0000 (17:53 +0100)]
ggml : portability fixes for VS 2017 (#12150)

* Add include files for std::min/max and std::toupper/tolower

* win32: move _USE_MATH_DEFINES before includes to ensure M_PI is defined

* Use GGML_RESTRICT instead of "restrict" keyword everywhere, and use "__restrict" in MSVC plain C mode

* win32: only use __restrict in MSVC if C11/C17 support is not enabled

---------

Co-authored-by: Marcus Groeber <redacted>
3 months agoreadme : fix roadmap link (#12185)
Georgi Gerganov [Tue, 4 Mar 2025 16:42:44 +0000 (18:42 +0200)]
readme : fix roadmap link (#12185)

3 months agomain: allow preloading conversation with -p and add -st / --single-turn (#12145)
Sigbjørn Skjæret [Tue, 4 Mar 2025 16:19:39 +0000 (17:19 +0100)]
main: allow preloading conversation with -p and add -st / --single-turn (#12145)

* Add chat template formatting to -no-cnv

* only enable prompt formatting if explicitly enabled

* add -st / --single-turn

* add --single-turn and -p in conversation mode

* fix -sys + -p

* reword warning

* small readability change and fix (long) outdated example usage

* only activate single turn in conversation mode

3 months ago`server`: fix deadly typo in response_format.json_schema.schema handling (#12168)
Olivier Chafik [Tue, 4 Mar 2025 06:24:07 +0000 (06:24 +0000)]
`server`: fix deadly typo in response_format.json_schema.schema handling (#12168)

3 months agoHIP: implement FlashAttention via rocWMMA for CDNA and RDNA3+ (#12032)
David Huang [Mon, 3 Mar 2025 21:10:54 +0000 (05:10 +0800)]
HIP: implement FlashAttention via rocWMMA for CDNA and RDNA3+ (#12032)

Adds GGML_HIP_ROCWMMA_FATTN and rocwmma header check
Adds rocWMMA support to fattn-wmma-f16

---

Signed-off-by: Carl Klemm <redacted>
Co-authored-by: Johannes Gäßler <redacted>
Co-authored-by: Ben Jackson <redacted>
3 months agosync : ggml
Georgi Gerganov [Mon, 3 Mar 2025 15:57:38 +0000 (17:57 +0200)]
sync : ggml

ggml-ci

3 months agocuda: unary ops as float + de-duplicate (ggml/1130)
cmdr2 [Mon, 3 Mar 2025 15:21:31 +0000 (20:51 +0530)]
cuda: unary ops as float + de-duplicate (ggml/1130)

3 months agosync : ggml
Georgi Gerganov [Fri, 28 Feb 2025 10:37:35 +0000 (12:37 +0200)]
sync : ggml

ggml-ci

3 months agocuda/vulkan: specify fp32-only support for some operations in supports_op (ggml/1129)
cmdr2 [Fri, 28 Feb 2025 10:36:46 +0000 (12:36 +0200)]
cuda/vulkan: specify fp32-only support for some operations in supports_op (ggml/1129)

ggml-ci

3 months agosync : ggml
Georgi Gerganov [Fri, 28 Feb 2025 07:09:58 +0000 (09:09 +0200)]
sync : ggml

ggml-ci

3 months agocuda/cpu: Increase support for fp16 unary operations (ggml/1125)
cmdr2 [Fri, 28 Feb 2025 07:04:39 +0000 (12:34 +0530)]
cuda/cpu: Increase support for fp16 unary operations (ggml/1125)

* Support fp16 unary operations in the CUDA backend

* cpu: increase fp16 support for unary operators in the CPU backend

* cuda: increase fp16 support for unary operators in the CUDA backend

* Add test cases for fp16 unary operators

* metal: update supports_op for unary operators that don't support fp16, to prevent test-backend-ops from failing

* metal: fix PR comments for unary op support after fp16 unary tests

3 months agowhisper : support GGML_BACKEND_DL (whisper/2843)
Diego Devesa [Thu, 27 Feb 2025 12:35:07 +0000 (13:35 +0100)]
whisper : support GGML_BACKEND_DL (whisper/2843)

* whisper : support GGML_BACKEND_DL

* fix DTW crash

* whisper.objc : fix build - add ggml-cpp.h

---------

Co-authored-by: Georgi Gerganov <redacted>
3 months agocmake : fix compile assumptions for power9/etc (whisper/2777)
midnight [Wed, 5 Feb 2025 12:41:10 +0000 (04:41 -0800)]
cmake : fix compile assumptions for power9/etc (whisper/2777)

* Add small comment re: VSX to readme

Co-authored-by: midnight <redacted>
3 months agoTold cmake to install ggml-cpp.h as a public header file. (ggml/1126)
petterreinholdtsen [Wed, 26 Feb 2025 20:44:00 +0000 (21:44 +0100)]
Told cmake to install ggml-cpp.h as a public header file. (ggml/1126)

It is used by Whisper talk-llama example.

Co-authored-by: Petter Reinholdtsen <redacted>
3 months agoSupport pure float16 add/sub/mul/div operations in the CUDA (and CPU) backend (ggml...
cmdr2 [Tue, 25 Feb 2025 12:36:34 +0000 (18:06 +0530)]
Support pure float16 add/sub/mul/div operations in the CUDA (and CPU) backend (ggml/1121)

* Support float16-to-float16 add/sub/mul/div operations in the CUDA backend

* Add fp16 support for add/sub/mul/div on the CPU backend

* Add test cases for fp16 add/sub/mul/div

3 months agoscripts : sync-ggml-am.sh fix
Georgi Gerganov [Fri, 28 Feb 2025 07:09:38 +0000 (09:09 +0200)]
scripts : sync-ggml-am.sh fix

3 months agoci : set GITHUB_ACTION env var for server tests (#12162)
Daniel Bevenius [Mon, 3 Mar 2025 15:17:36 +0000 (16:17 +0100)]
ci : set GITHUB_ACTION env var for server tests (#12162)

This commit tries to address/improve an issue with the server tests
which are failing with a timeout. Looking at the logs it seems like
they are timing out after 12 seconds:
```
FAILED unit/test_chat_completion.py::test_completion_with_json_schema[False-json_schema0-6-"42"] - TimeoutError: Server did not start within 12 seconds
```

This is somewhat strange as in utils.py we have the following values:
```python
DEFAULT_HTTP_TIMEOUT = 12

if "LLAMA_SANITIZE" in os.environ or "GITHUB_ACTION" in os.environ:
    DEFAULT_HTTP_TIMEOUT = 30

    def start(self, timeout_seconds: int | None = DEFAULT_HTTP_TIMEOUT) -> None:
```
It should be the case that a test running in a github action should have
a timeout of 30 seconds. However, it seems like this is not the case.
Inspecting the logs from the CI job we can see the following environment
variables:
```console
Run cd examples/server/tests
2 cd examples/server/tests
3 ./tests.sh
4 shell: /usr/bin/bash -e {0}
5 env:
6 LLAMA_LOG_COLORS: 1
7 LLAMA_LOG_PREFIX: 1
8 LLAMA_LOG_TIMESTAMPS: 1
9 LLAMA_LOG_VERBOSITY: 10
10 pythonLocation: /opt/hostedtoolcache/Python/3.11.11/x64
```

This probably does not address the underlying issue that the servers
that are providing the models to be downloaded occasionally take a
longer time to response but might improve these situations in some
cases.

3 months agotts: add speaker file support (#12048)
dm4 [Mon, 3 Mar 2025 13:09:29 +0000 (21:09 +0800)]
tts: add speaker file support (#12048)

* tts: add speaker file support

Signed-off-by: dm4 <redacted>
* tts: handle outetts-0.3

* tts : add new line in error message

---------

Signed-off-by: dm4 <redacted>
Co-authored-by: Georgi Gerganov <redacted>
3 months agotest-backend-ops : add option -p to filter by op params (#12155)
Diego Devesa [Mon, 3 Mar 2025 13:00:46 +0000 (14:00 +0100)]
test-backend-ops : add option -p to filter by op params (#12155)

3 months agoggml : fix kleidiai build (#12159)
ag2s20150909 [Mon, 3 Mar 2025 12:54:08 +0000 (20:54 +0800)]
ggml : fix kleidiai build (#12159)

The libggml API has changed, but this has not been updated.

3 months agoAdding UTF-8 support to llama.cpp (#12111)
Eric Curtin [Mon, 3 Mar 2025 12:44:56 +0000 (12:44 +0000)]
Adding UTF-8 support to llama.cpp (#12111)

For emojis, non-alpha characters, etc.

Signed-off-by: Eric Curtin <redacted>
3 months agowebui : add ?m=... and ?q=... params (#12148)
Xuan-Son Nguyen [Mon, 3 Mar 2025 10:42:45 +0000 (11:42 +0100)]
webui : add ?m=... and ?q=... params (#12148)

* webui : add ?m=... and ?q=... params

* also clear prefilledMessage variable

* better approach

* fix comment

* test: bump timeout on GITHUB_ACTION

3 months agoSYCL: Move CPY kernels to a separate file and add few missing kernels (#12133)
Akarshan Biswas [Mon, 3 Mar 2025 10:07:22 +0000 (15:37 +0530)]
SYCL: Move CPY kernels to a separate file and add few missing kernels (#12133)

* SYCL: refactor and move cpy kernels to a separate file

* Add few missing cpy kernels

* refactor and add debug logs

3 months agoggml-backend : keep paths in native string type when possible (#12144)
Diego Devesa [Sun, 2 Mar 2025 21:11:00 +0000 (22:11 +0100)]
ggml-backend : keep paths in native string type when possible (#12144)

3 months agomain: use jinja chat template system prompt by default (#12118)
Sigbjørn Skjæret [Sun, 2 Mar 2025 13:53:48 +0000 (14:53 +0100)]
main: use jinja chat template system prompt by default (#12118)

* Use jinja chat template system prompt by default

* faster conditional order

* remove nested ternary

---------

Co-authored-by: Xuan Son Nguyen <redacted>
3 months agomain: update outdated system prompt message (followup to #12131) (#12132)
Sigbjørn Skjæret [Sat, 1 Mar 2025 14:22:27 +0000 (15:22 +0100)]
main: update outdated system prompt message (followup to #12131) (#12132)

* Update outdated message

* wording

Co-authored-by: Xuan-Son Nguyen <redacted>
---------

Co-authored-by: Xuan-Son Nguyen <redacted>
3 months agocommon : add --system-prompt parameter, replace behavior of -p in conversation mode...
Sigbjørn Skjæret [Sat, 1 Mar 2025 12:56:45 +0000 (13:56 +0100)]
common : add --system-prompt parameter, replace behavior of -p in conversation mode (#12131)

* Add --system-prompt parameter

* use user defined system prompt

* clarify

Co-authored-by: Xuan-Son Nguyen <redacted>
* add warning

* clarify

Co-authored-by: Xuan-Son Nguyen <redacted>
---------

Co-authored-by: Xuan-Son Nguyen <redacted>
3 months agoCUDA: compress mode option and default to size (#12029)
Erik Scholz [Sat, 1 Mar 2025 11:57:22 +0000 (12:57 +0100)]
CUDA: compress mode option and default to size (#12029)

cuda 12.8 added the option to specify stronger compression for binaries, so we now default to "size".

3 months agowebui : minor typo fixes (#12116)
Vivian [Sat, 1 Mar 2025 10:15:09 +0000 (15:45 +0530)]
webui : minor typo fixes (#12116)

* fix typos and improve menu text clarity

* rename variable trimedValue to trimmedValue

* add updated index.html.gz

* rebuild

---------

Co-authored-by: Xuan Son Nguyen <redacted>
3 months agoconvert : fix Norway problem when parsing YAML (#12114)
Xuan-Son Nguyen [Fri, 28 Feb 2025 16:44:46 +0000 (17:44 +0100)]
convert : fix Norway problem when parsing YAML (#12114)

* convert : fix Norway problem when parsing YAML

* Update gguf-py/gguf/metadata.py

* add newline at correct place

3 months agoggml : upgrade init_tensor API to return a ggml_status (#11854)
William Tambellini [Fri, 28 Feb 2025 13:41:47 +0000 (05:41 -0800)]
ggml : upgrade init_tensor API to return a ggml_status (#11854)

* Upgrade init_tensor API to return a ggml_status

To prepare for an 'abort-free' ggml
(ggml not to abort on OOMs but return a OOM status),
as agreeed with Diego in the ggml repo,
upgrade the init_tensor() and view_init() APIs
to return a ggml_status.

* misc fixes

---------

Co-authored-by: slaren <redacted>
3 months agollama : add Phi-4-mini support (supersede #12099) (#12108)
Xuan-Son Nguyen [Fri, 28 Feb 2025 11:44:11 +0000 (12:44 +0100)]
llama : add Phi-4-mini support (supersede #12099) (#12108)

* Added Phi-4-mini-instruct support

* Update regex per ngxson

* Change the vocab base to Xenova/gpt-4o

* fix conversion update script

* no need to check longrope

* minor style fix

* fix python style

---------

Co-authored-by: Nicholas Sparks <redacted>
3 months agoUpdate granite vision docs for 3.2 model (#12105)
Alex Brooks [Fri, 28 Feb 2025 11:31:47 +0000 (04:31 -0700)]
Update granite vision docs for 3.2 model (#12105)

Signed-off-by: Alex-Brooks <redacted>
3 months agovulkan: add specific MMV kernels for IQ2 and IQ3 quants + optimizations (#11595)
Rémy O [Fri, 28 Feb 2025 08:42:52 +0000 (09:42 +0100)]
vulkan: add specific MMV kernels for IQ2 and IQ3 quants + optimizations (#11595)

* vulkan: implement specialized MMV kernels for IQ2 quantizations

* vulkan: add MMV kernels for IQ3 quants

* vulkan: Increase MMV batch size and unroll IQ LUT setup

* vulkan: fix init_iq_shmem for WG sizes larger than tables

* vulkan: common batch size for all I-quants

3 months agoCUDA: fix logic for V100 + GGML_CUDA_FORCE_MMQ (#12098)
Johannes Gäßler [Fri, 28 Feb 2025 08:26:43 +0000 (09:26 +0100)]
CUDA: fix logic for V100 + GGML_CUDA_FORCE_MMQ (#12098)

3 months agoggml: aarch64: implement SVE kernels for q2_k_q8_k vector dot (#12064)
Prashant Vithule [Fri, 28 Feb 2025 07:36:12 +0000 (13:06 +0530)]
ggml: aarch64: implement SVE kernels for q2_k_q8_k vector dot (#12064)

* Added SVE Support for Q2_K Quantized Models

* Use 4-space indentation in the switch cases

* removed comments lines

* Remove the loop Retain the curly bracess for better understanding of code

* Remove the comment like added for q3_k_q8_k kernel

---------

Co-authored-by: vithulep <redacted>
3 months agoCANN: Fix build error with GCC 13 (#11990)
hipudding [Fri, 28 Feb 2025 07:23:47 +0000 (15:23 +0800)]
CANN: Fix build error with GCC 13 (#11990)

Remove unused header file that causes compilation failure on ARM
platform with GCC 13.

3 months agovulkan: matmul dequantization improvements (#12015)
Eve [Fri, 28 Feb 2025 07:20:08 +0000 (07:20 +0000)]
vulkan: matmul dequantization improvements (#12015)

* faster dequant for old quants

* dont use unpack for iq4_nl

* vec2 unpack for q8

3 months agovulkan: improve im2col (#11826)
Daniele [Fri, 28 Feb 2025 06:52:51 +0000 (06:52 +0000)]
vulkan: improve im2col (#11826)

* vulkan: improve im2col performance

3 months agocmake: Fix ggml backend dependencies and installation (#11818)
Vladimir Vuksanovic [Thu, 27 Feb 2025 07:42:48 +0000 (08:42 +0100)]
cmake: Fix ggml backend dependencies and installation (#11818)

* Fix dependencies between ggml and backends

ggml backends link only to ggml-base and ggml links to all backends.

* Fix installation of ggml backends

Set up GNUInstallDirs before setting the installation directory of ggml backends

4 months agollava : add struct for FFI bindgen (#12079)
Ting Lou [Wed, 26 Feb 2025 14:26:52 +0000 (22:26 +0800)]
llava : add struct for FFI bindgen (#12079)

* add struct for FFI bindgen

* Apply suggestions from code review

---------

Co-authored-by: Xuan-Son Nguyen <redacted>
4 months agoRefactor gguf scripts to improve metadata handling (#11909) gguf-v0.16.0
Sigbjørn Skjæret [Wed, 26 Feb 2025 13:04:48 +0000 (14:04 +0100)]
Refactor gguf scripts to improve metadata handling (#11909)

* Refactor gguf scripts to improve metadata handling

Added contents method to ReaderField class
Added endianess property to GGUFReader class

* update scripts

* fix import

* remove unused import

* attempt to work around flake and pyright errors

* second attempt

* give up, ignore type

* bump version

* apply newbyteorder fixes

4 months agogguf-py: enable reading non-native endian files (#12081)
Aleksei Nikiforov [Wed, 26 Feb 2025 11:39:27 +0000 (12:39 +0100)]
gguf-py: enable reading non-native endian files (#12081)

Currently self.byte_order is never used.
Actually use it to byteswap read data to
allow reading big endian files on little endian systems
and vice versa.

Now it's possible to convert little-endian model
into a big-endian model and back
on a little-endian system.

4 months agoreadme : update infra list (#9096)
Kante Yin [Wed, 26 Feb 2025 07:49:36 +0000 (15:49 +0800)]
readme : update infra list (#9096)

Signed-off-by: kerthcet <redacted>
4 months agodocs: add docs/function-calling.md to lighten server/README.md's plight (#12069)
Olivier Chafik [Tue, 25 Feb 2025 18:52:56 +0000 (18:52 +0000)]
docs: add docs/function-calling.md to lighten server/README.md's plight (#12069)

4 months agovulkan: fix assertion when qy_needs_dequant (#12068)
Jeff Bolz [Tue, 25 Feb 2025 15:30:21 +0000 (09:30 -0600)]
vulkan: fix assertion when qy_needs_dequant (#12068)

Looks like a copy/paste bug from qx_needs_dequant.

4 months agoserver: handle echo=false on /v1/completions (#12060)
rhjdvsgsgks [Tue, 25 Feb 2025 11:52:52 +0000 (11:52 +0000)]
server: handle echo=false on /v1/completions (#12060)

4 months agoadd OP sigmoid (#12056)
Judd [Tue, 25 Feb 2025 11:32:20 +0000 (19:32 +0800)]
add OP sigmoid (#12056)

Co-authored-by: Judd <redacted>
4 months agoggml-cpu: Fix build with sve (#12059)
Molly Sophia [Tue, 25 Feb 2025 11:28:22 +0000 (19:28 +0800)]
ggml-cpu: Fix build with sve (#12059)

* ggml-cpu: Fix build with sve

Signed-off-by: Molly Sophia <redacted>
* ggml-cpu: Remove unused variable in sve q3_k vec dot

Signed-off-by: Molly Sophia <redacted>
---------

Signed-off-by: Molly Sophia <redacted>
4 months agovulkan: implement more backpropagation operators (#11914)
Rémy O [Tue, 25 Feb 2025 11:04:45 +0000 (12:04 +0100)]
vulkan: implement more backpropagation operators (#11914)

* vulkan: implement GGML_OP_ROPE_BACK

* vulkan: implement GGML_OP_RMS_NORM_BACK

* vulkan: implement GGML_OP_SILU_BACK

* vulkan: implement GGML_OP_SOFTMAX_BACK

4 months agoserver: support add_generation_prompt query param (#12062)
Olivier Chafik [Tue, 25 Feb 2025 10:40:22 +0000 (10:40 +0000)]
server: support add_generation_prompt query param (#12062)

4 months agoAdd Doc for Converting Granite Vision -> GGUF (#12006)
Alex Brooks [Tue, 25 Feb 2025 09:46:05 +0000 (02:46 -0700)]
Add Doc for Converting Granite Vision -> GGUF (#12006)

* Add example docs for granite vision

Signed-off-by: Alex-Brooks <redacted>
4 months agollama : expose llama_model_n_head_kv in the API (#11997)
Vitali Lovich [Tue, 25 Feb 2025 09:29:33 +0000 (01:29 -0800)]
llama : expose llama_model_n_head_kv in the API (#11997)

It's useful to be able to have this from the library layer as it's a key
parameter of the model (e.g. to figure out how much KV cache memory is
needed).

4 months agometal : copy kernels for quant to F32/F16 conversions (#12017)
Gian-Carlo Pascutto [Tue, 25 Feb 2025 09:27:58 +0000 (10:27 +0100)]
metal : copy kernels for quant to F32/F16 conversions (#12017)

metal: use dequantize_q templates

---------

Co-authored-by: Georgi Gerganov <redacted>
4 months agoopencl: fix for small models (#11950)
lhez [Mon, 24 Feb 2025 21:47:07 +0000 (13:47 -0800)]
opencl: fix for small models (#11950)

* opencl: fix small shape gemv, remove unused extensions

* opencl: fix `transpose_16`, `dump_tensor`, enforce subgroup size

* opencl: fix for token length < 4

* opencl: use wave size of 64 for all Adreno GPUs

---------

Co-authored-by: Shawn Gu <redacted>
Co-authored-by: Skyler Szot <redacted>
4 months agollava : Add Granite Vision Support (#11794)
Alex Brooks [Mon, 24 Feb 2025 16:09:51 +0000 (09:09 -0700)]
llava : Add Granite Vision Support (#11794)

* Add super wip scripts for multimodal granite gguf

Signed-off-by: Alex-Brooks <redacted>
* Add example for converting mmgranite to gguf

Signed-off-by: Alex-Brooks <redacted>
* remove hardcoded path

Signed-off-by: Alex-Brooks <redacted>
* Add vision feature layer to gguf params

Signed-off-by: Alex-Brooks <redacted>
* Clean up llava surgery and remove name substitution hacks

Signed-off-by: Alex-Brooks <redacted>
* Add transformers llava next tensor name mapping

Signed-off-by: Alex-Brooks <redacted>
* Make siglip / openclip mutuall exclusive

Signed-off-by: Alex-Brooks <redacted>
* Fix projector linear substitution

Signed-off-by: Alex-Brooks <redacted>
* Fix linear 2 substitution index

Signed-off-by: Alex-Brooks <redacted>
* Increase max flattened gridpoints to 64

Signed-off-by: Alex-Brooks <redacted>
* Fix hardcoded concat for multiple feature layers

Signed-off-by: Alex-Brooks <redacted>
* Pull vision feature layers out of gguf keys

Signed-off-by: Alex-Brooks <redacted>
* fix num gridpoints and use all layers

Signed-off-by: Alex-Brooks <redacted>
* Avoid dropping last image encoder layer in llava models

Signed-off-by: Alex-Brooks <redacted>
* Use 10 for max number of patches

Signed-off-by: Alex-Brooks <redacted>
* Standardize vision feature layers

Signed-off-by: Alex-Brooks <redacted>
* Cleanup logs

Signed-off-by: Alex-Brooks <redacted>
* Update comment for vision feature layer init

Signed-off-by: Alex-Brooks <redacted>
* Update notes for alternative to legacy llm conversion script

Signed-off-by: Alex-Brooks <redacted>
* Fix notes rendering

Signed-off-by: Alex-Brooks <redacted>
* Add v prefix to vision feature layer log

Signed-off-by: Alex-Brooks <redacted>
* Use current defaults for feature layer

Signed-off-by: Alex-Brooks <redacted>
* Use constant for max gridpoints / feat layers, style fixes

Signed-off-by: Alex-Brooks <redacted>
* clarify non-negative feature layers

Signed-off-by: Alex-Brooks <redacted>
* Remove CLIP_API from func signature

Signed-off-by: Alex-Brooks <redacted>
* USE MAX_IMAGE_FEATURE_LAYERS const in layer calc

Signed-off-by: Alex-Brooks <redacted>
* Clarify feature layers are non negative ints and not uint

Signed-off-by: Alex-Brooks <redacted>
* Fix condition for reading feature layers

Signed-off-by: Alex-Brooks <redacted>
* pop last llava layer when feature layers are unset

Signed-off-by: Alex-Brooks <redacted>
* Fix unset vision layer 0

Signed-off-by: Alex-Brooks <redacted>
* Update examples/llava/clip.cpp

Co-authored-by: Xuan-Son Nguyen <redacted>
* Reenable assertion for out of bounds get_rows

Signed-off-by: Alex-Brooks <redacted>
* Use std vector for gridpoints and feature layers

Signed-off-by: Alex-Brooks <redacted>
* Caculate max feature layer at load time

Signed-off-by: Alex-Brooks <redacted>
* Include base patch for granite vision allocation

Signed-off-by: Alex-Brooks <redacted>
* Fix trailing whitespace

Signed-off-by: Alex-Brooks <redacted>
* Add max num patches = 10 back for minicpmv

Signed-off-by: Alex-Brooks <redacted>
* Use unordered set to store feature layers

Co-authored-by: Xuan-Son Nguyen <redacted>
Signed-off-by: Alex-Brooks <redacted>
* Use max feature layer for postnorm

Signed-off-by: Alex-Brooks <redacted>
* Apply suggestions from code review

---------

Signed-off-by: Alex-Brooks <redacted>
Co-authored-by: Xuan-Son Nguyen <redacted>
4 months ago[SYCL] Optimize mul_mat for Q4_0 on Intel GPU (#12035)
Neo Zhang Jianyu [Mon, 24 Feb 2025 14:33:23 +0000 (22:33 +0800)]
[SYCL] Optimize mul_mat for Q4_0 on Intel GPU (#12035)

* opt performance by reorder for Intel GPU

* detect hw type and save opt feature, and print opt feature

* correct name

* support optimize graph once when compute graph, record the opt status in tensor->extra, make CI passed

* add env variable GGML_SYCL_DISABLE_OPT for debug

* use syclex::architecture replace the custom hw define, update the guide for GGML_SYCL_DISABLE_OPT

* add performance data

* mv getrows functions to separeted files

* fix global variables

---------

Co-authored-by: arthw <redacted>
4 months agogguf_convert_endian.py: implement byteswapping for q4_k and q6_k (#11349)
Aleksei Nikiforov [Mon, 24 Feb 2025 11:27:01 +0000 (12:27 +0100)]
gguf_convert_endian.py: implement byteswapping for q4_k and q6_k (#11349)

4 months agoSYCL: Fix GGML_SYCL_DEBUG macro (#11995)
Akarshan Biswas [Mon, 24 Feb 2025 10:18:25 +0000 (15:48 +0530)]
SYCL: Fix GGML_SYCL_DEBUG macro (#11995)

4 months agorun: allow to customize prompt by env var LLAMA_PROMPT_PREFIX (#12041)
Florent BENOIT [Sun, 23 Feb 2025 17:15:51 +0000 (18:15 +0100)]
run: allow to customize prompt by env var LLAMA_PROMPT_PREFIX (#12041)

Signed-off-by: Florent Benoit <redacted>
4 months agoSome llama-run cleanups (#11973)
Eric Curtin [Sun, 23 Feb 2025 13:14:32 +0000 (13:14 +0000)]
Some llama-run cleanups (#11973)

Use consolidated open function call from File class. Change
read_all to to_string(). Remove exclusive locking, the intent for
that lock is to avoid multiple processes writing to the same file,
it's not an issue for readers, although we may want to consider
adding a shared lock. Remove passing nullptr as reference,
references are never supposed to be null. clang-format the code
for consistent styling.

Signed-off-by: Eric Curtin <redacted>
4 months agoggml-cpu: Support s390x SIMD Instruction Set (#12019)
Aaron Teo [Sat, 22 Feb 2025 21:39:24 +0000 (05:39 +0800)]
ggml-cpu: Support s390x SIMD Instruction Set (#12019)

* ggml: add s390x ARCH_FLAGS for compilation

Signed-off-by: Aaron Teo <redacted>
* ggml: add SIMD for s390x using vector intrinsics

SIMD is activated for:
* ggml_vec_dot_f32
* ggml_vec_dot_f16
* ggml_vec_mad_f32
* ggml_vec_mad_f16
* ggml_vec_mad_f32_unroll
* ggml_vec_scale_f32
* ggml_vec_scale_f16

SIMD is NOT activated for:
* ggml_vec_dot_f16_unroll (pending bugfix)

Signed-off-by: Aaron Teo <redacted>
* ggml: fix missing escape character in GGML_F32x4_REDUCE

Signed-off-by: Aaron Teo <redacted>
* ggml: add temporary patch for GGML_F32_ARR and GGML_F16_ARR

Signed-off-by: Aaron Teo <redacted>
* ggml: fix s390x GGML_F32x4_REDUCE

Signed-off-by: Aaron Teo <redacted>
* ggml: full SIMD activation for F32,F16 s390x

Signed-off-by: Aaron Teo <redacted>
* ggml: add option to disable s390x VXE/VXE2

Signed-off-by: Aaron Teo <redacted>
* ggml: change vecintrin.h include to ggml-cpu-impl

* add __VXE__ and __VXE2__ macros

Signed-off-by: Aaron Teo <redacted>
* cmake: add s390x target detection for VX/VXE/VXE2

Signed-off-by: Aaron Teo <redacted>
* ggml: move s390x vector intrinsics to ggml-cpu-impl.h

Signed-off-by: Aaron Teo <redacted>
* ggml: s390x Q8_0 SIMD

Signed-off-by: Aaron Teo <redacted>
* ggml: correct documentation for Q8_0

Signed-off-by: Aaron Teo <redacted>
* ggml: s390x reduce code complexity Q8_0

Signed-off-by: Aaron Teo <redacted>
* ggml: s390x bugfix typo Q8_0

Signed-off-by: Aaron Teo <redacted>
* ggml: s390x SIMD activated for Q4_1

Signed-off-by: Aaron Teo <redacted>
* ggml: s390x inline vec_reve

Signed-off-by: Aaron Teo <redacted>
* ggml: s390x SIMD activation for Q4_0

Signed-off-by: Aaron Teo <redacted>
* ggml: add VXE backend feature

Signed-off-by: Aaron Teo <redacted>
* ggml: remove test.py

Signed-off-by: Aaron Teo <redacted>
* ggml: s390x SIMD activation for quantize_row_q8_0

Signed-off-by: Aaron Teo <redacted>
* ggml: s390x SIMD activation for quantize_row_q8_1

Signed-off-by: Aaron Teo <redacted>
* ggml: s390x SIMD activation for iq4_xs

Signed-off-by: Aaron Teo <redacted>
* ggml: bugfix iq4_xs

Signed-off-by: Aaron Teo <redacted>
* ggml: s390x SIMD activation for iq4_nl

Signed-off-by: Aaron Teo <redacted>
* ggml: add float, double, and long vector data type

Signed-off-by: Aaron Teo <redacted>
* ggml: clean up iq4_xs SIMD

Signed-off-by: Aaron Teo <redacted>
* ggml: fix improper use of restrict keyword

Signed-off-by: Aaron Teo <redacted>
* ggml: update warning message for ggml_vec_tbl

Signed-off-by: Aaron Teo <redacted>
* ggml: untested implementation of ggml_vec_dot_iq2_xxs_q8_K

Signed-off-by: Aaron Teo <redacted>
* ggml: update ggml_vec_dot_q4_1_q8_1 to use typedefs

Signed-off-by: Aaron Teo <redacted>
* ggml: switch to restrict for iq4_nl

Signed-off-by: Aaron Teo <redacted>
* ggml: slight dot product speed improvement for q4_1_q8_1

Signed-off-by: Aaron Teo <redacted>
* ggml: s390x SIMD activation for q6_K

Signed-off-by: Aaron Teo <redacted>
* ggml: add missing `_t` to ggml_int8x16x4_t

Signed-off-by: Aaron Teo <redacted>
* ggml: fix missing `_t` for ggml_vec_xl_s8x4

Signed-off-by: Aaron Teo <redacted>
* ggml: fix more missing `_t`

Signed-off-by: Aaron Teo <redacted>
* ggml: add unroll and prefetch to Q8_0

increase of 3.86% for prompt processing and 32.22% for token generation

Signed-off-by: Aaron Teo <redacted>
* ggml: patch Q8_0 to use proper vector sizes

Signed-off-by: Aaron Teo <redacted>
* ggml: optimise Q8_0 dot prod compute kernel further

Signed-off-by: Aaron Teo <redacted>
* ggml: add unroll and prefetch to Q4_1

Signed-off-by: Aaron Teo <redacted>
* ggml: refactor Q6_K variable naming for readability

Signed-off-by: Aaron Teo <redacted>
* ggml: fix Q6_K typos

Signed-off-by: Aaron Teo <redacted>
* ggml: s390x SIMD activation for Q5_K

Signed-off-by: Aaron Teo <redacted>
* ggml: fix wrong char*x16_t naming

Signed-off-by: Aaron Teo <redacted>
* ggml: Q5_K y0 wrong signness

Signed-off-by: Aaron Teo <redacted>
* ggml: fix Q5_K invalid uchar type

Signed-off-by: Aaron Teo <redacted>
* ggml: fix Q5_K invalid uchar type

Signed-off-by: Aaron Teo <redacted>
* ggml: s390x SIMD activation for Q4_K

Signed-off-by: Aaron Teo <redacted>
* ggml: fix Q4_K invalid vector intrinsics

Signed-off-by: Aaron Teo <redacted>
* ggml: simplify ggml_padd_s16 compute kernel

Signed-off-by: Aaron Teo <redacted>
* ggml: correct ggml-cpu vxe wording

Signed-off-by: Aaron Teo <redacted>
* ggml: change ggml_aligned_malloc alignment to 256

256 is the cache line size for s390x platforms

Signed-off-by: Aaron Teo <redacted>
* ggml: resolve pr merge via cherry-pick 225bbbf

Signed-off-by: Aaron Teo <redacted>
* ggml : fix LoongArch compile error with 128-bit SIMD (#11701)

* ggml: resolve pr merge via cherry-pick 4571953

Signed-off-by: Aaron Teo <redacted>
* ggml: cmake remove fork when determining s390x machine type

thank you @ericcurtin

Signed-off-by: Aaron Teo <redacted>
---------

Signed-off-by: Aaron Teo <redacted>
Co-authored-by: Jinyang He <redacted>
Co-authored-by: junchao-zhao <redacted>
4 months agoCUDA: app option to compile without FlashAttention (#12025)
Johannes Gäßler [Sat, 22 Feb 2025 19:44:34 +0000 (20:44 +0100)]
CUDA: app option to compile without FlashAttention (#12025)

4 months agollava: build clip image from pixels (#11999)
Ting Lou [Sat, 22 Feb 2025 14:28:28 +0000 (22:28 +0800)]
llava: build clip image from pixels (#11999)

* llava: export function `clip_build_img_from_pixels` to build image from pixels decoded by other libraries instead of stb_image.h for better performance

* Apply suggestions from code review

---------

Co-authored-by: Xuan-Son Nguyen <redacted>
4 months agoci : fix arm upload artifacts (#12024)
Georgi Gerganov [Sat, 22 Feb 2025 13:03:00 +0000 (15:03 +0200)]
ci : fix arm upload artifacts (#12024)

* ci : fix arm upload artifacts

* cont : fix archive name to use matrix

4 months agoCUDA: optimize FA for GQA + large batches (#12014)
Johannes Gäßler [Sat, 22 Feb 2025 11:20:17 +0000 (12:20 +0100)]
CUDA: optimize FA for GQA + large batches (#12014)

4 months agoci : Build on Github-hosted arm64 runners (#12009)
Rohanjames1997 [Sat, 22 Feb 2025 10:48:57 +0000 (04:48 -0600)]
ci : Build on Github-hosted arm64 runners (#12009)

4 months agoserver : disable Nagle's algorithm (#12020)
Georgi Gerganov [Sat, 22 Feb 2025 10:46:31 +0000 (12:46 +0200)]
server : disable Nagle's algorithm (#12020)

4 months agocuda: Add Q5_1, Q5_0, Q4_1 and Q4_0 to F32 conversion support. (#12000)
Gian-Carlo Pascutto [Sat, 22 Feb 2025 08:43:24 +0000 (09:43 +0100)]
cuda: Add Q5_1, Q5_0, Q4_1 and Q4_0 to F32 conversion support. (#12000)

4 months agollama.swiftui : add "Done" dismiss button to help view (#11998)
Daniel Bevenius [Sat, 22 Feb 2025 05:33:29 +0000 (06:33 +0100)]
llama.swiftui : add "Done" dismiss button to help view (#11998)

The commit updates the help view in the llama.swiftui example to use a
NavigationView and a Done button to dismiss the help view.

The motivation for this is that without this change there is now way to
dimiss the help view.

4 months agollama : skip loading unused tensors (#12004)
Georgi Gerganov [Fri, 21 Feb 2025 16:33:18 +0000 (18:33 +0200)]
llama : skip loading unused tensors (#12004)

* llama : assign unknown/unused tensors to host buffer type

ggml-ci

* llama : skip unused tensors

ggml-ci

4 months agodoc: update contributing guidelines [no ci] (#11969)
Johannes Gäßler [Fri, 21 Feb 2025 11:51:25 +0000 (12:51 +0100)]
doc: update contributing guidelines [no ci] (#11969)

4 months agoCUDA: correct the lowest Maxwell supported by CUDA 12 (#11984)
PureJourney [Fri, 21 Feb 2025 11:21:05 +0000 (19:21 +0800)]
CUDA: correct the lowest Maxwell supported by CUDA 12 (#11984)

* CUDA: correct the lowest Maxwell supported by CUDA 12

---------

Co-authored-by: Johannes Gäßler <redacted>
4 months agoMUSA: support ARM64 and enable dp4a .etc (#11843)
Bodhi [Fri, 21 Feb 2025 07:46:23 +0000 (15:46 +0800)]
MUSA: support ARM64 and enable dp4a .etc (#11843)

* MUSA:  support ARM64 and enable __dp4a .etc

* fix cross entropy loss op for musa

* update

* add cc info log for musa

* add comment for the MUSA .cc calculation block

---------

Co-authored-by: Bodhi Hu <redacted>
4 months agoclip : fix visual encoders with no CLS (#11982)
Alex Brooks [Fri, 21 Feb 2025 06:11:03 +0000 (23:11 -0700)]
clip : fix visual encoders with no CLS (#11982)

Signed-off-by: Alex-Brooks <redacted>
4 months agoserver (webui): Fix Premature Submission During IME Conversion (#11971)
momonga [Thu, 20 Feb 2025 18:43:22 +0000 (03:43 +0900)]
server (webui): Fix Premature Submission During IME Conversion (#11971)

* fix skip ime composing

* fix npm rebuild

* fix warn

---------

Co-authored-by: momonga <redacted>
Co-authored-by: Xuan Son Nguyen <redacted>
4 months agoggml-cpu: Add CPU backend support for KleidiAI library (#11390)
Charles Xu [Thu, 20 Feb 2025 13:06:51 +0000 (14:06 +0100)]
ggml-cpu: Add CPU backend support for KleidiAI library (#11390)

* ggml-cpu: Add CPU backend support for KleidiAI library

* Add environmental variable GGML_KLEIDIAI_SME

* Add support for multithread LHS conversion

* Switch kernel selection order to dotprod and i8mm

* updates for review comments

* More updates for review comments

* Reorganize and rename KleidiAI files

* Move ggml-cpu-traits.h to source file

* Update cmake for SME build and add alignment for SME

* Remove append GGML_USE_CPU_KLEIDIAI to the GGML_CDEF_PUBLIC list

4 months agoggml: aarch64: implement SVE kernels for q3_K_q8_K vector dot (#11917)
Prashant Vithule [Thu, 20 Feb 2025 10:08:32 +0000 (15:38 +0530)]
ggml: aarch64: implement SVE kernels for q3_K_q8_K vector dot (#11917)

* Added SVE Implementation for Q3_K Kernel in ggml-cpu-quants.c file

* Improved Formating of code in  ggml-cpu-quants.c file

* style : minor fixes

* style : less whitespaces

* style : ptr spaceing

---------

Co-authored-by: vithulep <redacted>
Co-authored-by: Georgi Gerganov <redacted>