]> git.djapps.eu Git - pkg/ggml/sources/llama.cpp/log
pkg/ggml/sources/llama.cpp
4 months agoHIP: force max threads per block to be 1024 (#11621)
fxzjshm [Tue, 4 Feb 2025 18:18:38 +0000 (02:18 +0800)]
HIP: force max threads per block to be 1024 (#11621)

Some old/vendor forked version of llvm still use 256. Explicitly set it to 1024 to align with upstream llvm.

Signed-off-by: fxzjshm <redacted>
4 months agoserver : add try..catch to places not covered by set_exception_handler (#11620)
Xuan-Son Nguyen [Tue, 4 Feb 2025 17:25:42 +0000 (18:25 +0100)]
server : add try..catch to places not covered by set_exception_handler (#11620)

* server : add try..catch to places not covered by set_exception_handler

* log_server_request: rm try catch, add reminder

4 months agoarg : list RPC devices first when using --list-devices (#11655)
Radoslav Gerganov [Tue, 4 Feb 2025 16:16:20 +0000 (18:16 +0200)]
arg : list RPC devices first when using --list-devices (#11655)

List devices in the same order as they appear when evaluating the model
and splitting tensors across devices, i.e. RPC devices come first in the
list.

ref #11435

4 months ago`tool-call`: command r7b fix for normal responses (#11608)
Olivier Chafik [Tue, 4 Feb 2025 15:48:53 +0000 (15:48 +0000)]
`tool-call`: command r7b fix for normal responses (#11608)

* fix command r7b normal response regex + add to server test

* test multiline non-tool-call responses in test-chat

4 months agoreadme : add llm_client Rust crate to readme bindings (#11628)
Shelby Jenkins [Tue, 4 Feb 2025 11:20:55 +0000 (05:20 -0600)]
readme : add llm_client Rust crate to readme bindings (#11628)

[This crate](https://github.com/ShelbyJenkins/llm_client) has been in a usable state for quite awhile, so I figured now is fair to add it.

It installs from crates.io, and automatically downloads the llama.cpp repo and builds it for the target platform - with the goal being the easiest user experience possible.

It also integrates model presets and choosing the largest quant given the target's available VRAM. So a user just has to specify one of the presets (I manually add the most popular models), and it will download from hugging face.

So, it's like a Rust Ollama, but it's not really for chatting. It makes heavy use of llama.cpp's grammar system to do structured output for decision making and control flow tasks.

4 months agoswift : fix llama-vocab api usage (#11645)
Jhen-Jie Hong [Tue, 4 Feb 2025 11:15:24 +0000 (19:15 +0800)]
swift : fix llama-vocab api usage (#11645)

* swiftui : fix vocab api usage

* batched.swift : fix vocab api usage

4 months agometal : use residency set for other platforms (#11648)
Jhen-Jie Hong [Tue, 4 Feb 2025 11:07:18 +0000 (19:07 +0800)]
metal : use residency set for other platforms (#11648)

4 months agoauthors : update
Georgi Gerganov [Tue, 4 Feb 2025 11:04:10 +0000 (13:04 +0200)]
authors : update

4 months agosync : ggml upstream/0.0.4631
Georgi Gerganov [Tue, 4 Feb 2025 10:59:21 +0000 (12:59 +0200)]
sync : ggml

4 months agocmake: Add ability to pass in GGML_BUILD_NUMBER (ggml/1096)
Christian Kastner [Mon, 3 Feb 2025 23:17:15 +0000 (00:17 +0100)]
cmake: Add ability to pass in GGML_BUILD_NUMBER (ggml/1096)

This makes git as a dependency optional, and is useful in the case where
ggml is built not from git, but from a tarball, or a distribution source
package.

This conditional also affects GGML_BUILD_COMMIT. Nothing seems to be
using it, though, so there doesn't seem much value factor it out, or
even require it.

4 months agoci : do not stale-close roadmap issues
Georgi Gerganov [Tue, 4 Feb 2025 07:30:42 +0000 (09:30 +0200)]
ci : do not stale-close roadmap issues

4 months ago`tool-call`: allow `--chat-template chatml` w/ `--jinja`, default to chatml upon...
Olivier Chafik [Mon, 3 Feb 2025 23:49:27 +0000 (23:49 +0000)]
`tool-call`: allow `--chat-template chatml` w/ `--jinja`, default to chatml upon parsing issue, avoid double bos (#11616)

* tool-call: allow `--jinja --chat-template chatml`

* fix double bos issue (drop bos/eos tokens from jinja template)

* add missing try catch around jinja parsing to default to chatml

* Simplify default chatml logic

4 months agoserver : (webui) revert hacky solution from #11626 (#11634)
Xuan-Son Nguyen [Mon, 3 Feb 2025 23:10:52 +0000 (00:10 +0100)]
server : (webui) revert hacky solution from #11626 (#11634)

4 months agoserver : (webui) allow typing and submitting during llm response (#11626)
Woof Dog [Mon, 3 Feb 2025 22:16:27 +0000 (22:16 +0000)]
server : (webui) allow typing and submitting during llm response (#11626)

4 months agoserver : remove CPPHTTPLIB_NO_EXCEPTIONS define (#11622)
Daniel Bevenius [Mon, 3 Feb 2025 15:45:38 +0000 (16:45 +0100)]
server : remove CPPHTTPLIB_NO_EXCEPTIONS define (#11622)

This commit removes the CPPHTTPLIB_NO_EXCEPTIONS define from the server
code.

The motivation for this is that when using a debug build the server
would crash when an exception was throws and terminate the server
process, as it was unhandled. When CPPHTTPLIB_NO_EXCEPTIONS is set
cpp_httplib will not call the exception handler, which would normally
return a 500 error to the client. This caused tests to fail when using
a debug build.

Fixes: https://github.com/ggerganov/llama.cpp/issues/11613
4 months agosync : ggml
Georgi Gerganov [Mon, 3 Feb 2025 12:57:08 +0000 (14:57 +0200)]
sync : ggml

4 months agoCUDA: fix Volta FlashAttention logic (#11615)
Johannes Gäßler [Mon, 3 Feb 2025 12:25:56 +0000 (13:25 +0100)]
CUDA: fix Volta FlashAttention logic (#11615)

4 months agoserver : (webui) Fix Shift+Enter handling (#11609)
mashdragon [Mon, 3 Feb 2025 09:42:55 +0000 (09:42 +0000)]
server : (webui) Fix Shift+Enter handling (#11609)

* Fix Shift+Enter handling

`exact` on the Enter handler means the message is not sent when Shift+Enter is pressed anyway

* build index.html.gz

---------

Co-authored-by: Xuan Son Nguyen <redacted>
4 months agoHIP: fix flash_attn_stream_k_fixup warning (#11604)
Johannes Gäßler [Sun, 2 Feb 2025 22:48:29 +0000 (23:48 +0100)]
HIP: fix flash_attn_stream_k_fixup warning (#11604)

4 months agoCUDA/HIP: add support for selectable warp size to mmv (#11519)
uvos [Sun, 2 Feb 2025 21:40:09 +0000 (22:40 +0100)]
CUDA/HIP: add support for selectable warp size to mmv (#11519)

CUDA/HIP: add support for selectable warp size to mmv

4 months agoHIP: add GGML_CUDA_CC_IS_* for amd familys as increasing cc archtectures for amd...
uvos [Sun, 2 Feb 2025 21:08:05 +0000 (22:08 +0100)]
HIP: add GGML_CUDA_CC_IS_* for amd familys as increasing cc archtectures for amd gpus are not supersets of eatch other (#11601)

This fixes a bug where RDNA1 gpus other than gfx1010 where not handled correctly

4 months agonit: more informative crash when grammar sampler fails (#11593)
Olivier Chafik [Sun, 2 Feb 2025 19:58:34 +0000 (19:58 +0000)]
nit: more informative crash when grammar sampler fails (#11593)

4 months agoCUDA: use mma PTX instructions for FlashAttention (#11583)
Johannes Gäßler [Sun, 2 Feb 2025 18:31:09 +0000 (19:31 +0100)]
CUDA: use mma PTX instructions for FlashAttention (#11583)

* CUDA: use mma PTX instructions for FlashAttention

* __shfl_sync workaround for movmatrix

* add __shfl_sync to HIP

Co-authored-by: Diego Devesa <redacted>
4 months agoName colors (#11573)
Eric Curtin [Sun, 2 Feb 2025 15:14:48 +0000 (16:14 +0100)]
Name colors (#11573)

It's more descriptive, use #define's so we can use compile-time
concatenations.

Signed-off-by: Eric Curtin <redacted>
4 months ago`tool-call`: support Command R7B (+ return tool_plan "thoughts" in API) (#11585)
Olivier Chafik [Sun, 2 Feb 2025 09:25:38 +0000 (09:25 +0000)]
`tool-call`: support Command R7B (+ return tool_plan "thoughts" in API) (#11585)

* `tool-call`: support Command R7B (w/ tool_plan return)

* `tool-call`: cleaner preservation of tokens + warn when likely bad chat template override

* `tool-call`: test cleanup / handle lazy grammar triggers

4 months agoFix exotic ci env that lacks ostringstream::str (#11581)
Olivier Chafik [Sun, 2 Feb 2025 09:10:15 +0000 (09:10 +0000)]
Fix exotic ci env that lacks ostringstream::str (#11581)

4 months agosampling : support for llguidance grammars (#10224)
Michał Moskal [Sun, 2 Feb 2025 07:55:32 +0000 (23:55 -0800)]
sampling : support for llguidance grammars (#10224)

* initial porting of previous LLG patch

* update for new APIs

* build: integrate llguidance as an external project

* use '%llguidance' as marker to enable llg lark syntax

* add some docs

* clarify docs

* code style fixes

* remove llguidance.h from .gitignore

* fix tests when llg is enabled

* pass vocab not model to llama_sampler_init_llg()

* copy test-grammar-integration.cpp to test-llguidance.cpp

* clang fmt

* fix ref-count bug

* build and run test

* gbnf -> lark syntax

* conditionally include llguidance test based on LLAMA_LLGUIDANCE flag

* rename llguidance test file to test-grammar-llguidance.cpp

* add gh action for llg test

* align tests with LLG grammar syntax and JSON Schema spec

* llama_tokenizer() in fact requires valid utf8

* update llg

* format file

* add $LLGUIDANCE_LOG_LEVEL support

* fix whitespace

* fix warning

* include <cmath> for INFINITY

* add final newline

* fail llama_sampler_init_llg() at runtime

* Link gbnf_to_lark.py script; fix links; refer to llg docs for lexemes

* simplify #includes

* improve doc string for LLAMA_LLGUIDANCE

* typo in merge

* bump llguidance to 0.6.12

4 months agollama : add support for GLM-Edge and GLM-Edge-V series models (#10573)
piDack [Sun, 2 Feb 2025 07:48:46 +0000 (15:48 +0800)]
llama : add support for GLM-Edge and GLM-Edge-V series models (#10573)

* add glm edge chat model

* use config partial_rotary_factor as rope ratio

* support for glm edge model

* vision model support

* remove debug info

* fix format

* llava.cpp trailing whitespace

* remove unused AutoTokenizer

* Update src/llama.cpp for not contain <|end|> or </s>

Co-authored-by: Xuan Son Nguyen <redacted>
* add edge template

* fix chat template

* fix confict

* fix confict

* fix ci err

* fix format err

* fix template err

* 9b hf chat support

* format

* format clip.cpp

* fix format

* Apply suggestions from code review

* Apply suggestions from code review

* Update examples/llava/clip.cpp

* fix format

* minor : style

---------

Co-authored-by: liyuhang <redacted>
Co-authored-by: piDack <redacted>
Co-authored-by: Xuan Son Nguyen <redacted>
Co-authored-by: liyuhang <redacted>
Co-authored-by: Georgi Gerganov <redacted>
4 months agoci: use sccache on windows HIP jobs (#11553)
Olivier Chafik [Sat, 1 Feb 2025 18:22:38 +0000 (18:22 +0000)]
ci: use sccache on windows HIP jobs (#11553)

4 months ago`sync`: minja (https://github.com/google/minja/commit/418a2364b56dc9be4ed9a1a2b0fb16f...
Olivier Chafik [Sat, 1 Feb 2025 12:24:51 +0000 (12:24 +0000)]
`sync`: minja (https://github.com/google/minja/commit/418a2364b56dc9be4ed9a1a2b0fb16fb53a7a22e) (#11574)

4 months agoImplement s3:// protocol (#11511)
Eric Curtin [Sat, 1 Feb 2025 10:30:54 +0000 (11:30 +0100)]
Implement s3:// protocol (#11511)

For those that want to pull from s3

Signed-off-by: Eric Curtin <redacted>
4 months agoci: simplify cmake build commands (#11548)
Olivier Chafik [Sat, 1 Feb 2025 00:01:20 +0000 (00:01 +0000)]
ci: simplify cmake build commands (#11548)

4 months ago`ci`: use sccache on windows instead of ccache (#11545)
Olivier Chafik [Fri, 31 Jan 2025 17:12:40 +0000 (17:12 +0000)]
`ci`: use sccache on windows instead of ccache (#11545)

* Use sccache on ci for windows

* Detect sccache in cmake

4 months ago`tool-call`: fix llama 3.x and functionary 3.2, play nice w/ pydantic_ai package...
Olivier Chafik [Fri, 31 Jan 2025 14:15:25 +0000 (14:15 +0000)]
`tool-call`: fix llama 3.x and functionary 3.2, play nice w/ pydantic_ai package, update readme (#11539)

* An empty tool_call_id is better than none!

* sync: minja (tool call name optional https://github.com/google/minja/pull/36)

* Force-disable parallel_tool_calls if template doesn't support it

* More debug logs

* Llama 3.x tools: accept / trigger on more varied spaced outputs

* Fix empty content for functionary v3.2 tool call

* Add proper tool call docs to server README

* readme: function calling *is* supported now

* Apply suggestions from code review

Co-authored-by: Georgi Gerganov <redacted>
---------

Co-authored-by: Georgi Gerganov <redacted>
4 months agofix stop regression (#11543)
Olivier Chafik [Fri, 31 Jan 2025 13:48:31 +0000 (13:48 +0000)]
fix stop regression (#11543)

4 months agoFix chatml fallback for unsupported builtin templates (when --jinja not enabled)...
Olivier Chafik [Fri, 31 Jan 2025 08:24:29 +0000 (08:24 +0000)]
Fix chatml fallback for unsupported builtin templates (when --jinja not enabled) (#11533)

4 months agoserver : fix --jinja when there's no tools or schema (typo was forcing JSON) (#11531)
Olivier Chafik [Fri, 31 Jan 2025 08:12:40 +0000 (08:12 +0000)]
server : fix --jinja when there's no tools or schema (typo was forcing JSON) (#11531)

4 months agocommon: Add missing va_end (#11529)
Steve Grubb [Fri, 31 Jan 2025 05:58:55 +0000 (00:58 -0500)]
common: Add missing va_end (#11529)

The va_copy man page states that va_end must be called to revert
whatever the copy did. For some implementaions, not calling va_end
has no consequences. For others it could leak memory.

4 months agoserver : update help metrics processing/deferred (#11512)
Daniel Bevenius [Fri, 31 Jan 2025 05:04:53 +0000 (06:04 +0100)]
server : update help metrics processing/deferred (#11512)

This commit updates the help text for the metrics `requests_processing`
and `requests_deferred` to be more grammatically correct.

Currently the returned metrics look like this:
```console
\# HELP llamacpp:requests_processing Number of request processing.
\# TYPE llamacpp:requests_processing gauge
llamacpp:requests_processing 0
\# HELP llamacpp:requests_deferred Number of request deferred.
\# TYPE llamacpp:requests_deferred gauge
llamacpp:requests_deferred 0
```

With this commit, the metrics will look like this:
```console
\# HELP llamacpp:requests_processing Number of requests processing.
\# TYPE llamacpp:requests_processing gauge
llamacpp:requests_processing 0
\# HELP llamacpp:requests_deferred Number of requests deferred.
\# TYPE llamacpp:requests_deferred gauge
llamacpp:requests_deferred 0
```
This is also consistent with the description of the metrics in the
server examples [README.md](https://github.com/ggerganov/llama.cpp/tree/master/examples/server#get-metrics-prometheus-compatible-metrics-exporter).

4 months ago`ci`: ccache for all github worfklows (#11516)
Olivier Chafik [Thu, 30 Jan 2025 22:01:06 +0000 (22:01 +0000)]
`ci`: ccache for all github worfklows (#11516)

4 months agoTool call support (generic + native for Llama, Functionary, Hermes, Mistral, Firefunc...
Olivier Chafik [Thu, 30 Jan 2025 19:13:58 +0000 (19:13 +0000)]
Tool call support (generic + native for Llama, Functionary, Hermes, Mistral, Firefunction, DeepSeek) w/ lazy grammars (#9639)

---------

Co-authored-by: Xuan Son Nguyen <redacted>
Co-authored-by: Georgi Gerganov <redacted>
Co-authored-by: Xuan Son Nguyen <redacted>
4 months agoHIP: require at least HIP 5.5
uvos [Wed, 29 Jan 2025 18:36:00 +0000 (19:36 +0100)]
HIP: require at least HIP 5.5

4 months agoHIP: Prepare reduction operators for wave 64
uvos [Wed, 29 Jan 2025 18:12:42 +0000 (19:12 +0100)]
HIP: Prepare reduction operators for wave 64

4 months agoCUDA/HIP: add warp_size to cuda_device_info
uvos [Wed, 29 Jan 2025 16:46:23 +0000 (17:46 +0100)]
CUDA/HIP: add warp_size to cuda_device_info

4 months agosync: minja (#11499)
Olivier Chafik [Thu, 30 Jan 2025 10:30:27 +0000 (10:30 +0000)]
sync: minja (#11499)

4 months agovocab : correctly identify LF token for GPT-2 style BPE tokenizer (#11496)
mgroeber9110 [Thu, 30 Jan 2025 10:10:59 +0000 (11:10 +0100)]
vocab : correctly identify LF token for GPT-2 style BPE tokenizer (#11496)

4 months agoserver : use lambda instead of std::bind (#11507)
Daniel Bevenius [Thu, 30 Jan 2025 10:05:00 +0000 (11:05 +0100)]
server : use lambda instead of std::bind (#11507)

This commit replaces the two usages of `std::bind` in favor of lambdas for
the callback functions for `callback_new_task` and
`callback_update_slots`.

The motivation for this changes is consistency with the rest of the code
in server.cpp (lambdas are used for all other callbacks/handlers). Also
lambdas are more readable (perhaps this is subjective) but also they are
recommended over `std::bind` in modern C++.

Ref: https://github.com/LithoCoders/dailycpp/blob/master/EffectiveModernC%2B%2B/chapter6/Item34_Prefer_lambdas_to_std::bind.md

4 months agoserver : (docs) added response format for /apply-template [no ci] (#11503)
Isaac McFadyen [Thu, 30 Jan 2025 09:11:53 +0000 (04:11 -0500)]
server : (docs) added response format for /apply-template [no ci] (#11503)

4 months agoreadme : reference examples relative links (#11505)
Guspan Tanadi [Thu, 30 Jan 2025 05:58:02 +0000 (12:58 +0700)]
readme : reference examples relative links (#11505)

4 months agoserver : update json snippets in README.md [no ci] (#11492)
Daniel Bevenius [Thu, 30 Jan 2025 04:48:14 +0000 (05:48 +0100)]
server : update json snippets in README.md [no ci] (#11492)

This commit updates some of JSON snippets in README.md file and
removes the `json` language tag from the code blocks.

The motivation for this changes is that if there is invalid json in a
code snippet these are highlighted in red which can make it somewhat
difficult to read and can be a little distracting.

4 months agoserver : add /apply-template endpoint for additional use cases of Minja functionality...
Nigel Bosch [Wed, 29 Jan 2025 18:45:44 +0000 (12:45 -0600)]
server : add /apply-template endpoint for additional use cases of Minja functionality (#11489)

* add /apply-template endpoint to server

* remove unnecessary line

* add /apply-template documentation

* return only "prompt" field in /apply-template

* use suggested idea instead of my overly verbose way

4 months agovulkan: implement initial support for IQ2 and IQ3 quantizations (#11360)
Rémy Oudompheng [Wed, 29 Jan 2025 17:29:39 +0000 (18:29 +0100)]
vulkan: implement initial support for IQ2 and IQ3 quantizations (#11360)

* vulkan: initial support for IQ3_S

* vulkan: initial support for IQ3_XXS

* vulkan: initial support for IQ2_XXS

* vulkan: initial support for IQ2_XS

* vulkan: optimize Q3_K by removing branches

* vulkan: implement dequantize variants for coopmat2

* vulkan: initial support for IQ2_S

* vulkan: vertically realign code

* port failing dequant callbacks from mul_mm

* Fix array length mismatches

* vulkan: avoid using workgroup size before it is referenced

* tests: increase timeout for Vulkan llvmpipe backend

---------

Co-authored-by: Jeff Bolz <redacted>
4 months agoserver : update auto gen files comments [no ci] (#11484)
Daniel Bevenius [Wed, 29 Jan 2025 15:34:18 +0000 (16:34 +0100)]
server : update auto gen files comments [no ci] (#11484)

* server : update auto gen files comments

This commit updates the 'auto generated files' comments in server.cpp
and removes `deps.sh` from the comment.

The motivation for this change is that `deps.sh` was removed in
Commit 91c36c269bca75b2d08119c653512cd20b4ea2ba ("server : (web ui)
Various improvements, now use vite as bundler (#10599)").

* squash! server : update auto gen files comments [no ci]

Move comments about file generation to README.md.

* squash! server : update auto gen files comments [no ci]

Remove the comments in server.cpp that mention that information
can be found in the README.md file.

4 months agovulkan: Catch pipeline creation failure and print an error message (#11436)
Jeff Bolz [Wed, 29 Jan 2025 15:26:50 +0000 (09:26 -0600)]
vulkan: Catch pipeline creation failure and print an error message (#11436)

* vulkan: Catch pipeline creation failure and print an error message

Also, fix some warnings from my on-demand compile change.

* vulkan: fix pipeline creation logging

4 months agoParse https://ollama.com/library/ syntax (#11480)
Eric Curtin [Wed, 29 Jan 2025 11:23:10 +0000 (12:23 +0100)]
Parse https://ollama.com/library/ syntax (#11480)

People search for ollama models using the web ui, this change
allows one to copy the url from the browser and for it to be
compatible with llama-run.

Signed-off-by: Eric Curtin <redacted>
4 months agosync : ggml
Georgi Gerganov [Wed, 29 Jan 2025 09:25:29 +0000 (11:25 +0200)]
sync : ggml

4 months agoggml : add option to not print stack on abort (ggml/1081)
William Tambellini [Thu, 23 Jan 2025 19:59:08 +0000 (11:59 -0800)]
ggml : add option to not print stack on abort (ggml/1081)

* Add option to not print stack on abort

Add option/envvar to disable stack printing on abort.
Also link some unittests with Threads to fix link errors on
ubuntu/g++11.

* Update ggml/src/ggml.c

---------

Co-authored-by: Diego Devesa <redacted>
4 months agoggml-cpu : fix ggml_graph_compute_thread did not terminate on abort. (ggml/1065)
issixx [Fri, 17 Jan 2025 12:29:08 +0000 (21:29 +0900)]
ggml-cpu : fix ggml_graph_compute_thread did not terminate on abort. (ggml/1065)

some threads kept looping and failed to terminate properly after an abort during CPU execution.

Co-authored-by: issi <redacted>
4 months agoembedding : enable --no-warmup option (#11475)
Daniel Bevenius [Wed, 29 Jan 2025 08:38:54 +0000 (09:38 +0100)]
embedding : enable --no-warmup option (#11475)

This commit enables the `--no-warmup` option for the llama-embeddings.

The motivation for this change is to allow the user to disable the
warmup when running the the program.

4 months agollama: fix missing k_cache store for rwkv6qwen2 (#11445)
Molly Sophia [Wed, 29 Jan 2025 04:07:21 +0000 (12:07 +0800)]
llama: fix missing k_cache store for rwkv6qwen2 (#11445)

Signed-off-by: Molly Sophia <redacted>
4 months agocmake: add hints for locating ggml on Windows using Llama find-package (#11466)
Emreerdog [Tue, 28 Jan 2025 23:22:06 +0000 (02:22 +0300)]
cmake: add hints for locating ggml on Windows using Llama find-package (#11466)

4 months agoserver : Fixed wrong function name in llamacpp server unit test (#11473)
peidaqi [Tue, 28 Jan 2025 23:03:42 +0000 (16:03 -0700)]
server : Fixed wrong function name in llamacpp server unit test (#11473)

The test_completion_stream_with_openai_library() function is actually with stream=False by default, and test_completion_with_openai_library() with stream=True

4 months agoci : fix build CPU arm64 (#11472)
Xuan-Son Nguyen [Tue, 28 Jan 2025 23:02:56 +0000 (00:02 +0100)]
ci : fix build CPU arm64 (#11472)

* ci : fix build CPU arm64

* failed, trying ubuntu 22

* vulkan: ubuntu 24

* vulkan : jammy --> noble

4 months agoHIP: Supress transformation warning in softmax.cu
uvos [Tue, 28 Jan 2025 22:06:32 +0000 (23:06 +0100)]
HIP: Supress transformation warning in softmax.cu

loops with bounds not known at compile time can not be unrolled.
when ncols_template == 0, the bounds of the loop are not constexpr, thus llvm cant unroll the loops here.

4 months agoHIP: Only call rocblas_initialize on rocblas versions with the multiple instantation...
Nikita Sarychev [Tue, 28 Jan 2025 15:42:20 +0000 (07:42 -0800)]
HIP: Only call rocblas_initialize on rocblas versions with the multiple instantation bug (#11080)

This disables the workaround on rocblas fixed versions (>=4.0.0) to eliminate the runtime cost and unnecessary VRAM allocation of loading all tensile objects.

4 months agoAdd github protocol pulling and http:// (#11465)
Eric Curtin [Tue, 28 Jan 2025 14:45:41 +0000 (15:45 +0100)]
Add github protocol pulling and http:// (#11465)

As pulling protocols to llama-run

Signed-off-by: Eric Curtin <redacted>
4 months agodocker: allow installing pip packages system-wide (#11437)
Nuno [Tue, 28 Jan 2025 14:17:25 +0000 (15:17 +0100)]
docker: allow installing pip packages system-wide (#11437)

Signed-off-by: rare-magma <redacted>
4 months agocmake : don't fail on `GGML_CPU=OFF` (#11457)
someone13574 [Tue, 28 Jan 2025 14:15:34 +0000 (09:15 -0500)]
cmake : don't fail on `GGML_CPU=OFF` (#11457)

4 months agodocker: add perplexity and bench commands to full image (#11438)
Nuno [Tue, 28 Jan 2025 10:42:32 +0000 (11:42 +0100)]
docker: add perplexity and bench commands to full image (#11438)

Signed-off-by: rare-magma <redacted>
4 months agoSYCL : SOFTMAX F16 mask support and other fixes (#11261)
Akarshan Biswas [Tue, 28 Jan 2025 09:56:58 +0000 (15:26 +0530)]
SYCL : SOFTMAX F16 mask support and other fixes (#11261)

Implemented ggml_sycl_op_soft_max() F16 src1(mask) support for which a pragma deprecation warning was added during #5021.
To do this, had to decouple it from ggml_sycl_op_flatten which always considered src1 to be of fp32 type(many OP functions are dependent on it).

* SYCL: SOFTMAX F16 mask support and other fixes

* test-backend-ops: Add F16 mask test cases

4 months agoHandle missing model in CLI parameters for llama-run (#11399)
Michael Engel [Tue, 28 Jan 2025 08:32:40 +0000 (09:32 +0100)]
Handle missing model in CLI parameters for llama-run (#11399)

The HTTP client in llama-run only prints an error in case the download of
a resource failed. If the model name in the CLI parameter list is missing,
this causes the application to crash.
In order to prevent this, a check for the required model parameter has been
added and errors for resource downloads get propagated to the caller.

Signed-off-by: Michael Engel <redacted>
4 months agoAdd new hf protocol for ollama (#11449)
Eric Curtin [Mon, 27 Jan 2025 18:36:10 +0000 (19:36 +0100)]
Add new hf protocol for ollama (#11449)

https://huggingface.co/docs/hub/en/ollama

Signed-off-by: Eric Curtin <redacted>
5 months agoAMD: parse the architecture as supplied by gcnArchName (#11244)
Haus1 [Mon, 27 Jan 2025 13:58:17 +0000 (08:58 -0500)]
AMD: parse the architecture as supplied by gcnArchName (#11244)

The value provided by minor doesn't include stepping for AMD, parse the value returned by gcnArchName instead to retrieve an accurate ID.

5 months agollama : minor fixes for up llama load model speed (#11448)
lexasub [Mon, 27 Jan 2025 13:42:09 +0000 (17:42 +0400)]
llama : minor fixes for up llama load model speed (#11448)

* impl::load change map bpe_ranks to onordered map for reduce time of impl::load on 30%

* llama_model_loader::init_mapping - replace new llama_mmap to std::make_unique<llama_mmap> for clean code & reduce (/2) time of running init_mappings

* Update src/llama-vocab.cpp

---------

Co-authored-by: lexasub <redacted>
Co-authored-by: Diego Devesa <redacted>
5 months agollama: refactor llama_decode_impl (#11381)
Johannes Gäßler [Mon, 27 Jan 2025 11:07:12 +0000 (12:07 +0100)]
llama: refactor llama_decode_impl (#11381)

5 months agometal: Handle null returned from MTLCreateSystemDefaultDevice() (#11441)
Ihar Hrachyshka [Mon, 27 Jan 2025 07:41:59 +0000 (02:41 -0500)]
metal: Handle null returned from MTLCreateSystemDefaultDevice() (#11441)

This fixes segmentation fault error when running tests when no metal
devices are available (for example, when not linked with Core Graphics
framework or otherwise).

5 months agodocker : fix ARM build and Vulkan build (#11434)
Xuan Son Nguyen [Sun, 26 Jan 2025 21:45:32 +0000 (22:45 +0100)]
docker : fix ARM build and Vulkan build (#11434)

* ci : do not fail-fast for docker

* build arm64/amd64 separatedly

* fix pip

* no fast fail

* vulkan: try jammy

5 months agometal : use residency sets (#11427)
Georgi Gerganov [Sun, 26 Jan 2025 18:06:16 +0000 (20:06 +0200)]
metal : use residency sets (#11427)

* metal : use residency sets

ggml-ci

* metal : restore commandBufferWithUnretainedReferences calls [no ci]

* metal : release descriptors

ggml-ci

* metal : check env GGML_METAL_NO_RESIDENCY

ggml-ci

* metal : fix build + clean-up

ggml-ci

5 months agodocker: add missing vulkan library to base layer and update to 24.04 (#11422)
Nuno [Sun, 26 Jan 2025 17:22:43 +0000 (18:22 +0100)]
docker: add missing vulkan library to base layer and update to 24.04 (#11422)

Signed-off-by: rare-magma <redacted>
5 months agocmake: add ggml find package (#11369)
bandoti [Sun, 26 Jan 2025 16:07:48 +0000 (12:07 -0400)]
cmake: add ggml find package (#11369)

* Add initial ggml cmake package

* Add build numbers to ggml find-package

* Expand variables with GGML_ prefix

* Guard against adding to cache variable twice

* Add git to msys2 workflow

* Handle ggml-cpu-* variants

* Link ggml/ggml-base libraries to their targets

* Replace main-cmake-pkg with simple-cmake-pkg

* Interface features require c_std_90

* Fix typo

* Removed unnecessary bracket from status message

* Update examples/simple-cmake-pkg/README.md

Co-authored-by: Georgi Gerganov <redacted>
* Update examples/simple-cmake-pkg/README.md

Co-authored-by: Georgi Gerganov <redacted>
---------

Co-authored-by: Georgi Gerganov <redacted>
5 months agorpc: fix register position (#11424)
Frank Mai [Sun, 26 Jan 2025 15:20:34 +0000 (23:20 +0800)]
rpc: fix register position (#11424)

Signed-off-by: thxCode <redacted>
5 months agoreadme : update hot topics
Georgi Gerganov [Sun, 26 Jan 2025 12:30:15 +0000 (14:30 +0200)]
readme : update hot topics

5 months agobuild: apply MSVC /bigobj option to c/cpp files only (#11423)
Jeff Bolz [Sun, 26 Jan 2025 02:10:03 +0000 (20:10 -0600)]
build: apply MSVC /bigobj option to c/cpp files only (#11423)

5 months agovulkan: compile shaders on-demand (#11406)
Jeff Bolz [Sat, 25 Jan 2025 21:29:57 +0000 (15:29 -0600)]
vulkan: compile shaders on-demand (#11406)

Reduce first-run startup time and memory consumption.

Should fix #11339.

5 months agoHip: disable VMM on hip as it seams that it dosent work in some configurations (...
uvos [Sat, 25 Jan 2025 20:01:12 +0000 (21:01 +0100)]
Hip: disable VMM on hip as it seams that it dosent work in some configurations (#11420)

5 months agobuild: add /bigobj to MSVC build (#11407)
Jeff Bolz [Sat, 25 Jan 2025 17:26:37 +0000 (11:26 -0600)]
build: add /bigobj to MSVC build (#11407)

5 months agodocker : add GGML_CPU_ARM_ARCH arg to select ARM architecture to build for (#11419)
Diego Devesa [Sat, 25 Jan 2025 16:22:41 +0000 (17:22 +0100)]
docker : add GGML_CPU_ARM_ARCH arg to select ARM architecture to build for (#11419)

5 months agoserver : fix cleaning up stream task (#11418)
Xuan Son Nguyen [Sat, 25 Jan 2025 15:36:44 +0000 (16:36 +0100)]
server : fix cleaning up stream task (#11418)

* server : fix cleaning up stream task

* one more spot

5 months agodocker : fix CPU ARM build (#11403)
Diego Devesa [Sat, 25 Jan 2025 14:22:29 +0000 (15:22 +0100)]
docker : fix CPU ARM build (#11403)

* docker : fix CPU ARM build

* add CURL to other builds

5 months agoci : fix line breaks on windows builds (#11409)
Georgi Gerganov [Sat, 25 Jan 2025 11:36:48 +0000 (13:36 +0200)]
ci : fix line breaks on windows builds (#11409)

* ci : fix line breaks on windows builds

* cont : another try

* ci : fix powershell line breaks

5 months agoCANN: Add Ascend CANN build ci (#10217)
jiahao su [Fri, 24 Jan 2025 23:26:01 +0000 (07:26 +0800)]
CANN: Add Ascend CANN build ci (#10217)

* CANN: Add Ascend CANN build ci

* Update build.yml

* Modify cann image version

* Update build.yml

* Change to run on x86 system

* Update build.yml

* Update build.yml

* Modify format error

* Update build.yml

* Add 'Ascend NPU' label restrictions

* Exclude non PR event

Co-authored-by: Yuanhao Ji <redacted>
* Update build.yml

---------

Co-authored-by: Yuanhao Ji <redacted>
5 months agohip : Add hipGraph and VMM support to ROCM (#11362)
uvos [Fri, 24 Jan 2025 23:02:23 +0000 (00:02 +0100)]
hip : Add hipGraph and VMM support to ROCM (#11362)

* Add hipGraph support

* Enable VMM on rocm

5 months agoCUDA: fix FP16 cuBLAS GEMM (#11396)
Johannes Gäßler [Fri, 24 Jan 2025 20:02:43 +0000 (21:02 +0100)]
CUDA: fix FP16 cuBLAS GEMM (#11396)

5 months agorocBLAS: Avoid fp32->fp16->fp32 conversion on cdna (#11356)
uvos [Fri, 24 Jan 2025 16:50:49 +0000 (17:50 +0100)]
rocBLAS: Avoid fp32->fp16->fp32 conversion on cdna (#11356)

5 months agorelease : pack /lib in the packages (#11392)
Georgi Gerganov [Fri, 24 Jan 2025 16:41:30 +0000 (18:41 +0200)]
release : pack /lib in the packages (#11392)

* release : pack /lib and /include in the packages

* cmake : put libs in /bin

* TMP : push artifacts

* Revert "TMP : push artifacts"

This reverts commit 4decf2c4dfc5cdf5d96ea44c03c8f9801ab41262.

* ci : fix HIP cmake compiler options to be on first line

* ci : restore the original HIP commands

* ci : change ubuntu build from latest to 20.04

* ci : try to fix macos build rpaths

* ci : remove obsolete MacOS build

* TMP : push artifacts

* ci : change back to ubuntu latest

* ci : macos set build rpath to "@loader_path"

* ci : fix typo

* ci : change ubuntu package to 22.04

* Revert "TMP : push artifacts"

This reverts commit 537b09e70ffc604c414ee78acf3acb4c940ec597.

5 months agodocs : Update readme to build targets for local docker build (#11368)
Jafar Uruç [Fri, 24 Jan 2025 13:30:13 +0000 (13:30 +0000)]
docs : Update readme to build targets for local docker build (#11368)

5 months agoCPU/CUDA: fix (GQA) mul mat back, add CUDA support (#11380)
Johannes Gäßler [Fri, 24 Jan 2025 11:38:31 +0000 (12:38 +0100)]
CPU/CUDA: fix (GQA) mul mat back, add CUDA support (#11380)

5 months agocmake : avoid -march=native when reproducible build is wanted (#11366)
Bernhard M. Wiedemann [Fri, 24 Jan 2025 11:21:35 +0000 (12:21 +0100)]
cmake : avoid -march=native when reproducible build is wanted (#11366)

See https://reproducible-builds.org/ for why this is good
and https://reproducible-builds.org/specs/source-date-epoch/
for the definition of this variable.

Without this patch, compiling on different machines produced different binaries, which made verification of results difficult.

Fixes: #11317
This patch was done while working on reproducible builds for openSUSE.

5 months agoUpdate llama-run README.md (#11386)
Eric Curtin [Fri, 24 Jan 2025 09:39:24 +0000 (09:39 +0000)]
Update llama-run README.md (#11386)

For consistency

Signed-off-by: Eric Curtin <redacted>
5 months agoserver : (webui) put DeepSeek R1 CoT in a collapsible <details> element (#11364)
stduhpf [Fri, 24 Jan 2025 08:02:38 +0000 (09:02 +0100)]
server : (webui) put DeepSeek R1 CoT in a collapsible <details> element (#11364)

* webui : put DeepSeek R1 CoT in a collapsible <details> element

* webui: refactor split

* webui: don't use regex to split cot and response

* webui: format+qol

* webui: no loading icon if the model isn't generating

* ui fix, add configs

* add jsdoc types

* only filter </think> for assistant msg

* build

* update build

---------

Co-authored-by: Xuan Son Nguyen <redacted>