]>
git.djapps.eu Git - pkg/ggml/sources/llama.cpp/log
Olivier Chafik [Sun, 25 May 2025 09:45:49 +0000 (10:45 +0100)]
server: fix/test add_generation_prompt (#13770)
Co-authored-by: ochafik <redacted>
Piotr Jasiukajtis [Sun, 25 May 2025 08:29:43 +0000 (10:29 +0200)]
llama : add support for Qwen3 MoE tied word embeddings (#13768)
Akarshan Biswas [Sun, 25 May 2025 07:08:37 +0000 (12:38 +0530)]
SYCL: revert "sycl: simplify bin_bcast_kernel (#13383)" (#13752)
Temporarily reverted due to failing fp16 DIV operation
This reverts commit
02cdd2d8b092b5a4bb18e013c6887ce49ba20ac5 .
ggml-ci
Olivier Chafik [Sun, 25 May 2025 00:48:08 +0000 (01:48 +0100)]
`server`: streaming of tool calls and thoughts when `--jinja` is on (#12379)
* add common_json w/ support for truncated json healing
* add common_chat_msg_diff
* partial common_chat_parse
* refactor parser w/ optionals
* server: wire chat diffs in stream mode
* fix trigger of thinking models (must happen after thoughts are closed)
* fix functionary v3.2 raw python!
* rename: common_chat_syntax (now contains format)
* rm common_regex.at_start
* don't return empty <think></think>
* accommodate yet another deepseek r1 distill fantasy syntax (`<|tool▁calls|>`)
* fix QwQ 32B tool call parsing after thoughts (hermes2)
* better logs for grammar triggers
* consume spaces after parse_json_tool_calls
* fix required tool calls w/ thinking models that have pre-opened thinking tags
* fix thinking model's initial trigger + test qwq's template
* run most test_tool_call tests in stream + non-stream modes
* make functionary v3.2 parsing more strict (differentiate first match from others)
* send final diff from server, to close off raw python arguments
* support partial content streaming in Generic mode
* tool-call: allow content prelude before hermes2 tool calls (for Qwen2.5)
* Update function-calling.md
* Update tool_bench.py
* chat-parser: remove input from exception (llm output may contain PII)
---------
Co-authored-by: ochafik <redacted>
Co-authored-by: Olivier Chafik <redacted>
Diego Devesa [Sat, 24 May 2025 22:55:16 +0000 (15:55 -0700)]
releases : bundle llvm omp library in windows release (#13763)
Diego Devesa [Sat, 24 May 2025 20:27:03 +0000 (13:27 -0700)]
releases : enable openmp in windows cpu backend build (#13756)
Diego Devesa [Sat, 24 May 2025 20:26:47 +0000 (13:26 -0700)]
ggml-cpu : set openmp wait time if not set (#13758)
0cc4m [Sat, 24 May 2025 14:49:12 +0000 (16:49 +0200)]
Move GLM4 f32 attention fix to the correct function (#13750)
Xuan-Son Nguyen [Sat, 24 May 2025 11:06:47 +0000 (13:06 +0200)]
ggml : add ggml_gelu_erf() CUDA kernel (#13719)
* ggml : add ggml_gelu_erf() CUDA kernel
* missing semicolon
Sigbjørn Skjæret [Sat, 24 May 2025 10:29:09 +0000 (12:29 +0200)]
vocab : fix ugm tokenizer precision (#13743)
Johannes Gäßler [Sat, 24 May 2025 09:46:19 +0000 (11:46 +0200)]
CUDA: fix race condition in FA vector kernels (#13742)
Diego Devesa [Fri, 23 May 2025 20:14:00 +0000 (13:14 -0700)]
ci : enable winget package updates (#13734)
Diego Devesa [Fri, 23 May 2025 20:09:38 +0000 (13:09 -0700)]
ci : add winget package updater (#13732)
Georgi Gerganov [Fri, 23 May 2025 17:16:13 +0000 (20:16 +0300)]
hparams : initialize arrays (#13728)
ggml-ci
Xuan-Son Nguyen [Fri, 23 May 2025 15:07:04 +0000 (17:07 +0200)]
llama : allow custom list of swa_layers (#13726)
Xuan-Son Nguyen [Fri, 23 May 2025 09:03:47 +0000 (11:03 +0200)]
server : support audio input (#13714)
* server : support audio input
* add audio support on webui
Chenguang Li [Fri, 23 May 2025 08:47:53 +0000 (16:47 +0800)]
CANN: Support MUL_MAT_ID for q8_0 and q4_0 (#13705)
* [CANN]Support MUL_MAT_ID Q8 && Q4
Signed-off-by: noemotiovon <redacted>
* codestyle adjustment
Signed-off-by: noemotiovon <redacted>
---------
Signed-off-by: noemotiovon <redacted>
Xuan-Son Nguyen [Fri, 23 May 2025 06:12:48 +0000 (08:12 +0200)]
ggml : fix the order of ggml_unary_op (#13718)
Jeff Bolz [Fri, 23 May 2025 04:45:02 +0000 (00:45 -0400)]
vulkan: support CPY from any type to itself (#13695)
Reuse the f16/f32 copy shaders, and just scale the number of elements
according to the type size.
Jeff Bolz [Fri, 23 May 2025 04:33:45 +0000 (00:33 -0400)]
vulkan: Disable coopmat/coopmat2/bfloat extensions if glslc doesn't support it (#13696)
Judd [Fri, 23 May 2025 04:33:08 +0000 (12:33 +0800)]
use LOG_WARN to replace `std::cerr` (#13657)
Diego Devesa [Thu, 22 May 2025 22:21:37 +0000 (15:21 -0700)]
release : fix windows hip release (#13707)
* release : fix windows hip release
* make single hip release with multiple targets
Georgi Gerganov [Thu, 22 May 2025 19:21:07 +0000 (22:21 +0300)]
tts : fix n_ubatch + make WavTokenizer cache-less (#13713)
ggml-ci
Xuan-Son Nguyen [Thu, 22 May 2025 18:42:48 +0000 (20:42 +0200)]
mtmd : add ultravox audio input (#13623)
* convert ok, load ok
* warmup ok
* test
* still does not work?
* fix padding
* temporary give up
* fix merge conflict
* build_ultravox()
* rm test
* fix merge conflict
* add necessary mtmd APIs
* first working version (only 4s of audio)
* will this monster compile?
* fix compile
* please compile
* fPIC
* fix windows
* various fixes
* clean up audio_helpers
* fix conversion
* add some debug stuff
* long audio input ok
* adapt the api
* add --audio arg
* final touch UX
* add miniaudio to readme
* fix typo
* refactor kv metadata
* mtmd_default_marker()
Aaron Teo [Thu, 22 May 2025 18:31:29 +0000 (02:31 +0800)]
common: Include torch package for s390x (#13699)
* common: update requirements.txt to include pytorch nightly for s390x
Signed-off-by: Aaron Teo <redacted>
* common: fix torch installation via pip for s390x
Signed-off-by: Aaron Teo <redacted>
---------
Signed-off-by: Aaron Teo <redacted>
Georgi Gerganov [Thu, 22 May 2025 13:33:39 +0000 (16:33 +0300)]
server : pad small embedding batches (#13692)
ggml-ci
Sigbjørn Skjæret [Thu, 22 May 2025 12:25:05 +0000 (14:25 +0200)]
gguf-py : correct charsmap parameter typing (#13701)
Nicolò Scipione [Thu, 22 May 2025 11:54:43 +0000 (13:54 +0200)]
sycl : Remove waits from function calls (#13702)
* removes the waits in async memcpy functions
Ewan Crawford [Thu, 22 May 2025 08:24:09 +0000 (09:24 +0100)]
SYCL: Avoid using with SYCL-Graph for unsupported nodes (#13587)
Currently on a CUDA backend to SYCL when running
`GGML_SYCL_DISABLE_GRAPH=0 ./bin/test-backend-ops -b SYCL0` there
are two operations that throw an exception from the blocking
waits during queue recording.
* `-o CONCAT` : Use of blocking waits on a queue that's being recorded https://github.com/ggml-org/llama.cpp/blob/master/ggml/src/ggml-sycl/concat.cpp#L185-L187
* `-o MUL_MAT_ID`: Blocking wait on a recording queue for a copy to host memory https://github.com/ggml-org/llama.cpp/blob/master/ggml/src/ggml-sycl/ggml-sycl.cpp#L3072-L3074
We've noticed that `ggml-cuda.cu` has the
[check_node_graph_compatibility_and_refresh_copy_ops](https://github.com/ggml-org/llama.cpp/blob/
39e73ae0d69f882d7e29cecc6dd8f5052fca6731 /ggml/src/ggml-cuda/ggml-cuda.cu#L2458-L2458)
method for checking if a graph can be used, even if enabled. I've taken a
similar approach in this PR by adding a method to `ggml-sycl.cpp` for checking
if a graph can be used for the operations even if a user has asked for it to be
enabled.
Henry Linjamäki [Wed, 21 May 2025 23:21:45 +0000 (02:21 +0300)]
opencl: Add support for multiple devices (#12622)
* opencl: Add support for multiple devices
... but limited to one platform. A platform with a GPU will be preferred.
Additionally:
* Filter out devices that lack capabilities needed by the backend
implementation (half support, OpenCL 2.0+, etc).
* Make ggml_backend_opencl_reg() thread-safe.
* fixup: fix an error in sync_with_other_backends
... when there is only one OpenCL device available.
Henry Linjamäki [Wed, 21 May 2025 20:21:17 +0000 (23:21 +0300)]
opencl: fix couple crashes (#12795)
* opencl: fix couple crashes
* fix kernel launches failed on devices which do not support
non-uniform work-groups. When non-uniform work-groups are not
supported, set `local_work_size` to NULL (= let driver choose the
work-group sizes). This patch does not cover everything - just the
cases tested by test-backend-ops.
* fix sub-buffer creation failed due to `cl_buffer_region::origin` not
being aligned to `CL_DEVICE_MEM_BASE_ADDR_ALIGN`.
* OpenCL: query non-uniform WG sizes only on OpenCL 3.0+
Diego Devesa [Wed, 21 May 2025 20:09:57 +0000 (13:09 -0700)]
releases : build CPU backend separately (windows) (#13642)
Georgi Gerganov [Wed, 21 May 2025 17:00:49 +0000 (20:00 +0300)]
hparams : support models for which all layers use SWA (#13682)
ggml-ci
Georgi Gerganov [Wed, 21 May 2025 16:46:56 +0000 (19:46 +0300)]
server : improve error reporting (#13680)
antichristHater [Wed, 21 May 2025 16:40:35 +0000 (19:40 +0300)]
convert : add qwen2vl support for unsloth merges (#13686)
Sigbjørn Skjæret [Wed, 21 May 2025 14:57:38 +0000 (16:57 +0200)]
examples : switch retrieval to llama_encode (#13685)
* switch retrieval to llama_encode
* enable --no-warmup for retrieval
Emmanuel Ferdman [Wed, 21 May 2025 14:33:54 +0000 (17:33 +0300)]
gguf-py : display the invalid gguf type (#13687)
Signed-off-by: Emmanuel Ferdman <redacted>
Xuan-Son Nguyen [Wed, 21 May 2025 14:26:33 +0000 (16:26 +0200)]
ggml : add ggml_gelu_erf() (#13667)
* ggml : add ggml_gelu_na (not approximated)
* fix naming order
* rename na --> erf
* apply review suggesions
* revert naming order
Robin Davidsson [Wed, 21 May 2025 13:15:27 +0000 (15:15 +0200)]
server : Add the endpoints /api/tags and /api/chat (#13659)
* Add the endpoints /api/tags and /api/chat
Add the endpoints /api/tags and /api/chat, and improved the model metadata response
* Remove trailing whitespaces
* Removed code that is not needed for copilot to work.
Dorin-Andrei Geman [Wed, 21 May 2025 13:07:57 +0000 (16:07 +0300)]
server : fix first message identification (#13634)
* server : fix first message identification
When using the OpenAI SDK (https://github.com/openai/openai-node/blob/master/src/lib/ChatCompletionStream.ts#L623-L626) we noticed that the expected assistant role is missing in the first streaming message. Fix this by correctly checking for the first message.
Co-authored-by: Piotr Stankiewicz <redacted>
Signed-off-by: Dorin Geman <redacted>
* server : Fix checks for first role message for stream=True
Co-authored-by: Piotr Stankiewicz <redacted>
Signed-off-by: Dorin Geman <redacted>
---------
Signed-off-by: Dorin Geman <redacted>
Co-authored-by: Piotr Stankiewicz <redacted>
Georgi Gerganov [Wed, 21 May 2025 12:11:13 +0000 (15:11 +0300)]
kv-cache : simplify the interface (#13660)
* kv-cache : simplify the interface
ggml-ci
* context : revert llama_batch_allocr position change
ggml-ci
Georgi Gerganov [Wed, 21 May 2025 10:09:21 +0000 (13:09 +0300)]
model : disable SWA for Phi models (#13676)
* model : disable SWA for Phi models
ggml-ci
* model : update warning message
* model : print warning only if n_swa > 0
* model : fix typo
R0CKSTAR [Wed, 21 May 2025 01:58:49 +0000 (09:58 +0800)]
musa: Upgrade MUSA SDK version to rc4.0.1 and use mudnn::Unary::IDENTITY op to accelerate D2D memory copy (#13647)
* musa: fix build warning (unused parameter)
Signed-off-by: Xiaodong Ye <redacted>
* musa: upgrade MUSA SDK version to rc4.0.1
Signed-off-by: Xiaodong Ye <redacted>
* musa: use mudnn::Unary::IDENTITY op to accelerate D2D memory copy
Signed-off-by: Xiaodong Ye <redacted>
* Update ggml/src/ggml-cuda/cpy.cu
Co-authored-by: Johannes Gäßler <redacted>
* musa: remove MUDNN_CHECK_GEN and use CUDA_CHECK_GEN instead in MUDNN_CHECK
Signed-off-by: Xiaodong Ye <redacted>
---------
Signed-off-by: Xiaodong Ye <redacted>
Co-authored-by: Johannes Gäßler <redacted>
Eve [Tue, 20 May 2025 21:35:16 +0000 (21:35 +0000)]
vulkan: fix warnings (#13626)
* small fixes
* remove ifdef
l3utterfly [Tue, 20 May 2025 16:55:30 +0000 (00:55 +0800)]
mtmd-helper : bug fix to token batching in mtmd (#13650)
* Update mtmd-helper.cpp
* Update tools/mtmd/mtmd-helper.cpp
Co-authored-by: Xuan-Son Nguyen <redacted>
---------
Co-authored-by: Xuan-Son Nguyen <redacted>
Georgi Gerganov [Tue, 20 May 2025 16:21:04 +0000 (19:21 +0300)]
model : fix llama4 graph (#13663)
ggml-ci
Georgi Gerganov [Tue, 20 May 2025 13:13:16 +0000 (16:13 +0300)]
llama : remove llama_kv_cache_view API + remove deprecated (#13653)
ggml-ci
Johannes Gäßler [Tue, 20 May 2025 12:45:07 +0000 (14:45 +0200)]
CUDA: skip fully masked-out KV in FA vec kernel (#13584)
* CUDA: skip fully masked-out KV in FA vec kernel
Sigbjørn Skjæret [Tue, 20 May 2025 10:03:17 +0000 (12:03 +0200)]
tests : avoid github urls due to throttling (#13654)
Svetlozar Georgiev [Tue, 20 May 2025 09:34:15 +0000 (10:34 +0100)]
sycl: disable reorder for sycl mulmat (#13536)
0cc4m [Tue, 20 May 2025 08:11:56 +0000 (10:11 +0200)]
Set GLM4 blk.*.attn_output.weight, kqv_out-* matmul to GGML_PREC_F32 to fix infinity values in output (#13639)
Georgi Gerganov [Tue, 20 May 2025 07:41:40 +0000 (10:41 +0300)]
metal : fix typo in FA kernel comments (#13651)
Georgi Gerganov [Tue, 20 May 2025 05:05:46 +0000 (08:05 +0300)]
kv-cache : add SWA support (#13194)
* kv-cache : prepare for SWA
ggml-ci
* kv-cache : initial iSWA implementation
ggml-ci
* kv-cache : rework error recovery logic
ggml-ci
* models : fix Phi-3 SWA parameters
ggml-ci
* model : adjust Granite to rope factor changes
ggml-ci
* server : check if context can do shifts
ggml-ci
* iswa : for now, always enable shifts (experiment)
ggml-ci
* kv-cache : simplify SWA logic
ggml-ci
* kv-cache : apply defrag when we fail to find slots for the batch
ggml-ci
* llama : update docs about llama_decode
ggml-ci
* kv-cache : update warning logs when no space for the batch is available
ggml-ci
* llama : add llama_kv_self_seq_pos_min()
* kv-cache : keep track of partial SWA computes and print warnings
* server : disallow use cases involving partial SWA context
ggml-ci
* llama : add param to control SWA cache size
ggml-ci
* minor : clean-up
ggml-ci
Xinpeng Dou [Tue, 20 May 2025 03:43:43 +0000 (11:43 +0800)]
CANN: Update CANN model support (#13162)
* Update CANN model support status
* Update of model support
* update
* update
* update
* fix format of CANN.md
* fix format of CANN.md
* fix format of CANN.md
Nicolò Scipione [Tue, 20 May 2025 00:54:43 +0000 (02:54 +0200)]
sycl : Overcoming workaround for mmap() allocation on Windows (#13482)
* Remove mmap workaround on windows
After some testing I found that mmap is supported on windows and for
many GPUs on Linux. Therefore I remove the workaround for windows since
it is not necessary.
* Update llama-bench README
SYCL backend introduced a workaround that allows execution of
llama-bench also without specifying `--mmp 0` flag
psocolovsky [Mon, 19 May 2025 19:17:36 +0000 (21:17 +0200)]
common : add load_progress_callback (#13617)
0cc4m [Mon, 19 May 2025 15:54:08 +0000 (17:54 +0200)]
Vulkan: Add f32 accumulator support to quantized mul mat to fix GLM4 32B incoherence (#13607)
Alberto Cabrera Pérez [Mon, 19 May 2025 13:38:20 +0000 (14:38 +0100)]
sycl : backend documentation review (#13544)
* sycl: reviewing and updating docs
* Updates Runtime error codes
* Improves OOM troubleshooting entry
* Added a llama 3 sample
* Updated supported models
* Updated releases table
Xuan-Son Nguyen [Mon, 19 May 2025 11:04:14 +0000 (13:04 +0200)]
mtmd : add vision support for llama 4 (#13282)
* wip llama 4 conversion
* rm redundant __init__
* fix conversion
* fix conversion
* test impl
* try this
* reshape patch_embeddings_0
* fix view
* rm ffn_post_norm
* cgraph ok
* f32 for pos embd
* add image marker tokens
* Llama4UnfoldConvolution
* correct pixel shuffle
* fix merge conflicts
* correct
* add debug_graph
* logits matched, but it still preceives the image incorrectly
* fix style
* add image_grid_pinpoints
* handle llama 4 preprocessing
* rm load_image_size
* rm unused line
* fix
* small fix 2
* add test & docs
* fix llava-1.6 test
* test: add notion of huge models
* add comment
* add warn about degraded quality
Alberto Cabrera Pérez [Mon, 19 May 2025 10:46:09 +0000 (11:46 +0100)]
ci : upgraded oneAPI version in SYCL workflows and dockerfile (#13532)
Georgi Gerganov [Mon, 19 May 2025 09:50:29 +0000 (12:50 +0300)]
sync : ggml
ggml-ci
Johannes Gäßler [Mon, 19 May 2025 07:33:35 +0000 (09:33 +0200)]
mnist: fix segmentation fault (ggml/1227)
Diego Devesa [Mon, 19 May 2025 01:30:13 +0000 (18:30 -0700)]
ggml : fix apple OS check in ggml_print_backtrace (ggml/1229)
Daniel Tang [Sat, 17 May 2025 23:06:26 +0000 (19:06 -0400)]
ggml : Fix missing backtrace on Linux (ggml/1228)
* Modern Linux defaults /proc/sys/kernel/yama/ptrace_scope to 1
* Fixed lldb attach
* Simplify by having the child do ggml_print_backtrace_symbols
Nick [Mon, 19 May 2025 10:25:41 +0000 (18:25 +0800)]
fix: check model pointer validity before use (#13631)
Chenguang Li [Mon, 19 May 2025 06:21:17 +0000 (14:21 +0800)]
CANN: Support MOE Model MUL_MAT_ID (#13042)
Signed-off-by: noemotiovon <redacted>
Isaac McFadyen [Sat, 17 May 2025 21:59:48 +0000 (17:59 -0400)]
server : added --no-prefill-assistant flag (#13608)
* added no-prefill-assistant flag
* reworded documentation comment
* updated server README.md
Gilad S. [Sat, 17 May 2025 18:26:43 +0000 (21:26 +0300)]
cmake: use the current build config for vulkan-shaders-gen (#13595)
* fix: use the current build config for `vulkan-shaders-gen`
* fix: only pass a valid build type to `--config`
Georgi Gerganov [Sat, 17 May 2025 09:58:55 +0000 (12:58 +0300)]
parallel : add option for non-shared and larger prompts (#13598)
* parallel : add option for non-shared and larger prompts
* parallel : update readme [no ci]
* cont : add note about base models [no ci]
* parallel : better var name
ggml-ci
Jeff Bolz [Sat, 17 May 2025 07:14:55 +0000 (16:14 +0900)]
vulkan: move common FA code to flash_attn_base.comp (#13556)
* vulkan: move common FA code to flash_attn_base.comp
* vulkan: move common FA index/stride setup code to flash_attn_base.comp
* build fix
Jeff Bolz [Sat, 17 May 2025 06:35:47 +0000 (15:35 +0900)]
vulkan: use scalar FA rather than coopmat2 when N==1 (#13554)
Z [Fri, 16 May 2025 20:56:28 +0000 (14:56 -0600)]
llguidance : official v0.7.20 release (no actual changes) [noci] (#13594)
Xuan-Son Nguyen [Fri, 16 May 2025 19:50:00 +0000 (21:50 +0200)]
server : do not return error out of context (with ctx shift disabled) (#13577)
Xuan-Son Nguyen [Fri, 16 May 2025 19:49:01 +0000 (21:49 +0200)]
webui : improve accessibility for visually impaired people (#13551)
* webui : improve accessibility for visually impaired people
* add a11y for extra contents
* fix some labels being read twice
* add skip to main content
Xuan-Son Nguyen [Fri, 16 May 2025 18:04:18 +0000 (20:04 +0200)]
readme : add list of dependencies and their license (#13591)
Diego Devesa [Fri, 16 May 2025 17:36:51 +0000 (10:36 -0700)]
releases : use arm version of curl for arm releases (#13592)
Georgi Gerganov [Fri, 16 May 2025 17:32:58 +0000 (20:32 +0300)]
metal : add FA-vec kernel for head size 64 (#13583)
ggml-ci
Diego Devesa [Fri, 16 May 2025 14:38:07 +0000 (07:38 -0700)]
llama : print hint when loading a model when no backends are loaded (#13589)
Sigbjørn Skjæret [Fri, 16 May 2025 12:54:23 +0000 (14:54 +0200)]
ci : add ppc64el to build-linux-cross (#13575)
Łukasz Ślusarczyk [Fri, 16 May 2025 10:15:29 +0000 (12:15 +0200)]
sycl : fixed compilation warnings (#13582)
Olivier Chafik [Thu, 15 May 2025 22:29:10 +0000 (23:29 +0100)]
minja: sync (qwen3) (#13573)
* minja: sync https://github.com/google/minja/commit/
f06140fa52fd140fe38e531ec373d8dc9c86aa06
- https://github.com/google/minja/pull/67 (@grf53)
- https://github.com/google/minja/pull/66 (@taha-yassine)
- https://github.com/google/minja/pull/63 (@grf53)
- https://github.com/google/minja/pull/58
---------
Co-authored-by: ochafik <redacted>
Diego Devesa [Thu, 15 May 2025 17:13:11 +0000 (10:13 -0700)]
gguf : use ggml log system (#13571)
* gguf : use ggml log system
* llama : remove unnecessary new lines in exception messages
Daniel Tang [Thu, 15 May 2025 16:47:10 +0000 (12:47 -0400)]
gguf-py : fix disconnect-before-connect in editor-gui (#13569)
The bug caused a crash upon load with venvs created with
--system-site-packages to use
python3-pyside6.qtwidgets=python3-pyside6.qtwidgets=6.6.2-4
from Kubuntu 24.10.
Xuan-Son Nguyen [Thu, 15 May 2025 15:40:07 +0000 (17:40 +0200)]
convert : fix conversion for llama 4 (#13567)
Atharva Dubey [Thu, 15 May 2025 15:39:52 +0000 (16:39 +0100)]
sycl: simplify bin_bcast_kernel (#13383)
Svetlozar Georgiev [Thu, 15 May 2025 15:35:44 +0000 (16:35 +0100)]
sycl: reordered Q4_K MMVQ (#13109)
Łukasz Ślusarczyk [Thu, 15 May 2025 14:53:41 +0000 (16:53 +0200)]
sycl: use oneDNN for matrices multiplication (#12972)
Diego Devesa [Thu, 15 May 2025 13:46:55 +0000 (06:46 -0700)]
llama-bench : fix -ot with dl backends (#13563)
Xuan-Son Nguyen [Thu, 15 May 2025 12:24:50 +0000 (14:24 +0200)]
webui : handle PDF input (as text or image) + convert pasted long content to file (#13562)
* webui : handle PDF input (as text or image)
* handle the case where pdf image + server without mtmd
* fix bug missing pages
Piotr Wilkin (ilintar) [Thu, 15 May 2025 06:40:58 +0000 (08:40 +0200)]
server : proper error handling for missing elements in messages array (OpenAI compatible backend) (#13540)
Georgi Gerganov [Thu, 15 May 2025 02:57:02 +0000 (05:57 +0300)]
bench : handle decode errors (#13548)
ggml-ci
Olivier Chafik [Thu, 15 May 2025 01:39:51 +0000 (02:39 +0100)]
`server`: inject date_string in llama 3.x template + fix date for firefunction v2 (#12802)
* Inject date_string in llama 3.x + fix for functionary v2
https://github.com/ggml-org/llama.cpp/issues/12729
* move/fix detection of functionary v3.1 before llama 3.x, fix & test their non-tool mode
Co-authored-by: Sigbjørn Skjæret <redacted>
* generate more tokens in test_completion_with_required_tool_tiny_fast to avoid truncation
---------
Co-authored-by: ochafik <redacted>
Co-authored-by: Sigbjørn Skjæret <redacted>
Georgi Gerganov [Wed, 14 May 2025 20:15:15 +0000 (23:15 +0300)]
kv-cache : fix out-of-bounds view during reserve graph (#13547)
* kv-cache : fix reserve graph out-of-bounds access
ggml-ci
* cont : add comment
* cont : fix comments [no ci]
* cont : more correct comment [no ci]
Yibo Cai [Wed, 14 May 2025 19:53:52 +0000 (03:53 +0800)]
arm64: optimize q6_k_q8_k kernel with i8mm (#13519)
This PR improves q6_k_q8_k gemm kernel with arm64 i8mm instruction.
Tested on neoverse-n2 with llama3 8b q6_k quantization model.
- 40% ~ 54% S_PP uplift for all batch sizes
- 16% ~ 47% S_TG uplift for batch size 4 and above
Perplexity doesn't change with this PR.
```
// tested on neoverse-n2
$ llama-batched-bench \
-m Meta-Llama-3-8B-Instruct-Q6_K.gguf \
--no-mmap -fa \
-c 8192 -b 4096 -ub 512 -npp 128 -ntg 128 \
-npl 1,2,4,8,16,32 \
-t 64
---------------------------------------------------------------------
| PP | TG | B | S_PP t/s | S_TG t/s |
| | | | original | this pr | original | this pr |
|-------|--------|------|----------|----------|----------|----------|
| 128 | 128 | 1 | 78.52 | 109.18 | 18.63 | 18.88 |
| 128 | 128 | 2 | 84.62 | 123.94 | 34.54 | 36.92 |
| 128 | 128 | 4 | 84.36 | 122.49 | 52.65 | 61.32 |
| 128 | 128 | 8 | 90.52 | 138.87 | 63.46 | 84.41 |
| 128 | 128 | 16 | 90.11 | 138.56 | 71.04 | 101.33 |
| 128 | 128 | 32 | 89.81 | 137.79 | 75.14 | 110.47 |
---------------------------------------------------------------------
```
Olivier Chafik [Wed, 14 May 2025 18:50:57 +0000 (19:50 +0100)]
`common`: add partial regex support (#12808)
* move string_find_partial_stop & string_ends_with to common
* add common_regex (supports partial matches)
Co-authored-by: Georgi Gerganov <redacted>
* Update common/regex-partial.cpp
Co-authored-by: Georgi Gerganov <redacted>
* Update common/regex-partial.cpp
Co-authored-by: Georgi Gerganov <redacted>
* Update common/regex-partial.h
Co-authored-by: Georgi Gerganov <redacted>
* partial regex: add missing iterator end checks
* string utils: use string_views
* direct throw to avoid ggml.h include
* regex-partial: replace missed ggml_asserts
---------
Co-authored-by: ochafik <redacted>
Co-authored-by: Georgi Gerganov <redacted>
Sigbjørn Skjæret [Wed, 14 May 2025 18:22:49 +0000 (20:22 +0200)]
editorconfig : fix trailing whitespace from #13542 (#13546)
Gilad S. [Wed, 14 May 2025 16:18:18 +0000 (19:18 +0300)]
fix: crash when calling `llama_state_get_size` on a context without a KV cache (#13542)
Johannes Gäßler [Wed, 14 May 2025 14:41:02 +0000 (16:41 +0200)]
CUDA: fix crash on large batch size for quant. MoE (#13537)
Diego Devesa [Wed, 14 May 2025 14:12:36 +0000 (07:12 -0700)]
llama : fix quantize with dl backends (#13539)
Johannes Gäßler [Wed, 14 May 2025 14:08:20 +0000 (16:08 +0200)]
CUDA: faster Deepseek FA, add Turing support (#13435)