]>
git.djapps.eu Git - pkg/ggml/sources/llama.cpp/log
Akarshan Biswas [Fri, 30 May 2025 14:10:57 +0000 (19:40 +0530)]
SYCL: Add mrope kernel (#13755)
* SYCL: Add mrope kernel
* feat: Optimize rope operations with vectorization
Uses `sycl::vec` to load and store two elements at a time,
significantly improving performance in `rope_norm`,
`rope_neox`, and `rope_multi`. This reduces the number of memory
accesses and leverages SIMD instructions for faster execution.
* Use ceil_div
Georgi Gerganov [Fri, 30 May 2025 13:25:45 +0000 (16:25 +0300)]
sync : vendor (#13901)
* sync : vendor
ggml-ci
* cont : fix httplib version
ggml-ci
* cont : fix lint
* cont : fix lint
* vendor : move to common folder /vendor
ggml-ci
* cont : fix lint
* cont : move httplib to /vendor + use json_fwd.hpp
ggml-ci
* cont : fix server build
ggml-ci
* cont : add missing headers
ggml-ci
* cont : header clean-up
ggml-ci
Sigbjørn Skjæret [Fri, 30 May 2025 12:50:43 +0000 (14:50 +0200)]
convert : fix rwkv bos/eos token (#13844)
Xuan-Son Nguyen [Fri, 30 May 2025 10:24:37 +0000 (12:24 +0200)]
convert : allow partial update to the chkhsh pre-tokenizer list (#13847)
* convert : allow partial update to the chkhsh pre-tokenizer list
* code style
* update tokenizer out
* rm inp/out files for models not having gguf
* fixed hash for glm
* skip nomic-bert-moe test
* Update convert_hf_to_gguf_update.py
* fix minerva-7b hash
* rm redundant import
Đinh Trọng Huy [Fri, 30 May 2025 09:56:02 +0000 (18:56 +0900)]
llama : add support for DistilBert (#13907)
* add distilbert
* small fixes
* add note for LLM_ARCH_DISTIL_BERT
* Use MODEL_ARCH.BERT for DistilBert
---------
Co-authored-by: dinhhuy <redacted>
zhangkaihuo [Fri, 30 May 2025 08:31:48 +0000 (16:31 +0800)]
llama : use llm_build_granite for minicpm (#13911)
Christian Kastner [Thu, 29 May 2025 23:28:54 +0000 (01:28 +0200)]
cmake: Guard GGML_CPU_ALL_VARIANTS by architecture (#13890)
Sigbjørn Skjæret [Thu, 29 May 2025 19:42:31 +0000 (21:42 +0200)]
llama : add support for jina-reranker-v2 (#13900)
Sigbjørn Skjæret [Thu, 29 May 2025 13:36:05 +0000 (15:36 +0200)]
gguf-py : add support for sub_type (in arrays) in GGUFWriter add_key_value method (#13561)
Yibo Cai [Thu, 29 May 2025 11:39:20 +0000 (19:39 +0800)]
arm64: optimize q4_k_q8_k kernel with i8mm (#13886)
This PR improves q4_k_q8_k gemm kernel with arm64 i8mm instruction.
Tested on neoverse-n2 with llama3 8b q4_k_m quantization model.
- 34% ~ 50% S_PP uplift for all batch sizes
- 12% ~ 37% S_TG uplift for batch size 4 and above
Perplexity doesn't change with this PR.
```
// tested on neoverse-n2
$ llama-batched-bench \
-m Meta-Llama-3-8B-Instruct-Q4_K_M.gguf \
--no-mmap -fa \
-c 8192 -b 4096 -ub 512 -npp 128 -ntg 128 \
-npl 1,2,4,8,16,32 \
-t 64
---------------------------------------------------------------------
| PP | TG | B | S_PP t/s | S_TG t/s |
| | | | original | this pr | original | this pr |
|-------|--------|------|----------|----------|----------|----------|
| 128 | 128 | 1 | 110.12 | 147.83 | 24.36 | 24.28 |
| 128 | 128 | 2 | 121.16 | 172.42 | 46.36 | 47.93 |
| 128 | 128 | 4 | 120.15 | 169.75 | 74.68 | 84.00 |
| 128 | 128 | 8 | 130.97 | 196.81 | 91.04 | 114.74 |
| 128 | 128 | 16 | 131.01 | 196.88 | 101.43 | 135.79 |
| 128 | 128 | 32 | 130.85 | 196.51 | 106.97 | 147.29 |
---------------------------------------------------------------------
```
Christian Kastner [Thu, 29 May 2025 10:50:25 +0000 (12:50 +0200)]
cmake: Factor out CPU architecture detection (#13883)
* cmake: Define function for querying architecture
The tests and results match exactly those of ggml/src/CMakeLists.txt
* Switch arch detection over to new function
Vineel Abhinav [Thu, 29 May 2025 09:18:43 +0000 (14:48 +0530)]
ggml: aarch64: Implement SVE F32 kernels for Mamba Sequential Scan Algorithm (#13882)
* F32-Mamba-Seq_Scan-SVE
* Fix formatting
* ggml : missing space
---------
Co-authored-by: Georgi Gerganov <redacted>
Georgi Gerganov [Thu, 29 May 2025 09:17:16 +0000 (12:17 +0300)]
tests : remove json.hpp from a test (#13880)
ggml-ci
Sigbjørn Skjæret [Thu, 29 May 2025 08:00:57 +0000 (10:00 +0200)]
convert : workaround for AutoConfig dummy labels (#13881)
Sigbjørn Skjæret [Thu, 29 May 2025 06:15:01 +0000 (08:15 +0200)]
llama : add RobertaForSequenceClassification reranker support (#13875)
Vineel Abhinav [Thu, 29 May 2025 06:01:33 +0000 (11:31 +0530)]
ggml: aarch64: Implement SVE F32 kernels for vector functions (#13843)
* F32-Mamba-SVE
* F32-Mamba-SVE
* Resolve test errors-1
* Resolve test errors-2
* F32-vec-SVE
* F32-vec-SVE
* F32-vec-SVE
Beinsezii [Wed, 28 May 2025 21:50:20 +0000 (14:50 -0700)]
gguf-py : fix SafetensorRemote return on undefined size (< 0) (#13841)
Xuan-Son Nguyen [Wed, 28 May 2025 20:35:31 +0000 (22:35 +0200)]
llama : fix KV shift for qwen2vl (#13870)
* llama : fix KV shift for qwen2vl
* add ref to the PR
Xuan-Son Nguyen [Wed, 28 May 2025 20:35:22 +0000 (22:35 +0200)]
mtmd : move helpers to dedicated library (⚠️ breaking change) (#13866)
* mtmd : move helpers to dedicated library
* fix server build
* rm leftover cmakelist code
bandoti [Wed, 28 May 2025 18:46:47 +0000 (15:46 -0300)]
ci: disable LLAMA_CURL for Linux cross-builds (#13871)
Đinh Trọng Huy [Wed, 28 May 2025 17:01:58 +0000 (02:01 +0900)]
llama : add support for BertForSequenceClassification reranker (#13858)
* convert: add support for BertForSequenceClassification
* add support for reranking using BertForSequenceClassification
* merge checks of eos and sep
* fix lint
---------
Co-authored-by: dinhhuy <redacted>
Đinh Trọng Huy [Wed, 28 May 2025 14:34:18 +0000 (23:34 +0900)]
convert: small addition to support LlamaModel (#13838)
Co-authored-by: dinhhuy <redacted>
Sky [Wed, 28 May 2025 14:33:54 +0000 (22:33 +0800)]
server: fix remove 'image_url'/'input_audio' json-object effectlly for 'llama_params' in multimodal-model-mode (#13853)
[fix]: remove 'image_url'/'input_audio' effectlly for 'llama_params' in multimodal-model-mode
Xuan-Son Nguyen [Wed, 28 May 2025 14:12:35 +0000 (16:12 +0200)]
convert : fix qwen omni conversion (#13859)
* convert : fix qwen omni conversion
* fix typo
Alex Fanthome [Wed, 28 May 2025 13:49:28 +0000 (14:49 +0100)]
tests : change umlaut test (#11600)
Johannes Gäßler [Wed, 28 May 2025 11:33:37 +0000 (13:33 +0200)]
CUDA: fix FA tg at long context for CC >= 8.9 (#13852)
Xuan-Son Nguyen [Wed, 28 May 2025 08:05:54 +0000 (10:05 +0200)]
convert : fix tensor naming conflict for llama 4 vision (#13836)
* convert : fix tensor naming conflict for llama 4 vision
* add comment
leo-pony [Wed, 28 May 2025 03:54:20 +0000 (11:54 +0800)]
CANN: Add SOC TYPE printing in cmake configuration (#13837)
lhez [Tue, 27 May 2025 19:56:08 +0000 (12:56 -0700)]
opencl: add new ops - `argsort`, `div`, `sub`, `addrows`, `sigmoid`, `group_norm` (#13787)
* opencl: add `argsort`
* opencl: add `div`
* opencl: add `add_rows`
* opencl: add `sub`
* opencl: add `sigmoid`, both `f16` and `f32`
* opencl: add `group_norm`
lhez [Tue, 27 May 2025 19:53:14 +0000 (12:53 -0700)]
opencl: mark `mul_mat` `f32f32` as supporting non-contiguous tensors (#13790)
Jeff Bolz [Tue, 27 May 2025 16:39:07 +0000 (11:39 -0500)]
vulkan: use timestamp queries for GGML_VULKAN_PERF (#13817)
Also change it to be controlled by an env var rather than cmake flag
Georgi Gerganov [Tue, 27 May 2025 16:08:44 +0000 (19:08 +0300)]
cmake : add llama-cparams.cpp to build (#13832)
Akarshan Biswas [Tue, 27 May 2025 15:22:59 +0000 (20:52 +0530)]
SYCL: add gelu_erf kernel (#13749)
* SYCL: add gelu_erf kernel
* refactor code
Co-authored-by: Atharva Dubey <redacted>
* Use scope_op_debug_print
---------
Co-authored-by: Atharva Dubey <redacted>
Georgi Gerganov [Tue, 27 May 2025 15:04:38 +0000 (18:04 +0300)]
sync : ggml
Xuan-Son Nguyen [Tue, 27 May 2025 13:53:55 +0000 (15:53 +0200)]
ggml : add ggml_repeat_4d (#13824)
xctan [Tue, 27 May 2025 13:21:36 +0000 (21:21 +0800)]
ggml : riscv: add xtheadvector support (#13720)
* ggml : riscv: add xtheadvector support
* ggml : clean up some macro usage
Xuan-Son Nguyen [Tue, 27 May 2025 12:06:10 +0000 (14:06 +0200)]
mtmd : support Qwen 2.5 Omni (input audio+vision, no audio output) (#13784)
* mtmd : allow multiple modalities at the same time
* refactor mtmd tokenizer
* fix compile
* ok, missing SinusoidsPositionEmbedding
* first working version
* fix style
* more strict validate of n_embd
* refactor if..else to switch
* fix regression
* add test for 3B
* update docs
* fix tokenizing with add_special
* add more tests
* fix test case "huge"
* rm redundant code
* set_position_mrope_1d rm n_tokens
bandoti [Tue, 27 May 2025 11:52:40 +0000 (08:52 -0300)]
docs: remove link for llama-cli function calling (#13810)
Christian Kastner [Tue, 27 May 2025 11:18:39 +0000 (13:18 +0200)]
ggml-cpu: x86 feature detection is specific to x86 (#13811)
Diego Devesa [Tue, 27 May 2025 11:05:18 +0000 (04:05 -0700)]
ggml : allow CUDA graphs when using pipeline parallelism (#13814)
Georgi Gerganov [Tue, 27 May 2025 10:49:41 +0000 (13:49 +0300)]
kv-cells : track min/max used cells and per-sequence positions (#13808)
* kv-cells : track min/max used cells and per-sequence positions
ggml-ci
* kv-cells : fix pos-modification updates for seq_pos
ggml-ci
* kv-cells : add comments
ggml-ci
Georgi Gerganov [Tue, 27 May 2025 09:07:52 +0000 (12:07 +0300)]
sampling : make sure samplers return at least 1 token (#13822)
* sampling : min-p should always return at least one token
ggml-ci
* sampling : same for typical sampling
* tests : sampling tests use min_keep == 0
ggml-ci
Georgi Gerganov [Tue, 27 May 2025 06:40:59 +0000 (09:40 +0300)]
llama : validate seq id batch input (#13809)
* llama : validate seq id batch input
ggml-ci
* cont : fix the fix
ggml-ci
Olivier Chafik [Mon, 26 May 2025 21:34:27 +0000 (14:34 -0700)]
server: --offline mode (#13804)
* server: --offline mode (env: LLAMA_OFFLINE)
---------
Co-authored-by: Xuan-Son Nguyen <redacted>
Georgi Gerganov [Mon, 26 May 2025 19:24:01 +0000 (22:24 +0300)]
scripts : add option to compare commits in Debug (#13806)
* scripts : add option to compare commits in Debug
* cont : reuse existing CMAKE_OPTS
Georgi Gerganov [Mon, 26 May 2025 19:14:52 +0000 (22:14 +0300)]
cuda : avoid cuGetErrorString (#13791)
ggml-ci
Akarshan Biswas [Mon, 26 May 2025 15:40:36 +0000 (21:10 +0530)]
SYCL: Add non contiguous support in RMS_NORM and NORM kernels (#13611)
* SYCL: Add non contiguous input support to norm kernel
* refactor and add RMS_NORM non contiguous input support
ggml-ci
* restore subgroup reduction for multi-subgroup thread blocks in norm kernels
* Swap grid dims of nsamples and nrows
ggml-ci
* Revert "Swap grid dims of nsamples and nrows"
This reverts commit
43be2d657fec7f7fba54e2cd154106bc0fc45adf .
* restore not required changes
ggml-ci
* address review comments: change it to more like SYCL
* Use a common function to calculate offset
* remove wrap around logic for handling broadcasts
* remove static from calculate_offset fn and use ceil_div
Olivier Chafik [Mon, 26 May 2025 15:03:57 +0000 (08:03 -0700)]
server: fix streaming crashes (#13786)
* add preludes to content on partial regex match
* allow all parsers to parse non-tool-call content.
* tweak order of <|python_tag|> vs <function= parsing for functionary v3.1 format. still not ideal but hopefully less prone to crash
standby24x7 [Mon, 26 May 2025 14:55:24 +0000 (23:55 +0900)]
examples/training: Fix file name in README (#13803)
This patch fixes binary file names in README.md.
Signed-off-by: Masanari Iida <redacted>
Olivier Chafik [Mon, 26 May 2025 13:56:49 +0000 (06:56 -0700)]
`server`: fix format of streamed tool call deltas (diff name, fix id location) (#13800)
* fix deltas of tool_call.function.name
* fix tool_call.id (was in tool_call.function.id!) + add function type
* add tool_call.type
* populate empty tool_call.function.arguments on first delta
Olivier Chafik [Mon, 26 May 2025 13:16:37 +0000 (06:16 -0700)]
server: fix regression on streamed non-chat completion w/ stops (#13785)
* more forgiving message diffs: partial stop words aren't erased, full stops are
* Add (slow) server test for completion + stream + stop
Georgi Gerganov [Mon, 26 May 2025 11:03:54 +0000 (14:03 +0300)]
examples : allow extracting embeddings from decoder contexts (#13797)
ggml-ci
Georgi Gerganov [Mon, 26 May 2025 09:57:50 +0000 (12:57 +0300)]
llama : clarify deprecation message (#13794)
Romain Biessy [Mon, 26 May 2025 08:28:53 +0000 (10:28 +0200)]
sycl: Add more debug prints (#13640)
Jeff Bolz [Mon, 26 May 2025 04:02:07 +0000 (23:02 -0500)]
vulkan: mark IM2COL as supporting non-contig (#13783)
Bizhao Shi [Mon, 26 May 2025 02:20:18 +0000 (10:20 +0800)]
CANN: Add the basic supports of Flash Attention kernel (#13627)
* cann: add the basic FA support
* cann: update the readme
* cann: update the FlashAttention with PSEShift
* cann: update the input parameters in FA
* cann: update the alibi with max_bias
* cann: add the constrints of softcap
* cann: update the docs CANN.md
* cann: update the docs CANN.md
* cann: fix typo of CANN.md
* cann: add some comments and update the CANN.md
* cann: update the CANN.md
* cann: update the inner precise for fusedInferAttention
* cann: update the constraints of flash_attn_ext on ggml-cann.cpp
* cann: clean the whitespace
* cann: clean the whitespace
* cann: add a new endline
Olivier Chafik [Sun, 25 May 2025 23:30:51 +0000 (00:30 +0100)]
`server`: add `--reasoning-budget 0` to disable thinking (incl. qwen3 w/ enable_thinking:false) (#13771)
---------
Co-authored-by: ochafik <redacted>
Co-authored-by: Xuan-Son Nguyen <redacted>
Xuan-Son Nguyen [Sun, 25 May 2025 17:02:18 +0000 (19:02 +0200)]
webui : bump max upload file size to 500MB (#13779)
Sigbjørn Skjæret [Sun, 25 May 2025 14:22:29 +0000 (16:22 +0200)]
tests : improve UGM tokenizer test coverage (#13773)
Georgi Gerganov [Sun, 25 May 2025 13:34:36 +0000 (16:34 +0300)]
kv-cache : rework kv_cell (#13706)
* kv-cache : rework kv_cell
ggml-ci
* kv-cells : use "shift" instead of "delta" consistently
ggml-ci
* llama : add llama_max_parallel_sequences()
ggml-ci
* kv-cells : update comments [no ci]
* context : fail upon construction if sequences exceed max value
ggml-ci
* kv-cells : get_pos() -> pos_get() + comments
ggml-ci
* kv-cells : fix tracking of "used" cells
ggml-ci
Percy Piper [Sun, 25 May 2025 12:35:53 +0000 (13:35 +0100)]
rpc : Fix build on OpenBSD (#13541)
Xuan-Son Nguyen [Sun, 25 May 2025 12:06:32 +0000 (14:06 +0200)]
mtmd : add support for Qwen2-Audio and SeaLLM-Audio (#13760)
* mtmd : add Qwen2-Audio support
* small clean up
* update discussion link
* clarify mtmd_get_output_embd
* clarification in multimodal.md
* fix ultravox bug
* ggml_cont
ddpasa [Sun, 25 May 2025 12:04:49 +0000 (14:04 +0200)]
docs : add Moondream2 pre-quantized link (#13745)
* Multimodal: Added Moondream2 model and fixed ggml.org link
* Apply suggestions from code review
---------
Co-authored-by: name <redacted>
Co-authored-by: Xuan-Son Nguyen <redacted>
Olivier Chafik [Sun, 25 May 2025 09:45:49 +0000 (10:45 +0100)]
server: fix/test add_generation_prompt (#13770)
Co-authored-by: ochafik <redacted>
Piotr Jasiukajtis [Sun, 25 May 2025 08:29:43 +0000 (10:29 +0200)]
llama : add support for Qwen3 MoE tied word embeddings (#13768)
Akarshan Biswas [Sun, 25 May 2025 07:08:37 +0000 (12:38 +0530)]
SYCL: revert "sycl: simplify bin_bcast_kernel (#13383)" (#13752)
Temporarily reverted due to failing fp16 DIV operation
This reverts commit
02cdd2d8b092b5a4bb18e013c6887ce49ba20ac5 .
ggml-ci
Olivier Chafik [Sun, 25 May 2025 00:48:08 +0000 (01:48 +0100)]
`server`: streaming of tool calls and thoughts when `--jinja` is on (#12379)
* add common_json w/ support for truncated json healing
* add common_chat_msg_diff
* partial common_chat_parse
* refactor parser w/ optionals
* server: wire chat diffs in stream mode
* fix trigger of thinking models (must happen after thoughts are closed)
* fix functionary v3.2 raw python!
* rename: common_chat_syntax (now contains format)
* rm common_regex.at_start
* don't return empty <think></think>
* accommodate yet another deepseek r1 distill fantasy syntax (`<|tool▁calls|>`)
* fix QwQ 32B tool call parsing after thoughts (hermes2)
* better logs for grammar triggers
* consume spaces after parse_json_tool_calls
* fix required tool calls w/ thinking models that have pre-opened thinking tags
* fix thinking model's initial trigger + test qwq's template
* run most test_tool_call tests in stream + non-stream modes
* make functionary v3.2 parsing more strict (differentiate first match from others)
* send final diff from server, to close off raw python arguments
* support partial content streaming in Generic mode
* tool-call: allow content prelude before hermes2 tool calls (for Qwen2.5)
* Update function-calling.md
* Update tool_bench.py
* chat-parser: remove input from exception (llm output may contain PII)
---------
Co-authored-by: ochafik <redacted>
Co-authored-by: Olivier Chafik <redacted>
Diego Devesa [Sat, 24 May 2025 22:55:16 +0000 (15:55 -0700)]
releases : bundle llvm omp library in windows release (#13763)
Diego Devesa [Sat, 24 May 2025 20:27:03 +0000 (13:27 -0700)]
releases : enable openmp in windows cpu backend build (#13756)
Diego Devesa [Sat, 24 May 2025 20:26:47 +0000 (13:26 -0700)]
ggml-cpu : set openmp wait time if not set (#13758)
0cc4m [Sat, 24 May 2025 14:49:12 +0000 (16:49 +0200)]
Move GLM4 f32 attention fix to the correct function (#13750)
Xuan-Son Nguyen [Sat, 24 May 2025 11:06:47 +0000 (13:06 +0200)]
ggml : add ggml_gelu_erf() CUDA kernel (#13719)
* ggml : add ggml_gelu_erf() CUDA kernel
* missing semicolon
Sigbjørn Skjæret [Sat, 24 May 2025 10:29:09 +0000 (12:29 +0200)]
vocab : fix ugm tokenizer precision (#13743)
Johannes Gäßler [Sat, 24 May 2025 09:46:19 +0000 (11:46 +0200)]
CUDA: fix race condition in FA vector kernels (#13742)
Diego Devesa [Fri, 23 May 2025 20:14:00 +0000 (13:14 -0700)]
ci : enable winget package updates (#13734)
Diego Devesa [Fri, 23 May 2025 20:09:38 +0000 (13:09 -0700)]
ci : add winget package updater (#13732)
Georgi Gerganov [Fri, 23 May 2025 17:16:13 +0000 (20:16 +0300)]
hparams : initialize arrays (#13728)
ggml-ci
Xuan-Son Nguyen [Fri, 23 May 2025 15:07:04 +0000 (17:07 +0200)]
llama : allow custom list of swa_layers (#13726)
Xuan-Son Nguyen [Fri, 23 May 2025 09:03:47 +0000 (11:03 +0200)]
server : support audio input (#13714)
* server : support audio input
* add audio support on webui
Chenguang Li [Fri, 23 May 2025 08:47:53 +0000 (16:47 +0800)]
CANN: Support MUL_MAT_ID for q8_0 and q4_0 (#13705)
* [CANN]Support MUL_MAT_ID Q8 && Q4
Signed-off-by: noemotiovon <redacted>
* codestyle adjustment
Signed-off-by: noemotiovon <redacted>
---------
Signed-off-by: noemotiovon <redacted>
Xuan-Son Nguyen [Fri, 23 May 2025 06:12:48 +0000 (08:12 +0200)]
ggml : fix the order of ggml_unary_op (#13718)
Jeff Bolz [Fri, 23 May 2025 04:45:02 +0000 (00:45 -0400)]
vulkan: support CPY from any type to itself (#13695)
Reuse the f16/f32 copy shaders, and just scale the number of elements
according to the type size.
Jeff Bolz [Fri, 23 May 2025 04:33:45 +0000 (00:33 -0400)]
vulkan: Disable coopmat/coopmat2/bfloat extensions if glslc doesn't support it (#13696)
Judd [Fri, 23 May 2025 04:33:08 +0000 (12:33 +0800)]
use LOG_WARN to replace `std::cerr` (#13657)
Diego Devesa [Thu, 22 May 2025 22:21:37 +0000 (15:21 -0700)]
release : fix windows hip release (#13707)
* release : fix windows hip release
* make single hip release with multiple targets
Georgi Gerganov [Thu, 22 May 2025 19:21:07 +0000 (22:21 +0300)]
tts : fix n_ubatch + make WavTokenizer cache-less (#13713)
ggml-ci
Xuan-Son Nguyen [Thu, 22 May 2025 18:42:48 +0000 (20:42 +0200)]
mtmd : add ultravox audio input (#13623)
* convert ok, load ok
* warmup ok
* test
* still does not work?
* fix padding
* temporary give up
* fix merge conflict
* build_ultravox()
* rm test
* fix merge conflict
* add necessary mtmd APIs
* first working version (only 4s of audio)
* will this monster compile?
* fix compile
* please compile
* fPIC
* fix windows
* various fixes
* clean up audio_helpers
* fix conversion
* add some debug stuff
* long audio input ok
* adapt the api
* add --audio arg
* final touch UX
* add miniaudio to readme
* fix typo
* refactor kv metadata
* mtmd_default_marker()
Aaron Teo [Thu, 22 May 2025 18:31:29 +0000 (02:31 +0800)]
common: Include torch package for s390x (#13699)
* common: update requirements.txt to include pytorch nightly for s390x
Signed-off-by: Aaron Teo <redacted>
* common: fix torch installation via pip for s390x
Signed-off-by: Aaron Teo <redacted>
---------
Signed-off-by: Aaron Teo <redacted>
Georgi Gerganov [Thu, 22 May 2025 13:33:39 +0000 (16:33 +0300)]
server : pad small embedding batches (#13692)
ggml-ci
Sigbjørn Skjæret [Thu, 22 May 2025 12:25:05 +0000 (14:25 +0200)]
gguf-py : correct charsmap parameter typing (#13701)
Nicolò Scipione [Thu, 22 May 2025 11:54:43 +0000 (13:54 +0200)]
sycl : Remove waits from function calls (#13702)
* removes the waits in async memcpy functions
Ewan Crawford [Thu, 22 May 2025 08:24:09 +0000 (09:24 +0100)]
SYCL: Avoid using with SYCL-Graph for unsupported nodes (#13587)
Currently on a CUDA backend to SYCL when running
`GGML_SYCL_DISABLE_GRAPH=0 ./bin/test-backend-ops -b SYCL0` there
are two operations that throw an exception from the blocking
waits during queue recording.
* `-o CONCAT` : Use of blocking waits on a queue that's being recorded https://github.com/ggml-org/llama.cpp/blob/master/ggml/src/ggml-sycl/concat.cpp#L185-L187
* `-o MUL_MAT_ID`: Blocking wait on a recording queue for a copy to host memory https://github.com/ggml-org/llama.cpp/blob/master/ggml/src/ggml-sycl/ggml-sycl.cpp#L3072-L3074
We've noticed that `ggml-cuda.cu` has the
[check_node_graph_compatibility_and_refresh_copy_ops](https://github.com/ggml-org/llama.cpp/blob/
39e73ae0d69f882d7e29cecc6dd8f5052fca6731 /ggml/src/ggml-cuda/ggml-cuda.cu#L2458-L2458)
method for checking if a graph can be used, even if enabled. I've taken a
similar approach in this PR by adding a method to `ggml-sycl.cpp` for checking
if a graph can be used for the operations even if a user has asked for it to be
enabled.
Henry Linjamäki [Wed, 21 May 2025 23:21:45 +0000 (02:21 +0300)]
opencl: Add support for multiple devices (#12622)
* opencl: Add support for multiple devices
... but limited to one platform. A platform with a GPU will be preferred.
Additionally:
* Filter out devices that lack capabilities needed by the backend
implementation (half support, OpenCL 2.0+, etc).
* Make ggml_backend_opencl_reg() thread-safe.
* fixup: fix an error in sync_with_other_backends
... when there is only one OpenCL device available.
Henry Linjamäki [Wed, 21 May 2025 20:21:17 +0000 (23:21 +0300)]
opencl: fix couple crashes (#12795)
* opencl: fix couple crashes
* fix kernel launches failed on devices which do not support
non-uniform work-groups. When non-uniform work-groups are not
supported, set `local_work_size` to NULL (= let driver choose the
work-group sizes). This patch does not cover everything - just the
cases tested by test-backend-ops.
* fix sub-buffer creation failed due to `cl_buffer_region::origin` not
being aligned to `CL_DEVICE_MEM_BASE_ADDR_ALIGN`.
* OpenCL: query non-uniform WG sizes only on OpenCL 3.0+
Diego Devesa [Wed, 21 May 2025 20:09:57 +0000 (13:09 -0700)]
releases : build CPU backend separately (windows) (#13642)
Georgi Gerganov [Wed, 21 May 2025 17:00:49 +0000 (20:00 +0300)]
hparams : support models for which all layers use SWA (#13682)
ggml-ci
Georgi Gerganov [Wed, 21 May 2025 16:46:56 +0000 (19:46 +0300)]
server : improve error reporting (#13680)
antichristHater [Wed, 21 May 2025 16:40:35 +0000 (19:40 +0300)]
convert : add qwen2vl support for unsloth merges (#13686)
Sigbjørn Skjæret [Wed, 21 May 2025 14:57:38 +0000 (16:57 +0200)]
examples : switch retrieval to llama_encode (#13685)
* switch retrieval to llama_encode
* enable --no-warmup for retrieval
Emmanuel Ferdman [Wed, 21 May 2025 14:33:54 +0000 (17:33 +0300)]
gguf-py : display the invalid gguf type (#13687)
Signed-off-by: Emmanuel Ferdman <redacted>