]>
git.djapps.eu Git - pkg/ggml/sources/llama.cpp/log
Xuan-Son Nguyen [Wed, 4 Jun 2025 08:11:26 +0000 (10:11 +0200)]
llama-graph : use ggml_repeat_4d (#13998)
Johannes Gäßler [Wed, 4 Jun 2025 06:57:05 +0000 (08:57 +0200)]
CUDA: fix FTZ in FA for Gemma 3 (#13991)
Georgi Gerganov [Wed, 4 Jun 2025 06:50:32 +0000 (09:50 +0300)]
kv-cache : fix unified::seq_rm to work with seq_id < 0 (#13985)
ggml-ci
Jeff Bolz [Tue, 3 Jun 2025 18:30:22 +0000 (13:30 -0500)]
vulkan: fix warnings in perf logger querypool code (#13937)
Xuan-Son Nguyen [Tue, 3 Jun 2025 11:09:36 +0000 (13:09 +0200)]
docs : add "Quick start" section for new users (#13862)
* docs : add "Quick start" section for non-technical users
* rm flox
* Update README.md
lhez [Mon, 2 Jun 2025 23:54:58 +0000 (16:54 -0700)]
opencl: add `backend_synchronize` (#13939)
* This is not needed by the normal use where the result is read
using `tensor_get`, but it allows perf mode of `test-backend-ops`
to properly measure performance.
rmatif [Mon, 2 Jun 2025 23:53:36 +0000 (23:53 +0000)]
OpenCL: Add concat, tsembd, upscale, tanh, pad and repeat (#13840)
* add concat, pad, repeat, tsembd, tanh, upscale
* small fixes
Georgi Gerganov [Mon, 2 Jun 2025 18:34:40 +0000 (21:34 +0300)]
server : disable speculative decoding for SWA models (#13970)
* server : use swa-full fo draft context
ggml-ci
* server : disable speculative decoding for SWA models
Georgi Gerganov [Mon, 2 Jun 2025 18:33:40 +0000 (21:33 +0300)]
metal : use F32 accumulators in FA kernels (#13975)
ggml-ci
Georgi Gerganov [Mon, 2 Jun 2025 17:54:26 +0000 (20:54 +0300)]
gemma : more consistent attention scaling for v2 and v3 (#13951)
* gemma : fix attn scale for 27B
* cont : apply scale before attn
* cont : consistent attention scaling
Olivier Chafik [Mon, 2 Jun 2025 17:15:44 +0000 (10:15 -0700)]
`server`: update deepseek reasoning format (pass reasoning_content as diffs) (#13933)
* server: update deepseek reasoning format (now in reasoning_content diffs), add legacy option for compat
* update unit/test_tool_call.py::test_thoughts
Xuan-Son Nguyen [Mon, 2 Jun 2025 14:29:28 +0000 (16:29 +0200)]
mtmd : fix memory leak in mtmd_helper_eval_chunk_single (#13961)
* mtmd : fix memory in mtmd_helper_eval_chunk_single
* mtmd-cli : fix mem leak
* Update tools/mtmd/mtmd-cli.cpp
Co-authored-by: Georgi Gerganov <redacted>
---------
Co-authored-by: Georgi Gerganov <redacted>
shalinib-ibm [Mon, 2 Jun 2025 12:18:36 +0000 (17:48 +0530)]
cmake : Handle mixed-case 'Power' strings in POWER CPU detection (#13966)
Some systems report the CPU implementation as "Power11" instead of "POWER11".
The existing CMake logic uses a case-sensitive regular expression to extract
the CPU generation, which fails when the casing doesn't exactly match "POWER".
This patch provides a fix by first converting the string to uppercase before applying the regex.
Signed-off-by: root <redacted>
Co-authored-by: root <redacted>
Atharva Dubey [Mon, 2 Jun 2025 09:12:20 +0000 (10:12 +0100)]
sycl: quantize and reorder the input to q8_1 when reorder is enabled (#13826)
* [WIP]: fuse q8 quantization and reorder
* wip2: fuse q8 quantization and reorder
* working q8 reorder commit
* restored common.hpp
* remove debug prints
* remove unnecessary headers and remove trailing whitespace
* Update ggml/src/ggml-sycl/ggml-sycl.cpp
Co-authored-by: Alberto Cabrera Pérez <redacted>
---------
Co-authored-by: Alberto Cabrera Pérez <redacted>
Johannes Gäßler [Sun, 1 Jun 2025 16:08:05 +0000 (18:08 +0200)]
gguf: fix failure on version == 0 (#13956)
Sigbjørn Skjæret [Sun, 1 Jun 2025 16:07:21 +0000 (18:07 +0200)]
convert : fix nomic-bert-moe mask token (#13757)
Sigbjørn Skjæret [Sun, 1 Jun 2025 15:23:11 +0000 (17:23 +0200)]
convert : fix vocab padding code for bert models (#13954)
Aaron Teo [Sun, 1 Jun 2025 14:53:57 +0000 (22:53 +0800)]
ggml: check if non-native endian model is being loaded (#13943)
* gguf: prevent non-native endian models from being loaded
Signed-off-by: Aaron Teo <redacted>
* gguf: update error message
Signed-off-by: Aaron Teo <redacted>
* gguf: make the non-native endian check more verbose
Signed-off-by: Aaron Teo <redacted>
* ggml: move ggml_assert location
Signed-off-by: Aaron Teo <redacted>
* ggml: reword the endianness check error message
Signed-off-by: Aaron Teo <redacted>
---------
Signed-off-by: Aaron Teo <redacted>
Georgi Gerganov [Sun, 1 Jun 2025 09:23:14 +0000 (12:23 +0300)]
sync : ggml
ggml-ci
Kai Pastor [Sat, 31 May 2025 10:49:55 +0000 (12:49 +0200)]
vulkan : Remove unexpected ; (ggml/1253)
Kai Pastor [Sat, 31 May 2025 10:39:19 +0000 (12:39 +0200)]
cmake : Fix broken CMake error messages (ggml/1252)
Radoslav Gerganov [Fri, 30 May 2025 06:11:09 +0000 (09:11 +0300)]
ggml : remove ggml_graph_import and ggml_graph_export declarations (ggml/1247)
The implementation is already deleted with commit
9d0762e .
closes: #1235
Georgi Gerganov [Thu, 29 May 2025 10:29:50 +0000 (13:29 +0300)]
sync : whisper.cpp (ggml/1250)
* ggml : Fix backtrace breaking Windows build (whisper/3203)
* sync : whisper.cpp
ggml-ci
---------
Co-authored-by: Daniel Tang <redacted>
Radoslav Gerganov [Thu, 29 May 2025 05:34:46 +0000 (08:34 +0300)]
ggml : install dynamic backends (ggml/1240)
* ggml : install dynamic backends
Make sure dynamic backends are installed in $CMAKE_INSTALL_BINDIR
Daniel Tang [Wed, 28 May 2025 00:58:46 +0000 (20:58 -0400)]
ggml : Print backtrace on uncaught C++ exceptions (ggml/1232)
The goal is to have what users call "full logs" contain the backtrace.
This is registered upon ggml_init. Also fixes a minor fd leak on Linux.
ddh0 [Sun, 1 Jun 2025 08:44:30 +0000 (03:44 -0500)]
readme : update bindings (#13950)
Georgi Gerganov [Sun, 1 Jun 2025 08:42:16 +0000 (11:42 +0300)]
parallel : fix n_junk == 0 (#13952)
Georgi Gerganov [Sun, 1 Jun 2025 08:39:27 +0000 (11:39 +0300)]
kv-cache : split implementation in separate sources (#13920)
ggml-ci
Max Krasnyansky [Sat, 31 May 2025 22:39:19 +0000 (15:39 -0700)]
threading: support for GGML_SCHED_PRIO_LOW, update thread info on Windows to avoid throttling (#12995)
* threading: support for GGML_SCHED_PRIO_LOW, update thread info on Windows to avoid throttling
We talked about adding LOW priority for GGML threads in the original threadpool PR.
It might be useful for some cases to avoid contention.
Latest Windows ARM64 releases started parking (offlining) the CPU cores
more aggresively which results in suboptimal performance with n_threads > 4.
To deal with that we now disable Power Throttling for our threads for the NORMAL
and higher priorities.
Co-authored-by: Diego Devesa <redacted>
* threading: disable SetThreadInfo() calls for older Windows versions
* Update tools/llama-bench/llama-bench.cpp
Co-authored-by: Diego Devesa <redacted>
---------
Co-authored-by: Diego Devesa <redacted>
Jiří Podivín [Sat, 31 May 2025 16:58:35 +0000 (18:58 +0200)]
docs : Note about necessity of having libcurl installed for standard build. (#13945)
Signed-off-by: Jiri Podivin <redacted>
Olivier Chafik [Sat, 31 May 2025 15:26:10 +0000 (08:26 -0700)]
server: allow unclosed thinking tags (#13931)
Georgi Gerganov [Sat, 31 May 2025 12:58:33 +0000 (15:58 +0300)]
llama : deprecate explicit kv_self defrag/update calls (#13921)
ggml-ci
Georgi Gerganov [Sat, 31 May 2025 12:57:44 +0000 (15:57 +0300)]
llama : use n_swa + n_ubatch cells for SWA cache (#13833)
* llama : use n_swa + n_ubatch cells for SWA cache
ggml-ci
* llama : add warning about multi-sqeuence SWA contexts
igardev [Sat, 31 May 2025 09:56:08 +0000 (12:56 +0300)]
webui : Replace alert and confirm with custom modals. (#13711)
* Replace alert and confirm with custom modals. This is needed as Webview in VS Code doesn't permit alert and confirm for security reasons.
* use Modal Provider to simplify the use of confirm and alert modals.
* Increase the z index of the modal dialogs.
* Update index.html.gz
* also add showPrompt
* rebuild
---------
Co-authored-by: igardev <redacted>
Co-authored-by: Xuan Son Nguyen <redacted>
Georgi Gerganov [Sat, 31 May 2025 09:55:57 +0000 (12:55 +0300)]
llama : auto-batch preparation (#13845)
* llama : auto-batch
ggml-ci
* context : simplify if branching
Xuan-Son Nguyen [Sat, 31 May 2025 08:14:29 +0000 (10:14 +0200)]
mtmd : drop `_shared` from `libmtmd` name, merge helpers into libmtmd (⚠️ breaking change) (#13917)
* mtmd : fix missing public header
* no object
* apply suggestion from Georgi
* rm mtmd-helper, merge it to mtmd
* missing vendor include dir
Georgi Gerganov [Sat, 31 May 2025 07:24:04 +0000 (10:24 +0300)]
kv-cache : refactor + add llama_memory_state_i (#13746)
* kv-cache : simplify the "struct llama_kv_cache" interface
ggml-ci
* kv-cache : revert the (n_swa + n_ubatch) change (for next PR)
ggml-ci
* kv-cache : some comments
ggml-ci
* context : fix graph reserve for multiple sequences
ggml-ci
* kv-cache : fix typo [no ci]
* kv-cache : fix find_slot() logic for free slots
ggml-ci
* llama : add TODO for deprecating the defrag API in the future
* kv-cache : improve find_slot() using min/max seq pos info
ggml-ci
* llama : handle aborts and compute errors
ggml-ci
* memory : extract state into llama_memory_state
ggml-ci
* kv-cache : add comments
ggml-ci
* server : update batching logic to reset n_batch on successful decode
* server : upon full re-processing, remove the sequence from the cache
* kv-cache : add TODO for doing split_equal when split_simple fails
ggml-ci
Shawn yang [Sat, 31 May 2025 06:48:04 +0000 (14:48 +0800)]
CUDA: add a prop in ggml_cuda_device_infor for distinguish iGPU or dGPU in cuda (#13856) (#13895)
* 1. add "integrated" in ggml_cuda_device_info for distinguish whether it is Intergrate_gpu or discrete_gpu
2. Adjust the func:"ggml_backend_cuda_device_supports_buft" for this new feature
* Update ggml/src/ggml-cuda/ggml-cuda.cu
Adjusted code indentation
Co-authored-by: Johannes Gäßler <redacted>
* Update ggml/src/ggml-cuda/ggml-cuda.cu
Fixed incorrect setting of variable types
Co-authored-by: Johannes Gäßler <redacted>
* Update ggml/src/ggml-cuda/ggml-cuda.cu
Adjusted the judgment logic
Co-authored-by: Johannes Gäßler <redacted>
* add a host_buft assert in case of integrated_cuda_device with func:'evaluate_and_capture_cuda_graph()'
* Update ggml/src/ggml-cuda/ggml-cuda.cu
Add a defensive security assert
Co-authored-by: Johannes Gäßler <redacted>
* Update ggml/src/ggml-cuda/ggml-cuda.cu
Adjusted the support judgment logic.
Co-authored-by: Johannes Gäßler <redacted>
* revoke the suggest commit changes due to it's not applicable in jetson_device
* Update ggml/src/ggml-cuda/ggml-cuda.cu
Add parentheses to enforce operator precedence
Co-authored-by: Diego Devesa <redacted>
* Update ggml/src/ggml-cuda/ggml-cuda.cu
Fix ci bug: add a spaces
Co-authored-by: Johannes Gäßler <redacted>
---------
Co-authored-by: yangxiao <redacted>
Co-authored-by: Johannes Gäßler <redacted>
Co-authored-by: yangxiao <redacted>
Co-authored-by: Diego Devesa <redacted>
Johannes Gäßler [Fri, 30 May 2025 19:22:03 +0000 (21:22 +0200)]
CUDA: fix typo in FlashAttention code (#13926)
Diego Devesa [Fri, 30 May 2025 16:56:19 +0000 (09:56 -0700)]
sched : avoid changing cur_copy when a graph is already allocated (#13922)
Georgi Gerganov [Fri, 30 May 2025 16:38:07 +0000 (19:38 +0300)]
parallel : increase the variability of the prompt lengths (#13927)
ggml-ci
Diego Devesa [Fri, 30 May 2025 14:37:18 +0000 (07:37 -0700)]
cuda : prevent using split buffers with 3d/4d matrices (#13919)
Akarshan Biswas [Fri, 30 May 2025 14:10:57 +0000 (19:40 +0530)]
SYCL: Add mrope kernel (#13755)
* SYCL: Add mrope kernel
* feat: Optimize rope operations with vectorization
Uses `sycl::vec` to load and store two elements at a time,
significantly improving performance in `rope_norm`,
`rope_neox`, and `rope_multi`. This reduces the number of memory
accesses and leverages SIMD instructions for faster execution.
* Use ceil_div
Georgi Gerganov [Fri, 30 May 2025 13:25:45 +0000 (16:25 +0300)]
sync : vendor (#13901)
* sync : vendor
ggml-ci
* cont : fix httplib version
ggml-ci
* cont : fix lint
* cont : fix lint
* vendor : move to common folder /vendor
ggml-ci
* cont : fix lint
* cont : move httplib to /vendor + use json_fwd.hpp
ggml-ci
* cont : fix server build
ggml-ci
* cont : add missing headers
ggml-ci
* cont : header clean-up
ggml-ci
Sigbjørn Skjæret [Fri, 30 May 2025 12:50:43 +0000 (14:50 +0200)]
convert : fix rwkv bos/eos token (#13844)
Xuan-Son Nguyen [Fri, 30 May 2025 10:24:37 +0000 (12:24 +0200)]
convert : allow partial update to the chkhsh pre-tokenizer list (#13847)
* convert : allow partial update to the chkhsh pre-tokenizer list
* code style
* update tokenizer out
* rm inp/out files for models not having gguf
* fixed hash for glm
* skip nomic-bert-moe test
* Update convert_hf_to_gguf_update.py
* fix minerva-7b hash
* rm redundant import
Đinh Trọng Huy [Fri, 30 May 2025 09:56:02 +0000 (18:56 +0900)]
llama : add support for DistilBert (#13907)
* add distilbert
* small fixes
* add note for LLM_ARCH_DISTIL_BERT
* Use MODEL_ARCH.BERT for DistilBert
---------
Co-authored-by: dinhhuy <redacted>
zhangkaihuo [Fri, 30 May 2025 08:31:48 +0000 (16:31 +0800)]
llama : use llm_build_granite for minicpm (#13911)
Christian Kastner [Thu, 29 May 2025 23:28:54 +0000 (01:28 +0200)]
cmake: Guard GGML_CPU_ALL_VARIANTS by architecture (#13890)
Sigbjørn Skjæret [Thu, 29 May 2025 19:42:31 +0000 (21:42 +0200)]
llama : add support for jina-reranker-v2 (#13900)
Sigbjørn Skjæret [Thu, 29 May 2025 13:36:05 +0000 (15:36 +0200)]
gguf-py : add support for sub_type (in arrays) in GGUFWriter add_key_value method (#13561)
Yibo Cai [Thu, 29 May 2025 11:39:20 +0000 (19:39 +0800)]
arm64: optimize q4_k_q8_k kernel with i8mm (#13886)
This PR improves q4_k_q8_k gemm kernel with arm64 i8mm instruction.
Tested on neoverse-n2 with llama3 8b q4_k_m quantization model.
- 34% ~ 50% S_PP uplift for all batch sizes
- 12% ~ 37% S_TG uplift for batch size 4 and above
Perplexity doesn't change with this PR.
```
// tested on neoverse-n2
$ llama-batched-bench \
-m Meta-Llama-3-8B-Instruct-Q4_K_M.gguf \
--no-mmap -fa \
-c 8192 -b 4096 -ub 512 -npp 128 -ntg 128 \
-npl 1,2,4,8,16,32 \
-t 64
---------------------------------------------------------------------
| PP | TG | B | S_PP t/s | S_TG t/s |
| | | | original | this pr | original | this pr |
|-------|--------|------|----------|----------|----------|----------|
| 128 | 128 | 1 | 110.12 | 147.83 | 24.36 | 24.28 |
| 128 | 128 | 2 | 121.16 | 172.42 | 46.36 | 47.93 |
| 128 | 128 | 4 | 120.15 | 169.75 | 74.68 | 84.00 |
| 128 | 128 | 8 | 130.97 | 196.81 | 91.04 | 114.74 |
| 128 | 128 | 16 | 131.01 | 196.88 | 101.43 | 135.79 |
| 128 | 128 | 32 | 130.85 | 196.51 | 106.97 | 147.29 |
---------------------------------------------------------------------
```
Christian Kastner [Thu, 29 May 2025 10:50:25 +0000 (12:50 +0200)]
cmake: Factor out CPU architecture detection (#13883)
* cmake: Define function for querying architecture
The tests and results match exactly those of ggml/src/CMakeLists.txt
* Switch arch detection over to new function
Vineel Abhinav [Thu, 29 May 2025 09:18:43 +0000 (14:48 +0530)]
ggml: aarch64: Implement SVE F32 kernels for Mamba Sequential Scan Algorithm (#13882)
* F32-Mamba-Seq_Scan-SVE
* Fix formatting
* ggml : missing space
---------
Co-authored-by: Georgi Gerganov <redacted>
Georgi Gerganov [Thu, 29 May 2025 09:17:16 +0000 (12:17 +0300)]
tests : remove json.hpp from a test (#13880)
ggml-ci
Sigbjørn Skjæret [Thu, 29 May 2025 08:00:57 +0000 (10:00 +0200)]
convert : workaround for AutoConfig dummy labels (#13881)
Sigbjørn Skjæret [Thu, 29 May 2025 06:15:01 +0000 (08:15 +0200)]
llama : add RobertaForSequenceClassification reranker support (#13875)
Vineel Abhinav [Thu, 29 May 2025 06:01:33 +0000 (11:31 +0530)]
ggml: aarch64: Implement SVE F32 kernels for vector functions (#13843)
* F32-Mamba-SVE
* F32-Mamba-SVE
* Resolve test errors-1
* Resolve test errors-2
* F32-vec-SVE
* F32-vec-SVE
* F32-vec-SVE
Beinsezii [Wed, 28 May 2025 21:50:20 +0000 (14:50 -0700)]
gguf-py : fix SafetensorRemote return on undefined size (< 0) (#13841)
Xuan-Son Nguyen [Wed, 28 May 2025 20:35:31 +0000 (22:35 +0200)]
llama : fix KV shift for qwen2vl (#13870)
* llama : fix KV shift for qwen2vl
* add ref to the PR
Xuan-Son Nguyen [Wed, 28 May 2025 20:35:22 +0000 (22:35 +0200)]
mtmd : move helpers to dedicated library (⚠️ breaking change) (#13866)
* mtmd : move helpers to dedicated library
* fix server build
* rm leftover cmakelist code
bandoti [Wed, 28 May 2025 18:46:47 +0000 (15:46 -0300)]
ci: disable LLAMA_CURL for Linux cross-builds (#13871)
Đinh Trọng Huy [Wed, 28 May 2025 17:01:58 +0000 (02:01 +0900)]
llama : add support for BertForSequenceClassification reranker (#13858)
* convert: add support for BertForSequenceClassification
* add support for reranking using BertForSequenceClassification
* merge checks of eos and sep
* fix lint
---------
Co-authored-by: dinhhuy <redacted>
Đinh Trọng Huy [Wed, 28 May 2025 14:34:18 +0000 (23:34 +0900)]
convert: small addition to support LlamaModel (#13838)
Co-authored-by: dinhhuy <redacted>
Sky [Wed, 28 May 2025 14:33:54 +0000 (22:33 +0800)]
server: fix remove 'image_url'/'input_audio' json-object effectlly for 'llama_params' in multimodal-model-mode (#13853)
[fix]: remove 'image_url'/'input_audio' effectlly for 'llama_params' in multimodal-model-mode
Xuan-Son Nguyen [Wed, 28 May 2025 14:12:35 +0000 (16:12 +0200)]
convert : fix qwen omni conversion (#13859)
* convert : fix qwen omni conversion
* fix typo
Alex Fanthome [Wed, 28 May 2025 13:49:28 +0000 (14:49 +0100)]
tests : change umlaut test (#11600)
Johannes Gäßler [Wed, 28 May 2025 11:33:37 +0000 (13:33 +0200)]
CUDA: fix FA tg at long context for CC >= 8.9 (#13852)
Xuan-Son Nguyen [Wed, 28 May 2025 08:05:54 +0000 (10:05 +0200)]
convert : fix tensor naming conflict for llama 4 vision (#13836)
* convert : fix tensor naming conflict for llama 4 vision
* add comment
leo-pony [Wed, 28 May 2025 03:54:20 +0000 (11:54 +0800)]
CANN: Add SOC TYPE printing in cmake configuration (#13837)
lhez [Tue, 27 May 2025 19:56:08 +0000 (12:56 -0700)]
opencl: add new ops - `argsort`, `div`, `sub`, `addrows`, `sigmoid`, `group_norm` (#13787)
* opencl: add `argsort`
* opencl: add `div`
* opencl: add `add_rows`
* opencl: add `sub`
* opencl: add `sigmoid`, both `f16` and `f32`
* opencl: add `group_norm`
lhez [Tue, 27 May 2025 19:53:14 +0000 (12:53 -0700)]
opencl: mark `mul_mat` `f32f32` as supporting non-contiguous tensors (#13790)
Jeff Bolz [Tue, 27 May 2025 16:39:07 +0000 (11:39 -0500)]
vulkan: use timestamp queries for GGML_VULKAN_PERF (#13817)
Also change it to be controlled by an env var rather than cmake flag
Georgi Gerganov [Tue, 27 May 2025 16:08:44 +0000 (19:08 +0300)]
cmake : add llama-cparams.cpp to build (#13832)
Akarshan Biswas [Tue, 27 May 2025 15:22:59 +0000 (20:52 +0530)]
SYCL: add gelu_erf kernel (#13749)
* SYCL: add gelu_erf kernel
* refactor code
Co-authored-by: Atharva Dubey <redacted>
* Use scope_op_debug_print
---------
Co-authored-by: Atharva Dubey <redacted>
Georgi Gerganov [Tue, 27 May 2025 15:04:38 +0000 (18:04 +0300)]
sync : ggml
Xuan-Son Nguyen [Tue, 27 May 2025 13:53:55 +0000 (15:53 +0200)]
ggml : add ggml_repeat_4d (#13824)
xctan [Tue, 27 May 2025 13:21:36 +0000 (21:21 +0800)]
ggml : riscv: add xtheadvector support (#13720)
* ggml : riscv: add xtheadvector support
* ggml : clean up some macro usage
Xuan-Son Nguyen [Tue, 27 May 2025 12:06:10 +0000 (14:06 +0200)]
mtmd : support Qwen 2.5 Omni (input audio+vision, no audio output) (#13784)
* mtmd : allow multiple modalities at the same time
* refactor mtmd tokenizer
* fix compile
* ok, missing SinusoidsPositionEmbedding
* first working version
* fix style
* more strict validate of n_embd
* refactor if..else to switch
* fix regression
* add test for 3B
* update docs
* fix tokenizing with add_special
* add more tests
* fix test case "huge"
* rm redundant code
* set_position_mrope_1d rm n_tokens
bandoti [Tue, 27 May 2025 11:52:40 +0000 (08:52 -0300)]
docs: remove link for llama-cli function calling (#13810)
Christian Kastner [Tue, 27 May 2025 11:18:39 +0000 (13:18 +0200)]
ggml-cpu: x86 feature detection is specific to x86 (#13811)
Diego Devesa [Tue, 27 May 2025 11:05:18 +0000 (04:05 -0700)]
ggml : allow CUDA graphs when using pipeline parallelism (#13814)
Georgi Gerganov [Tue, 27 May 2025 10:49:41 +0000 (13:49 +0300)]
kv-cells : track min/max used cells and per-sequence positions (#13808)
* kv-cells : track min/max used cells and per-sequence positions
ggml-ci
* kv-cells : fix pos-modification updates for seq_pos
ggml-ci
* kv-cells : add comments
ggml-ci
Georgi Gerganov [Tue, 27 May 2025 09:07:52 +0000 (12:07 +0300)]
sampling : make sure samplers return at least 1 token (#13822)
* sampling : min-p should always return at least one token
ggml-ci
* sampling : same for typical sampling
* tests : sampling tests use min_keep == 0
ggml-ci
Georgi Gerganov [Tue, 27 May 2025 06:40:59 +0000 (09:40 +0300)]
llama : validate seq id batch input (#13809)
* llama : validate seq id batch input
ggml-ci
* cont : fix the fix
ggml-ci
Olivier Chafik [Mon, 26 May 2025 21:34:27 +0000 (14:34 -0700)]
server: --offline mode (#13804)
* server: --offline mode (env: LLAMA_OFFLINE)
---------
Co-authored-by: Xuan-Son Nguyen <redacted>
Georgi Gerganov [Mon, 26 May 2025 19:24:01 +0000 (22:24 +0300)]
scripts : add option to compare commits in Debug (#13806)
* scripts : add option to compare commits in Debug
* cont : reuse existing CMAKE_OPTS
Georgi Gerganov [Mon, 26 May 2025 19:14:52 +0000 (22:14 +0300)]
cuda : avoid cuGetErrorString (#13791)
ggml-ci
Akarshan Biswas [Mon, 26 May 2025 15:40:36 +0000 (21:10 +0530)]
SYCL: Add non contiguous support in RMS_NORM and NORM kernels (#13611)
* SYCL: Add non contiguous input support to norm kernel
* refactor and add RMS_NORM non contiguous input support
ggml-ci
* restore subgroup reduction for multi-subgroup thread blocks in norm kernels
* Swap grid dims of nsamples and nrows
ggml-ci
* Revert "Swap grid dims of nsamples and nrows"
This reverts commit
43be2d657fec7f7fba54e2cd154106bc0fc45adf .
* restore not required changes
ggml-ci
* address review comments: change it to more like SYCL
* Use a common function to calculate offset
* remove wrap around logic for handling broadcasts
* remove static from calculate_offset fn and use ceil_div
Olivier Chafik [Mon, 26 May 2025 15:03:57 +0000 (08:03 -0700)]
server: fix streaming crashes (#13786)
* add preludes to content on partial regex match
* allow all parsers to parse non-tool-call content.
* tweak order of <|python_tag|> vs <function= parsing for functionary v3.1 format. still not ideal but hopefully less prone to crash
standby24x7 [Mon, 26 May 2025 14:55:24 +0000 (23:55 +0900)]
examples/training: Fix file name in README (#13803)
This patch fixes binary file names in README.md.
Signed-off-by: Masanari Iida <redacted>
Olivier Chafik [Mon, 26 May 2025 13:56:49 +0000 (06:56 -0700)]
`server`: fix format of streamed tool call deltas (diff name, fix id location) (#13800)
* fix deltas of tool_call.function.name
* fix tool_call.id (was in tool_call.function.id!) + add function type
* add tool_call.type
* populate empty tool_call.function.arguments on first delta
Olivier Chafik [Mon, 26 May 2025 13:16:37 +0000 (06:16 -0700)]
server: fix regression on streamed non-chat completion w/ stops (#13785)
* more forgiving message diffs: partial stop words aren't erased, full stops are
* Add (slow) server test for completion + stream + stop
Georgi Gerganov [Mon, 26 May 2025 11:03:54 +0000 (14:03 +0300)]
examples : allow extracting embeddings from decoder contexts (#13797)
ggml-ci
Georgi Gerganov [Mon, 26 May 2025 09:57:50 +0000 (12:57 +0300)]
llama : clarify deprecation message (#13794)
Romain Biessy [Mon, 26 May 2025 08:28:53 +0000 (10:28 +0200)]
sycl: Add more debug prints (#13640)
Jeff Bolz [Mon, 26 May 2025 04:02:07 +0000 (23:02 -0500)]
vulkan: mark IM2COL as supporting non-contig (#13783)
Bizhao Shi [Mon, 26 May 2025 02:20:18 +0000 (10:20 +0800)]
CANN: Add the basic supports of Flash Attention kernel (#13627)
* cann: add the basic FA support
* cann: update the readme
* cann: update the FlashAttention with PSEShift
* cann: update the input parameters in FA
* cann: update the alibi with max_bias
* cann: add the constrints of softcap
* cann: update the docs CANN.md
* cann: update the docs CANN.md
* cann: fix typo of CANN.md
* cann: add some comments and update the CANN.md
* cann: update the CANN.md
* cann: update the inner precise for fusedInferAttention
* cann: update the constraints of flash_attn_ext on ggml-cann.cpp
* cann: clean the whitespace
* cann: clean the whitespace
* cann: add a new endline
Olivier Chafik [Sun, 25 May 2025 23:30:51 +0000 (00:30 +0100)]
`server`: add `--reasoning-budget 0` to disable thinking (incl. qwen3 w/ enable_thinking:false) (#13771)
---------
Co-authored-by: ochafik <redacted>
Co-authored-by: Xuan-Son Nguyen <redacted>
Xuan-Son Nguyen [Sun, 25 May 2025 17:02:18 +0000 (19:02 +0200)]
webui : bump max upload file size to 500MB (#13779)