Matthew Michel [Thu, 23 Oct 2025 01:05:15 +0000 (20:05 -0500)]
sycl: use async memory allocation to fix crashes during graph recording (#16644)
* sycl: use async memory allocation to fix graph recording failures
GGML_SYCL_DISABLE_GRAPHS=0 causes crashes because:
- Host waits are currently unsupported in graph recording mode.
- SYCL malloc / free calls are unsupported in graph recording mode.
The following changes are made to fix SYCL graph functionality:
- When graphs are enabled, use the SYCL async memory extension for temp
buffers which is supported with SYCL graphs.
- For compiler versions that do not support this extension, skip
graphs with the affected op.
- Switch from USM shared to device memory as the async extension
currently just supports device allocations.
* Address reviewer feedback
* Use global async variable to decide path in sycl_ext_[malloc_device|free]
Max Krasnyansky [Wed, 22 Oct 2025 20:47:09 +0000 (13:47 -0700)]
Add experimental ggml-hexagon backend for the Hexagon NPU (#16547)
* model: add support for extra bufs for all devices
* hexagon: add experimental ggml-hexagon backend for the Hexagon NPU
This commit introduces a new experimental backend `ggml-hexagon` with support for the Hexagon NPU.
Highlights:
- Supports Hexagon versions: v73, v75, v79, and v81
- Targets Android devices based on Snapdragon SoCs: Gen3, 8-Elite, and 8-Elite Gen5
- Supports Q4_0, Q8_0, MXFP4, and FP32 data types
- Implements core LLM ops: MUL_MAT/MUL_MAT_ID, ADD/SUB/MUL/ADD_ID, RMS_NORM, ROPE, GLU/SWIGLU, SOFTMAX
**Note:** This backend is experimental and may exhibit instability or limited performance across supported devices.
It is intended for early testing and feedback from llama.cpp/ggml developer and user community.
Pascal [Wed, 22 Oct 2025 14:58:23 +0000 (16:58 +0200)]
webui: introduce OpenAI-compatible model selector in JSON payload (#16562)
* webui: introduce OpenAI-compatible model selector in JSON payload
* webui: restore OpenAI-Compatible model source of truth and unify metadata capture
This change re-establishes a single, reliable source of truth for the active model:
fully aligned with the OpenAI-Compat API behavior
It introduces a unified metadata flow that captures the model field from both
streaming and non-streaming responses, wiring a new onModel callback through ChatService
The model name is now resolved directly from the API payload rather than relying on
server /props or UI assumptions
ChatStore records and persists the resolved model for each assistant message during
streaming, ensuring consistency across the UI and database
Type definitions for API and settings were also extended to include model metadata
and the onModel callback, completing the alignment with OpenAI-Compat semantics
* webui: address review feedback from allozaur
* webui: move model selector into ChatForm (idea by @allozaur)
* webui: make model selector more subtle and integrated into ChatForm
* webui: replaced the Flowbite selector with a native Svelte dropdown
* webui: add developer setting to toggle the chat model selector
* webui: address review feedback from allozaur
Normalized streamed model names during chat updates
by trimming input and removing directory components before saving
or persisting them, so the conversation UI shows only the filename
Forced model names within the chat form selector dropdown to render as
a single-line, truncated entry with a tooltip revealing the full name
* webui: toggle displayed model source for legacy vs OpenAI-Compat modes
When the selector is disabled, it falls back to the active server model name from /props
When the model selector is enabled, the displayed model comes from the message metadata
(the one explicitly selected and sent in the request)
Co-authored-by: Aleksander Grygier <redacted>
* Update tools/server/webui/src/lib/constants/localstorage-keys.ts
Co-authored-by: Aleksander Grygier <redacted>
* Update tools/server/webui/src/lib/components/app/chat/ChatForm/ChatFormModelSelector.svelte
Co-authored-by: Aleksander Grygier <redacted>
* Update tools/server/webui/src/lib/components/app/chat/ChatMessages/ChatMessageAssistant.svelte
Co-authored-by: Aleksander Grygier <redacted>
* Update tools/server/webui/src/lib/services/chat.ts
Co-authored-by: Aleksander Grygier <redacted>
* Update tools/server/webui/src/lib/services/chat.ts
Co-authored-by: Aleksander Grygier <redacted>
* webui: refactor model selector and persistence helpers
- Replace inline portal and event listeners with proper Svelte bindings
- Introduce 'persisted' store helper for localStorage sync without runes
- Extract 'normalizeModelName' utils + Vitest coverage
- Simplify ChatFormModelSelector structure and cleanup logic
Replaced the persisted store helper's use of '$state/$effect' runes with
a plain TS implementation to prevent orphaned effect runtime errors
outside component context
Co-authored-by: Aleksander Grygier <redacted>
* webui: document normalizeModelName usage with inline examples
sirus20x6 [Wed, 22 Oct 2025 10:14:14 +0000 (05:14 -0500)]
ggml : Leverage the existing GGML_F32_VEC helpers to vectorize ggml_vec_set_f32 for faster fills (#16522)
* Leverage the existing GGML_F32_VEC helpers to broadcast the fill value across SIMD registers and store in vector-sized chunks, while retaining the scalar tail for leftover elements and non-SIMD builds.
Aleksander Grygier [Mon, 20 Oct 2025 10:41:13 +0000 (12:41 +0200)]
Enable per-conversation loading states to allow having parallel conversations (#16327)
* feat: Per-conversation loading states and tracking streaming stats
* chore: update webui build output
* refactor: Chat state management
Consolidates loading state management by using a global `isLoading` store synchronized with individual conversation states.
This change ensures proper reactivity and avoids potential race conditions when updating the UI based on the loading status of different conversations. It also improves the accuracy of statistics displayed.
Additionally, slots service methods are updated to use conversation IDs for per-conversation state management, avoiding global state pollution.
* feat: Adds loading indicator to conversation items
* chore: update webui build output
* fix: Fix aborting chat streaming
Improves the chat stream abortion process by ensuring that partial responses are saved before the abort signal is sent.
This avoids a race condition where the onError callback could clear the streaming state before the partial response is saved. Additionally, the stream reading loop and callbacks are now checked for abort signals to prevent further processing after abortion.
* refactor: Remove redundant comments
* chore: build webui static output
* refactor: Cleanup
* chore: update webui build output
* chore: update webui build output
* fix: Conversation loading indicator for regenerating messages
* chore: update webui static build
* feat: Improve configuration
* feat: Install `http-server` as dev dependency to not need to rely on `npx` in CI
takuya kodama [Mon, 20 Oct 2025 08:27:09 +0000 (16:27 +0800)]
llama-batch: fix build fails with `-Werror=missing-braces` (#16614)
## Why it failed
When compiling with strict compiler flags (-Wmissing-braces -Werror=missing-braces),
the build fails with the following error:
```
cmake \
-S . \
-B ../llama.cpp.build \
--preset=x64-linux-gcc-debug \
-DCMAKE_INSTALL_PREFIX=/tmp/local \
-DCMAKE_CXX_FLAGS="-Wmissing-braces -Werror=missing-braces" && \
cmake --build ../llama.cpp.build/
...
In file included from /home/otegami/work/cpp/llama.cpp/src/llama-graph.h:4,
from /home/otegami/work/cpp/llama.cpp/src/llama-model.h:5,
from /home/otegami/work/cpp/llama.cpp/src/llama.cpp:8:
/home/otegami/work/cpp/llama.cpp/src/llama-batch.h:126:48: error: missing braces around initializer for 'std::__array_traits<int, 1>::_Type' {aka 'int [1]'} [-Werror=missing-braces]
126 | std::array<llama_seq_id, 1> seq_id_0 = { 0 }; // default sequence id
| ^
cc1plus: some warnings being treated as errors
```
The issue is that std::array initialization requires double braces.
## How to fix
This PR changes `{ 0 }` to `{{ 0 }}` for std::array initialization.
This is part of a series of commits to fix missing braces warnings across the codebase.
- src/llama-batch.h <- This PR is here.
- src/llama-context.cpp
- tests/test-backend-ops.cpp
- tests/test-gguf.cpp
- tools/mtmd/clip.cpp
Benefits:
- std::array is a struct containing a C-style array, requiring nested braces
- Enables stricter compiler warnings to catch potential issues
takuya kodama [Mon, 20 Oct 2025 07:44:21 +0000 (15:44 +0800)]
llama-context: only warn on pooling_type when user specified (#16674)
The unexpeced pooling_type warning was incorrectly shown when users did not
specify the --pooling-type parameter. In this case, the parameter
defaults to `LLAMA_POOLING_TYPE_UNSPECIFIED (-1)`, and the code
automatically applies the model's default pooling type.
Example of spurious warning:
```
$ llama-embedding -hf ggml-org/bge-m3-Q8_0-GGUF -p "hello"
...
llama_init_from_model: model default pooling_type is [2], but [-1] was specified
...
```
This fix ensures the warning only appears when users explicitly specify
a pooling type that differs from the model's default (e.g., using
--pooling-type mean on a model that expects CLS pooling).
Radoslav Gerganov [Fri, 17 Oct 2025 15:02:52 +0000 (18:02 +0300)]
rpc : report actual free memory (#16616)
* rpc : report actual free memory
Start reporting the free memory on every device instead of using
fixed values. Now llama-cli users can get a nice memory breakdown
when using RPC devices.
muggle-stack [Fri, 17 Oct 2025 10:01:23 +0000 (18:01 +0800)]
ggml : fix SpaceMit IME array out-of-bounds in task assignment (#16629)
Fix incorrect task-to-batch index calculation in the quantization phase.
The bug caused out-of-bounds access to qnbitgemm_args array when
compute_idx exceeded per_gemm_block_count_m, leading to invalid
pointer dereferences and SIGBUS errors.
Correctly map tasks to batches by dividing compute_idx by
per_gemm_block_count_m instead of block_size_m.
Chenguang Li [Thu, 16 Oct 2025 08:41:11 +0000 (16:41 +0800)]
CANN: format code using .clang-format (#15863)
This commit applies .clang-format rules to all source files under the
ggml-cann directory to ensure consistent coding style and readability.
The .clang-format option `SortIncludes: false` has been set to disable
automatic reordering of include directives.
No functional changes are introduced.
takuya kodama [Thu, 16 Oct 2025 05:10:32 +0000 (13:10 +0800)]
ggml-cpu: replace putenv with setenv for const-correctness (#16573)
## Why it failed
When compiling with strict compiler flags (-Wwrite-strings -Werror=discarded-qualifiers),
the build fails with the following error:
```
cmake \
-S . \
-B ../llama.cpp.build \
--preset=x64-linux-gcc-debug \
-DCMAKE_INSTALL_PREFIX=/tmp/local \
-DCMAKE_C_FLAGS="-Wwrite-strings -Werror=discarded-qualifiers" && \
cmake --build ../llama.cpp.build/
...
/home/otegami/work/cpp/llama.cpp/ggml/src/ggml-cpu/ggml-cpu.c: In function ‘ggml_cpu_init’:
/home/otegami/work/cpp/llama.cpp/ggml/src/ggml-cpu/ggml-cpu.c:3572:24: error: passing argument 1 of ‘putenv’ discards ‘const’ qualifier from pointer target type [-Werror=discarded-qualifiers]
3572 | putenv("KMP_BLOCKTIME=200"); // 200ms
| ^~~~~~~~~~~~~~~~~~~
In file included from /home/otegami/work/cpp/llama.cpp/ggml/src/./ggml-impl.h:10,
from /home/otegami/work/cpp/llama.cpp/ggml/src/ggml-cpu/ggml-cpu-impl.h:6,
from /home/otegami/work/cpp/llama.cpp/ggml/src/ggml-cpu/traits.h:3,
from /home/otegami/work/cpp/llama.cpp/ggml/src/ggml-cpu/ggml-cpu.c:6:
/usr/include/stdlib.h:786:26: note: expected ‘char *’ but argument is of type ‘const char *’
786 | extern int putenv (char *__string) __THROW __nonnull ((1));
| ~~~~~~^~~~~~~~
cc1: some warnings being treated as errors
ninja: build stopped: subcommand failed.
```
The issue is that putenv() expects a non-const char * but receives a string literal (const char *).
## How to fix
This PR replaces putenv("KMP_BLOCKTIME=200") with setenv("KMP_BLOCKTIME", "200", 0).
Benefits of setenv():
- Accepts const char * parameters (no qualifier warnings)
- Makes copies of the strings (safer memory handling)
- The third parameter (0) ensures we don't overwrite if already set
Aleksei Nikiforov [Wed, 15 Oct 2025 20:43:08 +0000 (22:43 +0200)]
gguf-py : add support for endian conversion of BF16 data (#16594)
BF16 requires special handling in this script
while it's a 2-bytes data, but view is 1-byte by default.
Switch to correct view before attempting byteswapping.
With this change correctly byteswapping models like
Meta-Llama-3-8B-Instruct-bf16-GGUF
should be possible.
safranowith [Wed, 15 Oct 2025 19:24:51 +0000 (22:24 +0300)]
cpu : add FLOOR, CEIL, ROUND and TRUNC unary operators (#16083)
* CPU: Add support for FLOOR,CEIL,ROUND and TRUNC unary operators
- Added the operators to unary op enum
- Implemented API functions
- Implemented forward and unary-op logic in CPU backend
- Updated ggml_get_n_tasks
- Updated operators names array and static_assert
- Updated docs and enabled automatic tests
* docs: add documentation for ggml_trunc and ggml_trunc_inplace in ggml.h
* chore: remove trailing whitespace from ggml.h
* Remove unresolved merge markers
* Apply review suggestions: cleanup formatting, enum order and leftover artifacts
Chenguang Li [Mon, 13 Oct 2025 09:01:24 +0000 (17:01 +0800)]
CANN: fix CPU memory leak in CANN backend (#16549)
This commit fixes a CPU-side memory leak issue in the CANN backend,
which occurred when intermediate aclTensorList objects were not properly
released after operator execution. The leak happened during repeated
invocations of CANN ops (e.g., FlashAttention), leading to increasing
host memory usage over time.
Proper resource cleanup (aclDestroyTensorList and related release logic)
has been added to ensure that all temporary tensors are correctly freed.
Pascal [Mon, 13 Oct 2025 08:55:32 +0000 (10:55 +0200)]
fix: add remark plugin to render raw HTML as literal text (#16505)
* fix: add remark plugin to render raw HTML as literal text
Implemented a missing MDAST stage to neutralize raw HTML like major LLM WebUIs
do ensuring consistent and safe Markdown rendering
Introduced 'remarkLiteralHtml', a plugin that converts raw HTML nodes in the
Markdown AST into plain-text equivalents while preserving indentation and
line breaks. This ensures consistent rendering and prevents unintended HTML
execution, without altering valid Markdown structure
Kept 'remarkRehype' in the pipeline since it performs the required conversion
from MDAST to HAST for KaTeX, syntax highlighting, and HTML serialization
Refined the link-enhancement logic to skip unnecessary DOM rewrites,
fixing a subtle bug where extra paragraphs were injected after the first
line due to full innerHTML reconstruction, and ensuring links open in new
tabs only when required
hipudding [Mon, 13 Oct 2025 00:52:22 +0000 (08:52 +0800)]
CANN: Update several operators to support FP16 data format (#16251)
Many Ascend operators internally use FP16 precision for computation.
If input data is in FP32, it must first be cast to FP16 before
computation, and then cast back to FP32 after computation, which
introduces unnecessary cast operations. Moreover, FP16 computation
requires significantly less workload compared to FP32, leading to
noticeable efficiency improvements.
In this change, `get_rows`, `rms_norm`, and `flash_attn_ext` are extended
to support multiple data types. Validation on the Qwen2 0.5b model shows
correct accuracy and about 10% performance gain in concurrent scenarios.
Pascal [Sun, 12 Oct 2025 16:06:41 +0000 (18:06 +0200)]
webui: remove client-side context pre-check and rely on backend for limits (#16506)
* fix: make SSE client robust to premature [DONE] in agentic proxy chains
* webui: remove client-side context pre-check and rely on backend for limits
Removed the client-side context window pre-check and now simply sends messages
while keeping the dialog imports limited to core components, eliminating the
maximum context alert path
Simplified streaming and non-streaming chat error handling to surface a generic
'No response received from server' error whenever the backend returns no content
Removed the obsolete maxContextError plumbing from the chat store so state
management now focuses on the core message flow without special context-limit cases
Daniel Bevenius [Sun, 12 Oct 2025 05:19:06 +0000 (07:19 +0200)]
hparams : add check for layer index in is_recurrent (#16511)
* hparams : add check for layer index in is_recurrent
This commit adds a check in the is_recurrent method to ensure that the
provided layer index is within the valid range.
The motivation for this change is to prevent potential out-of-bounds
and also be consistent with other methods in the class that perform
similar checks, like is_swa.
sirus20x6 [Sun, 12 Oct 2025 05:15:00 +0000 (00:15 -0500)]
ggml: Correct SVE implementation in ggml_vec_dot_f16_unroll (#16518)
The previous SVE implementation for `ggml_vec_dot_f16_unroll` contained a bug due to a copy-paste error. The wrong variable was used in an FMA instruction, leading to incorrect results. This commit corrects the variable usage and improves the clarity of the code by renaming variables to avoid confusion.
Pascal [Sat, 11 Oct 2025 13:50:49 +0000 (15:50 +0200)]
feat: render user content as markdown option (#16358)
* feat: render user content as markdown option
- Add a persisted 'renderUserContentAsMarkdown' preference to the settings defaults and info metadata so the choice survives reloads like other options
- Surface the new 'Render user content as Markdown' checkbox in the General section of the chat settings dialog, beneath the PDF toggle
- Render user chat messages with 'MarkdownContent' when the new setting is enabled, matching assistant formatting while preserving the existing card styling otherwise
- chore: update webui build output
Yann Follet [Sat, 11 Oct 2025 13:39:04 +0000 (21:39 +0800)]
server / ranking : add sorting and management of top_n (#16403)
* server / ranking : add sorting and management of top_n
* Make the retro compatible if no top_n will return
all results
here is a script to make some test
```script
URL=${1:-http://127.0.0.1:8181}
curl "$URL/v1/rerank" -H "Content-Type: application/json" \
-d '{ "model": "M", "query": "What is the recipe to make bread ?",
"return_text" : true,
"texts" : true,
"top_n": 6,
"documents": [
"voici la recette pour faire du pain, il faut de la farine de l eau et du levain et du sel",
"it is a bear",
"bread recipe : floor, water, yest, salt",
"The giant panda (Ailuropoda melanoleuca), sometimes called a panda bear or simply panda, is a bear species endemic to China.",
"here is the ingedients to bake bread : 500g floor, 350g water, 120g fresh refresh yest, 15g salt",
"recipe to make cookies : floor, eggs, water, chocolat",
"here is the recipe to make bread : 500g floor, 350g water, 120g fresh refresh yest, 15g salt",
"il fait tres beau aujourd hui",
"je n ai pas faim, je ne veux pas manger",
"je suis a paris"
] }' | jq
```
* use resize() instead for(...)
* simplify top_n init since no need to return error