server : handle unsuccessful sink.write in chunked stream provider (#21478)
Check the return value of sink.write() in the chunked content provider
and return false when the write fails, matching cpp-httplib's own
streaming contract. This prevents logging chunks as sent when the sink
rejected them and properly aborts the stream on connection failure.
server : fix logging of build + system info (#21460)
This PR changes the logging that occurs at startup of llama-server.
Currently, it is redundant (including CPU information twice) and it is
missing the build + commit info.
common : fix tool call type detection for nullable and enum schemas (#21327)
* common : fix tool call type detection for nullable and enum schemas
* common, tests : fix grammar delegation for nullable/enum schemas and add tests
Fix enum type inference to scan all enum values (not just index 0) so
schemas like {"enum": [0, "celsius"]} correctly detect string type.
Fix schema_delegates in peg-parser to handle nullable type arrays
(["string", "null"]) and typeless enum schemas in raw mode, allowing
the tagged parser to use raw text instead of JSON-formatted strings.
Add test cases for Qwen3-Coder (TAG_WITH_TAGGED format):
- nullable string ["string", "null"]
- nullable string with null first ["null", "string"]
- nullable integer ["integer", "null"]
- enum without explicit type key
The `HSA_OVERRIDE_GFX_VERSION` variable can be used in ROCm to override an unsupported target architecture with a similar but supported target architecture.
This does not and has never worked on Windows. I think the clarification could avoid driving Windows people towards this solution that does not work.
ggml-zendnn : add MUL_MAT_ID op support for MoE models (#21315)
* ggml-zendnn : add MUL_MAT_ID op support for MoE models
- Add MUL_MAT_ID op acceleration for Mixture-of-Experts models
- MUL_MAT_ID op fallback to CPU backend if total experts > 32
- Point ZenDNN lib to latest bits ZenDNN-2026-WW13
* ggml-zendnn : add braces to sgemm failure condition for consistency
Reuse the buffer for the ggml context which is used for creating the
compute graph on the server side. This partially addresses a memory leak
created by the CUDA backend due to using buffer addresses as cache
keys.
Bump ROCm version on Linux from 7.2 to 7.2.1
Add gfx1102 target
Delete LLVM workaround since ROCm 7.2.1 has fix for ROCm 7.2 perf regression https://github.com/ROCm/rocm-systems/issues/2865
tests : add unit test coverage for llama_tensor_get_type (#20112)
* Add unit test coverage for llama_tensor_get_type
* Fix merge conflicts, add more schemas
* clang formatter changes
* Trailing whitespace
* Update name
* Start rebase
* Updating files with upstream changes prior to rebase
* Changes needed from rebase
* Update attn_qkv schema, change throw behaviour
* Fix merge conflicts
* White space
* Update with latest changes to state counters
* Revert accidental personal CLAUDE.md changes
* Change quotation mark
* Reuse metadata.name since we have it
* Move test-only stuff out of llama-quant.cpp
* Hide the regex functionality back in llama-quant.cpp, use a unique pointer to a new struct 'compiled_tensor_type_patterns' which contains the patterns
* cont : inital deslop guidelines
* Cleanup based on review comments
* Continue cleanup
* Small cleanup
* Manually set proper ordering of tensors, mostly applies to gemma
Jesus Talavera [Thu, 2 Apr 2026 09:28:56 +0000 (11:28 +0200)]
chat : add Granite 4.0 chat template with correct tool_call role mapping (#20804)
* chat : add Granite 4.0 chat template with correct tool_call role mapping
Introduce `LLM_CHAT_TEMPLATE_GRANITE_4_0` alongside the existing Granite
3.x template (renamed `LLM_CHAT_TEMPLATE_GRANITE_3_X`).
The Granite 4.0 Jinja template uses `<tool_call>` XML tags and maps the
`assistant_tool_call` role to `<|start_of_role|>assistant<|end_of_role|><|tool_call|>`.
Without a matching C++ handler, the fallback path emits the literal role
`assistant_tool_call` which the model does not recognize, breaking tool
calling when `--jinja` is not used.
Changes:
- Rename `LLM_CHAT_TEMPLATE_GRANITE` to `LLM_CHAT_TEMPLATE_GRANITE_3_X`
(preserves existing 3.x behavior unchanged)
- Add `LLM_CHAT_TEMPLATE_GRANITE_4_0` enum, map entry, and handler
- Detection: `<|start_of_role|>` + (`<tool_call>` or `<tools>`) → 4.0,
otherwise → 3.x
- Add production Granite 4.0 Jinja template
- Add tests for both 3.x and 4.0 template paths (C++ and Jinja)
Co-Authored-By: Claude Opus 4.6 <redacted>
* Code review: follow standard format and use common logic in test-chat-template.cpp
* Rename custom_conversation variable for extra_conversation to give it a more meaningful name
memory: respect unified KV cache in hybrid memory for eval tasks (#21224)
The hybrid memory paths (`llama-memory-hybrid.cpp` and
`llama-memory-hybrid-iswa.cpp`) always used sequential equal split,
ignoring the unified KV cache flag. This caused hellaswag, winogrande,
and multiple-choice evaluations to fail on hybrid models (models with
both attention and recurrent/SSM layers, such as Qwen3.5-35B-A3B) with:
split_equal: sequential split is not supported when there are
coupled sequences in the input batch (you may need to use the
-kvu flag)
PR #19954 fixed this for `llama-kv-cache-iswa.cpp` by automatically
enabling unified KV mode and setting n_parallel >= 4 for multi-choice
eval tasks. However, the hybrid memory paths were not updated.
This commit mirrors the iswa fix: use non-sequential split when KV
cache is unified (n_stream == 1), which is automatically set by
llama-perplexity for hellaswag/winogrande/multiple-choice since #19954.
CUDA/HIP: Fix kernel slection for mmvq mmid kernel to align host selection with device launch bounds (#21238)
The conditions cc == GGML_CUDA_CC_VOLTA || cc >= GGML_CUDA_CC_ADA_LOVELACE and cc >= GGML_CUDA_CC_TURING match all non-nvidia devices. This causes us to attempt to launch the kernel for batch sizes with larger configurations than our launch bounds on HIP devices. This pr fixes the conditionals in get_mmvq_mmid_max_batch.
Abhijit Ramesh [Tue, 31 Mar 2026 22:38:16 +0000 (15:38 -0700)]
ggml-webgpu: port all AOT operators to JIT (#20728)
* port cpy pipeline to shader lib with JIT compilation
* port glu pipeline to shader lib with JIT compilation
* port rope pipeline to shader lib with JIT compilation
* port soft_max pipeline to shader lib with JIT compilation
* removed unused functions from embed_wgsl.py which were used for
old AOT template expansion
When ollama calls ggml_backend_tensor_set from multiple threads (each
writing a different chunk of the same tensor), the CANN backend had
three concurrency issues:
1. Quantized tensors (Q4_0/Q8_0) require a full-tensor format transform
before uploading to device. Per-chunk transforms produced corrupt data.
2. ND-to-NZ weight conversion requires complete tensor data on device.
Per-chunk conversion operated on incomplete data.
3. The global g_nz_workspaces array had unprotected concurrent access.
Fix by introducing a TensorSetTracker that accumulates write progress
per tensor. For quantized tensors, raw data is staged in a host buffer
and the transform + upload is deferred until all chunks arrive. For NZ
weights, chunks are uploaded directly but conversion is deferred. The
tracker and its staging buffer are released immediately after
post-processing completes.
Add per-device mutex to g_nz_workspaces to prevent data races.
* CANN: fix L2_NORM ignoring eps parameter
The L2_NORM implementation was not using the eps parameter from
op_params, causing incorrect results when eps is large (e.g. 10.0).
The CPU reference computes scale = 1/fmaxf(norm, eps), so add a
Clamp step to clamp the norm to at least eps before dividing.
* ggml/cann: compare op_params for POOL_2D in ACL graph cache matching
When ACL graph mode is enabled, the graph LRU cache checks whether a
cached graph matches the current computation graph. Previously,
GGML_OP_POOL_2D was not included in the op_params comparison, so two
POOL_2D nodes with different pooling parameters (kernel size, stride,
padding) but identical tensor shapes and addresses could incorrectly
reuse a cached graph, leading to wrong results or aclnn errors.
Add GGML_OP_POOL_2D to the list of ops that require op_params matching
in ggml_graph_node_properties::has_matching_properties().
* cann: fix ACL graph cache matching by adding tensor type and unconditional op_params comparison
The ACL graph LRU cache was incorrectly reusing cached graphs for
operations with different tensor types or op_params, causing test
failures for CPY (f16 vs bf16), POOL_2D, L2_NORM, NORM_MUL_ADD,
RMS_NORM_MUL_ADD, and ADD_RMS_NORM.
Changes:
- Add node_type and src_type[] fields to ggml_graph_node_properties
so the cache can distinguish tensors with different types but
identical ne/nb (e.g. f16 and bf16 both have 2-byte elements)
- Compare op_params unconditionally for all ops instead of only for
SCALE/UNARY/GLU/ROPE/POOL_2D
SATISH K C [Tue, 31 Mar 2026 08:52:34 +0000 (03:52 -0500)]
fix: include API key in CORS proxy requests for MCP connections (#21193)
* fix: include API key in CORS proxy requests for MCP connections
When llama-server is started with --api-key-file and --webui-mcp-proxy,
the /cors-proxy endpoint requires authentication. The WebUI was not
including the Authorization header in proxy requests, causing MCP
connections to fail with 401.
Inject getAuthHeaders() into requestInit when useProxy is true so the
proxy request carries the Bearer token alongside the forwarded target
headers.
Fixes #21167
* fix: simplify headers assignment based on reviewer suggestion
Apply buildProxiedHeaders only when useProxy is true, pass headers
directly to the transport otherwise.
* Reject empty computed member expressions before returning slices[0] from parse_member_expression_arguments().
* Treat empty computed member expressions with Jinja2 undefined semantics
Treat empty computed member expressions like `a[]` as undefined instead of
raising a parser error, to match Jinja2 behavior.
- return a noop expression for empty computed member arguments
- return undefined when a computed member key evaluates to undefined
- add Jinja tests covering `a[]|default('fallback')` and `a[] is undefined`
* Handle undefined computed member properties
Move undefined-property handling to the common member access path, and add a test covering `a[undefined] is undefined`.
* Use default undefined value in member access
Initialize val and then return it when property is undefined.
Co-authored-by: Sigbjørn Skjæret <redacted>
* empty statement parses to blank_expression instead of noop_statement
We wrongly calculated offset_grid as `ceildiv(nrows, block_size)`,
while it must be `ceildiv(nrows + 1, block_size)`. As a consequence, we
had uninitialized values in `offset_iterator[nrows]` for the case when
`nrows % block_size == 0`.
Fixes #21162
* Reduce nrows in test case to 256, don't need 768
Radoslav Gerganov [Mon, 30 Mar 2026 14:05:11 +0000 (17:05 +0300)]
rpc : fix misleading error log (#21184)
When RPC is running with a remote backend which doesn't have init_tensor
function (like CPU and Metal), the server log gets full with error
messages saying that init_tensor is being called with null buffer which
is incorrect. This patch fixes this.
Gaurav Garg [Sun, 29 Mar 2026 16:35:18 +0000 (22:05 +0530)]
Optimize MOE GEMV kernel for BS > 1. (#20905)
* Optimize MOE GEMV kernel for BS > 1.
The previous MOE kernel for BS > 1 had too many thread blocks (nrows_x, nchannels_dst, ncols_dst), with very little work per block. block of (32, 4) was doing inner dot product for a single row.
New mul_mat_vec_q_moe kernel is dedicated for MoE multi-token kernel with grid (ceil(nrows_x/rpb), nchannels_dst), block (warp_size, ncols_dst). Each warp handles two rows independently with warp-level reduction only (no shared memory sync).
This change doesn't increase any compilation time as a single template instance is needed per type. This also simplifies the original GEMV kernel and gets rid of `is_multi_token_id` specialization.
* Remove em-dashes
* Cherry-pick changes from @am17an PR https://github.com/ggml-org/llama.cpp/pull/20885 to enable small_k optimization only for cases where it benefits
Increase max batch size for MMVQ kernels for MUL_MAT_ID to 8
* Make the max batch size for MOE GEMV kernel configurable based on GPU arch and datatype