bssrdf [Wed, 5 Nov 2025 20:55:04 +0000 (15:55 -0500)]
improve CUDA cpy memory bandwidth when copying transposed tensor (#16841)
* WIP
* added a cpy kernel specific to transposed tensor which uses smem to avoid uncoalesced access; test cases also added shwoing improved memory bandwidth
* added BF16 support
* more strict check to make sure src0 is a transpose
* reformulated to handle more complicated transpose cases
* bring back 2D transpose for higher performance
* allow build on windows
* tranpose copy more shapes
* minor tweak
* final clean up
* restore some test cases
* keep only the kernel for true tranposed case; updated with review suggestions
Georgi Gerganov [Tue, 4 Nov 2025 18:40:52 +0000 (20:40 +0200)]
ggml : fix conv2d_dw SVE path (ggml/1380)
* Fix test-conv2d-dw failure on ARM SVE by using runtime vector length
The ggml_compute_forward_conv_2d_dw_cwhn function was using a hardcoded GGML_F32_EPR (8) for SIMD vectorization, but on ARM SVE the actual vector length varies by hardware. This caused incorrect computation when processing CWHN layout tensors on ARM machines.
Fix by using svcntw() to get the runtime SVE vector length instead of the compile-time constant.
Co-authored-by: ggerganov <redacted>
* ci : reduce sam score threshold
Noah [Tue, 4 Nov 2025 05:04:59 +0000 (05:04 +0000)]
Fix garbled output with REPACK at high thread counts (#16956)
* Fix garbled output with REPACK at high thread counts
Fixed a race condition in the REPACK matrix multiplication code that caused garbled output when using 26+ threads (model-dependent threshold). The issue occurred because with high thread counts, the code forced chunk count to equal thread count, creating many small chunks. After aligning these chunks to NB_COLS boundaries, adjacent chunks could overlap, causing data corruption and race conditions. The fix enforces minimum chunk sizes based on NB_COLS and caps maximum chunk count to prevent creating too many tiny chunks, ensuring proper alignment without overlaps.
* Update ggml/src/ggml-cpu/repack.cpp
Co-authored-by: Georgi Gerganov <redacted>
* Update ggml/src/ggml-cpu/repack.cpp
Co-authored-by: Georgi Gerganov <redacted>
---------
Daniel Bevenius [Mon, 3 Nov 2025 17:01:59 +0000 (18:01 +0100)]
model-conversion : pass config to from_pretrained (#16963)
This commit modifies the script `run-org-model.py` to ensure that the
model configuration is explicitly passed to the `from_pretrained` method
when loading the model. It also removes a duplicate configuration
loading which was a mistake.
The motivation for this change is that enables the config object to be
modified and then passed to the model loading function, which can be
useful when testing new models.
Co-authored-by: Aleksander Grygier <redacted>
* webui: Moved constants to lib/constants.
* webui: package woff2 inside base64 data
* webui: LaTeX-line-break in display formula
* chore: update webui build output
* webui: Bugfix (font embedding)
* webui: Bugfix (font embedding)
* webui: vite embeds assets
* webui: don't suppress 404 (fonts)
* refactor: KaTeX integration with SCSS
Moves KaTeX styling to SCSS for better customization and font embedding.
This change includes:
- Adding `sass` as a dev dependency.
- Introducing a custom SCSS file to override KaTeX variables and disable TTF/WOFF fonts, relying solely on WOFF2 for embedding.
- Adjusting the Vite configuration to resolve `katex-fonts` alias and inject SCSS variables.
Adrian Lundberg [Sun, 2 Nov 2025 09:28:37 +0000 (10:28 +0100)]
docs: remove llama_sampler_accept reference in sampling sample usage (#16920)
commit 5fb5e24811cb01d48b482c15a974bfbd9f433e1d (llama : minor
sampling refactor (2) (#9386)) moved the llama_sampler_accept call
into llama_sampler_sample, but the sampling sample usage in llama.h
was forgotten to be updated accordingly.
Pascal [Sat, 1 Nov 2025 18:49:51 +0000 (19:49 +0100)]
webui: auto-refresh /props on inference start to resync model metadata (#16784)
* webui: auto-refresh /props on inference start to resync model metadata
- Add no-cache headers to /props and /slots
- Throttle slot checks to 30s
- Prevent concurrent fetches with promise guard
- Trigger refresh from chat streaming for legacy and ModelSelector
- Show dynamic serverWarning when using cached data
* fix: restore proper legacy behavior in webui by using unified /props refresh
Updated assistant message bubbles to show each message's stored model when available,
falling back to the current server model only when the per-message value is missing
When the model selector is disabled, now fetches /props and prioritizes that model name
over chunk metadata, then persists it with the streamed message so legacy mode properly
reflects the backend configuration
* fix: detect first valid SSE chunk and refresh server props once
* fix: removed the slots availability throttle constant and state
Pascal [Sat, 1 Nov 2025 16:14:54 +0000 (17:14 +0100)]
webui: add HTML/JS preview support to MarkdownContent with sandboxed iframe (#16757)
* webui: add HTML/JS preview support to MarkdownContent with sandboxed iframe dialog
Extended MarkdownContent to flag previewable code languages,
add a preview button alongside copy controls, manage preview
dialog state, and share styling for the new button group
Introduced CodePreviewDialog.svelte, a sandboxed iframe modal
for rendering HTML/JS previews with consistent dialog controls
* webui: fullscreen HTML preview dialog using bits-ui
Oliver Simons [Sat, 1 Nov 2025 05:13:26 +0000 (06:13 +0100)]
CUDA: Remove unneded bias/gate dims in fused mmvq (#16858)
* CUDA: Remove unneded bias/gate dims in fused mmvq
Pointed out
[here](https://github.com/ggml-org/llama.cpp/pull/16847#discussion_r2476798989)
that only a single value is needed per target col per thread
* Apply suggestions from code review
Co-authored-by: Johannes Gäßler <redacted>
* Fix "Error 991-D: extra braces are nonstandard" during compilation
Oliver Simons [Thu, 30 Oct 2025 03:34:15 +0000 (04:34 +0100)]
Hide latency of bias and gate-loading (#16847)
This is realised by loading them into registers before computation of
the dot-product, effectively batching them together with said
dot-product. As a lot of threads are alive here, the warp scheduler has
enough threads available to effectively hide the cost of additionally
loading those two floats.
Jeff Bolz [Wed, 29 Oct 2025 20:13:10 +0000 (15:13 -0500)]
vulkan: Fuse rope+set_rows (#16769)
This pattern appears in a lot of models, the rope operation is applied right
before storing into the KV cache (usually on the K tensor).
Add a path to some of the rope shaders that computes the destination address
based on the set_rows tensor. Compile variants of the shader with D_TYPE of
f16 (the usual KV cache type).
Add a src3 operand to ggml_vk_op_f32 - sometimes rope uses three srcs and needs
the fourth for the row indices.
Add fused_ops_write_mask to indicate which intermediate tensors need to write
their results to memory. Skipping writing the roped K value helps to allow more
nodes to run concurrently.
Add logic to ggml_vk_graph_optimize to make ROPE+VIEW+SET_ROWS consecutive. It
rarely starts out that way in the graph.
Max Krasnyansky [Wed, 29 Oct 2025 13:29:12 +0000 (06:29 -0700)]
Hexagon Op queue & dispatch optimizations (#16820)
* hexagon: remove dspqueue callbacks and do all read processing inplace
* hexagon: there is no need to ref/deref the buffers at this point
We're not going to release the buffers without flushing the session queue.
So there is no need to inc/dec the refcounts for every request.
We also don't need to include those bufs in the response.
* hexagon: bump the thread count in the adb wrapper scripts
We can use more CPU cores now that the dedicated dspqueue polling threads are not used (ie no contention).
Also enable more agressive polling for now since we still map Flash Attention (and a few other kernels) to
the CPU and those dspqueue threads were keeping the CPU cores are higher clock freqs.
Aman Gupta [Wed, 29 Oct 2025 07:55:06 +0000 (15:55 +0800)]
CUDA: Fix bug in topk-moe for gpt-oss (#16821)
* CUDA: Fix bug in topk-moe for gpt-oss
When using ggml_can_fuse_subgraph, the output nodes which are passed are wrong. This causes `test-backend-ops` to still fuse ndoes (because the nodes are not used elsewhere in the graph),
but it actually doesn't fuse in the actual gpt-oss