Acly [Sat, 4 Oct 2025 09:42:56 +0000 (11:42 +0200)]
vulkan : incremental shader builds (#16341)
* vulkan (DRAFT): split shader generation by GLSL source file, to improve incremental build times
* support dep-files so shaders are recompiled if their included files change
* rename shader files which are used as "headers" to use .glsl extension
* move glslc extension detection shaders to separate folders
* the above is to prevent them from getting glob'd with the actual compute shaders that need to be compiled
* vulkan : only write embedded shader .hpp/.cpp when they change
* avoid recompiling ggml-vulkan.cpp when editing shaders
* pass single --source argument instead of --input-dir & --filter to shader gen
* check for source file match earlier
* fix hang in vulkan-shaders-gen when there are compilation errors
* early out did not decrement compile_count
* clean up
* fix glslc integer dot product test
* unconditionally write the embedded shader cpp output
* replace output filepath in generated dep-files to match output in CMakeLists
Jeff Bolz [Fri, 3 Oct 2025 10:50:46 +0000 (05:50 -0500)]
vulkan: Replace uses of maxMemoryAllocationSize and VK_WHOLE_SIZE (#16354)
* vulkan: Replace uses of maxMemoryAllocationSize and VK_WHOLE_SIZE
Replace maxMemoryAllocationSize check with maxBufferSize when creating buffers.
The maxMemoryAllocationSize limit is a "soft" limit and allocations can succeed
beyond that limit. This allows > 4GB buffers to be allocated on some
implementations (e.g. NVIDIA) and tensors this large can be used for im2col
and mul_mat.
For temporary buffers (prealloc_x/y/etc) check against maxStorageBufferRange.
I'm not sure this check is ideal, but we always use these buffers as a single
full size binding and the limit may be smaller than maxMemoryAllocationSize
or maxBufferSize, so I think this is reasonable.
Replace descriptor range uses of VK_WHOLE_SIZE with a manually computed range.
The maxStorageBufferRange may be smaller than the maxBufferSize or
maxMemoryAllocationSize (and the Vulkan spec warns about this in a note) and
it's invalid usage if VK_WHOLE_SIZE computes a range larger than
maxStorageBufferRange.
With this change, it should be possible to generate videos using wan networks
in stable-diffusion.cpp.
* vulkan: Add env var GGML_VK_FORCE_MAX_BUFFER_SIZE and use stoull
Jeff Bolz [Fri, 3 Oct 2025 09:52:46 +0000 (04:52 -0500)]
vulkan: Fix FA coopmat1 invalid array indexing (#16365)
When computing sinks, the cm1 shader was looping r from 0 to Br rather than
to rows_per_thread. I must have copied this from the scalar path (where it is
correct), and somehow it wasn't causing failures on current drivers.
Adrien Gallouët [Wed, 1 Oct 2025 17:22:18 +0000 (19:22 +0200)]
common: introduce http.h for httplib-based client (#16373)
* common: introduce http.h for httplib-based client
This change moves cpp-httplib based URL parsing and client setup into
a new header `common/http.h`, and integrates it in `arg.cpp` and `run.cpp`.
It is an iteration towards removing libcurl, while intentionally
minimizing changes to existing code to guarantee the same behavior when
`LLAMA_CURL` is used.
model : support GLM 4.6 (make a few NextN/MTP tensors not required) (#16359)
* Make a few GLM tensors not required
layer.nextn.shared_head_head and layer.nextn.embed_tokens are both excluded from GLM 4.6 resulting in the model not loading after conversion/quantization, this marks those tensors as not required which makes it work
* Update llama-model.cpp
layer.nextn.shared_head_norm also not required in case of future models
`test-arg-parser.cpp` has been updated to work consistently,
regardless of whether CURL or SSL support is available, and
now always points to `ggml.ai`.
The previous timeout test has been removed, but it can be
added back by providing a dedicated URL under `ggml.ai`.
common : simplify etag tracking by removing json (#16342)
The JSON parser is temporarily kept only for backward compatibility. It
reads the etag from old .json files to prevent unnecessary re-downloads
for existing users.
This legacy code can be removed in a future version.
Daniel Bevenius [Fri, 26 Sep 2025 15:34:42 +0000 (17:34 +0200)]
ggml : remove -dev suffix from release version (ggml/1355)
This commit removes the `-dev` suffix from the version string in
CMakeLists.txt and the release script. The version will now be
just be formatted as `MAJOR.MINOR.PATCH`.
ggml-backend : add root cause in error message if loading backend library fails (#16172)
This PR adds additional information to an error message when loading backend library via ld_load_library() fails. This helps spotting why backend library did not load (missing library, missing dependency or unresolved symbol etc.).
During prompt ingestion, prompt tokens are accepted into the sampler history (for repetition penalties). The conversation-mode path then appended `common_sampler_last(smpl)` to `assistant_ss` before any new token was sampled. At that point, "last" was a prompt-side token (e.g., an input prefix), so the assistant chat message began with an extra piece.
Fix: append to `assistant_ss` only for a newly sampled (non-EOG) token. This affects only chat message assembly (`assistant_ss` / `chat_msgs` / `common_chat_format_single`); terminal stdout is unchanged. Sampling order/logits are unchanged.
Jeff Bolz [Sun, 28 Sep 2025 06:38:37 +0000 (01:38 -0500)]
vulkan: 64-bit im2col (#16135)
* vulkan: 64-bit im2col
Add variants of the im2col shaders that use buffer_device_address/buffer_reference,
and use 64-bit address calculations. This is needed for large convolutions used in
stable-diffusion.cpp.
Jeff Bolz [Sun, 28 Sep 2025 01:36:34 +0000 (20:36 -0500)]
vulkan: handle mat_mul with A matrix > 4GB (#16176)
* vulkan: handle mat_mul with A matrix > 4GB
This change splits mat_mul operations with huge A matrix into chunks in the M
dimension. This works well for stable-diffusion use cases where the im2col
matrix has very large M.
Fix the order of setting the stride in mul_mm_cm2 - setting the dimension
clobbers the stride, so stride should be set after.
Jeff Bolz [Sat, 27 Sep 2025 20:43:39 +0000 (16:43 -0400)]
vulkan: support arbitrary KV dimension in flash attention (#16160)
The "Clamp" spec constant is already based on whether KV is a multiple of Bc,
so use that to control whether bounds checking is performed. Add bounds checking
to the scalar and coopmat1 paths. Coopmat2 didn't need any changes (the K/V
tensors are already optionally clamped, nothing else needed to be changed).
Jeff Bolz [Sat, 27 Sep 2025 10:36:11 +0000 (06:36 -0400)]
vulkan: support GET_ROWS for k-quants (#16235)
The dequantize functions are copy/pasted from mul_mm_funcs.comp with very few
changes - add a_offset and divide iqs by 2. It's probably possible to call
these functions from mul_mm_funcs and avoid the duplication, but I didn't go
that far in this change.
model : make minicpm embedding_scale, residual_scale and logit_scale optional with legacy defaults (#16273)
* minicpm: make GGUF scaling keys optional with legacy defaults
Older MiniCPM GGUFs do not include the scaling metadata keys (minicpm.embedding_scale, minicpm.residual_scale, minicpm.logit_scale). The loader currently treats these as required, so quantization fails with:
key not found in model: minicpm.embedding_scale
This change restores backward compatibility by treating these keys as optional in the loader and using the older MiniCPM scaling values:
When the GGUF provides the keys, their values override the defaults; otherwise the legacy defaults are used. Newer GGUFs that already include these keys are unaffected.
Signed-off-by: Aaron Teo <redacted>
* devops: fix test does not exist
Signed-off-by: Aaron Teo <redacted>
* devops: fix model not found llama-eval-callback
Signed-off-by: Aaron Teo <redacted>
* Fix q3_K dot product error in test-quantize-fns on s390x
Array q8bytes had only 4 elements allocated, but 8 elements accessed.
This lead to write out of bounds and later read of overwritten values out of bounds
and incorrect result.
Signed-off-by: Aaron Teo <redacted>
* devops: re-enable ppc64le for testing
Signed-off-by: Aaron Teo <redacted>
* devops: activate test-thread-safety for s390x
Aleksander Grygier [Fri, 26 Sep 2025 16:35:42 +0000 (18:35 +0200)]
Allow viewing conversations even when llama server is down (#16255)
* webui: allow viewing conversations and sending messages even if llama-server is down
- Cached llama.cpp server properties in browser localStorage on startup, persisting successful fetches and reloading them when refresh attempts fail so the chat UI continues to render while the backend is unavailable.
- Cleared the stored server properties when resetting the store to prevent stale capability data after cache-backed operation.
- Kept the original error-splash behavior when no cached props exist so fresh installs still surface a clear failure state instead of rendering stale data.
* webui: extend cached props fallback to offline errors
Treat connection failures (refused, DNS, timeout, fetch) the same way as
server 5xx so the warning banner shows up when cache is available, instead
of falling back to a full error screen.
* webui: Left the chat form enabled when a server warning is present so operators can keep sending messages
e.g., to restart the backend over llama-swap, even while cached /props data is in use
common : use cpp-httplib as a cURL alternative for downloads (#16185)
* vendor : update httplib
Signed-off-by: Adrien Gallouët <redacted>
* common : use cpp-httplib as a cURL alternative for downloads
The existing cURL implementation is intentionally left untouched to
prevent any regressions and to allow for safe, side-by-side testing by
toggling the `LLAMA_CURL` CMake option.
Signed-off-by: Adrien Gallouët <redacted>
* ggml : Bump to Windows 10