Aman Gupta [Thu, 25 Sep 2025 14:35:05 +0000 (22:35 +0800)]
CUDA: add a fused top-K MoE kernel (#16130)
* CUDA: add a fused top-K MoE kernel
This kernel does the following:
1. softmax over the logits per token [n_experts, n_tokens]
2. argmax reduce over the top-k (n_experts_used) logits
3. write weights + ids to global memory
It is intended as fusion of softmax->top-k->get_rows pipeline for MoE models
* Refactor into ggml_cuda_should_use_topk_moe
* Review: Use better coalescing pattern, use WARP_SIZE, store logits into registers before
Daniel Bevenius [Thu, 25 Sep 2025 10:02:36 +0000 (12:02 +0200)]
model-conversion : add embedding prompt file support (#15871)
This commit adds support for passing a prompt file to the model
conversion targets/scripts. It also updates the logits.cpp to print out
embedding information in the same format as when running the original
embedding model.
The motivation for this is that it allows us to pass files of different
sizes when running the converted models and validating the logits.
This can be particularly important when testing the sliding window
functionality of models where the sequence length needs to exceed a
certain number of tokens to trigger the sliding window logic.
Daniel Bevenius [Thu, 25 Sep 2025 09:36:47 +0000 (11:36 +0200)]
server : add support for external server for tests (#16243)
This commit adds support for using an externally started llama-server
instance for the server tests. This can be enabled by setting the
DEBUG_EXTERNAL environment variable.
The motivation for this is to allow debugging of the server itself
when investigating a test failure. Instructions for how to do this are
added to the README.md file in the tests directory.
Use RPC_DEBUG environment variable to enable debug messages.
Add helper macro LOG_DBG() which does an early
check of the env var before calling GGML_LOG_DEBUG().
Make sure we log a debug message for every server function.
ggml : split graph allocations according to backend max buffer size (#15815)
* ggml : make gallocr respect the backend's max buffer size
* if the graph requires more memory than can fit into a single allocation, split it into multiple backend buffers
* vulkan: report the actual max allocation size in buffer type interface
* fix missing newline, apple-clang warning
* track size of individual chunks in ggml_dyn_tallocr and raise max chunks.
revert to use suballocation_block_size as max chunk size for vulkan.
* track (chunk, offset) pairs instead of "global" offsets through gallocr.
* simpler, don't need loops to map between local/global offsets
* touches more code
* fix dyn_tallocr_max_size and initialization
* fix memory leak when buffers are reused due to same buffer type appearing multiple times
* make vbuffer allocation follow the same logic as backend_buffer did before
* continue to use leftover unallocated space of previous chunks after a new one has been created
* treat free blocks of each chunk as separate list
* they're still allocated together, but start/end of each chunk is tracked, and allocate/free iterate over sub-ranges
* exhaust freed blocks of all chunks before considering their last blocks with unallocated space
* start with 0 chunks/blocks and create chunks as needed
* allow the last chunk to grow beyond max size
* refactor: move adding new free block and new chunk into separate functions
* allocate chunks individually with a separate free-blocks list for each one
* needs a bit more memory/allocations/indirections, but code is simpler
Gabe Goodhart [Mon, 22 Sep 2025 18:40:10 +0000 (12:40 -0600)]
feat: Add conversion support in GraniteHybrid for non-hybrid (all attn) (#16177)
This is a configuration of the hparams in the GraniteHybrid architecture
that devolves to the Granite (or GraniteMoe) architecture (ie Granite 3.x).
It may be used for some models in the Granite 4 family with the
GraniteHybrid architecture acting as a superset arch. Rather than support
it directly in the c++ graph, we simply coerce the architecture flag back
to the correct "granite" or "granitemoe" architecture.
Branch: gabe-l-hart/GraniteNonHybridConversion
Signed-off-by: Gabe Goodhart <redacted> Co-authored-by: Sigbjørn Skjæret <redacted>
Haiyue Wang [Mon, 22 Sep 2025 17:57:46 +0000 (01:57 +0800)]
clang-tidy : disable warning about performance enum size (#16127)
Disable 'performance-enum-size' checking:
Enum 'llama_token_type' uses a larger base type ('unsigned int', size: 4 bytes)
than necessary for its value set, consider using 'std::uint8_t' (1 byte) as the
base type to reduce its size.
Daniel Bevenius [Tue, 16 Sep 2025 04:16:52 +0000 (06:16 +0200)]
ggml : introduce semantic versioning (ggml/1336)
* ggml : introduce semantic versioning
This commit introduces semantic versioning for the GGML library.
The motivation for this is that the current versioning, using build
numbers, makes it difficult to track changes and releases for projects
that use ggml.
The release steps are the following:
1. Sync the changes from llama.cpp using sync-llama-am.sh and after the
PR has been approved and merged move to step 2.
2. Run scripts/release.sh and specify the type of release, major, minor,
or patch. This script will handle incrementing the version
(major|minor|patch), create a new commit with the version change,
create a tag for the version, and prepare for the next development
iteration.
3. Inspect the commits/tag and push to master. This will trigger the
github release workflow which is triggered for new tags which will
then publish a new release on github.
Example usage:
```console
$ ./scripts/release.sh major --dry-run
[dry-run] - No changes will be made
Step 1: Reading current version...
Current version: 0.9.0-dev
New release version: 1.0.0
Step 2: Updating version in ggml/CMakeLists.txt...
[dry-run] Would update GGML_VERSION_MAJOR to 1
[dry-run] Would update GGML_VERSION_MINOR to 0
[dry-run] Would update GGML_VERSION_PATCH to 0
[dry-run] Would remove -dev suffix
Step 3: Committing version bump...
[dry-run] Would commit: 'ggml : bump version to 1.0.0'
Step 4: Creating git tag...
[dry-run] Would create tag: v1.0.0 with message 'Release version 1.0.0'
Step 5: Preparing for next development cycle...
[dry-run] Would update GGML_VERSION_MINOR to 1
[dry-run] Would add -dev suffix back
Step 6: Committing development version...
[dry-run] Would commit: 'ggml : prepare for development of 1.1.0-dev'
[dry-run] Summary (no changes were made):
• Would have released version: 1.0.0
• Would have created tag: v1.0.0
• Would have set next development version: 1.1.0-dev
```
llama-bench: add --devices and --list-devices support (#16039)
* * llama-bench: add --devices support
- Support --devices same as llama-server
- Provide for benchmarking different device combinations
- Include --list-devices like llama-server for convenience
* fix: field display ordering restored
* fix: integrated the rpc devices
- aimed to mimic the server as much as possible
* cleanup: defaults for list-devices
- handle dup device listing with RPC
* cleanup: remove dup device load calls
* docs: update llama-bench
- added the recently added n-cpu-moe option to the docs while in there
* llama-bench: rpc device simplification
* rpc servers unify with other devices earlier, simplifying code
* --list-devices made stateless and simpler
* various cleanup
cmake : fix static linking for OpenMP on Unix-like systems (#16031)
When compiling with GGML_STATIC=ON, the build process would produce a
binary that was still dynamically linked to OpenMP. This defeats the
purpose of a static build:
This commit resolves the issue by modifying `CMAKE_FIND_LIBRARY_SUFFIXES`
to prioritize `.a` files, forcing CMake to link the static version of
the library.
Eric Curtin [Thu, 18 Sep 2025 15:22:50 +0000 (16:22 +0100)]
Add resumable downloads for llama-server model loading (#15963)
- Implement resumable downloads in common_download_file_single function
- Add detection of partial download files (.downloadInProgress)
- Check server support for HTTP Range requests via Accept-Ranges header
- Implement HTTP Range request with "bytes=<start>-" header
- Open files in append mode when resuming vs create mode for new downloads
server : include usage statistics only when user request them (#16052)
* server : include usage statistics only when user request them
When serving the OpenAI compatible API, we should check if
{"stream_options": {"include_usage": true} is set in the request when
deciding whether we should send usage statistics
common : Fix corrupted memory error on json grammar initialization (#16038)
Initalizing RESERVED_NAME in is_reserved_name() is not thread
safe and leads to corrupted memory when used from multiple threads
as can be seen in the asan trace below. This fixes the initialization
to make it thread-safe.
Daniel Bevenius [Wed, 17 Sep 2025 07:34:09 +0000 (09:34 +0200)]
ci : revert back to macos-13 for macOS-latest-cmake-x64 (#16040)
This commit reverts the change of the runs-on parameter for the
macOS-latest-cmake-x64 job back to macos-13 that was make in
Commit 51abc96bdc52ba8cd6ad78dcf12ed9a041d7b442 ("ci : update
macos-latest* jobs to use macos-latest (#15938)").
The motivation for this is that using macos-latest will cause an ARM
based runner to be used, and not an x64 based runner.
Chenguang Li [Wed, 17 Sep 2025 06:33:08 +0000 (14:33 +0800)]
CANN: Optimize ggml_cann_set_device (#15935)
* CANN: Fix ggml_cann_set_device to avoid redundant device switches
- Added a check to skip aclrtSetDevice if the current device is already set.
- Prevents unnecessary context switches while keeping thread/device consistency.
Daniel Bevenius [Tue, 16 Sep 2025 13:27:52 +0000 (15:27 +0200)]
ci : use macos-latest for arm64 webgpu build (#16029)
This commit updates the runs-on field for the macOS arm64 webgpu build
job to use macos-latest instead of just latest.
The motivation for this is that this job can wait for a runner to pick
up the job for a very long time, sometimes over 7 hours. This is an
attempt to see if this change can help reduce the wait time.
Daniel Bevenius [Tue, 16 Sep 2025 13:25:57 +0000 (15:25 +0200)]
ggml : fix padding in timestep embedding kernels (#15932)
* ggml : remove adding extra dim timestep embedding
This commit updates the ggml_timestep_embedding function to no longer
add an extra dimension when the specified dimension is odd.
The motivation for this change is that this introduces an unnecessary
dimension when the dimension is odd, which caused an issue in the
kernels which were not expecting this extra dimension and it resulted in
uninitialized memory for the second to last dimension.
* ggml-cuda : fix padding in timestep embedding kernel
This commit removes the zeroing out of the last dimension now that we
are not adding the extra padding dimension.
* ggml-metal : fix padding in timestep embedding kernel
This commit fixes the zero padding for odd dimensions in
the timestep embedding kernel
* ggml-opencl : fix padding in timestep embedding kernel
This commit fixes the zero padding for odd dimensions in
the timestep embedding kernel.
* ggml-sycl : fix padding in timestep embedding kernel
This commit fixes the zero padding for odd dimensions in
the timestep embedding kernel.
* ggml-vulkan : fix padding in timestep embedding kernel
This commit fixes the zero padding for odd dimensions in
the timestep embedding kernel.
* ggml-cpu : fix padding in timestep embedding function
This commit removes the zeroing out of the last dimension now that we
are not adding the extra padding dimension.
Daniel Bevenius [Tue, 16 Sep 2025 11:41:38 +0000 (13:41 +0200)]
ci : upload xcframework artifact from ios-xcode-build job (#16010)
This commit updates the github workflows build.yml file to include steps
for uploading and downloading the xcframework artifact. The
macos-latest-swift job now depends on the ios-xcode-build job and
downloads the xcframework artifact produced by it.
The motivation for this changes is that it takes a long time to build
the xcframework and we are currently doing this twice in the workflow.
With this change, we only build it once and reuse the artifact.
Bowen Han [Tue, 16 Sep 2025 06:59:19 +0000 (23:59 -0700)]
fix: apply clang-format to CUDA macros (#16017)
clang-format previously broke long CUDA macros (e.g. __launch_bounds__) into
unreadable line breaks inside template declarations, such as:
template<int D, int ncols, int nwarps, int VKQ_stride,
typename KQ_acc_t, bool use_logit_softcap>
__launch_bounds__(nwarps*ggml_cuda_get_physical_warp_size(), 1)
This change adjusts formatting rules so that CUDA macros remain consistent
and aligned with the surrounding template syntax.
Daniel Bevenius [Tue, 16 Sep 2025 03:57:16 +0000 (05:57 +0200)]
ci : update macos-latest* jobs to use macos-latest (#15938)
* ci : update macos-latest* jobs to use macos-latest
This commit updates the jobs that are named macos-latest* to use the
macos-latest label instead explicit versions.
The motivation for this is that there is currently a mixuture of
versions in this workflow and there are jobs that are failing because
they require a newer version.