Daniel Bevenius [Tue, 23 Dec 2025 13:07:25 +0000 (14:07 +0100)]
model-conversion : add device option to run-org-model.py (#18318)
* model-conversion : add device option to run-org-model.py
This commit refactors the `run-org-model.py` script to include a
`--device` argument, to allow users to specify the device on which to
run the model (e.g., cpu, cuda, mps, auto).
It also extracts a few common functions to prepare for future changes
where some code duplication will be removed which there currently
exists in embedding scripts.
The Makefile is also been updated to pass the device argument, for
example:
```console
(venv) $ make causal-verify-logits DEVICE=cpu
```
* fix error handling and remove parser reference
This commit fixes the error handling which previously referenced an
undefined 'parser' variable.
Daniel Bevenius [Tue, 23 Dec 2025 06:27:37 +0000 (07:27 +0100)]
model-conversion : add trust_remote_code for embedding scripts (#18288)
This commit adds the trust_remote_code=True parameter when loading
models and configurations in the embedding model conversion scripts.
It also adds a cast to float for models that might use a data type that
is not supported by python, for example bfloat16.
The motivation for this is that some models may require custom code to
be executed during loading, and setting trust_remote_code to True avoids
getting prompted for confirmation.
Future work will consolidate the embedding conversion scripts with the
causal conversion scripts to avoid code duplication. But in the mean
time it would be nice to have this fix in place.
Ryan Mangeno [Mon, 22 Dec 2025 23:28:19 +0000 (18:28 -0500)]
model : Granite Embedding support (#15641)
ModernBERT but without `head.norm` so will currently fail to convert and run any other ModernBERT models, PRs with `head.norm` support welcome!
* constants and tensor mappings for modern bert support, model not supported yet but working on getting conversion to work for encoder only
* conversion now working, hf -> gguf
* working on support, now working on building graph
* some cleanup
* cleanup
* continuing
* correct tensor shape for qkv
* fixed tensor mappings and working on buildin graph
* tensor debugging now works -> (llama-eval-callback), instead of simulated gate split with views, GEGLU is now used which does exactly this
* cleanup
* cleanup
* cleanup
* more cleanup
* ubatch issues, the assert for checking equal seqs in llama-graph.cpp when building attention keeps failing, setting ubatch size to 1 when running llama-embedding with --ubatch-size 1 makes it work, but needs to be looked into more
* added cls token per previous modern bert attempt, still working on checking out the rest
* fixed pre tokenizer and still working through previous pr
* working through previous attemp, implimented more accurate conversion per previous attempt, added local sliding window attention that alternates every third layer
* fixed pre tokenizer
* working on swa with local and global alternating attention
* some cleanup and now fails on build attn
* starting to work, and some cleanup, currently failing on last layer construction in graph build
* alternating rope implemented and modern bert graph build succeeds
* fixed asser for equal ubatch seq
* cleanup
* added mask check in vocab
* fixed alternating rope, the hparams.rope_freq_base_train and hparams.rope_freq_base_train_swa were the same and i set them to correct values
* reuse variable
* removed repeat
* standard swa method can be used instead of a new enum being LLAMA_SWA_TYPE_LOCAL
* correct swa layer indexing, is supposed to be 0, 3, 6 ... instead of 1, 4, 7 ...
* more modular hparam setting
* replaced attn out norm with ffn_norm and cosine similarity between hf embds and llama.cpp embds went way up, from 0.05 to 0.24, replaced the cacheless kv with swa todo per the previous conversion
Jeff Bolz [Sun, 21 Dec 2025 20:52:09 +0000 (14:52 -0600)]
vulkan: Implement set_tensor_async and the event interfaces (#18047)
The goal is to enable the async loading code paths in
llama_model_loader::load_all_data, originally from #7896. This works and the
loads themselves are faster, but with host visible vidmem I think the cost of
allocating/mapping vidmem moves and becomes more expensive, and I don't see a
benefit by default. But with GGML_VK_DISABLE_HOST_VISIBLE_VIDMEM=1 I do see a
significant improvement in model loading time.
Jeff Bolz [Sun, 21 Dec 2025 09:27:34 +0000 (03:27 -0600)]
vulkan/cuda: fix topk_moe with exp_probs_b (#18071)
I updated test_topk_moe to more closely match llm_graph_context::build_moe_ffn
and added coverage for exp_probs_b and some other missing combinations. This
exposed a bug in both CUDA and Vulkan backends where they were assuming the
input to argsort and the input to get_rows are the same. I'd like to optimize
this graph in another change, but for now just get it functional.
CUDA also had a bug where it got n_experts from the wrong place, leading to
GGML_ASSERT failures in some of the new tests.
Pascal [Fri, 19 Dec 2025 17:01:56 +0000 (18:01 +0100)]
arg: fix order to use short form before long form (#18196)
* arg: fix order to use short form before long form
* arg: update doc
* arg: update test-arg-parser
* arg: address review feedback from ngxson
simplified to check first.length() <= last.length() only
fixed: --sampler-seq, --rerank, --draft ordering
note: middle positions in 3+ arg sets are not verified
Daniel Bevenius [Fri, 19 Dec 2025 07:43:16 +0000 (08:43 +0100)]
model-conversion : add verbose flag in run-org-model.py (#18194)
This commit adds a --verbose flag to the run-org-model.py script to
enable or disable detailed debug output, such as input and output
tensors for each layer. Debug utilities (summarize, debug_hook,
setup_rope_debug) have been moved to utils/common.py.
The motivation for this is that the detailed debug output can be useful
for diagnosing issues with model conversion or execution, but it can
also produce a large amount of output that may not always be needed.
The script will also be further cleaned/refactored in follow-up commits.
Jeff Bolz [Fri, 19 Dec 2025 05:36:46 +0000 (23:36 -0600)]
vulkan: Add perf logger mode with concurrency (#17944)
This implements a variation of the perf logger where rather than timing each
operation individually with effectively a barrier in between, we put the
timing boundaries where we already synchronize and time the groups of work
that normally overlap. This can be useful to help understand whether
individual operations need to be optimized, or if the group is already running
efficiently.
GGML_VK_PERF_LOGGER_CONCURRENT=1 enables the new mode (when
GGML_VK_PERF_LOGGER is also set).
GGML_VK_SYNC_LOGGER=1 replaces the ENABLE_SYNC_LOGGING compile time switch.
Pascal [Wed, 17 Dec 2025 20:45:45 +0000 (21:45 +0100)]
server: (webui) add --webui-config (#18028)
* server/webui: add server-side WebUI config support
Add CLI arguments --webui-config (inline JSON) and --webui-config-file
(file path) to configure WebUI default settings from server side.
Backend changes:
- Parse JSON once in server_context::load_model() for performance
- Cache parsed config in webui_settings member (zero overhead on /props)
- Add proper error handling in router mode with try/catch
- Expose webui_settings in /props endpoint for both router and child modes
Frontend changes:
- Add 14 configurable WebUI settings via parameter sync
- Add tests for webui settings extraction
- Fix subpath support with base path in API calls
* nit: extract app name into a constant value; remove unused onBackPressed callbacks
* UI: update AppContent to pass in correct navigation callbacks
* nit: polish ModelLoadingScreen UI
* core: throw Exception instead of returning null if model fails to load
* navigation: sink model loading state management from AppContent down into ModelLoadingScreen; pass ModelLoadingMetrics to Benchmark and Conversation screens
* gguf: add GGUF metadata data holder and its corresponding extractor implementation
* core: swap out hardcoded LlamaAndroid library loading
* core: add back OpenMP due to huge perf loss on TG128
* misc: reorg the pkg structure
* misc: rename LlamaAndroid related class to InferenceEngine prefixes
* [WIP] lib: move GgufMetadata into the lib submodule
* lib: expose GgufMetadataReader as interface only
* lib: replace the naive & plain SharedPreferences with DataStore implementation
* lib: hide the internal implementations, only expose a facade and interfaces
* lib: expose Arm features
* di: add a stub TierDetection; provide both actual impl and stub in AppModule
* UI: add visualizer UI for Arm features
* misc: UI polish
* lib: refactored InferenceEngineLoader; added a `NONE` Llama Tier
* UI: support `NONE` Llama Tier in general settings
* lib: optimize engine loader; always perform a fresh detection when cache is null
* remote: add HuggingFaceModelDetails data class
* remote: refine HuggingFaceModel data class
* nit: remove `trendingScore` field from HuggingFace model entities, weird...
* remote: refactor HuggingFaceApiService; implement download feature in HuggingFaceRemoteDataSource
* remote: fix the incorrect parse of HuggingFace's inconsistent & weird JSON response
* UI: scaffold Models Management screen and view model
* UI: implement a dialog UI to show fetched HuggingFace models.
* UI: use a broadcast receiver to listen for download complete events and show local import dialog.
* data: handle network exceptions elegantly
* pkg: restructure `data`'s packages
* data: extract local file info, copy and cleanup logics into LocalFileDataSource
* nit: minor UI patch; add missing comments
* bugfix: tapping "Home" in navigation drawer should simply close it without any navigation action.
* UI: improve autoscroll during token generation
* lib: tested on JFrog Artifactory for Maven publishing
* UI: show RAM warning if model too large
* UI: polish model management screen's error dialog
* util: add more items into the mapping table of ISO 639-1 language code to ISO 3166-1 country code
* llm: properly propagate error to UI upon failing to load selected model
* UI: avoid duplicated calculation of token metrics
* lib: read & validate the magic number from the picked source file before executing the import
* UI: add "Learn More" hyperlinks to Error dialog upon model import failures
* lib: refactor the GgufMetadataReader to take InputStream instead of absolute path as argument
* lib: fix the `SIMD` typo in Tier description
* core: verify model file path is readable
* lib: add UnsupportedArchitectureException for triaged error message
* util: split FormatUtils into multiple utils for better readability
* UI: change benchmark screen from raw markdown to table view
* bugfix: reset preselection upon running the preselected model
* misc: linter issue
* bugfix: fix the malfunctioning monitoring switch
* UI: update Arm features indicator; fix the broken hyperlinks
* UI: add quick action buttons to benchmark screen's result card
* UI: hide share fab after clearing all benchmark results
* UI: fix the model unload dialog message; elevate the model card and hide it by default on Conversation screen;
* UI: hide the stubbing actions in Conversation screen
* UI: add show/hide stats control to conversation screen's assistant message bubble; fix placeholder
* UI: add a info button to explain token metrics
* misc: remove the redundant `Companion` added due to refactoring
* UI: show corresponding system metrics detailed info upon tapping RAM / storage / temperature indicator
* UI: add info button to System Prompt switch; expand the model card by default
* UI: disable tag & language chips; add section headers to explain what they are
* misc: replace top bar indicator's spacer with padding
* UI: merge the Model Selection and Model Management into a unified Models screen
* UI: split the ModelsManagementViewModel from a unified ModelsViewModel due to huge complexity
* UI: add model loading in progress view; polish the empty model info view
* UI: polish the bottom bars and info view when no models found; show loading in progress while fetching models
* build: [BREAKING] bump the versions of libraries and plugins
* UI: fix the breaking build
* UI: add Tooltip on Import FAB for user onboarding
* UI: adds AppPreferences to track user onboarding status
* UI: tracks user's first success on importing a model
* data: add hand crafted rules to filter the models fetched from HuggingFace API
* UI: update app name & about; polish top bars' indicators & buttons
* UI: polish Hugging Face download dialog UI
* UX: implement onboarding tooltips for model import and onboarding
* misc: use sentence case for CTA button labels
* [WIP] UI: add Arm color palette from Philip.Watson3
* UI: address Rojin's UX feedbacks
* UI: address Rojin's UX feedbacks - part 2
* UI: update Arm color palette from Philip.Watson3
* data: make sure fetch preselected models in the same order of their IDs
* UI: fix UI issues in the generic settings screen and navigation drawer
* nit: address Rojin's feedbacks on model import message again
* nit: append `®` to all `Arm` labels
* UI: extract a reusable InfoAlertDialog
* core: support GGML_CPU_ALL_VARIANTS on Android!
* core: restructure Kleidi-Llama library
* core: organizing cmake arguments
* data: sort preselected models according to device's available RAM
* app: update adaptive + themed + legacy icons and app name
* UI: fix the font size auto scaling for ArmFeaturesVisualizer
* core: further improve the performance on native methods
* UI: minor color palette changes; emphasize the bottom bar FABs; fix Settings Screen menu item label
* UI: make more room for assistant message bubble's width
* UI: better usage of tertiary colors to highlight model cards but not for warnings
* UI: fix the layout issue on large font sizes
* lib: support x86-64 by dynamically set Arm related definitions
* lib: replace the factory pattern for deprecated tiered lib loading with single instance pattern
* llama: update the library name in JNI and CMake project
* llama: update the library's package name and namespace
* llama: update the app's package name and namespace
* app: bump ksp version
* app: remove deprecated SystemUIController from accompanist by migrating to EdgeToEdge
* app: extract AppContent from MainActivity to a separate file in ui package
* lib: add File version for GGUF Magic number verification
* lib: perform engine state check inclusively instead of exclusively
* lib: change `LlamaTier` to `ArmCpuTier`
* lib: remove kleidi-llama related namings
* cleanup: remove Arm AI Chat/Playground app source code; replace with the basic sample app from https://github.com/hanyin-arm/Arm-AI-Chat-Sample
Note: the full Google Play version of AI Chat app will be open will be open sourced in another repo soon, therefore didn't go through the trouble of pruning the history using `git filter-repo` here.
* [WIP] doc: update main and Android README docs; add self to code owners
* lib: revert System.load back to System.loadLibrary
* jni: introduce a logging util to filter different logging levels on different build types
* lib: enable app optimization
* doc: replace stub Google Play app URL with the actual link add screenshots; add my GitHub ID to maintainer list
TrevorS [Wed, 17 Dec 2025 06:33:02 +0000 (22:33 -0800)]
arg: allow -kvu flag for llama-perplexity (#18117)
The -kvu (--kv-unified) flag is required for hellaswag and winogrande
benchmarks which use coupled sequences. Without unified KV cache,
these benchmarks fail with:
split_equal: sequential split is not supported when there are
coupled sequences in the input batch (you may need to use the -kvu flag)
This change adds LLAMA_EXAMPLE_PERPLEXITY to the allowed examples for
the -kvu argument, enabling its use with llama-perplexity.
yifant-code [Tue, 16 Dec 2025 12:27:36 +0000 (07:27 -0500)]
server: fix crash when batch > ubatch with embeddings (#17912)
* server: fix crash when batch > ubatch with embeddings (#12836)
Fixes #12836 where the server crashes with GGML_ASSERT failure when
running with embeddings enabled and n_batch > n_ubatch.
Root cause: Embeddings use non-causal attention which requires all
tokens to be processed within a single ubatch. When n_batch > n_ubatch,
the server attempts to split processing, causing assertion failure.
Solution:
- Add parameter validation in main() after common_params_parse()
- When embeddings enabled and n_batch > n_ubatch:
* Log warnings explaining the issue
* Automatically set n_batch = n_ubatch
* Prevent server crash
This follows the approach suggested by @ggerganov in issue #12836.
Note: This supersedes stalled PR #12940 which attempted a runtime fix
in the old examples/server/server.cpp location. This implementation
validates at startup in tools/server/server.cpp (current location).
Testing:
- Build: Compiles successfully
- Validation triggers: Warns when -b > -ub with --embedding
- Auto-correction works: Adjusts n_batch = n_ubatch
- No false positives: Valid params don't trigger warnings
- Verified on macOS M3 Pro with embedding model
* Update tools/server/server.cpp
---------
Co-authored-by: ytian218 <redacted> Co-authored-by: Georgi Gerganov <redacted>
Daniel Bevenius [Tue, 16 Dec 2025 10:17:40 +0000 (11:17 +0100)]
model-conversion : add note about verifying previous models (#18082)
This commit adds a note to the README in the model-conversion
examples, advising developers to verify that previous versions of models
pass logits verification before adding new models from the same family.