]>
git.djapps.eu Git - pkg/ggml/sources/llama.cpp/log
Dan Johansson [Tue, 25 Mar 2025 09:35:20 +0000 (10:35 +0100)]
docs : add build instructions for KleidiAI (#12563)
Signed-off-by: Dan Johansson <redacted>
R0CKSTAR [Tue, 25 Mar 2025 07:45:08 +0000 (15:45 +0800)]
ci: [MUSA] add CI and update doc (#12562)
Signed-off-by: Xiaodong Ye <redacted>
Georgi Gerganov [Tue, 25 Mar 2025 07:19:23 +0000 (09:19 +0200)]
context : fix worst-case reserve outputs (#12545)
ggml-ci
Akarshan Biswas [Mon, 24 Mar 2025 17:35:38 +0000 (23:05 +0530)]
ci: [SYCL] ggml-ci Use main GPU and enable sysman (#12547)
lhez [Mon, 24 Mar 2025 16:20:47 +0000 (09:20 -0700)]
opencl: simplify kernel embedding logic in cmakefile (#12503)
Co-authored-by: Max Krasnyansky <redacted>
Akarshan Biswas [Mon, 24 Mar 2025 12:58:32 +0000 (18:28 +0530)]
CI: fix SYCL build (#12546)
Tei Home [Mon, 24 Mar 2025 11:02:26 +0000 (19:02 +0800)]
docs: update: improve the Fedoa CUDA guide (#12536)
* docs: update fedora-cuda guide
- Rename and place into Backend Folder.
- Update Host-Supplied Packages.
- Expand Recommended Users Section.
* docs: improve the flow of CUDA-FEDORA.md
compilade [Mon, 24 Mar 2025 10:47:24 +0000 (06:47 -0400)]
llama-vocab : add SuperBPE pre-tokenizer (#12532)
R0CKSTAR [Mon, 24 Mar 2025 10:28:34 +0000 (18:28 +0800)]
CUDA: Fix clang warnings (#12540)
Signed-off-by: Xiaodong Ye <redacted>
Prajwal B Mehendarkar [Mon, 24 Mar 2025 10:17:10 +0000 (15:47 +0530)]
mmap : skip resource limit checks on AIX (#12541)
Jeff Bolz [Mon, 24 Mar 2025 06:56:17 +0000 (01:56 -0500)]
vulkan: fix mul_mat_vec failure in backend tests (#12529)
The OOB calculation could be wrong if the last iteration was during one of
the unrolled loops. Adjust the unrolling counts to avoid this. Add a couple
new backend tests that hit this failure on NVIDIA GPUs.
Marius Gerdes [Sun, 23 Mar 2025 18:30:26 +0000 (19:30 +0100)]
server : Add verbose output to OAI compatible chat endpoint. (#12246)
Add verbose output to server_task_result_cmpl_final::to_json_oaicompat_chat_stream, making it conform with server_task_result_cmpl_final::to_json_oaicompat_chat, as well as the other to_json methods.
Lars Sonchocky-Helldorf [Sun, 23 Mar 2025 08:21:48 +0000 (09:21 +0100)]
install : add macports (#12518)
MacPorts section added
Xuan-Son Nguyen [Sat, 22 Mar 2025 22:28:19 +0000 (23:28 +0100)]
llama : gemma3 : use output tensor if it exists in model weight (#12506)
* llama : gemma3 : use output tensor if it exists in model weight
* also add to the llm_tensor_names
Georgi Gerganov [Sat, 22 Mar 2025 14:23:26 +0000 (16:23 +0200)]
ggml : fix quantized cpy op (#12310)
* ggml : fix quantized cpy op
ggml-ci
* tests : add cpy tests for all types
ggml-ci
* tests : add BF16 copy tests
ggml-ci
* tests : fix loop for same-type copy
ggml-ci
* tests : add option to permute the dst tensor
ggml-ci
R0CKSTAR [Sat, 22 Mar 2025 09:11:37 +0000 (17:11 +0800)]
musa: refine compute capability (#12493)
* musa: refine compute capability
Signed-off-by: Xiaodong Ye <redacted>
* Address review comments
Signed-off-by: Xiaodong Ye <redacted>
---------
Signed-off-by: Xiaodong Ye <redacted>
Jeff Bolz [Sat, 22 Mar 2025 08:40:11 +0000 (03:40 -0500)]
vulkan: Optimize mul_mat_vec p021 and nc shaders (#12505)
* tests: add mul_mat perf/functional tests for p021/nc vulkan shaders
* vulkan: Optimize mul_mat_vec p021 and nc shaders.
These shaders are used in attention calculations, and when the KV cache grows
large they start to dominate the run time. For the nc shader (which is called
with large 'k' dimension), use unrolling and vector loads. For the p021 shader
(which is called with large 'm' and small 'k' dimensions), take advantage of
grouped query attention to reuse loads from the A matrix for the whole group,
and reduce the number of workgroups (too much overhead from tiny dispatches).
Using subgroupAdd in the p021 shader also helps, use that conditionally.
stduhpf [Fri, 21 Mar 2025 19:34:50 +0000 (20:34 +0100)]
Vulkan: RTE rounding for cpy to quant (#12480)
* Vulkan: RTE rounding for cpy to quant
Co-Authored-By: Jeff Bolz <redacted>
* remove trailing whitespace
* avoid duplicating pipeline_cpy_f32_quant
* fix copypasting issue
* remove duplicated code
---------
Co-authored-by: Jeff Bolz <redacted>
Eve [Fri, 21 Mar 2025 19:27:47 +0000 (19:27 +0000)]
vulkan: workaround for AMD Windows driver 16 bit unpack8 bug (#12472)
Georgi Gerganov [Fri, 21 Mar 2025 14:14:29 +0000 (16:14 +0200)]
model : do not repack if a GPU device is present (#12498)
ggml-ci
Sigbjørn Skjæret [Fri, 21 Mar 2025 09:21:36 +0000 (10:21 +0100)]
chore : cleanup llama_model_loader::TENSOR_ usage (#12492)
marcoStocchi [Fri, 21 Mar 2025 09:12:45 +0000 (10:12 +0100)]
llama-tts : avoid crashes related to bad model file paths (#12482)
蕭澧邦 [Fri, 21 Mar 2025 06:58:47 +0000 (14:58 +0800)]
[SYCL] Fix build on Windows when ccache enabled (#9954) (#9976)
* [SYCL] Fix build on Windows when ccache enabled (#9954)
* take effect only on windows and force it to icl
---------
Co-authored-by: Romain Biessy <redacted>
Svetlozar Georgiev [Fri, 21 Mar 2025 02:15:56 +0000 (02:15 +0000)]
sycl: cleanup oneDNN related code (#12097)
Woof Dog [Thu, 20 Mar 2025 14:57:43 +0000 (14:57 +0000)]
webui : Prevent rerendering on textarea input (#12299)
* webui: Make textarea uncontrolled to eliminate devastating lag
* Update index.html.gz
* use signal-style implementation
* rm console log
* no duplicated savedInitValue set
---------
Co-authored-by: Xuan Son Nguyen <redacted>
Sigbjørn Skjæret [Thu, 20 Mar 2025 11:49:59 +0000 (12:49 +0100)]
llama : make Qwen2MoE QKV bias optional (#12477)
Srihari-mcw [Thu, 20 Mar 2025 11:35:34 +0000 (17:05 +0530)]
ggml : block interleaving support for Q4_K quantization for x86 AVX2 architecture (#12332)
* Add block interleaving support for Q4_K quantization
* Remove whitespaces and fix CI/CD issues
* Update pointer of bsums from int16_t to const int16_t
* Add vector version of quantize_q8_K_4x8 function
* Update code formatting based on review comments
Bartowski [Thu, 20 Mar 2025 06:36:37 +0000 (02:36 -0400)]
convert : avoid calls to tokenizer.added_tokens_decoder (#12473)
tokenizer.added_tokens_decoder returns a fresh dict every time relatively slowly (~0.04s on average) which results in massive slowdowns when we have a huge number of added tokens
fairydreaming [Wed, 19 Mar 2025 20:01:57 +0000 (21:01 +0100)]
context : clear sets containing encoder output sequence ids before storing new values (#12470)
Co-authored-by: Stanisław Szymczyk <redacted>
Gaurav Garg [Wed, 19 Mar 2025 19:52:06 +0000 (01:22 +0530)]
CUDA: Improve flash decoding kernel GPU occupancy for BS=1 case (#12183)
- Find out active blocks per SM using cudaOccupancyMaxActiveBlocksPerMultiprocessor API. Use this value to determine the optimal parallel_blocks value.
- Prefer vector flash attention kernels over MMA kernel for BS=1
Fixes Issue: #12182
---------
Co-authored-by: Johannes Gäßler <redacted>
Jeff Bolz [Wed, 19 Mar 2025 18:56:23 +0000 (13:56 -0500)]
vulkan: optimize iq1 coopmat2 dequant functions (#12427)
Guus Waals [Wed, 19 Mar 2025 10:15:23 +0000 (10:15 +0000)]
Fix visionOS build and add CI (#12415)
* ci: add visionOS build workflow
Add a new GitHub Actions workflow for building on visionOS with CMake and Xcode.
* ggml: Define _DARWIN_C_SOURCE for visionOS to fix missing u_xxx typedefs
* ci: remove define hacks for u_xxx system types
---------
Co-authored-by: Giovanni Petrantoni <redacted>
Sigbjørn Skjæret [Wed, 19 Mar 2025 08:08:49 +0000 (09:08 +0100)]
llama : add support for GPT2, Bloom and CodeShell tied word embeddings (#12456)
* Add support for GPT2, Bloom and CodeShell tied word embeddings
* Deduplicate tied word embeddings weights
* Workaround for incorrect weight map
It appears transformer.wte.weight is in the weight map even though the weights are not there, remove it if output weights are encountered first.
* check++
* fatfingers--
Sigbjørn Skjæret [Wed, 19 Mar 2025 07:58:13 +0000 (08:58 +0100)]
convert : Support chat_template.json (#12460)
Jeff Bolz [Wed, 19 Mar 2025 07:26:26 +0000 (02:26 -0500)]
vulkan: Submit once enough matmul work has been recorded (#12406)
I've been seeing significantly worse performance for tg with flash attention
enabled vs disabled, and it seems to be related to the submit heuristic.
Change the heuristic to check how many bytes worth of weight matrix are
used and flush every 100MB, and ramp up after the first few submits.
This seems to resolve the issue, and also increases perf for non-FA a bit.
lhez [Tue, 18 Mar 2025 19:54:55 +0000 (12:54 -0700)]
opencl: improve profiling (#12442)
* opencl: more profiling timing
* opencl: generate trace for profiling
* opencl: reduce profiling overhead
* Populate profiling timing info at the end rather than after each
kernel run
* opencl: fix for chrome tracing
Georgi Gerganov [Tue, 18 Mar 2025 19:35:19 +0000 (21:35 +0200)]
graph : normalize Q, K, V shapes + sync cross attention (#12449)
* graph : normalize Q, K, V shapes and add comments
ggml-ci
* context : synchronize before getting cross attention data
* model : fix command-r attention norm check
R0CKSTAR [Tue, 18 Mar 2025 18:28:26 +0000 (02:28 +0800)]
musa: override warp_size of musa device to 32 (#12445)
Signed-off-by: Xiaodong Ye <redacted>
Xuan-Son Nguyen [Tue, 18 Mar 2025 18:16:19 +0000 (19:16 +0100)]
llama : support converting Mistral Small text-only (#12450)
Georgi Gerganov [Tue, 18 Mar 2025 17:35:11 +0000 (19:35 +0200)]
speculative : fix seg fault in certain cases (#12454)
Xuan-Son Nguyen [Tue, 18 Mar 2025 16:24:33 +0000 (17:24 +0100)]
llama : add support for EXAONE tied word embeddings (#12451)
Georgi Gerganov [Tue, 18 Mar 2025 11:05:49 +0000 (13:05 +0200)]
context : always use non-causal attention for encoder graphs (#12447)
* context : always use non-causal attention for encoder graphs
ggml-ci
* context : move the change to llama_context::encode()
ggml-ci
Łukasz Ślusarczyk [Tue, 18 Mar 2025 10:16:31 +0000 (11:16 +0100)]
SYCL: using graphs is configurable by environment variable and compile option (#12371)
* alberto changes
* enable sycl graphs by env variable
* fixed compilation warnings in ggml-sycl.cpp
* renamed graph variables
* fix markdown in docs/backend/SYCL.md
Co-authored-by: Romain Biessy <redacted>
* fix markdown in docs/backend/SYCL.md again
* compiling graphs by default, renamed graph_enable to graph_disable
---------
Co-authored-by: Romain Biessy <redacted>
Georgi Gerganov [Tue, 18 Mar 2025 10:05:42 +0000 (12:05 +0200)]
server : fix warmup draft cache type (#12446)
ggml-ci
Prajwal B Mehendarkar [Tue, 18 Mar 2025 09:37:33 +0000 (15:07 +0530)]
cmake : fix PowerPC build (#12241)
Closes #12240
fj-y-saito [Tue, 18 Mar 2025 08:14:39 +0000 (17:14 +0900)]
ggml : add SVE support for q6_K_q8_K (#12361)
0cc4m [Tue, 18 Mar 2025 06:21:40 +0000 (07:21 +0100)]
Vulkan: Default to 1GB allocations instead of 4GB to avoid fragmentation and driver issues (#12434)
Łukasz Ślusarczyk [Tue, 18 Mar 2025 00:51:25 +0000 (01:51 +0100)]
fixed compilation warnings in ggml-sycl (#12424)
Molly Sophia [Mon, 17 Mar 2025 23:27:50 +0000 (07:27 +0800)]
llama: Add support for RWKV v7 architecture (#12412)
* ggml: Add op l2_norm
Signed-off-by: Molly Sophia <redacted>
* ggml: Add op rwkv_wkv7
Signed-off-by: Molly Sophia <redacted>
* llama: Add support for RWKV7 and ARWKV7 models
Signed-off-by: Molly Sophia <redacted>
* llama: fix inference with RWKV6Qwen2
Signed-off-by: Molly Sophia <redacted>
* llama: add more (a)rwkv7 variants in size
Signed-off-by: Molly Sophia <redacted>
* Apply code-format changes
Signed-off-by: Molly Sophia <redacted>
* fix MUSA build
Signed-off-by: Molly Sophia <redacted>
* llama: fix shape error with rwkv using llama-parallel
Signed-off-by: Molly Sophia <redacted>
---------
Signed-off-by: Molly Sophia <redacted>
Sigbjørn Skjæret [Mon, 17 Mar 2025 20:14:32 +0000 (21:14 +0100)]
docs : bring llama-cli conversation/template docs up-to-date (#12426)
Gaurav Garg [Mon, 17 Mar 2025 18:25:13 +0000 (23:55 +0530)]
cuda : enable CUDA Graph on CUDA Toolkit < 12.x (#12394)
* Enable CUDA Graph on CTK < 12.x
`cudaGraphExecUpdate` API was changed on 12.x. For this reason CUDA graph support was disabled on older CUDA toolkit. This change enables CUDA support in CTK version < 12.x by using older API if CTK < 12.x.
* Fix compilation errors with MUSA
* Disable CUDA Graph for MUSA
Guus Waals [Mon, 17 Mar 2025 16:35:43 +0000 (00:35 +0800)]
ggml-vulkan: remove unused find_program(glslc) (#12416)
It's already found by FindVulkan.cmake in the parent CMakeLists
Jeff Bolz [Mon, 17 Mar 2025 14:26:18 +0000 (09:26 -0500)]
vulkan: Add N/2 and N/4 optimized paths in coopmat2 shader (#12312)
Daniele [Mon, 17 Mar 2025 11:42:33 +0000 (12:42 +0100)]
vulkan: subgroup size tuning (#12087)
* vulkan: subgroup size test
* Vulkan: Add device architecture enum and logic to recognize AMD generations
* vulkan: use new architecture logic to specify subgroup size
* Initial vulkan subgroup size tuning for RDNA3
* vulkan: commonize RDNA subgroup tuning
* vulkan: override subgroup size if required_subgroup_size = 0
* vulkan: disable warp 32 for RDNA3
* vulkan: fine tuned RDNA1 subgroup sizes
* vulkan: adjusted subgroup size map
* vulkan: fixed RDNA2 subgroup map
---------
Co-authored-by: 0cc4m <redacted>
Jeff Bolz [Mon, 17 Mar 2025 09:43:35 +0000 (04:43 -0500)]
vulkan: use fp32 in coopmat2 q4_k dequant function (#12309)
Jeff Bolz [Mon, 17 Mar 2025 09:41:59 +0000 (04:41 -0500)]
vulkan: Pad N dimension of B matrix for coopmat2 perf, to avoid bounds checking (#12273)
* vulkan: Pad N dimension of B matrix for coopmat2 perf, to avoid bounds checking
Jeff Bolz [Mon, 17 Mar 2025 09:35:00 +0000 (04:35 -0500)]
vulkan: Adjust coopmat2 tile sizes and selection heuristic (#12258)
Christian Kastner [Mon, 17 Mar 2025 09:05:23 +0000 (10:05 +0100)]
cmake : enable building llama.cpp using system libggml (#12321)
* cmake: Factor out compiler flag function from ggml
llama.cpps's build requires it, too, and we may want to make use of it
without add_subdirectory(ggml).
* cmake: Enable building against system ggml
This facilitates package maintenance for Linux distributions, where the
libggml library most likely will be shipped as an individual package
upon which a llama.cpp package depends.
Akarshan Biswas [Mon, 17 Mar 2025 01:45:12 +0000 (07:15 +0530)]
SYCL: set extras only on GGML_TYPE_Q4_0 (#12366)
* SYCL: set extras only on GGML_TYPE_Q4_0
* release tensor_extras in reset buffer interface
Sigbjørn Skjæret [Sun, 16 Mar 2025 17:46:36 +0000 (18:46 +0100)]
llama : fix OLMo-2-0325-32B-Instruct K-norm size (#12400)
Georgi Gerganov [Sun, 16 Mar 2025 17:29:36 +0000 (19:29 +0200)]
context : fix init of n_outputs (#12397)
ggml-ci
Daniel Bevenius [Sun, 16 Mar 2025 17:22:05 +0000 (18:22 +0100)]
ci : add --symlinks to xcframework zip command (#12409)
This commit adds the --symlinks option to the zip command used to create
the xcframework zip file. This is necessary to create symlinks in the
zip file. Without this option, the Versions symlink is stored as a
regular directory entry in the zip file, rather than as a symlink in the
zip which causes the followig error in xcode:
```console
Couldn't resolve framework symlink for '/Users/danbev/work/ai/llama.cpp/tmp_1/build-apple/llama.xcframework/macos-arm64_x86_64/llama.framework/Versions/Current': readlink(/Users/danbev/work/ai/llama.cpp/tmp_1/build-apple/llama.xcframework/macos-arm64_x86_64/llama.framework/Versions/Current): Invalid argument (22)
```
Refs: https://github.com/ggml-org/llama.cpp/pull/11996#issuecomment-
2727026377
marcoStocchi [Sat, 15 Mar 2025 16:23:11 +0000 (17:23 +0100)]
llama-tts : add '-o' option (#12398)
* added -o option to specify an output file name
* llama-tts returns ENOENT in case of file write error
note : PR #12042 is closed as superseded with this one.
aubreyli [Sat, 15 Mar 2025 14:49:03 +0000 (22:49 +0800)]
SYCL: Delete redundant plus sign and space (#12391)
fairydreaming [Sat, 15 Mar 2025 14:19:30 +0000 (15:19 +0100)]
SYCL : support non-contiguous tensors in binary ops (add, sub, etc) (#12399)
* sycl : support non-contiguous tensors in binary ops
* sycl : silence unused variable warning
---------
Co-authored-by: Stanisław Szymczyk <redacted>
Chenguang Li [Sat, 15 Mar 2025 01:31:08 +0000 (09:31 +0800)]
[CANN]MUL_MAT optimization (#12382)
Eric Curtin [Fri, 14 Mar 2025 16:41:20 +0000 (16:41 +0000)]
Add CLI arg to llama-run to adjust the number of threads used (#12370)
We default to 4, sometimes we want to manually adjust this
Signed-off-by: Eric Curtin <redacted>
Sigbjørn Skjæret [Fri, 14 Mar 2025 15:57:05 +0000 (16:57 +0100)]
main : add -sysf / --system-prompt-file (#12249) (#12250)
* add system_prompt_file
* add -sysf / --system-prompt-file
* remove system_prompt_file
fairydreaming [Fri, 14 Mar 2025 12:47:05 +0000 (13:47 +0100)]
Load all MoE experts during warmup (#11571)
* llama : introduce llama_set_warmup() API call that controls warmup mode; use all MoE experts during warmup
* common : use new API to enable warmup mode during model warmup
---------
Co-authored-by: Stanisław Szymczyk <redacted>
Victor [Fri, 14 Mar 2025 10:21:17 +0000 (11:21 +0100)]
server: fix "--grammar-file" parameter (#12285)
Georgi Gerganov [Fri, 14 Mar 2025 08:47:44 +0000 (10:47 +0200)]
graph : simplify attn input build for unified KV cache (#12381)
ggml-ci
Georgi Gerganov [Fri, 14 Mar 2025 07:03:24 +0000 (09:03 +0200)]
hparams : add SWA rope parameters (#12374)
ggml-ci
Georgi Gerganov [Thu, 13 Mar 2025 17:08:07 +0000 (19:08 +0200)]
llama : fix Gemma3 SWA KV cache shift (#12373)
* llama : fix Gemma3 SWA KV cache shift
ggml-ci
* hparams : add comment [no ci]
Xuan-Son Nguyen [Thu, 13 Mar 2025 11:34:54 +0000 (12:34 +0100)]
arg : no n_predict = -2 for examples except for main and infill (#12364)
Georgi Gerganov [Thu, 13 Mar 2025 10:35:44 +0000 (12:35 +0200)]
llama : refactor llama_context, llama_kv_cache, llm_build_context (#12181)
* llama : refactor llama_context, llama_kv_cache, llm_build_context
ggml-ci
* graph : don't mutate the KV cache during defrag
ggml-ci
* context : reduce virtuals + remove test function
ggml-ci
* context : move interface implementation to source file + factory
ggml-ci
* graph : move KV cache build functions to llama_context impl
ggml-ci
* graph : remove model reference from build_pooling
ggml-ci
* graph : remove llama_model reference
ggml-ci
* kv_cache : provide rope factors
ggml-ci
* graph : rework inputs to use only unique_ptr, remove attn input abstraction
ggml-ci
* context : remove llama_context_i abstraction
ggml-ci
* context : clean-up
ggml-ci
* graph : clean-up
ggml-ci
* llama : remove redundant keywords (struct, enum)
ggml-ci
* model : adapt gemma3
ggml-ci
* graph : restore same attention ops as on master
ggml-ci
* llama : remove TODO + fix indent
ggml-ci
Ishaan Gandhi [Thu, 13 Mar 2025 10:10:05 +0000 (06:10 -0400)]
server : fix crash when using verbose output with input tokens that are not in printable range (#12178) (#12338)
* Fix DOS index bug
* Remove new APIs
* remove extra line
* Remove from API
* Add extra newline
* Update examples/server/server.cpp
---------
Co-authored-by: Xuan-Son Nguyen <redacted>
Oscar Barenys [Wed, 12 Mar 2025 19:06:58 +0000 (20:06 +0100)]
Update build.yml for Windows Vulkan builder to use Vulkan 1.4.304 SDK for VK_NV_cooperative_matrix2 support (#12301)
Daniel Bevenius [Wed, 12 Mar 2025 12:45:32 +0000 (13:45 +0100)]
llama.swiftui : fix xcframework dir in README [no ci] (#12353)
This commit fixes the path to the xcframework in the README file which I
had forgotten to change after renaming the build directory.
Alberto Cabrera Pérez [Wed, 12 Mar 2025 09:57:32 +0000 (09:57 +0000)]
sycl : variable sg_size support for mmvq kernels (#12336)
uvos [Wed, 12 Mar 2025 09:14:11 +0000 (10:14 +0100)]
CUDA/HIP: Fix fattn-vec-* when device warp size is not 32 (#12315)
When fattn-wmma was ported over to warp64 various bits that also touch fattn-vec where converted to
selectable warp size, however the fattn-vec kernels dont work with 64 wide warps for now, so we need
to avoid launching them with parameters for warp64
Xuan-Son Nguyen [Wed, 12 Mar 2025 08:30:24 +0000 (09:30 +0100)]
llama : Add Gemma 3 support (+ experimental vision capability) (#12343)
* llama : Add Gemma 3 text-only support
* fix python coding style
* fix compile on ubuntu
* python: fix style
* fix ubuntu compile
* fix build on ubuntu (again)
* fix ubuntu build, finally
* clip : Experimental support for Gemma 3 vision (#12344)
* clip : Experimental support for Gemma 3 vision
* fix build
* PRId64
Jeff Bolz [Wed, 12 Mar 2025 05:59:19 +0000 (00:59 -0500)]
vulkan: fix bug in coopmat1 mul_mat_id (#12316)
* tests: run mul_mat_id with a larger N
* vulkan: fix bug in coopmat1 mul_mat_id
uvos [Tue, 11 Mar 2025 19:16:03 +0000 (20:16 +0100)]
CUDA/HIP: refractor mmqv to unify the calculation of nwarps and rows per block between host and device code. (#12177)
refactor mmqv to unify the calculation of nwarps and rows per block between host and device code.
---------
Co-authored-by: Johannes Gäßler <redacted>
jklincn [Tue, 11 Mar 2025 13:25:17 +0000 (21:25 +0800)]
ggml-backend : fix backend search path (#12330)
* Fix backend search path
* replace .native() with '/'
* reverted .native()
BB-fat [Tue, 11 Mar 2025 11:45:02 +0000 (19:45 +0800)]
metal : Cache the Metal library at the device context level (#12265)
Xuan-Son Nguyen [Tue, 11 Mar 2025 08:20:16 +0000 (09:20 +0100)]
clip : bring back GPU support (#12322)
* clip : bring back GPU support
* use n_gpu_layers param
* fix double free
* ggml_backend_init_by_type
* clean up
Eve [Mon, 10 Mar 2025 19:28:11 +0000 (19:28 +0000)]
mat vec double buffer (#12188)
R0CKSTAR [Mon, 10 Mar 2025 17:18:25 +0000 (01:18 +0800)]
musa: support new arch mp_31 and update doc (#12296)
Signed-off-by: Xiaodong Ye <redacted>
Henry Linjamäki [Mon, 10 Mar 2025 16:57:00 +0000 (18:57 +0200)]
opencl: use OpenCL C standard supported by the device (#12221)
This patch nudges the llama.cpp a bit to be supported on PoCL which
doesn't support OpenCL C CL2.0. The issue is solved by querying the
device for the supported OpenCL C versions and using the highest one
available.
John Bean [Mon, 10 Mar 2025 14:13:09 +0000 (22:13 +0800)]
readme: added Sidekick to available UIs (#12311)
Georgi Gerganov [Mon, 10 Mar 2025 12:07:15 +0000 (14:07 +0200)]
tests : fix test-quantize-fns to init the CPU backend (#12306)
ggml-ci
marcoStocchi [Mon, 10 Mar 2025 11:34:13 +0000 (12:34 +0100)]
common : refactor '-o' option (#12278)
As discussed in PR 'llama-tts : add -o option' (#12042):
* common_params : 'out_file' string is the only output file name parameter left in common_params. It's intended to be used in all example programs implementing an '-o' option.
* cvector-generator, export-lora, imatrix : default output filenames moved from 'common_params' to the 'main()' of each example program.
Olivier Chafik [Mon, 10 Mar 2025 10:59:03 +0000 (10:59 +0000)]
`server`: extract <think> tags from qwq outputs (#12297)
* extract <think> tags from qwq outputs
* const for all static regexes in chat.cpp
Olivier Chafik [Mon, 10 Mar 2025 09:45:29 +0000 (09:45 +0000)]
`tool-call`: ensure there's always a non-empty tool call id (#12292)
Olivier Chafik [Mon, 10 Mar 2025 09:45:07 +0000 (09:45 +0000)]
allow missing content in message if tool_calls provided (#12293)
Olivier Chafik [Mon, 10 Mar 2025 09:44:42 +0000 (09:44 +0000)]
`sampler`: fixes trigger tokens + lazy grammars (fix typo cast from token to string) (#12291)
* Fix typo in lazy grammar handling (fixes trigger tokens)
Co-authored-by: Georgi Gerganov <redacted>
---------
Co-authored-by: Georgi Gerganov <redacted>
tc-mb [Mon, 10 Mar 2025 08:33:24 +0000 (16:33 +0800)]
llava : fix bug in minicpm-v code (#11513)
* fix bug in minicpm-v code
* update readme of minicpm-v
Georgi Gerganov [Sun, 9 Mar 2025 17:08:20 +0000 (19:08 +0200)]
server : add speculative decoding presets for FIM (#12287)
Georgi Gerganov [Sat, 8 Mar 2025 16:26:00 +0000 (18:26 +0200)]
authors : update (#12271)
Jason C.H [Sat, 8 Mar 2025 16:02:39 +0000 (00:02 +0800)]
ggml-backend : make path_str compatible with C++20 (#12269)