]>
git.djapps.eu Git - pkg/ggml/sources/llama.cpp/log
Xuan-Son Nguyen [Mon, 28 Apr 2025 14:12:56 +0000 (16:12 +0200)]
mtmd : fix glm-edge redundant token count (#13139)
* mtmd : fix glm-edge redundant token count
* fix chat template
* temporary disable GLMEdge test chat tmpl
pockers21 [Mon, 28 Apr 2025 13:45:40 +0000 (06:45 -0700)]
context : do not clear output buffer on reserve (#13152)
Co-authored-by: pockers21 <redacted>
Xuan-Son Nguyen [Mon, 28 Apr 2025 12:20:56 +0000 (14:20 +0200)]
llama : (mrope) allow using normal 1D position for text token (#13138)
* llama : (mrope) use normal position for text token
* rm n_pos_per_embd from llm_graph_input_attn_temp
Xuan-Son Nguyen [Mon, 28 Apr 2025 10:18:59 +0000 (12:18 +0200)]
clip : refactor set input for cgraph + fix qwen2.5vl input (#13136)
* clip : refactor set input for cgraph
* more strict assert
* minicpmv : use clip_n_mmproj_embd instead of copying the same code everywhere
* split qwen2 and qwen2.5 code blocks
* minor style fix
Akarshan Biswas [Mon, 28 Apr 2025 09:33:25 +0000 (15:03 +0530)]
SYCL: Add all missing unary kernels (#13074)
* SYCL: Add all missing unary kernels
ggml-ci
* decouple kernel launch range from data size using strided loop
* use ciel_div helper for num_blocks
ggml-ci
* clean auto imported header files
Georgi Gerganov [Mon, 28 Apr 2025 09:10:18 +0000 (12:10 +0300)]
readme : update hot topics (#13150)
Georgi Gerganov [Mon, 28 Apr 2025 08:57:19 +0000 (11:57 +0300)]
common : fix noreturn compile warning (#13151)
ggml-ci
Xuan-Son Nguyen [Mon, 28 Apr 2025 08:11:58 +0000 (10:11 +0200)]
llama-chat : fix typo GML --> GLM (#13143)
R0CKSTAR [Mon, 28 Apr 2025 07:33:28 +0000 (15:33 +0800)]
musa: fix typo in cc control (#13144)
Signed-off-by: Xiaodong Ye <redacted>
Johannes Gäßler [Mon, 28 Apr 2025 07:29:26 +0000 (09:29 +0200)]
CUDA: fix q_nope_absorbed prec for DS 2 Lite f16 (#13137)
Xuan-Son Nguyen [Mon, 28 Apr 2025 05:16:59 +0000 (07:16 +0200)]
arg : fix unused variable (#13142)
4onen [Sun, 27 Apr 2025 21:48:26 +0000 (14:48 -0700)]
llama-bench : Add `--override-tensors` arg (#12922)
* Add --override-tensors option to llama-bench
* Correct llama-bench --override-tensors to --override-tensor
* llama-bench: Update --override-tensors parsing to match --tensor-split, appear in test matrix.
* Make new llama-bench util functions static to fix Ubuntu CI
* llama-bench: Correct -ot corner cases (No -ot calls, leading and trailing empty -ot spans, etc.)
matteo [Sun, 27 Apr 2025 19:57:32 +0000 (21:57 +0200)]
llama-chat : fix wrong template in GLM4-0414 (#13140)
* fix wrong template in GLM4-0414
* fix spaces
* no bos token since it is already in the template
* moved the chatgml4 check to higher priority
* restored template for old GLM models
* moved the GLM4 template check in the correct place with correct check
R0CKSTAR [Sun, 27 Apr 2025 11:22:49 +0000 (19:22 +0800)]
musa: fix build warning (#13129)
Signed-off-by: Xiaodong Ye <redacted>
LostRuins Concedo [Sun, 27 Apr 2025 10:43:37 +0000 (18:43 +0800)]
Fixes Qwen2.5VL segfault during inference with https://github.com/ggml-org/llama.cpp/pull/12402 as has_qwen2vl_merger migration was incomplete (#13133)
HimariO [Sun, 27 Apr 2025 08:10:34 +0000 (16:10 +0800)]
clip : Add Qwen2.5VL support (#12402)
* implment vision model architecture, gguf convertor
* handle window attention inputs
* add debug utils
* fix few incorrect tensor memory layout
* move position id remap out of ggml to avoid int32 cuda operations
* cleaning up
* ignore transformers Qwen2_5_xxx type check
* remove not so often use `qwen2vl-cli` debug functions
* remove commented-out code blocks
* fix attn weight scaling after rebase
* add `PROJECTOR_TYPE_QWEN2_5_VL`
* remove `KEY_USE_GLU_MLP`, `KEY_USE_RMS_NORM`
* replace `KEY_FULLATTN_BLK_IDX` with `KEY_WIN_ATTN_PATTERN`
* remove `attn_window_size` from gguf
* fix model conversion
* clean up
* fix merging problem
* add test
---------
Co-authored-by: Xuan Son Nguyen <redacted>
Xuan-Son Nguyen [Sat, 26 Apr 2025 20:58:12 +0000 (22:58 +0200)]
common : add common_remote_get_content (#13123)
* common : add common_remote_get_content
* support max size and timeout
* add tests
Xuan-Son Nguyen [Sat, 26 Apr 2025 20:39:47 +0000 (22:39 +0200)]
clip : improve projector naming (#13118)
* clip : improve projector naming
* no more kv has_llava_projector
* rm unused kv
* rm more unused
SXX [Sat, 26 Apr 2025 14:05:31 +0000 (22:05 +0800)]
ggml: move fp16/bf16 conversion optimizations to CPU backend + export conversion APIs (#13107)
* ggml: dynamic x86_64 feature detection for FP32 <-> FP16/BF16 conversion
* move fp converter to ggml-cpu
* Switch ggml_compute_forward_get_rows_f16/bf16 to new ggml_cpu_fp16/bf16_to_fp32
frob [Sat, 26 Apr 2025 08:10:20 +0000 (10:10 +0200)]
grammar : handle maxItems == 0 in JSON schema (#13117)
Co-authored-by: Richard Lyons <redacted>
Diego Devesa [Fri, 25 Apr 2025 17:40:11 +0000 (19:40 +0200)]
llama : fix K-shift with quantized K and BLAS backend (#13113)
City [Fri, 25 Apr 2025 12:38:34 +0000 (14:38 +0200)]
Force FP32 compute in GLM4 FFN Down (#13101)
* Force FP32 compute in cuBLAS GEMM
* Revert "Force FP32 compute in cuBLAS GEMM"
This reverts commit
6efd872732159ab88ee7b3c1d77ba5ebc83079bd .
* Force F32 compute in GLM4 ffn down
* Edit comment to clarify issue
Co-authored-by: Johannes Gäßler <redacted>
---------
Co-authored-by: Johannes Gäßler <redacted>
Xuan-Son Nguyen [Fri, 25 Apr 2025 12:31:42 +0000 (14:31 +0200)]
clip : fix pixtral on some GPU backends (#13097)
* clip : fix pixtral on some GPU backends
* refactor inp_raw set
* rm outdated comment
* fix dynamic size
* add TODO
Neo Zhang Jianyu [Fri, 25 Apr 2025 09:37:51 +0000 (17:37 +0800)]
change the reorder tensor from init to execute OP (#13003)
Radoslav Gerganov [Fri, 25 Apr 2025 07:08:08 +0000 (10:08 +0300)]
rpc : do not wait for response when sending RPC_CMD_SET_TENSOR (#12943)
RPC_CMD_SET_TENSOR always returns an empty response and we send this 4
times per token. We can improve TG speed if we don't wait for this empty
response.
The performance impact of this change depends on the network latency.
Xuan-Son Nguyen [Thu, 24 Apr 2025 20:17:04 +0000 (22:17 +0200)]
clip : remove boi/eoi embeddings for GLM-edge model (#13081)
Georgi Gerganov [Thu, 24 Apr 2025 19:29:22 +0000 (22:29 +0300)]
embeddings : fix batch sizes (#13076)
ggml-ci
Georgi Gerganov [Thu, 24 Apr 2025 14:22:27 +0000 (17:22 +0300)]
ggml : fix trailing whitespaces (#0)
Georgi Gerganov [Thu, 24 Apr 2025 13:47:43 +0000 (16:47 +0300)]
sync : ggml
ggml-ci
Acly [Thu, 17 Apr 2025 12:16:45 +0000 (14:16 +0200)]
ggml : Depthwise 2D convolution (ggml/1152)
* ggml-cpu : kernels for faster depthwise 2D convolution
* fix compile: remove static after moving to ops.cpp
* add dilation for depthwise_conv_2d
* review: rename to ggml_conv_2d_dw_direct, remove redundant struct keywords, pass by ref, whitespace
* review: rename depthwise_conv_2d -> conv_2d_dw everywhere
Johannes Gäßler [Thu, 24 Apr 2025 13:57:10 +0000 (15:57 +0200)]
CUDA: use switch statements in constexpr functions (#13095)
Georgi Gerganov [Thu, 24 Apr 2025 13:00:10 +0000 (16:00 +0300)]
cmake : do not include ./src as public for libllama (#13062)
* cmake : do not include ./src as public for libllama
ggml-ci
* cmake : rework tests
ggml-ci
* llguidance : remove unicode include
ggml-ci
* cmake : make c++17 private
ggml-ci
Georgi Gerganov [Thu, 24 Apr 2025 12:44:05 +0000 (15:44 +0300)]
clang-tidy : disable warning about missing math parenthesis (#13091)
Xuan-Son Nguyen [Thu, 24 Apr 2025 12:04:14 +0000 (14:04 +0200)]
arg : add --no-mmproj-offload (#13093)
* arg : add --no-mmproj-offload
* Update common/arg.cpp
Xuan-Son Nguyen [Thu, 24 Apr 2025 10:14:13 +0000 (12:14 +0200)]
arg : clean up handling --mmproj with -hf (#13082)
* arg : clean up handling --mmproj with -hf
* rm change about no_mmproj
* Revert "rm change about no_mmproj"
This reverts commit
2cac8e0efb629d66c612f137e75d562f94bb9e6c .
* handle no_mmproj explicitly
* skip download mmproj on examples not using it
Georgi Gerganov [Thu, 24 Apr 2025 07:38:30 +0000 (10:38 +0300)]
metal : fix floating-point range of attention scores in FA kernels (#13090)
ggml-ci
Eve [Thu, 24 Apr 2025 07:18:33 +0000 (07:18 +0000)]
vulkan: matmul gcn tuning (#13016)
* tune matmul for gcn
* this one is more power efficient
* Update ggml/src/ggml-vulkan/ggml-vulkan.cpp
Co-authored-by: 0cc4m <redacted>
* disable this tune for the proprietary driver
---------
Co-authored-by: 0cc4m <redacted>
pl752 [Wed, 23 Apr 2025 21:32:35 +0000 (02:32 +0500)]
llama-mtmd-cli: Sigint rework in mtmd vision example (#13080)
* Sigint rework in mtmd vision example
* Applied suggestions on mtmd-cli PR
* Forgot to invert one of the conditions
* Update examples/llava/mtmd-cli.cpp
* Removed redundant exit check
---------
Co-authored-by: pl752 <redacted>
Co-authored-by: Xuan-Son Nguyen <redacted>
Xuan-Son Nguyen [Wed, 23 Apr 2025 18:21:59 +0000 (20:21 +0200)]
mtmd : Support Pixtral 12B (#13065)
* add pixtral text model (vision is wip)
* cgraph ok, just missing 2D RoPE
* fix bad rebase
* first working version
* fix problem with img_break token
* support dynamic image size
* update docs
* update test script
piDack [Wed, 23 Apr 2025 14:59:14 +0000 (22:59 +0800)]
convert : Append mult-eos,half-rope,bos to GLM4-0414 and Z (#13021)
* append mult-eos,half-rope,bos to GLM4-0414
* remove unset var
Radoslav Gerganov [Wed, 23 Apr 2025 07:32:49 +0000 (10:32 +0300)]
rpc : add command line option for number of threads for the CPU backend (#13060)
closes #13051
Johannes Gäßler [Tue, 22 Apr 2025 19:27:40 +0000 (21:27 +0200)]
CUDA: noncont MMVQ + batched bs1 MUL_MAT_ID (#13014)
* CUDA: noncont MMVQ + batched bs1 MUL_MAT_ID
* fix logic for RoPE support, CUDA graphs
Xuan-Son Nguyen [Tue, 22 Apr 2025 14:24:54 +0000 (16:24 +0200)]
mtmd : support SmolVLM (version 1 and 2) (#13050)
* mtmd : support SmolVLM (version 1 and 2)
* correct chat template
* fix n_patches
* scale_factor is an int
* add more models to test
Georgi Gerganov [Tue, 22 Apr 2025 13:16:10 +0000 (16:16 +0300)]
security : add note about RPC and server functionality (#13061)
* security : add note about RPC functionality
* security : add note about llama-server
Georgi Gerganov [Tue, 22 Apr 2025 13:15:51 +0000 (16:15 +0300)]
metal : add memory pool for temp allocs (#12850)
* metal : add memory pool for temp allocs (wip) [no ci]
* cont : free buffers from the heap
* cont : resize heap [no ci]
* cont : refactor heap [no ci]
* cont : heap for each cmd buffer [no ci]
* cont : fix free
* wip
* cont : fix alignment [no ci]
* cont : not working .. [no ci]
* cont : heap allocation now works [no ci]
* cont : use MTLHeapTypePlacement
ggml-ci
* metal : use dynamic MTLHeap allocations
ggml-ci
* metal : add comments
* metal : disable softmax use of mem_pool
ggml-ci
* metal : final touches
Xuan-Son Nguyen [Tue, 22 Apr 2025 08:37:00 +0000 (10:37 +0200)]
llava : update documentations (#13055)
* llava : update documentations
* fix typo
Diego Devesa [Mon, 21 Apr 2025 16:13:51 +0000 (18:13 +0200)]
ggml : add SSE 4.2 and x64 base variant for CPUs without AVX (#12871)
* ggml : add SSE 4.2 variant for CPUs without AVX
* ggml : add x64 base ABI variant
Akarshan Biswas [Mon, 21 Apr 2025 13:43:30 +0000 (19:13 +0530)]
SYCL: Add non-contiguous support in ROPE (#12993)
ggml-ci
Xuan-Son Nguyen [Mon, 21 Apr 2025 13:32:58 +0000 (15:32 +0200)]
mtmd : merge llava, gemma3 and minicpmv CLI into single `llama-mtmd-cli` (#13012)
* mtmd : merge `llava-cli` and `gemma3-cli` into single `mtmd-cli`
* support for minicpmv
* remove cpp files of llava and minicpmv
* update hot topics
* mtmd : add not supported msg for qwen2vl
* Update examples/llava/mtmd.cpp
Co-authored-by: Georgi Gerganov <redacted>
---------
Co-authored-by: Georgi Gerganov <redacted>
Xuan-Son Nguyen [Sun, 20 Apr 2025 21:29:36 +0000 (23:29 +0200)]
convert : experimental support for `--mmproj` flag (#13023)
* convert : experimental support for `--mmproj` flag
* fix bad ctrl+f replace
* fix style
* split into subclasses TextModel and VisionModel
* rename Mode --> ModelBase
* small fix
* correct CLIP_VISION arch name (because existing GGUF already use it)
* Apply suggestions from code review
Co-authored-by: compilade <redacted>
* fix Mistral3Model
* fix typo
Co-authored-by: compilade <redacted>
---------
Co-authored-by: compilade <redacted>
Jeffrey Morgan [Sun, 20 Apr 2025 10:15:41 +0000 (03:15 -0700)]
llava: fix errors in clip.h on certain compilers (#13030)
Jeff Bolz [Sun, 20 Apr 2025 08:50:02 +0000 (03:50 -0500)]
vulkan: support noncontiguous rms_norm (#13031)
Jeffrey Morgan [Sun, 20 Apr 2025 05:28:40 +0000 (22:28 -0700)]
metal: add neg operator (#13029)
bandoti [Sat, 19 Apr 2025 16:05:03 +0000 (13:05 -0300)]
Disable CI cross-compile builds (#13022)
Sigbjørn Skjæret [Sat, 19 Apr 2025 14:26:38 +0000 (16:26 +0200)]
gguf-py : fix upload python package workflow (#13020)
Xuan-Son Nguyen [Sat, 19 Apr 2025 07:15:45 +0000 (09:15 +0200)]
clip : refactor, add `image_manipulation` and `llava_uhd` classes (#13011)
* clip : refactor, add `image_manipulation` and `llava_uhd`
* refactor llava-1.6 preprocessing
* simplify logic for llava-1.5
* missing include
Daniel Tang [Fri, 18 Apr 2025 20:02:55 +0000 (16:02 -0400)]
main : Fix Ctrl+D/newline handling (#12951)
This restores the behavior from #491. This does not affect Ctrl+D's ability to
terminate --multiline-input lines (#1040).
This also actually implements #587: "If the user wants the text to end in a
newline, this should be accomplished by explicitly adding a newline by using
\ followed by return, then returning control by pressing return again."
Fixes #12949
Chris Thompson [Fri, 18 Apr 2025 18:30:41 +0000 (12:30 -0600)]
gguf-py : GGUF Editor GUI - Python + Qt6 (#12930)
Xuan-Son Nguyen [Fri, 18 Apr 2025 17:58:12 +0000 (19:58 +0200)]
server : use std::move whenever possible (#12936)
* server : use std::move whenever possible
* use r-value ref
* Apply suggestions from code review
Co-authored-by: Georgi Gerganov <redacted>
* make task creation scoped
* restore std::move
* fix task_id not set correctly
* apply changes from suggestion
Co-authored-by: ggerganov <redacted>
---------
Co-authored-by: Georgi Gerganov <redacted>
Akarshan Biswas [Fri, 18 Apr 2025 13:57:56 +0000 (19:27 +0530)]
SYCL: Refactor and enable FP16 in binary broadcast OPs (#12975)
* SYCL: refactor move to a separate file
* Fix binbcast
* Remove duplicates
* fix include formatting
* fix typo
Xuan-Son Nguyen [Fri, 18 Apr 2025 08:04:51 +0000 (10:04 +0200)]
mtmd : add methods to access `mtmd_image_tokens` (#12906)
* mtmd : add more api around mtmd_image_tokens
* mtmd : ability to calc image hash
* shared_ptr for mtmd_image_tokens
* move hash to user-define ID (fixed)
* fix prompt_modified
* rm redundant data member
Radoslav Gerganov [Fri, 18 Apr 2025 07:13:42 +0000 (10:13 +0300)]
rpc : add RPC_CMD_HELLO (#12955)
Add RPC_CMD_HELLO for getting the version of the protocol implemend by
the server. Follow the semantic versioning rules at https://semver.org
Hopefully this bring better user experience when we make breaking
changes at the protocol level and avoid issues like #12465
Georgi Gerganov [Thu, 17 Apr 2025 15:16:36 +0000 (18:16 +0300)]
graph : make FA compatible with MLA + add initial Metal kernels (#12953)
* graph : make mla compatible with FA
* metal : add exp FA kernels for DeepSeek models
ggml-ci
* llama : minor naming updates
ggml-ci
* ggml : disable FA for DS head sizes
* tests : add FA tests for MLA shapes
ggml-ci
Alan Gray [Thu, 17 Apr 2025 13:19:42 +0000 (14:19 +0100)]
ggml: Re-enable CUDA graphs in presence of CONT and DUP nodes (#12970)
hipudding [Thu, 17 Apr 2025 12:34:16 +0000 (20:34 +0800)]
CANN: Add support for async operator submission (#12864)
Submit operators using asynchronous threads to improve performance.
Use the environment variable GGML_CANN_ASYNC_MODE to control whether
asynchronous submission is enabled. It is disabled by default.
Testing shows a 10%–20% performance improvement in scenarios with
small parameter sizes, especially in quantized models.
Mikko Juola [Thu, 17 Apr 2025 08:37:05 +0000 (01:37 -0700)]
llama : recognize IBM Granite 3.3 FIM tokens (#12988)
The Granite's FIM tokens are very similar to Qwen's; it's just that
they use underscore instead of a dash. So <fim_middle> for example
instead of <fim-middle>.
Opening up tokenizer_config.json in ibm-granite/granite-3.3-8b-base
shows:
```
"<fim_prefix>",
"<fim_middle>",
"<fim_suffix>",
"<fim_pad>",
...
"<reponame>",
```
kimminsu [Wed, 16 Apr 2025 21:25:57 +0000 (06:25 +0900)]
opencl: fix incorrect local_size index in profiling log (#12868)
Jeff Bolz [Wed, 16 Apr 2025 18:37:25 +0000 (13:37 -0500)]
vulkan: enable coopmat2 FA gqa and split_k optimizations more often (#12931)
The grouped query attention optmization doesn't require a power of two ratio,
the only thing relying on it was the modulo operation written as bitwise &.
split_k need not depend on gqa_ratio - enable it any time there's only one
workgroup in the X dimension. The shader gets the split index from the x coord,
and multiple workgroups in the X dimension (pre-split) indicates a larger
FA operation that wouldn't need splitting.
Chenguang Li [Wed, 16 Apr 2025 08:21:05 +0000 (16:21 +0800)]
CANN: Add 310P operator support check (#12962)
lhez [Tue, 15 Apr 2025 19:26:00 +0000 (12:26 -0700)]
opencl: split `ggml-opencl.cl` into multiple files and cleanup (#12886)
* opencl: refactor - split the kernel files
---------
Co-authored-by: Shangqing Gu <redacted>
* opencl: split more kernels into separate files
* opencl: specify subgroup size instead of querying it
* opencl: refine Adreno cl compiler version parsing
* opencl: skip some kernels not used by Adreno on old compilers
* opencl: refine logic for selecting Adreno kernels
* opencl: refine Adreno cl compiler version
* opencl: cleanup preprocessor for kernels
* opencl: consider Adreno CL compiler on Windows
* opencl: add final newline for `mul_mv_f16_f16.cl`
---------
Co-authored-by: Shangqing Gu <redacted>
Georgi Gerganov [Tue, 15 Apr 2025 11:45:05 +0000 (14:45 +0300)]
metal : add FA-vec kernels for head size 96 (#12952)
ggml-ci
hipudding [Tue, 15 Apr 2025 11:08:55 +0000 (19:08 +0800)]
CANN: Add x86 build ci (#12950)
* CANN: Add x86 build ci
* CANN: fix code format
David Huang [Tue, 15 Apr 2025 09:20:38 +0000 (17:20 +0800)]
CUDA/HIP: Share the same unified memory allocation logic. (#12934)
Replace compile-time `GGML_HIP_UMA` with environment variable `GGML_CUDA_ENABLE_UNIFIED_MEMORY`. This unifies the usage on NVIDIA and AMD GPUs, and allows a single binary to be shared between integrated and dedicated GPUs.
Akarshan Biswas [Tue, 15 Apr 2025 08:37:42 +0000 (14:07 +0530)]
SYCL: Add ROPE vision kernel (#12887)
* SYCL: Add ROPE vision kernel
* Add comment about rope mode
Juk Armstrong [Tue, 15 Apr 2025 06:49:57 +0000 (07:49 +0100)]
llama : DeepSeek V2/V3 MLA implementation (#12801)
* Merged using squash to remove all noise commit messages
* Force flash attention off for `LLM_ARCH_DEEPSEEK2` - embedding too large
* Removed 3 conts (2x RoPE and 1x RMS-norm)
* Changed to use `<cmath>` instead of `<math.h>`
* Reverted removal of the 3 conts
* Used `reshape` in `llm_graph_context::build_attn_mha()`
* Use `k_pe = ggml_reshape`
* Removed the 3 conts again
* Removed the 3D views of `wk_b` and `wv_b`, and just save and 3D in GGUF
* Removed MQA optimisation from `build_attn_mha()` as no gains now
* Simplified `is_mla` branch in `llm_build_deepseek2()`
* Removed `build_attn_mla` and added `nullptr` to all `build_atnn` calls
* Fixed call to `build_attn` in `llm_build_t5_enc`
Srihari-mcw [Tue, 15 Apr 2025 06:22:36 +0000 (11:52 +0530)]
ggml : Add AVX512 implementation of GEMM - Q4_Kx8 (#12829)
* Add AVX512 implementation of GEMM - q4kx8
* Update changes to remove unnecessary whitespaces
Chenguang Li [Tue, 15 Apr 2025 02:09:35 +0000 (10:09 +0800)]
CANN: Opt ROPE optimization (#12865)
* [CANN]Opt ROPE optimization
* [CANN]Codestyle adjustment
* [CANN]Fix the ROPE precision issue
* [CANN]codestyle fix
* [CANN]add rope unsupport case
Signed-off-by: noemotiovon <redacted>
Xinpeng Dou [Tue, 15 Apr 2025 02:04:24 +0000 (10:04 +0800)]
CANN: Optimize CANN buffer pool memory management (#12875)
Multiple optional memory pools are provided for CANN, including VMM,
priority queue-based, and traditional memory pools.
1.When the memory pool is available and GGML_CANN_DISABLE_VMM_POOL
is not defined, the VMM pool is selected by default.
2.Otherwise, if GGML_CANN_ENABLE_BUF_PRIO_POOL is defined,
the priority queue-based memory pool is used.
3.If neither condition is met, the default memory pool is used.
Russyyds [Mon, 14 Apr 2025 17:18:20 +0000 (01:18 +0800)]
Add performance print for gemma3 in example (#12929)
Akarshan Biswas [Mon, 14 Apr 2025 12:23:53 +0000 (17:53 +0530)]
SYCL: Fix im2col (#12910)
* SYCL: Fix im2col
* restore local workgroup size adjustments for large inputs
* restore format
Radoslav Gerganov [Mon, 14 Apr 2025 10:59:34 +0000 (13:59 +0300)]
rpc : use ggml_context_ptr (#12938)
Neo Zhang Jianyu [Mon, 14 Apr 2025 10:19:07 +0000 (18:19 +0800)]
dsiable curl lib check, this action is missed by commit
bd3f59f81289b920bcc597a208c14f55e39ed37e (#12761) (#12937)
Georgi Gerganov [Mon, 14 Apr 2025 05:52:10 +0000 (08:52 +0300)]
sync : ggml
ggml-ci
cmdr2 [Fri, 11 Apr 2025 06:44:19 +0000 (12:14 +0530)]
cpu: fix cpu backend's supports-op for GET_ROWS_BACK. fixes a fatal when running test-backend-ops with only the CPU backend (ggml/1190)
SXX [Mon, 14 Apr 2025 05:47:55 +0000 (13:47 +0800)]
ggml: use _mm[512/256]_dpbusd[_avx]_epi32 to directly accumulate into the result register (#12773)
* ggml: use _mm[512/256]_dpbusd[_avx]_epi32 to directly accumulate into the result register
* simplifies the codebase by removing redundant functions
Alan Gray [Sun, 13 Apr 2025 21:12:21 +0000 (22:12 +0100)]
ggml: disable CUDA graphs for unsupported DUP and CONT node types (#12891)
Fixes #12798
Ed Addario [Sun, 13 Apr 2025 18:29:28 +0000 (19:29 +0100)]
quantize: Handle user-defined quantization levels for additional tensors (#12511)
* Add llama_model_quantize_params parameters
* Add new quantize parameters parsing and validation
* Update usage
* Add new parameters defaults
* Add new quantization parameters logic
* Add llama_model_quantize_params parameters
* Add new quantize parameters parsing and validation
* Update usage
* Add new parameters defaults
* Add new quantization parameters logic
* Minor refactoring as per the contributors' coding guidelines
* Update descriptions to match existing style
* Add llama_model_quantize_params parameters
* Add new quantize parameters parsing and validation
* Update usage
* Add new parameters defaults
* Add new quantization parameters logic
* Minor refactoring as per the contributors' guidelines
* Implement general --tensor-type instead of tensor-specific command option
* Fix implied type bug
* Restore missing #includes
* Add regex capability for tensor selection
* Refactor function name and update ALLOWED_TENSOR_TYPE
* Add missing #include
* Handle edge case when tensor name is cls.output
* Minor logging improvement
Prajwal B Mehendarkar [Sat, 12 Apr 2025 15:33:39 +0000 (21:03 +0530)]
common : Define cache directory on AIX (#12915)
Jeff Bolz [Sat, 12 Apr 2025 08:44:48 +0000 (03:44 -0500)]
vulkan: use aligned loads for flash attention mask (#12853)
Rewrite the stride logic for the mask tensor in the FA shader to force the
stride to be aligned, to allow using more efficient loads.
Matt Clayton [Sat, 12 Apr 2025 05:29:03 +0000 (01:29 -0400)]
llava: Fix cpu-only clip image encoding sefault (#12907)
* llava: Fix cpu-only clip image encoding
* clip : no smart ptr for ggml_backend_t
* Fix for backend_ptr push_back
---------
Co-authored-by: Xuan Son Nguyen <redacted>
Georgi Gerganov [Fri, 11 Apr 2025 20:37:41 +0000 (23:37 +0300)]
server : add VSCode's Github Copilot Chat support (#12896)
* server : add VSCode's Github Copilot Chat support
* cont : update handler name
yuri@FreeBSD [Fri, 11 Apr 2025 20:04:14 +0000 (13:04 -0700)]
rpc : Set cache directory in rpc-server.cpp on FreeBSD (#12903)
Olivier Chafik [Fri, 11 Apr 2025 19:47:52 +0000 (12:47 -0700)]
`tool-call`: fix non-tool-calling grammar crashes w/ Qwen / Hermes 2 templates (#12900)
* `tool-call`: don't call common_chat_params_init_hermes_2_pro when there aren't tools (or when there's a schema)
* test all chat formats w/o tools
yuri@FreeBSD [Fri, 11 Apr 2025 19:45:44 +0000 (12:45 -0700)]
common : Define cache directory on FreeBSD (#12892)
Ewan Crawford [Fri, 11 Apr 2025 13:32:14 +0000 (15:32 +0200)]
sycl: Support sycl_ext_oneapi_limited_graph (#12873)
The current usage of the SYCL-Graph extension checks for
the `sycl_ext_oneapi_graph` device aspect. However, it is also
possible to support `sycl_ext_oneapi_limied_graph` devices that
don't support update
tastelikefeet [Fri, 11 Apr 2025 12:01:56 +0000 (20:01 +0800)]
contrib: support modelscope community (#12664)
* support download from modelscope
* support login
* remove comments
* add arguments
* fix code
* fix win32
* test passed
* fix readme
* revert readme
* change to MODEL_ENDPOINT
* revert tail line
* fix readme
* refactor model endpoint
* remove blank line
* fix header
* fix as comments
* update comment
* update readme
---------
Co-authored-by: tastelikefeet <redacted>
Yuxuan Zhang [Fri, 11 Apr 2025 10:10:10 +0000 (18:10 +0800)]
llama-model : add Glm4Model implementation for GLM-4-0414 (#12867)
* GLM-4-0414
* use original one
* Using with tensor map
* fix bug
* change order
* change order
* format with flask8
Xuan-Son Nguyen [Fri, 11 Apr 2025 10:09:39 +0000 (12:09 +0200)]
clip : use smart pointer (⚠️ breaking change) (#12869)
* clip : use smart pointers
* fix warmup
* add forward declaration
* misisng include
* fix include (2)
* composite
* simplify batch ptr
* fix conflict
Akarshan Biswas [Fri, 11 Apr 2025 08:03:50 +0000 (13:33 +0530)]
SYCL: Add fp16 type support to unary op kernels (#12788)
* SYCL: Add fp16 support to some elementwise OP kernels
* remove comment
ggml-ci
* Use static_cast directly
* remove not needed cast from tanh
* Use static cast and remove unneeded castings
* Adjust device_support_op for unary OPs
* Use cast_data and typed_data struct to deduplicate casting code
Daniel Han [Fri, 11 Apr 2025 07:49:09 +0000 (00:49 -0700)]
convert : Llama4 RoPE fix (#12889)