git.djapps.eu Git - pkg/ggml/sources/whisper.cpp/log

]> git.djapps.eu Git - pkg/ggml/sources/whisper.cpp/log

overview / pkg / ggml / sources / whisper.cpp / log

commit | commitdiff | tree

Jared Van Bortel [Wed, 6 Mar 2024 20:42:23 +0000 (15:42 -0500)]

ggml : use SYS_get_cpu if SYS_getcpu is not defined (llama/5906)

Fixes #5694
Fixes ggerganov/whisper.cpp#1894

commit | commitdiff | tree

bobqianic [Wed, 6 Mar 2024 07:35:07 +0000 (07:35 +0000)]

ggml : use `uint8x16_t` return type for `ggml_vqtbl1q_u8` (llama/5894)

* use uint8x16_t

* Update ggml-quants.c

commit | commitdiff | tree

Neo Zhang Jianyu [Wed, 6 Mar 2024 04:08:32 +0000 (12:08 +0800)]

add wait() to make code stable (llama/5895)

commit | commitdiff | tree

Jared Van Bortel [Tue, 5 Mar 2024 16:56:37 +0000 (11:56 -0500)]

quants : use MM256_SET_M128I consistently to fix gcc 7 build (llama/5889)

commit | commitdiff | tree

0cc4m [Tue, 5 Mar 2024 12:33:42 +0000 (13:33 +0100)]

Vulkan Improvements (llama/5835)

* Improve dequant shaders, add fast q4_0 dequant

* Optimize dmmv non-kquants for GCN

Remove unnecessary SPIR-V shader duplication

* Fix q4_0 dequant dispatch sizes

Fix backend free bug

* Optimize dequant shaders for q4_1, q5_0, q5_1 and q8_0

* Add unary and binary op shader templates

* Fix Vulkan check results

* Enable non-contiguous support for simple ops

* Add argsort

Basic q4_0 mmq shader and unit test

* Speed up q4_0 dequant code, enable mmq for q4_0

* Rework matmul pipeline selection

* Add soft_max alibi support

* Add q4_1, q5_0, q5_1 and q8_0 dequant mat mat mul shaders

* Add environment variable GGML_VK_FORCE_MAX_ALLOCATION_SIZE to limit max buffer size

Rename GGML_VULKAN_DISABLE_F16 to GGML_VK_DISABLE_F16 for consistency

commit | commitdiff | tree

Neo Zhang Jianyu [Tue, 5 Mar 2024 08:08:35 +0000 (16:08 +0800)]

fix mul_mat fault in CI/unit-test (llama/5862)

* fix mul_mat fault in cpy_f32_f16

* rm unused function

* add wait() for memcpy

* restore ci/run.sh, rename struct defination, fix bug in ggml_sycl_op_mul_mat_sycl

* fix format issue

* llama : fix segfault from unknown model arch name (llama/5820)

* llama : fix segfault from unknown model arch name

* llama : make all LLM maps const

This also requires using `std::map::at` instead of its `operator[]`
which does not exist for const maps.

* llama : name LLM_ARCH_UNKNOWN to "(unknown)"

This avoids errors from `std::map::at` when
getting the general name of the model architecture.
Using "(unknown)" instead of an empty string as per suggestion
https://github.com/ggerganov/llama.cpp/pull/5820#issuecomment-1973735284

* llama : remove redundant inner const for LLM_TENSOR_NAMES

The extra const won't do anything here as const maps
return const references to values.

Co-authored-by: Jared Van Bortel <redacted>
* llama : remove redundant nullptr check in llm_arch_from_string

Since LLM_ARCH_NAMES is a const map, no spurious elements
with a NULL name are inserted anymore, so this check is dead code.

---------

Co-authored-by: Jared Van Bortel <redacted>
* llama : refactor internal quantization functions (llama/5830)

* scripts : add pod-llama.sh

* ggml : IQ3_S improvements (llama/5829)

* iq3_s: somewhat faster AVX2 dot product

On Ryzen a 7950X TG-128 increases to 16 t/s from 15.5 t/s using
16 threads. For 8 threads it is 13.85 t/s vs 11.75 t/s.
PP-512 increases to 28.5 t/s from 23.8 t/s.

* iq3_s: somewhat faster ARM_NEON dot product

Still dog slow - 10.7 t/s up from 9.9 t/s.

* iq3_s: another small ARM_NEON improvement

10.7 -> 11.0 t/s. Using vmulq_s8 is faster than the xor - sub trick
that works best on AVX2.

* iq3_s: minor improvement on Metal

49.4 t/s -> 50.3 t/s

* iq3_s: PPL improvement

E.g., for a context of 4096 LLaMA-v2-7B goes to 5.1340 from 5.1653.

* iq3_s: use new grid everywhere

* Fix ARM_NEON

---------

Co-authored-by: Iwan Kawrakow <redacted>
* convert-hf : make model class definitions self-contained (llama/5825)

* convert : automatically fall back to HfVocab if tokenizer.model doesn't exist (llama/5821)

* ggml : fix IQ3_S AVX implementation (llama/5834)

ggml-ci

* llama : add abort_callback to interrupt computation (llama/5409)

* using abort_callback from ggml to stop llama computation

* format fix

* a brief explaining comment

---------

Co-authored-by: Georgi Gerganov <redacted>
* server: tests: passkey challenge /  self-extend with context shift demo (llama/5832)

* server: tests: add models endpoint scenario

* server: /v1/models add some metadata

* server: tests: add debug field in context before scenario

* server: tests: download model from HF, add batch size

* server: tests: add passkey test

* server: tests: add group attention params

* server: do not truncate prompt tokens if self-extend through group attention is enabled

* server: logs: do not truncate log values

* server: tests - passkey - first good working value of nga

* server: tests: fix server timeout

* server: tests: fix passkey, add doc, fix regex content matching, fix timeout

* server: tests: fix regex content matching

* server: tests: schedule slow tests on master

* server: metrics: fix when no prompt processed

* server: tests: self-extend add llama-2-7B and Mixtral-8x7B-v0.1

* server: tests: increase timeout for completion

* server: tests: keep only the PHI-2 test

* server: tests: passkey add a negative test

* flake.lock: Update (llama/5842)

Flake lock file updates:

• Updated input 'flake-parts':
    'github:hercules-ci/flake-parts/b253292d9c0a5ead9bc98c4e9a26c6312e27d69f' (2024-02-01)
  → 'github:hercules-ci/flake-parts/f7b3c975cf067e56e7cda6cb098ebe3fb4d74ca2' (2024-03-01)
• Updated input 'flake-parts/nixpkgs-lib':
    'github:NixOS/nixpkgs/97b17f32362e475016f942bbdfda4a4a72a8a652?dir=lib' (2024-01-29)
  → 'github:NixOS/nixpkgs/1536926ef5621b09bba54035ae2bb6d806d72ac8?dir=lib' (2024-02-29)
• Updated input 'nixpkgs':
    'github:NixOS/nixpkgs/cbc4211f0afffe6dfd2478a62615dd5175a13f9a' (2024-02-23)
  → 'github:NixOS/nixpkgs/1536926ef5621b09bba54035ae2bb6d806d72ac8' (2024-02-29)

Co-authored-by: github-actions[bot] <redacted>
* server : init http requests thread pool with --parallel if set (llama/5836)

* ci : schedule slow server tests only on Release or on demand (llama/5839)

* llama : fix llama_copy_state_data with fragmented KV cache (llama/5840)

The row size of the saved states was based on kv_self.head while
it should be based on llama_kv_cache_cell_max.

Existing session files should still work.

* llama : fix llama_kv_cache_cell_max inability to return 1

I've also changed its return type to uint32_t,
because this function is always used to set the value of uint32_t variables,
and because the index already has this type.

* llama : fix state size calculation

Some bytes in the state were unaccounted for in llama_get_state_size.
Since the logits reserve so much space, it did not cause problems.

* gguf-dump : support i-quants (llama/5841)

Co-authored-by: Black_Fox <redacted>
* llama : allow for user specified embedding pooling type (llama/5849)

* allow for user specified pooling type

* llama : use enum types over int

---------

Co-authored-by: Georgi Gerganov <redacted>
* readme : add API changes section

* cuda : fix data race in soft max (llama/5853)

* main : support special tokens as reverse/anti prompt (llama/5847)

* Support special tokens as reverse/anti prompt.

* Tokenize antiprompts only once.

* main : minor

---------

Co-authored-by: Georgi Gerganov <redacted>
* common : use LLAMA_DEFAULT_SEED (llama/5855)

* add some new ops, fix some operators and add batch operations to certain operators. (ggml/747)

* cuda: fix group_norm

* cuda: add batch inference support for ggml_pad/ggml_upscale

* add ggml_arrange

* add ggml_timestep_embedding

* update ggml_arange/ggml_timestep_embedding tests

* cuda: fix im2col

* add ggml_arange/ggml_timestep_embbeding support for metal backend

* fix some bugs

* fix some bugs

* Update ggml.h

Co-authored-by: Georgi Gerganov <redacted>
* Update ggml-cuda.cu

Co-authored-by: Georgi Gerganov <redacted>
* Update ggml-metal.m

Co-authored-by: Georgi Gerganov <redacted>
* Update ggml-metal.m

Co-authored-by: Georgi Gerganov <redacted>
* Update ggml-metal.metal

Co-authored-by: Georgi Gerganov <redacted>
* modify according to the review comments

* ggml : fix compile warnings + code style

* ggml : normalize compute_forward calls + fix seg fault in debug

* minor

---------

Co-authored-by: Georgi Gerganov <redacted>
Co-authored-by: slaren <redacted>
* sync : ggml

* add alias for chat template (llama/5858)

* speculative : implement stochastic speculative sampling (llama/5625)

* (WIP) Implement stochastic speculative decoding

* sample from residual distribution on draft accept failure

* fix #5657: force greedy sampling with probs when temp is 0

* remove p_accept parameter

* fix style

* remove unused variables

* add srand() in speculative.cpp

* replace use of rand() with mt19937 sampling

* fixes based on review (@JohannesGaessler)

* fix r random generation

* randomly select next sequence to verify + fix bug in memory freeing

* fix bug in active_seqs sync

* fix uniform int distribution initialization

* remove warnings from comparison between int and size_t

* check grammar in `llama_sample_probability_distribution_impl`

* remove malloc code by utilizing vectors

* add PR link to README

* cmake : handle cases where git index is not found in .git (llama/5844)

* Update CMakeLists.txt

* Update CMakeLists.txt

* ggml : introduce ggml_status (ggml/750)

* using enum as an exit code instead of macros

* update return type from enum to unsigned int

* indentation fix

* compound update
ggml_compute_exit_code -> ggml_status
changed ggml_status from a bit-field type to simple codes
ggml_status to string cast

* ggml_status to string cast

* GGML_CALL was removed

Co-authored-by: slaren <redacted>
---------

Co-authored-by: slaren <redacted>
Co-authored-by: Georgi Gerganov <redacted>
* sync : ggml

ggml-ci

* ggml : fix unknown status (llama/0)

* flake : fix

* llama : fix embeddings (llama/5796)

* llama : fix embeddings

ggml-ci

* llama : do not use KV cache for non-causal models

ggml-ci

* embeddings : fix llama_batch_init arg

* llama : add pooling switch

* llama : distinguish token vs sequence embeddings

ggml-ci

* llama : assert pooling tensor

* llama : simplify causal mask condition

ggml-ci

* llama : assert input batch with pooling enabled

* readme : update API changes list

* nix: static build (llama/5814)

* fix speculative decoding build on windows (llama/5874)

* rebase and rm tailing space

---------

Co-authored-by: LiangtaoJin <redacted>
Co-authored-by: compilade <redacted>
Co-authored-by: Jared Van Bortel <redacted>
Co-authored-by: Xuan Son Nguyen <redacted>
Co-authored-by: Georgi Gerganov <redacted>
Co-authored-by: Kawrakow <redacted>
Co-authored-by: Iwan Kawrakow <redacted>
Co-authored-by: Jared Van Bortel <redacted>
Co-authored-by: Michael Podvitskiy <redacted>
Co-authored-by: Pierrick Hymbert <redacted>
Co-authored-by: github-actions[bot] <redacted>
Co-authored-by: Nindaleth <redacted>
Co-authored-by: Black_Fox <redacted>
Co-authored-by: Douglas Hanley <redacted>
Co-authored-by: slaren <redacted>
Co-authored-by: DAN™ <redacted>
Co-authored-by: leejet <redacted>
Co-authored-by: Minsoo Cheong <redacted>
Co-authored-by: Dane Madsen <redacted>
Co-authored-by: hutli <redacted>
Co-authored-by: Jeffrey Quesnelle <redacted>

commit | commitdiff | tree

Georgi Gerganov [Mon, 4 Mar 2024 18:53:27 +0000 (20:53 +0200)]

ggml : fix unknown status (llama/0)

commit | commitdiff | tree

Georgi Gerganov [Tue, 5 Mar 2024 14:05:23 +0000 (16:05 +0200)]

whisper : fix compute helper return (ggml/750)

commit | commitdiff | tree

Michael Podvitskiy [Mon, 4 Mar 2024 09:05:42 +0000 (10:05 +0100)]

ggml : introduce ggml_status (ggml/750)

* using enum as an exit code instead of macros

* update return type from enum to unsigned int

* indentation fix

* compound update
ggml_compute_exit_code -> ggml_status
changed ggml_status from a bit-field type to simple codes
ggml_status to string cast

* ggml_status to string cast

* GGML_CALL was removed

Co-authored-by: slaren <redacted>
---------

Co-authored-by: slaren <redacted>
Co-authored-by: Georgi Gerganov <redacted>

commit | commitdiff | tree

slaren [Sun, 3 Mar 2024 13:26:18 +0000 (14:26 +0100)]

cuda : fix data race in soft max (llama/5853)

commit | commitdiff | tree

Georgi Gerganov [Sat, 2 Mar 2024 18:00:49 +0000 (20:00 +0200)]

ggml : fix IQ3_S AVX implementation (llama/5834)

ggml-ci

commit | commitdiff | tree

Kawrakow [Sat, 2 Mar 2024 15:00:51 +0000 (17:00 +0200)]

ggml : IQ3_S improvements (llama/5829)

* iq3_s: somewhat faster AVX2 dot product

On Ryzen a 7950X TG-128 increases to 16 t/s from 15.5 t/s using
16 threads. For 8 threads it is 13.85 t/s vs 11.75 t/s.
PP-512 increases to 28.5 t/s from 23.8 t/s.

* iq3_s: somewhat faster ARM_NEON dot product

Still dog slow - 10.7 t/s up from 9.9 t/s.

* iq3_s: another small ARM_NEON improvement

10.7 -> 11.0 t/s. Using vmulq_s8 is faster than the xor - sub trick
that works best on AVX2.

* iq3_s: minor improvement on Metal

49.4 t/s -> 50.3 t/s

* iq3_s: PPL improvement

E.g., for a context of 4096 LLaMA-v2-7B goes to 5.1340 from 5.1653.

* iq3_s: use new grid everywhere

* Fix ARM_NEON

---------

Co-authored-by: Iwan Kawrakow <redacted>

commit | commitdiff | tree

Neo Zhang Jianyu [Sat, 2 Mar 2024 11:49:30 +0000 (19:49 +0800)]

Support multiple GPUs (split mode) on SYCL backend (llama/5806)

* suport multiple cards: split-mode - layer|row

* rm warning

* rebase with master, support tow new OPs, close feature for -sm=row, fix for unit test

* update news

* fix merge error

* update according to review comments

commit | commitdiff | tree

ddpasa [Fri, 1 Mar 2024 17:00:00 +0000 (18:00 +0100)]

ggml-vulkan: fix VULKAN_CHECK_RESULTS flag, which was previously broken (llama/5813)

commit | commitdiff | tree

AidanBeltonS [Fri, 1 Mar 2024 07:36:47 +0000 (07:36 +0000)]

Use batched mul_mat pathway (llama/5591)

* Use batched mul_mat pathway

* rm extra line

* Explicitly state scaled data type

---------

Co-authored-by: Abhilash Majumder <redacted>

commit | commitdiff | tree

Eve [Wed, 28 Feb 2024 19:33:37 +0000 (19:33 +0000)]

make portability_enumeration_ext apple only (llama/5757)

commit | commitdiff | tree

leejet [Sun, 3 Mar 2024 12:23:52 +0000 (20:23 +0800)]

add some new ops, fix some operators and add batch operations to certain operators. (ggml/747)

* cuda: fix group_norm

* cuda: add batch inference support for ggml_pad/ggml_upscale

* add ggml_arrange

* add ggml_timestep_embedding

* update ggml_arange/ggml_timestep_embedding tests

* cuda: fix im2col

* add ggml_arange/ggml_timestep_embbeding support for metal backend

* fix some bugs

* fix some bugs

* Update ggml.h

Co-authored-by: Georgi Gerganov <redacted>
* Update ggml-cuda.cu

Co-authored-by: Georgi Gerganov <redacted>
* Update ggml-metal.m

Co-authored-by: Georgi Gerganov <redacted>
* Update ggml-metal.m

Co-authored-by: Georgi Gerganov <redacted>
* Update ggml-metal.metal

Co-authored-by: Georgi Gerganov <redacted>
* modify according to the review comments

* ggml : fix compile warnings + code style

* ggml : normalize compute_forward calls + fix seg fault in debug

* minor

---------

Co-authored-by: Georgi Gerganov <redacted>
Co-authored-by: slaren <redacted>

commit | commitdiff | tree

F1L1P [Wed, 6 Mar 2024 22:25:10 +0000 (23:25 +0100)]

examples : Auto lowercase language parameter in main.cpp (#1928)

* Auto lowercase language parameter

* Update examples/main/main.cpp

Co-authored-by: bobqianic <redacted>
---------

Co-authored-by: bobqianic <redacted>

commit | commitdiff | tree

zhouwg [Wed, 6 Mar 2024 22:21:44 +0000 (06:21 +0800)]

examples : fix typo in bench.cpp (#1933)

commit | commitdiff | tree

zhouwg [Tue, 5 Mar 2024 15:06:31 +0000 (23:06 +0800)]

whisper : fix typo (#1925)

commit | commitdiff | tree

zhouwg [Tue, 5 Mar 2024 13:59:26 +0000 (21:59 +0800)]

whisper.android.java : fix returns in JNI (#1929)

commit | commitdiff | tree

kennethge [Mon, 4 Mar 2024 19:17:48 +0000 (14:17 -0500)]

cmake : add library versioning (#1352)

Co-authored-by: Georgi Gerganov <redacted>

commit | commitdiff | tree

Gavin Cai [Mon, 4 Mar 2024 19:16:13 +0000 (11:16 -0800)]

readme : recommend MacOS Sonoma for Core ML (#1917)

commit | commitdiff | tree

Georgi Gerganov [Wed, 28 Feb 2024 11:04:05 +0000 (13:04 +0200)]

talk-llama : sync llama.cpp

commit | commitdiff | tree

Georgi Gerganov [Wed, 28 Feb 2024 11:01:33 +0000 (13:01 +0200)]

sync : ggml

commit | commitdiff | tree

Georgi Gerganov [Wed, 28 Feb 2024 10:59:11 +0000 (12:59 +0200)]

sync : llama.cpp (ggml/0)

commit | commitdiff | tree

Kawrakow [Wed, 28 Feb 2024 08:37:02 +0000 (10:37 +0200)]

ggml : make i-quants work with super-blocks of 64 (CPU,Metal) (llama/5760)

* WIP: make i-quants work for QK_K = 64

* iq2_xs: attempt to fix AVX dot product for QK_K = 64

Tests pass, but I get gibberish.

* QK_K = 64 tests pass on ARM_NEON and Metal

Sadly, that does not mean it actually works.

* Make CUDA compile with QK_K = 64

Tests don't pass, plus we get misaligned access

* Q2_K: fixed bug in imatrix quantization for QK_K = 64

* iq1_s: turn off SIMD implementation for QK_K = 64 (it does not work)

---------

Co-authored-by: Iwan Kawrakow <redacted>

commit | commitdiff | tree

Kawrakow [Tue, 27 Feb 2024 17:16:49 +0000 (19:16 +0200)]

Attempt to fix android build (llama/5752)

Co-authored-by: Iwan Kawrakow <redacted>

commit | commitdiff | tree

Kawrakow [Tue, 27 Feb 2024 14:34:24 +0000 (16:34 +0200)]

IQ4_XS: a 4.25 bpw quantization (llama/5747)

* Try IQ4_NL with blocks of 64 - does not look good

* iq4_xs: go to super-blocks of 256 and 6-bit scales for blocks of 32

* iq4_xs: CUDA works - 133.2 t/s

* iq4_xs: AVX2 dot product

* iq4_xs: ARM_NEON dot product

* iq4_nl: Metal implementation

As usual, Metal / Apple Silicon don't like my quants.

* iq3_xs: minor fix

* iq4_xs: shrink by using IQ3_S for attn_k and attn_q

* iq4_xs: revert using IQ3_S for attn_k and attn_v

PPL vs size is good, but CPU performance suffers: on M2 Max
TG-128 drops to 21.7 t/s from 28.8, and on a Ryzen-7950X
to 14.5 t/s from 15.8 t/s. On CUDA we have 135 t/s when
using IQ3_S vs 133 t/s with pure IQ4_XS.

* Fix CI

* iq4_xs: Added forgotten check for 256 divisibility

---------

Co-authored-by: Iwan Kawrakow <redacted>

commit | commitdiff | tree

Engininja2 [Tue, 27 Feb 2024 13:22:45 +0000 (07:22 -0600)]

cuda : replace remaining shfl_xor with calls to warp_reduce functions (llama/5744)

commit | commitdiff | tree

Engininja2 [Tue, 27 Feb 2024 12:50:18 +0000 (06:50 -0600)]

ggml-quants : fix avx2 iq1_s vec_dot when compiled with gcc (llama/5742)

commit | commitdiff | tree

Kawrakow [Mon, 26 Feb 2024 16:28:38 +0000 (18:28 +0200)]

Adding IQ2_S and IQ2_M to complete coverage of the 2-3 bit quantization range (llama/5721)

* Adding IQ2_S and IQ2_M as a single cumulative commit

* Update examples/quantize/quantize.cpp

Co-authored-by: Georgi Gerganov <redacted>
---------

Co-authored-by: Iwan Kawrakow <redacted>
Co-authored-by: Georgi Gerganov <redacted>

commit | commitdiff | tree

Johannes Gäßler [Mon, 26 Feb 2024 14:36:38 +0000 (15:36 +0100)]

CUDA: fix DEBUG_CUDA_MALLOC (llama/5729)

commit | commitdiff | tree

AidanBeltonS [Mon, 26 Feb 2024 14:02:11 +0000 (14:02 +0000)]

Add support for soft_max ALiBi (llama/5639)

* Add support for bias

* Update pre-processor

* rm commented code

* fix format

* fix CI

---------

Co-authored-by: Abhilash Majumder <redacted>

commit | commitdiff | tree

Radosław Gryta [Sun, 25 Feb 2024 18:43:00 +0000 (19:43 +0100)]

ggml-quants : provide ggml_vqtbl1q_u8 for 64bit compatibility (llama/5711)

* [ggml-quants] Provide ggml_vqtbl1q_u8 for 64bit compatibility

vqtbl1q_u8 is not part of arm v7 neon library

* [android-example] Remove abi filter after arm v7a fix

* [github-workflows] Do not skip Android armeabi-v7a build

commit | commitdiff | tree

slaren [Sun, 25 Feb 2024 19:41:35 +0000 (20:41 +0100)]

add google magika inference example (ggml/748)

* add magika inference example

* ggml : fix unaligned accesses in custom ops

* ggml : fix FP32 GELU for values that exceed the FP16 range

* use ggml_pool_1d

* add README

* Update README.md

* pad inputs if the files are too small

* cleanup

ggml-ci

commit | commitdiff | tree

Andrew S [Mon, 26 Feb 2024 08:12:35 +0000 (02:12 -0600)]

stream.wasm : fix invalid memory access when no segments (#1902)

No segments may be returned when a smaller sample buffer (EG 2048 samples) is sent to the worker.

commit | commitdiff | tree

Georgi Gerganov [Sun, 25 Feb 2024 18:00:10 +0000 (20:00 +0200)]

talk-llama : sync llama.cpp

commit | commitdiff | tree

Georgi Gerganov [Sun, 25 Feb 2024 17:59:34 +0000 (19:59 +0200)]

sync : ggml

commit | commitdiff | tree

Georgi Gerganov [Sun, 25 Feb 2024 17:58:06 +0000 (19:58 +0200)]

sync : llama.cpp (ggml/0)

commit | commitdiff | tree

Georgi Gerganov [Sun, 25 Feb 2024 10:09:09 +0000 (12:09 +0200)]

code : normalize enum names (llama/5697)

* coda : normalize enum names

ggml-ci

* code : cont

* code : cont

commit | commitdiff | tree

Kawrakow [Sat, 24 Feb 2024 14:23:52 +0000 (16:23 +0200)]

IQ3_S: a much better alternative to Q3_K (llama/5676)

* iq4_nl: squash commits for easier rebase

* Basics (quantize, dequantize)
* CUDA dequantize and dot product
* Slightly faster CUDA dot product (120 t/s)
* Switch to 6-bit scales
* Scalar dot product
* AVX2 dot product
* ARM_NEON dot product
* Works on metal, but still slow
* Slightly better Metal dot product
* Another small Metal improvement
* Metal dot product is getting there
* Faster CUDA dot product
* Add 1/8 ffn_down layers as Q5_K when no imatrix has been provided
* Report the actual bpw
* Add _xs mix that is 4.05 bpw for non-MoE models
* Remove IQ4_XS for now, slightly adjust kvalues_iq4nl
* AVX2 dot product uses Q8_0 instead of Q8_K
* Add to test-backend-ops
* Minor fix
* Also use use Q5_K for attn_output in MoE models
* Fixes after merging latest master
* Switching to blocks of 32
* AVX2 for blocks of 32
* Scaler dot product for blocks of 32
* ARM_NEON dot product for blocks of 32
* Metal kernels for blocks of 32
* Slightly faster Metal kernels

* Resurrecting iq3_xs

After all the experimentation, nothing was better than this.

* Minor PPL improvement via a block scale fudge factor

* Minor improvement via 3 neighbours

* iq3_xs: working scalar and AVX2 dot products

* iq3_xs: ARM_NEON dot product - works but extremely slow (10 t/s)

* iq3_xs: working Metal implementation

* Adding IQ3_M - IQ3_XS mix with mostly Q4_K

* iiq3_xs: a 3.4375 bpw variant

* iq3_xs: make CUDA work for new version

* iq3_xs: make scalar and AVX2 work for new version

* iq3_s: make ARM_NEON work with new version

* iq3_xs: make new version work on metal

Performance is very similar to Q3_K_S

* iq3_xs: tiny Metal speed improvement

* iq3_xs: tiny Metal speed improvement

* Fix stupid warning

* Q3_K_XS now uses a mix of IQ3_XS and IQ3_XXS

* iq3_xs: rename to iq3_s

* iq3_s: make tests pass

* Move Q3_K_XS mix to 3.25 bpw

* Attempt to fix failing tests

* Another attempt to fix the Windows builds

* Attempt to fix ROCm

* ROCm again

* iq3_s: partial fix for QK_K = 64

* iq3_s: make it work on metal for QK_K = 64

Pleasent surprise: the coding was super-block size independent,
so all it took was to delete some QK_K == 256 guards.

* Will this fix ROCm?

---------

Co-authored-by: Iwan Kawrakow <redacted>

commit | commitdiff | tree

UEXTM.com [Sat, 24 Feb 2024 16:27:36 +0000 (11:27 -0500)]

Introduce backend GUIDs (ggml/743)

* Introduce backend GUIDs

Initial proposed implementation of backend GUIDs
(Discussed in https://github.com/ggerganov/ggml/pull/741)

Hardcoded CPU backend GUID (for now)
Change ggml_backend_is_cpu logic to use GUID

* Remove redundant functions

Remove redundant functions `ggml_backend_i::get_name` and `ggml_backend_guid` which are not desired for future expansion

* Add spaces to match style

Co-authored-by: slaren <redacted>
* Fix brace style to match

Co-authored-by: slaren <redacted>
* Add void to () in function signature

Co-authored-by: slaren <redacted>
* Add back ggml_backend_guid and make CPU_GUID a local static in ggml_backend_cpu_guid

* add guids to all backends

ggml-ci

---------

Co-authored-by: slaren <redacted>

commit | commitdiff | tree

Tamotsu Takahashi [Sat, 24 Feb 2024 07:24:47 +0000 (16:24 +0900)]

talk, talk-llama : pass text_to_speak as a file (#1865)

* talk-llama: pass file instead of arg

it is too hard to quote text in a portable way

* talk-llama: pass heard_ok as a file

* talk-llama: let eleven-labs.py accept options

Options: -v voice, -s savefile, -p (--play)

* talk-llama: check installed commands in "speak"

Pass "-q" to eleven-labs.py to skip checking whether elevenlabs is installed

* talk-llama: pass voice_id again

in order to sync talk with talk-llama

* talk: sync with talk-llama

Passing text_to_speak as a file is safer and more portable
cf. https://stackoverflow.com/a/59036879/45375

* talk and talk-llama: get all installed voices in speak.ps1

* talk and talk-llama: get voices from api

* talk and talk-llama: add more options to eleven-labs.py

and remove DEFAULT_VOICE because it is deprecated (https://www.reddit.com/r/ElevenLabs/comments/1830abt/what_happened_to_bella/)

```
usage: eleven-labs.py [-q] [-l] [-h] [-n NAME | -v NUMBER] [-f KEY=VAL] [-s FILE | -p] [TEXTFILE]

options:
  -q, --quick           skip checking the required library

action:
  TEXTFILE              read the text file (default: stdin)
  -l, --list            show the list of voices and exit
  -h, --help            show this help and exit

voice selection:
  -n NAME, --name NAME  get a voice object by name (default: Arnold)
  -v NUMBER, --voice NUMBER
                        get a voice object by number (see --list)
  -f KEY=VAL, --filter KEY=VAL
                        filter voices by labels (default: "use case=narration")
                        this option can be used multiple times
                        filtering will be disabled if the first -f has no "=" (e.g. -f "any")

output:
  -s FILE, --save FILE  save the TTS to a file (default: audio.mp3)
  -p, --play            play the TTS with ffplay
```

* examples: add speak_with_file()

as suggested in the review

* talk and talk-llama: ignore to_speak.txt

commit | commitdiff | tree

Abhilash Majumder [Fri, 23 Feb 2024 07:22:24 +0000 (12:52 +0530)]

whisper : add SYCL support (#1863)

* add changes from llama upstream

* add sycl abstraction

* add sycl build

* update cmake

* add sycl build config

* fix bug

* fix bug

* refactor build

* fix bug

* update build

* call build

* use sycl header

* add examples

* add target

* fix typecast in quant.c

* readd fp16 and readme

* fix quant typecast

* add sample

* add readme

* remove cxx file check

commit | commitdiff | tree

Georgi Gerganov [Thu, 22 Feb 2024 21:30:53 +0000 (23:30 +0200)]

talk-llama : sync llama.cpp

commit | commitdiff | tree

Georgi Gerganov [Thu, 22 Feb 2024 21:25:38 +0000 (23:25 +0200)]

sync : ggml

commit | commitdiff | tree

Georgi Gerganov [Thu, 22 Feb 2024 21:21:39 +0000 (23:21 +0200)]

ggml : always define ggml_fp16_t as uint16_t (llama/5666)

* ggml : always define ggml_fp16_t as uint16_t

ggml-ci

* ggml : cont

ggml-ci

* ggml : cont

* ggml : cont

ggml-ci

* ggml : cont

ggml-ci

* cuda : no longer ggml headers last

ggml-ci

* ggml : fix q6_K FP16 -> FP32 conversion

ggml-ci

* ggml : more FP16 -> FP32 conversion fixes

ggml-ci

commit | commitdiff | tree

Georgi Gerganov [Thu, 22 Feb 2024 18:20:34 +0000 (20:20 +0200)]

ci : fix whitespace

commit | commitdiff | tree

Georgi Gerganov [Thu, 22 Feb 2024 16:31:40 +0000 (18:31 +0200)]

ggml : 32-bit arm compat (#1891)

* ggml : 32-bit arm compat

* ggml : add ggml_vqtbl1q_s8 impl

* ggml : cont

commit | commitdiff | tree

Georgi Gerganov [Thu, 22 Feb 2024 13:15:38 +0000 (15:15 +0200)]

sync : ggml

commit | commitdiff | tree

Georgi Gerganov [Wed, 21 Feb 2024 14:19:39 +0000 (16:19 +0200)]

sync : llama.cpp (ggml/0)

ggml-ci

commit | commitdiff | tree

Meng, Hengyu [Wed, 21 Feb 2024 09:52:06 +0000 (17:52 +0800)]

conext add name (llama/5624)

* [SYCL] conext add name

* name should start with SYCL*

commit | commitdiff | tree

AidanBeltonS [Tue, 20 Feb 2024 07:01:25 +0000 (07:01 +0000)]

Update ggml_sycl_op_mul_mat_vec_q (llama/5502)

* Update ggml_sycl_op_mul_mat_vec_q

* Apply suggestions from code review

Co-authored-by: Abhilash Majumder <redacted>
* revert suggestion on macro

* fix bug

* Add quant type GGML_TYPE_IQ1_S to unsupported

* fix format

---------

Co-authored-by: Abhilash Majumder <redacted>

commit | commitdiff | tree

0cc4m [Wed, 14 Feb 2024 19:57:17 +0000 (20:57 +0100)]

Refactor validation and enumeration platform checks into functions to clean up ggml_vk_instance_init()

commit | commitdiff | tree

0cc4m [Sat, 10 Feb 2024 21:14:52 +0000 (22:14 +0100)]

Add check for VK_KHR_portability_enumeration for MoltenVK support

commit | commitdiff | tree

Mathijs de Bruin [Tue, 6 Feb 2024 14:39:22 +0000 (14:39 +0000)]

Add preprocessor checks for Apple devices.

Based on work by @rbourgeat in https://github.com/ggerganov/llama.cpp/pull/5322/files

commit | commitdiff | tree

Mathijs de Bruin [Sat, 3 Feb 2024 18:00:11 +0000 (18:00 +0000)]

Resolve ErrorIncompatibleDriver with Vulkan on MacOS.

Refs:
- https://chat.openai.com/share/7020ce72-65fc-45ec-b7be-9d9d798a5f3f
- https://github.com/SaschaWillems/Vulkan/issues/954
- https://github.com/haasn/libplacebo/issues/128
- https://github.com/KhronosGroup/Vulkan-Samples/issues/476

commit | commitdiff | tree

Mathijs de Bruin [Sat, 3 Feb 2024 17:56:46 +0000 (17:56 +0000)]

Allow for Vulkan build with Accelerate.

Closes #5304

commit | commitdiff | tree

slaren [Mon, 19 Feb 2024 22:40:26 +0000 (23:40 +0100)]

cuda : ignore peer access already enabled errors (llama/5597)

* cuda : ignore peer access already enabled errors

* fix hip

commit | commitdiff | tree

Siddharth Ramakrishnan [Wed, 21 Feb 2024 12:34:53 +0000 (04:34 -0800)]

ggml : compute forward no longer pass src tensors (ggml/729)

* refactored compute forward to not pass in the src tensors each time

* fix merge issues with flags

* missed one place in the last commit to fix the is_param / flags issue

* minor spacing fix

* fixed some variable assignments so all tests locally are passing

* new change after merge fix

---------

Co-authored-by: siddharthvader <redacted>

commit | commitdiff | tree

bssrdf [Tue, 20 Feb 2024 19:17:09 +0000 (14:17 -0500)]

ggml : fix conv_2d batch mode (ggml/737)

Co-authored-by: bssrdf <redacted>

commit | commitdiff | tree

st-gr [Thu, 22 Feb 2024 13:11:35 +0000 (05:11 -0800)]

openvino : fix convert-whisper-to-openvino.py (#1890)

Fix issue: Conversion from Whisper to OpenVino failed #1870

convert-whisper-to-openvino.py stopped working with OpenVINO version 2023.0.0-10926-b4452d56304-releases/2023/0 .

Error was: TypeError: load(): incompatible function arguments. The following argument types are supported:
1. (self: openvino._pyopenvino.FrontEnd, path: object) -> ov::frontend::InputModel

Tested successfully with a large-v3 conversion.

Co-authored-by: Stefan Grundmann <redacted>

commit | commitdiff | tree

Davidson Francis [Thu, 22 Feb 2024 13:01:08 +0000 (10:01 -0300)]

main : fix file existence check in main.cpp (#1889)

In commit dda4b0e of PR #1872, I've introduced a check for the
existence of files before loading the model. However, I haven't
considered the case where whisper.cpp might read from stdin as well,
and in such cases, the checks should ignore the "-" argument as it
does not represent a regular file.

Additionally, this commit removes the usage of 'stat()' in favor of
the recently introduced function 'is_file_exist()' in common.cpp from
PR #1871.

Apologies for the bug introduced in the previous PR and any
inconvenience it may have caused.

commit | commitdiff | tree

Georgi Gerganov [Tue, 20 Feb 2024 10:09:57 +0000 (12:09 +0200)]

talk-llama : sync llama.cpp

commit | commitdiff | tree

LBlue [Tue, 20 Feb 2024 10:05:38 +0000 (18:05 +0800)]

make : fix CUBLAS link with WSL (#1878)

commit | commitdiff | tree

Georgi Gerganov [Mon, 19 Feb 2024 13:54:25 +0000 (15:54 +0200)]

sync : ggml

commit | commitdiff | tree

Georgi Gerganov [Mon, 19 Feb 2024 13:33:51 +0000 (15:33 +0200)]

ggml : resolve merge conflicts (ggml/0)

ggml-ci

commit | commitdiff | tree

Georgi Gerganov [Mon, 19 Feb 2024 13:27:37 +0000 (15:27 +0200)]

common : add IQ1_S (ggml/0)

ggml-ci

commit | commitdiff | tree

Georgi Gerganov [Mon, 19 Feb 2024 12:45:41 +0000 (14:45 +0200)]

ci : enable -Werror for CUDA builds (llama/5579)

* cmake : pass -Werror through -Xcompiler

ggml-ci

* make, cmake : enable CUDA errors on warnings

ggml-ci

commit | commitdiff | tree

slaren [Mon, 19 Feb 2024 08:04:45 +0000 (09:04 +0100)]

cuda, metal : fix nans in soft_max (llama/5574)

* cuda : fix nans in soft_max

* metal : fix nans in soft_max

---------

Co-authored-by: Georgi Gerganov <redacted>

commit | commitdiff | tree

bmwl [Mon, 19 Feb 2024 07:38:32 +0000 (23:38 -0800)]

ggml : android and old glibc NUMA incompatibility bugfixes (llama/5557)

* #ifdef out some code NUMA blocks for Android due to lack of support

* added in some __ANDROID__ if def gates around numa code and forced GLIBC prior to 2.29 to use a syscall for getcpu instead of the wrapper

* Changed gates on numa platform specific stuff to __gnu_linux__ to skip any platforms without glibc

* harmonizing #if defined blocks for numa code to __gnu_linux__ since that's the only model that's being followed anyways

---------

Co-authored-by: root <redacted>

commit | commitdiff | tree

Georgi Gerganov [Sun, 18 Feb 2024 20:58:57 +0000 (22:58 +0200)]

ggml : restore vec dot stride arg names (llama/5453)

commit | commitdiff | tree

Georgi Gerganov [Sun, 18 Feb 2024 20:39:30 +0000 (22:39 +0200)]

ci : fix wikitext url + compile warnings (llama/5569)

ggml-ci

commit | commitdiff | tree

Georgi Gerganov [Sun, 18 Feb 2024 19:39:58 +0000 (21:39 +0200)]

metal : fix unused warnings (llama/0)

commit | commitdiff | tree

Herman Semenov [Sun, 18 Feb 2024 16:20:12 +0000 (16:20 +0000)]

ggml, common, examples, tests : fixed type arguments in printf (llama/5528)

commit | commitdiff | tree

Kawrakow [Sun, 18 Feb 2024 16:16:55 +0000 (18:16 +0200)]

1.5 bit quantization (llama/5453)

* iq1_s: WIP basics

* iq1_s: CUDA is working

* iq1_s: scalar CPU dot product

* iq1_s: WIP AVX2 dot product - something is not right

* Fix tests

* Fix shadow warnings

* Fix after merge with latest master

* iq1_s: AVX2 finally works

* iq1_s: ARM_NEON dot product. Works, but not very fast

* iq1_s: better grid

* iq1_s: use IQ2_XXS for attn_output

At a cost of 0.04 extra bpw this gives a big improvement in PPL.

* iq1_s: Metal basics

Dequantize works, but not dot product

* iq1_s: Metal works, but quite slow

As usual, Apple Silicon does not like the code I write.

* iq1_s: Tests

* iq1_s: slightly faster dot product

---------

Co-authored-by: Iwan Kawrakow <redacted>

commit | commitdiff | tree

Georgi Gerganov [Mon, 19 Feb 2024 13:18:09 +0000 (15:18 +0200)]

ggml : add ALiBi support for ggml_soft_max_ext (llama/5488)

commit | commitdiff | tree

Ananta Bastola [Sat, 17 Feb 2024 21:03:14 +0000 (16:03 -0500)]

ci : add an option to fail on compile warning (llama/3952)

* feat(ci): add an option to fail on compile warning

* Update CMakeLists.txt

* minor : fix compile warnings

ggml-ci

* ggml : fix unreachable code warnings

ggml-ci

* ci : disable fatal warnings for windows, ios and tvos

* ggml : fix strncpy warning

* ci : disable fatal warnings for MPI build

* ci : add fatal warnings to ggml-ci

ggml-ci

---------

Co-authored-by: Georgi Gerganov <redacted>

commit | commitdiff | tree

Georgi Gerganov [Fri, 16 Feb 2024 17:05:56 +0000 (19:05 +0200)]

cmake : fix VULKAN and ROCm builds (llama/5525)

* cmake : fix VULKAN and ROCm builds

* cmake : fix (cont)

* vulkan : fix compile warnings

ggml-ci

* cmake : fix

ggml-ci

* cmake : minor

ggml-ci

commit | commitdiff | tree

bmwl [Fri, 16 Feb 2024 09:31:07 +0000 (01:31 -0800)]

ggml : add numa options (llama/5377)

* Added numa options to allow finer grained control as well as plumbing for a new mirror mode that will require numa.h

* Reverted Makefile

* Fixed include

* Removed sched.h from ggml.h, moved ggml_get_numa_affinity into ggml.c, removed trailing whitespace and fixed up a few inconsistent variables

* removed trailing whitespace

* Added numa options to allow finer grained control as well as plumbing for a new mirror mode that will require numa.h

* Reverting Makefile

* Fixed a number of issues with the move from BOOL to ggml_numa_strategies. Added a note about mirror mode note being implemented yet

* Removing MIRROR_MODE code for this PR

* Removing last bit of MIRROR_MODE code for this PR

* Removing unneeded branch in server.cpp example and moving get_numa_affinity and making it static

* Fixed lingering init_llama_backend() bool calls in tests and examples

* Remote enum llama_numa_strategies

* Revert bad merge with dynatemp flags

* add missing enum ggml_numa_strategies declaration and revert sync problem with master

* add missing enum ggml_numa_strategies declaration

* fixed ggml_init_numa variable

* Update ggml.h

Co-authored-by: Jared Van Bortel <redacted>
* Update READMEs with info about numa flags, change INTERLEAVE strategy name to DISTRIBUTE everywhere, implement the improved distribution strategy from @rankaiyx, fix a spelling mistake and un-merge some bad merges

* split numa init out from llama_backend_init and created llama_numa_init. Updated all code paths and samples

* Fix up some boolean vs enum comparisons

* Added #ifdefs for non-Linux OS that don't have cpu_set_t datatype

* Update ggml.h

Align enum values

Co-authored-by: Georgi Gerganov <redacted>
* Update ggml.c

Remove whitespace

Co-authored-by: Georgi Gerganov <redacted>
* Update ggml.c

align paremeters

Co-authored-by: Georgi Gerganov <redacted>
* Update examples/server/server.cpp

remove whitespace and align brace

Co-authored-by: Georgi Gerganov <redacted>
* Update common/common.cpp

Remove whitespace and align brace

Co-authored-by: Georgi Gerganov <redacted>
* unified ggml_numa_strategy enum and fixed text alignment in server.cpp example

* Update ggml.c

simplified return for platforms without NUMA support

Co-authored-by: Jared Van Bortel <redacted>
* removed redundant else from cli argument processing of --numa

* whitespace

---------

Co-authored-by: root <redacted>
Co-authored-by: Jared Van Bortel <redacted>
Co-authored-by: Georgi Gerganov <redacted>
Co-authored-by: Jared Van Bortel <redacted>

commit | commitdiff | tree

slaren [Thu, 15 Feb 2024 15:49:01 +0000 (16:49 +0100)]

cuda : print message when initialization fails (llama/5512)

* cuda : print message when initialization fails

* use CUDA_NAME both times

commit | commitdiff | tree

Neuman Vong [Thu, 15 Feb 2024 06:11:15 +0000 (17:11 +1100)]

vulkan: Find optimal memory type but with fallback (llama/5381)

* @0cc4m feedback

* More feedback @0cc4m

commit | commitdiff | tree

AT [Tue, 13 Feb 2024 21:44:25 +0000 (15:44 -0600)]

Early return for zero size calls to get_tensor. (llama/5482)

* Early return for zero size calls to get_tensor.

Signed-off-by: Adam Treat <redacted>
* Update ggml-kompute.cpp

Co-authored-by: Georgi Gerganov <redacted>
* Update ggml-kompute.cpp

Co-authored-by: Georgi Gerganov <redacted>
* Add an early return to the get/set tensor when the size is null.

Signed-off-by: Adam Treat <redacted>
* Early return after the assertions.

Signed-off-by: Adam Treat <redacted>
* Since we do the early return in the generic backend now no reason to do so here as well.

Signed-off-by: Adam Treat <redacted>
---------

Signed-off-by: Adam Treat <redacted>
Co-authored-by: Georgi Gerganov <redacted>

commit | commitdiff | tree

Kawrakow [Tue, 13 Feb 2024 07:07:57 +0000 (09:07 +0200)]

ggml-quants : fix compiler warnings (shadow variable) (llama/5472)

Co-authored-by: Iwan Kawrakow <redacted>

commit | commitdiff | tree

Abhilash Majumder [Mon, 12 Feb 2024 14:52:05 +0000 (20:22 +0530)]

ggml-sycl: Replace 3d ops with macro (llama/5458)

* use macro

* use macro

* fix format

commit | commitdiff | tree

Georgi Gerganov [Mon, 19 Feb 2024 12:44:46 +0000 (14:44 +0200)]

build : update CBLAS flags + fix unused var warning (#0)

commit | commitdiff | tree

Davidson Francis [Mon, 19 Feb 2024 08:51:26 +0000 (05:51 -0300)]

main : check if input files exist before proceeding (#1872)

Until the most recent commit (3d42463), the main.cpp sample file does
not check whether the input files exist or not. Consequently, the
model is loaded first before reporting whether there was a failure or
not when processing a file. In environments with HDD, this can take
about 50 seconds or more, depending on the loaded model.

This commit addresses this issue by checking in advance whether the
input files exist or not.

commit | commitdiff | tree

Felix [Mon, 19 Feb 2024 08:50:15 +0000 (09:50 +0100)]

examples : clean up common code (#1871)

move some utility functions into common.h

commit | commitdiff | tree

Jumper775 [Mon, 19 Feb 2024 02:19:47 +0000 (21:19 -0500)]

models : fix openvino setup info (#1874)

commit | commitdiff | tree

Georgi Gerganov [Tue, 13 Feb 2024 09:51:32 +0000 (11:51 +0200)]

models : add update py requirements

commit | commitdiff | tree

Georgi Gerganov [Mon, 12 Feb 2024 17:54:11 +0000 (19:54 +0200)]

swift : package no longer use ggml dependency (#1861)

* Revert "swift : update Package.swift to use ggml as package dependency (#1701)"

This reverts commit 993acb5d410cd8eaebaa3fc54d4b153e04bbefce.

* spm : add ggml.h

commit | commitdiff | tree

Georgi Gerganov [Mon, 12 Feb 2024 17:53:51 +0000 (19:53 +0200)]

whisper : fix external encoder (#1860)

commit | commitdiff | tree

Georgi Gerganov [Mon, 12 Feb 2024 17:07:56 +0000 (19:07 +0200)]

sync : ggml

commit | commitdiff | tree

slaren [Mon, 12 Feb 2024 17:07:14 +0000 (18:07 +0100)]

ggml-alloc : allocate all leafs as if they were inputs (ggml/731)

* ggml-alloc : allocate all leafs as if they were inputs

* ensure static leafs are allocated

* gpt-2-backend : remove unnecesary ggml_new_tensor

* update other gpt-2 examples to remove ggml_new_tensor calls in the graph

commit | commitdiff | tree

Georgi Gerganov [Mon, 12 Feb 2024 08:39:58 +0000 (10:39 +0200)]

talk-llama : sync llama.cpp

commit | commitdiff | tree

Georgi Gerganov [Mon, 12 Feb 2024 07:32:15 +0000 (09:32 +0200)]

sync : ggml

commit | commitdiff | tree

Georgi Gerganov [Mon, 12 Feb 2024 07:27:57 +0000 (09:27 +0200)]

ggml-backend : sync remnant

commit | commitdiff | tree

Johannes Gäßler [Sun, 11 Feb 2024 18:08:39 +0000 (19:08 +0100)]

CUDA: mul_mat_vec_q tiling, refactor mul mat logic (llama/5434)

* CUDA: mul_mat_vec_q tiling, refactor mul mat logic

Co-authored-by: slaren <redacted>
---------

Co-authored-by: slaren <redacted>

commit | commitdiff | tree

Sergio López [Sun, 11 Feb 2024 14:12:00 +0000 (15:12 +0100)]

vulkan: only use M-sized matmul on Apple GPUs (llama/5412)

* vulkan: refactor guess_matmul_pipeline for vendor

Refactor ggml_vk_guess_matmul_pipeline to simplify adding per-vendor
conditionals.

Signed-off-by: Sergio Lopez <redacted>
* vulkan: only use M-sized matmul on Apple GPUs

L-sized and S-sized matmuls are broken on Apple GPUs, force using
M-size with this vendor.

Signed-off-by: Sergio Lopez <redacted>
---------

Signed-off-by: Sergio Lopez <redacted>

Packaging of ggerganov/whisper.cpp

RSS Atom