git.djapps.eu Git - pkg/ggml/sources/llama.cpp/log

]> git.djapps.eu Git - pkg/ggml/sources/llama.cpp/log

overview / pkg / ggml / sources / llama.cpp / log

AidanBeltonS [Mon, 26 Feb 2024 14:02:11 +0000 (14:02 +0000)]

[SYCL] Add support for soft_max ALiBi (#5639)

* Add support for bias

* Update pre-processor

* rm commented code

* fix format

* fix CI

---------

Co-authored-by: Abhilash Majumder <redacted>

commit | commitdiff | tree

Georgi Gerganov [Mon, 26 Feb 2024 12:02:12 +0000 (14:02 +0200)]

unicode : reuse iterator (#5726)

commit | commitdiff | tree

Pierrick Hymbert [Mon, 26 Feb 2024 10:41:34 +0000 (11:41 +0100)]

server: CI fix trailing space (#5728)

commit | commitdiff | tree

Pierrick Hymbert [Mon, 26 Feb 2024 08:56:10 +0000 (09:56 +0100)]

server: CI tests reduce build matrix (#5725)

commit | commitdiff | tree

Georgi Gerganov [Mon, 26 Feb 2024 06:30:17 +0000 (08:30 +0200)]

llama : fix Gemma rope type (#5691)

commit | commitdiff | tree

github-actions[bot] [Sun, 25 Feb 2024 00:17:11 +0000 (00:17 +0000)]

flake.lock: Update

Flake lock file updates:

• Updated input 'nixpkgs':
'github:NixOS/nixpkgs/5863c27340ba4de8f83e7e3c023b9599c3cb3c80' (2024-02-16)
→ 'github:NixOS/nixpkgs/cbc4211f0afffe6dfd2478a62615dd5175a13f9a' (2024-02-23)

commit | commitdiff | tree

Pierrick Hymbert [Sun, 25 Feb 2024 21:48:33 +0000 (22:48 +0100)]

server: tests - slow inference causes timeout on the CI (#5715)

* server: tests - longer inference timeout for CI

commit | commitdiff | tree

Pierrick Hymbert [Sun, 25 Feb 2024 20:46:29 +0000 (21:46 +0100)]

server: docs - refresh and tease a little bit more the http server (#5718)

* server: docs - refresh and tease a little bit more the http server

* Rephrase README.md server doc

Co-authored-by: Georgi Gerganov <redacted>
* Update examples/server/README.md

Co-authored-by: Georgi Gerganov <redacted>
* Update examples/server/README.md

Co-authored-by: Georgi Gerganov <redacted>
* Update README.md

---------

Co-authored-by: Georgi Gerganov <redacted>

commit | commitdiff | tree

Georgi Gerganov [Sun, 25 Feb 2024 20:12:24 +0000 (22:12 +0200)]

llama : refactor k-shift implementation + KV defragmentation (#5691)

* llama : refactor k-shift implementation

ggml-ci

* llama : rename llama_kv_cache_seq_shift to llama_kv_cache_seq_add

* llama : cont k-shift refactoring + normalize type names

ggml-ci

* minor : fix MPI builds

* llama : reuse n_rot from the build context

ggml-ci

* llama : revert enum name changes from this PR

ggml-ci

* llama : update llama_rope_type

* llama : add comment about rope values

* llama : fix build

* passkey : apply kv cache updates explicitly

ggml-ci

* llama : change name to llama_kv_cache_update()

* llama : add llama_kv_cache_seq_pos_max()

* passkey : fix llama_kv_cache_seq_pos_max() usage

* llama : some llama_kv_cell simplifications

* llama : add llama_kv_cache_compress (EXPERIMENTAL)

* llama : add alternative KV cache merging (EXPERIMENTAL)

* llama : add llama_kv_cache_defrag

* llama : comments

* llama : remove llama_kv_cache_compress

will add in a separate PR

ggml-ci

* llama : defragment via non-overlapping moves

* llama : ggml_graph based defrag implementation

ggml-ci

* llama : switch the loop order in build_defrag

* llama : add comments

commit | commitdiff | tree

compilade [Sun, 25 Feb 2024 18:43:50 +0000 (13:43 -0500)]

server : fix crash when system prompt is bigger than batch size (#5714)

The system prompt is now decoded in batches.

* server : fix off-by-one n_past when start of prompt matches whole cache

The tokens right after the matching part would otherwise skip a pos value.

commit | commitdiff | tree

Radosław Gryta [Sun, 25 Feb 2024 18:43:00 +0000 (19:43 +0100)]

ggml-quants : provide ggml_vqtbl1q_u8 for 64bit compatibility (#5711)

* [ggml-quants] Provide ggml_vqtbl1q_u8 for 64bit compatibility

vqtbl1q_u8 is not part of arm v7 neon library

* [android-example] Remove abi filter after arm v7a fix

* [github-workflows] Do not skip Android armeabi-v7a build

commit | commitdiff | tree

kwin1412 [Sun, 25 Feb 2024 16:46:49 +0000 (00:46 +0800)]

make : fix nvcc version is empty (#5713)

fix nvcc version is empty

commit | commitdiff | tree

Ashok Gelal [Sun, 25 Feb 2024 15:57:34 +0000 (10:57 -0500)]

readme : add Msty to UI list (#5618)

commit | commitdiff | tree

Pierrick Hymbert [Sun, 25 Feb 2024 12:50:32 +0000 (13:50 +0100)]

server: logs - unified format and --log-format option (#5700)

* server: logs - always use JSON logger, add add thread_id in message, log task_id and slot_id

* server : skip GH copilot requests from logging

* server : change message format of server_log()

* server : no need to repeat log in comment

* server : log style consistency

* server : fix compile warning

* server : fix tests regex patterns on M2 Ultra

* server: logs: PR feedback on log level

* server: logs: allow to choose log format in json or plain text

* server: tests: output server logs in text

* server: logs switch init logs to server logs macro

* server: logs ensure value json value does not raised error

* server: logs reduce level VERBOSE to VERB to max 4 chars

* server: logs lower case as other log messages

* server: logs avoid static in general

Co-authored-by: Georgi Gerganov <redacted>
* server: logs PR feedback: change text log format to: LEVEL [function_name] message | additional=data

---------

Co-authored-by: Georgi Gerganov <redacted>

commit | commitdiff | tree

Pierrick Hymbert [Sun, 25 Feb 2024 12:49:43 +0000 (13:49 +0100)]

server: concurrency fix + monitoring - add /metrics prometheus compatible endpoint (#5708)

* server: monitoring - add /metrics prometheus compatible endpoint

* server: concurrency issue, when 2 task are waiting for results, only one call thread is notified

* server: metrics - move to a dedicated struct

commit | commitdiff | tree

Radosław Gryta [Sun, 25 Feb 2024 10:53:11 +0000 (11:53 +0100)]

cmake : fix compilation for Android armeabi-v7a (#5702)

commit | commitdiff | tree

Georgi Gerganov [Sun, 25 Feb 2024 10:09:09 +0000 (12:09 +0200)]

code : normalize enum names (#5697)

* coda : normalize enum names

ggml-ci

* code : cont

* code : cont

commit | commitdiff | tree

Anas Ahouzi [Sun, 25 Feb 2024 09:54:04 +0000 (10:54 +0100)]

py : fix StableLM conversion after config.json changes (#5703)

* Fix issues during StableLM models conversion

* Fix hard coded layer_norm_eps

* Support layer_norm_eps for LlavaStableLM

Co-authored-by: Jared Van Bortel <redacted>
* Add missing parenthesis

Co-authored-by: Jared Van Bortel <redacted>
* Support rotary_factor for LlavaStableLM

Co-authored-by: Jared Van Bortel <redacted>
* fix typo

* Add StableLMEpochForCausalLM for safety

Co-authored-by: compilade <redacted>
* Add StableLMEpochForCausalLM for safety 2

Co-authored-by: compilade <redacted>
---------

Co-authored-by: Jared Van Bortel <redacted>
Co-authored-by: Jared Van Bortel <redacted>
Co-authored-by: compilade <redacted>

commit | commitdiff | tree

Pierrick Hymbert [Sat, 24 Feb 2024 18:16:04 +0000 (19:16 +0100)]

server: continue to update other slots on embedding concurrent request (#5699)

* server: #5655 - continue to update other slots on embedding concurrent request.

* server: tests: add multi users embeddings as fixed

* server: tests: adding OAI compatible embedding concurrent endpoint

* server: tests: adding OAI compatible embedding with multiple inputs

commit | commitdiff | tree

Kawrakow [Sat, 24 Feb 2024 14:23:52 +0000 (16:23 +0200)]

IQ3_S: a much better alternative to Q3_K (#5676)

* iq4_nl: squash commits for easier rebase

* Basics (quantize, dequantize)
* CUDA dequantize and dot product
* Slightly faster CUDA dot product (120 t/s)
* Switch to 6-bit scales
* Scalar dot product
* AVX2 dot product
* ARM_NEON dot product
* Works on metal, but still slow
* Slightly better Metal dot product
* Another small Metal improvement
* Metal dot product is getting there
* Faster CUDA dot product
* Add 1/8 ffn_down layers as Q5_K when no imatrix has been provided
* Report the actual bpw
* Add _xs mix that is 4.05 bpw for non-MoE models
* Remove IQ4_XS for now, slightly adjust kvalues_iq4nl
* AVX2 dot product uses Q8_0 instead of Q8_K
* Add to test-backend-ops
* Minor fix
* Also use use Q5_K for attn_output in MoE models
* Fixes after merging latest master
* Switching to blocks of 32
* AVX2 for blocks of 32
* Scaler dot product for blocks of 32
* ARM_NEON dot product for blocks of 32
* Metal kernels for blocks of 32
* Slightly faster Metal kernels

* Resurrecting iq3_xs

After all the experimentation, nothing was better than this.

* Minor PPL improvement via a block scale fudge factor

* Minor improvement via 3 neighbours

* iq3_xs: working scalar and AVX2 dot products

* iq3_xs: ARM_NEON dot product - works but extremely slow (10 t/s)

* iq3_xs: working Metal implementation

* Adding IQ3_M - IQ3_XS mix with mostly Q4_K

* iiq3_xs: a 3.4375 bpw variant

* iq3_xs: make CUDA work for new version

* iq3_xs: make scalar and AVX2 work for new version

* iq3_s: make ARM_NEON work with new version

* iq3_xs: make new version work on metal

Performance is very similar to Q3_K_S

* iq3_xs: tiny Metal speed improvement

* iq3_xs: tiny Metal speed improvement

* Fix stupid warning

* Q3_K_XS now uses a mix of IQ3_XS and IQ3_XXS

* iq3_xs: rename to iq3_s

* iq3_s: make tests pass

* Move Q3_K_XS mix to 3.25 bpw

* Attempt to fix failing tests

* Another attempt to fix the Windows builds

* Attempt to fix ROCm

* ROCm again

* iq3_s: partial fix for QK_K = 64

* iq3_s: make it work on metal for QK_K = 64

Pleasent surprise: the coding was super-block size independent,
so all it took was to delete some QK_K == 256 guards.

* Will this fix ROCm?

---------

Co-authored-by: Iwan Kawrakow <redacted>

commit | commitdiff | tree

Pierrick Hymbert [Sat, 24 Feb 2024 11:28:55 +0000 (12:28 +0100)]

server: init functional tests (#5566)

* server: tests: init scenarios
- health and slots endpoints
- completion endpoint
- OAI compatible chat completion requests w/ and without streaming
- completion multi users scenario
- multi users scenario on OAI compatible endpoint with streaming
- multi users with total number of tokens to predict exceeds the KV Cache size
- server wrong usage scenario, like in Infinite loop of "context shift" #3969
- slots shifting
- continuous batching
- embeddings endpoint
- multi users embedding endpoint: Segmentation fault #5655
- OpenAI-compatible embeddings API
- tokenize endpoint
- CORS and api key scenario

* server: CI GitHub workflow

---------

Co-authored-by: Georgi Gerganov <redacted>

commit | commitdiff | tree

AlpinDale [Fri, 23 Feb 2024 19:31:54 +0000 (19:31 +0000)]

server : add KV cache quantization options (#5684)

commit | commitdiff | tree

Jared Van Bortel [Fri, 23 Feb 2024 18:39:14 +0000 (13:39 -0500)]

convert : fix missing ftype for gemma (#5690)

commit | commitdiff | tree

Jared Van Bortel [Thu, 22 Feb 2024 22:05:23 +0000 (17:05 -0500)]

mpt : do not duplicate token_embd.weight on disk (#5670)

commit | commitdiff | tree

Georgi Gerganov [Thu, 22 Feb 2024 21:23:46 +0000 (23:23 +0200)]

gemma : use more bits for the token_embd.weight tensor (#5650)

* gemma : use Q8_0 for the token_embd.weight tensor

* llama : quantize token_embd.weight using output type

commit | commitdiff | tree

Georgi Gerganov [Thu, 22 Feb 2024 21:22:48 +0000 (23:22 +0200)]

py : add Gemma conversion from HF models (#5647)

* py : add gemma conversion from HF models

* Update convert-hf-to-gguf.py

Co-authored-by: Aarni Koskela <redacted>
* Update convert-hf-to-gguf.py

Co-authored-by: Aarni Koskela <redacted>
* Update convert-hf-to-gguf.py

Co-authored-by: Jared Van Bortel <redacted>
---------

Co-authored-by: Aarni Koskela <redacted>
Co-authored-by: Jared Van Bortel <redacted>

commit | commitdiff | tree

Georgi Gerganov [Thu, 22 Feb 2024 21:21:39 +0000 (23:21 +0200)]

ggml : always define ggml_fp16_t as uint16_t (#5666)

* ggml : always define ggml_fp16_t as uint16_t

ggml-ci

* ggml : cont

ggml-ci

* ggml : cont

* ggml : cont

ggml-ci

* ggml : cont

ggml-ci

* cuda : no longer ggml headers last

ggml-ci

* ggml : fix q6_K FP16 -> FP32 conversion

ggml-ci

* ggml : more FP16 -> FP32 conversion fixes

ggml-ci

commit | commitdiff | tree

Georgi Gerganov [Thu, 22 Feb 2024 21:21:05 +0000 (23:21 +0200)]

sync : ggml

commit | commitdiff | tree

Georgi Gerganov [Thu, 22 Feb 2024 16:31:40 +0000 (18:31 +0200)]

ggml : 32-bit arm compat (whisper/1891)

* ggml : 32-bit arm compat

* ggml : add ggml_vqtbl1q_s8 impl

* ggml : cont

commit | commitdiff | tree

Someone [Thu, 22 Feb 2024 19:44:10 +0000 (19:44 +0000)]

nix: init singularity and docker images (#5056)

Exposes a few attributes demonstrating how to build [singularity](https://docs.sylabs.io/guides/latest/user-guide/)/[apptainer](https://apptainer.org/) and Docker images re-using llama.cpp's Nix expression.

Built locally on `x86_64-linux` with `nix build github:someoneserge/llama.cpp/feat/nix/images#llamaPackages.{docker,docker-min,sif,llama-cpp}` and it's fast and effective.

commit | commitdiff | tree

Georgi Gerganov [Thu, 22 Feb 2024 18:13:25 +0000 (20:13 +0200)]

py : minor fixes (#5668)

commit | commitdiff | tree

Xuan Son Nguyen [Thu, 22 Feb 2024 18:10:21 +0000 (19:10 +0100)]

Add Gemma chat template (#5665)

* add gemma chat template

* gemma: only apply system_prompt on non-model message

commit | commitdiff | tree

Someone [Thu, 22 Feb 2024 16:32:09 +0000 (16:32 +0000)]

workflows: nix: hardcode cachix ids, build unconditionally (#5663)

GitHub does not expose environment and repository variables to PRs coming from forks implies that we've been disabling the Nix CI actions for most PRs.

The `if:` also didn't make much sense, because we can always pull from cachix, and there's no point (albeit no risk either) in pushing cache for the untrusted code.

commit | commitdiff | tree

Georgi Gerganov [Thu, 22 Feb 2024 11:54:03 +0000 (13:54 +0200)]

minor : fix trailing whitespace (#5638)

commit | commitdiff | tree

Georgi Gerganov [Thu, 22 Feb 2024 08:35:54 +0000 (10:35 +0200)]

readme : update hot topics

commit | commitdiff | tree

Xuan Son Nguyen [Thu, 22 Feb 2024 08:33:24 +0000 (09:33 +0100)]

server : fallback to chatml, add AlphaMonarch chat template (#5628)

* server: fallback to chatml

* add new chat template

* server: add AlphaMonarch to test chat template

* server: only check model template if there is no custom tmpl

* remove TODO

commit | commitdiff | tree

Alexey Parfenov [Thu, 22 Feb 2024 08:27:32 +0000 (08:27 +0000)]

server : clarify some params in the docs (#5640)

commit | commitdiff | tree

Dat Quoc Nguyen [Thu, 22 Feb 2024 08:15:13 +0000 (18:15 +1000)]

mpt : add optional bias tensors (#5638)

Update for MPT with optional bias parameters: to work with PhoGPT and SEA-LION models that were pre-trained with 'bias'.

commit | commitdiff | tree

slaren [Wed, 21 Feb 2024 23:42:09 +0000 (00:42 +0100)]

llama : fix loading models with shared tok_embd and output (#5651)

ggml-ci

commit | commitdiff | tree

Xuan Son Nguyen [Wed, 21 Feb 2024 23:31:00 +0000 (00:31 +0100)]

Add docs for llama_chat_apply_template (#5645)

* add docs for llama_chat_apply_template

* fix typo

commit | commitdiff | tree

slaren [Wed, 21 Feb 2024 21:52:39 +0000 (22:52 +0100)]

llama : fix session save/load with quantized KV (#5649)

commit | commitdiff | tree

slaren [Wed, 21 Feb 2024 21:18:23 +0000 (22:18 +0100)]

gemma : allow offloading the output tensor (#5646)

commit | commitdiff | tree

Jared Van Bortel [Wed, 21 Feb 2024 15:33:54 +0000 (10:33 -0500)]

examples : do not assume BOS when shifting context (#5622)

commit | commitdiff | tree

Georgi Gerganov [Wed, 21 Feb 2024 14:52:39 +0000 (16:52 +0200)]

sync : ggml

commit | commitdiff | tree

Pierrick Hymbert [Wed, 21 Feb 2024 14:47:48 +0000 (15:47 +0100)]

server: health: fix race condition on slots data using tasks queue (#5634)

* server: health: fix race condition on slots data using tasks queue

* server: health:
* include_slots only if slots_endpoint
* fix compile warning task.target_id not initialized.

commit | commitdiff | tree

Ettore Di Giacinto [Wed, 21 Feb 2024 14:39:10 +0000 (15:39 +0100)]

readme : add LocalAI to the availables UI (#5629)

commit | commitdiff | tree

Georgi Gerganov [Wed, 21 Feb 2024 14:17:10 +0000 (16:17 +0200)]

sync : ggml (#5633)

* ggml : fix conv_2d batch mode (ggml/737)

Co-authored-by: bssrdf <redacted>
* ggml : compute forward no longer pass src tensors (ggml/729)

* sync : ggml

ggml-ci

---------

Co-authored-by: bssrdf <redacted>
Co-authored-by: bssrdf <redacted>

commit | commitdiff | tree

Georgi Gerganov [Wed, 21 Feb 2024 13:39:54 +0000 (15:39 +0200)]

readme : update hot topics

commit | commitdiff | tree

Daniel Bevenius [Wed, 21 Feb 2024 13:36:57 +0000 (14:36 +0100)]

llava : add --skip-unknown to 1.6 convert.py (#5632)

This commit adds the `--skip-unknown` option to the convert.py script
and removes the saving of the updated checkpoints to avoid updating
possibly checked out files.

The motivation for this change is that this was done for 1.5
in Commit fc0c8d286a533363a9a663510b62af85ffad58b3 ("llava :
update surgery script to not remove tensors") and makes the examples
more consistent.

Signed-off-by: Daniel Bevenius <redacted>

commit | commitdiff | tree

postmasters [Wed, 21 Feb 2024 13:08:22 +0000 (05:08 -0800)]

llama : add `gemma` model (#5631)

There are couple things in this architecture:

1. Shared input and output embedding parameters.
2. Key length and value length are not derived from `n_embd`.

More information about the models can be found at
https://ai.google.dev/gemma. GGUFs can be downloaded from
https://huggingface.co/google.

commit | commitdiff | tree

Meng, Hengyu [Wed, 21 Feb 2024 09:52:06 +0000 (17:52 +0800)]

[SYCL] conext add name (#5624)

* [SYCL] conext add name

* name should start with SYCL*

commit | commitdiff | tree

Kawrakow [Wed, 21 Feb 2024 09:39:52 +0000 (11:39 +0200)]

IQ4_NL: 4-bit non-linear quants with blocks of 32 (#5590)

* iq4_nl: squash commits for easier rebase

* Basics (quantize, dequantize)
* CUDA dequantize and dot product
* Slightly faster CUDA dot product (120 t/s)
* Switch to 6-bit scales
* Scalar dot product
* AVX2 dot product
* ARM_NEON dot product
* Works on metal, but still slow
* Slightly better Metal dot product
* Another small Metal improvement
* Metal dot product is getting there
* Faster CUDA dot product
* Add 1/8 ffn_down layers as Q5_K when no imatrix has been provided
* Report the actual bpw
* Add _xs mix that is 4.05 bpw for non-MoE models
* Remove IQ4_XS for now, slightly adjust kvalues_iq4nl
* AVX2 dot product uses Q8_0 instead of Q8_K
* Add to test-backend-ops
* Minor fix
* Also use use Q5_K for attn_output in MoE models
* Fixes after merging latest master
* Switching to blocks of 32
* AVX2 for blocks of 32
* Scaler dot product for blocks of 32
* ARM_NEON dot product for blocks of 32
* Metal kernels for blocks of 32
* Slightly faster Metal kernels

* iq4_nl: Fix after merging with master

* iq4_nl: another fix after merging with master

* Use IQ4_NL instead of Q4_K when using k-quants is not possible

* Fix typo that makes several tests fail

* It was the ggml_vdotq thing missed inside the brackets

---------

Co-authored-by: Iwan Kawrakow <redacted>

commit | commitdiff | tree

CJ Pais [Tue, 20 Feb 2024 19:07:22 +0000 (11:07 -0800)]

server : support llava 1.6 (#5553)

* server: init working 1.6

* move clip_image to header

* remove commented code

* remove c++ style from header

* remove todo

* expose llava_image_embed_make_with_clip_img

* fix zig build

commit | commitdiff | tree

slaren [Tue, 20 Feb 2024 19:06:17 +0000 (20:06 +0100)]

make : fix debug build with CUDA (#5616)

commit | commitdiff | tree

Daniel Bevenius [Tue, 20 Feb 2024 17:30:27 +0000 (18:30 +0100)]

llava : add explicit instructions for llava-1.6 (#5611)

This commit contains a suggestion for the README.md in the llava
example. The suggestion adds explicit instructions for how to convert
a llava-1.6 model and run it using llava-cli.

The motivation for this is that having explicit instructions similar to
the 1.5 instructions will make it easier for users to try this out.

Signed-off-by: Daniel Bevenius <redacted>

commit | commitdiff | tree

Xuan Son Nguyen [Tue, 20 Feb 2024 14:58:27 +0000 (15:58 +0100)]

Server: use llama_chat_apply_template (#5593)

* server: use llama_chat_apply_template

* server: remove trailing space

* server: fix format_chat

* server: fix help message

Co-authored-by: Georgi Gerganov <redacted>
* server: fix formatted_chat

---------

Co-authored-by: Georgi Gerganov <redacted>

commit | commitdiff | tree

Dane Madsen [Tue, 20 Feb 2024 10:00:23 +0000 (21:00 +1100)]

readme : update UI list (#5605)

* Add maid to ui list

* Specify licence

commit | commitdiff | tree

Haoxiang Fei [Tue, 20 Feb 2024 09:58:36 +0000 (22:58 -1100)]

metal : add build system support for embedded metal library (#5604)

* add build support for embedded metal library

* Update Makefile

---------

Co-authored-by: Haoxiang Fei <redacted>
Co-authored-by: Georgi Gerganov <redacted>

commit | commitdiff | tree

Pierrick Hymbert [Tue, 20 Feb 2024 07:48:19 +0000 (08:48 +0100)]

server : health endpoint configurable failure on no slot (#5594)

commit | commitdiff | tree

AidanBeltonS [Tue, 20 Feb 2024 07:01:25 +0000 (07:01 +0000)]

Update ggml_sycl_op_mul_mat_vec_q (#5502)

* Update ggml_sycl_op_mul_mat_vec_q

* Apply suggestions from code review

Co-authored-by: Abhilash Majumder <redacted>
* revert suggestion on macro

* fix bug

* Add quant type GGML_TYPE_IQ1_S to unsupported

* fix format

---------

Co-authored-by: Abhilash Majumder <redacted>

commit | commitdiff | tree

Mathijs de Bruin [Tue, 13 Feb 2024 20:28:02 +0000 (20:28 +0000)]

nix: now that we can do so, allow MacOS to build Vulkan binaries

Author: Philip Taron <redacted>
Date: Tue Feb 13 20:28:02 2024 +0000

commit | commitdiff | tree

0cc4m [Sat, 10 Feb 2024 21:18:33 +0000 (22:18 +0100)]

Enable Vulkan MacOS CI

commit | commitdiff | tree

0cc4m [Wed, 14 Feb 2024 19:57:17 +0000 (20:57 +0100)]

Refactor validation and enumeration platform checks into functions to clean up ggml_vk_instance_init()

commit | commitdiff | tree

0cc4m [Sat, 10 Feb 2024 21:14:52 +0000 (22:14 +0100)]

Add check for VK_KHR_portability_enumeration for MoltenVK support

commit | commitdiff | tree

Mathijs de Bruin [Tue, 6 Feb 2024 14:39:22 +0000 (14:39 +0000)]

Add preprocessor checks for Apple devices.

Based on work by @rbourgeat in https://github.com/ggerganov/llama.cpp/pull/5322/files

commit | commitdiff | tree

Mathijs de Bruin [Sat, 3 Feb 2024 18:00:11 +0000 (18:00 +0000)]

Resolve ErrorIncompatibleDriver with Vulkan on MacOS.

Refs:
- https://chat.openai.com/share/7020ce72-65fc-45ec-b7be-9d9d798a5f3f
- https://github.com/SaschaWillems/Vulkan/issues/954
- https://github.com/haasn/libplacebo/issues/128
- https://github.com/KhronosGroup/Vulkan-Samples/issues/476

commit | commitdiff | tree

Mathijs de Bruin [Sat, 3 Feb 2024 17:56:46 +0000 (17:56 +0000)]

Allow for Vulkan build with Accelerate.

Closes #5304

commit | commitdiff | tree

slaren [Mon, 19 Feb 2024 22:40:26 +0000 (23:40 +0100)]

cuda : ignore peer access already enabled errors (#5597)

* cuda : ignore peer access already enabled errors

* fix hip

commit | commitdiff | tree

Jared Van Bortel [Mon, 19 Feb 2024 20:54:12 +0000 (15:54 -0500)]

make : pass CPPFLAGS directly to nvcc, not via -Xcompiler (#5598)

commit | commitdiff | tree

nopperl [Mon, 19 Feb 2024 14:14:07 +0000 (14:14 +0000)]

examples : support minItems/maxItems in JSON grammar converter (#5039)

* support minLength and maxLength in JSON schema grammar converter

* Update examples/json-schema-to-grammar.py

---------

Co-authored-by: Georgi Gerganov <redacted>

commit | commitdiff | tree

Georgi Gerganov [Mon, 19 Feb 2024 13:23:17 +0000 (15:23 +0200)]

llava : remove extra cont (#5587)

commit | commitdiff | tree

slaren [Mon, 19 Feb 2024 13:02:36 +0000 (14:02 +0100)]

llava : replace ggml_cpy with ggml_cont

commit | commitdiff | tree

Georgi Gerganov [Mon, 19 Feb 2024 12:54:21 +0000 (14:54 +0200)]

sync : ggml

ggml-ci

commit | commitdiff | tree

Georgi Gerganov [Mon, 19 Feb 2024 12:53:48 +0000 (14:53 +0200)]

ggml-alloc : apply ggml/731

commit | commitdiff | tree

Didzis Gosko [Sun, 11 Feb 2024 14:41:41 +0000 (16:41 +0200)]

metal : option to embed MSL source into compiled binary (whisper/1842)

* ggml : embed Metal library source (ggml-metal.metal) into binary

enable by setting WHISPER_EMBED_METAL_LIBRARY

* rename the build option

* rename the preprocessor directive

* generate Metal library embedding assembly on-fly during build process

commit | commitdiff | tree

Georgi Gerganov [Mon, 19 Feb 2024 12:45:41 +0000 (14:45 +0200)]

ci : enable -Werror for CUDA builds (#5579)

* cmake : pass -Werror through -Xcompiler

ggml-ci

* make, cmake : enable CUDA errors on warnings

ggml-ci

commit | commitdiff | tree

Georgi Gerganov [Mon, 19 Feb 2024 11:41:51 +0000 (13:41 +0200)]

make : fix CUDA build (#5580)

commit | commitdiff | tree

valiray [Mon, 19 Feb 2024 10:37:10 +0000 (02:37 -0800)]

readme : fix typo in README-sycl.md (#5353)

commit | commitdiff | tree

Abhilash Majumder [Mon, 19 Feb 2024 09:15:18 +0000 (14:45 +0530)]

cmake : remove obsolete sycl compile flags (#5581)

* rm unwanted sycl compile options

* fix bug

* fix bug

* format fix

commit | commitdiff | tree

Georgi Gerganov [Mon, 19 Feb 2024 08:34:10 +0000 (10:34 +0200)]

minor : fix trailing whitespace (#5538)

commit | commitdiff | tree

Daniel Bevenius [Mon, 19 Feb 2024 08:31:59 +0000 (09:31 +0100)]

llava : avoid changing the original BakLLaVA model (#5577)

This is a follup of Commit fc0c8d286a533363a9a663510b62af85ffad58b3
("llava : update surgery script to not remove tensors") but this time
the change is to the BakLLaVA specific part of the surgery script.

I've been able to test this using SkunkworksAI/BakLLaVA-1 and it works
as expected using the instructions in README.md.

Signed-off-by: Daniel Bevenius <redacted>

commit | commitdiff | tree

NawafAlansari [Mon, 19 Feb 2024 08:25:38 +0000 (03:25 -0500)]

baby-llama : allocate graphs in ggml_context (#5573)

* Fixed the baby-llama issue (see issue #4830)

* minor : fix whitespaces

---------

Co-authored-by: Georgi Gerganov <redacted>

commit | commitdiff | tree

Xuan Son Nguyen [Mon, 19 Feb 2024 08:23:37 +0000 (09:23 +0100)]

llama : add llama_chat_apply_template() (#5538)

* llama: add llama_chat_apply_template

* test-chat-template: remove dedundant vector

* chat_template: do not use std::string for buffer

* add clarification for llama_chat_apply_template

* llama_chat_apply_template: add zephyr template

* llama_chat_apply_template: correct docs

* llama_chat_apply_template: use term "chat" everywhere

* llama_chat_apply_template: change variable name to "tmpl"

commit | commitdiff | tree

slaren [Mon, 19 Feb 2024 08:04:45 +0000 (09:04 +0100)]

cuda, metal : fix nans in soft_max (#5574)

* cuda : fix nans in soft_max

* metal : fix nans in soft_max

---------

Co-authored-by: Georgi Gerganov <redacted>

commit | commitdiff | tree

Mirko185 [Mon, 19 Feb 2024 07:39:31 +0000 (08:39 +0100)]

readme : update (#5572)

Added 1.5-bit on README.md

commit | commitdiff | tree

bmwl [Mon, 19 Feb 2024 07:38:32 +0000 (23:38 -0800)]

ggml : android and old glibc NUMA incompatibility bugfixes (#5557)

* #ifdef out some code NUMA blocks for Android due to lack of support

* added in some __ANDROID__ if def gates around numa code and forced GLIBC prior to 2.29 to use a syscall for getcpu instead of the wrapper

* Changed gates on numa platform specific stuff to __gnu_linux__ to skip any platforms without glibc

* harmonizing #if defined blocks for numa code to __gnu_linux__ since that's the only model that's being followed anyways

---------

Co-authored-by: root <redacted>

commit | commitdiff | tree

Jared Van Bortel [Sun, 18 Feb 2024 21:21:52 +0000 (16:21 -0500)]

build : pass all warning flags to nvcc via -Xcompiler (#5570)

* build : pass all warning flags to nvcc via -Xcompiler
* make : fix apparent mis-merge from #3952
* make : fix incorrect GF_CC_VER for CUDA host compiler

commit | commitdiff | tree

Georgi Gerganov [Sun, 18 Feb 2024 20:58:57 +0000 (22:58 +0200)]

ggml : restore vec dot stride arg names (#5453)

commit | commitdiff | tree

Georgi Gerganov [Sun, 18 Feb 2024 20:39:30 +0000 (22:39 +0200)]

ci : fix wikitext url + compile warnings (#5569)

ggml-ci

commit | commitdiff | tree

Georgi Gerganov [Sun, 18 Feb 2024 19:39:58 +0000 (21:39 +0200)]

metal : fix unused warnings (#0)

commit | commitdiff | tree

Robey Holderith [Sun, 18 Feb 2024 19:11:16 +0000 (11:11 -0800)]

common, server : surface min_keep as its own parameter (#5567)

* Feature - surface min_keep as its own parameter

* Updated README with min_keep param

commit | commitdiff | tree

Pierrick Hymbert [Sun, 18 Feb 2024 17:39:57 +0000 (18:39 +0100)]

server : slots monitoring endpoint (#5550)

commit | commitdiff | tree

Georgi Gerganov [Sun, 18 Feb 2024 17:38:06 +0000 (19:38 +0200)]

sampling : do not set min_keep to n_probs (#5564)

commit | commitdiff | tree

Georgi Gerganov [Sun, 18 Feb 2024 17:17:00 +0000 (19:17 +0200)]

cmake : fix GGML_USE_SYCL typo (#5555)

commit | commitdiff | tree

Pierrick Hymbert [Sun, 18 Feb 2024 16:31:28 +0000 (17:31 +0100)]

server : enhanced health endpoint (#5548)

* server: enrich health endpoint with available slots, return 503 if not slots are available

* server: document new status no slot available in the README.md

commit | commitdiff | tree

Pierrick Hymbert [Sun, 18 Feb 2024 16:30:09 +0000 (17:30 +0100)]

server : --n-predict option document and cap to max value (#5549)

* server: document --n-predict

* server: ensure client request cannot override n_predict if set

* server: fix print usage LF in new --n-predict option

commit | commitdiff | tree

Daniel Hiltgen [Sun, 18 Feb 2024 16:23:16 +0000 (08:23 -0800)]

server : graceful server shutdown (#5244)

This updates the server queue to support graceful shutdown of the server on signals.

commit | commitdiff | tree

Georgi Gerganov [Sun, 18 Feb 2024 16:21:52 +0000 (18:21 +0200)]

common : fix ub (#5530)

commit | commitdiff | tree

Herman Semenov [Sun, 18 Feb 2024 16:20:12 +0000 (16:20 +0000)]

ggml, common, examples, tests : fixed type arguments in printf (#5528)

commit | commitdiff | tree

Daniel Bevenius [Sun, 18 Feb 2024 16:19:23 +0000 (17:19 +0100)]

llava : update surgery script to not remove tensors (#5536)

This commit updates the surgery script to not remove the tensors from the
model file. For this to work the `--skip-unknown` flag is added as an
argument to the convert.py script in README.md.

The motivation for this change is that the surgery script currently
removes the projector tensors from the model file. If the model was
checked out from a repository, the model file will have been updated
and have to be checked out again to reset this effect. If this can be
avoided I think it would be preferable.

I did not perform this change for BakLLaVA models as I am not sure
how that part works.

Packaging of ggml-org/llama.cpp

RSS Atom