]> git.djapps.eu Git - pkg/ggml/sources/llama.cpp/log
pkg/ggml/sources/llama.cpp
18 months agoserver : add --override-kv parameter (#4710)
minarchist [Tue, 2 Jan 2024 10:38:15 +0000 (04:38 -0600)]
server : add --override-kv parameter (#4710)

* Changes to server to allow metadata override

* documentation

* flake.nix: expose full scope in legacyPackages

* flake.nix: rocm not yet supported on aarch64, so hide the output

* flake.nix: expose checks

* workflows: nix-ci: init; build flake outputs

* workflows: nix-ci: add a job for eval

* workflows: weekly `nix flake update`

* workflows: nix-flakestry: drop tag filters

...and add a job for flakehub.com

* workflows: nix-ci: add a qemu job for jetsons

* flake.nix: suggest the binary caches

* flake.lock: update

to a commit recently cached by nixpkgs-cuda-ci

---------

Co-authored-by: John <redacted>
Co-authored-by: Someone Serge <redacted>
18 months agopy : re-enable mmap in convert hf (#4732)
Nam D. Tran [Tue, 2 Jan 2024 09:23:38 +0000 (16:23 +0700)]
py : re-enable mmap in convert hf (#4732)

* update: awq support llama-7b model

* update: change order

* update: benchmark results for llama2-7b

* update: mistral 7b v1 benchmark

* update: support 4 models

* fix: Readme

* update: ready for PR

* update: readme

* fix: readme

* update: change order import

* black

* format code

* update: work for bot mpt and awqmpt

* update: readme

* Rename to llm_build_ffn_mpt_awq

* Formatted other files

* Fixed params count

* fix: remove code

* update: more detail for mpt

* fix: readme

* fix: readme

* update: change folder architecture

* fix: common.cpp

* fix: readme

* fix: remove ggml_repeat

* update: cicd

* update: cicd

* uppdate: remove use_awq arg

* update: readme

* llama : adapt plamo to new ffn

ggml-ci

* fix: update torch version

---------

Co-authored-by: Trần Đức Nam <redacted>
Co-authored-by: Le Hoang Anh <redacted>
Co-authored-by: Georgi Gerganov <redacted>
18 months agofinetune: fix typo in README.md (#4733)
Daniel Bevenius [Tue, 2 Jan 2024 09:16:55 +0000 (10:16 +0100)]
finetune: fix typo in README.md (#4733)

Signed-off-by: Daniel Bevenius <redacted>
18 months agometal : enable shader debugging (cmake option) (#4705)
Georgi Gerganov [Tue, 2 Jan 2024 08:57:44 +0000 (10:57 +0200)]
metal : enable shader debugging (cmake option) (#4705)

* ggml : disable fast-math for Metal (cmake build only)

ggml-ci

* metal : fix Metal API debug warnings

* cmake : add -fno-inline for Metal build (#4545)

* metal : fix API debug warnings

* metal : fix compile warnings

* metal : use uint64_t for strides

* cmake : rename option to LLAMA_METAL_SHADER_DEBUG

* metal : fix mat-vec Q8_0 kernel for BS > 1

* metal : normalize mat-vec kernel signatures

* cmake : respect LLAMA_QKK_64 option

* metal : fix mat-vec Q4_K kernel for QK_K == 64

ggml-ci

18 months agoflake.lock: update
Someone Serge [Sun, 31 Dec 2023 17:42:22 +0000 (17:42 +0000)]
flake.lock: update

to a commit recently cached by nixpkgs-cuda-ci

18 months agoflake.nix: suggest the binary caches
Someone Serge [Sat, 30 Dec 2023 18:25:25 +0000 (18:25 +0000)]
flake.nix: suggest the binary caches

18 months agoworkflows: nix-ci: add a qemu job for jetsons
Someone Serge [Sat, 30 Dec 2023 18:01:07 +0000 (18:01 +0000)]
workflows: nix-ci: add a qemu job for jetsons

18 months agoworkflows: nix-flakestry: drop tag filters
Someone Serge [Sat, 30 Dec 2023 17:36:08 +0000 (17:36 +0000)]
workflows: nix-flakestry: drop tag filters

...and add a job for flakehub.com

18 months agoworkflows: weekly `nix flake update`
Someone Serge [Sat, 30 Dec 2023 16:38:36 +0000 (16:38 +0000)]
workflows: weekly `nix flake update`

18 months agoworkflows: nix-ci: add a job for eval
Someone Serge [Sat, 30 Dec 2023 17:19:11 +0000 (17:19 +0000)]
workflows: nix-ci: add a job for eval

18 months agoworkflows: nix-ci: init; build flake outputs
Someone Serge [Tue, 26 Dec 2023 19:17:26 +0000 (19:17 +0000)]
workflows: nix-ci: init; build flake outputs

18 months agoflake.nix: expose checks
Someone Serge [Fri, 29 Dec 2023 16:21:50 +0000 (16:21 +0000)]
flake.nix: expose checks

18 months agoflake.nix: rocm not yet supported on aarch64, so hide the output
Someone Serge [Tue, 26 Dec 2023 23:34:40 +0000 (23:34 +0000)]
flake.nix: rocm not yet supported on aarch64, so hide the output

18 months agoflake.nix: expose full scope in legacyPackages
Someone Serge [Fri, 29 Dec 2023 16:15:37 +0000 (16:15 +0000)]
flake.nix: expose full scope in legacyPackages

18 months agoggml : add ggml_vdotq_s32 alias (#4715)
Georgi Gerganov [Sun, 31 Dec 2023 09:43:31 +0000 (11:43 +0200)]
ggml : add ggml_vdotq_s32 alias (#4715)

ggml-ci

18 months agoclip : refactor + bug fixes (#4696)
Georgi Gerganov [Sat, 30 Dec 2023 21:24:42 +0000 (23:24 +0200)]
clip : refactor + bug fixes (#4696)

* clip : refactor + bug fixes

ggml-ci

* server : add log message

18 months agoCUDA: fixed tensor cores not being used on RDNA3 (#4697)
Johannes Gäßler [Sat, 30 Dec 2023 12:52:01 +0000 (13:52 +0100)]
CUDA: fixed tensor cores not being used on RDNA3 (#4697)

18 months agoggml : add ggml_cpu_has_avx_vnni() (#4589)
automaticcat [Sat, 30 Dec 2023 08:07:48 +0000 (15:07 +0700)]
ggml : add ggml_cpu_has_avx_vnni() (#4589)

* feat: add avx_vnni based on intel documents

* ggml: add avx vnni based on intel document

* llama: add avx vnni information display

* docs: add more details about using oneMKL and oneAPI for intel processors

* docs: add more details about using oneMKL and oneAPI for intel processors

* docs: add more details about using oneMKL and oneAPI for intel processors

* docs: add more details about using oneMKL and oneAPI for intel processors

* docs: add more details about using oneMKL and oneAPI for intel processors

* Update ggml.c

Fix indentation upgate

Co-authored-by: Georgi Gerganov <redacted>
---------

Co-authored-by: Georgi Gerganov <redacted>
18 months agoCUDA: fix tensor core logic for Pascal and HIP (#4682)
Johannes Gäßler [Fri, 29 Dec 2023 22:12:53 +0000 (23:12 +0100)]
CUDA: fix tensor core logic for Pascal and HIP (#4682)

18 months agoclip : use ggml_backend_buffer_is_host (#4205)
Georgi Gerganov [Fri, 29 Dec 2023 16:53:34 +0000 (18:53 +0200)]
clip : use ggml_backend_buffer_is_host (#4205)

18 months agoclip : enable gpu backend (#4205)
Steward Garcia [Fri, 29 Dec 2023 16:52:15 +0000 (11:52 -0500)]
clip : enable gpu backend (#4205)

* clip: enable CUDA backend

* add missing kernels

* add enough padding for alignment

* remove ggml_repeat of clip.cpp

* add metal backend

* llava : fixes

- avoid ggml_repeat
- use GGML_USE_ instead of CLIP_USE_ macros
- remove unused vars

---------

Co-authored-by: Georgi Gerganov <redacted>
18 months agocuda: fix vmm oom issue on NVIDIA AGX Orin (#4687)
hydai [Fri, 29 Dec 2023 16:31:19 +0000 (00:31 +0800)]
cuda: fix vmm oom issue on NVIDIA AGX Orin (#4687)

Signed-off-by: hydai <redacted>
18 months agopython : add check-requirements.sh and GitHub workflow (#4585)
crasm [Fri, 29 Dec 2023 14:50:29 +0000 (09:50 -0500)]
python : add check-requirements.sh and GitHub workflow (#4585)

* python: add check-requirements.sh and GitHub workflow

This script and workflow forces package versions to remain compatible
across all convert*.py scripts, while allowing secondary convert scripts
to import dependencies not wanted in convert.py.

* Move requirements into ./requirements

* Fail on "==" being used for package requirements (but can be suppressed)

* Enforce "compatible release" syntax instead of ==

* Update workflow

* Add upper version bound for transformers and protobuf

* improve check-requirements.sh

* small syntax change

* don't remove venvs if nocleanup is passed

* See if this fixes docker workflow

* Move check-requirements.sh into ./scripts/

---------

Co-authored-by: Jared Van Bortel <redacted>
18 months agoflake.nix : rewrite (#4605)
Philip Taron [Fri, 29 Dec 2023 14:42:26 +0000 (06:42 -0800)]
flake.nix : rewrite (#4605)

* flake.lock: update to hotfix CUDA::cuda_driver

Required to support https://github.com/ggerganov/llama.cpp/pull/4606

* flake.nix: rewrite

1. Split into separate files per output.

2. Added overlays, so that this flake can be integrated into others.
   The names in the overlay are `llama-cpp`, `llama-cpp-opencl`,
   `llama-cpp-cuda`, and `llama-cpp-rocm` so that they fit into the
   broader set of Nix packages from [nixpkgs](https://github.com/nixos/nixpkgs).

3. Use [callPackage](https://summer.nixos.org/blog/callpackage-a-tool-for-the-lazy/)
   rather than `with pkgs;` so that there's dependency injection rather
   than dependency lookup.

4. Add a description and meta information for each package.
   The description includes a bit about what's trying to accelerate each one.

5. Use specific CUDA packages instead of cudatoolkit on the advice of SomeoneSerge.

6. Format with `serokell/nixfmt` for a consistent style.

7. Update `flake.lock` with the latest goods.

* flake.nix: use finalPackage instead of passing it manually

* nix: unclutter darwin support

* nix: pass most darwin frameworks unconditionally

...for simplicity

* *.nix: nixfmt

nix shell github:piegamesde/nixfmt/rfc101-style --command \
    nixfmt flake.nix .devops/nix/*.nix

* flake.nix: add maintainers

* nix: move meta down to follow Nixpkgs style more closely

* nix: add missing meta attributes

nix: clarify the interpretation of meta.maintainers

nix: clarify the meaning of "broken" and "badPlatforms"

nix: passthru: expose the use* flags for inspection

E.g.:

```
❯ nix eval .#cuda.useCuda
true
```

* flake.nix: avoid re-evaluating nixpkgs too many times

* flake.nix: use flake-parts

* nix: migrate to pname+version

* flake.nix: overlay: expose both the namespace and the default attribute

* ci: add the (Nix) flakestry workflow

* nix: cmakeFlags: explicit OFF bools

* nix: cuda: reduce runtime closure

* nix: fewer rebuilds

* nix: respect config.cudaCapabilities

* nix: add the impure driver's location to the DT_RUNPATHs

* nix: clean sources more thoroughly

...this way outPaths change less frequently,
and so there are fewer rebuilds

* nix: explicit mpi support

* nix: explicit jetson support

* flake.nix: darwin: only expose the default

---------

Co-authored-by: Someone Serge <redacted>
18 months agocmake : fix ld warning duplicate libraries libllama.a (#4671)
Cuong Trinh Manh [Fri, 29 Dec 2023 14:39:15 +0000 (21:39 +0700)]
cmake : fix ld warning duplicate libraries libllama.a (#4671)

* fix "ld: warning: ignoring duplicate libraries: '../libllama.a'"

* fix warning in example.

18 months agollava-cli : refactor to use sampling library (#4669)
Justine Tunney [Fri, 29 Dec 2023 14:38:38 +0000 (06:38 -0800)]
llava-cli : refactor to use sampling library (#4669)

This change makes it possible to use flags like `--grammar` when using
the `llava-cli` program. The rest is just code cleanup deleting a long
standing TODO comment.

This change also ensures that logging information is emitted to stderr
which helps the `llava-cli` command be more friendly to shell scripts.

See Mozilla-Ocho/llamafile@1cd334f

18 months agoserver : replace sleep with condition variables (#4673)
Justine Tunney [Fri, 29 Dec 2023 14:24:12 +0000 (06:24 -0800)]
server : replace sleep with condition variables (#4673)

The server currently schedules tasks using a sleep(5ms) busy loop. This
adds unnecessary latency since most sleep implementations do a round up
to the system scheduling quantum (usually 10ms). Other libc sleep impls
spin for smaller time intervals which results in the server's busy loop
consuming all available cpu. Having the explicit notify() / wait() code
also helps aid in the readability of the server code.

See mozilla-Ocho/llamafile@711344b

18 months agoserver : fix OpenAI server sampling w.r.t. penalty. (#4675)
SakuraUmi [Fri, 29 Dec 2023 14:22:44 +0000 (22:22 +0800)]
server : fix OpenAI server sampling w.r.t. penalty. (#4675)

18 months agoserver : allow to generate multimodal embeddings (#4681)
Karthik Sethuraman [Fri, 29 Dec 2023 14:22:10 +0000 (06:22 -0800)]
server : allow to generate multimodal embeddings (#4681)

18 months agomain-cmake-pkg : fix build issue (#4665)
andrijdavid [Fri, 29 Dec 2023 14:18:20 +0000 (15:18 +0100)]
main-cmake-pkg : fix build issue (#4665)

* Fix main-cmake-pkg compilation

* Use glob to load common files

* cmake : fix trailing whitespace

---------

Co-authored-by: Georgi Gerganov <redacted>
18 months agollama.swiftui : fix infinite loop, ouput timings, buff UI (#4674)
Peter Sugihara [Fri, 29 Dec 2023 13:58:56 +0000 (05:58 -0800)]
llama.swiftui : fix infinite loop, ouput timings, buff UI (#4674)

* fix infinite loop

* slight UI simplification, clearer UX

* clearer UI text, add timings to completion log

18 months agoscripts : print list of sync commits
Georgi Gerganov [Fri, 29 Dec 2023 13:12:35 +0000 (15:12 +0200)]
scripts : print list of sync commits

18 months agoci : build with CLBlast + ggml-opencl use GGML_API (whisper/1576)
Tamotsu Takahashi [Fri, 29 Dec 2023 10:23:27 +0000 (19:23 +0900)]
ci : build with CLBlast + ggml-opencl use GGML_API (whisper/1576)

* Build with CLBlast

* Declare GGML_API

After rebasing, examples/talk-llama failed:

"D:\a\whisper.cpp\whisper.cpp\build\ALL_BUILD.vcxproj" (build target) (1) ->
"D:\a\whisper.cpp\whisper.cpp\build\examples\talk-llama\talk-llama.vcxproj" (default target) (14) ->
(Link target) ->
  llama.obj : error LNK2019: unresolved external symbol ggml_cl_free_data referenced in function "public: __cdecl llama_model::~llama_model(void)" (??1llama_model@@QEAA@XZ) [D:\a\whisper.cpp\whisper.cpp\build\examples\talk-llama\talk-llama.vcxproj]
  llama.obj : error LNK2019: unresolved external symbol ggml_cl_transform_tensor referenced in function "public: void __cdecl llama_model_loader::load_all_data(struct ggml_context *,void (__cdecl*)(float,void *),void *,struct llama_mlock *)" (?load_all_data@llama_model_loader@@QEAAXPEAUggml_context@@P6AXMPEAX@Z1PEAUllama_mlock@@@Z) [D:\a\whisper.cpp\whisper.cpp\build\examples\talk-llama\talk-llama.vcxproj]
  D:\a\whisper.cpp\whisper.cpp\build\bin\Release\talk-llama.exe : fatal error LNK1120: 2 unresolved externals [D:\a\whisper.cpp\whisper.cpp\build\examples\talk-llama\talk-llama.vcxproj]

18 months agosync : ggml
Georgi Gerganov [Fri, 29 Dec 2023 12:56:41 +0000 (14:56 +0200)]
sync : ggml

18 months agoggml : fix some mul mat cases + add tests for src1 F16 (ggml/669)
bssrdf [Fri, 29 Dec 2023 08:32:31 +0000 (03:32 -0500)]
ggml : fix some mul mat cases + add tests for src1 F16 (ggml/669)

* fixed mul-mat error for old GPUs

* style fixes

* add mul mat src1 f16 test cases, fix more cases

ggml-ci

---------

Co-authored-by: bssrdf <redacted>
Co-authored-by: slaren <redacted>
18 months agoscripts : do not sync commits from this repo
Georgi Gerganov [Fri, 29 Dec 2023 12:41:36 +0000 (14:41 +0200)]
scripts : do not sync commits from this repo

18 months agoFix OpenAI server sampling w.r.t. temp and seed (#4668)
Justine Tunney [Thu, 28 Dec 2023 19:20:00 +0000 (11:20 -0800)]
Fix OpenAI server sampling w.r.t. temp and seed (#4668)

The default values for tfs_z and typical_p were being set to zero, which
caused the token candidates array to get shrunk down to one element thus
preventing any sampling. Note this only applies to OpenAI API compatible
HTTP server requests.

The solution is to use the default values that OpenAI documents, as well
as ensuring we use the llama.cpp defaults for the rest. I've tested this
change still ensures deterministic output by default. If a "temperature"
greater than 0 is explicitly passed, then output is unique each time. If
"seed" is specified in addition to "temperature" then the output becomes
deterministic once more.

See mozilla-Ocho/llamafile#117
See mozilla-Ocho/llamafile@9e4bf29

18 months agogpt2 : Add gpt2 architecture integration (#4555)
manikbhandari [Thu, 28 Dec 2023 14:03:57 +0000 (09:03 -0500)]
gpt2 : Add gpt2 architecture integration (#4555)

18 months agollama : add AWQ for llama, llama2, mpt, and mistral models (#4593)
Nam D. Tran [Wed, 27 Dec 2023 15:39:45 +0000 (22:39 +0700)]
llama : add AWQ for llama, llama2, mpt, and mistral models (#4593)

* update: awq support llama-7b model

* update: change order

* update: benchmark results for llama2-7b

* update: mistral 7b v1 benchmark

* update: support 4 models

* fix: Readme

* update: ready for PR

* update: readme

* fix: readme

* update: change order import

* black

* format code

* update: work for bot mpt and awqmpt

* update: readme

* Rename to llm_build_ffn_mpt_awq

* Formatted other files

* Fixed params count

* fix: remove code

* update: more detail for mpt

* fix: readme

* fix: readme

* update: change folder architecture

* fix: common.cpp

* fix: readme

* fix: remove ggml_repeat

* update: cicd

* update: cicd

* uppdate: remove use_awq arg

* update: readme

* llama : adapt plamo to new ffn

ggml-ci

---------

Co-authored-by: Trần Đức Nam <redacted>
Co-authored-by: Le Hoang Anh <redacted>
Co-authored-by: Georgi Gerganov <redacted>
18 months agofinetune : fix output formatting in print_params (#4653)
Daniel Bevenius [Wed, 27 Dec 2023 14:16:55 +0000 (15:16 +0100)]
finetune : fix output formatting in print_params (#4653)

This commit fixes the output formatting in the print_params function
which currently looks like this:
```console
print_params: n_vocab:   32000
print_params: n_ctx:     128
print_params: n_embd:    4096
print_params: n_ff:      11008
print_params: n_head:    32
print_params: n_head_kv: 32
print_params: n_layer:   32
print_params: norm_rms_eps          : 0.000010
print_params: rope_freq_base        : 10000.000000
print_params: rope_freq_scale       : 1.000000
```
With this comit the output will look like this:
```console
print_params: n_vocab               : 32000
print_params: n_ctx                 : 128
print_params: n_embd                : 4096
print_params: n_ff                  : 11008
print_params: n_head                : 32
print_params: n_head_kv             : 32
print_params: n_layer               : 32
print_params: norm_rms_eps          : 0.000010
print_params: rope_freq_base        : 10000.000000
print_params: rope_freq_scale       : 1.000000
```

Signed-off-by: Daniel Bevenius <redacted>
18 months agoscripts : add sync-ggml-am.sh
Georgi Gerganov [Wed, 27 Dec 2023 09:15:31 +0000 (11:15 +0200)]
scripts : add sync-ggml-am.sh

18 months agoggml : fix dot product for ARM (#4630)
Georgi Gerganov [Wed, 27 Dec 2023 09:02:13 +0000 (11:02 +0200)]
ggml : fix dot product for ARM (#4630)

ggml-ci

18 months agoAdd byte token type when tokenizer.model is not exists (#4641)
wonjun Jang [Wed, 27 Dec 2023 08:37:25 +0000 (17:37 +0900)]
Add byte token type when tokenizer.model is not exists (#4641)

* Add byte token type to hf format

* remove unused variable

18 months agocuda : fix vmm pool with multi GPU (#4620)
slaren [Tue, 26 Dec 2023 20:23:59 +0000 (21:23 +0100)]
cuda : fix vmm pool with multi GPU (#4620)

* cuda : fix vmm pool with multi GPU

* hip

* use recommended granularity instead of minimum

* better error checking

* fix mixtral

* use cudaMemcpy3DPeerAsync

* use cuda_pool_alloc in ggml_cuda_op_mul_mat

* consolidate error checking in ggml_cuda_set_device

* remove unnecessary inlines

ggml-ci

* style fixes

* only use vmm for the main device

* fix scratch buffer size, re-enable vmm pool for all devices

* remove unnecessary check id != g_main_device

18 months agoUpdate comment for AdamW implementation reference. (#4604)
WillCorticesAI [Tue, 26 Dec 2023 10:42:08 +0000 (05:42 -0500)]
Update comment for AdamW implementation reference. (#4604)

Co-authored-by: Will Findley <redacted>
18 months agoFix new CUDA10 compilation errors (#4635)
FantasyGmm [Tue, 26 Dec 2023 10:38:36 +0000 (18:38 +0800)]
Fix new CUDA10 compilation errors (#4635)

18 months agoAdding Emeltal reference to UI list (#4629)
Paul Tsochantaris [Mon, 25 Dec 2023 16:09:53 +0000 (16:09 +0000)]
Adding Emeltal reference to UI list (#4629)

18 months agosimplify bug issue template (#4623)
slaren [Sun, 24 Dec 2023 20:01:12 +0000 (21:01 +0100)]
simplify bug issue template (#4623)

18 months agollama : add PLaMo model (#3557)
Shintarou Okada [Sun, 24 Dec 2023 13:35:49 +0000 (22:35 +0900)]
llama : add PLaMo model (#3557)

* add plamo mock

* add tensor loading

* plamo convert

* update norm

* able to compile

* fix norm_rms_eps hparam

* runnable

* use inp_pos

* seems ok

* update kqv code

* remove develop code

* update README

* shuffle attn_q.weight and attn_output.weight for broadcasting

* remove plamo_llm_build_kqv and use llm_build_kqv

* fix style

* update

* llama : remove obsolete KQ_scale

* plamo : fix tensor names for correct GPU offload

---------

Co-authored-by: Georgi Gerganov <redacted>
18 months agocuda : improve cuda pool efficiency using virtual memory (#4606)
slaren [Sun, 24 Dec 2023 13:34:22 +0000 (14:34 +0100)]
cuda : improve cuda pool efficiency using virtual memory (#4606)

* cuda : improve cuda pool efficiency using virtual memory

* fix mixtral

* fix cmake build

* check for vmm support, disable for hip

ggml-ci

* fix hip build

* clarify granularity

* move all caps to g_device_caps

* refactor error checking

* add cuda_pool_alloc, refactor most pool allocations

ggml-ci

* fix hip build

* CUBLAS_TF32_TENSOR_OP_MATH is not a macro

* more hip crap

* llama : fix msvc warnings

* ggml : fix msvc warnings

* minor

* minor

* cuda : fallback to CPU on host buffer alloc fail

* Update ggml-cuda.cu

Co-authored-by: Johannes Gäßler <redacted>
* Update ggml-cuda.cu

Co-authored-by: Johannes Gäßler <redacted>
* ensure allocations are always aligned

* act_size -> actual_size

---------

Co-authored-by: Johannes Gäßler <redacted>
18 months agofallback to CPU buffer if host buffer alloc fails (#4610)
slaren [Sat, 23 Dec 2023 15:10:51 +0000 (16:10 +0100)]
fallback to CPU buffer if host buffer alloc fails (#4610)

18 months agoci(docker): fix tags in "Build and push docker image (tagged)" (#4603)
Samuel Maynard [Sat, 23 Dec 2023 09:35:55 +0000 (11:35 +0200)]
ci(docker): fix tags in "Build and push docker image (tagged)" (#4603)

18 months agoserver : allow to specify custom prompt for penalty calculation (#3727)
Alexey Parfenov [Sat, 23 Dec 2023 09:31:49 +0000 (09:31 +0000)]
server : allow to specify custom prompt for penalty calculation (#3727)

18 months agogrammar : check the full vocab only if necessary (opt) (#4306)
kalomaze [Sat, 23 Dec 2023 09:27:07 +0000 (03:27 -0600)]
grammar : check the full vocab only if necessary (opt) (#4306)

* Check the full vocab for grammar only if necessary

* Fix missing logit restoration step (?)

Does this matter, actually?

* Fix whitespace / formatting

* Adjust comment

* Didn't mean to push test gbnf

* Split sampling into the helper function (?)

And also revert the changes made to the header

* common : fix final newline

---------

Co-authored-by: Georgi Gerganov <redacted>
18 months agoCUDA: fixed row rounding for 0 tensor splits (#4594)
Johannes Gäßler [Sat, 23 Dec 2023 08:16:33 +0000 (09:16 +0100)]
CUDA: fixed row rounding for 0 tensor splits (#4594)

18 months agolookup : add prompt lookup decoding example (#4484)
LeonEricsson [Fri, 22 Dec 2023 16:05:56 +0000 (17:05 +0100)]
lookup : add prompt lookup decoding example (#4484)

* initial commit, going through initializations

* main loop finished, starting to debug

* BUG: generates gibberish/repeating tokens after a while

* kv_cache management

* Added colors to distinguish drafted tokens (--color). Updated README

* lookup : fix token positions in the draft batch

* lookup : use n_draft from CLI params

* lookup : final touches

---------

Co-authored-by: Leon Ericsson <redacted>
Co-authored-by: Georgi Gerganov <redacted>
18 months agosync : ggml (fix im2col) (#4591)
Georgi Gerganov [Fri, 22 Dec 2023 15:53:43 +0000 (17:53 +0200)]
sync : ggml (fix im2col) (#4591)

* cuda : fix im2col_f32_f16 (ggml/#658)

ggml-ci

* ggml-alloc : fix ggml_tallocr_is_own

---------

Co-authored-by: leejet <redacted>
18 months agocuda : fix jetson compile error (#4560)
FantasyGmm [Fri, 22 Dec 2023 15:11:12 +0000 (23:11 +0800)]
cuda : fix jetson compile error (#4560)

* fix old jetson compile error

* Update Makefile

* update jetson detect and cuda version detect

* update cuda marco define

* update makefile and cuda,fix some issue

* Update README.md

Co-authored-by: Georgi Gerganov <redacted>
* Update Makefile

* Update README.md

---------

Co-authored-by: Georgi Gerganov <redacted>
18 months agoFix CudaMemcpy direction (#4599)
Henrik Forstén [Fri, 22 Dec 2023 13:34:05 +0000 (15:34 +0200)]
Fix CudaMemcpy direction (#4599)

18 months agollama : fix platforms without mmap (#4578)
slaren [Fri, 22 Dec 2023 11:12:53 +0000 (12:12 +0100)]
llama : fix platforms without mmap (#4578)

* llama : fix platforms without mmap

* win32 : limit prefetch size to the file size

* fix win32 error clobber, unnecessary std::string in std::runtime_error

18 months agoggml : add comment about backward GGML_OP_DIAG_MASK_INF (#4203)
Herman Semenov [Fri, 22 Dec 2023 09:26:49 +0000 (09:26 +0000)]
ggml : add comment about backward GGML_OP_DIAG_MASK_INF (#4203)

18 months agomake : add LLAMA_HIP_UMA option (#4587)
Michael Kesper [Fri, 22 Dec 2023 08:03:25 +0000 (09:03 +0100)]
make : add LLAMA_HIP_UMA option (#4587)

NB: LLAMA_HIP_UMA=1 (or any value) adds MK_CPPFLAG -DGGML_HIP_UMA

18 months agoci : tag docker image with build number (#4584)
rhuddleston [Fri, 22 Dec 2023 06:56:34 +0000 (23:56 -0700)]
ci : tag docker image with build number (#4584)

18 months agoreadme : add zig bindings (#4581)
Deins [Fri, 22 Dec 2023 06:49:54 +0000 (08:49 +0200)]
readme : add zig bindings (#4581)

18 months agoggml : extend `enum ggml_log_level` with `GGML_LOG_LEVEL_DEBUG` (#4579)
bobqianic [Fri, 22 Dec 2023 06:47:01 +0000 (06:47 +0000)]
ggml : extend `enum ggml_log_level` with `GGML_LOG_LEVEL_DEBUG` (#4579)

18 months agollama : add ability to cancel model loading (#4462)
crasm [Fri, 22 Dec 2023 06:19:36 +0000 (01:19 -0500)]
llama : add ability to cancel model loading (#4462)

* llama : Add ability to cancel model load

Updated llama_progress_callback so that if it returns false, the model
loading is aborted.

* llama : Add test for model load cancellation

* Fix bool return in llama_model_load, remove std::ignore use

* Update llama.cpp

Co-authored-by: Jared Van Bortel <redacted>
* Fail test if model file is missing

* Revert "Fail test if model file is missing"

This reverts commit 32ebd525bf7e5a87ee8a3dbaab3d92ce79fbf23d.

* Add test-model-load-cancel to Makefile

* Revert "Revert "Fail test if model file is missing""

This reverts commit 2796953257ee5383fa7c8fe8fa8fc888c048fb0b.

* Simplify .gitignore for tests, clang-tidy fixes

* Label all ctest tests

* ci : ctest uses -L main

* Attempt at writing ctest_with_model

* ci : get ci/run.sh working with test-model-load-cancel

* ci : restrict .github/workflows/build.yml ctest to -L main

* update requirements.txt

* Disable test-model-load-cancel in make

* Remove venv before creation

* Restructure requirements.txt

Top-level now imports the specific additional requirements for each
python file. Using `pip install -r requirements.txt` will fail if
versions become mismatched in the per-file requirements.

* Make per-python-script requirements work alone

This doesn't break the main requirements.txt.

* Add comment

* Add convert-persimmon-to-gguf.py to new requirements.txt scheme

* Add check-requirements.sh script and GitHub workflow

* Remove shellcheck installation step from workflow

* Add nocleanup special arg

* Fix merge

see: https://github.com/ggerganov/llama.cpp/pull/4462#discussion_r1434593573

* reset to upstream/master

* Redo changes for cancelling model load

---------

Co-authored-by: Georgi Gerganov <redacted>
Co-authored-by: Jared Van Bortel <redacted>
18 months agoggml : change ggml_scale to take a float instead of tensor (#4573)
Georgi Gerganov [Thu, 21 Dec 2023 21:20:49 +0000 (23:20 +0200)]
ggml : change ggml_scale to take a float instead of tensor (#4573)

* ggml : change ggml_scale to take a float instead of tensor

* ggml : fix CPU implementation

* tests : fix test-grad0

ggml-ci

18 months agogguf-py : fix broken link
Georgi Gerganov [Thu, 21 Dec 2023 21:20:36 +0000 (23:20 +0200)]
gguf-py : fix broken link

18 months agogguf : simplify example dependencies
Georgi Gerganov [Thu, 21 Dec 2023 21:07:58 +0000 (23:07 +0200)]
gguf : simplify example dependencies

18 months agoci : add `jlumbroso/free-disk-space` to docker workflow (#4150)
Samuel Maynard [Thu, 21 Dec 2023 20:36:26 +0000 (22:36 +0200)]
ci : add `jlumbroso/free-disk-space` to docker workflow (#4150)

* [github][workflows][docker]: removes hardcoded `ggerganov` from `ghcr` repo

* [github][workflows][docker]: adds `jlumbroso/free-disk-space`

18 months agollama : initial ggml-backend integration (#4520)
slaren [Thu, 21 Dec 2023 20:07:46 +0000 (21:07 +0100)]
llama : initial ggml-backend integration (#4520)

* llama : initial ggml-backend integration

* add ggml-metal

* cuda backend can be used though ggml-backend with LLAMA_GGML_BACKEND_CUDA_TEST
access all tensor data with ggml_backend_tensor_get/set

* add ggml_backend_buffer_clear
zero-init KV cache buffer

* add ggml_backend_buffer_is_hos, used to avoid copies if possible when accesing tensor data

* disable gpu backends with ngl 0

* more accurate mlock

* unmap offloaded part of the model

* use posix_fadvise64(.., POSIX_FADV_SEQUENTIAL) to improve performance with mmap

* update quantize and lora

* update session copy/set to use ggml-backend

ggml-ci

* use posix_fadvise instead of posix_fadvise64

* ggml_backend_alloc_ctx_tensors_from_buft : remove old print

* llama_mmap::align_offset : use pointers instead of references for out parameters

* restore progress_callback behavior

* move final progress_callback call to load_all_data

* cuda : fix fprintf format string (minor)

* do not offload scales

* llama_mmap : avoid unmapping the same fragments again in the destructor

* remove unnecessary unmap

* metal : add default log function that prints to stderr, cleanup code

ggml-ci

---------

Co-authored-by: Georgi Gerganov <redacted>
18 months agollama : allow getting n_batch from llama_context in c api (#4540)
Marcus Dunn [Thu, 21 Dec 2023 19:57:48 +0000 (11:57 -0800)]
llama : allow getting n_batch from llama_context in c api (#4540)

* allowed getting n_batch from llama_context in c api

* changed to use `uint32_t` instead of `int`

* changed to use `uint32_t` instead of `int` in `llama_n_ctx`

* Update llama.h

---------

Co-authored-by: Georgi Gerganov <redacted>
18 months agometal : fix `ggml_metal_log` vargs (#4373)
Finn Voorhees [Thu, 21 Dec 2023 19:55:02 +0000 (14:55 -0500)]
metal : fix `ggml_metal_log` vargs (#4373)

18 months agocuda : ROCm AMD Unified Memory Architecture (UMA) handling (#4449)
Erik Garrison [Thu, 21 Dec 2023 19:45:32 +0000 (13:45 -0600)]
cuda : ROCm AMD Unified Memory Architecture (UMA) handling (#4449)

* AMD ROCm: handle UMA memory VRAM expansions

This resolves #2797 by allowing ROCm AMD GPU users with a UMA to
dynamically expand the VRAM allocated to the GPU.

Without this, AMD ROCm users with shared CPU/GPU memory usually are
stuck with the BIOS-set (or fixed) framebuffer VRAM, making it
impossible to load more than 1-2 layers.

Note that the model is duplicated in RAM because it's loaded once for
the CPU and then copied into a second set of allocations that are
managed by the HIP UMA system. We can fix this later.

* clarify build process for ROCm on linux with cmake

* avoid using deprecated ROCm hipMallocHost

* keep simplifying the change required for UMA

* cmake: enable UMA-compatible allocation when LLAMA_HIP_UMA=ON

18 months agoggml-cuda: Fix HIP build by adding define for __trap (#4569)
arlo-phoenix [Thu, 21 Dec 2023 19:13:25 +0000 (20:13 +0100)]
ggml-cuda: Fix HIP build by adding define for __trap (#4569)

Regression of 139882392258671ffe5acdfcadc0bc08572d6eef
HIP doesn't have trap, only abort

18 months agocommon : remove incorrect --model-draft default (#4568)
Jared Van Bortel [Thu, 21 Dec 2023 17:55:34 +0000 (12:55 -0500)]
common : remove incorrect --model-draft default (#4568)

18 months agoCUDA: mul_mat_id always on GPU for batches >= 32 (#4553)
Johannes Gäßler [Thu, 21 Dec 2023 17:42:59 +0000 (18:42 +0100)]
CUDA: mul_mat_id always on GPU for batches >= 32 (#4553)

18 months agoreadme : update coding guidelines
Georgi Gerganov [Thu, 21 Dec 2023 17:27:14 +0000 (19:27 +0200)]
readme : update coding guidelines

18 months agopy : open merges file as 'utf-8' (#4566)
howlger [Thu, 21 Dec 2023 17:07:34 +0000 (18:07 +0100)]
py : open merges file as 'utf-8' (#4566)

Otherwise, on Windows converting bling-phi-2-v0 (<https://huggingface.co/llmware/bling-phi-2-v0>) via convert-hf-to-gguf.py will fail with the following error:

```
Traceback (most recent call last):
  File "C:\Users\User\git\gguf\convert-hf-to-gguf.py", line 1061, in <module>
    model_instance.set_vocab()
  File "C:\Users\User\git\gguf\convert-hf-to-gguf.py", line 52, in set_vocab
    self._set_vocab_gpt2()
  File "C:\Users\User\git\gguf\convert-hf-to-gguf.py", line 264, in _set_vocab_gpt2
    special_vocab = gguf.SpecialVocab(dir_model, load_merges=True)
  File "C:\Users\User\git\gguf\gguf\vocab.py", line 33, in __init__
    self._load(Path(path))
  File "C:\Users\User\git\gguf\gguf\vocab.py", line 81, in _load
    self._try_load_merges_txt(path)
  File "C:\Users\User\git\gguf\gguf\vocab.py", line 95, in _try_load_merges_txt
    for line in fp:
  File "C:\Users\User\miniconda3\envs\gguf\lib\encodings\cp1252.py", line 23, in decode
    return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 1415: character maps to <undefined>
```

18 months agocuda : better error message for ggml_get_rows (#4561)
bobqianic [Thu, 21 Dec 2023 17:06:44 +0000 (17:06 +0000)]
cuda : better error message for ggml_get_rows (#4561)

* Update ggml-cuda.cu

* Update ggml-cuda.cu

* Update ggml-cuda.cu

---------

Co-authored-by: Georgi Gerganov <redacted>
18 months agocuda : replace asserts in wrong architecture checks with __trap (#4556)
slaren [Thu, 21 Dec 2023 17:02:30 +0000 (18:02 +0100)]
cuda : replace asserts in wrong architecture checks with __trap (#4556)

* cuda : replace asserts in wrong architecture checks with __trap

* make bad_arch noreturn, remove returns

18 months agollama : disable per-tensor info prints on model load (#4562)
Johannes Gäßler [Thu, 21 Dec 2023 16:34:17 +0000 (17:34 +0100)]
llama : disable per-tensor info prints on model load (#4562)

18 months agoFix access violation in ggml_cuda_free_data if tensor->extra is NULL (#4554)
LoganDark [Thu, 21 Dec 2023 09:59:27 +0000 (01:59 -0800)]
Fix access violation in ggml_cuda_free_data if tensor->extra is NULL (#4554)

18 months agoCUDA: Faster Mixtral prompt processing (#4538)
Johannes Gäßler [Wed, 20 Dec 2023 14:41:22 +0000 (15:41 +0100)]
CUDA: Faster Mixtral prompt processing (#4538)

* CUDA: make MoE tensors contiguous for batch size>1

* Update ggml-cuda.cu

Co-authored-by: slaren <redacted>
---------

Co-authored-by: slaren <redacted>
18 months agoggml : fixed check for _MSC_VER (#4535)
Eric Sommerlade [Tue, 19 Dec 2023 16:17:01 +0000 (16:17 +0000)]
ggml : fixed check for _MSC_VER (#4535)

Co-authored-by: Eric Sommerlade <redacted>
18 months agoggml-cuda: Fix HIP build (#4528)
arlo-phoenix [Mon, 18 Dec 2023 21:33:45 +0000 (22:33 +0100)]
ggml-cuda: Fix HIP build (#4528)

regression of #4490
Adds defines for two new datatypes
cublasComputeType_t, cudaDataType_t.

Currently using deprecated hipblasDatatype_t since newer ones very recent.

18 months agollama.swiftui : add tinyllama 1.1B F16
Georgi Gerganov [Mon, 18 Dec 2023 18:17:43 +0000 (20:17 +0200)]
llama.swiftui : add tinyllama 1.1B F16

18 months agollama.swiftui : add more models
Georgi Gerganov [Mon, 18 Dec 2023 18:05:12 +0000 (20:05 +0200)]
llama.swiftui : add more models

18 months agollama : add phi-2 + fix NeoX rope + ggml_mul_mat_set_prec (#4490)
Ebey Abraham [Mon, 18 Dec 2023 17:27:47 +0000 (17:27 +0000)]
llama : add phi-2 + fix NeoX rope + ggml_mul_mat_set_prec (#4490)

* phi2 implementation

* fix breaking change

* phi-2 : various fixes

* phi-2 : use layer norm eps

* py : whitespaces

* llama : fix meta KV override bug

* convert : phi don't add BOS token

* convert : revert "added_tokens_decoder" change

* phi-2 : scale Q instead of KQ for better precision

* ggml : fix NeoX rope to rotate just first n_dims

* cuda : less diff in the rope_neox kernel

* ggml : add ggml_mul_mat_set_prec

ggml-ci

* Update ggml-cuda.cu

Co-authored-by: slaren <redacted>
* Update ggml-cuda.cu

Co-authored-by: slaren <redacted>
* cuda : ggml_cuda_op_mul_mat_cublas support F32 precision

* cuda : remove oboslete comment

---------

Co-authored-by: Ebey Abraham <redacted>
Co-authored-by: Georgi Gerganov <redacted>
Co-authored-by: slaren <redacted>
18 months agollama : fix try_override for bool_value which always return true (#4519)
hankcs [Mon, 18 Dec 2023 13:14:58 +0000 (05:14 -0800)]
llama : fix try_override for bool_value which always return true (#4519)

18 months agodecode : fix logits_valid for legacy API (#4516)
Jared Van Bortel [Mon, 18 Dec 2023 00:39:02 +0000 (19:39 -0500)]
decode : fix logits_valid for legacy API (#4516)

18 months agoreadme : update hot topics
Georgi Gerganov [Sun, 17 Dec 2023 18:16:23 +0000 (20:16 +0200)]
readme : update hot topics

18 months agollama.swiftui : add bench functionality (#4483)
Georgi Gerganov [Sun, 17 Dec 2023 17:38:41 +0000 (19:38 +0200)]
llama.swiftui : add bench functionality (#4483)

* llama.swiftui : add bench button

* llama.swiftui : initial bench functionality

* force to use n_gpu_layers on simulator

* add download buttons & expose llamaState.loadModel

* update project.pbxproj

* comment #Preview & fix editorconfig check

* gitignore : xcode stuff

* llama.swiftui : UX improvements

* llama.swiftui : avoid data copy via "downloadTask"

* llama.swiftui : remove model from project

* llama : remove "mostly" from model infos

* llama.swiftui : improve bench

---------

Co-authored-by: jhen <redacted>
18 months agogguf-py : fail fast on nonsensical special token IDs (#4489)
Jared Van Bortel [Sun, 17 Dec 2023 15:45:46 +0000 (10:45 -0500)]
gguf-py : fail fast on nonsensical special token IDs (#4489)

18 months agobuild : Check the ROCm installation location (#4485)
Matheus Gabriel Alves Silva [Sun, 17 Dec 2023 15:23:33 +0000 (12:23 -0300)]
build : Check the ROCm installation location (#4485)

* build : Check the ROCm installation location

* more generic approach

* fixup! It was returning the path instead of the command output

* fixup! Trailing whitespace

18 months agofinetune : keep allocs alive until all allocations are done (#4486)
slaren [Sun, 17 Dec 2023 15:05:56 +0000 (16:05 +0100)]
finetune : keep allocs alive until all allocations are done (#4486)

18 months agoserver : disable llm logs if SERVER_VERBOSE is off (#3792)
olexiyb [Sun, 17 Dec 2023 15:02:16 +0000 (17:02 +0200)]
server : disable llm logs if SERVER_VERBOSE is off (#3792)

18 months agoserver : fix grammar being ignored (#4494)
AdithyanI [Sun, 17 Dec 2023 14:57:56 +0000 (15:57 +0100)]
server : fix grammar being ignored (#4494)

Fix bug in identifying the grammar.

18 months agoserver : fix possible ambiguity in content type charset (#4501)
Alexey Parfenov [Sun, 17 Dec 2023 14:56:09 +0000 (14:56 +0000)]
server : fix possible ambiguity in content type charset (#4501)

18 months agoserver : allow requests larger than 8K (#4500)
mzcu [Sun, 17 Dec 2023 14:54:37 +0000 (15:54 +0100)]
server : allow requests larger than 8K (#4500)