]>
git.djapps.eu Git - pkg/ggml/sources/llama.cpp/log
Someone Serge [Tue, 26 Dec 2023 23:34:40 +0000 (23:34 +0000)]
flake.nix: rocm not yet supported on aarch64, so hide the output
Someone Serge [Fri, 29 Dec 2023 16:15:37 +0000 (16:15 +0000)]
flake.nix: expose full scope in legacyPackages
Georgi Gerganov [Sun, 31 Dec 2023 09:43:31 +0000 (11:43 +0200)]
ggml : add ggml_vdotq_s32 alias (#4715)
ggml-ci
Georgi Gerganov [Sat, 30 Dec 2023 21:24:42 +0000 (23:24 +0200)]
clip : refactor + bug fixes (#4696)
* clip : refactor + bug fixes
ggml-ci
* server : add log message
Johannes Gäßler [Sat, 30 Dec 2023 12:52:01 +0000 (13:52 +0100)]
CUDA: fixed tensor cores not being used on RDNA3 (#4697)
automaticcat [Sat, 30 Dec 2023 08:07:48 +0000 (15:07 +0700)]
ggml : add ggml_cpu_has_avx_vnni() (#4589)
* feat: add avx_vnni based on intel documents
* ggml: add avx vnni based on intel document
* llama: add avx vnni information display
* docs: add more details about using oneMKL and oneAPI for intel processors
* docs: add more details about using oneMKL and oneAPI for intel processors
* docs: add more details about using oneMKL and oneAPI for intel processors
* docs: add more details about using oneMKL and oneAPI for intel processors
* docs: add more details about using oneMKL and oneAPI for intel processors
* Update ggml.c
Fix indentation upgate
Co-authored-by: Georgi Gerganov <redacted>
---------
Co-authored-by: Georgi Gerganov <redacted>
Johannes Gäßler [Fri, 29 Dec 2023 22:12:53 +0000 (23:12 +0100)]
CUDA: fix tensor core logic for Pascal and HIP (#4682)
Georgi Gerganov [Fri, 29 Dec 2023 16:53:34 +0000 (18:53 +0200)]
clip : use ggml_backend_buffer_is_host (#4205)
Steward Garcia [Fri, 29 Dec 2023 16:52:15 +0000 (11:52 -0500)]
clip : enable gpu backend (#4205)
* clip: enable CUDA backend
* add missing kernels
* add enough padding for alignment
* remove ggml_repeat of clip.cpp
* add metal backend
* llava : fixes
- avoid ggml_repeat
- use GGML_USE_ instead of CLIP_USE_ macros
- remove unused vars
---------
Co-authored-by: Georgi Gerganov <redacted>
hydai [Fri, 29 Dec 2023 16:31:19 +0000 (00:31 +0800)]
cuda: fix vmm oom issue on NVIDIA AGX Orin (#4687)
Signed-off-by: hydai <redacted>
crasm [Fri, 29 Dec 2023 14:50:29 +0000 (09:50 -0500)]
python : add check-requirements.sh and GitHub workflow (#4585)
* python: add check-requirements.sh and GitHub workflow
This script and workflow forces package versions to remain compatible
across all convert*.py scripts, while allowing secondary convert scripts
to import dependencies not wanted in convert.py.
* Move requirements into ./requirements
* Fail on "==" being used for package requirements (but can be suppressed)
* Enforce "compatible release" syntax instead of ==
* Update workflow
* Add upper version bound for transformers and protobuf
* improve check-requirements.sh
* small syntax change
* don't remove venvs if nocleanup is passed
* See if this fixes docker workflow
* Move check-requirements.sh into ./scripts/
---------
Co-authored-by: Jared Van Bortel <redacted>
Philip Taron [Fri, 29 Dec 2023 14:42:26 +0000 (06:42 -0800)]
flake.nix : rewrite (#4605)
* flake.lock: update to hotfix CUDA::cuda_driver
Required to support https://github.com/ggerganov/llama.cpp/pull/4606
* flake.nix: rewrite
1. Split into separate files per output.
2. Added overlays, so that this flake can be integrated into others.
The names in the overlay are `llama-cpp`, `llama-cpp-opencl`,
`llama-cpp-cuda`, and `llama-cpp-rocm` so that they fit into the
broader set of Nix packages from [nixpkgs](https://github.com/nixos/nixpkgs).
3. Use [callPackage](https://summer.nixos.org/blog/callpackage-a-tool-for-the-lazy/)
rather than `with pkgs;` so that there's dependency injection rather
than dependency lookup.
4. Add a description and meta information for each package.
The description includes a bit about what's trying to accelerate each one.
5. Use specific CUDA packages instead of cudatoolkit on the advice of SomeoneSerge.
6. Format with `serokell/nixfmt` for a consistent style.
7. Update `flake.lock` with the latest goods.
* flake.nix: use finalPackage instead of passing it manually
* nix: unclutter darwin support
* nix: pass most darwin frameworks unconditionally
...for simplicity
* *.nix: nixfmt
nix shell github:piegamesde/nixfmt/rfc101-style --command \
nixfmt flake.nix .devops/nix/*.nix
* flake.nix: add maintainers
* nix: move meta down to follow Nixpkgs style more closely
* nix: add missing meta attributes
nix: clarify the interpretation of meta.maintainers
nix: clarify the meaning of "broken" and "badPlatforms"
nix: passthru: expose the use* flags for inspection
E.g.:
```
❯ nix eval .#cuda.useCuda
true
```
* flake.nix: avoid re-evaluating nixpkgs too many times
* flake.nix: use flake-parts
* nix: migrate to pname+version
* flake.nix: overlay: expose both the namespace and the default attribute
* ci: add the (Nix) flakestry workflow
* nix: cmakeFlags: explicit OFF bools
* nix: cuda: reduce runtime closure
* nix: fewer rebuilds
* nix: respect config.cudaCapabilities
* nix: add the impure driver's location to the DT_RUNPATHs
* nix: clean sources more thoroughly
...this way outPaths change less frequently,
and so there are fewer rebuilds
* nix: explicit mpi support
* nix: explicit jetson support
* flake.nix: darwin: only expose the default
---------
Co-authored-by: Someone Serge <redacted>
Cuong Trinh Manh [Fri, 29 Dec 2023 14:39:15 +0000 (21:39 +0700)]
cmake : fix ld warning duplicate libraries libllama.a (#4671)
* fix "ld: warning: ignoring duplicate libraries: '../libllama.a'"
* fix warning in example.
Justine Tunney [Fri, 29 Dec 2023 14:38:38 +0000 (06:38 -0800)]
llava-cli : refactor to use sampling library (#4669)
This change makes it possible to use flags like `--grammar` when using
the `llava-cli` program. The rest is just code cleanup deleting a long
standing TODO comment.
This change also ensures that logging information is emitted to stderr
which helps the `llava-cli` command be more friendly to shell scripts.
See Mozilla-Ocho/llamafile@
1cd334f
Justine Tunney [Fri, 29 Dec 2023 14:24:12 +0000 (06:24 -0800)]
server : replace sleep with condition variables (#4673)
The server currently schedules tasks using a sleep(5ms) busy loop. This
adds unnecessary latency since most sleep implementations do a round up
to the system scheduling quantum (usually 10ms). Other libc sleep impls
spin for smaller time intervals which results in the server's busy loop
consuming all available cpu. Having the explicit notify() / wait() code
also helps aid in the readability of the server code.
See mozilla-Ocho/llamafile@
711344b
SakuraUmi [Fri, 29 Dec 2023 14:22:44 +0000 (22:22 +0800)]
server : fix OpenAI server sampling w.r.t. penalty. (#4675)
Karthik Sethuraman [Fri, 29 Dec 2023 14:22:10 +0000 (06:22 -0800)]
server : allow to generate multimodal embeddings (#4681)
andrijdavid [Fri, 29 Dec 2023 14:18:20 +0000 (15:18 +0100)]
main-cmake-pkg : fix build issue (#4665)
* Fix main-cmake-pkg compilation
* Use glob to load common files
* cmake : fix trailing whitespace
---------
Co-authored-by: Georgi Gerganov <redacted>
Peter Sugihara [Fri, 29 Dec 2023 13:58:56 +0000 (05:58 -0800)]
llama.swiftui : fix infinite loop, ouput timings, buff UI (#4674)
* fix infinite loop
* slight UI simplification, clearer UX
* clearer UI text, add timings to completion log
Georgi Gerganov [Fri, 29 Dec 2023 13:12:35 +0000 (15:12 +0200)]
scripts : print list of sync commits
Tamotsu Takahashi [Fri, 29 Dec 2023 10:23:27 +0000 (19:23 +0900)]
ci : build with CLBlast + ggml-opencl use GGML_API (whisper/1576)
* Build with CLBlast
* Declare GGML_API
After rebasing, examples/talk-llama failed:
"D:\a\whisper.cpp\whisper.cpp\build\ALL_BUILD.vcxproj" (build target) (1) ->
"D:\a\whisper.cpp\whisper.cpp\build\examples\talk-llama\talk-llama.vcxproj" (default target) (14) ->
(Link target) ->
llama.obj : error LNK2019: unresolved external symbol ggml_cl_free_data referenced in function "public: __cdecl llama_model::~llama_model(void)" (??1llama_model@@QEAA@XZ) [D:\a\whisper.cpp\whisper.cpp\build\examples\talk-llama\talk-llama.vcxproj]
llama.obj : error LNK2019: unresolved external symbol ggml_cl_transform_tensor referenced in function "public: void __cdecl llama_model_loader::load_all_data(struct ggml_context *,void (__cdecl*)(float,void *),void *,struct llama_mlock *)" (?load_all_data@llama_model_loader@@QEAAXPEAUggml_context@@P6AXMPEAX@Z1PEAUllama_mlock@@@Z) [D:\a\whisper.cpp\whisper.cpp\build\examples\talk-llama\talk-llama.vcxproj]
D:\a\whisper.cpp\whisper.cpp\build\bin\Release\talk-llama.exe : fatal error LNK1120: 2 unresolved externals [D:\a\whisper.cpp\whisper.cpp\build\examples\talk-llama\talk-llama.vcxproj]
Georgi Gerganov [Fri, 29 Dec 2023 12:56:41 +0000 (14:56 +0200)]
sync : ggml
bssrdf [Fri, 29 Dec 2023 08:32:31 +0000 (03:32 -0500)]
ggml : fix some mul mat cases + add tests for src1 F16 (ggml/669)
* fixed mul-mat error for old GPUs
* style fixes
* add mul mat src1 f16 test cases, fix more cases
ggml-ci
---------
Co-authored-by: bssrdf <redacted>
Co-authored-by: slaren <redacted>
Georgi Gerganov [Fri, 29 Dec 2023 12:41:36 +0000 (14:41 +0200)]
scripts : do not sync commits from this repo
Justine Tunney [Thu, 28 Dec 2023 19:20:00 +0000 (11:20 -0800)]
Fix OpenAI server sampling w.r.t. temp and seed (#4668)
The default values for tfs_z and typical_p were being set to zero, which
caused the token candidates array to get shrunk down to one element thus
preventing any sampling. Note this only applies to OpenAI API compatible
HTTP server requests.
The solution is to use the default values that OpenAI documents, as well
as ensuring we use the llama.cpp defaults for the rest. I've tested this
change still ensures deterministic output by default. If a "temperature"
greater than 0 is explicitly passed, then output is unique each time. If
"seed" is specified in addition to "temperature" then the output becomes
deterministic once more.
See mozilla-Ocho/llamafile#117
See mozilla-Ocho/llamafile@
9e4bf29
manikbhandari [Thu, 28 Dec 2023 14:03:57 +0000 (09:03 -0500)]
gpt2 : Add gpt2 architecture integration (#4555)
Nam D. Tran [Wed, 27 Dec 2023 15:39:45 +0000 (22:39 +0700)]
llama : add AWQ for llama, llama2, mpt, and mistral models (#4593)
* update: awq support llama-7b model
* update: change order
* update: benchmark results for llama2-7b
* update: mistral 7b v1 benchmark
* update: support 4 models
* fix: Readme
* update: ready for PR
* update: readme
* fix: readme
* update: change order import
* black
* format code
* update: work for bot mpt and awqmpt
* update: readme
* Rename to llm_build_ffn_mpt_awq
* Formatted other files
* Fixed params count
* fix: remove code
* update: more detail for mpt
* fix: readme
* fix: readme
* update: change folder architecture
* fix: common.cpp
* fix: readme
* fix: remove ggml_repeat
* update: cicd
* update: cicd
* uppdate: remove use_awq arg
* update: readme
* llama : adapt plamo to new ffn
ggml-ci
---------
Co-authored-by: Trần Đức Nam <redacted>
Co-authored-by: Le Hoang Anh <redacted>
Co-authored-by: Georgi Gerganov <redacted>
Daniel Bevenius [Wed, 27 Dec 2023 14:16:55 +0000 (15:16 +0100)]
finetune : fix output formatting in print_params (#4653)
This commit fixes the output formatting in the print_params function
which currently looks like this:
```console
print_params: n_vocab: 32000
print_params: n_ctx: 128
print_params: n_embd: 4096
print_params: n_ff: 11008
print_params: n_head: 32
print_params: n_head_kv: 32
print_params: n_layer: 32
print_params: norm_rms_eps : 0.000010
print_params: rope_freq_base : 10000.000000
print_params: rope_freq_scale : 1.000000
```
With this comit the output will look like this:
```console
print_params: n_vocab : 32000
print_params: n_ctx : 128
print_params: n_embd : 4096
print_params: n_ff : 11008
print_params: n_head : 32
print_params: n_head_kv : 32
print_params: n_layer : 32
print_params: norm_rms_eps : 0.000010
print_params: rope_freq_base : 10000.000000
print_params: rope_freq_scale : 1.000000
```
Signed-off-by: Daniel Bevenius <redacted>
Georgi Gerganov [Wed, 27 Dec 2023 09:15:31 +0000 (11:15 +0200)]
scripts : add sync-ggml-am.sh
Georgi Gerganov [Wed, 27 Dec 2023 09:02:13 +0000 (11:02 +0200)]
ggml : fix dot product for ARM (#4630)
ggml-ci
wonjun Jang [Wed, 27 Dec 2023 08:37:25 +0000 (17:37 +0900)]
Add byte token type when tokenizer.model is not exists (#4641)
* Add byte token type to hf format
* remove unused variable
slaren [Tue, 26 Dec 2023 20:23:59 +0000 (21:23 +0100)]
cuda : fix vmm pool with multi GPU (#4620)
* cuda : fix vmm pool with multi GPU
* hip
* use recommended granularity instead of minimum
* better error checking
* fix mixtral
* use cudaMemcpy3DPeerAsync
* use cuda_pool_alloc in ggml_cuda_op_mul_mat
* consolidate error checking in ggml_cuda_set_device
* remove unnecessary inlines
ggml-ci
* style fixes
* only use vmm for the main device
* fix scratch buffer size, re-enable vmm pool for all devices
* remove unnecessary check id != g_main_device
WillCorticesAI [Tue, 26 Dec 2023 10:42:08 +0000 (05:42 -0500)]
Update comment for AdamW implementation reference. (#4604)
Co-authored-by: Will Findley <redacted>
FantasyGmm [Tue, 26 Dec 2023 10:38:36 +0000 (18:38 +0800)]
Fix new CUDA10 compilation errors (#4635)
Paul Tsochantaris [Mon, 25 Dec 2023 16:09:53 +0000 (16:09 +0000)]
Adding Emeltal reference to UI list (#4629)
slaren [Sun, 24 Dec 2023 20:01:12 +0000 (21:01 +0100)]
simplify bug issue template (#4623)
Shintarou Okada [Sun, 24 Dec 2023 13:35:49 +0000 (22:35 +0900)]
llama : add PLaMo model (#3557)
* add plamo mock
* add tensor loading
* plamo convert
* update norm
* able to compile
* fix norm_rms_eps hparam
* runnable
* use inp_pos
* seems ok
* update kqv code
* remove develop code
* update README
* shuffle attn_q.weight and attn_output.weight for broadcasting
* remove plamo_llm_build_kqv and use llm_build_kqv
* fix style
* update
* llama : remove obsolete KQ_scale
* plamo : fix tensor names for correct GPU offload
---------
Co-authored-by: Georgi Gerganov <redacted>
slaren [Sun, 24 Dec 2023 13:34:22 +0000 (14:34 +0100)]
cuda : improve cuda pool efficiency using virtual memory (#4606)
* cuda : improve cuda pool efficiency using virtual memory
* fix mixtral
* fix cmake build
* check for vmm support, disable for hip
ggml-ci
* fix hip build
* clarify granularity
* move all caps to g_device_caps
* refactor error checking
* add cuda_pool_alloc, refactor most pool allocations
ggml-ci
* fix hip build
* CUBLAS_TF32_TENSOR_OP_MATH is not a macro
* more hip crap
* llama : fix msvc warnings
* ggml : fix msvc warnings
* minor
* minor
* cuda : fallback to CPU on host buffer alloc fail
* Update ggml-cuda.cu
Co-authored-by: Johannes Gäßler <redacted>
* Update ggml-cuda.cu
Co-authored-by: Johannes Gäßler <redacted>
* ensure allocations are always aligned
* act_size -> actual_size
---------
Co-authored-by: Johannes Gäßler <redacted>
slaren [Sat, 23 Dec 2023 15:10:51 +0000 (16:10 +0100)]
fallback to CPU buffer if host buffer alloc fails (#4610)
Samuel Maynard [Sat, 23 Dec 2023 09:35:55 +0000 (11:35 +0200)]
ci(docker): fix tags in "Build and push docker image (tagged)" (#4603)
Alexey Parfenov [Sat, 23 Dec 2023 09:31:49 +0000 (09:31 +0000)]
server : allow to specify custom prompt for penalty calculation (#3727)
kalomaze [Sat, 23 Dec 2023 09:27:07 +0000 (03:27 -0600)]
grammar : check the full vocab only if necessary (opt) (#4306)
* Check the full vocab for grammar only if necessary
* Fix missing logit restoration step (?)
Does this matter, actually?
* Fix whitespace / formatting
* Adjust comment
* Didn't mean to push test gbnf
* Split sampling into the helper function (?)
And also revert the changes made to the header
* common : fix final newline
---------
Co-authored-by: Georgi Gerganov <redacted>
Johannes Gäßler [Sat, 23 Dec 2023 08:16:33 +0000 (09:16 +0100)]
CUDA: fixed row rounding for 0 tensor splits (#4594)
LeonEricsson [Fri, 22 Dec 2023 16:05:56 +0000 (17:05 +0100)]
lookup : add prompt lookup decoding example (#4484)
* initial commit, going through initializations
* main loop finished, starting to debug
* BUG: generates gibberish/repeating tokens after a while
* kv_cache management
* Added colors to distinguish drafted tokens (--color). Updated README
* lookup : fix token positions in the draft batch
* lookup : use n_draft from CLI params
* lookup : final touches
---------
Co-authored-by: Leon Ericsson <redacted>
Co-authored-by: Georgi Gerganov <redacted>
Georgi Gerganov [Fri, 22 Dec 2023 15:53:43 +0000 (17:53 +0200)]
sync : ggml (fix im2col) (#4591)
* cuda : fix im2col_f32_f16 (ggml/#658)
ggml-ci
* ggml-alloc : fix ggml_tallocr_is_own
---------
Co-authored-by: leejet <redacted>
FantasyGmm [Fri, 22 Dec 2023 15:11:12 +0000 (23:11 +0800)]
cuda : fix jetson compile error (#4560)
* fix old jetson compile error
* Update Makefile
* update jetson detect and cuda version detect
* update cuda marco define
* update makefile and cuda,fix some issue
* Update README.md
Co-authored-by: Georgi Gerganov <redacted>
* Update Makefile
* Update README.md
---------
Co-authored-by: Georgi Gerganov <redacted>
Henrik Forstén [Fri, 22 Dec 2023 13:34:05 +0000 (15:34 +0200)]
Fix CudaMemcpy direction (#4599)
slaren [Fri, 22 Dec 2023 11:12:53 +0000 (12:12 +0100)]
llama : fix platforms without mmap (#4578)
* llama : fix platforms without mmap
* win32 : limit prefetch size to the file size
* fix win32 error clobber, unnecessary std::string in std::runtime_error
Herman Semenov [Fri, 22 Dec 2023 09:26:49 +0000 (09:26 +0000)]
ggml : add comment about backward GGML_OP_DIAG_MASK_INF (#4203)
Michael Kesper [Fri, 22 Dec 2023 08:03:25 +0000 (09:03 +0100)]
make : add LLAMA_HIP_UMA option (#4587)
NB: LLAMA_HIP_UMA=1 (or any value) adds MK_CPPFLAG -DGGML_HIP_UMA
rhuddleston [Fri, 22 Dec 2023 06:56:34 +0000 (23:56 -0700)]
ci : tag docker image with build number (#4584)
Deins [Fri, 22 Dec 2023 06:49:54 +0000 (08:49 +0200)]
readme : add zig bindings (#4581)
bobqianic [Fri, 22 Dec 2023 06:47:01 +0000 (06:47 +0000)]
ggml : extend `enum ggml_log_level` with `GGML_LOG_LEVEL_DEBUG` (#4579)
crasm [Fri, 22 Dec 2023 06:19:36 +0000 (01:19 -0500)]
llama : add ability to cancel model loading (#4462)
* llama : Add ability to cancel model load
Updated llama_progress_callback so that if it returns false, the model
loading is aborted.
* llama : Add test for model load cancellation
* Fix bool return in llama_model_load, remove std::ignore use
* Update llama.cpp
Co-authored-by: Jared Van Bortel <redacted>
* Fail test if model file is missing
* Revert "Fail test if model file is missing"
This reverts commit
32ebd525bf7e5a87ee8a3dbaab3d92ce79fbf23d .
* Add test-model-load-cancel to Makefile
* Revert "Revert "Fail test if model file is missing""
This reverts commit
2796953257ee5383fa7c8fe8fa8fc888c048fb0b .
* Simplify .gitignore for tests, clang-tidy fixes
* Label all ctest tests
* ci : ctest uses -L main
* Attempt at writing ctest_with_model
* ci : get ci/run.sh working with test-model-load-cancel
* ci : restrict .github/workflows/build.yml ctest to -L main
* update requirements.txt
* Disable test-model-load-cancel in make
* Remove venv before creation
* Restructure requirements.txt
Top-level now imports the specific additional requirements for each
python file. Using `pip install -r requirements.txt` will fail if
versions become mismatched in the per-file requirements.
* Make per-python-script requirements work alone
This doesn't break the main requirements.txt.
* Add comment
* Add convert-persimmon-to-gguf.py to new requirements.txt scheme
* Add check-requirements.sh script and GitHub workflow
* Remove shellcheck installation step from workflow
* Add nocleanup special arg
* Fix merge
see: https://github.com/ggerganov/llama.cpp/pull/4462#discussion_r1434593573
* reset to upstream/master
* Redo changes for cancelling model load
---------
Co-authored-by: Georgi Gerganov <redacted>
Co-authored-by: Jared Van Bortel <redacted>
Georgi Gerganov [Thu, 21 Dec 2023 21:20:49 +0000 (23:20 +0200)]
ggml : change ggml_scale to take a float instead of tensor (#4573)
* ggml : change ggml_scale to take a float instead of tensor
* ggml : fix CPU implementation
* tests : fix test-grad0
ggml-ci
Georgi Gerganov [Thu, 21 Dec 2023 21:20:36 +0000 (23:20 +0200)]
gguf-py : fix broken link
Georgi Gerganov [Thu, 21 Dec 2023 21:07:58 +0000 (23:07 +0200)]
gguf : simplify example dependencies
Samuel Maynard [Thu, 21 Dec 2023 20:36:26 +0000 (22:36 +0200)]
ci : add `jlumbroso/free-disk-space` to docker workflow (#4150)
* [github][workflows][docker]: removes hardcoded `ggerganov` from `ghcr` repo
* [github][workflows][docker]: adds `jlumbroso/free-disk-space`
slaren [Thu, 21 Dec 2023 20:07:46 +0000 (21:07 +0100)]
llama : initial ggml-backend integration (#4520)
* llama : initial ggml-backend integration
* add ggml-metal
* cuda backend can be used though ggml-backend with LLAMA_GGML_BACKEND_CUDA_TEST
access all tensor data with ggml_backend_tensor_get/set
* add ggml_backend_buffer_clear
zero-init KV cache buffer
* add ggml_backend_buffer_is_hos, used to avoid copies if possible when accesing tensor data
* disable gpu backends with ngl 0
* more accurate mlock
* unmap offloaded part of the model
* use posix_fadvise64(.., POSIX_FADV_SEQUENTIAL) to improve performance with mmap
* update quantize and lora
* update session copy/set to use ggml-backend
ggml-ci
* use posix_fadvise instead of posix_fadvise64
* ggml_backend_alloc_ctx_tensors_from_buft : remove old print
* llama_mmap::align_offset : use pointers instead of references for out parameters
* restore progress_callback behavior
* move final progress_callback call to load_all_data
* cuda : fix fprintf format string (minor)
* do not offload scales
* llama_mmap : avoid unmapping the same fragments again in the destructor
* remove unnecessary unmap
* metal : add default log function that prints to stderr, cleanup code
ggml-ci
---------
Co-authored-by: Georgi Gerganov <redacted>
Marcus Dunn [Thu, 21 Dec 2023 19:57:48 +0000 (11:57 -0800)]
llama : allow getting n_batch from llama_context in c api (#4540)
* allowed getting n_batch from llama_context in c api
* changed to use `uint32_t` instead of `int`
* changed to use `uint32_t` instead of `int` in `llama_n_ctx`
* Update llama.h
---------
Co-authored-by: Georgi Gerganov <redacted>
Finn Voorhees [Thu, 21 Dec 2023 19:55:02 +0000 (14:55 -0500)]
metal : fix `ggml_metal_log` vargs (#4373)
Erik Garrison [Thu, 21 Dec 2023 19:45:32 +0000 (13:45 -0600)]
cuda : ROCm AMD Unified Memory Architecture (UMA) handling (#4449)
* AMD ROCm: handle UMA memory VRAM expansions
This resolves #2797 by allowing ROCm AMD GPU users with a UMA to
dynamically expand the VRAM allocated to the GPU.
Without this, AMD ROCm users with shared CPU/GPU memory usually are
stuck with the BIOS-set (or fixed) framebuffer VRAM, making it
impossible to load more than 1-2 layers.
Note that the model is duplicated in RAM because it's loaded once for
the CPU and then copied into a second set of allocations that are
managed by the HIP UMA system. We can fix this later.
* clarify build process for ROCm on linux with cmake
* avoid using deprecated ROCm hipMallocHost
* keep simplifying the change required for UMA
* cmake: enable UMA-compatible allocation when LLAMA_HIP_UMA=ON
arlo-phoenix [Thu, 21 Dec 2023 19:13:25 +0000 (20:13 +0100)]
ggml-cuda: Fix HIP build by adding define for __trap (#4569)
Regression of
139882392258671ffe5acdfcadc0bc08572d6eef
HIP doesn't have trap, only abort
Jared Van Bortel [Thu, 21 Dec 2023 17:55:34 +0000 (12:55 -0500)]
common : remove incorrect --model-draft default (#4568)
Johannes Gäßler [Thu, 21 Dec 2023 17:42:59 +0000 (18:42 +0100)]
CUDA: mul_mat_id always on GPU for batches >= 32 (#4553)
Georgi Gerganov [Thu, 21 Dec 2023 17:27:14 +0000 (19:27 +0200)]
readme : update coding guidelines
howlger [Thu, 21 Dec 2023 17:07:34 +0000 (18:07 +0100)]
py : open merges file as 'utf-8' (#4566)
Otherwise, on Windows converting bling-phi-2-v0 (<https://huggingface.co/llmware/bling-phi-2-v0>) via convert-hf-to-gguf.py will fail with the following error:
```
Traceback (most recent call last):
File "C:\Users\User\git\gguf\convert-hf-to-gguf.py", line 1061, in <module>
model_instance.set_vocab()
File "C:\Users\User\git\gguf\convert-hf-to-gguf.py", line 52, in set_vocab
self._set_vocab_gpt2()
File "C:\Users\User\git\gguf\convert-hf-to-gguf.py", line 264, in _set_vocab_gpt2
special_vocab = gguf.SpecialVocab(dir_model, load_merges=True)
File "C:\Users\User\git\gguf\gguf\vocab.py", line 33, in __init__
self._load(Path(path))
File "C:\Users\User\git\gguf\gguf\vocab.py", line 81, in _load
self._try_load_merges_txt(path)
File "C:\Users\User\git\gguf\gguf\vocab.py", line 95, in _try_load_merges_txt
for line in fp:
File "C:\Users\User\miniconda3\envs\gguf\lib\encodings\cp1252.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 1415: character maps to <undefined>
```
bobqianic [Thu, 21 Dec 2023 17:06:44 +0000 (17:06 +0000)]
cuda : better error message for ggml_get_rows (#4561)
* Update ggml-cuda.cu
* Update ggml-cuda.cu
* Update ggml-cuda.cu
---------
Co-authored-by: Georgi Gerganov <redacted>
slaren [Thu, 21 Dec 2023 17:02:30 +0000 (18:02 +0100)]
cuda : replace asserts in wrong architecture checks with __trap (#4556)
* cuda : replace asserts in wrong architecture checks with __trap
* make bad_arch noreturn, remove returns
Johannes Gäßler [Thu, 21 Dec 2023 16:34:17 +0000 (17:34 +0100)]
llama : disable per-tensor info prints on model load (#4562)
LoganDark [Thu, 21 Dec 2023 09:59:27 +0000 (01:59 -0800)]
Fix access violation in ggml_cuda_free_data if tensor->extra is NULL (#4554)
Johannes Gäßler [Wed, 20 Dec 2023 14:41:22 +0000 (15:41 +0100)]
CUDA: Faster Mixtral prompt processing (#4538)
* CUDA: make MoE tensors contiguous for batch size>1
* Update ggml-cuda.cu
Co-authored-by: slaren <redacted>
---------
Co-authored-by: slaren <redacted>
Eric Sommerlade [Tue, 19 Dec 2023 16:17:01 +0000 (16:17 +0000)]
ggml : fixed check for _MSC_VER (#4535)
Co-authored-by: Eric Sommerlade <redacted>
arlo-phoenix [Mon, 18 Dec 2023 21:33:45 +0000 (22:33 +0100)]
ggml-cuda: Fix HIP build (#4528)
regression of #4490
Adds defines for two new datatypes
cublasComputeType_t, cudaDataType_t.
Currently using deprecated hipblasDatatype_t since newer ones very recent.
Georgi Gerganov [Mon, 18 Dec 2023 18:17:43 +0000 (20:17 +0200)]
llama.swiftui : add tinyllama 1.1B F16
Georgi Gerganov [Mon, 18 Dec 2023 18:05:12 +0000 (20:05 +0200)]
llama.swiftui : add more models
Ebey Abraham [Mon, 18 Dec 2023 17:27:47 +0000 (17:27 +0000)]
llama : add phi-2 + fix NeoX rope + ggml_mul_mat_set_prec (#4490)
* phi2 implementation
* fix breaking change
* phi-2 : various fixes
* phi-2 : use layer norm eps
* py : whitespaces
* llama : fix meta KV override bug
* convert : phi don't add BOS token
* convert : revert "added_tokens_decoder" change
* phi-2 : scale Q instead of KQ for better precision
* ggml : fix NeoX rope to rotate just first n_dims
* cuda : less diff in the rope_neox kernel
* ggml : add ggml_mul_mat_set_prec
ggml-ci
* Update ggml-cuda.cu
Co-authored-by: slaren <redacted>
* Update ggml-cuda.cu
Co-authored-by: slaren <redacted>
* cuda : ggml_cuda_op_mul_mat_cublas support F32 precision
* cuda : remove oboslete comment
---------
Co-authored-by: Ebey Abraham <redacted>
Co-authored-by: Georgi Gerganov <redacted>
Co-authored-by: slaren <redacted>
hankcs [Mon, 18 Dec 2023 13:14:58 +0000 (05:14 -0800)]
llama : fix try_override for bool_value which always return true (#4519)
Jared Van Bortel [Mon, 18 Dec 2023 00:39:02 +0000 (19:39 -0500)]
decode : fix logits_valid for legacy API (#4516)
Georgi Gerganov [Sun, 17 Dec 2023 18:16:23 +0000 (20:16 +0200)]
readme : update hot topics
Georgi Gerganov [Sun, 17 Dec 2023 17:38:41 +0000 (19:38 +0200)]
llama.swiftui : add bench functionality (#4483)
* llama.swiftui : add bench button
* llama.swiftui : initial bench functionality
* force to use n_gpu_layers on simulator
* add download buttons & expose llamaState.loadModel
* update project.pbxproj
* comment #Preview & fix editorconfig check
* gitignore : xcode stuff
* llama.swiftui : UX improvements
* llama.swiftui : avoid data copy via "downloadTask"
* llama.swiftui : remove model from project
* llama : remove "mostly" from model infos
* llama.swiftui : improve bench
---------
Co-authored-by: jhen <redacted>
Jared Van Bortel [Sun, 17 Dec 2023 15:45:46 +0000 (10:45 -0500)]
gguf-py : fail fast on nonsensical special token IDs (#4489)
Matheus Gabriel Alves Silva [Sun, 17 Dec 2023 15:23:33 +0000 (12:23 -0300)]
build : Check the ROCm installation location (#4485)
* build : Check the ROCm installation location
* more generic approach
* fixup! It was returning the path instead of the command output
* fixup! Trailing whitespace
slaren [Sun, 17 Dec 2023 15:05:56 +0000 (16:05 +0100)]
finetune : keep allocs alive until all allocations are done (#4486)
olexiyb [Sun, 17 Dec 2023 15:02:16 +0000 (17:02 +0200)]
server : disable llm logs if SERVER_VERBOSE is off (#3792)
AdithyanI [Sun, 17 Dec 2023 14:57:56 +0000 (15:57 +0100)]
server : fix grammar being ignored (#4494)
Fix bug in identifying the grammar.
Alexey Parfenov [Sun, 17 Dec 2023 14:56:09 +0000 (14:56 +0000)]
server : fix possible ambiguity in content type charset (#4501)
mzcu [Sun, 17 Dec 2023 14:54:37 +0000 (15:54 +0100)]
server : allow requests larger than 8K (#4500)
Bach Le [Sun, 17 Dec 2023 10:57:33 +0000 (18:57 +0800)]
Link to cublas dynamically on Windows even with LLAMA_STATIC (#4506)
slaren [Sat, 16 Dec 2023 17:58:46 +0000 (18:58 +0100)]
lora : add support for non-llama models (#3333)
* lora : add support for non-llama models
ggml-ci
* avoid leaking ggml_context on failure
cleanup
ggml-ci
* lora : allow 1d tensors
* lora : include embd and output layers in size calculation
* fix style
Jared Van Bortel [Sat, 16 Dec 2023 03:16:15 +0000 (22:16 -0500)]
llama : sanity checks for access to logits (#4274)
Co-authored-by: Georgi Gerganov <redacted>
ShadovvBeast [Fri, 15 Dec 2023 11:49:01 +0000 (13:49 +0200)]
server : add optional API Key Authentication example (#4441)
* Add API key authentication for enhanced server-client security
* server : to snake_case
---------
Co-authored-by: Georgi Gerganov <redacted>
slaren [Fri, 15 Dec 2023 11:45:50 +0000 (12:45 +0100)]
ggml : group mul_mat_id rows by matrix (cpu only) (#4480)
* ggml : group mul_mat_id rows by matrix (cpu only)
* remove mmid parameters from mm forward
* store row groups in wdata and calculate only once in GGML_TASK_INIT
ggml-ci
slaren [Thu, 14 Dec 2023 19:05:21 +0000 (20:05 +0100)]
ggml : use ggml_row_size where possible (#4472)
* ggml : use ggml_row_size where possible
ggml-ci
* ggml : move ggml_nbytes_split to ggml-cuda.cu
slaren [Thu, 14 Dec 2023 15:52:08 +0000 (16:52 +0100)]
ggml : remove n_dims from ggml_tensor (#4469)
ggml-ci
wonjun Jang [Thu, 14 Dec 2023 12:44:49 +0000 (21:44 +0900)]
py : add protobuf dependency (#4466)
LostRuins [Thu, 14 Dec 2023 12:13:33 +0000 (20:13 +0800)]
ggml : add ggml_row_size() (fixes llama out of space) (#4461)
* Fixes "Not enough space in the context's memory pool" encountered on certain models, which seems to be caused by some imprecision related to the automatic casting of floating point values
* do not cast to size_t, instead just use doubles
* ggml : add ggml_row_size(), deprecate ggml_type_sizef()
* ggml : fix row size compute to avoid overflows
* tests : fix sizey -> sizez
---------
Co-authored-by: Georgi Gerganov <redacted>
Georgi Gerganov [Thu, 14 Dec 2023 08:35:29 +0000 (10:35 +0200)]
ggml : fix OpenCL broadcast requirement for ggml_mul (close #4453)
wonjun Jang [Thu, 14 Dec 2023 08:09:34 +0000 (17:09 +0900)]
convert : support loading vocab from fast tokenizer config (#3633)
* Add HFVocab into convert.py
* Update convert.py
* Update convert.py
* add bytes_to_unicode function
* change add_meta_vocab fucntion
* remove debug code
* remove byte_encoder
* Add newline between classes
* Check tokenizer.json when tokenizer.model is not exist.
* Move transformers dependency to local code
* Add error context with 'raise from'
* Add fast tokenizer option to BpeVocab
* Update convert.py
* Add VocabLoader and remove *Vocab class
* Add transformers dependency
* remove added tokens and check newline token to decide spm or bpe
* Update convert.py
* Add special token type
* Update convert.py
* Update convert.py
* Update convert.py
* Fix typo in convert.py
* Fix when params.n_vocab < tokenizer vocab size
* update vocab class
* change funtion name
* Remove unused variable/functions, add types to class variable and methods, delete blank liens
* fix flake8 warnings
* code style cleanup
* make mypy happy
* change exception
---------
Co-authored-by: Jared Van Bortel <redacted>
BarfingLemurs [Thu, 14 Dec 2023 07:38:49 +0000 (02:38 -0500)]
readme : update supported model list (#4457)