]>
git.djapps.eu Git - pkg/ggml/sources/llama.cpp/log
afrideva [Tue, 14 Nov 2023 01:03:40 +0000 (17:03 -0800)]
convert.py: also look for plain model.safetensors (#4043)
* add safetensors to convert.py help message
* Check for single-file safetensors model
* Update convert.py "model" option help message
* revert convert.py help message change
M. Yusuf Sarıgöz [Mon, 13 Nov 2023 15:20:52 +0000 (18:20 +0300)]
llava : fix regression for square images in #3613 (#4056)
Georgi Gerganov [Mon, 13 Nov 2023 14:55:52 +0000 (16:55 +0200)]
ggml : sync (im2col, GPU conv, 32-bit arm compat) (#4060)
ggml-ci
Georgi Gerganov [Mon, 13 Nov 2023 12:18:08 +0000 (14:18 +0200)]
readme : update hot topics
Georgi Gerganov [Mon, 13 Nov 2023 12:16:23 +0000 (14:16 +0200)]
sync : ggml (backend v2) (#3912)
* sync : ggml (backend v2) (wip)
* sync : migrate examples and llama.cpp to dynamic graphs (wip)
* sync : update tests + fix max op params to 64
ggml-ci
* sync : ggml-cuda
ggml-ci
* llama : fix save/load state context size
ggml-ci
* sync : try to fix build on tvOS
* sync : pass custom graph sizes in training examples
* sync : update graph copies to new ggml API
* sync : update sync-ggml.sh with new files
* scripts : fix header in sync script
* train : fix context size calculations
* llama : increase inference graph size up to 4096 nodes
* train : allocate grads for backward graphs
* train : allocate grads for gb_tmp
Kerfuffle [Mon, 13 Nov 2023 08:58:15 +0000 (01:58 -0700)]
Add ReLU and SQR CUDA ops to (partially) fix Persimmon offloading (#4041)
* Add ReLU and SQR CUDA ops to fix Persimmon offloading
* Persimmon loader: More helpful error on CUDA/ROCM when offloading too many layers
Kerfuffle [Sun, 12 Nov 2023 23:39:37 +0000 (16:39 -0700)]
gguf-py: gguf_writer: Use bytearray to build metadata (#4051)
* gguf-py: gguf_writer: Use BytesIO to build metadata
* Use bytearray instead
Bump gguf-py package version
Richard Kiss [Sun, 12 Nov 2023 06:04:58 +0000 (22:04 -0800)]
Fix some documentation typos/grammar mistakes (#4032)
* typos
* Update examples/parallel/README.md
Co-authored-by: Kerfuffle <redacted>
---------
Co-authored-by: Kerfuffle <redacted>
M. Yusuf Sarıgöz [Sat, 11 Nov 2023 15:35:31 +0000 (18:35 +0300)]
Fix gguf-convert-endian script (#4037)
* Fix gguf-convert-endian script
* Bump version and update description
Alexey Parfenov [Sat, 11 Nov 2023 05:48:21 +0000 (05:48 +0000)]
server : fix crash when prompt exceeds context size (#3996)
Kerfuffle [Sat, 11 Nov 2023 05:04:50 +0000 (22:04 -0700)]
gguf-py: Refactor and allow reading/modifying existing GGUF files (#3981)
* gguf-py: Refactor and add file reading support
* Replay changes from #3871
Credit to @cebtenzzre for that pull
* Various type annotation fixes.
* sort imports with isort (again)
* Fix missing return statement in add_tensor
* style cleanup with flake8
* fix NamedTuple and Enum usage
* Fix an issue with state init in GGUFReader
Move examples to an examples/ directory
Clean up examples
Add an example of modifying keys in a GGUF file
Update documentation with info on examples
Try to support people importing gguf/gguf.py directly
* Damagage is not a word.
* Clean up gguf-py/examples/modify_gguf.py whitespace
Co-authored-by: Jared Van Bortel <redacted>
* Update gguf-py/examples/modify_gguf.py formatting
Co-authored-by: Jared Van Bortel <redacted>
* Update gguf-py/gguf/gguf_reader.py type hint
Co-authored-by: Jared Van Bortel <redacted>
* Make examples executable, formatting changes
* Add more information to GGUFReader and examples comments
* Include a gguf Python package version bump
* Add convert-gguf-endian.py script
* cleanup
* gguf-py : bump minor version
* Reorganize scripts
* Make GGUFReader endian detection less arbitrary
* Add JSON dumping support to gguf-dump.py
Which I kind of regret now
* A few for gguf-dump.py cleanups
* Murder accidental tuple in gguf-py/scripts/gguf-dump.py
Co-authored-by: Jared Van Bortel <redacted>
* cleanup
* constants : remove unneeded type annotations
* fix python 3.8 compat
* Set up gguf- scripts in pyproject.toml
* And include scripts/__init__.py, derp
* convert.py: We can't currently support Q8_0 on big endian.
* gguf-py: SpecialVocab: Always try available sources for special token ids
gguf-py: SpecialVocab: Try to load merges from merges.txt if not in tokenizer.json
gguf-py: SpecialVocab: Add 'add_bos_token' type bools to GGUF metadata
u
* cleanup
* Promote add_X_token to GGUF metadata for BOS and EOS
---------
Co-authored-by: Jared Van Bortel <redacted>
Co-authored-by: Jared Van Bortel <redacted>
Jhen-Jie Hong [Fri, 10 Nov 2023 22:49:33 +0000 (06:49 +0800)]
server : allow continue edit on completion mode (#3950)
* server : allow continue edit on completion mode
* server : handle abort case in runCompletion
* server : style improvement
Galunid [Fri, 10 Nov 2023 13:24:54 +0000 (14:24 +0100)]
Unbreak persimmon after #3837 (#4010)
Galunid [Thu, 9 Nov 2023 10:09:29 +0000 (11:09 +0100)]
scripts: Generalize convert scripts (#3838)
* Replace convert-*-hf-to-gguf.py files with convert-hf-to-gguf.py
Mihai [Thu, 9 Nov 2023 02:00:34 +0000 (04:00 +0200)]
server : add min_p param (#3877)
* Update server.cpp with min_p after it was introduced in https://github.com/ggerganov/llama.cpp/pull/3841
* Use spaces instead of tabs
* Update index.html.hpp after running deps.sh
* Fix test - fix line ending
slaren [Wed, 8 Nov 2023 12:15:14 +0000 (13:15 +0100)]
ggml-alloc : fix backend assignments of views (#3982)
Jared Van Bortel [Tue, 7 Nov 2023 17:43:04 +0000 (12:43 -0500)]
gguf : track writer state, free unneeded tensors, cleanup (#3871)
Georgi Gerganov [Tue, 7 Nov 2023 17:25:32 +0000 (19:25 +0200)]
make : do not add linker flags when compiling static llava lib (#3977)
xaedes [Tue, 7 Nov 2023 08:04:51 +0000 (09:04 +0100)]
ggml : fix backward rope after YaRN (#3974)
* fix backward process of rope
rope backward process was broken after YaRN RoPE (#2268) implementation, due to missing changes in backward functions.
the code for the backward process is nearly identically to the forward process:
the only difference is the sign of the sin-values.
to avoid future regressions remove the near-duplicate backward functions and reuse the forward code:
for this a new function argument `bool forward` was added to `ggml_compute_forward_rope_f32` and `ggml_compute_forward_rope_f16`.
the sin-values will be negated when forward is false.
* fix finetune rope call to use correct default attn_factor of 1.0f
* remove unused `ggml_rope_xpos_back`
it is better to have only one `ggml_rope_back` function that accepts all rope parameters, so that `ggml_compute_backward` can propagate all parameters without having to switch between different rope_back variants.
* fix comments explaining the sinus sign in ggml_forward_rope
* add missing function arguments in declaration
* fix function argument type in declaration
Matthew Tejo [Tue, 7 Nov 2023 07:43:59 +0000 (23:43 -0800)]
Use params when loading models in llava-cli (#3976)
llava-cli was loading models with default params and ignoring settings
from the cli. This switches to a generic function to load the params
from the cli options.
Meng Zhang [Tue, 7 Nov 2023 06:49:08 +0000 (22:49 -0800)]
cuda : supports running on CPU for GGML_USE_CUBLAS=ON build (#3946)
* protyping the idea that supports running on CPU for a GGML_USE_CUBLAS=on build
* doc: add comments to ggml_cublas_loaded()
* fix defined(...)
Damian Stewart [Mon, 6 Nov 2023 21:36:23 +0000 (22:36 +0100)]
llava : expose as a shared library for downstream projects (#3613)
* wip llava python bindings compatibility
* add external llava API
* add base64 in-prompt image support
* wip refactor image loading
* refactor image load out of llava init
* cleanup
* further cleanup; move llava-cli into its own file and rename
* move base64.hpp into common/
* collapse clip and llava libraries
* move llava into its own subdir
* wip
* fix bug where base64 string was not removed from the prompt
* get libllava to output in the right place
* expose llava methods in libllama.dylib
* cleanup memory usage around clip_image_*
* cleanup and refactor *again*
* update headerdoc
* build with cmake, not tested (WIP)
* Editorconfig
* Editorconfig
* Build with make
* Build with make
* Fix cyclical depts on Windows
* attempt to fix build on Windows
* attempt to fix build on Windows
* Upd TODOs
* attempt to fix build on Windows+CUDA
* Revert changes in cmake
* Fix according to review comments
* Support building as a shared library
* address review comments
---------
Co-authored-by: M. Yusuf Sarıgöz <redacted>
Co-authored-by: Jared Van Bortel <redacted>
slaren [Sun, 5 Nov 2023 17:45:16 +0000 (18:45 +0100)]
ggml-cuda : fix f16 mul mat (#3961)
* ggml-cuda : fix f16 mul mat
ggml-ci
* silence common.cpp warning (bonus)
Kerfuffle [Sun, 5 Nov 2023 17:06:06 +0000 (10:06 -0700)]
Allow common process_escapes to handle \x sequences (#3928)
* Allow common process_escapes to handle \x sequences
* Fix edge case when second hex digit is NUL
Thái Hoàng Tâm [Sun, 5 Nov 2023 16:15:27 +0000 (23:15 +0700)]
server : fix typo for --alias shortcut from -m to -a (#3958)
Jared Van Bortel [Sun, 5 Nov 2023 15:08:57 +0000 (10:08 -0500)]
cuda : fix disabling device with --tensor-split 1,0 (#3951)
Co-authored-by: slaren <redacted>
Meng Zhang [Sun, 5 Nov 2023 12:40:08 +0000 (04:40 -0800)]
llama : mark LLM_ARCH_STARCODER as full offload supported (#3945)
as done in https://github.com/ggerganov/llama.cpp/pull/3827
Eve [Sun, 5 Nov 2023 08:03:09 +0000 (08:03 +0000)]
cmake : MSVC instruction detection (fixed up #809) (#3923)
* Add detection code for avx
* Only check hardware when option is ON
* Modify per code review sugguestions
* Build locally will detect CPU
* Fixes CMake style to use lowercase like everywhere else
* cleanup
* fix merge
* linux/gcc version for testing
* msvc combines avx2 and fma into /arch:AVX2 so check for both
* cleanup
* msvc only version
* style
* Update FindSIMD.cmake
---------
Co-authored-by: Howard Su <redacted>
Co-authored-by: Jeremy Dunn <redacted>
Eve [Sun, 5 Nov 2023 07:46:44 +0000 (07:46 +0000)]
ci : use intel sde when ci cpu doesn't support avx512 (#3949)
slaren [Sun, 5 Nov 2023 07:12:13 +0000 (08:12 +0100)]
cuda : revert CUDA pool stuff (#3944)
* Revert "cuda : add ROCM aliases for CUDA pool stuff (#3918)"
This reverts commit
629f917cd6b96ba1274c49a8aab163b1b189229d .
* Revert "cuda : use CUDA memory pool with async memory allocation/deallocation when available (#3903)"
This reverts commit
d6069051de7165a4e06662c89257f5d2905bb156 .
ggml-ci
Kerfuffle [Sat, 4 Nov 2023 22:20:34 +0000 (16:20 -0600)]
gguf-py: Support 01.AI Yi models (#3943)
Peter Sugihara [Fri, 3 Nov 2023 19:18:18 +0000 (12:18 -0700)]
metal : round up to 16 to fix MTLDebugComputeCommandEncoder assertion (#3938)
Xiao-Yong Jin [Fri, 3 Nov 2023 18:00:31 +0000 (13:00 -0500)]
ggml-metal: fix yarn rope (#3937)
slaren [Fri, 3 Nov 2023 11:13:09 +0000 (12:13 +0100)]
ggml-cuda : move row numbers to x grid dim in mmv kernels (#3921)
Georgi Gerganov [Fri, 3 Nov 2023 07:41:17 +0000 (09:41 +0200)]
speculative : change default p_accept to 0.5 + CLI args (#3919)
ggml-ci
Georgi Gerganov [Fri, 3 Nov 2023 07:24:00 +0000 (09:24 +0200)]
common : YAYF (yet another YARN fix) (#3925)
ggml-ci
cebtenzzre [Fri, 3 Nov 2023 06:31:58 +0000 (02:31 -0400)]
llama : change yarn_ext_factor placeholder to -1 (#3922)
Kerfuffle [Thu, 2 Nov 2023 19:58:22 +0000 (13:58 -0600)]
cuda : add ROCM aliases for CUDA pool stuff (#3918)
Andrei [Thu, 2 Nov 2023 19:40:31 +0000 (15:40 -0400)]
cmake : fix relative path to git submodule index (#3915)
Georgi Gerganov [Thu, 2 Nov 2023 18:44:12 +0000 (20:44 +0200)]
readme : add notice about #3912
Georgi Gerganov [Thu, 2 Nov 2023 18:32:11 +0000 (20:32 +0200)]
cuda : fix const ptrs warning causing ROCm build issues (#3913)
Oleksii Maryshchenko [Thu, 2 Nov 2023 17:10:39 +0000 (18:10 +0100)]
cuda : use CUDA memory pool with async memory allocation/deallocation when available (#3903)
* Using cuda memory pools for async alloc/dealloc.
* If cuda device doesnt support memory pool than use old implementation.
* Removed redundant cublasSetStream
---------
Co-authored-by: Oleksii Maryshchenko <redacted>
Georgi Gerganov [Thu, 2 Nov 2023 14:22:30 +0000 (16:22 +0200)]
gguf : print error for GGUFv1 files (#3908)
slaren [Thu, 2 Nov 2023 12:10:33 +0000 (13:10 +0100)]
cmake : disable LLAMA_NATIVE by default (#3906)
Georgi Gerganov [Thu, 2 Nov 2023 09:20:21 +0000 (11:20 +0200)]
gguf : remove special-case code for GGUFv1 (#3901)
ggml-ci
Georgi Gerganov [Thu, 2 Nov 2023 07:54:18 +0000 (09:54 +0200)]
llm : prevent from 1-D tensors being GPU split (#3697)
cebtenzzre [Thu, 2 Nov 2023 06:50:16 +0000 (02:50 -0400)]
build : link against build info instead of compiling against it (#3879)
* cmake : fix build when .git does not exist
* cmake : simplify BUILD_INFO target
* cmake : add missing dependencies on BUILD_INFO
* build : link against build info instead of compiling against it
* zig : make build info a .cpp source instead of a header
Co-authored-by: Matheus C. França <redacted>
* cmake : revert change to CMP0115
---------
Co-authored-by: Matheus C. França <redacted>
Georgi Gerganov [Thu, 2 Nov 2023 06:35:10 +0000 (08:35 +0200)]
cuda : check if this fixes Pascal card regression (#3882)
Georgi Gerganov [Thu, 2 Nov 2023 06:33:37 +0000 (08:33 +0200)]
metal : fix build errors and kernel sig after #2268 (#3898)
cebtenzzre [Thu, 2 Nov 2023 05:49:44 +0000 (01:49 -0400)]
cuda : fix RoPE after #2268 (#3897)
cebtenzzre [Wed, 1 Nov 2023 23:29:14 +0000 (19:29 -0400)]
llama : fix llama_context_default_params after #2268 (#3893)
slaren [Wed, 1 Nov 2023 22:10:09 +0000 (23:10 +0100)]
ggml-cuda : compute ptrs for cublasGemmBatchedEx in a kernel (#3891)
* ggml-cuda : compute ptrs for cublasGemmBatchedEx in a kernel
* fix warnings
cebtenzzre [Wed, 1 Nov 2023 22:04:33 +0000 (18:04 -0400)]
llama : implement YaRN RoPE scaling (#2268)
Co-authored-by: cebtenzzre <redacted>
Co-authored-by: Jeffrey Quesnelle <redacted>
Georgi Gerganov [Wed, 1 Nov 2023 21:08:30 +0000 (23:08 +0200)]
llm : fix llm_build_kqv taking unused tensor (benign, #3837)
Georgi Gerganov [Wed, 1 Nov 2023 21:00:50 +0000 (23:00 +0200)]
llm : fix falcon norm after refactoring (#3837)
Georgi Gerganov [Wed, 1 Nov 2023 19:25:00 +0000 (21:25 +0200)]
metal : multi-simd softmax (#3710)
ggml-ci
Georgi Gerganov [Wed, 1 Nov 2023 19:15:55 +0000 (21:15 +0200)]
common : minor (#3715)
Georgi Gerganov [Wed, 1 Nov 2023 18:11:02 +0000 (20:11 +0200)]
llm : add llm_build_context (#3881)
* llm : add llm_build_context
* llm : deduce norm eps based on type + explict max_alibi_bias, clamp_kqv
* llm : restore the non-graph llm_build_ functional API
ggml-ci
* llm : cleanup + comments
bandoti [Wed, 1 Nov 2023 17:42:01 +0000 (14:42 -0300)]
common : allow caller to handle help/argument exceptions (#3715)
* Allow caller to handle help/argument exceptions
* Prepend newline to usage output
* Add new gpt_params_parse_ex function to hide arg-parse impl
* Fix issue blocking success case
* exit instead of returning false
* Update common/common.h
Co-authored-by: Georgi Gerganov <redacted>
* Update common/common.cpp
Co-authored-by: Georgi Gerganov <redacted>
---------
Co-authored-by: Georgi Gerganov <redacted>
staviq [Wed, 1 Nov 2023 14:18:27 +0000 (15:18 +0100)]
log : make generating separate log files optional (#3787)
* impl --log-new, --log-append
* Update common/log.h
Co-authored-by: cebtenzzre <redacted>
* Update common/log.h
Co-authored-by: cebtenzzre <redacted>
* Apply suggestions from code review
Co-authored-by: cebtenzzre <redacted>
---------
Co-authored-by: cebtenzzre <redacted>
l3utterfly [Wed, 1 Nov 2023 13:40:43 +0000 (21:40 +0800)]
sampling : null grammar field after reset (#3885)
Georgi Gerganov [Wed, 1 Nov 2023 11:50:45 +0000 (13:50 +0200)]
ggml : fix UNUSED macro (#3762)
Andrew Godfrey [Wed, 1 Nov 2023 11:49:04 +0000 (04:49 -0700)]
finetune : add -ngl parameter (#3762)
* Add '-ngl' support to finetune.cpp
* Add fprintf in ggml_cuda_op_add
When I tried CUDA offloading during finetuning following the readme, I got an assert here.
This probably isn't an important case because inference later gives a warning saying you should use f16 or f32 instead when using lora
* Add 'finetune.sh', which currently fails when using GPU
"error: operator (): Finetuning on tensors with type 'f16' is not yet supported"
* tweak finetune.sh
* Suppress some warnings in ggml.c
* Add f16 implementation to ggml_compute_forward_add_f16_f32
* Add an f16 case to ggml_add_cast_impl and llama_build_lora_finetune_graphs
* finetune.sh: Edit comments
* Add "add_f16_f32_f32_cuda"
* Tweak an error message
* finetune.sh: Add an optional LLAMA_MODEL_DIR variable
* finetune.sh: Add an optional LLAMA_TRAINING_DIR variable
* train : minor
* tabs to spaces
---------
Co-authored-by: Georgi Gerganov <redacted>
Co-authored-by: cebtenzzre <redacted>
Georgi Gerganov [Wed, 1 Nov 2023 09:29:07 +0000 (11:29 +0200)]
scripts : add server-llm.sh (#3868)
* scripts : add deploy-server.sh
* scripts : rename to server-llm.sh
* scripts : working curl pipe
Adrian Hesketh [Wed, 1 Nov 2023 09:28:28 +0000 (09:28 +0000)]
server : re-enable completion and embedded at the same time (#3876)
Georgi Gerganov [Wed, 1 Nov 2023 06:04:02 +0000 (08:04 +0200)]
llama : refactor graph build code (#3837)
* llama : factor out ggml-alloc from graph graph build functions
ggml-ci
* metal : disable kernel load log
* llama : factor out tensor offloading outside the build call (wip)
ggml-ci
* llama : offload rest of the models
ggml-ci
* llama : update offload log messages to print node index
* llama : comments
* llama : support offloading result_norm + comments
* llama : factor graph input into a function
* llama : do tensor offload only with CUDA
* llama : fix res_norm offloading
* llama : try to optimize offloading code
* llama : fix non-CUDA build
* llama : try to fix build
* llama : move refact in correct place + optimize graph input
* llama : refactor tensor offloading as callback
* llama : add layer index to all tensor names
* llama : add functional header
* llama : comment
ggml-ci
* llama : remove obsolete map for layer counting
* llama : add llm_build helper functions (#3848)
* llama : add llm_build_norm helper function
ggml-ci
* llama : add llm_build_ffn helper function (#3849)
ggml-ci
* llama : add llm_build_k_shift helper
ggml-ci
* llama : fix offloading after recent changes
* llama : add llm_build_kv_store helper
ggml-ci
* llama : remove obsolete offload names
* llama : fix llm_build_k_shift to use n_head_kv instead of n_head
* llama : simplify falcon Q, K, V computation
* llama : remove obsolete comments in build graphs
* llama : add llm_build_kqv helper
ggml-ci
* llama : minor
* llama : add LLAMA_OFFLOAD_DEBUG + fix starcoder offloading
* llama : fix input allocation logic
* llama : update offload functions for KQ tensors
* llama : normalize tensor names
ggml-ci
* llama : enable warning about not offloaded tensors
* llama : remove extra ; + deduplicate gate_b logic
* llama : add llm_build_inp_embd helper
kalomaze [Tue, 31 Oct 2023 19:44:49 +0000 (14:44 -0500)]
samplers : Min-P sampler implementation [alternative to Top P/Top K] (#3841)
* Introduce the new Min-P sampler by @kalomaze
The Min-P sampling method was designed as an alternative to Top-P, and aims to ensure a balance of quality and variety. The parameter *p* represents the minimum probability for a token to be considered, relative to the probability of the most likely token.
* Min-P enabled and set to 0.05 default
---------
Co-authored-by: Georgi Gerganov <redacted>
Co-authored-by: cebtenzzre <redacted>
Tungsten842 [Tue, 31 Oct 2023 17:24:03 +0000 (18:24 +0100)]
flake.nix: fix for rocm 5.7 (#3853)
Georgi Gerganov [Mon, 30 Oct 2023 17:19:15 +0000 (19:19 +0200)]
ggml : move FP16 <-> FP32 code to ggml-impl.h (#3861)
* ggml : move FP16 <-> FP32 stuff to ggml-impl.h
ggml-ci
* tests : fix ARM build
* ggml : explicitly initialize deprecated type traits
* ggml : add math.h to ggml-impl.h
* ggml : remove duplicate static assert macros
* ggml : prefix lookup tables with ggml_
ggml-ci
* ggml-impl : move extern "C" to start of file
Kerfuffle [Sun, 29 Oct 2023 17:31:40 +0000 (11:31 -0600)]
Extend llama_kv_cache_seq_rm to allow matching any sequence (#3843)
* Extend llama_kv_cache_seq_rm to allow matichng any sequence
* Replace llama_kv_cache_tokens_rm with llama_kv_cache_clear
Use llama_kv_cache_clear for cache clearing
Change calls to llama_kv_cache_tokens_rm that want to delete by position to use llama_kv_cache_seq_rm functionality
cebtenzzre [Sun, 29 Oct 2023 16:33:47 +0000 (12:33 -0400)]
make : remove unnecessary dependency on build-info.h (#3842)
Georgi Gerganov [Sun, 29 Oct 2023 16:32:51 +0000 (18:32 +0200)]
llama : fix kv shift bug (#3835)
ggml-ci
Georgi Gerganov [Sun, 29 Oct 2023 16:32:28 +0000 (18:32 +0200)]
ggml : quantization refactoring (#3833)
* ggml : factor all quantization code in ggml-quants
ggml-ci
* ggml-quants : fix Zig and Swift builds + quantize tool
ggml-ci
* quantize : --pure option for disabling k-quant mixtures
---------
Co-authored-by: cebtenzzre <redacted>
Erik Scholz [Sat, 28 Oct 2023 14:41:07 +0000 (16:41 +0200)]
flake : update flake.lock for newer transformers version + provide extra dev shell (#3797)
* flake : update flake.lock for newer transformers version + provide extra dev shell with torch and transformers (for most convert-xxx.py scripts)
Aarni Koskela [Sat, 28 Oct 2023 12:43:01 +0000 (15:43 +0300)]
metal : try cwd for ggml-metal.metal if bundle lookup fails (#3793)
* Try cwd for ggml-metal if bundle lookup fails
When building with `-DBUILD_SHARED_LIBS=ON -DLLAMA_METAL=ON -DLLAMA_BUILD_SERVER=ON`,
`server` would fail to load `ggml-metal.metal` because `[bundle pathForResource:...]`
returns `nil`. In that case, fall back to `ggml-metal.metal` in the cwd instead of
passing `null` as a path.
Follows up on #1782
* Update ggml-metal.m
---------
Co-authored-by: Georgi Gerganov <redacted>
Georgi Gerganov [Sat, 28 Oct 2023 12:25:33 +0000 (15:25 +0300)]
issues : change label from bug to bug-unconfirmed (#3748)
Georgi Gerganov [Sat, 28 Oct 2023 12:25:15 +0000 (15:25 +0300)]
convert : ignore tokens if their IDs are within [0, vocab_size) (#3831)
Kerfuffle [Sat, 28 Oct 2023 11:54:24 +0000 (05:54 -0600)]
llama : allow quantizing k-quants to fall back when tensor size incompatible (#3747)
* Allow quantizing k-quants to fall back when tensor size incompatible
* quantizing: Add warning when tensors were incompatible with k-quants
Clean up k-quants state passing a bit
Georgi Gerganov [Sat, 28 Oct 2023 11:23:11 +0000 (14:23 +0300)]
llama : add option for greedy sampling with probs (#3813)
* llama : add option for greedy sampling with probs
* llama : add comment about llama_sample_token_greedy() missing probs
* sampling : temp == 0.0 -> no probs, temp < 0.0 -> probs
Henk Poley [Sat, 28 Oct 2023 10:16:33 +0000 (12:16 +0200)]
common : print that one line of the syntax help *also* to standard output (#3823)
Georgi Gerganov [Sat, 28 Oct 2023 09:06:08 +0000 (12:06 +0300)]
starcoder : add GPU offloading (#3827)
* starcoder : do not GPU split 1D bias tensors
* starcoder : offload layers to GPU
ggml-ci
Kerfuffle [Fri, 27 Oct 2023 21:40:07 +0000 (15:40 -0600)]
speculative : ensure draft and target model vocab matches (#3812)
* speculative: Ensure draft and target model vocab matches
* Tolerate small differences when checking dft vs tgt vocab
cebtenzzre [Fri, 27 Oct 2023 21:33:53 +0000 (17:33 -0400)]
llama : correctly report GGUFv3 format (#3818)
Thibault Terrasson [Fri, 27 Oct 2023 14:37:41 +0000 (16:37 +0200)]
simple : fix batch handling (#3803)
Georgi Gerganov [Fri, 27 Oct 2023 14:01:23 +0000 (17:01 +0300)]
cuda : improve text-generation and batched decoding performance (#3776)
* cuda : prints wip
* cuda : new cublas gemm branch for multi-batch quantized src0
* cuda : add F32 sgemm branch
* cuda : fine-tune >= VOLTA params + use MMQ only for small batches
* cuda : remove duplicated cuBLAS GEMM code
* cuda : add CUDA_USE_TENSOR_CORES and GGML_CUDA_FORCE_MMQ macros
* build : add compile option to force use of MMQ kernels
Georgi Gerganov [Thu, 26 Oct 2023 19:53:37 +0000 (22:53 +0300)]
server : do not release slot on image input (#3798)
Georgi Gerganov [Wed, 25 Oct 2023 07:26:27 +0000 (10:26 +0300)]
batched-bench : print params at start
Georgi Gerganov [Wed, 25 Oct 2023 07:09:16 +0000 (10:09 +0300)]
log : disable pid in log filenames
cebtenzzre [Tue, 24 Oct 2023 20:10:43 +0000 (16:10 -0400)]
server : add parameter -tb N, --threads-batch N (#3584) (#3768)
Co-authored-by: Michael Coppola <redacted>
Co-authored-by: Michael Coppola <redacted>
Georgi Gerganov [Tue, 24 Oct 2023 20:08:20 +0000 (23:08 +0300)]
server : do not block system prompt update (#3767)
* server : do not block system prompt update
* server : update state machine logic to process system prompts
* server : minor
Georgi Gerganov [Tue, 24 Oct 2023 18:51:20 +0000 (21:51 +0300)]
sync : ggml (conv ops + cuda MSVC fixes) (#3765)
ggml-ci
John Smith [Tue, 24 Oct 2023 17:48:45 +0000 (01:48 +0800)]
cmake : add missed dependencies (#3763)
Georgi Gerganov [Tue, 24 Oct 2023 13:48:37 +0000 (16:48 +0300)]
cuda : add batched cuBLAS GEMM for faster attention (#3749)
* cmake : add helper for faster CUDA builds
* batched : add NGL arg
* ggml : skip nops in compute_forward
* cuda : minor indentation
* cuda : batched cuBLAS GEMMs for src0 F16 and src1 F32 (attention ops)
* Apply suggestions from code review
These changes plus:
```c++
#define cublasGemmBatchedEx hipblasGemmBatchedEx
```
are needed to compile with ROCM. I haven't done performance testing, but it seems to work.
I couldn't figure out how to propose a change for lines outside what the pull changed, also this is the first time trying to create a multi-part review so please forgive me if I mess something up.
* cuda : add ROCm / hipBLAS cublasGemmBatchedEx define
* cuda : add cublasGemmStridedBatchedEx for non-broadcasted cases
* cuda : reduce mallocs in cublasGemmBatchedEx branch
* cuda : add TODO for calling cublas from kernel + using mem pool
---------
Co-authored-by: Kerfuffle <redacted>
Galunid [Tue, 24 Oct 2023 07:17:17 +0000 (09:17 +0200)]
Add more tokenizer tests (#3742)
* Add more tokenizer tests
* Add starcoder
* Update test vocab files
* Restrict bpe tokenizer tests to unicode planes
* Update comment
* Comment cosmetics
* Remove bloom vocab/test
Georgi Gerganov [Tue, 24 Oct 2023 06:46:50 +0000 (09:46 +0300)]
metal : handle ggml_scale for n%4 != 0 (close #3754)
ggml-ci
Georgi Gerganov [Mon, 23 Oct 2023 20:46:05 +0000 (23:46 +0300)]
Revert "make : add optional CUDA_NATIVE_ARCH (#2482)"
This reverts commit
96981f37b1e3f450d9e63e571514217bf60f0a7f .
See:
https://github.com/ggerganov/llama.cpp/pull/2482#issuecomment-
1775975866
M. Yusuf Sarıgöz [Mon, 23 Oct 2023 19:57:16 +0000 (22:57 +0300)]
issues : separate bug and enhancement template + no default title (#3748)
Galunid [Mon, 23 Oct 2023 19:46:00 +0000 (21:46 +0200)]
Update special token handling in conversion scripts for gpt2 derived tokenizers (#3746)
We still have the heads up in `README.md` regarding `bpe` tokenizers and this patch is needed for
- a couple of tokenizer tests
- some more `special` and `non-special` added tokens handling (as far as I understand it)
* Update special token handling
* Add mpt
Marcus Dunn [Mon, 23 Oct 2023 19:40:03 +0000 (12:40 -0700)]
llama : remove token functions with `context` args in favor of `model` (#3720)
* added `llama_model_token_*` variants to all the `llama_token_*` functions.
* added `LLAMA_API`
* formatting
Co-authored-by: Georgi Gerganov <redacted>
* removed old `llama_token` functions
* changed 3 more functions to take in model
- `llama_token_get_text`
- `llama_token_get_score`
- `llama_token_get_type`
* added back docs
* fixed main.cpp
* changed token functions to use new model variants
* changed token functions to use new model variants
---------
Co-authored-by: Georgi Gerganov <redacted>
Galunid [Mon, 23 Oct 2023 15:47:03 +0000 (17:47 +0200)]
Fix baichuan convert script not detecing model (#3739)
It seems nobody objects.