]> git.djapps.eu Git - pkg/ggml/sources/llama.cpp/commit
Work on the BPE tokenizer (#3252)
authorgoerch <redacted>
Tue, 3 Oct 2023 07:16:26 +0000 (09:16 +0200)
committerGitHub <redacted>
Tue, 3 Oct 2023 07:16:26 +0000 (09:16 +0200)
commitff5a3f0c09dfa0a8e0bf76d1748df5c6dee0e8ff
tree356ce471234d1f82db452e6274a951ac0b72cb9f
parent1c84003c08027f5d3a4cb876f51d6b6224a34d0e
Work on the BPE tokenizer (#3252)

* Work on the BPE tokenizer

Tokenizer tests work for Falcon-7B

* Try to fix build problem

* Fix debug assertion failure

* Fix MSVC Unicode BOM problem

* Cleanup and an improvement

* Fix compiler warning

* Cleanup

* Test doesn't work over the full range of Unicodes

* Update .gitignore and Makefile

* Another Makefile rule

* Testing Aquila

* Moving byte decoding back to `token_to_piece` ...

... because everyone is using it.

* Guarding some unusable code pathes

* Streamlining code and adding some more assertions

Important change: I'm classifying added tokens as control tokens now for BPE.

* Adding a comment

* Adding another assertion

* Fixed vocabulary guarding assertions

* Fix PR for recent change

* Fix PR for recent change

* Fix for compiler warning

* Fix PR for recent change

* Fix PR for recent change

* Fix PR for recent change

* Fix for compiler warning

* Fixes for more compiler warnings

* Remove unused code

* Fix initialization of static maps

* Add scores and token types back, adapt gptneox

* Update llama.cpp

Co-authored-by: Georgi Gerganov <redacted>
* Update unicode.h

Co-authored-by: Georgi Gerganov <redacted>
* Update unicode.h

Co-authored-by: Georgi Gerganov <redacted>
* Ported Starcoder and added some assertions

* Fix coding style

* Apply @jploski 's fix for missing tokens

---------

Co-authored-by: Georgi Gerganov <redacted>
15 files changed:
.gitignore
Makefile
common/common.cpp
convert-falcon-hf-to-gguf.py
convert-gptneox-hf-to-gguf.py
convert-starcoder-hf-to-gguf.py
convert.py
llama.cpp
models/ggml-vocab-aquila.gguf [new file with mode: 0644]
models/ggml-vocab-falcon.gguf [new file with mode: 0644]
tests/CMakeLists.txt
tests/test-tokenizer-0-falcon.cpp
tests/test-tokenizer-1-bpe.cpp [new file with mode: 0644]
tests/test-tokenizer-1-llama.cpp
unicode.h [new file with mode: 0644]