]> git.djapps.eu Git - pkg/ggml/sources/llama.cpp/commit
llama : more tokenizer fixes (#2810)
authorGeorgi Gerganov <redacted>
Sun, 27 Aug 2023 11:19:19 +0000 (14:19 +0300)
committerGitHub <redacted>
Sun, 27 Aug 2023 11:19:19 +0000 (14:19 +0300)
commitedd4c1481708fcd788b0e423268304fd26e2b125
tree2e7db62ea4816dc18f2518a08c36b6ea480eff05
parent1591e2e590762011b43b10a9b6e04f13f98f2aa5
llama : more tokenizer fixes (#2810)

* tests : write a Python tokenizer test (wip)

* llama : prefix input text for tokenization with whitespace

* llama : distinguish pieces from decoded text + fix detokenization

* common : add comments

* examples : no longer manually add leading space when tokenizing

* tests : use Python to generate tokenizer tests for C++

* tests : add option to tokenize text files

ggml-ci

* tests : add test-tokenizer-1.py

* llama.cpp : fix LF token

* hellaswag : move the concat space for clarity

* tests : add falcon tests (py + cpp, currently do not pass Unicode)

ggml-ci

* common : temporary separate llama_detokenize calls for SPM and BPE

---------

Co-authored-by: klosax <redacted>
20 files changed:
common/common.cpp
common/common.h
examples/beam_search/beam_search.cpp
examples/embd-input/embd-input-lib.cpp
examples/embedding/embedding.cpp
examples/main/main.cpp
examples/perplexity/perplexity.cpp
examples/save-load-state/save-load-state.cpp
examples/server/server.cpp
examples/simple/simple.cpp
examples/train-text-from-scratch/train-text-from-scratch.cpp
llama.cpp
llama.h
tests/CMakeLists.txt
tests/test-tokenizer-0-falcon.cpp [new file with mode: 0644]
tests/test-tokenizer-0-falcon.py [new file with mode: 0644]
tests/test-tokenizer-0-llama.cpp [new file with mode: 0644]
tests/test-tokenizer-0-llama.py [new file with mode: 0644]
tests/test-tokenizer-0.cpp [deleted file]
tests/test-tokenizer-1.cpp