]> git.djapps.eu Git - pkg/ggml/sources/llama.cpp/commit
tests : add test-tokenizer-0.sh + fix some tokenizers (#7036)
authorGeorgi Gerganov <redacted>
Sat, 4 May 2024 05:32:32 +0000 (08:32 +0300)
committerGitHub <redacted>
Sat, 4 May 2024 05:32:32 +0000 (08:32 +0300)
commit92139b90af4841d7fd060b526bdd443b621770ff
tree9679c3de1b39970ca73b5bd988c63ddac0359ca6
parenta2ac89d6efb41b535778bfeaecaae8fe295b6ed3
tests : add test-tokenizer-0.sh + fix some tokenizers (#7036)

* tests : add test-tokenizer-0.sh

* unicode : add all unicode number ranges

* starcoder : fix pre-tokenizer

* tests : add test that fails with DeepSeek tokenizers

* falcon : fix regex

* unicode : regenerate unicode tables

* refact : add tokenizer model

* lint : fix

* tests : disable failing tests

ggml-ci

* refact : add tests files

ggml-ci

* convert : print -> logging

ggml-ci

* lint : fix

* unicode : digit -> number

* phi-3 : update
41 files changed:
.flake8
Makefile
convert-hf-to-gguf-update.py
convert-hf-to-gguf.py
llama.cpp
llama.h
models/ggml-vocab-bert-bge.gguf.inp
models/ggml-vocab-bert-bge.gguf.out
models/ggml-vocab-deepseek-coder.gguf.inp
models/ggml-vocab-deepseek-coder.gguf.out
models/ggml-vocab-deepseek-llm.gguf.inp
models/ggml-vocab-deepseek-llm.gguf.out
models/ggml-vocab-falcon.gguf.inp
models/ggml-vocab-falcon.gguf.out
models/ggml-vocab-gpt-2.gguf.inp
models/ggml-vocab-gpt-2.gguf.out
models/ggml-vocab-llama-bpe.gguf.inp
models/ggml-vocab-llama-bpe.gguf.out
models/ggml-vocab-llama-spm.gguf.inp
models/ggml-vocab-llama-spm.gguf.out
models/ggml-vocab-mpt.gguf.inp
models/ggml-vocab-mpt.gguf.out
models/ggml-vocab-phi-3.gguf
models/ggml-vocab-phi-3.gguf.inp
models/ggml-vocab-phi-3.gguf.out
models/ggml-vocab-refact.gguf
models/ggml-vocab-refact.gguf.inp [new file with mode: 0644]
models/ggml-vocab-refact.gguf.out [new file with mode: 0644]
models/ggml-vocab-starcoder.gguf.inp
models/ggml-vocab-starcoder.gguf.out
scripts/gen-unicode-data.py [new file with mode: 0644]
tests/CMakeLists.txt
tests/test-tokenizer-0-bpe.py [deleted file]
tests/test-tokenizer-0-spm.py [deleted file]
tests/test-tokenizer-0.cpp
tests/test-tokenizer-0.py [new file with mode: 0644]
tests/test-tokenizer-0.sh [new file with mode: 0755]
unicode-data.cpp
unicode-data.h
unicode.cpp
unicode.h