]> git.djapps.eu Git - pkg/ggml/sources/llama.cpp/commit
Tokenizer WPM fixes (#7500)
authorjaime-m-p <redacted>
Tue, 28 May 2024 19:46:34 +0000 (21:46 +0200)
committerGitHub <redacted>
Tue, 28 May 2024 19:46:34 +0000 (21:46 +0200)
commit02c1ecad07f0e2d2febe8196271bcc64bdc9c006
tree2208298e9ac6bd0743787d02f35b527f7db47d0b
parent6bd12ce409f949012935b7d1b15a21ffa473a565
Tokenizer WPM fixes (#7500)

* Update random test: add_bos_token.
* Update random test: add WPM models for testing.
* Build vocab.special_tokens_cache using vocab token types.
* Fix and improve WPM preprocessing.
  - Fix unicode edge case combinations.
  - Split by whitspace in the same pass.
* Discard all tokens when no matching found.
llama.cpp
tests/test-tokenizer-random.py