]> git.djapps.eu Git - pkg/ggml/sources/llama.cpp/commit
Detokenizer fixes (#8039)
authorjaime-m-p <redacted>
Fri, 5 Jul 2024 17:01:35 +0000 (19:01 +0200)
committerGitHub <redacted>
Fri, 5 Jul 2024 17:01:35 +0000 (19:01 +0200)
commit213701b51a17175d0d326b566efc03f30ec7fbe6
tree0ac8bc65ea9d6be2ec201471dc682a1629fc577f
parentbe20e7f49d9e5c6d9e8d9b4871eeba3df7a1639d
Detokenizer fixes (#8039)

* Add llama_detokenize():
  - Update header files location
  - UNKNOWN and CONTROL are 'special pieces'
  - Remove space after UNKNOWN and CONTROL
  - Refactor llama_token_to_piece()
  - Add flag: clean_up_tokenization_spaces
  - Symmetric params for llama_tokenize() and llama_detokenize()

* Update and fix tokenizer tests:
  - Using llama_detokenize()
  - Unexpected vocab type as test fail instead of error
    - Useful when automating tests:
    - If you don't know in advance the vocab type
    - Differenciate other loading errors
  - Skip unicode surrogaes and undefined
  - Gracefully exit threads
    - Using exit() is throwing random exceptions
  - Clean old known problematic codepoints
  - Minor: confusing hexadecimal codepoint

* Update bruteforce random tests
  - Add detokenizer checks
  - New generator: ascii_lr_strip
  - New generator: apostrophe
  - Add more vocabs files
  - Detokenize special tokens.
  - Replace errors with '\uFFFD' when detokenizing to 'utf-8'
  - More edge cases
  - Better detokenization results check

* Fix add_space_prefix, set false by default
* Better leading space removal
* Do not remove space when decoding special tokens
* Bugfix: custom regexs splits undefined unicode codepoints
* 'viking' detokenizer clean spaces
common/common.cpp
common/common.h
examples/batched.swift/Sources/main.swift
examples/llama.swiftui/llama.cpp.swift/LibLlama.swift
include/llama.h
src/llama.cpp
src/unicode.cpp
tests/test-tokenizer-0.cpp
tests/test-tokenizer-1-bpe.cpp
tests/test-tokenizer-1-spm.cpp
tests/test-tokenizer-random.py