git.djapps.eu Git - pkg/ggml/sources/llama.cpp/commit

author	compilade <redacted>
	Sun, 14 Jul 2024 03:35:10 +0000 (23:35 -0400)
committer	GitHub <redacted>
	Sun, 14 Jul 2024 03:35:10 +0000 (23:35 -0400)
commit	fa79495bb4897953a75607addd9d2cdd2ec63222
tree	a6ff568ea390e3c98e2a433077db1021ecf68fd6	tree
parent	17eb6aa8a992cda37ee65cf848d9289bd6cad860	commit \| diff

llama : fix pre-tokenization of non-special added tokens (#8228)

* llama : fix mpt and olmo pre-tokenizer

* llama : pre-tokenize non-special user-defined tokens first

* llama : fix detection of control-like user-defined tokens

* convert_hf : identify which user-defined tokens are control tokens

Only used in _set_vocab_gpt2() for now.

* convert_hf : identify more added control tokens for SPM tokenziers

This makes Gemma and Gemma-2 tokenize pretty much EVERYTHING correctly,
including HTML tags and consecutive spaces,
but it unfortunately requires model re-conversion.

There seems to be a weird behavior of the HF tokenizer for Gemma,
which prefers to use the 16-space token over more lengthy space tokens,
while using the SentencePiece tokenizer does not do this.
(the implementation in llama.cpp has the same behavior as SentencePiece)

* llama : fix wrong pre-tokenization of byte tokens

* llama : fix Viking pre-tokenizer regex

The order was previously wrong, which caused errors in some tests.

* llama : fix command-r detokenization

* convert_hf : reduce usages of the UNKNOWN token type

* llama : add UNKNOWN tokens in the special tokens cache

* convert_hf : reduce usages of UNKNOWN for InternLM2

This makes the changes from #8321 more consistent
with the other changes made here.

* test-tokenizer-random : reduce potential confilcts with #8379

* test-tokenizer-random : add a failing edge case for falcon

convert_hf_to_gguf.py		diff \| blob \| history
src/llama.cpp		diff \| blob \| history
tests/test-tokenizer-0.cpp		diff \| blob \| history
tests/test-tokenizer-random.py		diff \| blob \| history