From: Ronsor Date: Wed, 15 Mar 2023 19:37:50 +0000 (-0700) Subject: Use `tokenizer.vocab_size()` instead of hardcoding 32000 in convert-pth-to-ggml.py... X-Git-Tag: gguf-v0.4.0~1241 X-Git-Url: https://git.djapps.eu/?a=commitdiff_plain;h=956dfda8ad8cea7961e22e0384bbc315bf79aed2;p=pkg%2Fggml%2Fsources%2Fllama.cpp Use `tokenizer.vocab_size()` instead of hardcoding 32000 in convert-pth-to-ggml.py (#142) There are ways that special tokens or other new tokens could be added to the tokenizer; therefore it's probably best not to assume the vocabulary is only 32000 tokens. --- diff --git a/convert-pth-to-ggml.py b/convert-pth-to-ggml.py index d2557500..5c36e9c0 100644 --- a/convert-pth-to-ggml.py +++ b/convert-pth-to-ggml.py @@ -99,7 +99,7 @@ for p in range(n_parts): fout.write(struct.pack("i", ftype)) # Is this correct?? - for i in range(32000): + for i in range(tokenizer.vocab_size()): if tokenizer.is_unknown(i): # "" token (translated as ??) text = " \u2047 ".encode("utf-8")