]> git.djapps.eu Git - pkg/ggml/sources/ggml/commit
examples : add tokenization tests and refactor codes (#186)
authorjaeminSon <redacted>
Sat, 27 May 2023 08:47:34 +0000 (17:47 +0900)
committerGitHub <redacted>
Sat, 27 May 2023 08:47:34 +0000 (11:47 +0300)
commit765c9bce37a53c36da4723e9f014874c27b69aa1
treec2b9b0c43c42f4381c2dec9bc5c4d43c4aa78ff0
parent9c285d7fcd67760f51630511fae8ac6967cb33d5
examples : add tokenization tests and refactor codes (#186)

* examples : [refactor] remove unnecessary lines and segments

* examples : [feature] add tokenization test for gpt-neox

* examples : [feature] handle multibyte character set

* examples : [refactor] find the longest token for word

* examples : [refactor] move test_tokenizer to common.cpp as the function affects other models

* add 'test_tokenizer' function after loading the model

* examples : [feature] add test cases for checking tokenization

* examples : [feature] tokenize with huggingface tokenizers for currently supported models

* examples : add tokenization test cases for each model

* revert conversion from string to utf-8 encoded byte strings

* [refactor] make util functions for testing tokenizers available

* [bug fix] test replit using functions and variables (e.g. tokenizer struct, tokenization method) defined in its main.cpp

* [refactor] modify function name test_tokenizer -> test_gpt_tokenizer

* [refactor] put parenthesis on single line for-loops and if-statements

* [refactor] withdraw <filesystem> and use <iostream> and <dirent.h>

* [refactor] remove 'find_test_file' function and directly set test file path from 'test_gpt_tokenizer' function

* call a function for testing tokenizer with filename specified

* revert test tokenizer in replit (replit uses seperate methods for tokenzation and decoding)

* compare vector of id to check if two tokenizations are identical.

* write token ids instead of strings.

* [refactor] use --token_test rather than --test for token-test argument

* add english test cases

* update test cases with more english prompts

* examples : tokenizer testing fixes

---------

Co-authored-by: Georgi Gerganov <redacted>
22 files changed:
examples/common.cpp
examples/common.h
examples/dolly-v2/main.cpp
examples/gpt-2/main.cpp
examples/gpt-j/main.cpp
examples/gpt-neox/convert-h5-to-ggml.py
examples/gpt-neox/main.cpp
examples/mpt/main.cpp
examples/prompts/dolly-v2.txt [new file with mode: 0644]
examples/prompts/gpt-2-chinese.txt [new file with mode: 0644]
examples/prompts/gpt-2.txt [new file with mode: 0644]
examples/prompts/gpt-j.txt [new file with mode: 0644]
examples/prompts/gpt-neox-japanese.txt [new file with mode: 0644]
examples/prompts/gpt-neox.txt [new file with mode: 0644]
examples/prompts/polyglot-ko.txt [new file with mode: 0644]
examples/prompts/replit.txt [new file with mode: 0644]
examples/prompts/starcoder.txt [new file with mode: 0644]
examples/prompts/test-cases.txt [new file with mode: 0644]
examples/prompts/tokenize_huggingface.py [new file with mode: 0644]
examples/prompts/whisper.txt [new file with mode: 0644]
examples/replit/main.cpp
examples/starcoder/main.cpp