git.djapps.eu Git - pkg/ggml/sources/ggml/commit

examples : add tokenization tests and refactor codes (#186)

* examples : [refactor] remove unnecessary lines and segments

* examples : [feature] add tokenization test for gpt-neox

* examples : [feature] handle multibyte character set

* examples : [refactor] find the longest token for word

* examples : [refactor] move test_tokenizer to common.cpp as the function affects other models

* add 'test_tokenizer' function after loading the model

* examples : [feature] add test cases for checking tokenization

* examples : [feature] tokenize with huggingface tokenizers for currently supported models

* examples : add tokenization test cases for each model

* revert conversion from string to utf-8 encoded byte strings

* [refactor] make util functions for testing tokenizers available

* [bug fix] test replit using functions and variables (e.g. tokenizer struct, tokenization method) defined in its main.cpp

* [refactor] modify function name test_tokenizer -> test_gpt_tokenizer

* [refactor] put parenthesis on single line for-loops and if-statements

* [refactor] withdraw <filesystem> and use <iostream> and <dirent.h>

* [refactor] remove 'find_test_file' function and directly set test file path from 'test_gpt_tokenizer' function

* call a function for testing tokenizer with filename specified

* revert test tokenizer in replit (replit uses seperate methods for tokenzation and decoding)

* compare vector of id to check if two tokenizations are identical.

* write token ids instead of strings.

* [refactor] use --token_test rather than --test for token-test argument

* add english test cases

* update test cases with more english prompts

* examples : tokenizer testing fixes

---------

Co-authored-by: Georgi Gerganov <redacted>

author	jaeminSon <redacted>
	Sat, 27 May 2023 08:47:34 +0000 (17:47 +0900)
committer	GitHub <redacted>
	Sat, 27 May 2023 08:47:34 +0000 (11:47 +0300)
commit	765c9bce37a53c36da4723e9f014874c27b69aa1
tree	c2b9b0c43c42f4381c2dec9bc5c4d43c4aa78ff0	tree
parent	9c285d7fcd67760f51630511fae8ac6967cb33d5	commit \| diff

examples/common.cpp		diff \| blob \| history
examples/common.h		diff \| blob \| history
examples/dolly-v2/main.cpp		diff \| blob \| history
examples/gpt-2/main.cpp		diff \| blob \| history
examples/gpt-j/main.cpp		diff \| blob \| history
examples/gpt-neox/convert-h5-to-ggml.py		diff \| blob \| history
examples/gpt-neox/main.cpp		diff \| blob \| history
examples/mpt/main.cpp		diff \| blob \| history
examples/prompts/dolly-v2.txt	[new file with mode: 0644]	blob
examples/prompts/gpt-2-chinese.txt	[new file with mode: 0644]	blob
examples/prompts/gpt-2.txt	[new file with mode: 0644]	blob
examples/prompts/gpt-j.txt	[new file with mode: 0644]	blob
examples/prompts/gpt-neox-japanese.txt	[new file with mode: 0644]	blob
examples/prompts/gpt-neox.txt	[new file with mode: 0644]	blob
examples/prompts/polyglot-ko.txt	[new file with mode: 0644]	blob
examples/prompts/replit.txt	[new file with mode: 0644]	blob
examples/prompts/starcoder.txt	[new file with mode: 0644]	blob
examples/prompts/test-cases.txt	[new file with mode: 0644]	blob
examples/prompts/tokenize_huggingface.py	[new file with mode: 0644]	blob
examples/prompts/whisper.txt	[new file with mode: 0644]	blob
examples/replit/main.cpp		diff \| blob \| history
examples/starcoder/main.cpp		diff \| blob \| history