git.djapps.eu Git - pkg/ggml/sources/llama.cpp/commit

author	Kawrakow <redacted>
	Sun, 21 Jan 2024 12:42:44 +0000 (14:42 +0200)
committer	GitHub <redacted>
	Sun, 21 Jan 2024 12:42:44 +0000 (14:42 +0200)
commit	7dcbe39d36b76389f6c5cd3b151928472b7e22ff
tree	d0b13b66cdd5046d5767b4791d183bed3e97c61c	tree
parent	726c0fa9a2da976e9c5d5c51e185d9dd453fc9e5	commit \| diff

Add ability to evauate multiple choice tasks (#5047)

* TruthfulQA: 1st attempt, does not look like it is working

The same implementation can be used for HellaSwag as well,
so I converted a HellaSwag validation dataset to the binary
format used here and tested with that. The score is only
around 50, so something is not quite right.

* TruthfulQA: works but the result is bad

I know it works because if I convert the HellaSwag validation
data to the binary format used in the truthful_qa_score() function
I get the exact same result as from the hellaswag_score() function.
But I guess, the questions are tricky and the way I have done
the combination of question + answer is very likely not the best.
The TruthfulQA validation dataset contains 817 questions, with
random chance result around 19%. With this version I get
29.1% for Mistral-7B and 55.2% for Mistral-7B-Instruct-v0.2.
The HF leader board results for these two models are
42.2% and 68.3%, respectively.

* TruthfulQA: fix random sample

* TruthfulQA: prepare tasks in parallel for large test datasets

* Rename truthful_qa to multiple_choice

* Make MSVC happy

I had forgotten that MSVC does not make constexpr's available
inside a lambda.

---------

Co-authored-by: Iwan Kawrakow <redacted>

common/common.cpp		diff \| blob \| history
common/common.h		diff \| blob \| history
examples/perplexity/perplexity.cpp		diff \| blob \| history