From: Georgi Gerganov Date: Sat, 13 May 2023 10:08:56 +0000 (+0300) Subject: examples : update readme with new quantization usage + remove bug alert X-Git-Tag: upstream/0.0.1642~1484 X-Git-Url: https://git.djapps.eu/?a=commitdiff_plain;h=0b1df4dd6a9cbcff65685d0e289fb656e00fa773;p=pkg%2Fggml%2Fsources%2Fggml examples : update readme with new quantization usage + remove bug alert --- diff --git a/examples/dolly-v2/README.md b/examples/dolly-v2/README.md index 14069733..377e816b 100644 --- a/examples/dolly-v2/README.md +++ b/examples/dolly-v2/README.md @@ -101,7 +101,7 @@ main: total time = 6187.27 ms ```bash # quantize the model to 5-bits using Q5_0 quantization -./bin/dollyv2-quantize ./dolly-v2-3b/ggml-model-f16.bin ./dolly-v2-3b/ggml-model-q5_0.bin 8 +./bin/dollyv2-quantize ./dolly-v2-3b/ggml-model-f16.bin ./dolly-v2-3b/ggml-model-q5_0.bin q5_0 # run the quantized model ./bin/dollyv2 -m ./dolly-v2-3b/ggml-model-q5_0.bin -p "State the meaning of life." -t 6 -n 64 @@ -182,36 +182,3 @@ main: total time = 2802.51 ms - The tokenizer is currently hacked - probably works only for English - Non-parallel residual is not supported - Contributions and improvements are welcome - -## Note about possible bug -**There might be some issue with this implementation - not 100% sure. -The embeddings magnitude increases after each layer which is unexpected. -To observe this, uncomment the following line:** -https://github.com/ggerganov/ggml/blob/abea4b7609c14b837015ab625e3ac36c4708dd03/src/ggml.c#L9208 -``` -... -p[ 0] = 65.5842 -p[ 1] = 61.6951 -p[ 2] = 59.3500 -p[ 3] = 61.2421 -p[ 4] = 65.9653 -p[ 5] = 59.4936 -p[ 6] = 58.4164 -p[ 0] = -209.6351 -p[ 1] = -214.0987 -p[ 2] = -217.0928 -p[ 3] = -215.0267 -p[ 4] = -208.2430 -p[ 5] = -215.3692 -p[ 6] = -214.1981 -p[ 0] = -301.0286 -p[ 1] = -308.6521 -p[ 2] = -310.7513 -p[ 3] = -307.0832 -p[ 4] = -299.9238 -p[ 5] = -306.0667 -p[ 6] = -302.1777 -... -``` -**Instead, I think the magnitude should remain around `1`. -See https://github.com/ggerganov/llama.cpp/issues/1063#issuecomment-1527730562 for more analysis** diff --git a/examples/gpt-neox/README.md b/examples/gpt-neox/README.md index e95a131c..d80338ab 100644 --- a/examples/gpt-neox/README.md +++ b/examples/gpt-neox/README.md @@ -56,17 +56,17 @@ main: predict time = 4474.07 ms / 63.92 ms per token main: total time = 6911.26 ms ``` -## 4-bit integer quantization mode +## 5-bit integer quantization mode ```bash -# quantize the model to 4-bits using Q4_3 quantization -./bin/gpt_neox-quantize ./stablelm-base-alpha-3b/ggml-model-f16.bin ./stablelm-base-alpha-3b/ggml-model-q4_3.bin 6 +# quantize the model to 5-bits using Q5_0 quantization +./bin/gpt_neox-quantize ./stablelm-base-alpha-3b/ggml-model-f16.bin ./stablelm-base-alpha-3b/ggml-model-q5_0.bin q5_0 # run the quantized model -./bin/gpt_neox -m ./stablelm-base-alpha-3b/ggml-model-q4_3.bin -p "I believe the meaning of life is" -t 8 -n 64 +./bin/gpt_neox -m ./stablelm-base-alpha-3b/ggml-model-q5_0.bin -p "I believe the meaning of life is" -t 8 -n 64 main: seed = 1682021489 -gpt_neox_model_load: loading model from 'models/stablelm-base-alpha-3b/ggml-model-q4_3.bin' - please wait ... +gpt_neox_model_load: loading model from 'models/stablelm-base-alpha-3b/ggml-model-q5_0.bin' - please wait ... gpt_neox_model_load: n_vocab = 50688 gpt_neox_model_load: n_ctx = 4096 gpt_neox_model_load: n_embd = 4096 @@ -105,40 +105,3 @@ main: total time = 4177.68 ms - The tokenizer is currently hacked - probably works only for English - Non-parallel residual is not supported - Contributions and improvements are welcome - -## Note about possible bug - -**There might be some issue with this implementation - not 100% sure. -The embeddings magnitude increases after each layer which is unexpected. -To observe this, uncomment the following line:** - -https://github.com/ggerganov/ggml/blob/abea4b7609c14b837015ab625e3ac36c4708dd03/src/ggml.c#L9208 - -``` -... -p[ 0] = 65.5842 -p[ 1] = 61.6951 -p[ 2] = 59.3500 -p[ 3] = 61.2421 -p[ 4] = 65.9653 -p[ 5] = 59.4936 -p[ 6] = 58.4164 -p[ 0] = -209.6351 -p[ 1] = -214.0987 -p[ 2] = -217.0928 -p[ 3] = -215.0267 -p[ 4] = -208.2430 -p[ 5] = -215.3692 -p[ 6] = -214.1981 -p[ 0] = -301.0286 -p[ 1] = -308.6521 -p[ 2] = -310.7513 -p[ 3] = -307.0832 -p[ 4] = -299.9238 -p[ 5] = -306.0667 -p[ 6] = -302.1777 -... -``` - -**Instead, I think the magnitude should remain around `1`. -See https://github.com/ggerganov/llama.cpp/issues/1063#issuecomment-1527730562 for more analysis**