convert : update for baichuan (#2081)

author Judd <redacted>

Thu, 6 Jul 2023 16:23:49 +0000 (00:23 +0800)

committer GitHub <redacted>

Thu, 6 Jul 2023 16:23:49 +0000 (19:23 +0300)
author Judd <redacted>
Thu, 6 Jul 2023 16:23:49 +0000 (00:23 +0800)
committer GitHub <redacted>
Thu, 6 Jul 2023 16:23:49 +0000 (19:23 +0300)
diff --git a/README.md b/README.md

index 32f17c2d1bcdeb59e42ed3b67851310048e3da93..863aef123ad9a64737463757683e504910cfad63 100644 (file)
--- a/README.md
+++ b/README.md
@@ -86,7 +86,7 @@ as the main playground for developing new features for the [ggml](https://github
  - [X] [OpenBuddy 🐶 (Multilingual)](https://github.com/OpenBuddy/OpenBuddy)
  - [X] [Pygmalion 7B / Metharme 7B](#using-pygmalion-7b--metharme-7b)
  - [X] [WizardLM](https://github.com/nlpxucan/WizardLM)
-- [X] [Baichuan-7B](https://huggingface.co/baichuan-inc/baichuan-7B)
+- [X] [Baichuan-7B](https://huggingface.co/baichuan-inc/baichuan-7B) and its derivations (such as [baichuan-7b-sft](https://huggingface.co/hiyouga/baichuan-7b-sft))
  
  **Bindings:**
  
diff --git a/convert.py b/convert.py

index 14269277627b13b29f735196878a76eb6f418a03..66509b99c8f3e20ae8be38e9fb7cac2f401088ce 100644 (file)
--- a/convert.py
+++ b/convert.py
@@ -154,9 +154,15 @@ class Params:
          # try transformer naming first
          if "model.layers.0.self_attn.q_proj.weight" in model:
              n_layer=next(i for i in itertools.count() if f"model.layers.{i}.self_attn.q_proj.weight" not in model)
+        elif "model.layers.0.self_attn.W_pack.weight" in model:   # next: try baichuan naming
+            n_layer=next(i for i in itertools.count() if f"model.layers.{i}.self_attn.W_pack.weight" not in model)
          else:
              n_layer=next(i for i in itertools.count() if f"layers.{i}.attention.wq.weight" not in model)
  
+        if n_layer < 1:
+            raise Exception("failed to guess 'n_layer'. This model is unknown or unsupported.\n"
+                            "Suggestion: provide 'config.json' of the model in the same directory containing model files.")
+
          n_head=n_embd // 128 # guessed
  
          return Params(
diff --git a/examples/embedding/embedding.cpp b/examples/embedding/embedding.cpp

index 2b7eb39c51ff5390913f6f01cf126508c1192b9a..03e801c2a6d4b6a159970555cd11c66fcc3929f0 100644 (file)
--- a/examples/embedding/embedding.cpp
+++ b/examples/embedding/embedding.cpp
@@ -18,7 +18,7 @@ int main(int argc, char ** argv) {
      params.embedding = true;
  
      if (params.n_ctx > 2048) {
-        fprintf(stderr, "%s: warning: model does not support context sizes greater than 2048 tokens (%d specified);"
+        fprintf(stderr, "%s: warning: model might not support context sizes greater than 2048 tokens (%d specified);"
                  "expect poor results\n", __func__, params.n_ctx);
      }
  
diff --git a/examples/main/main.cpp b/examples/main/main.cpp

index 3a171925ba5103aeb0b86116707cc073d2ba744f..0f6391acba45d6bfc566a6cedd3041176b0f392a 100644 (file)
--- a/examples/main/main.cpp
+++ b/examples/main/main.cpp
@@ -85,7 +85,7 @@ int main(int argc, char ** argv) {
      }
  
      if (params.n_ctx > 2048) {
-        fprintf(stderr, "%s: warning: model does not support context sizes greater than 2048 tokens (%d specified);"
+        fprintf(stderr, "%s: warning: model might not support context sizes greater than 2048 tokens (%d specified);"
                  "expect poor results\n", __func__, params.n_ctx);
      } else if (params.n_ctx < 8) {
          fprintf(stderr, "%s: warning: minimum context size is 8, using minimum size.\n", __func__);
diff --git a/examples/perplexity/perplexity.cpp b/examples/perplexity/perplexity.cpp

index dd54ed3c4bd6cd65a62bcaa01156c793ef36a1b4..fd4b03cb261f6849ec6383487e92234c15fcf2b4 100644 (file)
--- a/examples/perplexity/perplexity.cpp
+++ b/examples/perplexity/perplexity.cpp
@@ -130,7 +130,7 @@ int main(int argc, char ** argv) {
      params.n_batch = std::min(params.n_batch, params.n_ctx);
  
      if (params.n_ctx > 2048) {
-        fprintf(stderr, "%s: warning: model does not support context sizes greater than 2048 tokens (%d specified);"
+        fprintf(stderr, "%s: warning: model might not support context sizes greater than 2048 tokens (%d specified);"
                  "expect poor results\n", __func__, params.n_ctx);
      }
  
diff --git a/examples/server/README.md b/examples/server/README.md

index c5139c16bb8277d032b07f17510141197fb18f0a..ad9b6bb08184525c3b9532ba96a04b677de813e5 100644 (file)
--- a/examples/server/README.md
+++ b/examples/server/README.md
@@ -7,7 +7,7 @@ Command line options:
  -   `--threads N`, `-t N`: Set the number of threads to use during computation.
  -   `-m FNAME`, `--model FNAME`: Specify the path to the LLaMA model file (e.g., `models/7B/ggml-model.bin`).
  -   `-m ALIAS`, `--alias ALIAS`: Set an alias for the model. The alias will be returned in API responses.
--   `-c N`, `--ctx-size N`: Set the size of the prompt context. The default is 512, but LLaMA models were built with a context of 2048, which will provide better results for longer input/inference.
+-   `-c N`, `--ctx-size N`: Set the size of the prompt context. The default is 512, but LLaMA models were built with a context of 2048, which will provide better results for longer input/inference. The size may differ in other models, for example, baichuan models were build with a context of 4096.
  -   `-ngl N`, `--n-gpu-layers N`: When compiled with appropriate support (currently CLBlast or cuBLAS), this option allows offloading some layers to the GPU for computation. Generally results in increased performance.
  -   `-mg i, --main-gpu i`: When using multiple GPUs this option controls which GPU is used for small tensors for which the overhead of splitting the computation across all GPUs is not worthwhile. The GPU in question will use slightly more VRAM to store a scratch buffer for temporary results. By default GPU 0 is used. Requires cuBLAS.
  -   `-ts SPLIT, --tensor-split SPLIT`: When using multiple GPUs this option controls how large tensors should be split across all GPUs. `SPLIT` is a comma-separated list of non-negative values that assigns the proportion of data that each GPU should get in order. For example, "3,2" will assign 60% of the data to GPU 0 and 40% to GPU 1. By default the data is split in proportion to VRAM but this may not be optimal for performance. Requires cuBLAS.
author	Judd <redacted>
	Thu, 6 Jul 2023 16:23:49 +0000 (00:23 +0800)
committer	GitHub <redacted>
	Thu, 6 Jul 2023 16:23:49 +0000 (19:23 +0300)
README.md		patch \| blob \| history
convert.py		patch \| blob \| history
examples/embedding/embedding.cpp		patch \| blob \| history
examples/main/main.cpp		patch \| blob \| history
examples/perplexity/perplexity.cpp		patch \| blob \| history
examples/server/README.md		patch \| blob \| history