comex [Tue, 21 Mar 2023 16:42:25 +0000 (09:42 -0700)]
Importer for GPTQ quantized LLaMA models (#301)
* [WIP, broken] Importer for GPTQ quantized LLaMA models
Based on: https://github.com/qwopqwop200/GPTQ-for-LLaMa
Current status: Something is busted. The output starts out decent, but
quickly degrades into gibberish. This doesn't happen with either the
original GPTQ-for-LLaMa using the same weights, or llama.cpp when using
weights quantized by its own quantizer. Is there a bug in the
conversion script that somehow only comes into play with a large context
size?
I did notice one potential issue. It's clearly not the main cause of
the gibberish, since it doesn't happen when using q4_1 weights quantized
by llama.cpp itself, but it seems concerning. When doing a matrix
multiplication of f16 * f32 => f32 or q4_1 * f32 => f32, at least when
the multiplication is not done with BLAS, the intermediate results are
stored in the smaller format rather than f32. This seems like an
unnecessary waste of precision, especially in the q4_1 case.
I was originally hoping to validate the results by matching the Python
implementation's output exactly, but precision and non-associativity
issues make this very difficult, including when performing matrix
multiplications and, especially, computing norms.
Anyway, design details:
The models being imported store per-layer weights in essentially q4_1
format, although the addend and scale are shared across an entire row
rather than every group of 32 weights. This script duplicates the
addend and scale to match ggml's expectations, at the cost of wasting
some memory.
However, there are two differences which I accommodated changing the
output format (and adding corresponding support to main.cpp) rather than
having the script match the existing one:
- The tok_embeddings and output weights (i.e. the weights that aren't
per-layer) are f16 instead of q4_1. They could be converted to q4_1,
and the impact of the loss of precision would probably be low, but
this would rule out exactly matching the Python implementation's
output for validation.
- There is no sharding, since the input doesn't have it, and for a
CPU-only implementation it seems more useful to avoid having to deal
with multiple files.
The new format is differentiated from existing q4_1 format by changing
the 'f16' header flag to a new value, 4. That said, I think a cleaner
approach would be to change main.cpp to support loading each tensor with
an arbitrary sharding configuration and type rather than hardcoding
specific combinations of types. So far I've wasted too much time
debugging to try implementing this...
Georgi Gerganov [Tue, 21 Mar 2023 15:29:41 +0000 (17:29 +0200)]
Add tokenizer test + revert to C++11 (#355)
* Add test-tokenizer-0 to do a few tokenizations - feel free to expand
* Added option to convert-pth-to-ggml.py script to dump just the vocabulary
* Added ./models/ggml-vocab.bin containing just LLaMA vocab data (used for tests)
* Added utility to load vocabulary file from previous point (temporary implementation)
* Avoid using std::string_view and drop back to C++11 (hope I didn't break something)
* Rename gpt_vocab -> llama_vocab
* All CMake binaries go into ./bin/ now
Casey Primozic [Tue, 21 Mar 2023 14:35:42 +0000 (07:35 -0700)]
Add initial AVX512 support for dot product on Linux (#320)
* Update Makefile to detect AVX512 support and add compiler flags if it's available
* Based on existing AVX2 implementation, dot product on one 32-value block of 4-bit quantized ints at a time
* Perform 8 bit -> 16 bit sign extension and multiply+add on 32 values at time instead of 16
* Use built-in AVX512 horizontal reduce add to get sum at the end
* Manual unrolling on inner dot product loop to reduce loop counter overhead
nusu-github [Tue, 21 Mar 2023 00:37:16 +0000 (09:37 +0900)]
Adding missing features of CMakeLists.txt & Refactoring (#131)
* Functionality addition CMakeLists.txt
Refactoring:
1. Simplify more options that are negation of negation.
LLAMA_NO_ACCELERATE -> LLAMA_ACCELERATE
2. Changed to an optional expression instead of forcing to enable AVX2 in MSVC.
3. Make CMAKE_CXX_STANDARD, which is different from Makefile, the same.
4. Use add_compile_options instead of adding options to CMAKE_C_FLAGS.
5. Make utils use target_link_libraries instead of directly referencing code.
Added features:
1. Added some options.
LLAMA_STATIC_LINK,LLAMA_NATIVE,LLAMA_LTO,LLAMA_GPROF,LLAMA_OPENBLAS
Rickey Bowers Jr [Sun, 19 Mar 2023 19:44:30 +0000 (13:44 -0600)]
fix coloring of last `n_batch` of prompt, and refactor line input (#221)
* fix coloring of last `n_batch` of prompt, and refactor line input
* forgot the newline that needs to be sent to the model
* (per #283) try to force flush of color reset in SIGINT handler
Suaj Carrot [Sun, 19 Mar 2023 18:38:44 +0000 (12:38 -0600)]
Improved quantize script (#222)
* Improved quantize script
I improved the quantize script by adding error handling and allowing to select many models for quantization at once in the command line. I also converted it to Python for generalization as well as extensibility.
* Fixes and improvements based on Matt's observations
Fixed and improved many things in the script based on the reviews made by @mattsta. The parallelization suggestion is still to be revised, but code for it was still added (commented).
* Small fixes to the previous commit
* Corrected to use the original glob pattern
The original Bash script uses a glob pattern to match files that have endings such as ...bin.0, ...bin.1, etc. That has been translated correctly to Python now.
* Added support for Windows and updated README to use this script
New code to set the name of the quantize script binary depending on the platform has been added (quantize.exe if working on Windows) and the README.md file has been updated to use this script instead of the Bash one.
* Fixed a typo and removed shell=True in the subprocess.run call
Fixed a typo regarding the new filenames of the quantized models and removed the shell=True parameter in the subprocess.run call as it was conflicting with the list of parameters.
* Corrected previous commit
* Small tweak: changed the name of the program in argparse
This was making the automatic help message to be suggesting the program's usage as being literally "$ Quantization Script [arguments]". It should now be something like "$ python3 quantize.py [arguments]".
Stephan Walter [Fri, 17 Mar 2023 17:47:35 +0000 (17:47 +0000)]
Don't tell users to use a bad number of threads (#243)
The readme tells people to use the command line option "-t 8", causing 8
threads to be started. On systems with fewer than 8 cores, this causes a
significant slowdown. Remove the option from the example command lines
and use /proc/cpuinfo on Linux to determine a sensible default.
Ronsor [Wed, 15 Mar 2023 19:37:50 +0000 (12:37 -0700)]
Use `tokenizer.vocab_size()` instead of hardcoding 32000 in convert-pth-to-ggml.py (#142)
There are ways that special tokens or other new tokens could be added to the tokenizer; therefore it's probably best not to assume the vocabulary is only 32000 tokens.