ggml : a faster version for Q4_1 x Q8_0 dot products (#1083)
* A faster version for Q4_1 x Q8_0 dot products
The idea nehind being that Q8_0 quantized
values get used many times in the matrix multiplications
where they are involved. In the current implementations,
when we are evaluating the dot products, we need to compute
the sum of the quants in the Q8_0 vector, so the same
operation is repeated many times. Here we pre-compute
the sum during Q8_0 quantization, store it in the
now modified block_q8_0 struct, and then reuse this
result in the subsequent dot products.
In a synthetic benchmark (just compute a bunch of dot
products), this change speeds up the Q4_1 * Q8_0 dot
product by 80%, making the performance identical to
Q4_0 * Q8_0.
In practical application, I see a ~15% gain in speed for
token prediction on M2, and ~5% gain on Ryzen 7950X.
The speed gain in the prompt evaluation is much bigger
(around 50%).
I have only done the change for the scalar version,
ARM_NEON, and AVX2, so we still need an AVX implementation.
Not much gain for simple quantizations, bit it will be important
for quantizations that require more CPU cycles.
* Multi-threading for quantize-stats
It now does the job in ~14 seconds on my Mac for
Q4_0, Q4_1 and Q4_2. Single-threaded it was taking
more than 2 minutes after adding the more elaborate
version of Q4_2.
* Reviewer comments
* Avoiding compiler confusion
After changing chunk_size to const int as suggested by
@ggerganov, clang and GCC starting to warn me that I don't
need to capture it in the lambda. So, I removed it from the
capture list. But that makes the MSVC build fail. So,
making it a constexpr to make every compiler happy.
* Still fighting with lambda captures in MSVC
---------
Co-authored-by: Iwan Kawrakow <redacted> Co-authored-by: Georgi Gerganov <redacted>
Ivan Komarov [Thu, 20 Apr 2023 15:15:18 +0000 (17:15 +0200)]
ci : remove the LLAMA_ACCELERATE matrix dimension from Ubuntu builds in the CI (#1074)
[Accelerate](https://developer.apple.com/documentation/accelerate) is an Apple framework which can only be used on macOS, and the CMake build [ignores](https://github.com/ggerganov/llama.cpp/blob/master/CMakeLists.txt#L102) the `LLAMA_ACCELERATE` variable when run on non-Apple platforms. This implies setting `LLAMA_ACCELERATE` is a no-op on Ubuntu and can be removed.
This will reduce visual noise in CI check results (in addition to reducing the number of checks we have to run for every PR). Right now every sanitized build is duplicated twice for no good reason (e.g., we have `CI / ubuntu-latest-cmake-sanitizer (ADDRESS, Debug, ON)` and `CI / ubuntu-latest-cmake-sanitizer (ADDRESS, Debug, OFF)`).
llama : well-defined static initialization of complex objects (#927)
* Replaced static initialization of complex objects with a initialization on first use. This prevents an undefined behavior on program run, for example, crash in Release build, works in Debug build
* replaced use of auto with exact type to avoid using -std=c++14
* Made the assessors functions for static maps be static const
convert.py: Fix loading safetensors and ggml format on Windows (#991)
Calling `mmap.mmap` on Windows apparently resets the file offset of the
raw file object (and makes the BufferedReader return a *negative* file
offset). For safetensors, avoid using the file offset after calling
mmap. For GGML format, explicitly save and restore the offset.
Current status: Working, except for the latest GPTQ-for-LLaMa format
that includes `g_idx`. This turns out to require changes to GGML, so
for now it only works if you use the `--outtype` option to dequantize it
back to f16 (which is pointless except for debugging).
I also included some cleanup for the C++ code.
This script is meant to replace all the existing conversion scripts
(including the ones that convert from older GGML formats), while also
adding support for some new formats. Specifically, I've tested with:
- [x] `LLaMA` (original)
- [x] `llama-65b-4bit`
- [x] `alpaca-native`
- [x] `alpaca-native-4bit`
- [x] LLaMA converted to 'transformers' format using
`convert_llama_weights_to_hf.py`
- [x] `alpaca-native` quantized with `--true-sequential --act-order
--groupsize 128` (dequantized only)
- [x] same as above plus `--save_safetensors`
- [x] GPT4All
- [x] stock unversioned ggml
- [x] ggmh
There's enough overlap in the logic needed to handle these different
cases that it seemed best to move to a single script.
I haven't tried this with Alpaca-LoRA because I don't know where to find
it.
Useful features:
- Uses multiple threads for a speedup in some cases (though the Python
GIL limits the gain, and sometimes it's disk-bound anyway).
- Combines split models into a single file (both the intra-tensor split
of the original and the inter-tensor split of 'transformers' format
files). Single files are more convenient to work with and more
friendly to future changes to use memory mapping on the C++ side. To
accomplish this without increasing memory requirements, it has some
custom loading code which avoids loading whole input files into memory
at once.
- Because of the custom loading code, it no longer depends in PyTorch,
which might make installing dependencies slightly easier or faster...
although it still depends on NumPy and sentencepiece, so I don't know
if there's any meaningful difference. In any case, I also added a
requirements.txt file to lock the dependency versions in case of any
future breaking changes.
- Type annotations checked with mypy.
- Some attempts to be extra user-friendly:
- The script tries to be forgiving with arguments, e.g. you can
specify either the model file itself or the directory containing
it.
- The script doesn't depend on config.json / params.json, just in
case the user downloaded files individually and doesn't have those
handy. But you still need tokenizer.model and, for Alpaca,
added_tokens.json.
- The script tries to give a helpful error message if
added_tokens.json is missing.