Not much gain for simple quantizations, bit it will be important
for quantizations that require more CPU cycles.
* Multi-threading for quantize-stats
It now does the job in ~14 seconds on my Mac for
Q4_0, Q4_1 and Q4_2. Single-threaded it was taking
more than 2 minutes after adding the more elaborate
version of Q4_2.
* Reviewer comments
* Avoiding compiler confusion
After changing chunk_size to const int as suggested by
@ggerganov, clang and GCC starting to warn me that I don't
need to capture it in the lambda. So, I removed it from the
capture list. But that makes the MSVC build fail. So,
making it a constexpr to make every compiler happy.
* Still fighting with lambda captures in MSVC
---------
Co-authored-by: Iwan Kawrakow <redacted> Co-authored-by: Georgi Gerganov <redacted>
Ivan Komarov [Thu, 20 Apr 2023 15:15:18 +0000 (17:15 +0200)]
ci : remove the LLAMA_ACCELERATE matrix dimension from Ubuntu builds in the CI (#1074)
[Accelerate](https://developer.apple.com/documentation/accelerate) is an Apple framework which can only be used on macOS, and the CMake build [ignores](https://github.com/ggerganov/llama.cpp/blob/master/CMakeLists.txt#L102) the `LLAMA_ACCELERATE` variable when run on non-Apple platforms. This implies setting `LLAMA_ACCELERATE` is a no-op on Ubuntu and can be removed.
This will reduce visual noise in CI check results (in addition to reducing the number of checks we have to run for every PR). Right now every sanitized build is duplicated twice for no good reason (e.g., we have `CI / ubuntu-latest-cmake-sanitizer (ADDRESS, Debug, ON)` and `CI / ubuntu-latest-cmake-sanitizer (ADDRESS, Debug, OFF)`).
llama : well-defined static initialization of complex objects (#927)
* Replaced static initialization of complex objects with a initialization on first use. This prevents an undefined behavior on program run, for example, crash in Release build, works in Debug build
* replaced use of auto with exact type to avoid using -std=c++14
* Made the assessors functions for static maps be static const
convert.py: Fix loading safetensors and ggml format on Windows (#991)
Calling `mmap.mmap` on Windows apparently resets the file offset of the
raw file object (and makes the BufferedReader return a *negative* file
offset). For safetensors, avoid using the file offset after calling
mmap. For GGML format, explicitly save and restore the offset.
Current status: Working, except for the latest GPTQ-for-LLaMa format
that includes `g_idx`. This turns out to require changes to GGML, so
for now it only works if you use the `--outtype` option to dequantize it
back to f16 (which is pointless except for debugging).
I also included some cleanup for the C++ code.
This script is meant to replace all the existing conversion scripts
(including the ones that convert from older GGML formats), while also
adding support for some new formats. Specifically, I've tested with:
- [x] `LLaMA` (original)
- [x] `llama-65b-4bit`
- [x] `alpaca-native`
- [x] `alpaca-native-4bit`
- [x] LLaMA converted to 'transformers' format using
`convert_llama_weights_to_hf.py`
- [x] `alpaca-native` quantized with `--true-sequential --act-order
--groupsize 128` (dequantized only)
- [x] same as above plus `--save_safetensors`
- [x] GPT4All
- [x] stock unversioned ggml
- [x] ggmh
There's enough overlap in the logic needed to handle these different
cases that it seemed best to move to a single script.
I haven't tried this with Alpaca-LoRA because I don't know where to find
it.
Useful features:
- Uses multiple threads for a speedup in some cases (though the Python
GIL limits the gain, and sometimes it's disk-bound anyway).
- Combines split models into a single file (both the intra-tensor split
of the original and the inter-tensor split of 'transformers' format
files). Single files are more convenient to work with and more
friendly to future changes to use memory mapping on the C++ side. To
accomplish this without increasing memory requirements, it has some
custom loading code which avoids loading whole input files into memory
at once.
- Because of the custom loading code, it no longer depends in PyTorch,
which might make installing dependencies slightly easier or faster...
although it still depends on NumPy and sentencepiece, so I don't know
if there's any meaningful difference. In any case, I also added a
requirements.txt file to lock the dependency versions in case of any
future breaking changes.
- Type annotations checked with mypy.
- Some attempts to be extra user-friendly:
- The script tries to be forgiving with arguments, e.g. you can
specify either the model file itself or the directory containing
it.
- The script doesn't depend on config.json / params.json, just in
case the user downloaded files individually and doesn't have those
handy. But you still need tokenizer.model and, for Alpaca,
added_tokens.json.
- The script tries to give a helpful error message if
added_tokens.json is missing.
Mostly for msys2 and mingw64 builds, which are different from each other
and different from standard Visual Studio builds. Isn't Windows fun?
- Define _GNU_SOURCE in more files (it's already used in ggml.c for
Linux's sake).
- Don't use PrefetchVirtualMemory if not building for Windows 8 or later
(mingw64 doesn't by default). But warn the user about this situation
since it's probably not intended.
- Check for NOMINMAX already being defined, which it is on mingw64.
- Actually use the `increment` variable (bug in my `pizza` PR).
- Suppress unused variable warnings in the fake pthread_create and
pthread_join implementations for Windows.
- (not Windows-related) Remove mention of `asprintf` from comment;
`asprintf` is no longer used.
- Support all three formats (ggml, ggmf, ggjt). (However, I didn't
include the hack needed to support GPT4All files without conversion.
Those can still be used after converting them with convert.py from my
other PR.)
- Support both mmap and read (mmap is used by default, but can be
disabled with `--no-mmap`, and is automatically disabled for pre-ggjt
files or on platforms where mmap is not supported).
- Support multi-file models like before, but automatically determine the
number of parts rather than requiring `--n_parts`.
- Improve validation and error checking.
- Stop using the per-file type field (f16) entirely in favor of just
relying on the per-tensor type/size fields. This has no immediate
benefit, but makes it easier to experiment with different formats, and
should make it easier to support the new GPTQ-for-LLaMa models in the
future (I have some work in progress on that front).
- Support VirtualLock on Windows (using the same `--mlock` option as on
Unix).
- Indicate loading progress when using mmap + mlock. (Which led me
to the interesting observation that on my Linux machine, with a
warm file cache, mlock actually takes some time, whereas mmap
without mlock starts almost instantly...)
- To help implement this, move mlock support from ggml to the
loading code.
- madvise/PrefetchVirtualMemory support (based on #740)
- Switch from ifstream to the `fopen` family of functions to avoid
unnecessary copying and, when mmap is enabled, allow reusing the same
file descriptor for both metadata reads and mmap (whereas the existing
implementation opens the file a second time to mmap).
- Quantization now produces a single-file output even with multi-file
inputs (not really a feature as much as 'it was easier this way').
Implementation notes:
I tried to factor the code into more discrete pieces than before.
Regarding code style: I tried to follow the code style, but I'm naughty
and used a few advanced C++ features repeatedly:
- Destructors to make it easier to ensure everything gets cleaned up.
- Exceptions. I don't even usually use exceptions when writing C++, and
I can remove them if desired... but here they make the loading code
much more succinct while still properly handling a variety of errors,
ranging from API calls failing to integer overflow and allocation
failure. The exceptions are converted to error codes at the
API boundary.)
Co-authored-by: Pavol Rusnak <redacted> (for the bit I copied from #740)
Add quantize-stats command for testing quantization (#728)
Command that calculates some statistics over the errors introduced by
quantization, like mean square error, max error and some percentile errors for layer
weights. Should be useful for testing quantization improvements.
Exposes some internal state from ggml and llama for testing
Georgi Gerganov [Wed, 5 Apr 2023 19:07:33 +0000 (22:07 +0300)]
ggml, llama : avoid heavy V transpose + improvements (#775)
ggml :
- added ggml_view_3d()
- ggml_view_tensor() now inherits the stride too
- reimplement ggml_cpy() to account for dst stride
- no longer require tensor->data to be memory aligned
llama :
- compute RoPE on 32-bit tensors (should be more accurate)
- store RoPE-ed K in the KV cache
- store transposed V in the KV cache (significant speed-up)
- avoid unnecessary Q copy