jaime-m-p [Fri, 17 May 2024 23:09:13 +0000 (01:09 +0200)]
Unicode codepoint flags for custom regexs (#7245)
* Replace CODEPOINT_TYPE_* with codepoint_flags
* Update and bugfix brute force random test
* Deterministic brute force random test
* Unicode normalization NFD
* Get rid of BOM
Gavin Zhao [Fri, 17 May 2024 15:03:03 +0000 (11:03 -0400)]
ROCm: use native CMake HIP support (#5966)
Supercedes #4024 and #4813.
CMake's native HIP support has become the
recommended way to add HIP code into a project (see
[here](https://rocm.docs.amd.com/en/docs-6.0.0/conceptual/cmake-packages.html#using-hip-in-cmake)).
This PR makes the following changes:
1. The environment variable `HIPCXX` or CMake option
`CMAKE_HIP_COMPILER` should be used to specify the HIP
compiler. Notably this shouldn't be `hipcc`, but ROCm's clang,
which usually resides in `$ROCM_PATH/llvm/bin/clang`. Previously
this was control by `CMAKE_C_COMPILER` and `CMAKE_CXX_COMPILER`.
Note that since native CMake HIP support is not yet available on
Windows, on Windows we fall back to the old behavior.
2. CMake option `CMAKE_HIP_ARCHITECTURES` is used to control the
GPU architectures to build for. Previously this was controled by
`GPU_TARGETS`.
3. Updated the Nix recipe to account for these new changes.
4. The GPU targets to build against in the Nix recipe is now
consistent with the supported GPU targets in nixpkgs.
5. Added CI checks for HIP on both Linux and Windows. On Linux, we test
both the new and old behavior.
The most important part about this PR is the separation of the
HIP compiler and the C/C++ compiler. This allows users to choose
a different C/C++ compiler if desired, compared to the current
situation where when building for ROCm support, everything must be
compiled with ROCm's clang.
~~Makefile is unchanged. Please let me know if we want to be
consistent on variables' naming because Makefile still uses
`GPU_TARGETS` to control architectures to build for, but I feel
like setting `CMAKE_HIP_ARCHITECTURES` is a bit awkward when you're
calling `make`.~~ Makefile used `GPU_TARGETS` but the README says
to use `AMDGPU_TARGETS`. For consistency with CMake, all usage of
`GPU_TARGETS` in Makefile has been updated to `AMDGPU_TARGETS`.
Thanks to the suggestion of @jin-eld, to maintain backwards
compatibility (and not break too many downstream users' builds), if
`CMAKE_CXX_COMPILER` ends with `hipcc`, then we still compile using
the original behavior and emit a warning that recommends switching
to the new HIP support. Similarly, if `AMDGPU_TARGETS` is set but
`CMAKE_HIP_ARCHITECTURES` is not, then we forward `AMDGPU_TARGETS`
to `CMAKE_HIP_ARCHITECTURES` to ease the transition to the new
HIP support.
* share requests session for performance
* create directories only when needed, don't skip downloads when empty directory encountered
* be more graceful about errors
Justine Tunney [Fri, 17 May 2024 06:58:52 +0000 (02:58 -0400)]
ggml : rewrite silu and softmax for cpu (#7154)
This change upstreams llamafile's vectorized expf() functions. This lets
us compute softmax and silu more accurately than the short[65536] lookup
table that GGML previously used to make this operation go faster. We can
support aarch64 and sse2+ with the worst case rounding error of 2ulp. It
makes make -j8 tests && ./tests/test-backend-ops -o SOFT_MAX -b CPU perf
go 1.5x faster for SSE2+FMA, 1.9x faster for AVX2+FMA and 2.1x on AVX512
agray3 [Wed, 15 May 2024 13:44:49 +0000 (14:44 +0100)]
Avoid unnecessarily disabling CUDA graphs (#7302)
As discussed in PR #6766, CUDA graphs were being disabled in the presence of long prompts.
This fixes the issue by avoiding the consective update counter from incrementing unnecessarily
for tokens in which cuda graphs are disabled due to batch size > 1.
Haggai Nuchi [Tue, 14 May 2024 05:25:56 +0000 (22:25 -0700)]
Add left recursion check: quit early instead of going into an infinite loop (#7083)
* Add left recursion check: quit early instead of going into an infinite loop
* Remove custom enum, rename left recursion check and move to "grammar internal" section, add handling for edge case where a leftmost nonterminal may be empty
Ryuei [Tue, 14 May 2024 05:20:47 +0000 (14:20 +0900)]
docs: Fix typo and update description for --embeddings flag (#7026)
- Change '--embedding' to '--embeddings' in the README
- Update the description to match the latest --help output
- Added a caution about defining physical batch size
compilade [Sat, 11 May 2024 15:06:26 +0000 (11:06 -0400)]
convert-hf : support bfloat16 conversion (#7158)
* convert-hf : support bfloat16 conversion
* gguf-py : flake8 fixes
* convert-hf : add missing space after comma
* convert-hf : get bit-exact same output as ./quantize
The quantization version was missing.
* convert-hf : don't round bf16 NANs
* convert-hf : save some memory with np.int16 intermediate bf16 weights
* convert-hf : more closely match llama.cpp with which weights to keep in f32
* convert-hf : add --outtype auto-f16
A reason for this to exist is for model quantizers who want an initial
GGUF with the most fidelity to the original model while still using
a 16-bit float type instead of 32-bit floats.
* convert-hf : remove a semicolon because flake8 doesn't like it
It's a reflex from when programming in C/C++, I guess.
* convert-hf : support outtype templating in outfile name
* convert-hf : rename --outtype auto-f16 to --outtype auto
Steve Grubb [Sat, 11 May 2024 08:13:02 +0000 (04:13 -0400)]
server : free llama_batch on exit (#7212)
* [server] Cleanup a memory leak on exit
There are a couple memory leaks on exit of the server. This hides others.
After cleaning this up, you can see leaks on slots. But that is another
patch to be sent after this.
Justine Tunney [Fri, 10 May 2024 11:01:08 +0000 (07:01 -0400)]
Fix memory bug in grammar parser (#7194)
The llama.cpp grammar parser had a bug where forgetting to add a closing
quotation mark to strings would cause parsing to crash. Anyone running a
server on a public endpoint is advised to upgrade. To reproduce this bug
./llamafile -m foo.gguf -p bar --grammar 'root::="'
Credit for discovering and reporting this issue goes to Eclypsium
Security Researcher Richard Johnson <redacted>.
This commit changes the value assigned to llama_timings.n_p_eval when
ctx->n_p_eval is 0 to be 1 instead of 1 which is the current value.
The motivation for this change is that if session caching is enabled,
for example using the `--prompt-cache main-session.txt` command line
argument for the main example, and if the same prompt is used then on
subsequent runs, the prompt tokens will not actually be passed to
llama_decode, and n_p_eval will not be updated by llama_synchoronize.
But the value of n_p_eval will be set 1 by llama_get_timings because
ctx->n_p_eval will be 0. This could be interpreted as 1 token was
evaluated for the prompt which could be misleading for applications
using this value.
Sigbjørn Skjæret [Thu, 9 May 2024 10:56:00 +0000 (12:56 +0200)]
gguf-py : add special token modification capability (#7166)
* Add special token modification capability
To be able to fix/amend special tokens in a GGUF let's add two new arguments:
* `--special-token <name> <value>` where `<name>` can be bos, eos, prefix, middle, etc. while `<value>` is the token value, f.ex. `"<|fim▁begin|>"`
* `--special-token-by-id <name> <id>` where `<id>` is the ID of the token, f.ex. 32006
So, in order to f.ex. add fill-in-middle tokens to a GGUF you would do the following:
```bash
python3 gguf-new-metadata.py input.gguf output.gguf --special-token prefix "<|fim▁begin|>" --special-token middle "<|fim▁hole|>" --special-token suffix "<|fim▁end|>"
```
* improve help text
* flake--
* fix multiple tokens warning
* make script executable
* switch to namedtuple, no need to dataclass
* typing++
* add progress bar
* Add special token modification capability
To be able to fix/amend special tokens in a GGUF let's add two new arguments:
* `--special-token <name> <value>` where `<name>` can be bos, eos, prefix, middle, etc. while `<value>` is the token value, f.ex. `"<|fim▁begin|>"`
* `--special-token-by-id <name> <id>` where `<id>` is the ID of the token, f.ex. 32006
So, in order to f.ex. add fill-in-middle tokens to a GGUF you would do the following:
```bash
gguf-new-metadata.py input.gguf output.gguf --special-token prefix "<|fim▁begin|>" --special-token middle "<|fim▁end|>" --special-token suffix "<|fim▁hole|>"
```
(yes, fim_end is the `middle` token, because completion is a `prefix`/`suffix`/`middle` sequence (where `middle` is unfilled))
or
```bash
gguf-new-metadata.py input.gguf output.gguf --special-token prefix "<fim_prefix>" --special-token middle "<fim_middle>" --special-token suffix "<fim_suffix>"
```
etc...
NB: The tokens have to exist already, trying to add non-existent token name/IDs will be ignored (with a warning), while non-existent values will fail (with an error).