In addition this add and remove some entries in the KV store and adds a metadata class with automatic heuristics capability to derive some values based on model card content
* No Change:
- Internal GGUF Spec
- `general.architecture`
- `general.quantization_version`
- `general.alignment`
- `general.file_type`
- General Model Details
- `general.name`
- `general.author`
- `general.version`
- `general.description`
- Licensing details
- `general.license`
- Typically represents the converted GGUF repo (Unless made from scratch)
- `general.url`
- Model Source during conversion
- `general.source.url`
* Removed:
- Model Source during conversion
- `general.source.huggingface.repository`
* Added:
- General Model Details
- `general.organization`
- `general.finetune`
- `general.basename`
- `general.quantized_by`
- `general.size_label`
- Licensing details
- `general.license.name`
- `general.license.link`
- Typically represents the converted GGUF repo (Unless made from scratch)
- `general.doi`
- `general.uuid`
- `general.repo_url`
- Model Source during conversion
- `general.source.doi`
- `general.source.uuid`
- `general.source.repo_url`
- Base Model Source
- `general.base_model.count`
- `general.base_model.{id}.name`
- `general.base_model.{id}.author`
- `general.base_model.{id}.version`
- `general.base_model.{id}.organization`
- `general.base_model.{id}.url` (Model Website/Paper)
- `general.base_model.{id}.doi`
- `general.base_model.{id}.uuid`
- `general.base_model.{id}.repo_url` (Model Source Repository (git/svn/etc...))
- Array based KV stores
- `general.tags`
- `general.languages`
- `general.datasets`
---------
Co-authored-by: compilade <redacted> Co-authored-by: Xuan Son Nguyen <redacted>
Brian [Tue, 16 Jul 2024 07:14:16 +0000 (17:14 +1000)]
gguf-hash : update clib.json to point to original xxhash repo (#8491)
* Update clib.json to point to Cyan4973 original xxhash
Convinced Cyan4973 to add clib.json directly to his repo, so can now point the clib package directly to him now. Previously pointed to my fork with the clib.json package metadata
https://github.com/Cyan4973/xxHash/pull/954
* gguf-hash: readme update to point to Cyan4973 xxHash repo [no ci]
Steve Bonds [Tue, 16 Jul 2024 07:04:45 +0000 (00:04 -0700)]
export-lora : handle help argument (#8497)
The --help option on export-lora isn't accepted as valid. The help still gets displayed by default, but the script exits with an error message and nonzero status.
* convert_hf : fix memory leak in lazy MoE conversion
The '_lazy' queue was sometimes self-referential,
which caused reference cycles of objects old enough
to avoid garbage collection until potential memory exhaustion.
server: update README.md with llama-server --help output [no ci] (#8472)
The README.md had a stale information. In particular, the --ctx-size
"defaults to 512" confused me and I had to check the code to confirm
this was false. This the server is evolving rapidly, it's probably
better to keep the source of truth at a single place (in the source) and
generate the README.md based on that.
Did:
make llama-server
./llama-server --help > t.txt
vimdiff t.txt examples/server/README.md
I copied the content inside a backquote block. I would have preferred
proper text but it would require a fair amount of surgery to make the
current output compatible with markdown. A follow up could be to
automate this process with a script.
llama : fix pre-tokenization of non-special added tokens (#8228)
* llama : fix mpt and olmo pre-tokenizer
* llama : pre-tokenize non-special user-defined tokens first
* llama : fix detection of control-like user-defined tokens
* convert_hf : identify which user-defined tokens are control tokens
Only used in _set_vocab_gpt2() for now.
* convert_hf : identify more added control tokens for SPM tokenziers
This makes Gemma and Gemma-2 tokenize pretty much EVERYTHING correctly,
including HTML tags and consecutive spaces,
but it unfortunately requires model re-conversion.
There seems to be a weird behavior of the HF tokenizer for Gemma,
which prefers to use the 16-space token over more lengthy space tokens,
while using the SentencePiece tokenizer does not do this.
(the implementation in llama.cpp has the same behavior as SentencePiece)
* llama : fix wrong pre-tokenization of byte tokens
* llama : fix Viking pre-tokenizer regex
The order was previously wrong, which caused errors in some tests.
* llama : fix command-r detokenization
* convert_hf : reduce usages of the UNKNOWN token type
* llama : add UNKNOWN tokens in the special tokens cache
* convert_hf : reduce usages of UNKNOWN for InternLM2
This makes the changes from #8321 more consistent
with the other changes made here.
* test-tokenizer-random : reduce potential confilcts with #8379
* test-tokenizer-random : add a failing edge case for falcon
docker : fix filename for convert-hf-to-gguf.py in tools.sh (#8441)
Commit b0a4699 changed the name of this script from convert-hf-to-gguf.py to
convert_hf_to_gguf.py breaking how convert is called from within a Docker
container.
Name Migration: Build the deprecation-warning 'main' binary every time (#8404)
* Modify the deprecation-warning 'main' binary to build every time, instead of only when a legacy binary is present. This is to help users of tutorials and other instruction sets from knowing what to do when the 'main' binary is missing and they are trying to follow instructions.
* Adjusting 'server' name-deprecation binary to build all the time, similar to the 'main' legacy name binary.
Deprecation warning to assist with migration to new binary names (#8283)
* Adding a simple program to provide a deprecation warning that can exist to help people notice the binary name change from #7809 and migrate to the new filenames.
* Build legacy replacement binaries only if they already exist. Check for their existence every time so that they are not ignored.
Kevin Wang [Mon, 8 Jul 2024 07:26:53 +0000 (03:26 -0400)]
common : preallocate sampling token data vector (#8363)
`emplace_back` repeatedly-called is slower than preallocating the vector to the vocab size and directly inserting the data. Some rudimentary profiling with `chrono` improves the performance of this block of code from ~500us/op to ~40us/op.
Overall, this slightly improves the sampling performance which has a more substantial impact for the `examples/lookahead` implementation -- I am able to see a ~10% performance boost in lookahead inference.
py : type-check all Python scripts with Pyright (#8341)
* py : type-check all Python scripts with Pyright
* server-tests : use trailing slash in openai base_url
* server-tests : add more type annotations
* server-tests : strip "chat" from base_url in oai_chat_completions
* server-tests : model metadata is a dict
* ci : disable pip cache in type-check workflow
The cache is not shared between branches, and it's 250MB in size,
so it would become quite a big part of the 10GB cache limit of the repo.
* py : fix new type errors from master branch
* tests : fix test-tokenizer-random.py
Apparently, gcc applies optimisations even when pre-processing,
which confuses pycparser.
* ci : only show warnings and errors in python type-check
The "information" level otherwise has entries
from 'examples/pydantic_models_to_grammar.py',
which could be confusing for someone trying to figure out what failed,
considering that these messages can safely be ignored
even though they look like errors.
Brian [Sun, 7 Jul 2024 12:58:43 +0000 (22:58 +1000)]
gguf-hash: model wide and per tensor hashing using xxhash and sha1 (#8048)
CLI to hash GGUF files to detect difference on a per model and per tensor level
The hash type we support is:
- `--xxh64`: use xhash 64bit hash mode (default)
- `--sha1`: use sha1
- `--uuid`: use uuid
- `--sha256`: use sha256
While most POSIX systems already have hash checking programs like sha256sum, it
is designed to check entire files. This is not ideal for our purpose if we want
to check for consistency of the tensor data even if the metadata content of the
gguf KV store has been updated.
This program is designed to hash a gguf tensor payload on a 'per tensor layer'
in addition to a 'entire tensor model' hash. The intent is that the entire
tensor layer can be checked first but if there is any detected inconsistencies,
then the per tensor hash can be used to narrow down the specific tensor layer
that has inconsistencies.
* fix conflicts and add rope_ratio & ChatGLMForConditionalGeneration
* fix chat template bug
* fix codestyle
* fix conflicts
* modified the general name of glm model
* fix conflicts
* remove prefix and suffix
* use normal glm4 chattempalte & use LLM_FFN_SWIGLU in phi3
* fix: resolve Flake8 errors in `convert-hf-to-gguf.py`
- Fix E302 by adding two blank lines before top-level function definitions
- Replace print statements to fix NP100
- Fix E303 by ensuring only one blank line between lines of code
server: Retrieve prompt template in /props (#8337)
* server: Retrieve prompt template in /props
This PR adds the following:
- Expose the model's Jinja2 prompt template from the model in the /props endpoint.
- Change log-level from Error to Warning for warning about template mismatch.
The front-end stands a better chance of actually executing the Jinja template format correctly. Server is currently just guessing it.
Ideally this should have been inside a JSON block that expose the same key/value pairs as listed during startup in "llm_load_print_meta" function.