git.djapps.eu Git - pkg/ggml/sources/llama.cpp/commit

llama_model_loader: support multiple split/shard GGUFs (#6187)

* split: support in llama_model_loader

* avoid copying the entire vector

Co-authored-by: slaren <redacted>
* split: move llama_tensor_offset to llama_model_loader

* llama_model_loader: PR feedbacks:
- use only one gguf_context for metadata only
- store all ggml_context in a vector as the files and mappings
- store all weights in a vector along with the source tensor
- rename ctx_gguf to meta
- rename ctx_meta to contexts

* avoid copying the entire vector

* Simplify this by making these optional, switch some layer creation tensor optional

Co-authored-by: Georgi Gerganov <redacted>
* Handle optional tensors

Co-authored-by: Georgi Gerganov <redacted>
* llama_model_loader: fail if backend cannot allocate buffer

* fix mmap buffer management

* llama_model_loader: map file to backend buffer if the allocation succeeds only

* llama_model_loader: only map tensors included in the context

* llama_model_loader: minor, use same variable name for consistency, fix spacing in types cast

* llama_model_loader: fail if any of backend buffer cannot be allocated

* spacing

Co-authored-by: slaren <redacted>
* fix loop over pointer

Co-authored-by: slaren <redacted>
* llama_model_loader: if n_tensors declared not equals to loaded tensors in split, throw an exception instead of asserting

* llama_model_loader: ensure mappings vector has the expected size

* llama_model_loader: use at instead of operator[] if this should never add to the map.

* llama_model_loader: immediately add the backend buffer to the model buffers in order to free them if an error occurs in the next allocation. Reserve the expected size.

* llama_model_loader: be sure the model mappings has enough capacity before allocating backend buffer

* llama_model_loader: fix map -> unordered map

* llama_split_prefix: use a clearer version, not pass split path len but dest max len.

Co-authored-by: Xuan Son Nguyen <redacted>
* llama : minor

ggml-ci

* llama : introduce some typedef helpers

* docs: add model shard in hot topic

* llama_model_loader: put mapping in a unique_ptr from the moment it is allocated

Co-authored-by: slaren <redacted>
* fix llama_split_prefix

---------

Co-authored-by: slaren <redacted>
Co-authored-by: Georgi Gerganov <redacted>
Co-authored-by: Xuan Son Nguyen <redacted>

author	Pierrick Hymbert <redacted>
	Fri, 22 Mar 2024 18:00:01 +0000 (19:00 +0100)
committer	GitHub <redacted>
	Fri, 22 Mar 2024 18:00:01 +0000 (19:00 +0100)
commit	dba1af612926cbd4ebe2d876277af1e3305177e0
tree	9cb53aa9fbfaab4525a4fa1ce2afa120c0396491	tree
parent	ee804f6223777019cf921e0d99cc24669313ab98	commit \| diff

README.md		diff \| blob \| history
examples/gguf-split/gguf-split.cpp		diff \| blob \| history
llama.cpp		diff \| blob \| history
llama.h		diff \| blob \| history