git.djapps.eu Git - pkg/ggml/sources/llama.cpp/commit

author	Sigbjørn Skjæret <redacted>
	Fri, 2 Aug 2024 19:11:39 +0000 (21:11 +0200)
committer	GitHub <redacted>
	Fri, 2 Aug 2024 19:11:39 +0000 (15:11 -0400)
commit	b72c20b85c1029d135022d39e9a20d4807c11893
tree	e2d966bdfbf0232e89c4348949cfae132e5c74ba	tree
parent	e09a800f9a9b19c73aa78e03b4c4be8ed988f3e6	commit \| diff

Fix conversion of unnormalized BF16->BF16 weights (#7843)

* add truncate_bf16

* truncate intermediate fp32 if converting bf16 to bf16

* fix masking in __compute_fp32_to_bf16

* np.int16 no longer used

* missing cast and additional numpy 2.x fix

* ggml-impl : do not flush bf16 subnormals to zero

* ggml : add reference fp32 to bf16 conversion

The fast version is no longer equivalent for all platforms
because of the handling of subnormal values.

* gguf-py : remove flush to zero for bf16 subnormals

* gguf-py : remove float32 truncation to bf16

Rounding achieves the same thing in the cases where this was used.

* missed prototype update in merge

* merge cleanup

---------

Co-authored-by: Francis Couture-Harpin <redacted>

convert_hf_to_gguf.py		diff \| blob \| history
ggml/include/ggml.h		diff \| blob \| history
ggml/src/ggml-impl.h		diff \| blob \| history
ggml/src/ggml.c		diff \| blob \| history
gguf-py/gguf/quants.py		diff \| blob \| history