]> git.djapps.eu Git - pkg/ggml/sources/llama.cpp/commit
Fix conversion of unnormalized BF16->BF16 weights (#7843)
authorSigbjørn Skjæret <redacted>
Fri, 2 Aug 2024 19:11:39 +0000 (21:11 +0200)
committerGitHub <redacted>
Fri, 2 Aug 2024 19:11:39 +0000 (15:11 -0400)
commitb72c20b85c1029d135022d39e9a20d4807c11893
treee2d966bdfbf0232e89c4348949cfae132e5c74ba
parente09a800f9a9b19c73aa78e03b4c4be8ed988f3e6
Fix conversion of unnormalized BF16->BF16 weights (#7843)

* add truncate_bf16

* truncate intermediate fp32 if converting bf16 to bf16

* fix masking in __compute_fp32_to_bf16

* np.int16 no longer used

* missing cast and additional numpy 2.x fix

* ggml-impl : do not flush bf16 subnormals to zero

* ggml : add reference fp32 to bf16 conversion

The fast version is no longer equivalent for all platforms
because of the handling of subnormal values.

* gguf-py : remove flush to zero for bf16 subnormals

* gguf-py : remove float32 truncation to bf16

Rounding achieves the same thing in the cases where this was used.

* missed prototype update in merge

* merge cleanup

---------

Co-authored-by: Francis Couture-Harpin <redacted>
convert_hf_to_gguf.py
ggml/include/ggml.h
ggml/src/ggml-impl.h
ggml/src/ggml.c
gguf-py/gguf/quants.py