]> git.djapps.eu Git - pkg/ggml/sources/whisper.cpp/commit
whisper : Fix UTF-8 character boundary issue in segment wrapping (max_len) (#3592)
authorYshtola <redacted>
Fri, 16 Jan 2026 12:16:05 +0000 (20:16 +0800)
committerGitHub <redacted>
Fri, 16 Jan 2026 12:16:05 +0000 (14:16 +0200)
commitf53dc74843e97f19f94a79241357f74ad5b691a6
tree9c781f7ceb9d4626a891baea2fb8cc7dfff0dfe5
parent2eeeba56e9edd762b4b38467bab96c2517163158
whisper : Fix UTF-8 character boundary issue in segment wrapping (max_len) (#3592)

The current implementation in `whisper_wrap_segment()` uses `strlen()` to count bytes, not UTF-8 characters. When splitting segments at `max_len`, this can break multi-byte UTF-8 characters, resulting in invalid sequences displayed as `�` (U+FFFD replacement character).
src/whisper.cpp