* tools/main: llama-cli: prevent spurious assistant token (#13402)
During prompt ingestion, prompt tokens are accepted into the sampler history (for repetition penalties). The conversation-mode path then appended `common_sampler_last(smpl)` to `assistant_ss` before any new token was sampled. At that point, "last" was a prompt-side token (e.g., an input prefix), so the assistant chat message began with an extra piece.
Fix: append to `assistant_ss` only for a newly sampled (non-EOG) token. This affects only chat message assembly (`assistant_ss` / `chat_msgs` / `common_chat_format_single`); terminal stdout is unchanged. Sampling order/logits are unchanged.
Fixes #13402.
Signed-off-by: Vinkal Chudgar <redacted>
* Update tools/main/main.cpp
Co-authored-by: Sigbjørn Skjæret <redacted>
* tools/main: remove outdated comment
Signed-off-by: Vinkal Chudgar <redacted>
---------
Signed-off-by: Vinkal Chudgar <redacted>
Co-authored-by: Sigbjørn Skjæret <redacted>
embd.push_back(id);
+ if (params.conversation_mode && !waiting_for_first_input && !llama_vocab_is_eog(vocab, id)) {
+ assistant_ss << common_token_to_piece(ctx, id, false);
+ }
+
// echo this to console
input_echo = true;
}
}
- // if current token is not EOG, we add it to current assistant message
if (params.conversation_mode && !waiting_for_first_input) {
- const auto id = common_sampler_last(smpl);
- assistant_ss << common_token_to_piece(ctx, id, false);
-
if (!prompt.empty()) {
prompt.clear();
is_interacting = false;