From: Vinkal <redacted>
Date: Mon, 29 Sep 2025 07:03:12 +0000 (+0530)
Subject: llama-cli: prevent spurious assistant token (#16202)
X-Git-Tag: upstream/0.0.6641~17
X-Git-Url: https://git.djapps.eu/?a=commitdiff_plain;h=2f61c0f5bf8a620ca4c3872408803ab38cfb9613;p=pkg%2Fggml%2Fsources%2Fllama.cpp

llama-cli: prevent spurious assistant token (#16202)

* tools/main: llama-cli: prevent spurious assistant token (#13402)

During prompt ingestion, prompt tokens are accepted into the sampler history (for repetition penalties). The conversation-mode path then appended `common_sampler_last(smpl)` to `assistant_ss` before any new token was sampled. At that point, "last" was a prompt-side token (e.g., an input prefix), so the assistant chat message began with an extra piece.

Fix: append to `assistant_ss` only for a newly sampled (non-EOG) token. This affects only chat message assembly (`assistant_ss` / `chat_msgs` / `common_chat_format_single`); terminal stdout is unchanged. Sampling order/logits are unchanged.

Fixes #13402.

Signed-off-by: Vinkal Chudgar <redacted>

* Update tools/main/main.cpp

Co-authored-by: Sigbjørn Skjæret <redacted>

* tools/main: remove outdated comment

Signed-off-by: Vinkal Chudgar <redacted>

---------

Signed-off-by: Vinkal Chudgar <redacted>
Co-authored-by: Sigbjørn Skjæret <redacted>
---

diff --git a/tools/main/main.cpp b/tools/main/main.cpp
index 083fc0cf2..498e00e3a 100644
--- a/tools/main/main.cpp
+++ b/tools/main/main.cpp
@@ -707,6 +707,10 @@ int main(int argc, char ** argv) {
 
             embd.push_back(id);
 
+            if (params.conversation_mode && !waiting_for_first_input && !llama_vocab_is_eog(vocab, id)) {
+                assistant_ss << common_token_to_piece(ctx, id, false);
+            }
+
             // echo this to console
             input_echo = true;
 
@@ -824,11 +828,7 @@ int main(int argc, char ** argv) {
                 }
             }
 
-            // if current token is not EOG, we add it to current assistant message
             if (params.conversation_mode && !waiting_for_first_input) {
-                const auto id = common_sampler_last(smpl);
-                assistant_ss << common_token_to_piece(ctx, id, false);
-
                 if (!prompt.empty()) {
                     prompt.clear();
                     is_interacting = false;