[BUG] LLM回复有时是截断的，但是max_tokens的设置设置为很大，这个参数难道不是指定LLM返回内容的token数吗，为什么设置了很大还是只输出了一小部分的内容呢？可以有人帮忙看下嘛谢谢！ #5313

azhe1234 · 2025-04-21T08:24:31Z

问题描述 / Problem Description
LLM回复有时是截断的，但是max_tokens的设置设置为很大，这个参数难道不是指定LLM返回内容的token数吗，为什么设置了很大还是只输出了一小部分的内容呢？可以有人帮忙看下嘛谢谢！

复现问题的步骤 / Steps to Reproduce

输入提示词
发送至LLM
LLM返回结果截断。

预期的结果 / Expected Result
应该返回完整的回复

实际结果 / Actual Result
只返回了一小部分token

环境信息 / Environment Information

Langchain-Chatchat 版本 / commit 号：0.3.1
部署方式（pypi 安装 / 源码部署 / docker 部署）：源码部署
使用的模型推理框架（Xinference / Ollama / OpenAI API 等）：Xinference
使用的 LLM 模型（GLM-4-9B / Qwen2-7B-Instruct 等）：deepseek-r1-distill-qwen-32b
使用的 Embedding 模型（bge-large-zh-v1.5 / m3e-base 等）：bge-m3
使用的向量库类型 (faiss / milvus / pg_vector 等)： faiss
操作系统及版本 / Operating system and version: Linux
Python 版本 / Python version: 3.8
推理使用的硬件（GPU / CPU / MPS / NPU 等） / Inference hardware (GPU / CPU / MPS / NPU, etc.): GPU
其他相关环境信息 / Other relevant environment information:

附加信息 / Additional Information

如图所示，自己搭了一个前端，然后输入提示词，但是LLM返回的内容是截断的。
由于服务器事件流返回的chunk也是断的：

到这里就没了，这个问题困扰了很久，有大佬知道怎么解决吗，谢谢！

dosubot · 2025-04-21T08:26:42Z

I found a similar open issue regarding truncated responses in the Langchain-Chatchat repository. You can check the details here ^[1]. However, no solution has been found in the comments.

Regarding your issue, the max_tokens parameter is indeed intended to limit the number of tokens generated by the language model during response generation. If the responses are still truncated despite setting a large max_tokens, there might be other factors at play, such as network issues or server-side handling errors that could cause incomplete data transmission ^[2]^[3].

Additionally, the streaming API handles chunked responses, and truncation could occur due to network connection issues or errors in server response handling. The fetchSSE method includes error handling to manage such scenarios, but specific causes of truncation would depend on the implementation details and network conditions ^[2]^[3].

You might want to check the network connection stability and ensure that the server-side implementation correctly handles streaming responses. Also, verify that the max_tokens parameter is correctly configured and passed to the LLM during instantiation ^[4].

To continue talking to Dosu, mention @dosu.

^{Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Bug Report | Other}

tf18536 · 2025-04-26T13:37:43Z

检查一下xinference使用的LLM上下文长度

azhe1234 · 2025-04-26T13:47:14Z

检查一下xinference使用的LLM上下文长度

上下文设置了32000的参数，好像是deepseek模型的训练模板一直会出现这个问题。。。

azhe1234 added the bug Something isn't working label Apr 21, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] LLM回复有时是截断的，但是max_tokens的设置设置为很大，这个参数难道不是指定LLM返回内容的token数吗，为什么设置了很大还是只输出了一小部分的内容呢？可以有人帮忙看下嘛谢谢！ #5313

[BUG] LLM回复有时是截断的，但是max_tokens的设置设置为很大，这个参数难道不是指定LLM返回内容的token数吗，为什么设置了很大还是只输出了一小部分的内容呢？可以有人帮忙看下嘛谢谢！ #5313

azhe1234 commented Apr 21, 2025

dosubot bot commented Apr 21, 2025

tf18536 commented Apr 26, 2025

azhe1234 commented Apr 26, 2025

[BUG] LLM回复有时是截断的，但是max_tokens的设置设置为很大， 这个参数难道不是指定LLM返回内容的token数吗，为什么设置了很大还是只输出了一小部分的内容呢？可以有人帮忙看下嘛谢谢！ #5313

[BUG] LLM回复有时是截断的，但是max_tokens的设置设置为很大， 这个参数难道不是指定LLM返回内容的token数吗，为什么设置了很大还是只输出了一小部分的内容呢？可以有人帮忙看下嘛谢谢！ #5313

Comments

azhe1234 commented Apr 21, 2025

dosubot bot commented Apr 21, 2025

tf18536 commented Apr 26, 2025

azhe1234 commented Apr 26, 2025

[BUG] LLM回复有时是截断的，但是max_tokens的设置设置为很大，这个参数难道不是指定LLM返回内容的token数吗，为什么设置了很大还是只输出了一小部分的内容呢？可以有人帮忙看下嘛谢谢！ #5313

[BUG] LLM回复有时是截断的，但是max_tokens的设置设置为很大，这个参数难道不是指定LLM返回内容的token数吗，为什么设置了很大还是只输出了一小部分的内容呢？可以有人帮忙看下嘛谢谢！ #5313