Skip to content

[BUG] LLM回复有时是截断的,但是max_tokens的设置设置为很大, 这个参数难道不是指定LLM返回内容的token数吗,为什么设置了很大还是只输出了一小部分的内容呢?可以有人帮忙看下嘛谢谢! #5313

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
azhe1234 opened this issue Apr 21, 2025 · 3 comments
Labels
bug Something isn't working

Comments

@azhe1234
Copy link

问题描述 / Problem Description
LLM回复有时是截断的,但是max_tokens的设置设置为很大, 这个参数难道不是指定LLM返回内容的token数吗,为什么设置了很大还是只输出了一小部分的内容呢?可以有人帮忙看下嘛谢谢!

复现问题的步骤 / Steps to Reproduce

  1. 输入提示词
  2. 发送至LLM
  3. LLM返回结果截断。

预期的结果 / Expected Result
应该返回完整的回复

实际结果 / Actual Result
只返回了一小部分token

环境信息 / Environment Information

  • Langchain-Chatchat 版本 / commit 号:0.3.1
  • 部署方式(pypi 安装 / 源码部署 / docker 部署):源码部署
  • 使用的模型推理框架(Xinference / Ollama / OpenAI API 等):Xinference
  • 使用的 LLM 模型(GLM-4-9B / Qwen2-7B-Instruct 等):deepseek-r1-distill-qwen-32b
  • 使用的 Embedding 模型(bge-large-zh-v1.5 / m3e-base 等):bge-m3
  • 使用的向量库类型 (faiss / milvus / pg_vector 等): faiss
  • 操作系统及版本 / Operating system and version: Linux
  • Python 版本 / Python version: 3.8
  • 推理使用的硬件(GPU / CPU / MPS / NPU 等) / Inference hardware (GPU / CPU / MPS / NPU, etc.): GPU
  • 其他相关环境信息 / Other relevant environment information:

附加信息 / Additional Information

Image
如图所示,自己搭了一个前端,然后输入提示词,但是LLM返回的内容是截断的。
由于服务器事件流返回的chunk也是断的:

Image
到这里就没了,这个问题困扰了很久,有大佬知道怎么解决吗,谢谢!

@azhe1234 azhe1234 added the bug Something isn't working label Apr 21, 2025
Copy link

dosubot bot commented Apr 21, 2025

I found a similar open issue regarding truncated responses in the Langchain-Chatchat repository. You can check the details here [1]. However, no solution has been found in the comments.

Regarding your issue, the max_tokens parameter is indeed intended to limit the number of tokens generated by the language model during response generation. If the responses are still truncated despite setting a large max_tokens, there might be other factors at play, such as network issues or server-side handling errors that could cause incomplete data transmission [2][3].

Additionally, the streaming API handles chunked responses, and truncation could occur due to network connection issues or errors in server response handling. The fetchSSE method includes error handling to manage such scenarios, but specific causes of truncation would depend on the implementation details and network conditions [2][3].

You might want to check the network connection stability and ensure that the server-side implementation correctly handles streaming responses. Also, verify that the max_tokens parameter is correctly configured and passed to the LLM during instantiation [4].

To continue talking to Dosu, mention @dosu.


Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Bug Report | Other

@tf18536
Copy link

tf18536 commented Apr 26, 2025

检查一下xinference使用的LLM上下文长度

@azhe1234
Copy link
Author

检查一下xinference使用的LLM上下文长度

上下文设置了32000的参数,好像是deepseek模型的训练模板一直会出现这个问题。。。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants