You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
It seems that this is caused by the absence of the lookahead parameter, where the max_daft.len parameter is determined by the (W, N, G) parameter. Setting the decoding method to lookahead alone is not enough to make it run, and decoding parameters need to be set.
Add (W, N, G) parameters to config. pbtext
Modify the source code, add code for reading (W, N, G) and setting parameters, recompile
Enjoy the decoding pleasure brought by lookahead
We tested that lookahead can reduce latency by 57% on the qwen2-7B model, but due to the lack of randomness, its result is equal to the top k=1 result, which can be understood as completely aligned in terms of accuracy.
Uh oh!
There was an error while loading. Please reload this page.
System Info
Who can help?
No response
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
Build the model:
Start the container:
Quantize the model:
Build:
Adapt model repo:
Adding the following to config.pbtext:
Run with Tritonserver:
Start the container:
start tritonserver
tritonserver --model-repository=/models
Expected behavior
Tritonserver should start successfully, and model inference should be available.
actual behavior
Tritonserver fails to start with the following assertion error:
additional notes
Changing
--max_draft_len
to15
allows Tritonserver to start, but this prevents selecting the desired max_draft_len value.The text was updated successfully, but these errors were encountered: