-
-
Notifications
You must be signed in to change notification settings - Fork 7.2k
[V1][Spec Decode] Apply torch.compile & cudagraph to EAGLE3 #17504
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
Signed-off-by: qizixi <[email protected]>
5d20cdd
to
e8b1ab6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks for doing this! It'd be nice if @luyuzhe111 can also take a look.
@zixi-qi Oh can you please fix the format error in the |
@zixi-qi Thanks for the PR! LGTM overall and only left some minor suggestions. Factoring the |
Signed-off-by: qizixi <[email protected]>
Address comments and apply pre-commit lint suggestions |
RE moving fc layer into eager mode: this is mostly just an aten:mm so seems pretty cheap |
@zixi-qi thanks for incorporating the suggestions! final thing, were you able to check the acceptance length with torch.compile & cuda graph matches that with eager mode as in #16937 (comment)? |
This is a great point, there does seem to be a difference in acceptance rate between with and without eager, below tests are run with mt_bench + patching #16367
Eagle pass does not have the same issue, so perhaps moving the fc had a negative impact on numerics?
|
@zixi-qi My guess is that this can be related to vLLM's compilation cache... |
@zixi-qi thanks for sharing the testing results. if you think there is something wrong with cuda graph implementation for EAGLE3, you can add |
Signed-off-by: Woosuk Kwon <[email protected]>
@luyuzhe111 @zixi-qi I've updated the PR. Thanks for catching the bug! Will merge once it passes the CI. |
Thanks so much for catching and fixing the issue! Also verified locally that the acceptance rate issue is fixed
|
Follow up on #17211 to add support for Eagle3 torch.compile & cudagraph by moving out the multiple auxiliary hidden states merge
Test plan: