Description
📚 The doc issue
Using docker image rocm/vllm-dev:base_aiter_test_main_20250606_tuned_20250609
as base to create our vLLM docker image.
Upload the results file based on models to the following link
https://embeddedllm502.sharepoint.com/:f:/s/ExternalSharing/Em4fFI2PF6hEoyW2OlvbJwcB-lOPSao34-Rw0Cv4CSv7CA?e=uqyhMQ
We need to run experiments similar to this blogpost https://blog.vllm.ai/2024/10/23/vllm-serving-amd.html across multiple facets of the vLLM V1 engine arguments to find out the best way to host v1 engine on vLLM.
Test Server Mode
Number of prompts: 200
ISL: OSL = 500 SharedGPT Dataset, 1000/2000, 5000/1000
The following aspects should be viewed as a permutation collection:
-
With/Without AITER
i. With AITER
ii. Without AITER -
Chunked Prefill
i.Without Chunked Prefill
ii. Chunked Size = default (find out from source code)
iii. Chunked Size = 2048
iv. Chunked Size = 4096
v. Chunked Size = 8192
vi. Chunked Size = 16384
vii. Chunked Size = 32768 -
Prefix Caching
i. Without prefix caching
ii. With prefix caching -
Models:
i. Llama4-Maverick-FP8 - TP8
ii. DeepSeekV3 - TP8
iii. Llama-3.1-70B-Instruct - TP1
iv. Qwen3-32B-Instruct - TP1
v. Llama-3.1-70B-Instruct --quantization ptpc_fp8 - TP1
vi. Qwen/Qwen3-32B-FP8 - TP1 -
If it has been merged ([Attention][V1] Toggle for v1 attention backend vllm-project/vllm#18275)
- VLLM_V1_USE_PREFILL_DECODE_ATTENTION=True (this will use the prefill decode attention)
- VLLM_V1_USE_PREFILL_DECODE_ATTENTION=False (this will use the unified triton attention)
-
block-size
- 1 (MLA ROCm only)
- 8
- 16
- 32
- 64
- 128
Suggest a potential alternative/fix
No response
Before submitting a new issue...
- Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.