Why we need triton_max_batch_size and runtime engine max_batch_size seperately? #753
Unanswered
sushilkumar-yadav
asked this question in
Q&A
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
I recently tested the batching feature using the Triton Inference Server. Below are the steps I followed.
I'm wondering about the purpose of the max_batch_size parameter under the preprocessing section. I don't see any change in the output when I modify the max_batch_size value in preprocessing. However, when I change the max_batch_size during engine building (trt_build command), I do observe a difference in latency.
trtllm-build --checkpoint_dir ${UNIFIED_CKPT_PATH}
--remove_input_padding enable
--gpt_attention_plugin float16
--context_fmha enable
--gemm_plugin float16
--output_dir ${ENGINE_DIR}
--paged_kv_cache enable
--max_batch_size 2048
python3 /tensorrtllm_backend/tensorrt_llm/examples/run.py --engine_dir=/engines/llama3.1-8b-instruct/1-gpu/ --max_output_len 50 --tokenizer_dir /Llama-3.1-8B-Instruct-hf --input_text "What is ML?"
cp -R /tensorrtllm_backend/all_models/inflight_batcher_llm /opt/tritonserver/.
preprocessing
TOKENIZER_DIR=/Llama-3.1-8B-Instruct-hf/
TOKENIZER_TYPE=auto
ENGINE_DIR=/engines/llama3.1-8b-instruct/1-gpu/
DECOUPLED_MODE=true
MODEL_FOLDER=/opt/tritonserver/inflight_batcher_llm
MAX_BATCH_SIZE=16
INSTANCE_COUNT=1
MAX_QUEUE_DELAY_MS=100
TRITON_BACKEND=tensorrtllm
LOGITS_DATATYPE="TYPE_FP32"
FILL_TEMPLATE_SCRIPT=/tensorrtllm_backend/tools/fill_template.py
python3 ${FILL_TEMPLATE_SCRIPT} -i ${MODEL_FOLDER}/preprocessing/config.pbtxt tokenizer_dir:${TOKENIZER_DIR},tokenizer_type:${TOKENIZER_TYPE},triton_max_batch_size:${MAX_BATCH_SIZE},preprocessing_instance_count:${INSTANCE_COUNT}
python3 ${FILL_TEMPLATE_SCRIPT} -i ${MODEL_FOLDER}/postprocessing/config.pbtxt tokenizer_dir:${TOKENIZER_DIR},tokenizer_type:${TOKENIZER_TYPE},triton_max_batch_size:${MAX_BATCH_SIZE},postprocessing_instance_count:${INSTANCE_COUNT}
python3 ${FILL_TEMPLATE_SCRIPT} -i ${MODEL_FOLDER}/tensorrt_llm_bls/config.pbtxt triton_max_batch_size:${MAX_BATCH_SIZE},decoupled_mode:${DECOUPLED_MODE},bls_instance_count:${INSTANCE_COUNT},logits_datatype:${LOGITS_DATATYPE}
python3 ${FILL_TEMPLATE_SCRIPT} -i ${MODEL_FOLDER}/ensemble/config.pbtxt triton_max_batch_size:${MAX_BATCH_SIZE},logits_datatype:${LOGITS_DATATYPE}
python3 ${FILL_TEMPLATE_SCRIPT} -i ${MODEL_FOLDER}/tensorrt_llm/config.pbtxt triton_backend:${TRITON_BACKEND},triton_max_batch_size:${MAX_BATCH_SIZE},decoupled_mode:${DECOUPLED_MODE},engine_dir:${ENGINE_DIR},max_queue_delay_microseconds:${MAX_QUEUE_DELAY_MS},batching_strategy:inflight_fused_batching,encoder_input_features_data_type:TYPE_FP16,logits_datatype:${LOGITS_DATATYPE}
Thanks
Beta Was this translation helpful? Give feedback.
All reactions