Why we need triton_max_batch_size and runtime engine max_batch_size seperately? #753

sushilkumar-yadav · 2025-05-21T22:17:42Z

sushilkumar-yadav
May 21, 2025

I recently tested the batching feature using the Triton Inference Server. Below are the steps I followed.

I'm wondering about the purpose of the max_batch_size parameter under the preprocessing section. I don't see any change in the output when I modify the max_batch_size value in preprocessing. However, when I change the max_batch_size during engine building (trt_build command), I do observe a difference in latency.

trtllm-build --checkpoint_dir ${UNIFIED_CKPT_PATH}
--remove_input_padding enable
--gpt_attention_plugin float16
--context_fmha enable
--gemm_plugin float16
--output_dir ${ENGINE_DIR}
--paged_kv_cache enable
--max_batch_size 2048

python3 /tensorrtllm_backend/tensorrt_llm/examples/run.py --engine_dir=/engines/llama3.1-8b-instruct/1-gpu/ --max_output_len 50 --tokenizer_dir /Llama-3.1-8B-Instruct-hf --input_text "What is ML?"

cp -R /tensorrtllm_backend/all_models/inflight_batcher_llm /opt/tritonserver/.

preprocessing

TOKENIZER_DIR=/Llama-3.1-8B-Instruct-hf/
TOKENIZER_TYPE=auto
ENGINE_DIR=/engines/llama3.1-8b-instruct/1-gpu/
DECOUPLED_MODE=true
MODEL_FOLDER=/opt/tritonserver/inflight_batcher_llm
MAX_BATCH_SIZE=16
INSTANCE_COUNT=1
MAX_QUEUE_DELAY_MS=100
TRITON_BACKEND=tensorrtllm
LOGITS_DATATYPE="TYPE_FP32"
FILL_TEMPLATE_SCRIPT=/tensorrtllm_backend/tools/fill_template.py

python3 ${FILL_TEMPLATE_SCRIPT} -i ${MODEL_FOLDER}/preprocessing/config.pbtxt tokenizer_dir:${TOKENIZER_DIR},tokenizer_type:${TOKENIZER_TYPE},triton_max_batch_size:${MAX_BATCH_SIZE},preprocessing_instance_count:${INSTANCE_COUNT}
python3 ${FILL_TEMPLATE_SCRIPT} -i ${MODEL_FOLDER}/postprocessing/config.pbtxt tokenizer_dir:${TOKENIZER_DIR},tokenizer_type:${TOKENIZER_TYPE},triton_max_batch_size:${MAX_BATCH_SIZE},postprocessing_instance_count:${INSTANCE_COUNT}
python3 ${FILL_TEMPLATE_SCRIPT} -i ${MODEL_FOLDER}/tensorrt_llm_bls/config.pbtxt triton_max_batch_size:${MAX_BATCH_SIZE},decoupled_mode:${DECOUPLED_MODE},bls_instance_count:${INSTANCE_COUNT},logits_datatype:${LOGITS_DATATYPE}
python3 ${FILL_TEMPLATE_SCRIPT} -i ${MODEL_FOLDER}/ensemble/config.pbtxt triton_max_batch_size:${MAX_BATCH_SIZE},logits_datatype:${LOGITS_DATATYPE}
python3 ${FILL_TEMPLATE_SCRIPT} -i ${MODEL_FOLDER}/tensorrt_llm/config.pbtxt triton_backend:${TRITON_BACKEND},triton_max_batch_size:${MAX_BATCH_SIZE},decoupled_mode:${DECOUPLED_MODE},engine_dir:${ENGINE_DIR},max_queue_delay_microseconds:${MAX_QUEUE_DELAY_MS},batching_strategy:inflight_fused_batching,encoder_input_features_data_type:TYPE_FP16,logits_datatype:${LOGITS_DATATYPE}

Thanks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Why we need triton_max_batch_size and runtime engine max_batch_size seperately? #753

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Why we need triton_max_batch_size and runtime engine max_batch_size seperately? #753

Uh oh!

sushilkumar-yadav May 21, 2025

preprocessing

Replies: 0 comments

sushilkumar-yadav
May 21, 2025