Skip to content

Triton Server uses incorrect batch size #8141

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
saarus72 opened this issue Apr 11, 2025 · 0 comments
Open

Triton Server uses incorrect batch size #8141

saarus72 opened this issue Apr 11, 2025 · 0 comments

Comments

@saarus72
Copy link

Description
Triton feeds model with batches of size which is neither preferred_batch_size nor max_batch_size but max_queue_delay_microseconds is not exceeded.

I try to improve the performance of my TensorRT model, which was exported from PyTorch. As is usual for a TensorRT model, all (or some of) the optimal shapes are mentioned in the preferred_batch_size section. While investigating, I noticed that some model configurations led to slow inference times. As I delved deeper, I discovered that, under any excessive workload, the Triton Inference Server tended to feed the model with batches with incorrect batch size (max(preferred_batch_size) + 1 specifically), which led to slower inference in the TensorRT case, as these sizes differed from those in the model Optimization Profiles.

It seems not to be backend related (as I have reproduced it on both LibTorch and TensorRT). It is also not frontend related as I used Locust with HTTP REST requests and system shared memory but reproduced on molotov using python grpc.aio client without shm as well. Model instances number increase/decrease changes nothing.

(The part of) example config is listed below.

backend: "tensorrt"
max_batch_size: 256

# ...

instance_group [
  {
    count: 1
    kind: KIND_GPU
    gpus: [ 0 ]
  }
]

dynamic_batching {
  max_queue_delay_microseconds: 5000000
  preferred_batch_size: [ 8, 16]
}

As I create load to ensure internal triton queue is always more than 16 but less than 256, model begins to process batches of size 17. max_queue_delay_microseconds is set on any extremely high value to be sure that is not a timeout behavior.

I used --log-verbose=2 --log-info=true --log-file=/logs/log.txt so that I may see all the batch sizes the model processes. Output of grep "executing" log.txt is below.

I0411 10:13:33.617706 1 libtorch.cc:2660] "model model_repr, instance model_repr_0_0, executing 8 requests"
I0411 10:13:34.351349 1 libtorch.cc:2660] "model model_repr, instance model_repr_0_0, executing 17 requests"
I0411 10:13:35.257131 1 libtorch.cc:2660] "model model_repr, instance model_repr_0_0, executing 17 requests"
I0411 10:13:35.994047 1 libtorch.cc:2660] "model model_repr, instance model_repr_0_0, executing 17 requests"
...

I always send requests of size 1 so 17 request mean batch size of 17. (Anyway, using requests of size 2 leads to the same behavior of processing 17 requests.)

I may see the same batch sizes using rate(nv_inference_request_success[15s])/rate(nv_inference_exec_count[15s]) prometheus metrics.

If I stop using use max_queue_delay_microseconds, Triton Inference Server seems to work normally. Using max_batch_size equals to max of preferred_batch_size leads to correct behavior as well. Mentioning high enough value (like equals to max_batch_size - 1) in preferred_batch_size section also prevents triton from using incorrect batch size (as previous maximum value it is not a maximum value anymore).

Triton Information
I use nvcr.io/nvidia/tritonserver:25.02-py3 but 24.02-py3 shows the same behavior.

To Reproduce

  • Use the model and config which are enough to reproduce. I suppose any model should fit.
  • Ensure to create proper load (less than max_batch_size but more than max of preferred_batch_size).
  • Check logs with --log-verbose=2 or Prometheus metrics for max(preferred_batch_size) + 1 value.

Expected behavior
Tritonserver does not process batches of size not mentioned in preferred_batch_size and max_batch_size unless max_queue_delay_microseconds time exceeded.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

1 participant