How to set cuda-memory-pool-byte-size and handle the case when we are out of this memory

## Description


Hello,

We are using triton-inference-server in our video AI analysis application and have recently encountered this problem which we suspect to be a bug in the triton-inference-server.

For more context, we deploy an ensemble in our triton server which consists of a preprocessing python backend, a detection model in tensorrt format and a postprocessing python backend for the detector's output. The server is requested using a python grpc triton client via the asynchronous inference method. (Like this [example](https://github.com/triton-inference-server/client/blob/main/src/python/examples/simple_grpc_async_infer_client.py#L109))

We noticed triton was returning empty output arrays for some of the requests that we were sending. Upon inspecting the triton container logs, we found this error message which was being logged each time this empty output is returned:
```
W0410 13:51:44.743979 1 memory.cc:212] Failed to allocate CUDA memory with byte size 102236160 on GPU 0: CNMEM_STATUS_OUT_OF_MEMORY, falling back to pinned system memory

E0410 13:51:44.772932 1 pb_stub.cc:402] An error occurred while trying to load GPU buffers in the Python backend stub: failed to copy data: invalid argument

E0410 13:51:45.493633 1 pb_stub.cc:402] An error occurred while trying to load GPU buffers in the Python backend stub: failed to copy data: invalid argument

E0410 13:51:45.620086 1 pb_stub.cc:402] An error occurred while trying to load GPU buffers in the Python backend stub: failed to copy data: invalid argument
```
Note that this error was not raised in the python backends and does not make the triton server crash, it simply returns empty outputs as though the detection model did not detect any object in the input request which is very problematic.

After some research, we found a similar [issue](https://github.com/triton-inference-server/server/issues/7148) where it was recommended to raise the value of cuda-memory-pool-byte-size to fix this problem. We tried so and it indeed fixed it.

We tried to understand what this parameter does. The tritonserver --help says this:
```
--cuda-memory-pool-byte-size <<integer>:<integer>>
	The total byte size that can be allocated as CUDA memory for
	the GPU device. If GPU support is enabled, the server will allocate
	CUDA memory to minimize data transfer between host and devices
	until it exceeds the specified byte size. This option will not affect
	the allocation conducted by the backend frameworks. The argument
	should be 2 integers separated by colons in the format <GPU device
	ID>:<pool byte size>. This option can be used multiple times, but only
	once per GPU device. Subsequent uses will overwrite previous uses for
	the same GPU device. Default is 64 MB.
```

It is not clear to us from this doc when the sever allocates CUDA memory from this pool. We found this [issue reply](https://github.com/triton-inference-server/server/issues/7148#issuecomment-2089457793) which says that this pool is maybe used to store on GPU memory the data that is sent from one model to the next one in the ensemble. For exemple, it stores on GPU memory the preprocessing result which will be read by the tensorrt model. Using this explanation, we are able to make sens of why the error is happening (see our description bellow). If this explanation is false however, please give us the correct explanation of what cuda-memory-pool-byte-size is and when triton allocates memory from it.

We are still opening this issue even though increasing cuda-memory-pool-byte-size solved the error for these two reasons for which we are seeking your assistance:

### 1. We don't know how to set cuda-memory-pool-byte-size in order to avoid this error in all conditions

We noticed that this problem was not happening on all our testing videos. It only happened on videos that have a small resolution. After investigation we found that because the ensemble's preprocessing python backend runs faster when the input image resolution is smaller, it takes a lead in the number of requests it processed compared to the postprocessing. We saw this lead by logging the id of the request each of the preprocessing and postprocessing is currently processing in their execute() function, and as you can see in this image, the preprocessing reaches the request (batch) number 1351 while the postprocessing only reached the request 1340. This lead of 10 requests causes the preprocessing to output 10 preprocessed tensors that are stored somewhere in the GPU memory and that are waiting to be processed by the rest of the ensemble (tensorrt engine + postprocessing). As we said previously, we understand that it is the cuda-memory-pool-byte-size that is used to store this data that passed from a model to another in the ensemble so we arrived to the conclusion that it is this lead that the preprocessing takes that causes the cuda-memory-pool-byte-size to be filled with intermediate data and that the preprocessing only takes this lead when processing videos of a smaller resolution which explains why the error was not happening on videos of a larger resolution.

![Image](https://github.com/user-attachments/assets/3531f290-bb26-457c-add5-a3936d4aa079)

From all this investigation, we are guessing that the memory size we need to allocate in cuda-memory-pool-byte-size depends on the size of the inputs and outputs of the triton models and on the number of concurrent requests that are currently being processed since each request will need its own copies of these inputs and outputs. If this is the case, then we currently don't have a clear method that allow us to find a value of cuda-memory-pool-byte-size that is guaranteed to be safe across all conditions. We may be able to estimate the size of the input and outputs of the models but we currently can't estimate the maximum number of concurrent requests.

### 2. When this error happens, we can't catch it

When an error like this one happens where the inference requests returns empty results, we expect triton to either crash or at least raise an error that we can catch and handle instead of returning empty results which currently tricks our triton client into believing that the detector did not detect any object on the input request.

Besides, the error message is preceded by a warning that says that we are falling back to pinned system memory but this fallback seems to never happen since there's an error right after (the invalid argument error) which we believe could be a bug in triton.

## Triton Information


What version of Triton are you using?
2.44.0

Are you using the Triton container or did you build it yourself?

We create a custom docker image that starts from nvcr.io/nvidia/tritonserver:24.03-py3 in which we install some python packages used by the python backends.

Thank you

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

How to set cuda-memory-pool-byte-size and handle the case when we are out of this memory #8177

Description

1. We don't know how to set cuda-memory-pool-byte-size in order to avoid this error in all conditions

2. When this error happens, we can't catch it

Triton Information

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

How to set cuda-memory-pool-byte-size and handle the case when we are out of this memory #8177

Description

Description

1. We don't know how to set cuda-memory-pool-byte-size in order to avoid this error in all conditions

2. When this error happens, we can't catch it

Triton Information

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions