How to decrease model inference time #22

richa-nvidia · 2025-03-26T06:17:16Z

Hi Team,

I am trying to use this for my application logs, by tweaking a bit of security-prompt.

Few Observations/query that I have:

I tried switching to 32B the performance was even slower
It takes a quite of lot time to run analysis, I used 100 chunks, and for this it took almost 5 mins and if we inc the file size to 5k the analysis will take almost 10 min
Is there a way to increase the inference ? so that analysis can be run more quickly
also I am running this on A100, 4gpu - any other machine configuration for which it will be more faster and accurate?
Any other suggestion that u guys have to make it work in more faster and optimized way.

cpfiffer · 2025-03-27T18:50:28Z

32B parameter models will naturally be slower as it is a significantly larger model.
It's not surprising that this takes a long time to run -- language models are expensive. The logs are also difficult to tokenize, so there end up being far more tokens than usual.
For general speed improvements, see the vLLM guide on the outlines docs. For you, I would make sure

tensor_parallel_size is set to the number of GPUs you have
gpu_memory_utilization is somewhere around 80-90%, though it defaults to 90% and may not require a change.

As to a machine configuration, that's about what you should use. Most configurations work at around the same speed.

richa-nvidia · 2025-04-02T04:54:24Z

Well I was testing this with 4 GPU on A100, Now I have H100 with 8 gpu now that I can use but I have seen that attention heads are 28 so 8 gpu's cannot be used ? - so my concern is how to use it with 8 then ?

Also I have notices that this works with specific log format, time and message, if the message contains multiple fields it wont be able to process that - any thoughts around it

cpfiffer · 2025-04-02T22:45:42Z

Well I was testing this with 4 GPU on A100, Now I have H100 with 8 gpu now that I can use but I have seen that attention heads are 28 so 8 gpu's cannot be used ? - so my concern is how to use it with 8 then ?

Same thing. This is the tensor_parallel_size argument in vLLM -- the linked docs above should help here.

Also I have notices that this works with specific log format, time and message, if the message contains multiple fields it wont be able to process that - any thoughts around it

Can you clarify what you mean by "multiple fields" and "won't be able to process that"? Not sure what you mean exactly.

richa-nvidia · 2025-04-07T14:19:27Z

Tried increasing parallelism and below error I am getting:
File "/localhome/local-ricsingh/demos/logs/ric/lib/python3.10/site-packages/vllm/engine/arg_utils.py", line 1260, in create_engine_config
config = VllmConfig(
File "", line 19, in init
File "/localhome/local-ricsingh/demos/logs/ric/lib/python3.10/site-packages/vllm/config.py", line 3191, in post_init
self.model_config.verify_with_parallel_config(self.parallel_config)
File "/localhome/local-ricsingh/demos/logs/ric/lib/python3.10/site-packages/vllm/config.py", line 694, in verify_with_parallel_config
raise ValueError(
ValueError: Total number of attention heads (28) must be divisible by tensor parallel size (8).

cpfiffer · 2025-04-07T15:55:13Z

As the error states:

ValueError: Total number of attention heads (28) must be divisible by tensor parallel size (8).

Try 4 or 7 instead. As a side note, this is more of a discussion about using vLLM rather than Outlines, so I may need to refer you to another forum.

richa-nvidia · 2025-04-07T15:59:49Z

well have already tried with 4 and got below results, so wanted to use 8 anyways have to go with 4 only

cpfiffer · 2025-04-07T17:50:51Z

Try 7.

dottxt-ai deleted a comment from richa-nvidia Apr 2, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to decrease model inference time #22

How to decrease model inference time #22

richa-nvidia commented Mar 26, 2025

cpfiffer commented Mar 27, 2025

richa-nvidia commented Apr 2, 2025

cpfiffer commented Apr 2, 2025

richa-nvidia commented Apr 7, 2025

cpfiffer commented Apr 7, 2025

richa-nvidia commented Apr 7, 2025

cpfiffer commented Apr 7, 2025

How to decrease model inference time #22

How to decrease model inference time #22

Comments

richa-nvidia commented Mar 26, 2025

cpfiffer commented Mar 27, 2025

richa-nvidia commented Apr 2, 2025

cpfiffer commented Apr 2, 2025

richa-nvidia commented Apr 7, 2025

cpfiffer commented Apr 7, 2025

richa-nvidia commented Apr 7, 2025

cpfiffer commented Apr 7, 2025