Skip to content

How to decrease model inference time #22

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
richa-nvidia opened this issue Mar 26, 2025 · 7 comments
Open

How to decrease model inference time #22

richa-nvidia opened this issue Mar 26, 2025 · 7 comments

Comments

@richa-nvidia
Copy link

Hi Team,

I am trying to use this for my application logs, by tweaking a bit of security-prompt.

Few Observations/query that I have:

I tried switching to 32B the performance was even slower
It takes a quite of lot time to run analysis, I used 100 chunks, and for this it took almost 5 mins and if we inc the file size to 5k the analysis will take almost 10 min
Is there a way to increase the inference ? so that analysis can be run more quickly
also I am running this on A100, 4gpu - any other machine configuration for which it will be more faster and accurate?
Any other suggestion that u guys have to make it work in more faster and optimized way.

@cpfiffer
Copy link
Contributor

  1. 32B parameter models will naturally be slower as it is a significantly larger model.
  2. It's not surprising that this takes a long time to run -- language models are expensive. The logs are also difficult to tokenize, so there end up being far more tokens than usual.
  3. For general speed improvements, see the vLLM guide on the outlines docs. For you, I would make sure
  • tensor_parallel_size is set to the number of GPUs you have
  • gpu_memory_utilization is somewhere around 80-90%, though it defaults to 90% and may not require a change.

As to a machine configuration, that's about what you should use. Most configurations work at around the same speed.

@richa-nvidia
Copy link
Author

Well I was testing this with 4 GPU on A100, Now I have H100 with 8 gpu now that I can use but I have seen that attention heads are 28 so 8 gpu's cannot be used ? - so my concern is how to use it with 8 then ?

Also I have notices that this works with specific log format, time and message, if the message contains multiple fields it wont be able to process that - any thoughts around it

@dottxt-ai dottxt-ai deleted a comment from richa-nvidia Apr 2, 2025
@cpfiffer
Copy link
Contributor

cpfiffer commented Apr 2, 2025

Well I was testing this with 4 GPU on A100, Now I have H100 with 8 gpu now that I can use but I have seen that attention heads are 28 so 8 gpu's cannot be used ? - so my concern is how to use it with 8 then ?

Same thing. This is the tensor_parallel_size argument in vLLM -- the linked docs above should help here.

Also I have notices that this works with specific log format, time and message, if the message contains multiple fields it wont be able to process that - any thoughts around it

Can you clarify what you mean by "multiple fields" and "won't be able to process that"? Not sure what you mean exactly.

@richa-nvidia
Copy link
Author

Tried increasing parallelism and below error I am getting:
File "/localhome/local-ricsingh/demos/logs/ric/lib/python3.10/site-packages/vllm/engine/arg_utils.py", line 1260, in create_engine_config
config = VllmConfig(
File "", line 19, in init
File "/localhome/local-ricsingh/demos/logs/ric/lib/python3.10/site-packages/vllm/config.py", line 3191, in post_init
self.model_config.verify_with_parallel_config(self.parallel_config)
File "/localhome/local-ricsingh/demos/logs/ric/lib/python3.10/site-packages/vllm/config.py", line 694, in verify_with_parallel_config
raise ValueError(
ValueError: Total number of attention heads (28) must be divisible by tensor parallel size (8).

@cpfiffer
Copy link
Contributor

cpfiffer commented Apr 7, 2025

As the error states:

ValueError: Total number of attention heads (28) must be divisible by tensor parallel size (8).

Try 4 or 7 instead. As a side note, this is more of a discussion about using vLLM rather than Outlines, so I may need to refer you to another forum.

@richa-nvidia
Copy link
Author

well have already tried with 4 and got below results, so wanted to use 8 anyways have to go with 4 only

Image

@cpfiffer
Copy link
Contributor

cpfiffer commented Apr 7, 2025

Try 7.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants