-
-
Notifications
You must be signed in to change notification settings - Fork 4.5k
Speed regression on GPU inference after upgrading from spaCy 3.3.1 to 3.7.5 #13783
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Thanks for reporting this. I've had to shut down the CI we had that checked performance regressions like this, so these things can slip through currently. It's not intended or expected, I hope we can find a solution. |
Thank you for your reply and for confirming that this behavior isn't intended. I also hope it can be resolved soon. Please feel free to let me know if there's any additional information I can provide. I am looking forward to any updates or suggestions you might have! Thanks again for your help. |
@honnibal @mingzhu-wu I'm far from an expert here, but I've been investigating how to speed up transformer inference for
Again, I'm not an expert in GPU programming but it seems like a lot of time is spent allocating new vectors with new_layer = model.ops.alloc2f(doc_alignment.ws_n_pieces, hidden_width) and then for each non-ws token the token's vector is copied with new_layer[
alignment.ws_piece_offset : alignment.ws_piece_offset
+ alignment.n_pieces,
:,
] = layer.dataXd[
alignment.no_ws_piece_offset : alignment.no_ws_piece_offset
+ alignment.n_pieces,
:,
] In the text samples I have, the majority of tokens are not white space (> 95%), so I think if this copy was done such that the entire vector is copied to I've tested this and it seems to dramatically increase transformer inference speed, but I haven't yet looked for other bottlenecks |
@honnibal I believe I've found another unintended bottleneck, but again I don't know Cython that well. It looks like there is a bottleneck in I think that the intention of https://github.com/explosion/curated-tokenizers/blob/main/curated_tokenizers/_bbpe.pyx#L68 however, looking at some profiles, the calls to https://github.com/explosion/curated-tokenizers/blob/main/curated_tokenizers/_bbpe.pyx#L146 I believe to avoid this, |
Dear spaCy Team,
I recently upgraded from spaCy v3.3.1 to v3.7.5 and observed a significant inference slowdown on GPU when using the same model and input data.
Upon investigating, the issue appears related to the introduction of spacy_curated_transformers starting from v3.7.0. Specifically, the last_transformer_layer_listener_forward function in spacy_curated_transformers/models/listeners.py:382 takes approximately 1 second per call in v3.7.5, whereas the equivalent functionality in spaCy v3.3.1 (spacy_transformers) took only about 0.1 second.
This slowdown consistently occurs with transformer-based models such as en_core_web_trf, de_dep_news_trf, and zh_core_web_trf, while non-transformer models like xx_ent_wiki_sm remain unaffected.
Could you please confirm whether this slowdown is an unintended regression or if this performance difference is expected? Thank you in advance!
Steps to Reproduce
Environment A (spaCy v3.3.1): Install spaCy 3.3.1 with CUDA 11.8 support and its dependencies, download the “en_core_web_trf” model with python -m spacy download en_core_web_trf. Then install torch version torch==2.0.1+cu118 torchvision==0.15.2+cu118 torchaudio==2.0.2+cu118 from https://download.pytorch.org/whl/torch_stable.html
Environment B (spaCy v3.7.5): Install spaCy 3.7.5 and its dependencies, download the same “en_core_web_trf” model, then install the same torch version.
Your Environment
Environment A (spaCy v3.3.1) Packages:
Environment B (spaCy v3.7.5) packages:
Observed Behavior:
The text was updated successfully, but these errors were encountered: