Skip to content

Speed regression on GPU inference after upgrading from spaCy 3.3.1 to 3.7.5 #13783

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
mingzhu-wu opened this issue Apr 1, 2025 · 4 comments

Comments

@mingzhu-wu
Copy link

mingzhu-wu commented Apr 1, 2025

Dear spaCy Team,

I recently upgraded from spaCy v3.3.1 to v3.7.5 and observed a significant inference slowdown on GPU when using the same model and input data.

Upon investigating, the issue appears related to the introduction of spacy_curated_transformers starting from v3.7.0. Specifically, the last_transformer_layer_listener_forward function in spacy_curated_transformers/models/listeners.py:382 takes approximately 1 second per call in v3.7.5, whereas the equivalent functionality in spaCy v3.3.1 (spacy_transformers) took only about 0.1 second.

This slowdown consistently occurs with transformer-based models such as en_core_web_trf, de_dep_news_trf, and zh_core_web_trf, while non-transformer models like xx_ent_wiki_sm remain unaffected.

Could you please confirm whether this slowdown is an unintended regression or if this performance difference is expected? Thank you in advance!

Steps to Reproduce

  1. Create two virtual environments:
  • Environment A (spaCy v3.3.1): Install spaCy 3.3.1 with CUDA 11.8 support and its dependencies, download the “en_core_web_trf” model with python -m spacy download en_core_web_trf. Then install torch version torch==2.0.1+cu118 torchvision==0.15.2+cu118 torchaudio==2.0.2+cu118 from https://download.pytorch.org/whl/torch_stable.html

  • Environment B (spaCy v3.7.5): Install spaCy 3.7.5 and its dependencies, download the same “en_core_web_trf” model, then install the same torch version.

  1. Run the following code snippet in both environments and compare the inference times:
import time
import spacy
import cProfile
import pstats

def load_model():
    s1 = time.time()
    nlp=spacy.load("en_core_web_trf")
    e1 = time.time()
    print(f"Load model took {(e1-s1)*1000:.2f} ms ")
    return nlp

def proc(nlp, texts):
    s = time.time()
    processed_docs = list(nlp.pipe(texts, batch_size=5))
    e = time.time()
    print(f"{(e-s)*1000:.2f} ms")

text = """"""
While some of spaCy's features work independently, others require trained pipelines to be loaded, which enable spaCy to predict linguistic annotations for example, whether a word is a verb or a noun. A trained pipeline can consist of multiple components that use a statistical model trained on labeled data. spaCy currently offers trained pipelines for a variety of languages, which can be installed as individual Python modules. Pipeline packages can differ in size, speed, memory usage, accuracy and the data they include. The package you choose always depends on your use case and the texts you're working with. For a general-purpose use case, the small, default packages are always a good start. They typically include the following components: Binary weights for the part-of-speech tagger, dependency parser and named entity recognizer to predict those annotations in context.
Lexical entries in the vocabulary, i.e. words and their context-independent attributes like the shape or spelling.
Data files like lemmatization rules and lookup tables.
Word vectors, i.e. multi-dimensional meaning representations of words that let you determine how similar they are to each other.
Configuration options, like the language and processing pipeline settings and model implementations to use, to put spaCy in the correct state when you load the pipeline."""
"""

spacy.require_gpu(1)
nlp_gpu = load_model()
profiler = cProfile.Profile()
profiler.enable()
proc(nlp_gpu, [text])
profiler.disable()
stats = pstats.Stats(profiler).sort_stats('cumtime')
stats.print_stats(50)

Your Environment

  • Operating System: Ubuntu 20.04.4
  • Python Version Used: 3.9.15
  • GPU: NVIDIA GeForce RTX 3090, CUDA 11.8

Environment A (spaCy v3.3.1) Packages:

Package                  Version
------------------------ ------------
blis                     0.7.11
catalogue                2.0.10
certifi                  2025.1.31
charset-normalizer       3.4.1
click                    8.1.8
cupy-cuda112             10.6.0
cymem                    2.0.11
en-core-web-trf          3.3.0
fastrlock                0.8.3
filelock                 3.18.0
fsspec                   2025.3.0
huggingface-hub          0.29.3
idna                     3.10
Jinja2                   3.1.6
langcodes                3.5.0
language_data            1.3.0
marisa-trie              1.2.1
MarkupSafe               3.0.2
mpmath                   1.3.0
murmurhash               1.0.12
networkx                 3.2.1
numpy                    1.24.4
nvidia-cublas-cu12       12.1.3.1
nvidia-cuda-cupti-cu12   12.1.105
nvidia-cuda-nvrtc-cu12   12.1.105
nvidia-cuda-runtime-cu12 12.1.105
nvidia-cudnn-cu12        8.9.2.26
nvidia-cufft-cu12        11.0.2.54
nvidia-curand-cu12       10.3.2.106
nvidia-cusolver-cu12     11.4.5.107
nvidia-cusparse-cu12     12.1.0.106
nvidia-nccl-cu12         2.18.1
nvidia-nvjitlink-cu12    12.8.93
nvidia-nvtx-cu12         12.1.105
packaging                24.2
pathlib_abc              0.1.1
pathy                    0.11.0
pillow                   11.1.0
pip                      25.0.1
preshed                  3.0.9
pydantic                 1.8.2
PyYAML                   6.0.2
regex                    2024.11.6
requests                 2.32.3
setuptools               78.1.0
smart-open               6.4.0
spacy                    3.3.1
spacy-alignments         0.9.1
spacy-legacy             3.0.12
spacy-loggers            1.0.5
spacy-transformers       1.1.7
srsly                    2.5.1
sympy                    1.13.3
thinc                    8.0.17
tokenizers               0.12.1
torch                      2.0.1+cu118
torchaudio                 2.0.2+cu118
torchvision                0.15.2+cu118
tqdm                     4.67.1
transformers             4.20.1
triton                   2.0.0
typer                    0.4.2
typing_extensions        4.5.0
typing-inspection          0.4.0
urllib3                  2.3.0
wasabi                   0.10.1
wheel                    0.45.1

Environment B (spaCy v3.7.5) packages:

Package                    Version
-------------------------- ------------
annotated-types            0.7.0
blis                       0.7.11
catalogue                  2.0.10
certifi                    2025.1.31
charset-normalizer         3.4.1
click                      8.1.8
cloudpathlib               0.21.0
confection                 0.1.5
cupy-cuda11x               13.4.1
curated-tokenizers         0.0.9
curated-transformers       0.1.1
cymem                      2.0.11
en-core-web-trf            3.7.3
fastrlock                  0.8.3
filelock                   3.13.1
fsspec                     2024.6.1
idna                       3.10
Jinja2                     3.1.6
langcodes                  3.5.0
language_data              1.3.0
marisa-trie                1.2.1
markdown-it-py             3.0.0
MarkupSafe                 3.0.2
mdurl                      0.1.2
mpmath                     1.3.0
murmurhash                 1.0.12
networkx                   3.2.1
numpy                      1.26.4
nvidia-cublas-cu11         11.11.3.6
nvidia-cuda-cupti-cu11     11.8.87
nvidia-cuda-nvrtc-cu11     11.8.89
nvidia-cuda-runtime-cu11   11.8.89
nvidia-cudnn-cu11          8.7.0.84
nvidia-cufft-cu11          10.9.0.58
nvidia-curand-cu11         10.3.0.86
nvidia-cusolver-cu11       11.4.1.48
nvidia-cusparse-cu11       11.7.5.86
nvidia-nccl-cu11           2.20.5
nvidia-nvtx-cu11           11.8.86
packaging                  24.2
pillow                     11.0.0
pip                        25.0.1
preshed                    3.0.9
pydantic                   2.11.1
pydantic_core              2.33.0
Pygments                   2.19.1
regex                      2024.11.6
requests                   2.32.3
rich                       14.0.0
setuptools                 78.1.0
shellingham                1.5.4
smart-open                 7.1.0
spacy                      3.7.5
spacy-curated-transformers 0.2.2
spacy-legacy               3.0.12
spacy-loggers              1.0.5
srsly                      2.5.1
sympy                      1.13.1
thinc                      8.2.5
torch                      2.0.1+cu118
torchaudio                 2.0.2+cu118
torchvision                0.15.2+cu118
tqdm                       4.67.1
triton                     2.0.0
typer                      0.15.2
typing_extensions          4.13.0
typing-inspection          0.4.0
urllib3                    2.3.0
wasabi                     1.1.3
weasel                     0.4.1
wheel                      0.45.1
wrapt                      1.17.2

Observed Behavior:

  • Environment A (v3.3.1): approximately 2300 ms inference time
53596 function calls (52689 primitive calls) in 2.369 seconds

  Ordered by: cumulative time
  List reduced from 558 to 50 due to restriction <50>

  ncalls tottime percall cumtime percall filename:lineno(function)
    1  0.000  0.000  2.369  2.369 /data/debug-spacy-speed/debug.py:16(proc)
    2  0.000  0.000  2.369  1.185 /data/venv-spacy33/lib/python3.9/site-packages/spacy/language.py:1499(pipe)
   12/2  0.008  0.001  2.369  1.184 /data/venv-spacy33/lib/python3.9/site-packages/spacy/util.py:1603(_pipe)
   12/4  0.000  0.000  2.365  0.591 /data/venv-spacy33/lib/python3.9/site-packages/spacy/util.py:1549(minibatch)
   4/2  0.000  0.000  2.365  1.182 spacy/pipeline/pipe.pyx:41(pipe)
    2  0.000  0.000  2.299  1.150 spacy/pipeline/trainable_pipe.pyx:58(pipe)
    4  0.000  0.000  2.282  0.571 /data/venv-spacy33/lib/python3.9/site-packages/thinc/model.py:311(predict)
   23/7  0.000  0.000  2.163  0.309 /data/venv-spacy33/lib/python3.9/site-packages/thinc/model.py:288(__call__)
   6/3  0.000  0.000  2.076  0.692 /data/venv-spacy33/lib/python3.9/site-packages/thinc/layers/chain.py:48(forward)
    1  0.000  0.000  2.075  2.075 spacy/pipeline/tagger.pyx:129(predict)
    1  0.000  0.000  2.063  2.063 /data/venv-spacy33/lib/python3.9/site-packages/thinc/layers/with_array.py:28(forward)
    1  0.000  0.000  2.063  2.063 /data/venv-spacy33/lib/python3.9/site-packages/thinc/layers/with_array.py:68(_list_forward)
    1  0.000  0.000  2.061  2.061 /data/venv-spacy33/lib/python3.9/site-packages/thinc/layers/softmax.py:56(forward)
    1  0.000  0.000  2.061  2.061 /data/venv-spacy33/lib/python3.9/site-packages/thinc/backends/ops.py:220(affine)
    5  0.000  0.000  2.056  0.411 /data/venv-spacy33/lib/python3.9/site-packages/thinc/backends/cupy_ops.py:59(gemm)
    5  0.000  0.000  2.056  0.411 /data/venv-spacy33/lib/python3.9/site-packages/cupy/linalg/_product.py:45(dot)
    5  2.056  0.411  2.056  0.411 {method 'dot' of 'cupy._core.core.ndarray' objects}
    2  0.000  0.000  0.223  0.112 /data/venv-spacy33/lib/python3.9/site-packages/spacy_transformers/pipeline_component.py:196(pipe)
    1  0.000  0.000  0.205  0.205 /data/venv-spacy33/lib/python3.9/site-packages/spacy_transformers/pipeline_component.py:215(predict)
    1  0.000  0.000  0.203  0.203 /data/venv-spacy33/lib/python3.9/site-packages/spacy_transformers/layers/transformer_model.py:161(forward)
  • Environment B (v3.7.5): approximately 4800 ms inference time on the same text
56357 function calls (55121 primitive calls) in 4.836 seconds

  Ordered by: cumulative time
  List reduced from 619 to 100 due to restriction <100>

  ncalls tottime percall cumtime percall filename:lineno(function)
    1  0.000  0.000  4.836  4.836 /data/debug-spacy-speed/debug.py:16(proc)
    2  0.000  0.000  4.836  2.418 /data/venv-spacy37-cuda118/lib/python3.9/site-packages/spacy/language.py:1534(pipe)
   12/2  0.004  0.000  4.836  2.418 /data/venv-spacy37-cuda118/lib/python3.9/site-packages/spacy/util.py:1693(_pipe)
   12/4  0.000  0.000  4.832  1.208 /data/venv-spacy37-cuda118/lib/python3.9/site-packages/spacy/util.py:1639(minibatch)
   4/2  0.000  0.000  4.832  2.416 spacy/pipeline/pipe.pyx:43(pipe)
    2  0.000  0.000  4.792  2.396 spacy/pipeline/trainable_pipe.pyx:58(pipe)
   7/4  0.000  0.000  4.760  1.190 /data/venv-spacy37-cuda118/lib/python3.9/site-packages/thinc/model.py:330(predict)
   22/7  0.000  0.000  4.758  0.680 /data/venv-spacy37-cuda118/lib/python3.9/site-packages/thinc/model.py:307(__call__)
    4  0.000  0.000  4.732  1.183 /data/venv-spacy37-cuda118/lib/python3.9/site-packages/thinc/layers/chain.py:48(forward)
    1  0.000  0.000  2.591  2.591 spacy/pipeline/tagger.pyx:124(predict)
    3  0.000  0.000  2.589  0.863 /data/venv-spacy37-cuda118/lib/python3.9/site-packages/spacy_curated_transformers/models/listeners.py:382(last_transformer_layer_listener_forward)
    3  0.000  0.000  2.589  0.863 /data/venv-spacy37-cuda118/lib/python3.9/site-packages/spacy_curated_transformers/models/pooling.py:91(with_ragged_last_layer_forward)
    14  0.000  0.000  2.585  0.185 /data/venv-spacy37-cuda118/lib/python3.9/site-packages/cupy/cuda/compiler.py:517(_compile_module_with_cache)
    14  0.001  0.000  2.585  0.185 /data/venv-spacy37-cuda118/lib/python3.9/site-packages/cupy/cuda/compiler.py:548(_compile_with_cache_cuda)
    3  0.000  0.000  2.583  0.861 /data/venv-spacy37-cuda118/lib/python3.9/site-packages/thinc/layers/reduce_mean.py:17(forward)
    3  0.000  0.000  2.583  0.861 /data/venv-spacy37-cuda118/lib/python3.9/site-packages/thinc/backends/cupy_ops.py:286(reduce_mean)
    3  0.001  0.000  2.583  0.861 /data/venv-spacy37-cuda118/lib/python3.9/site-packages/thinc/backends/_custom_kernels.py:435(reduce_mean)
    3  0.000  0.000  2.567  0.856 /data/venv-spacy37-cuda118/lib/python3.9/site-packages/thinc/backends/_custom_kernels.py:96(__call__)
    3  0.000  0.000  2.567  0.856 /data/venv-spacy37-cuda118/lib/python3.9/site-packages/thinc/backends/_custom_kernels.py:100(_compile_kernel)
    1  0.000  0.000  2.566  2.566 {method 'get_function' of 'cupy._core.raw.RawModule' objects}
    1  0.000  0.000  2.562  2.562 /data/venv-spacy37-cuda118/lib/python3.9/site-packages/cupy/cuda/compiler.py:335(compile_using_nvrtc)
    1  0.000  0.000  2.560  2.560 /data/venv-spacy37-cuda118/lib/python3.9/site-packages/cupy/cuda/compiler.py:338(_compile)
    1  0.000  0.000  2.560  2.560 /data/venv-spacy37-cuda118/lib/python3.9/site-packages/cupy/cuda/compiler.py:732(compile)
    1  2.560  2.560  2.560  2.560 {built-in method cupy_backends.cuda.libs.nvrtc.compileProgram}
    2  0.000  0.000  2.200  1.100 /data/venv-spacy37-cuda118/lib/python3.9/site-packages/spacy_curated_transformers/pipeline/transformer.py:195(pipe)
    1  0.000  0.000  2.165  2.165 /data/venv-spacy37-cuda118/lib/python3.9/site-packages/spacy_curated_transformers/pipeline/transformer.py:214(predict)
    1  0.000  0.000  2.164  2.164 /data/venv-spacy37-cuda118/lib/python3.9/site-packages/spacy_curated_transformers/models/architectures.py:648(transformer_model_forward)
    1  0.000  0.000  2.164  2.164 /data/venv-spacy37-cuda118/lib/python3.9/site-packages/spacy_curated_transformers/models/with_non_ws_tokens.py:67(with_non_ws_tokens_forward)
    1  0.000  0.000  2.031  2.031 /data/venv-spacy37-cuda118/lib/python3.9/site-packages/spacy_curated_transformers/models/with_strided_spans.py:91(with_strided_spans_forward)
    1  0.000  0.000  2.018  2.018 /data/venv-spacy37-cuda118/lib/python3.9/site-packages/thinc/layers/pytorchwrapper.py:217(forward)
    1  0.000  0.000  2.014  2.014 /data/venv-spacy37-cuda118/lib/python3.9/site-packages/thinc/shims/pytorch.py:93(__call__)
    1  0.000  0.000  2.014  2.014 /data/venv-spacy37-cuda118/lib/python3.9/site-packages/thinc/shims/pytorch.py:107(predict)
  176/1  0.000  0.000  2.006  2.006 /data/venv-spacy37-cuda118/lib/python3.9/site-packages/torch/nn/modules/module.py:1528(_wrapped_call_impl)
  176/1  0.001  0.000  2.006  2.006 /data/venv-spacy37-cuda118/lib/python3.9/site-packages/torch/nn/modules/module.py:1534(_call_impl)
    1  0.000  0.000  2.005  2.005 /data/venv-spacy37-cuda118/lib/python3.9/site-packages/curated_transformers/models/curated_transformer.py:27(forward)
    1  0.000  0.000  2.005  2.005 /data/venv-spacy37-cuda118/lib/python3.9/site-packages/curated_transformers/models/roberta/encoder.py:33(forward)
  • While inference times might differ depending on hardware (CPU or GPU), spaCy v3.7.5 consistently shows around twice the inference time compared to spaCy v3.3.1 across different hardware setups tested.
@honnibal
Copy link
Member

Thanks for reporting this. I've had to shut down the CI we had that checked performance regressions like this, so these things can slip through currently.

It's not intended or expected, I hope we can find a solution.

@mingzhu-wu
Copy link
Author

Thanks for reporting this. I've had to shut down the CI we had that checked performance regressions like this, so these things can slip through currently.

It's not intended or expected, I hope we can find a solution.

Thank you for your reply and for confirming that this behavior isn't intended. I also hope it can be resolved soon. Please feel free to let me know if there's any additional information I can provide.

I am looking forward to any updates or suggestions you might have!

Thanks again for your help.

@znadrich-qf
Copy link

znadrich-qf commented Apr 16, 2025

@honnibal @mingzhu-wu I'm far from an expert here, but I've been investigating how to speed up transformer inference for en_core_web_trf and have found that a curiously long time is spent in the following calls

with_non_ws_tokens_forward and specifically it seems _add_whitespace_tokens

Again, I'm not an expert in GPU programming but it seems like a lot of time is spent allocating new vectors with

https://github.com/explosion/spacy-curated-transformers/blob/1f79710af24fa1de4045f2f0fe0e692be02ade9a/spacy_curated_transformers/models/with_non_ws_tokens.py#L184

new_layer = model.ops.alloc2f(doc_alignment.ws_n_pieces, hidden_width)

and then for each non-ws token the token's vector is copied with

https://github.com/explosion/spacy-curated-transformers/blob/1f79710af24fa1de4045f2f0fe0e692be02ade9a/spacy_curated_transformers/models/with_non_ws_tokens.py#L189-L197

                    new_layer[
                        alignment.ws_piece_offset : alignment.ws_piece_offset
                        + alignment.n_pieces,
                        :,
                    ] = layer.dataXd[
                        alignment.no_ws_piece_offset : alignment.no_ws_piece_offset
                        + alignment.n_pieces,
                        :,
                    ]

In the text samples I have, the majority of tokens are not white space (> 95%), so I think if this copy was done such that the entire vector is copied to new_layer and then white space tokens can get specifically zeroed out which would result in less operations and speed up the inference.

I've tested this and it seems to dramatically increase transformer inference speed, but I haven't yet looked for other bottlenecks

@znadrich-qf
Copy link

znadrich-qf commented Apr 16, 2025

@honnibal I believe I've found another unintended bottleneck, but again I don't know Cython that well.

It looks like there is a bottleneck in byte_bpe_encoder_forward, specifcally in ByteBPEProcessor

I think that the intention of _split_pattern is that the regex is compiled once on object creation

https://github.com/explosion/curated-tokenizers/blob/main/curated_tokenizers/_bbpe.pyx#L68

however, looking at some profiles, the calls to regex.findall will re-compile the pattern each time encode_as_pieces is called

https://github.com/explosion/curated-tokenizers/blob/main/curated_tokenizers/_bbpe.pyx#L146

I believe to avoid this, self._split_pattern.findall should be called

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants