Incredibly slow transformer code #1229

Numeri · 2021-04-08T12:16:59Z

Numeri
Apr 8, 2021

I'm working on a project with a deadline, and seem to have hit a wall. My code is running many orders of magnitude slower than I had hoped, and I don't understand why.

I'm making a Transformer model, and need

I tried to profile it using jax.profiler, but just got a 40 byte file that doesn't seem to work as a Tensorboard profile.

I really do apologize for the lack of a minimal working example, but I haven't been able to pin down the cause of the slow speeds. Would anyone experienced mind quickly glancing at my repo and looking for obvious mistakes? I know this isn't how GitHub usually works, but if anyone can help me get this working, I'd love to Venmo enough for a dinner out on the town – I'm feeling very desperate.

Thank you in advance!

Answered by jheek

Apr 8, 2021

Looking at the code my first guess is that your problem is the placement of jax.jit. Note how you end up calling jit each iteration.

The second problem is that the jit does not capture the full train step so you are missing some optimization opportunities.
Please try to rewrite your train code like this:

@jax.jit
def train_step(optimizer,  batch):
  print("compling train step...") # This should print only once in the entire train script. Otherwise you have a re-compile bug
  def loss_fn(...):
    ...
  loss, grad = jax.value_and_grad(loss_fn)
  optimizer = optimizer.apply_gradient(grad)
  return optimizer, loss


def train():
  ...

  for step in range(num_steps):
    ...
    optimizer = t…

View full answer

jheek · 2021-04-08T12:35:20Z

jheek
Apr 8, 2021
Maintainer

Looking at the code my first guess is that your problem is the placement of jax.jit. Note how you end up calling jit each iteration.

The second problem is that the jit does not capture the full train step so you are missing some optimization opportunities.
Please try to rewrite your train code like this:

@jax.jit
def train_step(optimizer,  batch):
  print("compling train step...") # This should print only once in the entire train script. Otherwise you have a re-compile bug
  def loss_fn(...):
    ...
  loss, grad = jax.value_and_grad(loss_fn)
  optimizer = optimizer.apply_gradient(grad)
  return optimizer, loss


def train():
  ...

  for step in range(num_steps):
    ...
    optimizer = train_step(optimizer, batch)
    ...

0 replies

Numeri · 2021-04-08T14:06:10Z

Numeri
Apr 8, 2021
Author

@jheek Thank you so much for the help! That sped things up considerably, after the jit compilation. In an effort to speed up the compilation and training steps, I also replaced the Python while loop here with a jax.lax.scan.

The compilation itself takes a whole 10 minutes (611 seconds, last run), which is longer than it was before. Is there anything I can do for that?

0 replies

jheek · 2021-04-08T14:41:22Z

jheek
Apr 8, 2021
Maintainer

I'm surprised the compilation is that slow!
With the while you end up unrolling the entire model seq_length times, the scan should fix that.
But in general, are you sure the scan is actually what you want? Normally you use masking & teacher forcing to train a transformer decoder model in parallel. Training a transformer model like this by recursive way by feeding back the argmax is fundamentally a slow operation.
Also you keep re-feeding the same tokens into the model so you end up doing re-computing the same thing many times.
If you would indeed want something more like an RNN you should use cached decoding to avoid the unnecessary re-computation. Please check flax/examples/wmt to see a well-tuned example that does all these things correctly with a transformer model.

0 replies

Numeri · 2021-04-08T18:38:23Z

Numeri
Apr 8, 2021
Author

Oh, dear, so I misunderstood something fairly essential to transformers then. That's unfortunate. Is a beam search still performed in that token-by-token manner? Also, @jheek, if you don't mind, email me at ____ so I can Venmo or PayPal you! I've really appreciated your help

1 reply

jheek Apr 9, 2021
Maintainer

Decoding is indeed performed using token-by-token sampling in some way (e.g.: argmax, beam search, or tempered sampling).
Using cached decoding you only feed one token at a time through the model and the attention layers will cache the keys and values of previous tokens to make it work. This should give you a massive O(seq_length) speedup compared to the naive approach.

if you don't mind, email me

Restaurants are still closed where I'm at, so you're lucky this time ;)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Incredibly slow transformer code #1229

Uh oh!

{{title}}

Uh oh!

Replies: 4 comments 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

Incredibly slow transformer code #1229

Uh oh!

Numeri Apr 8, 2021

Replies: 4 comments · 1 reply

Uh oh!

jheek Apr 8, 2021 Maintainer

Uh oh!

Numeri Apr 8, 2021 Author

Uh oh!

Uh oh!

jheek Apr 8, 2021 Maintainer

Uh oh!

Uh oh!

Numeri Apr 8, 2021 Author

Uh oh!

Uh oh!

jheek Apr 9, 2021 Maintainer

Numeri
Apr 8, 2021

Replies: 4 comments 1 reply

jheek
Apr 8, 2021
Maintainer

Numeri
Apr 8, 2021
Author

jheek
Apr 8, 2021
Maintainer

Numeri
Apr 8, 2021
Author

jheek Apr 9, 2021
Maintainer