Transformer model acting like a language model #1309

Numeri · 2021-05-07T21:41:27Z

Numeri
May 7, 2021

Hi! I've asked for help here before, as I had a school deadline coming up for a project I was doing that involved implementing a Transformer for MT. Luckily, that deadline is past, and I managed to work around the fact that my implementation isn't working, but now I would like to understand why my model doesn't work.

Essentially, when I train the model (using WMT14, DE->EN), the model converges to output the same "translation" no matter the input to the encoders. In essence, it seems to act like a language model, producing the same semi-coherent sentence for all source sentences (although greedy decoding and beam search produce different sentences, each produces only one translation for all inputs). For example, greedy decoding produces the following sentence:

The Commission has also been able to support the Commission's proposal for a directive on the basis of the European Union's proposal for a directive. The Commission has not been able to support the Commission's proposal for a directive on the European Union

Decoding sentences by providing the decoder stack with the correct translation as previously translated tokens (i.e., teacher-forcing) yields gibberish, but the gibberish is unique to the target translation and occasionally reflects small pieces of the target translation. For example:

Source: Sie sagen es und deuten gleichzeitig darauf.
Target: You will talk and show at the same time.
Output: The will the, to the the time,, and

With poking and prodding, I've found that the model trains the encoder stack to produce nearly identical output for all inputs. Each encoder in the stack moves the input closer and closer to this one output until at the end, the difference is negligible. This would explain why it acts like a language model, as it's just learned to try to produce as fluent English as possible, given that it receives no information from the encoders.

Could anyone look at my model code or my loss function, to see if you know what is going wrong?

I'd really love to understand what I'm misunderstanding, haha! Thanks so much!

marcvanzee · 2021-05-10T09:26:07Z

marcvanzee
May 10, 2021
Maintainer

The codebase is a quite big and the question is very general, so maybe we can try scoping the problem.

First ensure your code doesn't have bugs:

How do you know that your model code is correct? I recommend writing unit tests for all modules to verify they behave correctly (i.e. produce the expected outputs). You can find some examples in linen_linear_test.py.
Did you write the autoregressive decoding code yourself, and if so, did you test it? Autoregressive decoding can be quite tricky for transformers and it is easy to make mistakes. You could again write some tests.

Then investigate why training fails. This is difficult and usually takes time, but some things to try:

Use Tensorboard to report loss and accuracy during training. These curves may give you more insight in what is going wrong. For instance, your learning rate could be too high, or you loss function could be wrong (it does look correct to me on first sight)
Build a "fake dataset" that is very simple (e.g., input X number of "a" and output X number of "b"), but allows you to investigate whether your training setup is correct.
Compare your model code with our WMT example and make sure all hyperparameters are the same.

Good luck!

2 replies

Numeri Jun 20, 2021
Author

@marcvanzee So, I've been working on this in my free time, and clearly haven't had much time to work on it, but between unit testing and careful debugging, I've found out that the feed forward layers in the encoder stack are transforming any inputs into one identical output.

If you look at the variance of the sentence-level vectors after each layer in an encoder, all layers maintain approximately the same level of variance, until it hits the feed forward layers, where it drops. After about two encoders, any input is completely identical. This explains why it would act like a language model – but now I just don't know what to do about that. Do you have any idea what might cause this behavior?

Numeri Jun 20, 2021
Author

Also, I know implementing my own transformer is sort of a waste of time, but it's a learning experience, and a personal challenge. I do feel like I've gotten to understand it better than I did before, but clearly not well enough yet!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Transformer model acting like a language model #1309

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Transformer model acting like a language model #1309

Uh oh!

Uh oh!

Numeri May 7, 2021

Replies: 1 comment · 2 replies

Uh oh!

marcvanzee May 10, 2021 Maintainer

Uh oh!

Numeri Jun 20, 2021 Author

Uh oh!

Numeri Jun 20, 2021 Author

Numeri
May 7, 2021

Replies: 1 comment 2 replies

marcvanzee
May 10, 2021
Maintainer

Numeri Jun 20, 2021
Author

Numeri Jun 20, 2021
Author