Replies: 1 comment 2 replies
-
The codebase is a quite big and the question is very general, so maybe we can try scoping the problem. First ensure your code doesn't have bugs:
Then investigate why training fails. This is difficult and usually takes time, but some things to try:
Good luck! |
Beta Was this translation helpful? Give feedback.
2 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Hi! I've asked for help here before, as I had a school deadline coming up for a project I was doing that involved implementing a Transformer for MT. Luckily, that deadline is past, and I managed to work around the fact that my implementation isn't working, but now I would like to understand why my model doesn't work.
Essentially, when I train the model (using WMT14, DE->EN), the model converges to output the same "translation" no matter the input to the encoders. In essence, it seems to act like a language model, producing the same semi-coherent sentence for all source sentences (although greedy decoding and beam search produce different sentences, each produces only one translation for all inputs). For example, greedy decoding produces the following sentence:
Decoding sentences by providing the decoder stack with the correct translation as previously translated tokens (i.e., teacher-forcing) yields gibberish, but the gibberish is unique to the target translation and occasionally reflects small pieces of the target translation. For example:
With poking and prodding, I've found that the model trains the encoder stack to produce nearly identical output for all inputs. Each encoder in the stack moves the input closer and closer to this one output until at the end, the difference is negligible. This would explain why it acts like a language model, as it's just learned to try to produce as fluent English as possible, given that it receives no information from the encoders.
Could anyone look at my model code or my loss function, to see if you know what is going wrong?
I'd really love to understand what I'm misunderstanding, haha! Thanks so much!
Beta Was this translation helpful? Give feedback.
All reactions