I guess if we always use teacher forcing then the model will learn well how to predict the next word but will fail to learn to remember hidden state of the decoder for longer. E.g. if word #5 depends on word #1 then the model should learn to keep necessary info up to word #5 to be able to produce it correctly.
According to the code, jeremy’s using encoder outputs weighted average as opposed to hidden states weighted average as a final hidden state used by the decoder?
Comment / Digression: The German " Er lietbe zu essen " does indeed mean " He loved to eat" however, in spoken German, the passive form is more often used. Hence, if this was a spoken language translator (memory recalls), the person would need to say " Er hat zu essen geliebt." We would never pick that up from a text corpus.
A similar, but probably more powerful, idea is to use beam search: get top N predictions for word1, then try to predict the word2 using all N word1guesses, get top N of every word2 predictions. By now you have N^2 pairs of word1 and word2 predictions. Then sort them by probability, reduce to N best options, predict word3 for each N options, …
This approach doesn’t have a drawback of using not the best prediction. At the same time it helps to avoid the case when some word in the middle of the sequence is predicted incorrectly and it screws up the rest of the sequence.
Why use tanh instead of ReLU in the attention mini-net?
Where is the loss function for the attentional NN?
Could I think of attention as training a classifier for which words in ‘German’ map to which words in ‘English’?
It is learned along with the encoder-decoder. Hence the loss is the decoder loss I presume.
Loss is where the label is, so probably the same loss that we had before. No special loss.
I guess the attentional net is integrated into the seq2seq model, so it uses the same loss.
Are the attention weights learned per word in the vocabulary or per word position?
The loss fn is for the overall net (end-to-end) - the same reason we don’t have separate loss fns for each layer.
For the softmax, does the width (number of classes) it outputs determines how many timesteps to weight over? If so, how wide is it?
Why accuracy is not measured here?
how would you measure accuracy?
I highly recommend this brilliant talk by Stephen Merity on the significance of attention and what it generally does to a network. Also he talks about regularization and AWD LSTM.
 - Stephen Merity, Attention and Memory in Deep Learning Networks, https://www.youtube.com/watch?v=uuPZFWJ-4bE&t=1261s
I guess, by comparing those translations which are correct vs total
Check the BLEU score, it discusses how you can measure the translation “quality” without resorting to exact comparison (which is almost always too much to ask).
So… there is no line of code which points out the weights, but it’s magic in
Parameter which tells PyTorch “hey, this is part of the thing you need to optimize with the overall loss function”, I think.
Why do we call this end to end but not the previous one which differs by just couple of layers from the previous?