# Part 2 lesson 11 wiki

(Pavel Surmenok) #94

I guess if we always use teacher forcing then the model will learn well how to predict the next word but will fail to learn to remember hidden state of the decoder for longer. E.g. if word #5 depends on word #1 then the model should learn to keep necessary info up to word #5 to be able to produce it correctly.

(Ravi Jain) #95

According to the code, jeremy’s using encoder outputs weighted average as opposed to hidden states weighted average as a final hidden state used by the decoder?

(Mike Kunz ) #96

Comment / Digression: The German " Er lietbe zu essen " does indeed mean " He loved to eat" however, in spoken German, the passive form is more often used. Hence, if this was a spoken language translator (memory recalls), the person would need to say " Er hat zu essen geliebt." We would never pick that up from a text corpus.

(Pavel Surmenok) #97

A similar, but probably more powerful, idea is to use beam search: get top N predictions for word1, then try to predict the word2 using all N word1guesses, get top N of every word2 predictions. By now you have N^2 pairs of word1 and word2 predictions. Then sort them by probability, reduce to N best options, predict word3 for each N options, …
This approach doesn’t have a drawback of using not the best prediction. At the same time it helps to avoid the case when some word in the middle of the sequence is predicted incorrectly and it screws up the rest of the sequence.

(Emil) #98

Why use tanh instead of ReLU in the attention mini-net?

(Kaitlin Duck Sherwood) #99

Where is the loss function for the attentional NN?

(Sneha Nagpaul) #100

Could I think of attention as training a classifier for which words in ‘German’ map to which words in ‘English’?

(Ananda Seelan) #101

It is learned along with the encoder-decoder. Hence the loss is the decoder loss I presume.

(Sneha Nagpaul) #102

Loss is where the label is, so probably the same loss that we had before. No special loss.

(Emil) #103

I guess the attentional net is integrated into the seq2seq model, so it uses the same loss.

(Jonathan Mist) #104

Are the attention weights learned per word in the vocabulary or per word position?

(Arvind Nagaraj) #105

The loss fn is for the overall net (end-to-end) - the same reason we don’t have separate loss fns for each layer.

(Ken) #106

For the softmax, does the width (number of classes) it outputs determines how many timesteps to weight over? If so, how wide is it?

(chunduri) #107

Why accuracy is not measured here?

(Arvind Nagaraj) #108

how would you measure accuracy?

(Ananda Seelan) #109

I highly recommend this brilliant talk[1] by Stephen Merity on the significance of attention and what it generally does to a network. Also he talks about regularization and AWD LSTM.

[1] - Stephen Merity, Attention and Memory in Deep Learning Networks, https://www.youtube.com/watch?v=uuPZFWJ-4bE&t=1261s

(chunduri) #110

I guess, by comparing those translations which are correct vs total

(Emil) #111

Check the BLEU score, it discusses how you can measure the translation “quality” without resorting to exact comparison (which is almost always too much to ask).

(Kaitlin Duck Sherwood) #112

So… there is no line of code which points out the weights, but it’s magic in Parameter which tells PyTorch “hey, this is part of the thing you need to optimize with the overall loss function”, I think.

(Debashish Panigrahi) #113

Why do we call this end to end but not the previous one which differs by just couple of layers from the previous?