Part 2 lesson 11 wiki

surmenok · April 10, 2018, 3:22am

I guess if we always use teacher forcing then the model will learn well how to predict the next word but will fail to learn to remember hidden state of the decoder for longer. E.g. if word #5 depends on word #1 then the model should learn to keep necessary info up to word #5 to be able to produce it correctly.

ravijain · April 10, 2018, 3:23am

According to the code, jeremy’s using encoder outputs weighted average as opposed to hidden states weighted average as a final hidden state used by the decoder?

fmichaelkunz · April 10, 2018, 3:25am

Comment / Digression: The German " Er lietbe zu essen " does indeed mean " He loved to eat" however, in spoken German, the passive form is more often used. Hence, if this was a spoken language translator (memory recalls), the person would need to say " Er hat zu essen geliebt." We would never pick that up from a text corpus.

surmenok · April 10, 2018, 3:27am

A similar, but probably more powerful, idea is to use beam search: get top N predictions for word1, then try to predict the word2 using all N word1guesses, get top N of every word2 predictions. By now you have N^2 pairs of word1 and word2 predictions. Then sort them by probability, reduce to N best options, predict word3 for each N options, …
This approach doesn’t have a drawback of using not the best prediction. At the same time it helps to avoid the case when some word in the middle of the sequence is predicted incorrectly and it screws up the rest of the sequence.

emilmelnikov · April 10, 2018, 3:27am

Why use tanh instead of ReLU in the attention mini-net?

Ducky · April 10, 2018, 3:28am

Where is the loss function for the attentional NN?

snagpaul · April 10, 2018, 3:29am

Could I think of attention as training a classifier for which words in ‘German’ map to which words in ‘English’?

ananda_seelan · April 10, 2018, 3:29am

It is learned along with the encoder-decoder. Hence the loss is the decoder loss I presume.

snagpaul · April 10, 2018, 3:29am

Loss is where the label is, so probably the same loss that we had before. No special loss.

emilmelnikov · April 10, 2018, 3:30am

I guess the attentional net is integrated into the seq2seq model, so it uses the same loss.

jonathanmist · April 10, 2018, 3:30am

Are the attention weights learned per word in the vocabulary or per word position?

narvind2003 · April 10, 2018, 3:30am

The loss fn is for the overall net (end-to-end) - the same reason we don’t have separate loss fns for each layer.

kmatsuda · April 10, 2018, 3:31am

For the softmax, does the width (number of classes) it outputs determines how many timesteps to weight over? If so, how wide is it?

chunduri · April 10, 2018, 3:31am

Why accuracy is not measured here?

narvind2003 · April 10, 2018, 3:31am

how would you measure accuracy?

ananda_seelan · April 10, 2018, 3:32am

I highly recommend this brilliant talk[1] by Stephen Merity on the significance of attention and what it generally does to a network. Also he talks about regularization and AWD LSTM.

[1] - Stephen Merity, Attention and Memory in Deep Learning Networks, https://www.youtube.com/watch?v=uuPZFWJ-4bE&t=1261s

chunduri · April 10, 2018, 3:33am

I guess, by comparing those translations which are correct vs total

emilmelnikov · April 10, 2018, 3:34am

Check the BLEU score, it discusses how you can measure the translation “quality” without resorting to exact comparison (which is almost always too much to ask).

Ducky · April 10, 2018, 3:36am

So… there is no line of code which points out the weights, but it’s magic in Parameter which tells PyTorch “hey, this is part of the thing you need to optimize with the overall loss function”, I think.

Deb · April 10, 2018, 3:36am

Why do we call this end to end but not the previous one which differs by just couple of layers from the previous?