Part 2 lesson 11 wiki

(Rudraksh Tuwani) #59

I think it may be for two reasons:

  1. The embeddings for French language are not available for AWD LSTM, and the fasText ones have different dimensions.
  2. Simplicity. Jeremy’s goal was to illustrate the use of seq2seq learning using a simple model so that we are able to grasp the basic idea behind it.

(Arvind Nagaraj) #60

most will hit the eos token and break out of the loop

(Rudraksh Tuwani) #61

What might be some ways of regularizing these seq2seq models besides dropout and weight decay?

(Rudraksh Tuwani) #62

AWD-LSTM is just your regular RNN with LSTM cells and all kinds of dropout. How do they train faster?

(Ananda Seelan) #63

IMO, attention itself might be acting as a regularizer in a way in seq2seq models.

(Ravi Sekar Vijayakumar) #66

my bad. i meant it gives better results.

(Gerardo Garcia) #67

Other than translate
Could you please elaborate different applications that this model can be used?

(Emil) #68

Why making a decoder bidirectional is considered cheating?

(Ananda Seelan) #69

A couple of examples are Text summarization and Grammar correction

(blake west) #70

Can you elaborate on the intuition behind why doing a language model backwards is a good idea?

(Ananda Seelan) #71

Also any problem that involves a sequence of inputs that could lead to a sequence of outputs can be modelled with this family of networks. Another recent example I can think of is a work where natural language sentences are translated to SQL queries.

(Arvind Nagaraj) #72

because of things like:

a man…-…-…-his…
a woman…-…-…-her…

it’s better to know beforehand that a his/her appears later in the sentence.

(Mike Kunz ) #73

Think of it as predicting the preceding word rather than the following word.

If we see “Beer” do we see “cold” or “root” before “beer” ?

(Alex) #74

I missed why we need the pr_force loop. Why can’t we always give the correct word in teacher forcing?

(Arvind Nagaraj) #75

that would be cheating - you want the student to learn after a while and stop giving it clues.

(Sam Lloyd) #76

We want to first really help the model out, by guiding it to the right words, but as it gets better we don’t need it as much, so we reduce the forcing

(K Sreelakshmi) #77

Could you quickly explain Attention once again @jeremy

(blake west) #78

I’ve often seen that using a temperature parameter when picking the next word for generative text models improves results a lot. Could that help for the decoder stage here? For others, temperature is where you sometimes pick a word that is not the maximum prediction (weighted by the temperature param)

(Phani Srikanth) #88

Why do we do a matrix multiply once again (here -> a = F.softmax(u @ self.V, 0)) after we obtain the outputs from the last layer and pass them through the linear layer and the activation function?

(Emil) #89

You probably want to get a translated sentence as close as possible to the original, so you need to always pick one with the maximum probability.