Part 2 lesson 11 wiki

rudraksh · April 10, 2018, 2:48am

I think it may be for two reasons:

The embeddings for French language are not available for AWD LSTM, and the fasText ones have different dimensions.
Simplicity. Jeremy’s goal was to illustrate the use of seq2seq learning using a simple model so that we are able to grasp the basic idea behind it.

narvind2003 · April 10, 2018, 2:48am

most will hit the eos token and break out of the loop

rudraksh · April 10, 2018, 2:49am

What might be some ways of regularizing these seq2seq models besides dropout and weight decay?

rudraksh · April 10, 2018, 2:51am

AWD-LSTM is just your regular RNN with LSTM cells and all kinds of dropout. How do they train faster?

ananda_seelan · April 10, 2018, 2:53am

IMO, attention itself might be acting as a regularizer in a way in seq2seq models.

ravivijay · April 10, 2018, 2:57am

my bad. i meant it gives better results.

gerardo · April 10, 2018, 2:59am

Other than translate
Could you please elaborate different applications that this model can be used?

emilmelnikov · April 10, 2018, 2:59am

Why making a decoder bidirectional is considered cheating?

ananda_seelan · April 10, 2018, 2:59am

A couple of examples are Text summarization and Grammar correction

blakewest · April 10, 2018, 3:00am

Can you elaborate on the intuition behind why doing a language model backwards is a good idea?

ananda_seelan · April 10, 2018, 3:01am

Also any problem that involves a sequence of inputs that could lead to a sequence of outputs can be modelled with this family of networks. Another recent example I can think of is a work where natural language sentences are translated to SQL queries.

narvind2003 · April 10, 2018, 3:01am

because of things like:

a man…-…-…-his…
a woman…-…-…-her…

it’s better to know beforehand that a his/her appears later in the sentence.

fmichaelkunz · April 10, 2018, 3:02am

Think of it as predicting the preceding word rather than the following word.

If we see “Beer” do we see “cold” or “root” before “beer” ?

apk · April 10, 2018, 3:07am

I missed why we need the pr_force loop. Why can’t we always give the correct word in teacher forcing?

narvind2003 · April 10, 2018, 3:08am

that would be cheating - you want the student to learn after a while and stop giving it clues.

sjdlloyd · April 10, 2018, 3:11am

We want to first really help the model out, by guiding it to the right words, but as it gets better we don’t need it as much, so we reduce the forcing

Sree · April 10, 2018, 3:16am

Could you quickly explain Attention once again @jeremy

blakewest · April 10, 2018, 3:19am

I’ve often seen that using a temperature parameter when picking the next word for generative text models improves results a lot. Could that help for the decoder stage here? For others, temperature is where you sometimes pick a word that is not the maximum prediction (weighted by the temperature param)

binga · April 10, 2018, 3:21am

Why do we do a matrix multiply once again (here -> a = F.softmax(u @ self.V, 0)) after we obtain the outputs from the last layer and pass them through the linear layer and the activation function?

emilmelnikov · April 10, 2018, 3:21am

You probably want to get a translated sentence as close as possible to the original, so you need to always pick one with the maximum probability.