Part 2 lesson 11 wiki

I think it may be for two reasons:

  1. The embeddings for French language are not available for AWD LSTM, and the fasText ones have different dimensions.
  2. Simplicity. Jeremy’s goal was to illustrate the use of seq2seq learning using a simple model so that we are able to grasp the basic idea behind it.
1 Like

most will hit the eos token and break out of the loop

1 Like

What might be some ways of regularizing these seq2seq models besides dropout and weight decay?

4 Likes

AWD-LSTM is just your regular RNN with LSTM cells and all kinds of dropout. How do they train faster?

IMO, attention itself might be acting as a regularizer in a way in seq2seq models.

4 Likes

my bad. i meant it gives better results.

Other than translate
Could you please elaborate different applications that this model can be used?

1 Like

Why making a decoder bidirectional is considered cheating?

6 Likes

A couple of examples are Text summarization and Grammar correction

3 Likes

Can you elaborate on the intuition behind why doing a language model backwards is a good idea?

2 Likes

Also any problem that involves a sequence of inputs that could lead to a sequence of outputs can be modelled with this family of networks. Another recent example I can think of is a work where natural language sentences are translated to SQL queries.

4 Likes

because of things like:

a man…-…-…-his…
a woman…-…-…-her…

it’s better to know beforehand that a his/her appears later in the sentence.

3 Likes

Think of it as predicting the preceding word rather than the following word.

If we see “Beer” do we see “cold” or “root” before “beer” ?

4 Likes

I missed why we need the pr_force loop. Why can’t we always give the correct word in teacher forcing?

2 Likes

that would be cheating - you want the student to learn after a while and stop giving it clues.

5 Likes

We want to first really help the model out, by guiding it to the right words, but as it gets better we don’t need it as much, so we reduce the forcing

2 Likes

Could you quickly explain Attention once again @jeremy

4 Likes

I’ve often seen that using a temperature parameter when picking the next word for generative text models improves results a lot. Could that help for the decoder stage here? For others, temperature is where you sometimes pick a word that is not the maximum prediction (weighted by the temperature param)

3 Likes

Why do we do a matrix multiply once again (here -> a = F.softmax(u @ self.V, 0)) after we obtain the outputs from the last layer and pass them through the linear layer and the activation function?

7 Likes

You probably want to get a translated sentence as close as possible to the original, so you need to always pick one with the maximum probability.