Part 2 lesson 11 wiki

(Arvind Nagaraj) #259

@jeremy : I’ve been working on the semantic similarity task using the kaggle Quora duplicate pairs dataset.

It’s a classifier, not seq2seq, but I didn’t want to train from scratch. So I used the embedding layer weights from our LM and I’m training the rest of the model as usual. I understand it’s not the same as using our LM backbone but it’s still a head start. I am planning to use our awd-lstm backbone and try the same exercise soon.

Do you think using our own english wiki embeddings would fare better in the fr-en translation as opposed to using fasttext en word embeddings?



I’ve been trying to do that, and have built a French LM model for that purpose, to use with the English one. For now I’m still struggling to have them comprehend each other but I’m sure there’s a way.

1 Like

(Arvind Nagaraj) #261

Thanks. My model is also taking too long to train.


(Jeremy Howard (Admin)) #262

Yes I expect a pre-trained LM would be better for basically all seq2seq tasks. You could also use a pre-trained french model of course.


(Kevin Bird) #264

Is there anywhere to read more about the dropout types? I am just trying to get a better intuition on what exactly the different dropouts are doing when building a language model.

EDIT: while digging into the md.get_model method, I found some decent documentation at least that gives a better explanation for what each of these are under the get_language_model function.

dropouth (float): dropout to apply to the activations going from one LSTM layer to another
dropouti (float): dropout to apply to the input layer.
dropoute (float): dropout to apply to the embedding layer.
wdrop (float): dropout used for a LSTM’s internal (or hidden) recurrent weights.

Here is the paper Jeremy mentions below:


(Jeremy Howard (Admin)) #265

In the AWD LSTM paper.


(Kevin Bird) #266

I’m still digging into get_language_model and I’m wondering what the thought process of padding_idx on the embedding layers is. If I understand padding index, it takes one of the embeddings weights and sets it to zero. So I’m running a small test that looks like this to try to understand it better:

pad_emb_test = nn.Embedding(10, 2, padding_idx=1) #this would be a dictionary with 10 words each of them having 2 weights in the embedding vector

when I check the weights here, I see:

Parameter containing:
-1.3344 -0.3897
 0.0000  0.0000 #<-------------This is what padding_idx=1 does (ties the weights of [1] to 0)
-0.4985  0.4639
 0.3196 -1.9752
 0.1880  0.5855
 0.2163  0.2847
 0.0048  0.0531
 0.0900  0.8619
 1.3623 -0.4543
 0.0975 -0.6223
[torch.FloatTensor of size 10x2]

Which you can see sets the embedding weights of index 1 to [0,0].

This is what I’m not understanding. Isn’t this just a super tiny amount of dropout (1 row/number of tokens) that is at a layer lower than dropoute? So in the case I’m using above, it would be 10% dropout, but when I have 10000 tokens, it wouldn’t matter at all to have one row zero’d out.

1 Like

(Arvind Nagaraj) #267

Pad is a special token we have introduced and it gets it’s own place in our itos.
We use this pad token to pad sequences in a batch to the same length (bptt-ish)
Other special tokens we have introduced in fastai are UNK( for missing words when we switch datasets- wiki to imdb in this case ), BOS…etc.

For unk tokens we would like to learn the weights during training. So to help the model we start with a good initial weight(mean of all weights).
For padding token we don’t really care and set weights to zero.


(Kaitlin Duck Sherwood) #268

Just for kicks and giggles, I tried running the cats&dogs data set image through the DeViSE predictor. The dogs were great! The predictor predicted the dog images to be dogs with 99.93% accuracy.

The cats sucked bigtime. The predictor only predicted the cats correctly 41.8% of the time!

Looking into the imagenet data a little more carefully, Well, it turns out that Imagenet has 118 dog categories and only 5 cat categories. Furthermore, if you take out categories which have an underscore (since fasttext doesn’t have underscores), then there is only one cat category (‘tabby’) but sixty dog categories. Given that there the training categories all tend have roughly the same number of images, this means that there are about sixty times as many dog images as cat images.

Data matters!


(Jeremy Howard (Admin)) #269

Nice debugging @Ducky!


(nok) #270

A quick question, is there an intuition about stacking up multiple LSTM/GRU layer? I found stacking up layer in CNN is more intuitive, but I am not sure what does it means when we stack LSTM.

Sorry if this question is not directly related to the lesson, thanks.

So I just found this link explains how stacking RNN helps learning structures of text. Is there any nice visualization/example explain this?


(Jeremy Howard (Admin)) #271

The lesson 6 powerpoint shows it, and in a previous I suggested trying to implement it from scratch as an exercise.

1 Like

(Kyle Maxwell) #272

Know of a good paper that explores how the effective receptive field changes as you stack RNNs?

1 Like

(Alex) #274

Have anyone had problems loading previously saved nmslib index?

Kernel dies almost immediately without any reason that I can see for this code:

nmslib.loadIndex(nn_wvs , ‘data/fb_word_vectors/all_nn_fb_nms_index’)



Where to find English word vectors (in binary format .bin) ?
In the Fasttext website, the English word vectors are provided in .vec format only or did someone figured out how to convert .vec files to .bin ?



They can be downloaded from here. Just scroll for your language and click bin+txt!


(Kaitlin Duck Sherwood) #277

I found the English binary format file here:

1 Like

(WG) #278

Same here.

You ever figure what is going on?


(Alex) #279

Nope. Few hours of research did not help.
No errors or messages in logs - kernel just dies.


(WG) #280

Yah it sucks because nmslib is crazy fast … but for now, I have to rebuild the index every time from scratch.

If I find a resolution I’ll make sure to update you here. Please do the same.