Part 2 lesson 11 wiki

narvind2003 · April 20, 2018, 7:21pm

@jeremy : I’ve been working on the semantic similarity task using the kaggle Quora duplicate pairs dataset.

It’s a classifier, not seq2seq, but I didn’t want to train from scratch. So I used the embedding layer weights from our LM and I’m training the rest of the model as usual. I understand it’s not the same as using our LM backbone but it’s still a head start. I am planning to use our awd-lstm backbone and try the same exercise soon.

Do you think using our own english wiki embeddings would fare better in the fr-en translation as opposed to using fasttext en word embeddings?

sgugger · April 20, 2018, 7:32pm

I’ve been trying to do that, and have built a French LM model for that purpose, to use with the English one. For now I’m still struggling to have them comprehend each other but I’m sure there’s a way.

narvind2003 · April 20, 2018, 7:37pm

Thanks. My model is also taking too long to train.

jeremy · April 20, 2018, 8:07pm

Yes I expect a pre-trained LM would be better for basically all seq2seq tasks. You could also use a pre-trained french model of course.

KevinB · April 21, 2018, 6:23pm

Is there anywhere to read more about the dropout types? I am just trying to get a better intuition on what exactly the different dropouts are doing when building a language model.

EDIT: while digging into the md.get_model method, I found some decent documentation at least that gives a better explanation for what each of these are under the get_language_model function.

dropouth (float): dropout to apply to the activations going from one LSTM layer to another
dropouti (float): dropout to apply to the input layer.
dropoute (float): dropout to apply to the embedding layer.
wdrop (float): dropout used for a LSTM’s internal (or hidden) recurrent weights.

Here is the paper Jeremy mentions below: https://arxiv.org/pdf/1708.02182.pdf

jeremy · April 21, 2018, 7:21pm

In the AWD LSTM paper.

KevinB · April 21, 2018, 9:28pm

I’m still digging into get_language_model and I’m wondering what the thought process of padding_idx on the embedding layers is. If I understand padding index, it takes one of the embeddings weights and sets it to zero. So I’m running a small test that looks like this to try to understand it better:

pad_emb_test = nn.Embedding(10, 2, padding_idx=1) #this would be a dictionary with 10 words each of them having 2 weights in the embedding vector
pad_emb_test.weight

when I check the weights here, I see:

Parameter containing:
-1.3344 -0.3897
 0.0000  0.0000 #<-------------This is what padding_idx=1 does (ties the weights of [1] to 0)
-0.4985  0.4639
 0.3196 -1.9752
 0.1880  0.5855
 0.2163  0.2847
 0.0048  0.0531
 0.0900  0.8619
 1.3623 -0.4543
 0.0975 -0.6223
[torch.FloatTensor of size 10x2]

Which you can see sets the embedding weights of index 1 to [0,0].

This is what I’m not understanding. Isn’t this just a super tiny amount of dropout (1 row/number of tokens) that is at a layer lower than dropoute? So in the case I’m using above, it would be 10% dropout, but when I have 10000 tokens, it wouldn’t matter at all to have one row zero’d out.

narvind2003 · April 22, 2018, 1:51am

Pad is a special token we have introduced and it gets it’s own place in our itos.
We use this pad token to pad sequences in a batch to the same length (bptt-ish)
Other special tokens we have introduced in fastai are UNK( for missing words when we switch datasets- wiki to imdb in this case ), BOS…etc.

For unk tokens we would like to learn the weights during training. So to help the model we start with a good initial weight(mean of all weights).
For padding token we don’t really care and set weights to zero.

Ducky · April 22, 2018, 2:10am

Just for kicks and giggles, I tried running the cats&dogs data set image through the DeViSE predictor. The dogs were great! The predictor predicted the dog images to be dogs with 99.93% accuracy.

The cats sucked bigtime. The predictor only predicted the cats correctly 41.8% of the time!

Looking into the imagenet data a little more carefully, Well, it turns out that Imagenet has 118 dog categories and only 5 cat categories. Furthermore, if you take out categories which have an underscore (since fasttext doesn’t have underscores), then there is only one cat category (‘tabby’) but sixty dog categories. Given that there the training categories all tend have roughly the same number of images, this means that there are about sixty times as many dog images as cat images.

Data matters!

jeremy · April 22, 2018, 2:32am

Nice debugging @Ducky!

nok · April 22, 2018, 4:41pm

A quick question, is there an intuition about stacking up multiple LSTM/GRU layer? I found stacking up layer in CNN is more intuitive, but I am not sure what does it means when we stack LSTM.

Sorry if this question is not directly related to the lesson, thanks.

So I just found this link explains how stacking RNN helps learning structures of text. Is there any nice visualization/example explain this?
http://qr.ae/TU1ECW

jeremy · April 22, 2018, 4:58pm

The lesson 6 powerpoint shows it, and in a previous I suggested trying to implement it from scratch as an exercise.

fizx · April 22, 2018, 5:42pm

Know of a good paper that explores how the effective receptive field changes as you stack RNNs?

tensoralex · May 4, 2018, 12:37am

Have anyone had problems loading previously saved nmslib index?

Kernel dies almost immediately without any reason that I can see for this code:

nn_wvs=nmslib.init()
nmslib.loadIndex(nn_wvs , ‘data/fb_word_vectors/all_nn_fb_nms_index’)

Stoufa · May 17, 2018, 12:14pm

Where to find English word vectors (in binary format .bin) ?
In the Fasttext website, the English word vectors are provided in .vec format only or did someone figured out how to convert .vec files to .bin ?

sgugger · May 17, 2018, 1:05pm

They can be downloaded from here. Just scroll for your language and click bin+txt!

Ducky · May 18, 2018, 4:23am

I found the English binary format file here: https://s3-us-west-1.amazonaws.com/fasttext-vectors/wiki.en.zip

wgpubs · May 19, 2018, 4:31am

Same here.

You ever figure what is going on?

tensoralex · May 19, 2018, 6:28pm

Nope. Few hours of research did not help.
No errors or messages in logs - kernel just dies.

wgpubs · May 19, 2018, 7:34pm

Yah it sucks because nmslib is crazy fast … but for now, I have to rebuild the index every time from scratch.

If I find a resolution I’ll make sure to update you here. Please do the same.

-wg