Part 2 lesson 11 wiki

The Imagenet localization challenge has all the classification data too.

nmslib provides a faster and more accurate knn than clustering can provide. There may be interesting applications of clustering DeVISE outputs however - feel free to experiment!

Many thanks for the helpful folks that added links - there’s now lots of helpful stuff in the top post.

2 Likes

I’ve now posted the lesson video to the top post.

2 Likes

In these lines, Jeremy truncates the input sentences to be as big as the 99th percentile and 97th percentile. The sentences which he truncates will not have EOS tokens or padding.

enlen_90 = int(np.percentile([len(o) for o in en_ids], 99))
frlen_90 = int(np.percentile([len(o) for o in fr_ids], 97))
...
en_ids_tr = np.array([o[:enlen_90] for o in en_ids])
fr_ids_tr = np.array([o[:frlen_90] for o in fr_ids])

(He explained in class that there are a very few sentences which are very very long, and that those end up taking an annoyingly long time, so this truncation is purely for performance reasons.)

So Only those 1% of English and 3% of French will face truncation, rest will go through with out any problem.

The numpy function is percentile (not percentage).

Exactly!

I think I know why. At first, BOS and EOS seemed redundant and because EOS will usually be followed by BOS. In the translate notebook, we are just translating one question - so there will not be the second BOS. By using EOS instead of BOS, it gives the decoder a way to tell us when it thinks it finished.

Actually, no. I’ve gotten further in the video (~1:02:41) and we are using 0 (i.e. _bos_) as the first input. So I think we can try putting BOS as the first token and see if it trains better. Also we are not really using EOS for terminating the for loop:

if (dec_inp==1).all(): break

So maybe we can try omitting EOS and just rely on padding character.

1 Like

In case anyone is unsure how to get the full data, the commands I ended up running were the following…

Note the download (the wget part) took me like ~45 minutes on Paperspace.

mkdir ~/fastai/courses/dl2/data/translate
cd ~/fastai/courses/dl2/data/translate
wget http://www.statmt.org/wmt10/training-giga-fren.tar
tar -xvf training-giga-fren.tar
gunzip giga-fren.release2.fixed.en.gz
gunzip giga-fren.release2.fixed.fr.gz

PS: I ended up having to upgrade my disk space on Paperspace to handle the new data, and when you do that, they very unhelpfully fail to mention that you have to manually “expand your disk” after upgrading. Very easy, but until you do that, your machine becomes worthless. The link for how to do that is here. Hope this helps someone!

5 Likes

But your earlier explanation also made sense. It was exactly what I thought.

If it had to depend on pad character to break then why did we even introduce the BOS token.

I am looking into attention model code, following are the two lines of decoder seq2seq RNN code for normal model and the model with attention respectively.
There are two inputs to the RNN, the hidden state “h” is common for both, the difference is with the other input which is a list of outputs at every time step of the encoder.

When it comes to attention seq2seq model’s input, we are considering the real-time current decoder word and predict the possible weighted avg from encoder weights from each time step, that influences the current decoder word, with the help of a two-layer NN with softmax. Here we are recalculating the output for the corresponding time step of the encoder, with the new weighted avg weight combinations and feed it the decoder’s current time step.

Whereas in the case of a seq2seq model without attention, there is no real time reflection of current decoder word. What we have is a simple “emb” list from which we pick one of the outputs, concatenate with “h” and feed it.

Model without attention:
outp, h = self.gru_dec(emb, h)

Model with attention:
outp, h = self.gru_dec(wgt_enc.unsqueeze(0), h)

I am writing up my understanding of attention model. while going through the code I had a lot of questions, but Jeremy’s lecture has cleared most of them.
Please point out if there is anything wrong with my understanding, or if you want to add to my explanation and make it easy to understand.

That was needed for the IMDB model (it helps the classifier know how to reset properly). I expect you could safely remove it from the translation model - give it a try and see!

But there’s no harm having it, and I don’t like to make changes if they’re not necessary.

2 Likes

When I try to load the enligh word vectors found here with the get_vecs function, my kernel dies all the time on my P6000. I know the file is heavy (6GB) but I’m supposed to have 30GB of RAM so it’s a bit weird. Does anyone have the same problem? Any idea to solve this?

On my laptop (with 16GB RAM) I had no problem loading and converting the lighter version we can find here but it doesn’t have the same mean and STD as the one in the notebook, so I’m not sure it’s the one Jeremy used in his notebook.

Edit: If anyone has the same problem, I have solved with this issue by rewriting the function get_vecs like this (it even has a cool fastai-style progress bar ^^)

def get_vecsb(lang):
    vecd = {}
    with open(PATH/f'wiki.{lang}.vec', encoding='utf-8') as infile:
        length, dim = infile.readline().split()
        for i in tqdm(range(int(length))):
            line = infile.readline()
            while len(line) == 0:
                line = infile.readline()
            w, *v = line.split()
            if is_number(v[0]) and len(v)==300:
                vecd[w] = np.array(v, dtype=np.float32)
    pickle.dump(vecd, open(PATH/f'wiki.{lang}.pkl','wb'))
    return vecd
5 Likes

I used the binary version with the fastText pytorch library. Did you have RAM problems with the binary or text version?

I used the text version since it was lighter to download just that, probably where the problem came from.

A quick look showed me the get_vec version used 12GB RAM for the french file while the rewritten version stays around 1GB.

The binary version worked fine for me.

FYI, if you go to the fasttext.cc page for English, I only saw links for (text) vectors.

I found the text-bin, on the github fasttext page.

2 Likes

Note that there are at least two different “fast[Tt]ext” repos.
https://github.com/salestock/fastText.py.git - fasttext
https://github.com/facebookresearch/fastText - fastText

I found that if I did
pip install fastText, I got something which wouldn’t let me
import fastText.

If I did
pip install git+https://github.com/facebookresearch/fastText
then I was able to do
import fastText.

If you get the wrong fasttext, as I did at first, you get a strange out of memory error.

1 Like

Be sure to make sure you’ve got the latest notebook - it’s got all these paths and details for the modules and data in it, FYI.