Part 2 lesson 11 wiki

jeremy · May 21, 2018, 3:20pm

You could create an issue on their github repo.

Stoufa · May 22, 2018, 8:59pm

Greetings, I have some questions about this lecture

Why we have to transpose the arrays ? 43:32
I’m a beginner in PyTorch, but, I read somewhere that batch_size dimension should be the first dimension so that the computations are performed faster, is it for this reason ?
Why we used int(bs*1.6) instead of bs in val_dl ?
Does multiplying the vectors by 3 has to do with their standard deviation ? torch.from_numpy(vecs[w]*3) in the function create_emb, so the general rule would be (wordVector - mean) / standardDeviation, right ?
How to deal with large datasets ? how to manage time and space complexities
Instead of passing the encoder’s output through a linear layer, why we don’t change the RNN cell to output a vector in the correct shape from the beginning ? and giving it directly to the decoder without passing by a linear layer ? i.e. replacing nh by em_sz_dec.
self.gru_enc = nn.GRU(em_sz_enc, em_sz_dec, num_layers=nl, dropout=0.25)
Is it compulsory to pass by a linear layer ? why and why not ?
In the decoder we used a tensor of long, why we don’t use a regular float tensor then we round it at the end to get the closest index, is this going to give us better results ?
Can we vectorize the decoder’s for loop ? if so, how the code would look like then, I think that the hardest part in this is how to change the value of dec_inp (the previous translated word) without a loop.
Does fastai provide the feature of saving the encoder’s weights and the decoder’s weights separately ? this will be useful for other language pairs, for example, if we swap them, we can translate from English back to French without any further training.
Why adding bidirectional=True is kind of considered cheating ? is it because the attention mechanism is already doing nearly the same thing ?
Can you please explain what this line does exactly h = h.view(2,2,bs,-1).permute(0,2,1,3).contiguous().view(2,bs,-1) ( it’s in the method forward of the class Seq2SeqRNN_Bidir )
What’s the effect of dividing the random tensor by math.sqrt(sz[0]) in the function rand_t(*sz) ?
Why use tanh instead of ReLU in the attention mini-net ?
Have anyone tried using sentencepiece with fastai ? how this will change the code ?
In the following lines of code, why we used en_trn and en_val instead of trn_ds and val_ds (the whole datasets) ? is it a typo or am I missing something ?
trn_samp = SortishSampler(en_trn, key=lambda x: len(en_trn[x]), bs=bs)
val_samp = SortSampler(en_val, key=lambda x: len(en_val[x]))
In seq2seq_loss, I understand that we did input = input[:sl] to make both sentences at the same length in order to apply cross entropy loss function afterwards, Is there a chance where the model predicts a correct translation but longer than the target sentence ? In this case we shouldn’t truncate the prediction but pad them both to the length of the largest sentence instead. Is this correct ?
Can someone explain what’s the role of partial(...) in opt_fn = partial(optim.Adam, betas=(0.8, 0.99)), the comments in the source code aren’t so beginner-friendly

Ducky · June 2, 2018, 8:11pm

Sorry, I have been sick and/or busy. I haven’t had time to dig into all your questions. I’m hoping to do that in a few days.

Stoufa · June 2, 2018, 8:50pm

Thank you for your reply @Ducky, get well soon

tester · June 19, 2018, 5:03am

Hello all, I am wondering after the wiki pretrain vec. I train the model again using english->french words dictionary (eg: oxford dictionary) with empty padding to pretend a sentence. Would that increase the accuracy before training it using giga-fren dataset?

I have trouble getting good result on longer sentences (more than ~6 words) even tho I made no change to the seq2seq model code.

In order to get to google translator accuracy on English to French, what else are we missing from the lesson?

WiraDKP · June 20, 2018, 8:00am

Hi everyone,

I have trained a Neural Machine Translation which gives a certain BLEU score for a dataset, but improvement of the model (achieved lower trn_loss and val_loss) causes a lower BLEU score.
Does this means that the modified-cross-entropy is not a good loss function to train the model ?

On the other hand, the seq2seq_loss requires 2 arguments, seq2seq_loss(input, target), but Jeremy did not provide those in this line

learn.crit = seq2seq_loss

Is it okay or did I miss something ?
Thanks in advance everyone.

WiraDKP · June 20, 2018, 11:31am

I hope I can help to answer a few question here.

Does multiplying the vectors by 3 has to do with their standard deviation ? torch.from_numpy(vecs[w]*3) in the function create_emb, so the general rule would be (wordVector - mean) / standardDeviation, right ?

I’ve tried this generalization before, but it gives me error though.

Does fastai provide the feature of saving the encoder’s weights and the decoder’s weights separately ? this will be useful for other language pairs, for example, if we swap them, we can translate from English back to French without any further training.

in text.py, in the RNN_Learner class, there is save_encoder. I think it would be similar if you want to save the decoder otherwise.

What’s the effect of dividing the random tensor by math.sqrt(sz[0]) in the function rand_t(*sz) ?

I think this is to imitate Xavier initialization (commonly used weight initialization for tanh activation)

Why use tanh instead of ReLU in the attention mini-net ?

LSTM and GRU have known to be better using tanh, so we would want to choose tanh as initial try. I don’t know if there is any update on this, but it is open for experiment if you want to try ReLU. There is also leaky ReLU and SeLU worth to try.

In the following lines of code, why we used en_trn and en_val instead of trn_ds and val_ds (the whole datasets) ? is it a typo or am I missing something ?
trn_samp = SortishSampler(en_trn, key=lambda x: len(en_trn), bs=bs)
val_samp = SortSampler(en_val, key=lambda x: len(en_val))

Note: Seq2SeqDataset and SortSampler are both class in text.py.
I think the reason is that trn_ds, and val_ds has __main__.Seq2SeqDataset type, while SortSampler need a list instead.

I’m not an expert in this, but I hope it helps.

Stoufa · June 21, 2018, 1:36pm

It did clarify some of my doubts, thank you

WiraDKP · June 30, 2018, 7:09pm

Hi @Stoufa,
I have explored a bit more, and I think I would like to revisit some of your questions again

Why we used int(bs*1.6) instead of bs in val_dl ?

trn_dl used more memory (because training also include the gradient calculation), but val_dl would only use memory proportional to feedforward. Using a larger bs for val_dl will speed up the calculation of val_loss, but we have to be carefull that the bs is not to large to avoid out of memory. I think from there, we simply take a rough guess that 1.6 is still safe. It may be lower or higher than 1.6 though.

Does multiplying the vectors by 3 has to do with their standard deviation ? torch.from_numpy(vecs[w]*3) in the function create_emb, so the general rule would be (wordVector - mean) / standardDeviation, right ?

The idea is we have generated a randomly initialized word tensor (which will have mean and stdev slightly around 0 and 1 respectively), then we lookup each word to the pre-trained vectors (fasttext). If it exists in the pre-trained model, we prefer to use the fasttext vector instead of our randomly weighted vector. Something to note is that fasttext word vectors has a mean very close to zero, but stdev around 0.3. Just so that the vectors that we use from fasttext has a same scale, we tweaked them by multiplying it with 3. Dividing them with the stdev should also follow the same idea.

How to deal with large datasets ? how to manage time and space complexities

Are the data in the large datasets clean ? If not, you could try to clean it first. You would have a smaller dataset by then. Then… you may try to use a lot for your training, but not too many for validation because it will took some time to calculate val_loss if you use them all. I’ve tried it and it didn’t even pass the first epoch. Hahaha.

Instead of passing the encoder’s output through a linear layer, why we don’t change the RNN cell to output a vector in the correct shape from the beginning ? and giving it directly to the decoder without passing by a linear layer ? i.e. replacing nh by em_sz_dec.

What we do is that we apply a linear combination of RNN output (in a nutshell, Ax+b) to form a ‘state’ vector. It’s like wrapping up words into sentence. We can directly pass it to the decoder without providing a state. The consequences is your architecture won’t be complex enough to model the phenomena (e.g. translation).

Is it compulsory to pass by a linear layer ? why and why not ?

I could not say it as compulsory, but let me give an illustration. Suppose you have a very simple problem, you can simply model it with perceptron, and you won’t even need multi-layer perceptron (Neural Network). You would want to use a Neural Network if the problem need some additional complexity, rather than a simple perceptron could do.
As an analogy to what I explained in previous answer, passing encoder output directly to decoder would be a simple model, and it would work for simple problem, but if you want to handle a complex problem, you would need to provide a more complex model, which is, passing encoder output to a multilayer perceptron, then to the decoder. This way, you can adjust the complexity of your model. Adding more layers and more neurons would increase your model complexity, but of course need more data to train. So, it is not compulsory, but depends on the problem you want to model.

In the decoder we used a tensor of long, why we don’t use a regular float tensor then we round it at the end to get the closest index, is this going to give us better results ?

I’m not sure, but as far as I know, decimals will directly rounded in a long data type. I will save us a step rather than using float then rounding them up

Can we vectorize the decoder’s for loop ? if so, how the code would look like then, I think that the hardest part in this is how to change the value of dec_inp (the previous translated word) without a loop.

Vectorized code can only be done to parallel task, not an iteration task. In the decoder, it is an iteration task, so we cannot vectorized the code.

Why adding bidirectional=True is kind of considered cheating ? is it because the attention mechanism is already doing nearly the same thing ?

What is discussed in the video is adding bidirectional=True in the decoder, while we know that decoder should take previous word to predict the next word starting from <bos>. Using bidirectional means that we somehow predict the answer from the <eos>, then move backward. So, it’s kinda cheating right ? Like, how can you know the last word before knowing the first word ? This make no sense, but, it may be a very genuine idea !!
It may sounds weird, we just don’t know yet. It needs some thought especially on how to loop backward because it is not as simple as having a reversed tensor such as in the encoder. There is also problem with reverse padding to think about.

In seq2seq_loss, I understand that we did input = input[:sl] to make both sentences at the same length in order to apply cross entropy loss function afterwards, Is there a chance where the model predicts a correct translation but longer than the target sentence ? In this case we shouldn’t truncate the prediction but pad them both to the length of the largest sentence instead. Is this correct ?

Based on the seq2seq_loss code below

def seq2seq_loss(input, target):
sl,bs = target.size()
sl_in,bs_in,nc = input.size()
if sl>sl_in: input = F.pad(input, (0,0,0,0,0,sl-sl_in))
input = input[:sl]
return F.cross_entropy(input.view(-1,nc), target.view(-1))#, ignore_index=1)

If the input is longer than the target, then it is sliced only as long as the target, but the target itself has the size of the longest target in the minibatch, which means that the short targets has <pad> in them.

I hope it helps. Thanks.

Stoufa · July 15, 2018, 9:52am

Thanks a lot for your time @WiraDKP , I really appreciate it.
Great explanation. It did help me answer another pile of questions I have, Thank you.

KarlH · August 12, 2018, 4:54am

Was a solution to this ever found? Currently getting the same error installing on Anaconda.

Chris_Palmer · August 12, 2018, 9:23am

Not by me, I haven’t explored all the other conversations on the forum though…

wyquek · September 5, 2018, 12:33pm

May I know what is this function is_number in def get_vecsb(lang)? I have problems loading the en vec binary too

sgugger · September 7, 2018, 12:27pm

Probably something that tries to transform into an int/float and returns True if it’s possible, False if it raises an exception. I’m not sure since it has been a while!

try:
    _ = int(x)
    return True
except:
    return False

wyquek · September 8, 2018, 4:11am

Thanks @sgugger

wyquek · September 8, 2018, 1:11pm

Why making a decoder bidirectional is considered cheating? I can’t figure that out too.

wyquek · September 10, 2018, 12:10pm

I can’t get anything close to the translated sentences Jeremy had in his notebook. Is there anyone who ran it and attained the same translation quality as Jeremy did?

pascal · September 10, 2018, 2:05pm

Same for me.
Please, can someone share us the 2 generated files after get_vecs() calls ?

data/translate/wiki.en.pkl
data/translate/wiki.fr.pkl

wyquek · September 10, 2018, 2:33pm

https://drive.google.com/open?id=1owjxKJCdUkFY1ge6MQPhHYqPcTjK4opW

pascal · September 10, 2018, 7:01pm

thank you a lot, downloading 3.9Go zip file.