Lesson 12 (2019) discussion and wiki

wgpubs · April 21, 2019, 6:58pm

Can you elaborate on this?

Using the default bits, we can tokenize our corpus and have the tokens not seen by the original LM model trained against wiki-103, learned in the process of fine-tuning the LM against our corpus. Why doesn’t this work when using SentencePiece?

If I’m understanding you correctly, then for every corpus we’d like to actually fine-tune … if we are using SP, we actually have to concatenate it with wiki-103 and train the whole thing?

Kaspar · April 21, 2019, 7:11pm

if you set character_coverage=1.0. then sentencepiece includes all characters it has seen in the vocab. Sentencepiece can, therefore, in most cases use the character alphabet to tokenize letter seuqences it hasn’t seen before. This is why piotr writes Emojis or equivalent: if you have an english text with says chinese charaters that wasn’t in the corpus then sentencepiece would emit UNK for those new characters.

I actually use a quite small vocab for english (4K) and it works fine:

Kaspar · April 21, 2019, 7:14pm

do you have the formula to compare the perplexity for different size vocab ?

sergeman · April 22, 2019, 12:24am

Lesson 12 notebooks annotated with video links are available here

devforfu · April 22, 2019, 3:25am

Yeah, me too, I’ve started writing my own custom training loop during Part 1 and now going to refactor it using Jupyter exporting approach shown in the lectures

Later I’ll probably try to port it to S4TF also.

Kjeanclaude · April 22, 2019, 12:21pm

Please, did someone meet the “AttributeError: ‘NBMasterBar’ object has no attribute ‘update’” and then after installing “!pip install fastprogress -U”, this display “AttributeError: ‘NBProgressBar’ object has no attribute ‘fill’”? (11_train_imagenette.ipynb)
I think it is a problem of version, but perhaps you have a different experience.

jeremy · April 22, 2019, 1:19pm

Yes you need to update fastprogress.

amitkayal · April 22, 2019, 2:40pm

Hello…I am not sure if this topic is related to this Lesson 12…So my apologies if i have posted in wrong forum…I am exploring “depthwise separable convolution” and was exploring if this can be done through fast.ai . From my brief experiment it is quite clear that depthwise separable convolution is clear winner from no of parameter perspective but then why our standard models like resnet, dense have not adopted this one?

Kjeanclaude · April 22, 2019, 5:07pm

Thank you Jeremy! That seems to have worked on previous lessons as I perform my training on Kaggle. But for this Lesson 12, I am able to correctly install apex, however the “!pip install fastprogress -U” seems to not work. So, I will try it later or on local.

Kjeanclaude · April 22, 2019, 5:11pm

Please Jeremy, I have two (02) questions for NLP:
1-Is it possible to use BERT for machine translation (from scratch)?
2-Is there a fastai tool for facilitating language pairs (Source – Reference) building for machine translation?

jeremy · April 22, 2019, 9:32pm

I’m very interested in exploring them more. But I wanted to get xresnet working first. Now that I’ve done that, you should absolutely try replacing some convs with dw separable convs and see if you can get better results!

whamp · April 23, 2019, 12:16am

I can’t seem to find the answer in this thread or the video, why do we need to call reset in the SequentialRNN for the AWD_LSTM and what does reset do?

def reset(self):
"Reset the hidden states."
self.hidden = [(self._one_hidden(l), self._one_hidden(l)) for l in range(self.n_layers)]


class SequentialRNN(nn.Sequential):
"A sequential module that passes the reset call to its children."
def reset(self):
    for c in self.children():
        if hasattr(c, 'reset'): c.reset()

charming · April 23, 2019, 1:35am

How to use label smoothing in multi-classification problems?

amitkayal · April 23, 2019, 4:17am

Excellent…I will certainly try this and will share my observation…Thanks a lot Jeremy…

marcmuc · April 23, 2019, 11:34am

I have a question re: the LSTM explanation in this lesson.
At 1:47:30 of the edited video, Jeremy says that the result of the addition of the hidden state and input is “split into 4 equal size tensors” and that those 4 tensors will then go through the different paths/gates of the LSTM.

The way I understood LSTMs so far (and also if I understand the formulas correctly - which I might not!!), the entire result would be passed into all gates/paths?! Is there something special here in this case?

[EDIT]: After rereading the lesson notebook, this just seems to be a “speedup” trick for making the calculation more efficient (instead of having 4 weight matrices we use one matrix that is 4x the “original” size and then split up the results). But that means we do not multiply input and hidden state by some weight matrix “the usual way”, but rather in a different way. Correct?

Lankinen · April 23, 2019, 12:03pm

Am I right that by adding more regularization learning rate can be increased and that way model can be trained faster because it takes bigger steps?

Lankinen · April 23, 2019, 12:08pm

Also what is the idea of gradient clipping? I understand how it is working but I don’t understand why someone would want to use it. If you initialize the parameters well gradient will stay at some range, right?

RawanSaifAldeen · April 23, 2019, 12:58pm

I would like to ask about the fact that deep convolutional nets tend to learn about textures instead of shapes; I did not read alot of resourses regarding how to solve this issue honestly (other than training on style transferred images) but is there a study or did someone try to train an ensemble-like architecture of a network containing networks (at least two) with different depths (number of layers) to get around this problem? Since (I think) the shallow layers could capture the general patterns and therefore increase its ability to recognize shapes in addition to textures.
I would think that people do not like to use multiple networks collaboratively due to the amount of computations and memory exhaustion, but is it the only reason or was it found to be ineffective?

MicPie · April 23, 2019, 2:26pm

The picture in this article explains it nicely why you would want to use it: https://hackernoon.com/gradient-clipping-57f04f0adae
(TL;DR: So you do not jump of a cliff too far in the wrong direction in the optimization landscape while looking for a global minima.)

I guess the initialization for RNNs is not as straightforward as for normal feedforward NN.
(But maybe someone can clarify this point in detail?)

Lankinen · April 23, 2019, 2:47pm

That’s something I never though. When Jeremy explained this I though that it is the same for every type of network. Is it working only with feedforward or also with others like conv?