Lesson 12 (2019) discussion and wiki

Can you elaborate on this?

Using the default bits, we can tokenize our corpus and have the tokens not seen by the original LM model trained against wiki-103, learned in the process of fine-tuning the LM against our corpus. Why doesn’t this work when using SentencePiece?

If I’m understanding you correctly, then for every corpus we’d like to actually fine-tune … if we are using SP, we actually have to concatenate it with wiki-103 and train the whole thing?

if you set character_coverage=1.0. then sentencepiece includes all characters it has seen in the vocab. Sentencepiece can, therefore, in most cases use the character alphabet to tokenize letter seuqences it hasn’t seen before. This is why piotr writes Emojis or equivalent: if you have an english text with says chinese charaters that wasn’t in the corpus then sentencepiece would emit UNK for those new characters.

I actually use a quite small vocab for english (4K) and it works fine:

1 Like

do you have the formula to compare the perplexity for different size vocab ?

Lesson 12 notebooks annotated with video links are available here

1 Like

Yeah, me too, I’ve started writing my own custom training loop during Part 1 and now going to refactor it using Jupyter exporting approach shown in the lectures :smile:

Later I’ll probably try to port it to S4TF also.

Please, did someone meet the “AttributeError: ‘NBMasterBar’ object has no attribute ‘update’” and then after installing “!pip install fastprogress -U”, this display “AttributeError: ‘NBProgressBar’ object has no attribute ‘fill’”? (11_train_imagenette.ipynb)
I think it is a problem of version, but perhaps you have a different experience.

Yes you need to update fastprogress.

1 Like

Hello…I am not sure if this topic is related to this Lesson 12…So my apologies if i have posted in wrong forum…I am exploring “depthwise separable convolution” and was exploring if this can be done through fast.ai . From my brief experiment it is quite clear that depthwise separable convolution is clear winner from no of parameter perspective but then why our standard models like resnet, dense have not adopted this one?

1 Like

Thank you Jeremy! That seems to have worked on previous lessons as I perform my training on Kaggle. But for this Lesson 12, I am able to correctly install apex, however the “!pip install fastprogress -U” seems to not work. So, I will try it later or on local.

Please Jeremy, I have two (02) questions for NLP:
1-Is it possible to use BERT for machine translation (from scratch)?
2-Is there a fastai tool for facilitating language pairs (Source – Reference) building for machine translation?

I’m very interested in exploring them more. But I wanted to get xresnet working first. Now that I’ve done that, you should absolutely try replacing some convs with dw separable convs and see if you can get better results!

1 Like

I can’t seem to find the answer in this thread or the video, why do we need to call reset in the SequentialRNN for the AWD_LSTM and what does reset do?

def reset(self):
"Reset the hidden states."
self.hidden = [(self._one_hidden(l), self._one_hidden(l)) for l in range(self.n_layers)]


class SequentialRNN(nn.Sequential):
"A sequential module that passes the reset call to its children."
def reset(self):
    for c in self.children():
        if hasattr(c, 'reset'): c.reset()

How to use label smoothing in multi-classification problems?:thinking:

Excellent…I will certainly try this and will share my observation…Thanks a lot Jeremy…

I have a question re: the LSTM explanation in this lesson.
At 1:47:30 of the edited video, Jeremy says that the result of the addition of the hidden state and input is “split into 4 equal size tensors” and that those 4 tensors will then go through the different paths/gates of the LSTM.

The way I understood LSTMs so far (and also if I understand the formulas correctly - which I might not!!), the entire result would be passed into all gates/paths?! Is there something special here in this case?

[EDIT]: After rereading the lesson notebook, this just seems to be a “speedup” trick for making the calculation more efficient (instead of having 4 weight matrices we use one matrix that is 4x the “original” size and then split up the results). But that means we do not multiply input and hidden state by some weight matrix “the usual way”, but rather in a different way. Correct?

4 Likes

Am I right that by adding more regularization learning rate can be increased and that way model can be trained faster because it takes bigger steps?

1 Like

Also what is the idea of gradient clipping? I understand how it is working but I don’t understand why someone would want to use it. If you initialize the parameters well gradient will stay at some range, right?

2 Likes

I would like to ask about the fact that deep convolutional nets tend to learn about textures instead of shapes; I did not read alot of resourses regarding how to solve this issue honestly (other than training on style transferred images) but is there a study or did someone try to train an ensemble-like architecture of a network containing networks (at least two) with different depths (number of layers) to get around this problem? Since (I think) the shallow layers could capture the general patterns and therefore increase its ability to recognize shapes in addition to textures.
I would think that people do not like to use multiple networks collaboratively due to the amount of computations and memory exhaustion, but is it the only reason or was it found to be ineffective?

2 Likes

The picture in this article explains it nicely why you would want to use it: https://hackernoon.com/gradient-clipping-57f04f0adae
(TL;DR: So you do not jump of a cliff too far in the wrong direction in the optimization landscape while looking for a global minima.)

I guess the initialization for RNNs is not as straightforward as for normal feedforward NN.
(But maybe someone can clarify this point in detail?)

2 Likes

That’s something I never though. When Jeremy explained this I though that it is the same for every type of network. Is it working only with feedforward or also with others like conv?