# Lesson 12 (2019) discussion and wiki

Hello Jeremy
The plot of Beta distribution in Notebook 10b_mixup_label_smoothing seems not right to me.
Beta distribution is given by:

In our case α = β and Β(α,α) = \frac{\Gamma(\alpha)\Gamma(\alpha)}{\Gamma(\alpha+\alpha)} = \frac{\Gamma(\alpha)^2}{\Gamma(2\alpha)}
So in the notebook, it should be:

for α,ax in zip([0.1,0.8], axs):
α = tensor(α)
y = (x**(α-1) * (1-x)**(α-1)) / (Γ(α)**2 / Γ(2*α))
ax.plot(x,y)
ax.set_title(f"α={α:.1}")

To have the distribution right?
Thanks

1 Like

Yeah, it was only recently created by @deena-b so I believe it will cover more topics soon.

1 Like

It was just 10c_fp16 from yesterday’s class, so IMAGENETTE_160 with just a couple of basic transformations. I did increase the batch size and the number of epochs. I’ll ensure that libjpeg-turbo is in use and will try something heavier next.

Thank you Sylvain for tagging me.

@wgpubs The SentencePiece tokenizer does not add this special tokens, nor spacy does. The tokens are being added in the preprocessing by fastai library. We added them in our recent experiments. If you add such tokens you need to ensure they aren’t broken apart by sentence piece so you need to list them as special tokens in the sentence piece params.
The vocab size should be smaller than with regular tokenization, 25k works well, but models with 15k aren’t much worse. Test few to see which performs best. You can do that on the LM level as it is possible to compare the perplexity between 25k and 15k.

You can find an example implementation in n-waves/ulmfit-multilingual, the repo will be merged to fast.ai once I manage to come up with okey PR.

Sentencepiece has BPE tokenization included you can select that in the params. I like SentencePiece as the API is ok, I haven’t tried BPE.

Just playground it is intended to be merged to fast.ai. Currently, it has SentencePiece implementation and it lets you simply run tests against IMDB and MLDoc. It is still incompatible with the recent fast.ai but I’m planning to fix that.

4 Likes

Few more thoughts. We tested models with sp on MLDoc (9 languages), a model with 25k had a bit better performance on German, and comparable (or a bit worse) on other languages, including Russian which was odd.
It is hard to judge the tokenzier as the ULMFiT classifier performance differs from execution to execution, and to really test something you need to run the model end to end (~2days).

The senentecepiece models are a bit faster to train as the vocab is smaller. Although the trick with changing the vocabulary doesn’t work, so if your initial tokenization didn’t had all the characters (for example Emojis) that you are going to use in your downstream task then you probably want to train a full model end to end on (wikipedia + your corpus).

@mkardas may add a more as he was playing with ulmfit for hate speech detection.

4 Likes

Is there any guideline available how and what time of augmentation we should use? Should we design augmentation based on class of the images for classification model?

I was referring specifically to https://github.com/glample/fastBPE but sentencepiece bpe implementation is probably similar.

Thanks for the reply. I haven’t used sentence piece yet but will give it a try.

What are the recommended steps to training ulmfit from scratch? I remember there being some thread on it but haven’t found it, nor am I sure it is current.

Many thanks for the fix!

1 Like

It’s not just that, the 2080 uses the Turing Architecture, which has special cores for fp16 processors, the Maxwell and Pascal architectures (GTX 1080 and prior), will do fp16 math, but no faster than their 32 operations. So to get the speed improvement, you must have the Turing or Volta(really high end) architectures.

1 Like

Has anyone tried mixup in image superresolution/decrapification or segmentation or other generative models of this type? At first glance it seems like it wouldn’t make sense, but I’m still curious.

Edited for clarity and brevity

Regarding the bag of trick idea “Resnet-C” that replaces the initial 77 conv by three 33 convs:

Jeremy is going from 3 to 32 to 64 to 64 channels with the Xresnet model (57:10)

Wouldn’t a better progression be [3,16,32,64], or [3,8,16,32,64]?

This would be more in line with what we learned in lesson 10 (https://youtu.be/HR0lt1hlR6U?t=4267), where Jeremy explained why people use a 77 conv instead of single 33 in the first layer.

So the 2080 should run faster in fp16 than in fp32, right? Mine isn’t; I’ll do some more testing.

Yes, on my 2070 I see a significant speed increase when switching to FP16.

You should be able to test that easily: Setup a NN without .to_fp16() and the same NN with .to_fp16(), like in the example code from the callbacks.fp16 docs and compare the total times.

1 Like

Check ulmfit-multlingual it has the tips from Sylvain incorporated, and it has plenty of tools to run transfer learning from different models without need to specify all the params (as they are saved to .json). For the training I had good results with using label smoothing so you may want to train with that as well. Good advice is to find some small dataset on which pretraing is fast to test different hyper-parameters.

6 Likes

Good tips. Thanks.

Quick question: Why do you all tokenize with Moses first as part of the wiki-text pre-processing?

i am learning a lot by creating a lib with the core ideas and code from the course with a slightly different design. it gives another angle on the subject thus making the concepts stick better

1 Like

you could use both

you could make mixup on any level of the architecture, but doing it in the first layers is probably more eficient