Lesson 12 (2019) discussion and wiki

Elfayoumi · April 19, 2019, 3:31am

Hello Jeremy
The plot of Beta distribution in Notebook 10b_mixup_label_smoothing seems not right to me.
Beta distribution is given by: $\frac{x^{\alpha-1}(1-x)^{\beta-1}} {\Beta(\alpha,\beta)}!$

In our case α = β and Β(α,α) = \frac{\Gamma(\alpha)\Gamma(\alpha)}{\Gamma(\alpha+\alpha)} = \frac{\Gamma(\alpha)^2}{\Gamma(2\alpha)}
So in the notebook, it should be:

for α,ax in zip([0.1,0.8], axs):
α = tensor(α)
y = (x**(α-1) * (1-x)**(α-1)) / (Γ(α)**2 / Γ(2*α))
ax.plot(x,y)
ax.set_title(f"α={α:.1}")

To have the distribution right?
Thanks

jcatanza · April 19, 2019, 4:04am

Didn’t know about the glossary.

devforfu · April 19, 2019, 6:01am

Yeah, it was only recently created by @deena-b so I believe it will cover more topics soon.

pcuenq · April 19, 2019, 6:10am

It was just 10c_fp16 from yesterday’s class, so IMAGENETTE_160 with just a couple of basic transformations. I did increase the batch size and the number of epochs. I’ll ensure that libjpeg-turbo is in use and will try something heavier next.

piotr.czapla · April 19, 2019, 9:18am

Thank you Sylvain for tagging me.

@wgpubs The SentencePiece tokenizer does not add this special tokens, nor spacy does. The tokens are being added in the preprocessing by fastai library. We added them in our recent experiments. If you add such tokens you need to ensure they aren’t broken apart by sentence piece so you need to list them as special tokens in the sentence piece params.
The vocab size should be smaller than with regular tokenization, 25k works well, but models with 15k aren’t much worse. Test few to see which performs best. You can do that on the LM level as it is possible to compare the perplexity between 25k and 15k.

You can find an example implementation in n-waves/ulmfit-multilingual, the repo will be merged to fast.ai once I manage to come up with okey PR.

Sentencepiece has BPE tokenization included you can select that in the params. I like SentencePiece as the API is ok, I haven’t tried BPE.

Just playground it is intended to be merged to fast.ai. Currently, it has SentencePiece implementation and it lets you simply run tests against IMDB and MLDoc. It is still incompatible with the recent fast.ai but I’m planning to fix that.

piotr.czapla · April 19, 2019, 9:44am

Few more thoughts. We tested models with sp on MLDoc (9 languages), a model with 25k had a bit better performance on German, and comparable (or a bit worse) on other languages, including Russian which was odd.
It is hard to judge the tokenzier as the ULMFiT classifier performance differs from execution to execution, and to really test something you need to run the model end to end (~2days).

The senentecepiece models are a bit faster to train as the vocab is smaller. Although the trick with changing the vocabulary doesn’t work, so if your initial tokenization didn’t had all the characters (for example Emojis) that you are going to use in your downstream task then you probably want to train a full model end to end on (wikipedia + your corpus).

@mkardas may add a more as he was playing with ulmfit for hate speech detection.

amitkayal · April 19, 2019, 9:47am

Is there any guideline available how and what time of augmentation we should use? Should we design augmentation based on class of the images for classification model?

benjmann · April 19, 2019, 1:15pm

I was referring specifically to https://github.com/glample/fastBPE but sentencepiece bpe implementation is probably similar.

wgpubs · April 19, 2019, 1:50pm

Thanks for the reply. I haven’t used sentence piece yet but will give it a try.

What are the recommended steps to training ulmfit from scratch? I remember there being some thread on it but haven’t found it, nor am I sure it is current.

jeremy · April 19, 2019, 6:46pm

Many thanks for the fix!

Interogativ · April 19, 2019, 8:20pm

It’s not just that, the 2080 uses the Turing Architecture, which has special cores for fp16 processors, the Maxwell and Pascal architectures (GTX 1080 and prior), will do fp16 math, but no faster than their 32 operations. So to get the speed improvement, you must have the Turing or Volta(really high end) architectures.

Seb · April 19, 2019, 9:53pm

Has anyone tried mixup in image superresolution/decrapification or segmentation or other generative models of this type? At first glance it seems like it wouldn’t make sense, but I’m still curious.

Seb · April 20, 2019, 12:25am

Edited for clarity and brevity

Regarding the bag of trick idea “Resnet-C” that replaces the initial 77 conv by three 33 convs:

Jeremy is going from 3 to 32 to 64 to 64 channels with the Xresnet model (57:10)

Wouldn’t a better progression be [3,16,32,64], or [3,8,16,32,64]?

This would be more in line with what we learned in lesson 10 (https://youtu.be/HR0lt1hlR6U?t=4267), where Jeremy explained why people use a 77 conv instead of single 33 in the first layer.

pcuenq · April 20, 2019, 1:14am

So the 2080 should run faster in fp16 than in fp32, right? Mine isn’t; I’ll do some more testing.

MicPie · April 20, 2019, 5:18am

Yes, on my 2070 I see a significant speed increase when switching to FP16.

You should be able to test that easily: Setup a NN without .to_fp16() and the same NN with .to_fp16(), like in the example code from the callbacks.fp16 docs and compare the total times.

piotr.czapla · April 20, 2019, 12:36pm

Check ulmfit-multlingual it has the tips from Sylvain incorporated, and it has plenty of tools to run transfer learning from different models without need to specify all the params (as they are saved to .json). For the training I had good results with using label smoothing so you may want to train with that as well. Good advice is to find some small dataset on which pretraing is fast to test different hyper-parameters.

wgpubs · April 20, 2019, 7:07pm

Good tips. Thanks.

Quick question: Why do you all tokenize with Moses first as part of the wiki-text pre-processing?

Kaspar · April 21, 2019, 5:23pm

i am learning a lot by creating a lib with the core ideas and code from the course with a slightly different design. it gives another angle on the subject thus making the concepts stick better

Kaspar · April 21, 2019, 5:29pm

you could use both

Kaspar · April 21, 2019, 5:38pm

you could make mixup on any level of the architecture, but doing it in the first layers is probably more eficient