Lesson 2 - Non-beginner discussion

sgugger · March 25, 2020, 1:31am

This is the topic for any non-beginner discussion around lesson 2. It won’t be actively monitored by Jeremy or I tonight, but we will answer standing questions in here tomorrow (if needed).

vladgets · March 25, 2020, 2:43am

In NLU tasks for Bert models, does it make sense for Transfer learning during fine tuning phase to update deeper Transformer levels less than upper Transformer layers and additional classification layers that were added for down stream task, the same way as you did for image recognition models?
Did you make any experiments on this topic?

muellerzr · March 25, 2020, 2:44am

@ganesh.bhat asked:

What is the definition of a head and body of a model given that there can be so many layers?

Is head the last layer that gets trained and body the remaining layers?

The ‘body’ is the actual resnet part of the model. The ‘head’ is a special set of layers fastai adds at the very end of the model that we train on. But for instance on a regular non-fastai changed model this is the last linear layer

muellerzr · March 25, 2020, 2:44am

@jcatanza asked:

We found out that fine_tune(1) first does a head-only training (body frozen) , and then a full network retrain .
Why is this a good thing to do? Why not only the head? Why first head-only → whole network, and not 1 epoch whole network → 1 epoch head only?

JoshVarty · March 25, 2020, 2:48am

Jeremy wrote a paper on NLP called Universal Language Model Fine-tuning for Text Classification. While they do not use transformers, they do use discriminitive learning rates applied to RNNs.

jcatanza · March 25, 2020, 2:59am

For example, the last layer of the ImageNet model classifies the input image into 1000 possible classes. For our cat/dog classifier, we start with the pre-trained ImageNet model, “chop off its head” – i.e remove the final layer – and replace it with a new final layer (or head) that is a two-way classifier. Note that we retain the body – i.e. the layers before the head – which have the pre-trained weights from ImageNet.

JoshVarty · March 25, 2020, 3:01am

Does anyone know of any research where some convolutional layers weights were manually initialized (as opposed to randomly initialized via Xavier or Kaiming init).

Things like Sobel operators to give the network a headstart on learning useful things (eg. edges) about the natural world.

giacomov · March 25, 2020, 3:00am

One possibility that I read about to jump-start a model in a new domain (provided you have enough data) is using its convolution layers as the encoder part in a encoder-decoder configuration.

You can train this encoder-decoder on unlabeled data, so you don’t need ground truth.

That way the convolution layers learn how to extract meaningful features from the (unlabeled) data so that the decoder can reconstruct the input image.

You can then detach the encoder part, attach it to a dense part (if the model needs it) and train with labeled data starting from there.

Makes a lot of sense to me, but it also sounds pretty labor intensive

vladgets · March 25, 2020, 3:06am

Thanks! But today everyone in NLU moved to Transformer models and the behavior there could be different. So, I am curious specifically about them.

JoshVarty · March 25, 2020, 3:07am

GPT was trained with discriminitive fine tuning as well.

Raymond-Wu · March 25, 2020, 3:07am

Non beginner question. Are there any resources for combining models in production? Ie I want to label what a person is doing in a picture combining NLP/CV

Albertotono · March 25, 2020, 3:10am

Can we visualize fastain models on tensorflow board? https://pytorch.org/tutorials/intermediate/tensorboard_tutorial.html

sgugger · March 25, 2020, 3:10am

Yes, there is a callback for that.

vladgets · March 25, 2020, 3:14am

Some other interesting paper on this topic: http://cips-cl.org/static/anthology/CCL-2019/CCL-19-141.pdf

jwuphysics · March 25, 2020, 3:14am

One possibility that I read about to jump-start a model in a new domain (provided you have enough data) is using its convolution layers as the encoder part in a encoder-decoder configuration.

Self-supervised learning like you described seems to be popular in computer vision, and you don’t need an entire encoder-decoder sequence! For example, you can apply 90, 180, and 270 degree rotations to an image, and then train a convnet to classify the correct rotation. This “pre-text” training seems to be really helpful for jump-starting a convnet (e.g., RotNet).

Jeremy also had a blog post with lots of great pre-text examples!

eugeneshvarts · March 25, 2020, 3:25am

As we talked about running the first cat-dog example of 01_intro, I noticed that when I trained the model, the second step of fine-tune (here updating the whole model) seems to overfit; error-rate increases, validation loss increases, with training loss dropping significantly. Is that overfitting an oversight, or expected behavior?

giacomov · March 25, 2020, 3:27am

A link to this awesome website seems very appropriate:

harikrishnanrajeev · March 25, 2020, 3:29am

I have a requirement to compare between inference metric (output metric such as accuracy,f1score etc) of various models on a certain topic (eg text classification), and pick the best model. How do you do it , has anybody tried using any statistical significance tests for the same ?. thanks

giacomov · March 25, 2020, 3:42am

Maybe we should build this into fastai at some point, at least the easy incarnation involving rotating the images.

muellerzr · March 25, 2020, 3:43am

Generally I’d imagine you compare their results on a held out test set. If you wanted to use a p-test you could, just make sure to do multiple runs with your models if you can (IE 3 or 5 times)