Lesson 2 - Non-beginner discussion

This is the topic for any non-beginner discussion around lesson 2. It won’t be actively monitored by Jeremy or I tonight, but we will answer standing questions in here tomorrow (if needed).

8 Likes

In NLU tasks for Bert models, does it make sense for Transfer learning during fine tuning phase to update deeper Transformer levels less than upper Transformer layers and additional classification layers that were added for down stream task, the same way as you did for image recognition models?
Did you make any experiments on this topic?

2 Likes

@ganesh.bhat asked:

What is the definition of a head and body of a model given that there can be so many layers?

Is head the last layer that gets trained and body the remaining layers?

The ‘body’ is the actual resnet part of the model. The ‘head’ is a special set of layers fastai adds at the very end of the model that we train on. But for instance on a regular non-fastai changed model this is the last linear layer

2 Likes

@jcatanza asked:

We found out that fine_tune(1) first does a head-only training (body frozen) , and then a full network retrain .
Why is this a good thing to do? Why not only the head? Why first head-only → whole network, and not 1 epoch whole network → 1 epoch head only?

Jeremy wrote a paper on NLP called Universal Language Model Fine-tuning for Text Classification. While they do not use transformers, they do use discriminitive learning rates applied to RNNs.

1 Like

For example, the last layer of the ImageNet model classifies the input image into 1000 possible classes. For our cat/dog classifier, we start with the pre-trained ImageNet model, “chop off its head” – i.e remove the final layer – and replace it with a new final layer (or head) that is a two-way classifier. Note that we retain the body – i.e. the layers before the head – which have the pre-trained weights from ImageNet.

3 Likes

Does anyone know of any research where some convolutional layers weights were manually initialized (as opposed to randomly initialized via Xavier or Kaiming init).

Things like Sobel operators to give the network a headstart on learning useful things (eg. edges) about the natural world.

One possibility that I read about to jump-start a model in a new domain (provided you have enough data) is using its convolution layers as the encoder part in a encoder-decoder configuration.

You can train this encoder-decoder on unlabeled data, so you don’t need ground truth.

That way the convolution layers learn how to extract meaningful features from the (unlabeled) data so that the decoder can reconstruct the input image.

You can then detach the encoder part, attach it to a dense part (if the model needs it) and train with labeled data starting from there.

Makes a lot of sense to me, but it also sounds pretty labor intensive :slight_smile:

3 Likes

Thanks! But today everyone in NLU moved to Transformer models and the behavior there could be different. So, I am curious specifically about them.

1 Like

GPT was trained with discriminitive fine tuning as well.

2 Likes

Non beginner question. Are there any resources for combining models in production? Ie I want to label what a person is doing in a picture combining NLP/CV

1 Like

Can we visualize fastain models on tensorflow board? https://pytorch.org/tutorials/intermediate/tensorboard_tutorial.html

Yes, there is a callback for that.

2 Likes

Some other interesting paper on this topic: http://cips-cl.org/static/anthology/CCL-2019/CCL-19-141.pdf

2 Likes

One possibility that I read about to jump-start a model in a new domain (provided you have enough data) is using its convolution layers as the encoder part in a encoder-decoder configuration.

Self-supervised learning like you described seems to be popular in computer vision, and you don’t need an entire encoder-decoder sequence! For example, you can apply 90, 180, and 270 degree rotations to an image, and then train a convnet to classify the correct rotation. This “pre-text” training seems to be really helpful for jump-starting a convnet (e.g., RotNet).

Jeremy also had a blog post with lots of great pre-text examples!

4 Likes

As we talked about running the first cat-dog example of 01_intro, I noticed that when I trained the model, the second step of fine-tune (here updating the whole model) seems to overfit; error-rate increases, validation loss increases, with training loss dropping significantly. Is that overfitting an oversight, or expected behavior?

A link to this awesome website seems very appropriate:

2 Likes

I have a requirement to compare between inference metric (output metric such as accuracy,f1score etc) of various models on a certain topic (eg text classification), and pick the best model. How do you do it , has anybody tried using any statistical significance tests for the same ?. thanks

Maybe we should build this into fastai at some point, at least the easy incarnation involving rotating the images.

1 Like

Generally I’d imagine you compare their results on a held out test set. If you wanted to use a p-test you could, just make sure to do multiple runs with your models if you can (IE 3 or 5 times)

1 Like