Lesson 2 - Non-beginner discussion

For example, the last layer of the ImageNet model classifies the input image into 1000 possible classes. For our cat/dog classifier, we start with the pre-trained ImageNet model, “chop off its head” – i.e remove the final layer – and replace it with a new final layer (or head) that is a two-way classifier. Note that we retain the body – i.e. the layers before the head – which have the pre-trained weights from ImageNet.

3 Likes

Does anyone know of any research where some convolutional layers weights were manually initialized (as opposed to randomly initialized via Xavier or Kaiming init).

Things like Sobel operators to give the network a headstart on learning useful things (eg. edges) about the natural world.

One possibility that I read about to jump-start a model in a new domain (provided you have enough data) is using its convolution layers as the encoder part in a encoder-decoder configuration.

You can train this encoder-decoder on unlabeled data, so you don’t need ground truth.

That way the convolution layers learn how to extract meaningful features from the (unlabeled) data so that the decoder can reconstruct the input image.

You can then detach the encoder part, attach it to a dense part (if the model needs it) and train with labeled data starting from there.

Makes a lot of sense to me, but it also sounds pretty labor intensive :slight_smile:

3 Likes

Thanks! But today everyone in NLU moved to Transformer models and the behavior there could be different. So, I am curious specifically about them.

1 Like

GPT was trained with discriminitive fine tuning as well.

2 Likes

Non beginner question. Are there any resources for combining models in production? Ie I want to label what a person is doing in a picture combining NLP/CV

1 Like

Can we visualize fastain models on tensorflow board? https://pytorch.org/tutorials/intermediate/tensorboard_tutorial.html

Yes, there is a callback for that.

2 Likes

Some other interesting paper on this topic: http://cips-cl.org/static/anthology/CCL-2019/CCL-19-141.pdf

2 Likes

One possibility that I read about to jump-start a model in a new domain (provided you have enough data) is using its convolution layers as the encoder part in a encoder-decoder configuration.

Self-supervised learning like you described seems to be popular in computer vision, and you don’t need an entire encoder-decoder sequence! For example, you can apply 90, 180, and 270 degree rotations to an image, and then train a convnet to classify the correct rotation. This “pre-text” training seems to be really helpful for jump-starting a convnet (e.g., RotNet).

Jeremy also had a blog post with lots of great pre-text examples!

4 Likes

As we talked about running the first cat-dog example of 01_intro, I noticed that when I trained the model, the second step of fine-tune (here updating the whole model) seems to overfit; error-rate increases, validation loss increases, with training loss dropping significantly. Is that overfitting an oversight, or expected behavior?

A link to this awesome website seems very appropriate:

2 Likes

I have a requirement to compare between inference metric (output metric such as accuracy,f1score etc) of various models on a certain topic (eg text classification), and pick the best model. How do you do it , has anybody tried using any statistical significance tests for the same ?. thanks

Maybe we should build this into fastai at some point, at least the easy incarnation involving rotating the images.

1 Like

Generally I’d imagine you compare their results on a held out test set. If you wanted to use a p-test you could, just make sure to do multiple runs with your models if you can (IE 3 or 5 times)

1 Like

I see that DataLoader subclasses GetAttr. Can you explain t a bit?

Can you share intuition behind observing metric vs loss on validation set during training? I thought metric like accuracy is much more volatile, especially if validation set is small, so choosing checkpoints based on minimizing validation loss seemed like a good idea to me.

One thing to consider: let’s for simplicity consider a classification setting. The value of a cross-entropy loss depends not only on whether the image is classified correctly, but also on the confidence that the model has in the prediction. So your loss can increase if the model is getting more things wrong, OR if the model is becoming less confident about some predictions.

Intuitively, the second thing might not necessarily be bad: if the model was overconfident for some reason earlier, it’s ok if it becomes less confident now (and so the loss increases) as long as the prediction is still correct. If you think in these terms, you see how you might get a loss that’s increasing and an accuracy that is improving.

For example, the model might be learning now how to classify well some data points that was getting wrong earlier (which would decrease the loss by a certain amount A), and in order to do so it might need to become less confident about other examples that it was already getting right (which would increase the loss by B). If B > A then you will get a net increase in the loss, but also an improved accuracy.

4 Likes

Hey can anyone point me in the right direction with this:
In chapter one of fastbook, there is this statement

The importance of pretrained models is generally not recognized or discussed in most courses, books, or software library features, and is rarely considered in academic papers. As we write this at the start of 2020, things are just starting to change, but it’s likely to take a while.

My interest is in just how things are changing, are there any papers that are tackling this you can point us to, or are there any interesting ideas that you can share with regards to this.

An example of this is ULMFit paper :slight_smile: