Lesson 2 - Non-beginner discussion

An example of this is ULMFit paper :slight_smile:

But if I am correct, ULMFit is a 2018 paper

Just because it’s old (three years) doesn’t mean it doesn’t work any more :wink: ULM-FiT was the start of utilizing transfer learning for text data. Multi-FiT just came out in the last year or so which uses this approach for multi-lingual problems

I didn’t mean to say that it doesn’t work, what I am trying to say is that, from the book:

As we write this at the start of 2020, things are just starting to change, but it’s likely to take a while.

It says that things are just starting to change in 2020, so I wanted to be shown the things that are happening now in that area.

I had asked this question yesterday and @sgugger had replied. But wanted to follow up here re some clarifications too. Basically I would like to get an intuition about transfer learning and its relationship with neural nets /DL. Per my understanding the concept of Transfer Learning predates DL. They way @jeremy explained it yesterday re using base resnet (trained for a different task) to improve on book/pets detector (which can be argued for classical ML classifiers too. So is there something inherent in the architecture of Neural Networks that make them more efficient to be used for Transfer Learning, compared to say Random Forests / GBTs etc @jeremy @sgugger Thanks!

I am trying to build a databunch/dataloader from already pre-processed data (store as numpy array as x_train, x_test, y_train, y_test). However, I am not sure how to do that as fastai2 expects ‘path’ (of image file names) as input… Is there any way I can feed this already pre-processed data directly to a learner ?

Yes, you can totally do that. Look at the Data block API (https://docs.fast.ai/data_block.html). Jeremy gave an overview at the end of last lesson.

It is referring to the fastai v1. I need for the v2. Moreover, I didn’t get how to pass the numpy array as input to create a dataloader going through the document.

In chapter 7, in the progressive resizing section it says that one should be careful using this on pre-trained models if the transfer learning model dataset is similar to the original dataset in terms of images and sizes as the weights will not change much, and thus if we train on smaller images it might damage the already learnt pre-trained weights.

Is this what Jeremy referred to as ''Catastrophic forgetting in transfer learning" in lesson 2?

Have a look here:

They idea with the synthetic Gabor filters seems to be super useful (see last picture in the blog post).

Be sure to check the publication, as they investigated other interesting concepts too.

1 Like

The docs are fastai v1, but the same functionality is present in v2 (although slightly different). Look here:

(minute 1:23:37)

You just need to create your own get_items (so remove get_image_files and insert your own).

The key here are convolutions, there are basically filters applied to an image (think about an edge detector filter, a corner detector filter, etc…). These guys are extremely useful in any kind of vision task. In the other hand, linear layers are not so “general” at all, remember that when doing transfer learning we just throw all the linear layers away.

For example, there is not a very good transfer learning approach for tabular learning yet. Sure, you can transfer some embeddings, but that’s it…

As for NLP, I’m not the best one to answer because of my limited experience with the topic. I know we have very good transfer learning in NLP but I’m not sure how it works, maybe someone more experienced can clarify it here, I would love to know. Transferring word embeddings is a good start, but I don’t know how to rest of it works.

2 Likes

I haven’t seen examples of using transfer learning for the “classic” algorithms like Random Forest. One of the biggest benefits of deep learning is that it learns how to do the feature extraction that we used to have to do manually. The feature extraction process tends to have some strong crossover (e.g. the convolution layers Jeremy talked about in the last lecture) between different computer vision tasks. If you think about the input to a random forest, it’s typically a little bit more processed that what goes into a deep learning model. My intuition is that most of the value of transfer learning comes from the feature extraction / representation learning, and the downstream classification / regression layers are better learned from scratch.

I have seen some really great examples of using a neural network to do the feature extraction / representation learning, and then using those features as part of a more classic algorithm (https://arxiv.org/abs/1604.06737)., but I think it’s much harder to integrate the two and you’d need a pretty compelling reason not to just do it all in one neural network that can be trained end to end.

1 Like

@dcooper01 Daniel There’s a reason for that: the Random Forest is not amenable to transfer learning because a tree model is an ensemble of trees that is specific to the data set used to generate it. For that reason, the weights of the trees cannot be transferred to another RF model.

Reposting this here, since it might be a bit much for the beginner discussion:

I wrote a Jupyter notebook How can we determine a p-value for an experimental result? in which I carry out the experiment Jeremy suggested in Lesson 2 in relation to the discussion of Figure 1(a) of the paper "High Temperature and High Humidity Reduce the Transmission of COVID-19"

The result may surprise you!


https://forums.fast.ai/t/lesson-2-official-topic/66294/332?u=jcatanza

Follow up:

(1) For some of you, the result of p = 1.e-5 from the notebook might not square with your intuition. That’s OK – as a scientist, it’s your job to be skeptical. Be that as it may, the utility of statistics such as the p-value is that they go beyond our intuition, which is not always well-informed.

The p-value is what it is. There are two possibilities: either I’ve calculated it correctly or not. If you find an error in my notebook, please let me know!

The extremely small p-value tells us that the slope measured in the actual data is significant, i.e. that the trend of R decreasing with Temperature is likely to be real. This conclusion is credible because a trend of decreasing R with increasing T has been observed for other viruses.

(2) In the video, Jeremy generated a 100 null-hypothesis data sets, with100 null-hypothesis slopes to compare with the actual slope from Figure 1(a). As you see in the attached histogram, the slopes from the Monte Carlo ensemble of null-hypothesis data sets fall into a broad Gaussian distribution. A hundred values drawn from this distribution is not enough to tell us whether or not we should reject the null hypothesis, as we’ll see, below.

(3) Jeremy and I are using different distribution parameters for R and T:

Jeremy’s: Temperature: mean, std = (5. ,5.) R: mean,std = (1.9 , 0.5)
Mine: Temperature: mean, std = (7. , 7.5) R: mean,std = (1.75 , 0.35)

As far as I can tell, both are consistent with the original data in Figure 1(a).

(4) So I repeated the Monte-Carlo simulations using Jeremy’s values for the R and T distributions, and to understand the uncertainty in the p-value estimate, I repeated each simulation 5 times:

  • With an ensemble of 100 null-hypothesis data sets, p-values are [0.01, 0. 0, 0.03, 0.02, 0.0 ]
  • With an ensemble of 100000 null-hypothesis data sets p-values are [0.012, 0.012, 0.012, 0.012,0.012]

From this exercise, we conclude that

  • The p-value is sensitive to the assumed distributions of R and T!
  • 100 synthetic data sets is not enough to quantify the p-value to two decimal places, so we cannot say whether it is below or above the 0.01 threshold.
  • 100,000 synthetic data sets nails the p-value to 3 decimal places – it is (barely) above the 0.01 threshold so we cannot reject the null hypothesis.

Hi everybody,

Has the get_transforms() function been totally removed? That was handy to add a set of default transformations.
Is there any other way to add default transforms?

its aug_transforms() in fastai2

1 Like

Oh okay. Thank you :slight_smile:

About catastrophic forgetting.

I was recently (well, still currently) learning about optimization of neural networks in federated learning.

In federated learning the training is done with (usually) mobile devices that hold their own private training data and then the results are combine to form a global model. The issue I was investigating was how the non-IID nature of the data (for example many phones have very different images) affects the learning process.

So the problem is when learning on a device that has a lot of pictures of birds and on a device how that has a lot images of buildings how can the models be combined so that it actually learns to recognize both birds and buildings.

Here’s the paper I’m referring to:

So they use something called Elastic Weight Consolidation (EWC). The basic problem that EWC is trying to solve is how to learn tasks A and B sequentially using a single model. The idea is to recognize the weights that are important for task A and then when learning task B add a penalty to the loss function for modifying those weights.

For some reason I did not see the connection at all to transfer learning before Jeremy explicitly forcing it into my knowledge during Lesson 2.

So that got me wondering: would it be possible to apply EWC to transfer learning?

In what kind of scenarios does the “catastrophic forgetting” happen in transfer learning?

I’m having trouble coming up with anything that would not actually require replacing the output layer (and thus throwing away the ability to detect ImageNet after learning cats and dogs)…