Loss and error are often synonyms (except if you men error rate that is 1-accuracy). A metric is something like accuracy, that gives a number on how well your model is doing, and is something you are explicitely optimizing for.
A loss function is something that analyzes how badly your model is doing. It needs to have some requirements (like being smooth), we will see how it’s used to train your model.
Question from DLenthousiast: We found out that fine_tune(1) first does a head-only training (body frozen), and then a full network retrain.
Why is this a good thing to do? Why not only the head? Why first head-only -> whole network, and not 1 epoch whole network -> 1 epoch head only?
Do people average parameters from the same architecture but trained on k-folded train/test sets? If we did k-fold cross validation and distributed the computations across a few machines; can we then average the models and have it be like a bagged deep learning model?
Is it useful to do different steps of transfer learning? For example, start with imagenet, then a big dataset of data related to our problem, and finally fine-tune with our dataset of interest?
It has a (big) part that has been pretrained we call the body of the model. It also as something random that is specific to your problem that we call the head.
Transfer learning existed before Deep Learning – like in Classical ML ? Does anything in the architecture in Neural Net make it more suitable to do Transfer Learning ?
Here’s what I had in my notes previously with the update that Jeremy gave. I think it covers a couple people’s questions. In previous lessons he did mention fitting being related to training/validation loss but now that isn’t as good as metrics
Classical ML can’t use pretrained models. The way deep learning models are built makes it easier to use transfer learning, because they are… well… deep. So the “deep” part can more easily be pretrained.
That’s not easy to do afaik. Due to the random initialization of the network as well as all the other randomness involved (data augmentation, batch composition…), you are not at all guaranteed that the weights will be corresponding one-to-one between the different instances of the same architecture. So instead of averaging the weights, you can literally use every model you trained in an ensemble, for example using bagging (“majority vote”). I remember reading a few papers using this technique. However, in practice it is a very expensive technique to use: multiple training runs and also multiple networks to run inference on.
Re Transfer Learning – using results from one task to another – can you actually use an architecture from a different domain (say text / language modeling) and use it for another – say. image classification / object detection ? Intuition being if you know how objects are related in text corpus (through language modeling) – you can be better in object detection too ? Or is this too wild ?
suppose we have some private data (containing customer details), once we build a model using this data and plan to do transfer learning. Should we consider the learned weights as private ?. Will there be a possibility for the data being exposed if the weights learned on private data is used in public domain ?.
When we look at what is being recognized at these low layers, we’re human beings doing the recognizing. Is it not true that often we won’t know what the machine is actually ‘recognizing’? Are we not just kind of cherry-picking those layers that we can recognize?
From what I saw, Captum provides a cool UI which says which part of the image are used to classify, say a cat image as cat instead of dog… Would love to integrate it with Fast AI