As my first exercise I decided to imitate the bear classifier but with elephants. I got images from google for african elephants, indian elephants and stuffed toy elephants. I then cleaned the data manually (the image cleaner widget doesn’t work in colab) and trained for 4 epochs.
The first strange thing I noticed is that the error rate is low by the second epoch but then gets higher again the more I train. Is the learning rate too high? The loss is not super high like in the examples given in the lesson.
The second strange thing is that the training loss is always quite lower than the validation loss, which from what I gathered means I should train more epochs, but that increases my loss.
Any help in understanding what is happening and how I can make my model work better is appreciated. I’ve shared the links to my data in google drive as well as colab notebook below.
Your table is typical for over-fitting.
Whilst Jeremy has explained that it can be absolutely desirable to have a smaller training- than validation loss, the important part to notice it here is that your validation accuracy (and loss) gets worse with more epochs while the training loss decreases (and so would training accuracy if you’d measure it).
So what happens is that weight updates in your later epochs are so specific to the training examples that they help to minimize the loss there but in a way that doesn’t translate to other, unseen examples (like those from your validation set).
Things you might want to try to fix this:
- Get more training data – Either through more actual labeled examples or through augmentations/transformations for the samples you already have. This is usually very effective but can also be hard to do. Especially because fastai defaults already do many sensible transformations and getting more labeled data is tedious.
- Improve regularization – See later lectures. Again this is usually a very good idea, but fastai defaults are often chosen very well already. Still, this is something you could try, e.g. simply increasing weight decday (
wd parameter) and see if it helps.
- Stop before the over-fitting. Either by training for fewer epochs or by smaller learning rates as you already mention. The idea here isn’t that your learning rate is so big that you are unable to properly navigate the loss landscape and diverge, instead it would just be to “stop before the weight updates are due over-fitting and make your model worse”.
- Decrease the size of your models (number of layers or units in layers). Jeremy advocates against this in a later lecture but keeps it open as kind of a last resort when proper regularization doesn’t help. As this isn’t straightforward when transfer learning from ResNet, I don’t think you should really consider this.
Great, that’s very helpful. So have I stumbled on one of these rare cases that Jeremy mentions where the defaults don’t work? This is surprising because it’s basically exactly the same he did for bears but with elephants. I think after cleaning the final training size was around 300 so maybe I should try to scroll through a few more pages of google images. Otherwise I’ll probably have to pick another problem in the mean time and come back to this one later when I’ve learned more.
Yes, exactly. I’d suspect that the defaults are still a good choice but that the number of training examples is too just too low for how hard it is to distinguish bears from elephants. (or that the concrete examples you have available are not ideal, e.g. because almost all are from the same angle and some examples in your validation set are from another.)
That said, to me your table indicates that your general setup is “working”. There is over-fitting and we can expect that the problem will probably be solvable even better. But, your model is already much better than random. Thus, while I cannot be 100% sure, I’d suspect that your code is on the right track and more examples will automatically improve performance.
Makes sense, there’s just one thing I don’t understand. If it’s overfitting shouldn’t it be doing much better on the training data than on the validation one? Why is the validation loss significantly lower?