How to pick the right pretrained model

cheeseblubber · November 30, 2017, 3:49am

How do you pick which pretrained model for each individual application? For example if someone were to build hotdog or no hotdog with 1000 trained images which pretrained model would you pick where as if you had 10000 images would you pick a different model?

Are bigger models better than smaller models? or do they larger models pretrained models work better with more data?

edit: bigger as in more complex and more layers

ramesh · November 30, 2017, 4:44am

I am not an expert in this. But here’s understanding.

You always want to go for the smallest model that works well for your data. Up until earlier this year, people usually start with VGG16 or VGG19, but Resnet is also a great choice for fine tuning. Start with Resnet18, then to Resnet34 and Resnet50. You could also try the newer models in ResNext or Nascent nets.

Bigger models is not always better. They usually overfit your train data. What you really care about is Validation loss. Keeping it small, yet closer to training loss is tricky and may need some regularization like DropOut or WeightDecay etc. FastAI provides easy access to all of them.

Whether you need more data or not depends on the task. If you are trying to identify Hotdog or not-hotdog you may not need lots of data. But if you are trying to read Street Signs and Logo from your pictures, you may need more data.

tldr: When working on a new problem, always start with simple networks - Resnet18 / 34 or 50 are good choices.

taylorpell51 · November 30, 2017, 8:28am

This is sort of looking at things from really high up but I hope it’s useful:

One thing that we really are worried about in picking a big/complex model is the risk of overfitting. A big or overly complex model can give too much importance to small, subtle changes in features. And so when it sees new data, it will “overreact” and perform poorly, as compared to a more general model.

Usually, the “simple-but-good-enough” model that we might choose has similar performance between train/test loss (while hopefully minimizing test loss) means that it focused on the features just as much as it should have.

Side note about needing more data for assessing accuracy on more complex tasks:
a very rough rule is that you would like to have at least 20 (hopefully more) of each thing you try to classify, for each result (success/fail).

So say we are doing the hotdog-or-not predictor, and the images are 20% hotdogs, then you would need at least 100 images in your test set (assuming you got them all right).

But what if the model isn’t perfect? Say the we get hotdog-or-not right 80% of the time, we would actually need at least 125 images in the test set.

johnnyv · November 30, 2017, 8:28am

@ramesh what is a Nascent network? I haven’t heard of that architecture

johnnyv · November 30, 2017, 8:29am

Do you mean NasNet? https://research.googleblog.com/2017/11/automl-for-large-scale-image.html

miguel_perez · November 30, 2017, 1:43pm

Aren’t you kind of adapting a rule from structured data here? (Namely, that you need smallest class number of observations at least equal to 20 x number of features). I don’t think that is correct for images/DL … also number of images needed would be surprinsingly small using that rule.

(anyway, I could be wrong, maybe using pretrained models minimum image number needed drops that much, if its the case and you are aware of some studies about this rules of thumbs would be great if you share the link… rules of thumb are great help! )

jeremy · November 30, 2017, 3:53pm

I’m pretty sure he did

taylorpell51 · November 30, 2017, 7:12pm

Yes, this is definitely my attempt at adapting from the structured data rule, which I am much more familiar with. I think that your idea about 20 x number of features is illuminating, in that a feature is less clearly defined in the context of an image as opposed to say a column of numbers.

Generally the idea is that you need enough observations, such that you have a significant number of correctly and incorrectly classified, for each classification. I think the last part of my comment would be closer to the point - we need enough of each thing to be confident in our rate of successful classification, which is certainly inflated by the number of features our model considers, and of course things can get hairy when we think about overfitting…

tldr: this rule of thumb is not clearly defined. Likewise, please share if you find something useful!

miguel_perez · November 30, 2017, 7:31pm

@taylorpell51 understood, all clear now. I also cant find an easy translation from “feature” to images. Direct translation (but wrong translation) if you wanted to use a non DL method for feature, could be “pixel”, cause that would be your columns, all the pixels of one image. That would lead to huge number of image necesary if you multiplied number of pixels x 20. But it wont work because images can have almost the same information at different scales and different pixel number, so… sorry no rules of thumb (apart of “usually a lot of images is better than a few”). Validation info about how easily you overfit is a good heads up when you dont have enough variety of images but that’s all I can say )