In Lesson 11, Jeremy points to
where they mention that they train their models on 128x128 Imagenet images instead of the standard 224x224 images. They show/claim that the performance of a model trained on the scaled-down images is highly correlated w/ performance of the same model trained on normal sized images. That’s interesting because it means you can do your architecture search/parameter tuning over a “cheaper” class of models, then “expensively” retrain the best model you find on full-sized images.
This touches on the general point that, in architecture search/parameter tuning, you don’t actually need to train a bunch of models, you just need to predict what their performance will be. Any metrics that’s highly correlated w/ eventual performance of an expensive to train model is very useful. The 128x128 scaling mentioned above is one such trick that lets you predict performance of a model w/o actually training it. I’m wondering whether people are aware of other such methods – perhaps things like
- training a model on greyscale images
- training a model w/ less data
- only partially training a model
- training model at low precision
- higher learning rates
etc, which all seem reasonable but I haven’t really seen evaluated to this end.
~ Ben