Approximate training methods

In Lesson 11, Jeremy points to

where they mention that they train their models on 128x128 Imagenet images instead of the standard 224x224 images. They show/claim that the performance of a model trained on the scaled-down images is highly correlated w/ performance of the same model trained on normal sized images. That’s interesting because it means you can do your architecture search/parameter tuning over a “cheaper” class of models, then “expensively” retrain the best model you find on full-sized images.

This touches on the general point that, in architecture search/parameter tuning, you don’t actually need to train a bunch of models, you just need to predict what their performance will be. Any metrics that’s highly correlated w/ eventual performance of an expensive to train model is very useful. The 128x128 scaling mentioned above is one such trick that lets you predict performance of a model w/o actually training it. I’m wondering whether people are aware of other such methods – perhaps things like

  • training a model on greyscale images
  • training a model w/ less data
  • only partially training a model
  • training model at low precision
  • higher learning rates

etc, which all seem reasonable but I haven’t really seen evaluated to this end.

~ Ben