Leveraging Proxy Datasets for faster CNN development - ImageNette featured!

A really nice paper was just published which took the concept Jeremy pioneered of using proxy datasets for accelerating CNN’s development, and went into deeper statistical analysis.

The short summary is - using a dataset made of 10% of the easiest examples by far had the best correlation to actual results on the full dataset. Using 10% hardest, 10% random and 1 epoch only testing were grossly inferior.

In a similar vein, they also touched on looking at how much info is gleaned per epoch on the full dataset…i.e. should you run 5 epochs, 10 epochs, etc. as a way to try and proxy for optimization? 2 epochs provided a 72% correlation, and 8 provided about 82%…and 15 provided 96% correlation (bear in mind final results were based on 20 epochs).
Thus, that may provide a ‘fail fast’ process for testing as well…

I wrote an article summarizing their findings here:

and full paper is here:

Really neat to see things from the advanced dev course making their way into cutting edge papers and research practices! Plus, now we have some better methods for how to further develop our own CNN’s.