I am a researcher in the area of deep learning. Jeremy and I have been discussing various topics over the past several months. In our most recent conversation, we have some interesting ideas on what might work and how well. We would like a curious and enterprising individual or team to empirically test and compare our ideas. Your reward for your effort should be co-authoring a publication with us, if all goes well.
Here is how I described my idea to Jeremy: “In addition, I can suggest a new idea for faster training. You mentioned progressive resizing. I am toying with a different approach. What if one trained in stages where the first stage includes only one image per class for a couple of hundred iterations/epochs, the second stage included 10 images per class, third stage 100 images, and finally all the images. This should be very fast so the question is does it perform well. One can even choose the first image to be iconic for the class so the network learns a good initial set of weights to initialize the next stage’s weights.”
Jeremy replied with the following idea; “So I guess another alternative would be: in the first stage just include a small number of very different classes (e.g. one type of fish, one type of plant, one type of vehicle). Then gradually add more classes, and towards the end of training add more similar classes (e.g. different breeds of dog). My intuition is that the latter approach might be more successful, especially when trying to train with large learning rates. But I’d be interested to see!”
Ah, the art of science. We have two different hypothesis on ways to potentially speed up training. Which one is right? Or neither? Or both? Experiments must be run!
I believe that trying this on Imagenet will be definitive since it is so computationally intensive but my approach is to always start as simple as possible to get the bugs out. So start with a small dataset (i.e., MNIST, Cifar-10, …) and when the process is working, try it on Imagenet.
I will periodically but infrequently, be checking on replies to this post. If you need to contact me directly, my email address is firstname.lastname@example.org.
As a postscript, Jeremy and I were also discussing that it would also be interesting to have a study that compares transfer learning and various initialization methods (i.e., Gaussian, msra, LSUV, etc.) to determine if one should always start training by transfer learning or not. It would also be illuminating to compare which source datasets are best for transfer learning for which target datasets.
I am also thinking of a new initialization that is similar to LSUV (see https://arxiv.org/abs/1511.06422) but instead of decomposing Gaussian noise into an orthonormal basis, to use orthonormal basis such as Gabor filters that are indicated by papers such as https://arxiv.org/abs/1411.1792.
One more thing - you might want to get organized among yourselves and split up the work. There is plenty to do and this can be a team effort. Enjoy and best of luck to you on this exploration and adventure!