It is really so gratifying to see this group autonomously develop so many proper methods for undertaking this research project. I commend all of you for what you have accomplished in only a few days.
I’d like to make a suggestion that you also start a collaborative paper. Personally, I prefer LaTeX for the way it handles formatting and equations. I use Overleaf (https://www.overleaf.com/) but there are other similar sites.
In my experience, I found it better to write my reports/papers while I am doing the investigation/experiments rather than come back to writing after everything is done. Sure, I’ve tossed/deleted a lot of my writing but my writing often drives which experiments to do and it is easier to write when all is fresh. It is better to have way to much in the paper and need to edit it down than not enough information to replicate the work. Besides, implementation details can be in an Appendix.
Finally, let me suggest a potential title: “Training deep networks via growth spurts”.
Had some spare time today, and I’m glad to say I have preliminary results for your ideas. I started by adding the required functionality to the fast-ai library. Uploaded it to my github account: leslie-vs-jeremy
The change is localized to the fastai/dataset.py. My idea was to change the ImageClassifierData class and all the from_* functions to only load the required data from the training set. I added two additional arguments that can be passed: max_train_per_class - integer (e.g. max_train_per_class=100). Only changes the training set. Loads for every training class up to the set number of images. The images are randomly chosen. If this number is larger than the number of images in a certain class, then all images are taken. partial_train_classes - list of integers (e.g. partial_train_classes=[0, 2, 5]). Only changes the training set. Loads only training images from the classes prescribed in the list.
I ran on the cifar10 dataset using the notebook dl1/lesson7-cifar10.ipynb with the Resnet2 network at the end of the notebook (located at /courses/dl1/leslie-vs-jeremy-cifar10.ipynb on my github)
Baseline - training the network on the whole DS for 6 epochs.
Jeremy - I trained 2 epochs for: 2 classes, 4 classes, 6 classes, 8 classes 10 classes (total of 10 epochs). The number of images the network saw was 6=(0.2+0.4+0.6+0.8+1)*2 times the training data, hence computationally equivalent to the baseline.
Leslie - I trained 2 epochs for: 20% training, 40% training, 60% training, 80% training, 100% training (total of 10 epochs). The number of images the network saw was 6 times the training data, hence computationally equivalent to the baseline.
Verdict:
1st place - Baseline - 67%
2nd place - Leslie - 66.6%
3rd place - Jeremy - 58%
No problem, I’m re-running things as we speak. Just as a note, I added the functionality to the fast-ai library, so it is possible to easily run this on any dataset, with any network, using any configuration of training schedule. I’ll be glad to have others adapt my dataset.py changes and try out any idea that comes to mind.
I just read two recent papers that I really liked.
The first is by the good folks at FaceBook:
Mahajan, Dhruv, Ross Girshick, Vignesh Ramanathan, Kaiming He, Manohar Paluri, Yixuan Li, Ashwin Bharambe, and Laurens van der Maaten. “Exploring the Limits of Weakly Supervised Pretraining.” arXiv preprint arXiv:1805.00932 (2018).
I liked how thoroughly they covered the topic of weakly supervised pretraining/transfer learning. Also, that they utilized such a large set of images. I wonder if they created a more universal feature extractor - one that is less dependent on specific classes. This topic is important and they utilized significant computational resources on 3.5 billion images. One of their long jobs took about 20 GPU years of time (“336 GPUs across 42 machines with minibatches of 8,064 images… took ~22 days to train on 3.5B images.”). I’d say that conservatively they ran at least 90 jobs to obtain the results in their paper, which works out to over 180 GPU years of processing time. Impressive!
I sent one of the authors an email today saying that I really enjoyed reading their paper and suggesting “You are right that other research groups won’t know exactly which images you used but would you consider making your trained weight files public.” I believe these files will be useful.
The second paper by the good folks at Google is:
Oliver, Avital, Augustus Odena, Colin Raffel, Ekin D. Cubuk, and Ian J. Goodfellow. “Realistic Evaluation of Deep Semi-Supervised Learning Algorithms.” arXiv preprint arXiv:1804.09170 (2018).
I particularly liked their rigorous comparison of semi-supervised learning methods. Too often there are conflicting claims in the literature so it is refreshing to see an “apples to apples” comparison. The authors discuss the “reproducibility crises” and they say they will make their implementation available (I checked - it is not there yet).
NEW RESEARCH TOPIC:
IMO, it would be quite interesting to investigate if there is any advantage to combinations of fully supervised transfer learning (FSTL), weakly supervised learning (WSL), and semi-supervised learning (SSL). Some combinations might be: FSTL + WSL, FSTL + SSL, FSTL + WSL + SSL, and WSL + SSL. I don’t know if such work has been done (i.e., need a literature search) but it would be nice to know which combinations work and what works best. However, unless someone has unlimited access to FaceBook’s or Google’s servers, we won’t have the resources to replicate the WSL weight files. I thought I’d mention this potential research topic even if we can’t investigate it.
I ran the models to convergence (notebook), which I interpreted as 80% accuracy.
I tried playing with different learning rates and different number of classes/% training images, but in reality I must say both approaches are inferior to the baseline vanilla SGD.
Interesting. On the face of it, the suggested methods appear inferior to the baseline.
What if instead of “10%x10, 40%x10, 60%x10, 80%x3.2, 100%x10” you tried “20%x2, 40%x4, 60%x6, 80%x8, 100%x10”? Perhaps the earlier training doesn’t need to be to convergence since we are only using the weights to initialize the next step. Perhaps we will see that the greater the amount of data, the more epochs are necessary for training.
Or this too can be wrong (it is research, after all, and one is wrong a lot more than one is right). The important thing is to learn something from this.
Yes exactly - my understanding is that at the start of training, the network learns better if it has easier things to learn. I think of something similar with my toddler - I speak to her in a simplified English (as do all parents - studies show that parents do this without even thinking about it).
I don’t expect my approach to be helpful on CIFAR 10 - there’s not enough similar classes. An earlier post in this thread however shows how to hierarchically group CIFAR 100, which I think should help. Another approach would be word2vec class similarities in imagenet (you can use the 64 pixel size version to make it faster).
I’d suggest using @radek’s CIFAR10 baseline - that’ll get you 94% accuracy in <15 mins. I’d be cautious of interpreting approaches which much poorer baseline; (I’ve found that papers that build on poor baselines often have conclusions that turn out not to be relevant in practice.)
I’m not sure that they’re using the best semi-supervised or transfer learning approaches (e.g. the kind of stuff we used in ULMFiT) and I’m also suspicious of “baselines” that use massive compute to optimize hyper-params (since that’s not how we train models in practice).
@jeremy regarding simple English, it seems like the language modeling implementation would be to start with the most frequent tokens and slowly deflate the “unknown” sentinel by extracting increasingly rare types into their own embeddings (which might increase in size during this process). This also presents an interesting opportunity to systematically size the vocabulary, rather than relying on heuristics to decide on a minimum frequency or vocabulary size.
Of course for language models that affects the difficulty of both encoding and the decoding. For language classification, it changes encoding difficulty but not decoding. For image processing, the analogous progressive resizing would be the equivalent change to encoding difficulty. For classification tasks in general, it seems like we’re exploring decoding difficulty.
By this do you mean some sort of label embeddings were trained based on ImageNet data and publicly available or is it just usual word2vec embeddings ?
But I believe it’s probably the first one since it makes more sense for learning like tSNE representation for MNIST. (If my understanding is correct we can use tSNE embeddings of MNIST and start with k clusters and add more classes from each cluster step by step). For example you start training with 7 and then at the next stage at class 1, which shouldn’t be that hard to learn after learning 7. But, then how applicable this would be in real life ? How certain can we be of the unsupervised part ?
Leslie’s approach seems easier to implement by just writing a custom sampler I suppose. Regarding to your approach do we need to dynamically change the model output as we add more classes? Such as initializing a new logistic unit for the new added class at the final layer ?
Please correct me if there is a conceptual misunderstanding.
I wanted to point out that flaw in my implementation of Jeremy’s approach. I initialize a full network and take out some of the classes, but this turns out to perform poorly. My guess is that when you only show the network negative examples of a certain class, then the FC weights of the missing classes turn out to be pretty bad, and it takes time for the network to correct this once the positive samples are introduced.
Probably a better approach would be to implement a hierarchical approach and dynamically resize the network accordingly. There are two options here:
Showing the network only certain classes each time. This uses the same classes that are in the labels, but we need to increase the size of the last FC layer gradually by the number of classes currently present. The problem here is that at each step you are introducing uninitialized weights. I assume that in this case you should freeze all of the network, train the uninitialized weights, and then resume training.
Showing the network all of the data, but changing the labels of the data to fit a hierarchy. This is what @gokkulnath proposed. We can first train a network to use the superclasses, and after while, train a full network using all of the classes, but using the superclass weights as an initialization. The problem with this approach is that it is not general, it assumes the labels can be divided into superclasses. This also assumes some expert knowledge in choosing these superclasses correctly.
Valid point. But I still have a hunch that my implementation of your method would yield poor results even with a better network, for reasons I described in the post above. The question is what is the best way to implement your approach? I personally don’t like approach (2) because it is not general. Approach (1) seems better.
I personally am not sure about both your’s and Leslie’s methods. I think that showing the network only a partial training set leads to weights that generalize less well to the validation set, and once a larger set of training images are introduced, the network has to correct this, leading to a longer training time. Whereas training on the whole training set from the beginning allows the network to converge to a general set of weights more directly.
But it’s still worth trying the approach on @radek’s CIFAR10 baseline.
@Leslie, @jeremy, @deanmark
I’ve been playing around with exactly that recently: I implemented custom sampler so that we could use less data to train quicker. My results so far indicate that this type of approach can speed up learning but it doesn’t converge to good end result. I am using CIFAR10 as a bencmark dataset and baseline is fast.ai dawnbench architecture.
e.g if our target is to find fastest time to 70-80% accuracy on CIFAR10 then small-sample training could win. But, if our target is 90+% accuracy, then small-sample training never achieves it. These are my results so far.
By the way, I modified fast.ai from here and there and my current fastest CIFAR-10 result is <11min to 93+% accuracy on my local machine (compared with <15 min to 94% acc using original fast.ai). I am trying to find this last 1% to push it over 94%
Based on the comments over the last 24 hours, I’d say “Neither” method is better than the baseline.
In hindsight, I realized a flaw in my thinking. The mini-batch size is the same throughout so the speed is the same. We are only limiting the number of training samples, which isn’t what I had in mind. I was wondering if it were possible to very quickly estimate a good solution (i.e., weights) with very few samples. One can think of this as a warm start and I still think there might be a good way to achieve it quickly…
Also, I believe @deanmark’s comments about training with a subset of classes to be correct.
I’d like to share my personal scientific process.
Initial intuition/hypothesis -> experiments -> observation, especially of contradictions to my initial intuition -> THINK -> understanding
I thank everyone for any and all of their efforts with these experiments. This isn’t necessarily over yet. We are at the thinking stage and the goal is understanding. If anyone has any profound (or less than profound) thoughts about these experiments and where to go next and where this might lead, please share. I too will think about this.
Thank you too @Leslie. I had fun trying something new so no harm there. It’s nice to have the opportunity to participate in research in my spare time. Hope we’ll all gain some insight from this experience. I’m looking forward for more threads like this in the future.
P.S. the Gabor filter initialization seems to be explored. Intel caffe have a Gabor filler to initialize conv layers with Gabor filters. Also here and here.