I’m fascinated with the recent paper by Frankle and Carbin (https://arxiv.org/abs/1803.03635) and the subsequent work done at Uber (https://eng.uber.com/deconstructing-lottery-tickets/). In terms of practical deep learning, does this represent a real breakthrough? Would love to see anyone’s results playing around with this approach
Can someone ELI-5 the paper(s) ?
Well essentially some researchers from MIT found a way to prune a neural network after training leading to much smaller neural networks with similar accuracies.
For some of their results, the pruned networks actually reached better accuracy than the full network. I really wonder though if we could apply transfer learning with those sub-networks.
It feels like this is a kind of transfer learning.
Yeah - this fascinates me as well. I am not sure if those models are as generalizable for transfer learning as the original non-pruned models though.
Still - it would be awesome if we have a fastai function that prunes a final trained model if, for nothing else, but to have a smaller more efficient final neural network we can deploy to production.
Well, pruning is not that hard while using PyTorch. I will implement their solution today and see if 1/ it works, 2/ how to automate it.
Thinking through how it would work…would we just loop through all the final weights and zero out all the ones under a certain threshold? Then reshape the model to only have non-zero weights?
It seems that there two main ways of pruning, either pruning p% of all the weights or pruning layer-wise. From the paper, the weights selection is done based on their values, so you select the lower ones, no threshold involved. I guess “lower” ones are selected in absolute values.
The source code for the paper is available on Lottery Ticket Hypothesis.
There is also an interesting paper from ICLR 2019 on the same subject -> Rethinking the Value of Network Pruning. Their implementation is also available (PyTorch).
DNN theoretician but weak coder here. The implications of this paper are huge. This finding suggests that random initializations are far more important than current practice capitalizes on. The power of a large network may be mostly in the fact that it contains a large number of random initializations that will be more likely to find a decent number of “winning lottery tickets”. Essentially, the random initializations are far more important than the training.
This suggests that it might be more productive to start with a large number of neural nets with random initializations, then prune them aggressively to find the "winning tickets’ that do well quickly, than to train a single large neural net with a single set of random initializations.
Just briefly… the implications seem to include:
- Weight decay should be integrated with network pruning
- Dropout should perhaps be more intelligent, maybe using something like momentum based on weights that don’t seem to contribute much to the output
- Complex multi-layer subnetworks/kernels (need a good name) might in the future be isolated as pruned networks and combined in large ensembles. This suggests that we may need the ability to freeze and unfreeze kernels in parallel for transfer learning.
That’s not even touching on the implied value of adding newly randomly-initialized cells to a network, along the lines of evolutionary AI.
If random initialization is as important as this paper implies, then adding randomly activated connections throughout the training cycle could be the a big part of the way forward.
Yes I agree. The starting position is everything. The fact initialisations are random means you can always be unlucky in the init phase. And if you look at Kaggle competitions ensembles usually win.
Haven’t had a chance to read the whole paper, but I am confused at why they are using Gaussian Glorot initialization for all of their comparisons. I would think training resnet and such would be significantly hampered by this choice? I do understand that they wanted a “fair” comparison, but I think this may more or less be “our network works well using Gaussian Glorot initialization.” Would be interested in a comparison of this with regular conv nets trained with kaiming.
Yeah, I found that curious too - if they had a reason for selecting Gaussian Glorot, it wasn’t explained. Given how important the paper suggests initialization is, it would be nice to see a comparison of the various methods from the lottery-ticket perspective.
Speaking to some PhD researchers today, I heard one point out that these findings might not extend, or might extend in unforeseen ways, to other types of network - language models and networks with Attention were the examples. The function of Attention seems to be to teach the network to focus on the winning lottery tickets; the value of Attention might change or even go away in a pruned network.
A fascinating topic, and I am eager to see what becomes of it.
I may be seeing something similar to the lottery ticket in practice. It’s a layered CNN that does time series prediction. The artificial data is exponential + sine wiggles + random walk, predicting the next step, with the last 20% reserved for validation. About one third of runs, it learns the Training very well and tracks Validation nearly perfectly. Two-thirds of the runs, it learns Training equally well but fails spectacularly to generalize to Validation. The only difference is in the randomization of initial weights. It’s like it initially can fall into two different hypotheses only one of which fits the underlying function.
I finally have a code working to apply their method of pruning. From the first experiment I have done on VGG16/CIFAR10, I can see that I can get better accuracy faster. There are no deterioration of the model’s accuracy when pruning 2~3 times (20% each time). However I know that I am not following exactly their methodology. After resetting the weights before re-training, they are still updated during the back-propagation which I think should be avoided.
Also I am not sure what to do regarding the bias parameters.
I will continue further experiments to have more information about this. But it is fun !
After an initial training (just 20 epochs with nothing fancy), I got a validation accuracy of 89.34%. After pruning a first time by 20% (layer-wise) and re-training, I got 89.8%.
Re-pruning by 20% and re-training, I get 90.3%.
Re-pruning by 20% and re-training, I get 90.6%.
There are already three new research papers adding on the hypothesis:
The Lottery Ticket Hypothesis at Scale
Deconstructing Lottery Tickets - Zeros, Signs and the Supermask
Sparse Transfer Learning via Winning Lottery Tickets
Great summary. Look forward to seeing this in the Fastai library very soon.
I was as an aside noticing this effect recently when trying lots of different layer architectures and kept re-running from scratch. Sometimes I could get an mse loss of 4.5 and other times 2.8 just by restarting everything with the built in randomness of the init. I was gonna “solve” this by setting all the random seeds to fixed, but now I’m thinking re-running and saving the good networks might be a better strategy … at least, until all this pruning stuff is implemented. Time to start the experiments.
Very interesting indeed.
I haven’t had time to read the paper in full, for now I only watched the below video from ICML and read the blog post from the team at Uber.
I find the concern raised in the final question at the end of video quite interesting. Are these Lottery Tickets overfitting ― both in terms of architecture and weights ― the validation and test sets or these could generalize equally well? In this case, I mean as well as the original model.
P.S.: I count on reading both papers by the end of next week. Very interesting research area.