I’m fascinated with the recent paper by Frankle and Carbin (https://arxiv.org/abs/1803.03635) and the subsequent work done at Uber (https://eng.uber.com/deconstructing-lottery-tickets/). In terms of practical deep learning, does this represent a real breakthrough? Would love to see anyone’s results playing around with this approach
Can someone ELI-5 the paper(s) ?
Well essentially some researchers from MIT found a way to prune a neural network after training leading to much smaller neural networks with similar accuracies.
For some of their results, the pruned networks actually reached better accuracy than the full network. I really wonder though if we could apply transfer learning with those sub-networks.
It feels like this is a kind of transfer learning.
Yeah - this fascinates me as well. I am not sure if those models are as generalizable for transfer learning as the original non-pruned models though.
Still - it would be awesome if we have a fastai function that prunes a final trained model if, for nothing else, but to have a smaller more efficient final neural network we can deploy to production.
Well, pruning is not that hard while using PyTorch. I will implement their solution today and see if 1/ it works, 2/ how to automate it.
Thinking through how it would work…would we just loop through all the final weights and zero out all the ones under a certain threshold? Then reshape the model to only have non-zero weights?
It seems that there two main ways of pruning, either pruning p% of all the weights or pruning layer-wise. From the paper, the weights selection is done based on their values, so you select the lower ones, no threshold involved. I guess “lower” ones are selected in absolute values.
The source code for the paper is available on Lottery Ticket Hypothesis.
There is also an interesting paper from ICLR 2019 on the same subject -> Rethinking the Value of Network Pruning. Their implementation is also available (PyTorch).
DNN theoretician but weak coder here. The implications of this paper are huge. This finding suggests that random initializations are far more important than current practice capitalizes on. The power of a large network may be mostly in the fact that it contains a large number of random initializations that will be more likely to find a decent number of “winning lottery tickets”. Essentially, the random initializations are far more important than the training.
This suggests that it might be more productive to start with a large number of neural nets with random initializations, then prune them aggressively to find the "winning tickets’ that do well quickly, than to train a single large neural net with a single set of random initializations.
Just briefly… the implications seem to include:
- Weight decay should be integrated with network pruning
- Dropout should perhaps be more intelligent, maybe using something like momentum based on weights that don’t seem to contribute much to the output
- Complex multi-layer subnetworks/kernels (need a good name) might in the future be isolated as pruned networks and combined in large ensembles. This suggests that we may need the ability to freeze and unfreeze kernels in parallel for transfer learning.
That’s not even touching on the implied value of adding newly randomly-initialized cells to a network, along the lines of evolutionary AI.
If random initialization is as important as this paper implies, then adding randomly activated connections throughout the training cycle could be the a big part of the way forward.
Yes I agree. The starting position is everything. The fact initialisations are random means you can always be unlucky in the init phase. And if you look at Kaggle competitions ensembles usually win.