Following up from EfficientNet
@rwightman thank you for pointing out the MnasNet paper (https://arxiv.org/abs/1807.11626) it was indeed one of the missing pieces of the puzzle and explains some parts in more details (e.g. the squeeze-and-excitation blocks).
I’m currently running experiments with https://github.com/daniel-j-h/efficientnet on ImageNet and can without bells and whistles and without swish or squeeze-and-excitation reach competitive results with my EfficientNetB0 wrt. Acc@1. But these experiments eat a lot of time in that regard.
I also heard from folks giving the EfficientNet models a try on mobile devices that it’s quite slow wrt. inference performance. This might be due to the backend used or the swish activation function. Maybe you folks have insights here?
From initial benchmarks on my laptop (cpu) it seems like e.g. the MobileNetV2 (existing one from torchvision.models) can easily be scaled down via its width multiplier but EfficientNetB0 is the smallest variant in that regard and we do not have coefficients to go lower. Has anyone tried scaling down EfficientNets? Or would it make more sense to go to MobileNetV3 and MobileNetV3-small directly?
What I also looked into is the Bag of Tricks paper (https://arxiv.org/abs/1812.01187) and it’s three insights
- zero init the batchnorm weights in the last res-block layer
- adapt the res-blocks with optional AvgPool, Conv1x1 to make them all skip-able
- do not apply weight decay to biases (haven’t done experiments with this one yet)
Especially the Bag of Tricks ResNet-D (Figure 2 c) looks very interesting for EfficientNets.
Below are statistics for my EfficientNets and their Bottleneck blocks. You can see how some of these blocks are not skip-able because either the spatial dimension or the number of channels do not match in the res-block e.g. in the stride=2 blocks.
The skip ratio describes the ratio of blocks in which we can add the residual (skip connection). See how in the smaller EfficientNets we are missing skip connections for almost half of the layers!
The Bag of Tricks ResNet-D (Figure 2 c) adaption (adding AvgPool, Conv1x1 to make them skip-able) could especially be beneficial for the small EfficientNet models.
EfficientNet0 {'n': 16, 'has_skip': 9, 'not_has_skip': 7, 'skip_ratio': 0.5625}
EfficientNet1 {'n': 23, 'has_skip': 16, 'not_has_skip': 7, 'skip_ratio': 0.6957}
EfficientNet2 {'n': 23, 'has_skip': 16, 'not_has_skip': 7, 'skip_ratio': 0.6957}
EfficientNet3 {'n': 26, 'has_skip': 19, 'not_has_skip': 7, 'skip_ratio': 0.7309}
EfficientNet4 {'n': 32, 'has_skip': 25, 'not_has_skip': 7, 'skip_ratio': 0.7813}
EfficientNet5 {'n': 39, 'has_skip': 32, 'not_has_skip': 7, 'skip_ratio': 0.8205}
EfficientNet6 {'n': 45, 'has_skip': 38, 'not_has_skip': 7, 'skip_ratio': 0.8444}
EfficientNet7 {'n': 55, 'has_skip': 48, 'not_has_skip': 7, 'skip_ratio': 0.8727}
I’m also looking into progressive growing to initialize models. So far I’m transplanting only the convolutional weights from the smaller model after regularly initializing the bigger models. I’m wondering if we can and should also transplant the remaining layers e.g. batchnorm and if it would give us a better initialization for training.
Best,
Daniel J H