EfficientNet

daniel-j-h · June 26, 2019, 10:24pm

Hey folks - for the last week I’ve worked on implementing the EfficientNet family of models in my evenings and I’m now starting to experiment with some of their modifications. Here’s what I found out so far that I want to share with you since there is not a lot of information out there regarding EfficientNets.

Here is my repository if you want to follow along or check it out for inspiration. It’s MIT licensed and I’m happy for feedback and suggestions: https://github.com/daniel-j-h/efficientnet

References

https://arxiv.org/abs/1905.11946 EfficientNet. This is the main paper you want to follow. When they talk about techniques such as Squeeze-and-Excitation and MBConv read the papers below.
https://arxiv.org/abs/1801.04381 MobileNet V2. The EfficientNet’s basic building block (inverted residual linear bottlekneck, simply called “MBConv”) is taken from this paper. To understand MBConv blocks you want to read and understand the MobileNet V2 paper with a focus on the narrow-wide-narrow blocks with depthwise separable convolutions.
https://arxiv.org/abs/1905.02244 MobileNet V3. While the EfficientNet paper only briefly mentions the Squeeze-and-Excitation blocks, the MobileNetV3 paper actually explains where and how to add them to the MBConv blocks. Similar with the swish activation function: the EfficientNet paper only briefly mentions it; the official EfficientNet implementation uses it by default. In addition the MobileNetV3 paper seems to come with more tricks which could be applied to the EfficientNets: in Figure 5 they show how to re-do the last stages to be more efficient; and they explain why they only use 16 filters instead of 32 in the head. I haven’t tested this in the EfficientNet architecture so far.
https://arxiv.org/abs/1709.01507 Squeeze-and-Excitation (cSE). I would call it a simple (but effective) form of attention. There is a follow-up paper https://arxiv.org/abs/1803.02579 introducing a similar block (sSE) and they show how a combination of both (cSE + sSE = scSE block) gives amazing results. I’m seeing good results in an unrelated segmentation project using these scSE blocks. Check out at least the cSE paper for EfficientNet.

Implementation

Depthwise separable convolutions in PyTorch can be expressed using the groups parameter as in: nn.Conv2d(expand, expand, groups=expand, ..).
When the paper is talking about MBConv1 or MBConv6 they mean MBConv with an expansion factor of 1 and 6, respectively. The MBConv1 block does not come with the first expansion 1x1 conv since there is nothing to expand (expansion factor 1); this block starts with a depthwise separable convolution immediately.
In a layer there is a sequence of n MBConv blocks. When the EfficientNet paper talks about e.g. a stride=2 layer they mean: the first MBConv in the sequence implements a stride=2 in the depthwise convolution, all the following n-1 MBConv blocks implement stride=1.
The skip connections in the MBConv blocks are only possible for blocks whith stride=1 (so the in and out spatial resolution is the same, and number of in channels and out channels are the same).
The official EfficientNet implementation at some point was not using using drop connect. This was a bug.
The official EfficientNet implementation at some point had their stride=1 vs stride=2 blocks mixed up compared to the paper. This was a bug (in the paper).
The EfficientNet paper is all about scaling depth, width, and resolution. But they never tell you about the engineering tricks for how to actually scale depth, and width. The official EfficientNet implementation snaps to roughly the nearest multiple of eigth for width, most likely because implementations such as cudnn like multiple of eight sized channels.

Experiments

I’m using https://github.com/pytorch/examples/tree/master/imagenet for a no bells and whistles training setup. I’ll switch to the fastai training setup at some point but for now I want to keep it simple.

For a dataset I’m using https://github.com/fastai/imagenette#imagewoof for quick iteration and training (most) EfficientNet models. I started with ImageNette but it was too easy of a dataset for this task - especially for the bigger models.

What I found out:

With ReLU6 as activation function (instead of Swish), and without Squeeze-and-Excitation blocks, the EfficientNets are already very competitive on ImageWoof (even the smallest EfficientNetB0) even though I’m using the simple training script.
When I add the Swish activation function instead of ReLU6 and scSE blocks I have to train 4-5 times as long. I don’t fully understand why this happens but the Google TPU docs say they have to train EfficientNets for 350 epochs (instead of the default 90) on ImageNet.
With Swish and scSE blocks Acc@1 dropped rapidly for the EfficientNetB0 by 20 percent points.
With Swish and cSE blocks Acc@1 is a bit worse than no Squeeze-and-Excitation blocks at all. It looks like the scSE block (and especially the sSE block) is the problem here. But I can not explain why yet.

My intuition tells me the Squeeze-and-Excitation (attention) blocks might just not be necessary for a dataset such as ImageWoof. We need to re-do these experiments on a full-blown ImageNet dataset.

For now my implementation uses ReLU6 and no Squeeze-and-Excitation blocks until I can confirm their benefits on larger datasets.

Open Questions

The EfficientNet paper is using DropConnect instead of Dropout. I haven’t benchmarked it yet and my implementation is simply using Dropout for now. How much does it help?
ImageNet: how easy is it to train the models on ImageNet. How does Swish vs. ReLU6 behave? Do cSE blocks help? scSE blocks? For ImageWoof neither Swish nor Squeeze-and-Excitation seems to make a difference for my simple setup.
Progressive Growing: can we speed up training the EfficientNet models by starting to train EfficientNetB0, then use it to initialize EfficientNetB1, and so on? Does it help speed up training? How do the resulting models compare vs. the same model trained from scratch without progressive growing?
MobileNetV3: can we apply the paper’s tricks (re-do last stages, and reduce filters in head).
to EfficientNets?
The MobileNetV3 and EfficientNet paper came out roughly at the same time, are very similar in building blocks, overall design, and the authors even contributed to both papers. Yet not one of these papers mentions the other one at all. Why is that?

Hope that helps,
Daniel J H