This repo has B4-B5


I tried to use efficient net b2 on the cats and dogs classification with image size 260.
But the error rate isn’t as good as the resnet30. Is it because I didn’t use the Concat pooling or because my batchsize is smaller.

model_name = 'efficientnet-b2'
def getModel():
    model = EfficientNet.from_pretrained(model_name)
    # linear 1408 > 1000
    rel1 = nn.ReLU(inplace=True)
    bn1 = nn.BatchNorm1d(1000)
    drop1 = nn.Dropout(0.25)
    lin2 = nn.Linear(1000, 512)
    rel2 = nn.ReLU(inplace=True)
    bn2 = nn.BatchNorm1d(512)
    drop2 = nn.Dropout(0.5)
    lin3 = nn.Linear(512, data.c)
    return nn.Sequential(model, rel1, bn1, drop1, 
                         lin2, rel2, bn2, drop2, 
1 Like

Hey folks - for the last week I’ve worked on implementing the EfficientNet family of models in my evenings and I’m now starting to experiment with some of their modifications. Here’s what I found out so far that I want to share with you since there is not a lot of information out there regarding EfficientNets.

Here is my repository if you want to follow along or check it out for inspiration. It’s MIT licensed and I’m happy for feedback and suggestions: https://github.com/daniel-j-h/efficientnet


  • https://arxiv.org/abs/1905.11946 EfficientNet. This is the main paper you want to follow. When they talk about techniques such as Squeeze-and-Excitation and MBConv read the papers below.

  • https://arxiv.org/abs/1801.04381 MobileNet V2. The EfficientNet’s basic building block (inverted residual linear bottlekneck, simply called “MBConv”) is taken from this paper. To understand MBConv blocks you want to read and understand the MobileNet V2 paper with a focus on the narrow-wide-narrow blocks with depthwise separable convolutions.

  • https://arxiv.org/abs/1905.02244 MobileNet V3. While the EfficientNet paper only briefly mentions the Squeeze-and-Excitation blocks, the MobileNetV3 paper actually explains where and how to add them to the MBConv blocks. Similar with the swish activation function: the EfficientNet paper only briefly mentions it; the official EfficientNet implementation uses it by default. In addition the MobileNetV3 paper seems to come with more tricks which could be applied to the EfficientNets: in Figure 5 they show how to re-do the last stages to be more efficient; and they explain why they only use 16 filters instead of 32 in the head. I haven’t tested this in the EfficientNet architecture so far.

  • https://arxiv.org/abs/1709.01507 Squeeze-and-Excitation (cSE). I would call it a simple (but effective) form of attention. There is a follow-up paper https://arxiv.org/abs/1803.02579 introducing a similar block (sSE) and they show how a combination of both (cSE + sSE = scSE block) gives amazing results. I’m seeing good results in an unrelated segmentation project using these scSE blocks. Check out at least the cSE paper for EfficientNet.


  • Depthwise separable convolutions in PyTorch can be expressed using the groups parameter as in: nn.Conv2d(expand, expand, groups=expand, ..).

  • When the paper is talking about MBConv1 or MBConv6 they mean MBConv with an expansion factor of 1 and 6, respectively. The MBConv1 block does not come with the first expansion 1x1 conv since there is nothing to expand (expansion factor 1); this block starts with a depthwise separable convolution immediately.

  • In a layer there is a sequence of n MBConv blocks. When the EfficientNet paper talks about e.g. a stride=2 layer they mean: the first MBConv in the sequence implements a stride=2 in the depthwise convolution, all the following n-1 MBConv blocks implement stride=1.

  • The skip connections in the MBConv blocks are only possible for blocks whith stride=1 (so the in and out spatial resolution is the same, and number of in channels and out channels are the same).

  • The official EfficientNet implementation at some point was not using using drop connect. This was a bug.

  • The official EfficientNet implementation at some point had their stride=1 vs stride=2 blocks mixed up compared to the paper. This was a bug (in the paper).

  • The EfficientNet paper is all about scaling depth, width, and resolution. But they never tell you about the engineering tricks for how to actually scale depth, and width. The official EfficientNet implementation snaps to roughly the nearest multiple of eigth for width, most likely because implementations such as cudnn like multiple of eight sized channels.


I’m using https://github.com/pytorch/examples/tree/master/imagenet for a no bells and whistles training setup. I’ll switch to the fastai training setup at some point but for now I want to keep it simple.

For a dataset I’m using https://github.com/fastai/imagenette#imagewoof for quick iteration and training (most) EfficientNet models. I started with ImageNette but it was too easy of a dataset for this task - especially for the bigger models.

What I found out:

  • With ReLU6 as activation function (instead of Swish), and without Squeeze-and-Excitation blocks, the EfficientNets are already very competitive on ImageWoof (even the smallest EfficientNetB0) even though I’m using the simple training script.

  • When I add the Swish activation function instead of ReLU6 and scSE blocks I have to train 4-5 times as long. I don’t fully understand why this happens but the Google TPU docs say they have to train EfficientNets for 350 epochs (instead of the default 90) on ImageNet.

  • With Swish and scSE blocks Acc@1 dropped rapidly for the EfficientNetB0 by 20 percent points.

  • With Swish and cSE blocks Acc@1 is a bit worse than no Squeeze-and-Excitation blocks at all. It looks like the scSE block (and especially the sSE block) is the problem here. But I can not explain why yet.

My intuition tells me the Squeeze-and-Excitation (attention) blocks might just not be necessary for a dataset such as ImageWoof. We need to re-do these experiments on a full-blown ImageNet dataset.

For now my implementation uses ReLU6 and no Squeeze-and-Excitation blocks until I can confirm their benefits on larger datasets.

Open Questions

  • The EfficientNet paper is using DropConnect instead of Dropout. I haven’t benchmarked it yet and my implementation is simply using Dropout for now. How much does it help?

  • ImageNet: how easy is it to train the models on ImageNet. How does Swish vs. ReLU6 behave? Do cSE blocks help? scSE blocks? For ImageWoof neither Swish nor Squeeze-and-Excitation seems to make a difference for my simple setup.

  • Progressive Growing: can we speed up training the EfficientNet models by starting to train EfficientNetB0, then use it to initialize EfficientNetB1, and so on? Does it help speed up training? How do the resulting models compare vs. the same model trained from scratch without progressive growing?

  • MobileNetV3: can we apply the paper’s tricks (re-do last stages, and reduce filters in head).
    to EfficientNets?

  • The MobileNetV3 and EfficientNet paper came out roughly at the same time, are very similar in building blocks, overall design, and the authors even contributed to both papers. Yet not one of these papers mentions the other one at all. Why is that?

Hope that helps,
Daniel J H


Thanks for your insights!

I am using EfficientNet for super resolution and my experience with swish, but also squeeze and excite was exactly the same as yours up to this point. Both barely made a difference.

I tried EfficientNet b3 with mixup and could get 93.8% on the test set for the Stanford Cars Dataset. What made a great difference for me in making the EfficientNet work was to change the optimizer to RMSprop. With the default (Adam) I got worse results than Resnet50.


@daniel-j-h The MNasNet paper was important in the jump from MobileNet-V2 to MobileNet-V3 and EfficientNet, it introduce the SE to the Inverted Residual block and as you can see the MobileNet-V3 and EfficientNet were both very much written with the assumption that you read this first: https://arxiv.org/abs/1807.11626

The MNasNet-A1 vs B1 is worth looking at for an SE vs not comparison. The A1 is SE block based with some reductions in dimensions vs the B1, but it can be trained to 75.2% top-1 as per paper (I’ve managed 75.45) and the B1 is 74.5 (74.66 in my attempt). The A1 has 3.9M params and the B1 4.4M.

The attention from SE blocks typically improve parameter efficiency. Just as with SE-ResNets or ResNeXts, the IR block based networks with SE present give you a higher ratio of performance (accuracy metrics) to parameters. Unfortunately they seem to make the networks harder to train, more epochs (maybe more variety needed, so bigger datasets or better augmentation helps?) and a little slower to push the images through.

All said though, for PyTorch running on an NVIDIA GPU, I don’t think EfficientNets make sense as a go to network for most applications. While they are parameter and flop efficient, when it comes to GPU memory usage and image throughput, they are no faster, if not worse than ResNe(X)ts. Well trained ResNe(X)ts can achieve similar accuracy metrics with better GPU memory characteristics, higher throughputs, and half the training time epochs (possibly up to 3-4x the wallclock time)


I’ve been working on a Colab notebook to compare EfficientNets to some other well trained models, mostly ResNet based. I didn’t feel the choice of comparison models was particularly fair in the paper. Also, despite the big gains in parameter/flop efficiency, the EfficientNet models do not run faster in PyTorch or utilize less GPU memory than larger, appropriately matched ResNet peers.


Edit: A github option if you don’t want to login to google, https://github.com/rwightman/pytorch-image-models/blob/master/notebooks/EffResNetComparison.ipynb


In my limited experience shrinking GANs the tradeoff I have noticed is as follows:

Over-parameterised networks = Large learning rate = Fast training = Slow Inference
Under-parameterised networks = Small learning rate = Slow training = Fast Inference

The choice depends on the objective. Where do you want to run it once its trained? On GPU or NPU (optimised for a particular platform, i.e. mobile).

Having compressed over-parameterised GANs by removing entire resblocks my thought is progressive training (starting with smallest network then growing) would speed up overall training and might avoid having to resort to LR tricks. Just a thought.


For a lot of tasks, I agree this is a good rule of thumb.

For inference, the slow vs fast part isn’t quite as straightforward. Doing batched inference on a GPU, bigger networks can be as fast or faster than significantly ‘smaller’ ones by param count and FLOP count – depending on the architecture and the framework you’re on.

1 Like

While working with medical data-set shall i use unet ,resnet orEfficentNet?

What are you trying to do with the images? Unet is for segmentation for example and not for classification.
Also your question is likely better posed on its own rather than this thread.
In general I would stick with (x)resnet for classification and retinanet for object detection. Efficient net didnt seem to live up to my expectations.


i want to do segmentation

Following up from EfficientNet

@rwightman thank you for pointing out the MnasNet paper (https://arxiv.org/abs/1807.11626) it was indeed one of the missing pieces of the puzzle and explains some parts in more details (e.g. the squeeze-and-excitation blocks).

I’m currently running experiments with https://github.com/daniel-j-h/efficientnet on ImageNet and can without bells and whistles and without swish or squeeze-and-excitation reach competitive results with my EfficientNetB0 wrt. Acc@1. But these experiments eat a lot of time in that regard.

I also heard from folks giving the EfficientNet models a try on mobile devices that it’s quite slow wrt. inference performance. This might be due to the backend used or the swish activation function. Maybe you folks have insights here?

From initial benchmarks on my laptop (cpu) it seems like e.g. the MobileNetV2 (existing one from torchvision.models) can easily be scaled down via its width multiplier but EfficientNetB0 is the smallest variant in that regard and we do not have coefficients to go lower. Has anyone tried scaling down EfficientNets? Or would it make more sense to go to MobileNetV3 and MobileNetV3-small directly?

What I also looked into is the Bag of Tricks paper (https://arxiv.org/abs/1812.01187) and it’s three insights

  • zero init the batchnorm weights in the last res-block layer
  • adapt the res-blocks with optional AvgPool, Conv1x1 to make them all skip-able
  • do not apply weight decay to biases (haven’t done experiments with this one yet)

Especially the Bag of Tricks ResNet-D (Figure 2 c) looks very interesting for EfficientNets.

Below are statistics for my EfficientNets and their Bottleneck blocks. You can see how some of these blocks are not skip-able because either the spatial dimension or the number of channels do not match in the res-block e.g. in the stride=2 blocks.

The skip ratio describes the ratio of blocks in which we can add the residual (skip connection). See how in the smaller EfficientNets we are missing skip connections for almost half of the layers!

The Bag of Tricks ResNet-D (Figure 2 c) adaption (adding AvgPool, Conv1x1 to make them skip-able) could especially be beneficial for the small EfficientNet models.

EfficientNet0 {'n': 16, 'has_skip': 9, 'not_has_skip': 7, 'skip_ratio': 0.5625}
EfficientNet1 {'n': 23, 'has_skip': 16, 'not_has_skip': 7, 'skip_ratio': 0.6957}
EfficientNet2 {'n': 23, 'has_skip': 16, 'not_has_skip': 7, 'skip_ratio': 0.6957}
EfficientNet3 {'n': 26, 'has_skip': 19, 'not_has_skip': 7, 'skip_ratio': 0.7309}
EfficientNet4 {'n': 32, 'has_skip': 25, 'not_has_skip': 7, 'skip_ratio': 0.7813}
EfficientNet5 {'n': 39, 'has_skip': 32, 'not_has_skip': 7, 'skip_ratio': 0.8205}
EfficientNet6 {'n': 45, 'has_skip': 38, 'not_has_skip': 7, 'skip_ratio': 0.8444}
EfficientNet7 {'n': 55, 'has_skip': 48, 'not_has_skip': 7, 'skip_ratio': 0.8727}

I’m also looking into progressive growing to initialize models. So far I’m transplanting only the convolutional weights from the smaller model after regularly initializing the bigger models. I’m wondering if we can and should also transplant the remaining layers e.g. batchnorm and if it would give us a better initialization for training.

Daniel J H


@daniel-j-h MobileNet-V3 and MNasNet are essentially the mobile versions of EfficientNet (same building blocks, optimized for mobile). With that in mind, I’m not sure how important it is to make EfficientNet more mobile friendly. For larger/powerful mobile devices, perhaps – the biggest hit is those sigmoids, so hard-swish/hard-sigmoid as per MNV3 and reduction in number of SE blocks would be the biggest win.

Some Caffe2 benchmark numbers from exported EfficientNet, MobileNet-V3, MNasNet. Note the sigmoid runtime cost (Intel CPU) vs the mobile specific networks.

1 Like

As a GPU based DL practitioner, I’d like to see a GPU targeted NAS search. A lot of the NAS based models are aiming to reduce parameter count, FLOP count, improve accuracy, improve latency on mobile devices for an inference optimized network, or runtime metrics on TPU for XLA optimized graph. This has resulted in some useful networks, but they aren’t a particularly great match for GPU training and inference. At comparable accuracy levels, ResNets often still beat them when it comes to GPU memory consumption while training or running inference.

I haven’t finished this experiment, but I explored it for feasibility. I went through an EfficientNet config and tweaked the widths and factors. I basically made them divisible by larger power of 2 denominators and switched expansion factors to 4 and 8 and made less different expansion factors. I haven’t trained this monster, but I did a quick check of param count vs PyTorch CUDA memory consumption. With my 15 min hack fest I confirmed I could increase param count by 20% and reduce practical GPU memory usage by about that much. With some more thought, or a NAS search, I think those could be pushed further apart.


I want to reopen the discussion regarding the performance of EfficientNet. Based on @rwightman’s and @Seb’s results it seems that EfficientNets are harder to train and require more epochs to reach similar performance to ResNet50.

However, in current Kaggle competitions, there seems to be a rise in the usage of pretrained EfficientNets, with seemingly successful results. This is even with the fastai library. I have yet to try out EfficientNets for these competitions so I do not know the details.

Have subsequent experiments shown success of pretrained EfficientNets?

1 Like

@ilovescience It’s worth noting that my comments were for training from scratch, on ImageNet. Fine-tuning from the pretrained model is a different exercise that is typically less sensitive to h-params and requiring less epochs.

The EffNets are capable models, so no doubt they can produce solid results for Kaggle or other pretraining exercises. I just wouldn’t recommend them as a first choice on any given problem since they are more challenging to work with and a lot of their ‘efficiency’ benefit isn’t realized with PyTorch + GPUs.


Google AI team posted new EfficientNet weights with AutoAugment training. Improvements across the board, but only .1% top-1 by B7. However, B6 and B7 weights are incl this time round.

I did a quick update of weights for my PyTorch impl: https://github.com/rwightman/pytorch-image-models

B6 & B7 are absolute monsters to run so haven’t updated all validation scores just yet.

Also interesting are the MixNet models. A promising idea to explore in other architectures. I had a shelved ResNeXt-like model with bottleneck 3x3 replaced by groups of 2 * 3x3, 5x5, and 7x7. It was converging nicely but I dropped it for other training priorities before completing. Must revist now.


@rwightman your repo is so great! :slight_smile: One thing that could make it even better would be to add a training time column to the table of models that you’ve trained yourself. And maybe also max GPU memory use? Param count isn’t always a great proxy, as you know!


@jeremy thanks!

Retroactively, the training time is going to be a challenge. The way I train right now, I use a few different local machines of widely differing capabilities, best is probably at least 4x the worst. So, there’d be varying configs. Other challenge is that I often kill a training session and resume later if priorities change and I need to use a machine for another task.

Something I thought would be useful for this and other reproducibility challenges, is to keep a train history embedded in the training checkpoint. Basically something that records timestamp + hparam delta whenever an hparam or scheduled hparam changes. Embedded in the checkpoints it wont be lost across stop/restart or even moving to different machines. It could be later parsed and used exactly replay training for a given model despite manual tweaks. So, an ‘hparam replay buffer’ of sorts.

Max gpu usage is on the list of todos. I need to experiment more with appropriate sampling points and rates to capture a representative value. Capturing the state during validation/inference is usually reliable, but I see more variability and sometimes epoch to epoch fluctuation in training, especially with AMP enabled.

1 Like