EfficientNet

Sorry if I missed it in the thread, but has anyone got imagenette or imagewoof results from scratch with this yet? If so, how does it compare to resnet for training speed and accuracy?

It would be faster, but I could not get fp16 it to work with tensorflow even using nvidiaā€™s containers. ( Since my target device was the USB TPU, I had to stick with tensorflow.)

I was going to use Resnet50 as a baseline for some experimenting with compressed domain input from jpeg as in this paper. Since tensorflow was less familiar territory, I wanted to stick with fp32 to prevent errors on my part.

I want to replicate that same paper with Efficientnet and see if there is a further speedup in inference.

So far it looks slower and doesnā€™t converge as fast as xresnet50 in 5 epochs. I thought I was making mistakes in my implementation, but I tried with another github, and wasnā€™t too impressed by initial results:

Going to wait and see if others end up striking gold with EfficientNet

3 Likes

I thought this was an interesting post:

RE the FLOPS issue noted above. One possible source of FLOP discrepancy between the paper and counting via PyTorch could arise if the paper authors are counting after theyā€™ve exported and optimized the model for inference. Batch norm will usually be folded into the convolutions at this stage and that would reduce FLOPs.

4 Likes

A comment on the training of these networks, since Iā€™ve spent time trying to train other models leading up to EfficientNets ā€“ namely MNASNet and MobileNetV3. All of these have evolved from the block structure of Mobilenet-V1 and V2. Which were then thrown through the blender with NAS algorithms to choose expansion factors, SE on/off, channel numbers, repeat counts, activations fns, and other scaling factors in the likes of MNASNet, FBNet, ChamNet, Single-Path NAS, MobileNetV3, and now EfficientNet. The striding and resolution stage progression in all of these is pretty much the same.

None of the latest iterations of these networks have been particularly easy to train. Iā€™ve spent time trying to reproduce MNASNet-A1/B1 and MobileNet-V3. I only just managed to match (slight beat) MobileNet-V3 from scratch after a few weeks of experiments leading up to another week of the final train on a 3-GPU system.

The only scheme thatā€™s got me close to paper results is trying to follow the paper h-params as closely as possible. Basically, using RMSprop with learning rates scaled appropriately (for my GPU count/batch size) and a whole lot of epochs (350 or more). In addition to those h-params I replicated the TF RMSprop. They use a really high eps (.001) and TF differs from PyTorch in where the eps is applied in relation to the sqrt. By itself, I got within 1.5-2% of the stated results north of 400 epochs. The last bit is using an exponential average of the model weights. This is a common thing in TF training setups (tf.train.ExponentialMovingAverage), less so in other frameworks. I recently implemented my own variant of this in PyTorch. It finally got me to the goal. I was nudged to finally do this by looking at the EfficientNet checkpoints, unlike MNASNet they included the full training checkpoints, you can see the gap between their EMA values and the current epoch yourself, it is significant and matched the gap I had between my training efforts and the paper.

A quick spin of EfficientNet-B0 shows itā€™s pretty similar. Iā€™m almost done fine-tuning a set of B0 weights from the TF weights that will work in PyTorch without padding hacks and non-default BN epsilon. Fine tuning is also sensitive, easy to send those nice numbers sliding off a cliff.

I do plan to revisit some of the other training setups I tried with a decent EMA impl at my disposal. I did manage to get within 1.5-2% of MNASNet using Cosine LR decay /w random-erasing instead of dropout in ~150-180 epochs. The rapid drop of cosine decay is probably going to cause some problems, but Iā€™m curious to see, with a nice flat cool-down period, whether the EMA produces a nice set of weightsā€¦ I guess this is getting pretty close to SWA, which is also worth an experiment here.

Iā€™d love to hear any results people come up with. More network architectures that are optimized for efficiency, especially for deployment on specific hardware platforms/accelerators is guaranteed. Training efficiency will probably not make the priority list of things to optimize forā€¦

16 Likes

Great to see you here RWightman! I reviewed your codebase on github for my EffNet implementation :slight_smile:

This is a common thing in TF training setups (tf.train.ExponentialMovingAverage), less so in other frameworks. I recently implemented my own variant of this in PyTorch. It finally got me to the goal.

Very interesting, thanks for sharing this info. Is your implementation publicly available?

I have used comparable building blocks (combinations of 1x1 and separable/group convolutions) for image denoising and super resolution. The number of parameters could be reduced immensely, but exactly as stated, the training speed became slower and using the usual tricks wasnā€™t as helpful. Using the learning rate finder was useful so far, but cosine learning rate and the like didnā€™t work for me so far, meaning there wasnā€™t an meaningful improvement regarding the training time. But I have to admit that I didnā€™t systematically experiment with them yet!

2 Likes

Very interesting, thanks for sharing this info. Is your implementation publicly available?

I just pushed to a branch. I want to clean up some of the handling of the model init sequence vs checkpoint load handling (if I can). Things get a bit complicated maintaining a separate set of weights (and being able to run validation on them) while using things like distributed training, DataParellel, AMP, etcā€¦

Itā€™ll use some GPU memory, but I experimented with a flag that keeps the EMA on the CPU only, but you have to validate those results manually from the checkpoints. Also, pay attention to the decay factor, watch how it relates to your batch size (update count per epoch). Google trains on TPU with big batches and uses .9999, on less capable systems youā€™ll want to reduce that unless you feel averaging over 10 epochs is useful :slight_smile: The ā€˜N-dayā€™ EMA equivalence formula is useful for making that adjustment sensible. Iā€™m using .9998 right now.

3 Likes

Maybe interesting for others who want to understand the EfficientNet architecture in detail:

5 Likes

I got the code working with different dataset.

Anyone got pytorch pretrained weights for b4-b7? I need to test my implementation of efficient-nets.

This repo has B4-B5

2 Likes

I tried to use efficient net b2 on the cats and dogs classification with image size 260.
But the error rate isnā€™t as good as the resnet30. Is it because I didnā€™t use the Concat pooling or because my batchsize is smaller.

model_name = 'efficientnet-b2'
def getModel():
    model = EfficientNet.from_pretrained(model_name)
    # linear 1408 > 1000
    rel1 = nn.ReLU(inplace=True)
    bn1 = nn.BatchNorm1d(1000)
    drop1 = nn.Dropout(0.25)
    
    lin2 = nn.Linear(1000, 512)
    rel2 = nn.ReLU(inplace=True)
    bn2 = nn.BatchNorm1d(512)
    drop2 = nn.Dropout(0.5)
    
    lin3 = nn.Linear(512, data.c)
    
    return nn.Sequential(model, rel1, bn1, drop1, 
                         lin2, rel2, bn2, drop2, 
                         lin3)
1 Like

Hey folks - for the last week Iā€™ve worked on implementing the EfficientNet family of models in my evenings and Iā€™m now starting to experiment with some of their modifications. Hereā€™s what I found out so far that I want to share with you since there is not a lot of information out there regarding EfficientNets.

Here is my repository if you want to follow along or check it out for inspiration. Itā€™s MIT licensed and Iā€™m happy for feedback and suggestions: https://github.com/daniel-j-h/efficientnet

References

  • https://arxiv.org/abs/1905.11946 EfficientNet. This is the main paper you want to follow. When they talk about techniques such as Squeeze-and-Excitation and MBConv read the papers below.

  • https://arxiv.org/abs/1801.04381 MobileNet V2. The EfficientNetā€™s basic building block (inverted residual linear bottlekneck, simply called ā€œMBConvā€) is taken from this paper. To understand MBConv blocks you want to read and understand the MobileNet V2 paper with a focus on the narrow-wide-narrow blocks with depthwise separable convolutions.

  • https://arxiv.org/abs/1905.02244 MobileNet V3. While the EfficientNet paper only briefly mentions the Squeeze-and-Excitation blocks, the MobileNetV3 paper actually explains where and how to add them to the MBConv blocks. Similar with the swish activation function: the EfficientNet paper only briefly mentions it; the official EfficientNet implementation uses it by default. In addition the MobileNetV3 paper seems to come with more tricks which could be applied to the EfficientNets: in Figure 5 they show how to re-do the last stages to be more efficient; and they explain why they only use 16 filters instead of 32 in the head. I havenā€™t tested this in the EfficientNet architecture so far.

  • https://arxiv.org/abs/1709.01507 Squeeze-and-Excitation (cSE). I would call it a simple (but effective) form of attention. There is a follow-up paper https://arxiv.org/abs/1803.02579 introducing a similar block (sSE) and they show how a combination of both (cSE + sSE = scSE block) gives amazing results. Iā€™m seeing good results in an unrelated segmentation project using these scSE blocks. Check out at least the cSE paper for EfficientNet.

Implementation

  • Depthwise separable convolutions in PyTorch can be expressed using the groups parameter as in: nn.Conv2d(expand, expand, groups=expand, ..).

  • When the paper is talking about MBConv1 or MBConv6 they mean MBConv with an expansion factor of 1 and 6, respectively. The MBConv1 block does not come with the first expansion 1x1 conv since there is nothing to expand (expansion factor 1); this block starts with a depthwise separable convolution immediately.

  • In a layer there is a sequence of n MBConv blocks. When the EfficientNet paper talks about e.g. a stride=2 layer they mean: the first MBConv in the sequence implements a stride=2 in the depthwise convolution, all the following n-1 MBConv blocks implement stride=1.

  • The skip connections in the MBConv blocks are only possible for blocks whith stride=1 (so the in and out spatial resolution is the same, and number of in channels and out channels are the same).

  • The official EfficientNet implementation at some point was not using using drop connect. This was a bug.

  • The official EfficientNet implementation at some point had their stride=1 vs stride=2 blocks mixed up compared to the paper. This was a bug (in the paper).

  • The EfficientNet paper is all about scaling depth, width, and resolution. But they never tell you about the engineering tricks for how to actually scale depth, and width. The official EfficientNet implementation snaps to roughly the nearest multiple of eigth for width, most likely because implementations such as cudnn like multiple of eight sized channels.

Experiments

Iā€™m using https://github.com/pytorch/examples/tree/master/imagenet for a no bells and whistles training setup. Iā€™ll switch to the fastai training setup at some point but for now I want to keep it simple.

For a dataset Iā€™m using https://github.com/fastai/imagenette#imagewoof for quick iteration and training (most) EfficientNet models. I started with ImageNette but it was too easy of a dataset for this task - especially for the bigger models.

What I found out:

  • With ReLU6 as activation function (instead of Swish), and without Squeeze-and-Excitation blocks, the EfficientNets are already very competitive on ImageWoof (even the smallest EfficientNetB0) even though Iā€™m using the simple training script.

  • When I add the Swish activation function instead of ReLU6 and scSE blocks I have to train 4-5 times as long. I donā€™t fully understand why this happens but the Google TPU docs say they have to train EfficientNets for 350 epochs (instead of the default 90) on ImageNet.

  • With Swish and scSE blocks Acc@1 dropped rapidly for the EfficientNetB0 by 20 percent points.

  • With Swish and cSE blocks Acc@1 is a bit worse than no Squeeze-and-Excitation blocks at all. It looks like the scSE block (and especially the sSE block) is the problem here. But I can not explain why yet.

My intuition tells me the Squeeze-and-Excitation (attention) blocks might just not be necessary for a dataset such as ImageWoof. We need to re-do these experiments on a full-blown ImageNet dataset.

For now my implementation uses ReLU6 and no Squeeze-and-Excitation blocks until I can confirm their benefits on larger datasets.

Open Questions

  • The EfficientNet paper is using DropConnect instead of Dropout. I havenā€™t benchmarked it yet and my implementation is simply using Dropout for now. How much does it help?

  • ImageNet: how easy is it to train the models on ImageNet. How does Swish vs. ReLU6 behave? Do cSE blocks help? scSE blocks? For ImageWoof neither Swish nor Squeeze-and-Excitation seems to make a difference for my simple setup.

  • Progressive Growing: can we speed up training the EfficientNet models by starting to train EfficientNetB0, then use it to initialize EfficientNetB1, and so on? Does it help speed up training? How do the resulting models compare vs. the same model trained from scratch without progressive growing?

  • MobileNetV3: can we apply the paperā€™s tricks (re-do last stages, and reduce filters in head).
    to EfficientNets?

  • The MobileNetV3 and EfficientNet paper came out roughly at the same time, are very similar in building blocks, overall design, and the authors even contributed to both papers. Yet not one of these papers mentions the other one at all. Why is that?

Hope that helps,
Daniel J H

22 Likes

Thanks for your insights!

I am using EfficientNet for super resolution and my experience with swish, but also squeeze and excite was exactly the same as yours up to this point. Both barely made a difference.

I tried EfficientNet b3 with mixup and could get 93.8% on the test set for the Stanford Cars Dataset. What made a great difference for me in making the EfficientNet work was to change the optimizer to RMSprop. With the default (Adam) I got worse results than Resnet50.

8 Likes

@daniel-j-h The MNasNet paper was important in the jump from MobileNet-V2 to MobileNet-V3 and EfficientNet, it introduce the SE to the Inverted Residual block and as you can see the MobileNet-V3 and EfficientNet were both very much written with the assumption that you read this first: https://arxiv.org/abs/1807.11626

The MNasNet-A1 vs B1 is worth looking at for an SE vs not comparison. The A1 is SE block based with some reductions in dimensions vs the B1, but it can be trained to 75.2% top-1 as per paper (Iā€™ve managed 75.45) and the B1 is 74.5 (74.66 in my attempt). The A1 has 3.9M params and the B1 4.4M.

The attention from SE blocks typically improve parameter efficiency. Just as with SE-ResNets or ResNeXts, the IR block based networks with SE present give you a higher ratio of performance (accuracy metrics) to parameters. Unfortunately they seem to make the networks harder to train, more epochs (maybe more variety needed, so bigger datasets or better augmentation helps?) and a little slower to push the images through.

All said though, for PyTorch running on an NVIDIA GPU, I donā€™t think EfficientNets make sense as a go to network for most applications. While they are parameter and flop efficient, when it comes to GPU memory usage and image throughput, they are no faster, if not worse than ResNe(X)ts. Well trained ResNe(X)ts can achieve similar accuracy metrics with better GPU memory characteristics, higher throughputs, and half the training time epochs (possibly up to 3-4x the wallclock time)

4 Likes

Iā€™ve been working on a Colab notebook to compare EfficientNets to some other well trained models, mostly ResNet based. I didnā€™t feel the choice of comparison models was particularly fair in the paper. Also, despite the big gains in parameter/flop efficiency, the EfficientNet models do not run faster in PyTorch or utilize less GPU memory than larger, appropriately matched ResNet peers.

https://colab.research.google.com/drive/1M6dMs7h6SChJe7VXQro1Yk37ibH0IRKE

Edit: A github option if you donā€™t want to login to google, https://github.com/rwightman/pytorch-image-models/blob/master/notebooks/EffResNetComparison.ipynb

4 Likes

In my limited experience shrinking GANs the tradeoff I have noticed is as follows:

Over-parameterised networks = Large learning rate = Fast training = Slow Inference
Under-parameterised networks = Small learning rate = Slow training = Fast Inference

The choice depends on the objective. Where do you want to run it once its trained? On GPU or NPU (optimised for a particular platform, i.e. mobile).

Having compressed over-parameterised GANs by removing entire resblocks my thought is progressive training (starting with smallest network then growing) would speed up overall training and might avoid having to resort to LR tricks. Just a thought.

3 Likes