EfficientNet

rwightman · July 2, 2019, 12:36am

For a lot of tasks, I agree this is a good rule of thumb.

For inference, the slow vs fast part isn’t quite as straightforward. Doing batched inference on a GPU, bigger networks can be as fast or faster than significantly ‘smaller’ ones by param count and FLOP count – depending on the architecture and the framework you’re on.

krishna_katyal · July 9, 2019, 11:18am

While working with medical data-set shall i use unet ,resnet orEfficentNet?

LessW2020 · July 15, 2019, 8:40am

What are you trying to do with the images? Unet is for segmentation for example and not for classification.
Also your question is likely better posed on its own rather than this thread.
In general I would stick with (x)resnet for classification and retinanet for object detection. Efficient net didnt seem to live up to my expectations.

krishna_katyal · July 17, 2019, 7:11am

i want to do segmentation

daniel-j-h · July 17, 2019, 9:50pm

Following up from EfficientNet

@rwightman thank you for pointing out the MnasNet paper (https://arxiv.org/abs/1807.11626) it was indeed one of the missing pieces of the puzzle and explains some parts in more details (e.g. the squeeze-and-excitation blocks).

I’m currently running experiments with https://github.com/daniel-j-h/efficientnet on ImageNet and can without bells and whistles and without swish or squeeze-and-excitation reach competitive results with my EfficientNetB0 wrt. Acc@1. But these experiments eat a lot of time in that regard.

I also heard from folks giving the EfficientNet models a try on mobile devices that it’s quite slow wrt. inference performance. This might be due to the backend used or the swish activation function. Maybe you folks have insights here?

From initial benchmarks on my laptop (cpu) it seems like e.g. the MobileNetV2 (existing one from torchvision.models) can easily be scaled down via its width multiplier but EfficientNetB0 is the smallest variant in that regard and we do not have coefficients to go lower. Has anyone tried scaling down EfficientNets? Or would it make more sense to go to MobileNetV3 and MobileNetV3-small directly?

What I also looked into is the Bag of Tricks paper (https://arxiv.org/abs/1812.01187) and it’s three insights

zero init the batchnorm weights in the last res-block layer
adapt the res-blocks with optional AvgPool, Conv1x1 to make them all skip-able
do not apply weight decay to biases (haven’t done experiments with this one yet)

Especially the Bag of Tricks ResNet-D (Figure 2 c) looks very interesting for EfficientNets.

Below are statistics for my EfficientNets and their Bottleneck blocks. You can see how some of these blocks are not skip-able because either the spatial dimension or the number of channels do not match in the res-block e.g. in the stride=2 blocks.

The skip ratio describes the ratio of blocks in which we can add the residual (skip connection). See how in the smaller EfficientNets we are missing skip connections for almost half of the layers!

The Bag of Tricks ResNet-D (Figure 2 c) adaption (adding AvgPool, Conv1x1 to make them skip-able) could especially be beneficial for the small EfficientNet models.

EfficientNet0 {'n': 16, 'has_skip': 9, 'not_has_skip': 7, 'skip_ratio': 0.5625}
EfficientNet1 {'n': 23, 'has_skip': 16, 'not_has_skip': 7, 'skip_ratio': 0.6957}
EfficientNet2 {'n': 23, 'has_skip': 16, 'not_has_skip': 7, 'skip_ratio': 0.6957}
EfficientNet3 {'n': 26, 'has_skip': 19, 'not_has_skip': 7, 'skip_ratio': 0.7309}
EfficientNet4 {'n': 32, 'has_skip': 25, 'not_has_skip': 7, 'skip_ratio': 0.7813}
EfficientNet5 {'n': 39, 'has_skip': 32, 'not_has_skip': 7, 'skip_ratio': 0.8205}
EfficientNet6 {'n': 45, 'has_skip': 38, 'not_has_skip': 7, 'skip_ratio': 0.8444}
EfficientNet7 {'n': 55, 'has_skip': 48, 'not_has_skip': 7, 'skip_ratio': 0.8727}

I’m also looking into progressive growing to initialize models. So far I’m transplanting only the convolutional weights from the smaller model after regularly initializing the bigger models. I’m wondering if we can and should also transplant the remaining layers e.g. batchnorm and if it would give us a better initialization for training.

Best,
Daniel J H

rwightman · July 17, 2019, 10:39pm

@daniel-j-h MobileNet-V3 and MNasNet are essentially the mobile versions of EfficientNet (same building blocks, optimized for mobile). With that in mind, I’m not sure how important it is to make EfficientNet more mobile friendly. For larger/powerful mobile devices, perhaps – the biggest hit is those sigmoids, so hard-swish/hard-sigmoid as per MNV3 and reduction in number of SE blocks would be the biggest win.

Some Caffe2 benchmark numbers from exported EfficientNet, MobileNet-V3, MNasNet. Note the sigmoid runtime cost (Intel CPU) vs the mobile specific networks.

github.com

rwightman/gen-efficientnet-pytorch/blob/master/BENCHMARK.md

# Model Performance Benchmarks

All benchmarks run as per:

```
python onnx_export.py --model mobilenetv3_100
python onnx_optimize.py ./mobilenetv3_100.onnx --output mobilenetv3_100-opt.onnx
python onnx_to_caffe.py ./mobilenetv3_100.onnx --c2-prefix mobilenetv3
python onnx_to_caffe.py ./mobilenetv3_100-opt.onnx --c2-prefix mobilenetv3-opt
python caffe2_benchmark.py --c2-init ./mobilenetv3.init.pb --c2-predict ./mobilenetv3.predict.pb
python caffe2_benchmark.py --c2-init ./mobilenetv3-opt.init.pb --c2-predict ./mobilenetv3-opt.predict.pb
```

## EfficientNet-B0

### Unoptimized
```
Main run finished. Milliseconds per iter: 49.2862. Iters per second: 20.2897
Time per operator type:
        29.7378 ms.    60.5145%. Conv

This file has been truncated. show original

rwightman · July 17, 2019, 10:51pm

As a GPU based DL practitioner, I’d like to see a GPU targeted NAS search. A lot of the NAS based models are aiming to reduce parameter count, FLOP count, improve accuracy, improve latency on mobile devices for an inference optimized network, or runtime metrics on TPU for XLA optimized graph. This has resulted in some useful networks, but they aren’t a particularly great match for GPU training and inference. At comparable accuracy levels, ResNets often still beat them when it comes to GPU memory consumption while training or running inference.

I haven’t finished this experiment, but I explored it for feasibility. I went through an EfficientNet config and tweaked the widths and factors. I basically made them divisible by larger power of 2 denominators and switched expansion factors to 4 and 8 and made less different expansion factors. I haven’t trained this monster, but I did a quick check of param count vs PyTorch CUDA memory consumption. With my 15 min hack fest I confirmed I could increase param count by 20% and reduce practical GPU memory usage by about that much. With some more thought, or a NAS search, I think those could be pushed further apart.

ilovescience · July 21, 2019, 8:41pm

I want to reopen the discussion regarding the performance of EfficientNet. Based on @rwightman’s and @Seb’s results it seems that EfficientNets are harder to train and require more epochs to reach similar performance to ResNet50.

However, in current Kaggle competitions, there seems to be a rise in the usage of pretrained EfficientNets, with seemingly successful results. This is even with the fastai library. I have yet to try out EfficientNets for these competitions so I do not know the details.

Have subsequent experiments shown success of pretrained EfficientNets?

rwightman · July 22, 2019, 7:15pm

@ilovescience It’s worth noting that my comments were for training from scratch, on ImageNet. Fine-tuning from the pretrained model is a different exercise that is typically less sensitive to h-params and requiring less epochs.

The EffNets are capable models, so no doubt they can produce solid results for Kaggle or other pretraining exercises. I just wouldn’t recommend them as a first choice on any given problem since they are more challenging to work with and a lot of their ‘efficiency’ benefit isn’t realized with PyTorch + GPUs.

rwightman · July 31, 2019, 2:58am

Google AI team posted new EfficientNet weights with AutoAugment training. Improvements across the board, but only .1% top-1 by B7. However, B6 and B7 weights are incl this time round.

I did a quick update of weights for my PyTorch impl: https://github.com/rwightman/pytorch-image-models

B6 & B7 are absolute monsters to run so haven’t updated all validation scores just yet.

Also interesting are the MixNet models. A promising idea to explore in other architectures. I had a shelved ResNeXt-like model with bottleneck 3x3 replaced by groups of 2 * 3x3, 5x5, and 7x7. It was converging nicely but I dropped it for other training priorities before completing. Must revist now.

jeremy · August 6, 2019, 5:45pm

@rwightman your repo is so great! One thing that could make it even better would be to add a training time column to the table of models that you’ve trained yourself. And maybe also max GPU memory use? Param count isn’t always a great proxy, as you know!

rwightman · August 7, 2019, 4:25pm

@jeremy thanks!

Retroactively, the training time is going to be a challenge. The way I train right now, I use a few different local machines of widely differing capabilities, best is probably at least 4x the worst. So, there’d be varying configs. Other challenge is that I often kill a training session and resume later if priorities change and I need to use a machine for another task.

Something I thought would be useful for this and other reproducibility challenges, is to keep a train history embedded in the training checkpoint. Basically something that records timestamp + hparam delta whenever an hparam or scheduled hparam changes. Embedded in the checkpoints it wont be lost across stop/restart or even moving to different machines. It could be later parsed and used exactly replay training for a given model despite manual tweaks. So, an ‘hparam replay buffer’ of sorts.

Max gpu usage is on the list of todos. I need to experiment more with appropriate sampling points and rates to capture a representative value. Capturing the state during validation/inference is usually reliable, but I see more variability and sometimes epoch to epoch fluctuation in training, especially with AMP enabled.

RFTexas · August 17, 2019, 1:56pm

Hello everyone,

just tried to use the .freeze() method with Efficientnet as well as the discriminative learning rate with slice(x,y) in the fit_one_cycle() method, and it doesn’t work. I figured out that there’s only one layer group, which prevents fast.ai from freezing another layer group.

I tried to understand how I can tweak Lukemelas’ implementation to come up with multiple layer groups. Has anyone already implemented it or am I plain wrong on my assumption ?

heye0507 · August 21, 2019, 7:53am

Yes, I was going to report my findings on the forum, but I went through this post, it seems to me that it is about training from scratch so I didn’t weight in. Another important reason is the test is based on kaggle on going competition, using local cv, LB score to report model is not solid. But I will just share what I tried, at least for the current score, efficient net is outperform my resnet model (0.7x to solid 0.8+ single model)

I tried efficient net B0-B4 on current APTOS competition, with pretrained weights from imagenet.

Here is my experiment setup with b0
bs=64, img_size=224, default argumentation.

Plain efficient net b0, change the fc layer output to data classes.
Cut the model at head. Freeze the body, fine tune the head with some epochs, unfreeze, fine tune small epochs.

model.split(lambda m: (m._conv_head)) if you ever wonder the code

cut the model at the ._block middle. Where the filter size changed from 40-80. Another cut at conv_head.
So now the layer_group is 3

Gradual freezing, discriminative learning rate.

Surprisingly, the result is 1>=2 >> 3

Where I thought the order should be 3 > 2 > 1

Some thoughts:

Efficient net is scaled with best architectures so different lr doesn’t work?
I didn’t manage to find time to fully understand the paper, the cut is based on understanding of resnet. But as you can see, they are different.
Based on kaggle LB, b0-b4 has increased score linearly with same setup. Also, pay attention to img size, as efficient net also scaled input size.

I also have some questions in mind:

Should I cut efficient net?
Should I replace the conv_head with fastai standard head? Concatpooling, linear… etc. Based on my understanding, this changes the architecture of efficient net, so it is not scaled anymore?

morgan · September 1, 2019, 9:06am

Hey @fmobrj75 I am also playing around with Stanford Cars, did you use the lukemelas implementation of efficientnet? Do you have a notebook you could share, I’m struggling to get above 92.5% with b3…

fmobrj75 · September 9, 2019, 2:09pm

Hi Morgan! Sorry for the delay.

I have a script I ran for this result. As soon as I have time I will put the code in github and share with you.

morgan · September 9, 2019, 4:13pm

Thanks! I got some nice results with the Mish activation and Ranger, here: Meet Mish: New Activation function, possible successor to ReLU?

But curious to see how you did. I’m trying to train b5 now, but its seems trickier than b3!

morgan · September 16, 2019, 7:49am

Hey all,

Has anyone had much luck on training Efficientnet b5 or higher? My loss keeps going to nan.

For reference I am training on Stanford-Cars and have been able to consistently get accuracy above 93.6% using Efficientnet b3. But with b5 either my losses go to nan or I make some tweaks so it can train but its results are far inferior to b3.

I’m using the @lukemelas pytorch implementation here: GitHub - lukemelas/EfficientNet-PyTorch: A PyTorch implementation of EfficientNet

Things I’ve tried:

Using fp_16()
Move to a bigger machine (from P4000 to P6000) with more ram, to be able to use a reasonable batch size, bs=16 for image size 456x456 (b5 image size for the paper)
Used callback_fns=BnFreeze to freeze the BN layers. This worked (model was able to train), but its results were much worse than b5
Used Group Normalisation , again got it to train but results were really poor (potentially I need to explore a little more here though on the number of groups to use)
Higher LRs, Lower LRs…

Heres a typical call to Learner (b5_mish_gn_model is the model with GroupNorn instead of BatchNorm):

learn = Learner(data_test, 
                model=b5_mish_gn_model,
                wd = 1e-3,
                opt_func=Ranger,
                bn_wd=False,
                true_wd=True,
                metrics=[accuracy],
                loss_func=LabelSmoothingCrossEntropy(),
                callback_fns=BnFreeze
               ).to_fp16()

fit_fc(learn, tot_epochs=1, lr=1e-4, start_pct=0.40, wd=1e-3, show_curve=False)

Any help would be appreciated, I was excited about the b3 results and was hoping b5 could smash them, buts its proving harder than I thought…

heye0507 · September 16, 2019, 4:04pm

I only tried b0-b4 with imagenet pretrained weights. For b5-b7, I think you can check with @DrHB (don’t know how to tag people). But he is the one in APTOS 2019 with gold medal for Efficient Net B5+.

morgan · September 16, 2019, 6:30pm

Thanks @heye0507, and congrats @DrHB! Looking through your notebooks (https://github.com/DrHB/APTOS-2019-GOLD-MEDAL-SOLUTION) I see you didn’t increase the image size for b5 like they did in the paper (456 in the paper vs your 224) and that you didn’t specify a particular learner (default AdamW)

I’ll go back to basics and give these a go, maybe I was overthinking things