EfficientNet

heye0507 · September 16, 2019, 6:44pm

For that I can comment a bit

For image size, you can use smaller size as what introduced in the paper, but if you want to push the last droplet from the model, you’d better use the size they said in the paper.

Base on my understanding, I tried effb2, image size = 260 (which is the one in the paper)
effb4, image size = 256 (which is not the one in the paper)

For LB scores, yes, effb2 can have the same result as effb4. I don’t have a powerful machine, so I didn’t push the limit.

Another thing I tried is using rAdma as optimizer, and it did increase my effb4 LB score.

morgan · September 16, 2019, 8:10pm

Thanks @heye0507, I’ll settle for some form of successful training run before asking for max performance

Still no joy, with a simplified setup, getting hit by nans again. I wonder if its something to do with my machine…its probably not the weights since it worked for @DrHB…

from efficientnet_pytorch import EfficientNet

md_ef = EfficientNet.from_pretrained(‘efficientnet-b5’)
md_ef._fc = nn.Linear(2048, data.c)

data_test.batch_size = 64 # couldn’t fit 128 in memory
learn = Learner(data_test, md_ef, metrics = [accuracy])

learn.to_fp16()
learn.unfreeze()
learn.lr_find()

DrHB · September 16, 2019, 8:22pm

Interesting! Are you using with Swish activation function or standard ReLu?

morgan · September 16, 2019, 8:24pm

Swish, the default with that library. For my earlier successes with b3 I had great results with Mish, but neither are working here.

DrHB · September 16, 2019, 8:28pm

Wow this is so interesting. I never experienced nan issue while experimenting with eff net. I will be home in few hours and run some tests…

morgan · September 16, 2019, 9:07pm

Hmm its something about the machine/installation I think. I was running my previous successful b3 runs on a p4000, but I just tried the same b3 run on the p6000 and its also giving me nans. So maybe I need to try a new environment with a fresh installation. I can see the working machine has pytorch 1.2 vs 1.0.1 for the p6000, hopefully its that!

Not working P6000

Working P4000

heye0507 · September 16, 2019, 9:16pm

you have a notebook I can take a look?

How about removing weight decay and div factor just using the default ones?

adeperio · September 17, 2019, 2:15am

I’m not sure if this is related, but with 224px images, and using eff-b5 and batch_size=16, my train and valid loss comes through ok, but using the quadratic_kappa metric gives me nans on only some of my epochs.

So far haven’t seen this happening with eff-b2 at batch_size=64

I’m on an RTX2070 with fp16

ilovescience · September 17, 2019, 2:18am

This may have to do more with the dataset and the batch size. With a highly imbalanced dataset, if the batch contains only one class, then the quadratic kappa is not defined --> NaN. This is more likely to happen with low batch sizes like bs=16. This is also why you are getting the RuntimeWarning.

adeperio · September 17, 2019, 2:23am

Ahh right, I’ve been trying to figure for a while now why it was doing this! Thanks!

heye0507 · September 17, 2019, 2:50am

Ha, I have a better idea, can you try shut down the fp16() call? I remember someone report that fp16() doesn’t work for some last generation GPUs. I don’t know if this is the case.

morgan · September 17, 2019, 11:04am

I upgraded from pytorch 1.01 -> 1.2 and its working as expected

Thanks for the suggestions all and apologies for the thrash, I’ll update if I get any decent results with b5

jujubi · October 3, 2019, 9:13pm

I am not able to match the FLOP count reported in the paper.
The paper says in Table 2, the 0.39 Billion for B0 model.

But they are comparing with 4.1B for ResNet-50, which is actually MACCs. So, lets say paper is 0.39 MACCs for B0 model.

If I just compute the MACCs in convolutions only (3x3 depthwise and 1x1) I get only 0.26B MACCs.
Does this match anyone’s calculations?

Adding S&E MACCs will only add half a million or so. Adding BN will add few million more.
But there are ~5 million swish operations. How many FLOPs does one assign for each swish operation? sigmoid evaluation FLOPs depend on the polynomial degree of the approximation.

Any comments, suggestions in interpreting the FLOPs mentioned in the paper?

rwightman · October 5, 2019, 5:25am

@jujubi A MACC is generally considered to be equivalent to 2 FLOPS, I did notice that the EfficientNet paper didn’t seem to stick to that convention and appears to use FLOPS==MACCS.

Either way, I measured the FLOPS on a few networks I ported to PyTorch and they all checked out, including EfficientNet-B0. I get 0.780305 GFLOP running Caffe2 benchmark on the ONNX exported and optimized (BN folded into convolutions) model.

A wrote a blurb on to reproduce the result, and some comparisons of similar models (pre and post optimization) back when I did the checks…

github.com

rwightman/gen-efficientnet-pytorch/blob/master/BENCHMARK.md

# Model Performance Benchmarks

All benchmarks run as per:

```
python onnx_export.py --model mobilenetv3_100 ./mobilenetv3_100.onnx
python onnx_optimize.py ./mobilenetv3_100.onnx --output mobilenetv3_100-opt.onnx
python onnx_to_caffe.py ./mobilenetv3_100.onnx --c2-prefix mobilenetv3
python onnx_to_caffe.py ./mobilenetv3_100-opt.onnx --c2-prefix mobilenetv3-opt
python caffe2_benchmark.py --c2-init ./mobilenetv3.init.pb --c2-predict ./mobilenetv3.predict.pb
python caffe2_benchmark.py --c2-init ./mobilenetv3-opt.init.pb --c2-predict ./mobilenetv3-opt.predict.pb
```

## EfficientNet-B0

### Unoptimized
```
Main run finished. Milliseconds per iter: 49.2862. Iters per second: 20.2897
Time per operator type:
        29.7378 ms.    60.5145%. Conv

This file has been truncated. show original

jujubi · October 5, 2019, 3:15pm

Thank you @rwightman for response and the link.
Could you tell how many FLOPs you assign to each Sigmoid?

(In https://github.com/Lyken17/pytorch-OpCounter/blob/master/thop/count_hooks.py, I see just one op is assigned for exponentiation though depending on the type of evaluation method used (say, piece-wise cubic approximation) it would be more than 1.)

bwarner · October 20, 2019, 4:34pm

Is there a trick to make EfficientNet work with nn.DataParallel? I’ve tried both the lukemelas and RWightman implementations and both work fine on one GPU, but attempting to use two GPUs via nn.DataParallel results with a cuDNN error: CUDNN_STATUS_BAD_PARAM in their conv2d methods.

I’m creating the learner, then setting learn.model = nn.DataParallel(learn.model). It works with ResNet, just not any of the EfficientNet implementations I’ve tried.

MicPie · October 20, 2019, 5:00pm

I recently used the torch.distributed (setup like in the docs) with the Luke Melas EfficientNets and it was working right out of the box.

mgloria · October 25, 2019, 1:13pm

Are Luke Melas parameters’ correct?

Regarding Luke Melas EfficientNets, I was comparing it with the paper and it seems to me that either I am missing something (apologies in advance) or the parameters he is using are not the same. This is his code which you will find in utils.py:

def efficientnet_params(model_name):
""" Map EfficientNet model name to parameter coefficients. """
params_dict = {
    # Coefficients:   width,depth,res,dropout
    'efficientnet-b0': (1.0, 1.0, 224, 0.2),
    'efficientnet-b1': (1.0, 1.1, 240, 0.2),
    'efficientnet-b2': (1.1, 1.2, 260, 0.3),
    'efficientnet-b3': (1.2, 1.4, 300, 0.3),
    'efficientnet-b4': (1.4, 1.8, 380, 0.4),
    'efficientnet-b5': (1.6, 2.2, 456, 0.4),
    'efficientnet-b6': (1.8, 2.6, 528, 0.5),
    'efficientnet-b7': (2.0, 3.1, 600, 0.5),
}
return params_dict[model_name]

Now, when I go to the original paper, the authors say the following:

the equation to which they refer is the one:

So basically if for instance alpha = 1.2 and depth = alpha ** phi, how do I get to Luke Melas table? I did not find a way to understand the numbers of his scaling. Did somebody review it?

There is one sentence of the paper that confuses me though

we first fix phi=1 (…). We find best values for efficientNetB0 are alpha = 1.2, beta = 1.1, gamma = 1.15."

If I interpret this sentence correctly, w=1.0 d=1.0 and r=224 would correspond to EfficientNetB0 and these new parameters are for phi = 1, hence EfficientNetB1… again, different in Lukas table.

Seb · November 13, 2019, 1:57am

I think people interested in Efficientnet or Imagenet SOTA might be interested in this new paper:

jeremy · November 20, 2019, 12:52am

@mgloria it might be worth raising an issue on the github project page to ask that question. Let us know what you find!