EfficientNet

For that I can comment a bit

For image size, you can use smaller size as what introduced in the paper, but if you want to push the last droplet from the model, you’d better use the size they said in the paper.

Base on my understanding, I tried effb2, image size = 260 (which is the one in the paper)
effb4, image size = 256 (which is not the one in the paper)

For LB scores, yes, effb2 can have the same result as effb4. I don’t have a powerful machine, so I didn’t push the limit.

Another thing I tried is using rAdma as optimizer, and it did increase my effb4 LB score.

1 Like

Thanks @heye0507, I’ll settle for some form of successful training run before asking for max performance :smiley:

Still no joy, with a simplified setup, getting hit by nans again. I wonder if its something to do with my machine…its probably not the weights since it worked for @DrHB

from efficientnet_pytorch import EfficientNet

md_ef = EfficientNet.from_pretrained(‘efficientnet-b5’)
md_ef._fc = nn.Linear(2048, data.c)

data_test.batch_size = 64 # couldn’t fit 128 in memory
learn = Learner(data_test, md_ef, metrics = [accuracy])

learn.to_fp16()
learn.unfreeze()
learn.lr_find()

1 Like

Interesting! Are you using with Swish activation function or standard ReLu?

Swish, the default with that library. For my earlier successes with b3 I had great results with Mish, but neither are working here.

Wow this is so interesting. I never experienced nan issue while experimenting with eff net. I will be home in few hours and run some tests…

1 Like

Hmm its something about the machine/installation I think. I was running my previous successful b3 runs on a p4000, but I just tried the same b3 run on the p6000 and its also giving me nans. So maybe I need to try a new environment with a fresh installation. I can see the working machine has pytorch 1.2 vs 1.0.1 for the p6000, hopefully its that!

Not working P6000

Working P4000

you have a notebook I can take a look?

How about removing weight decay and div factor just using the default ones?

I’m not sure if this is related, but with 224px images, and using eff-b5 and batch_size=16, my train and valid loss comes through ok, but using the quadratic_kappa metric gives me nans on only some of my epochs.

So far haven’t seen this happening with eff-b2 at batch_size=64

I’m on an RTX2070 with fp16

1 Like

This may have to do more with the dataset and the batch size. With a highly imbalanced dataset, if the batch contains only one class, then the quadratic kappa is not defined --> NaN. This is more likely to happen with low batch sizes like bs=16. This is also why you are getting the RuntimeWarning.

2 Likes

Ahh right, I’ve been trying to figure for a while now why it was doing this! Thanks!

2 Likes

Ha, I have a better idea, can you try shut down the fp16() call? I remember someone report that fp16() doesn’t work for some last generation GPUs. I don’t know if this is the case.

1 Like

I upgraded from pytorch 1.01 -> 1.2 and its working as expected :man_facepalming:

Thanks for the suggestions all and apologies for the thrash, I’ll update if I get any decent results with b5 :slight_smile:

4 Likes

I am not able to match the FLOP count reported in the paper.
The paper says in Table 2, the 0.39 Billion for B0 model.

But they are comparing with 4.1B for ResNet-50, which is actually MACCs. So, lets say paper is 0.39 MACCs for B0 model.

If I just compute the MACCs in convolutions only (3x3 depthwise and 1x1) I get only 0.26B MACCs.
Does this match anyone’s calculations?

Adding S&E MACCs will only add half a million or so. Adding BN will add few million more.
But there are ~5 million swish operations. How many FLOPs does one assign for each swish operation? sigmoid evaluation FLOPs depend on the polynomial degree of the approximation.

Any comments, suggestions in interpreting the FLOPs mentioned in the paper?

@jujubi A MACC is generally considered to be equivalent to 2 FLOPS, I did notice that the EfficientNet paper didn’t seem to stick to that convention and appears to use FLOPS==MACCS.

Either way, I measured the FLOPS on a few networks I ported to PyTorch and they all checked out, including EfficientNet-B0. I get 0.780305 GFLOP running Caffe2 benchmark on the ONNX exported and optimized (BN folded into convolutions) model.

A wrote a blurb on to reproduce the result, and some comparisons of similar models (pre and post optimization) back when I did the checks…

Thank you @rwightman for response and the link.
Could you tell how many FLOPs you assign to each Sigmoid?

(In https://github.com/Lyken17/pytorch-OpCounter/blob/master/thop/count_hooks.py, I see just one op is assigned for exponentiation though depending on the type of evaluation method used (say, piece-wise cubic approximation) it would be more than 1.)

Is there a trick to make EfficientNet work with nn.DataParallel? I’ve tried both the lukemelas and RWightman implementations and both work fine on one GPU, but attempting to use two GPUs via nn.DataParallel results with a cuDNN error: CUDNN_STATUS_BAD_PARAM in their conv2d methods.

I’m creating the learner, then setting learn.model = nn.DataParallel(learn.model). It works with ResNet, just not any of the EfficientNet implementations I’ve tried.

I recently used the torch.distributed (setup like in the docs) with the Luke Melas EfficientNets and it was working right out of the box.

Are Luke Melas parameters’ correct?

Regarding Luke Melas EfficientNets, I was comparing it with the paper and it seems to me that either I am missing something (apologies in advance) or the parameters he is using are not the same. This is his code which you will find in utils.py:

def efficientnet_params(model_name):
""" Map EfficientNet model name to parameter coefficients. """
params_dict = {
    # Coefficients:   width,depth,res,dropout
    'efficientnet-b0': (1.0, 1.0, 224, 0.2),
    'efficientnet-b1': (1.0, 1.1, 240, 0.2),
    'efficientnet-b2': (1.1, 1.2, 260, 0.3),
    'efficientnet-b3': (1.2, 1.4, 300, 0.3),
    'efficientnet-b4': (1.4, 1.8, 380, 0.4),
    'efficientnet-b5': (1.6, 2.2, 456, 0.4),
    'efficientnet-b6': (1.8, 2.6, 528, 0.5),
    'efficientnet-b7': (2.0, 3.1, 600, 0.5),
}
return params_dict[model_name]

Now, when I go to the original paper, the authors say the following:

image

the equation to which they refer is the one:

image

So basically if for instance alpha = 1.2 and depth = alpha ** phi, how do I get to Luke Melas table? I did not find a way to understand the numbers of his scaling. Did somebody review it?

There is one sentence of the paper that confuses me though

we first fix phi=1 (…). We find best values for efficientNetB0 are alpha = 1.2, beta = 1.1, gamma = 1.15."

If I interpret this sentence correctly, w=1.0 d=1.0 and r=224 would correspond to EfficientNetB0 and these new parameters are for phi = 1, hence EfficientNetB1… again, different in Lukas table.

1 Like

I think people interested in Efficientnet or Imagenet SOTA might be interested in this new paper:

6 Likes

@mgloria it might be worth raising an issue on the github project page to ask that question. Let us know what you find!