Adding EfficientNet to fastai vision

rwightman · October 28, 2019, 4:22am

Yeah, it would be nice, but these models are such beasts to work with at the higher resolutions, I generally haven’t bothered going beyond B2 for training. I may finetune B3/B4 at some point, but B5 and beyond are so slow and GPU memory intensive, I doubt I’ll ever get to it unless someone wants to gift me some time on an 8 V100 machine

BTW, I updated my stand-alone version of these models this weekend (https://github.com/rwightman/gen-efficientnet-pytorch) and made it a bit more fast.ai friendly

Using the model entrypoint functions (ie geffnet.efficientnet_b2(pretrained=True)) or the geffnet.create_model('mixnet_l', pretrained=True, drop_rate=0.25) fn… you can pass as_sequential=True as an additional kwarg and it’ll return a sequential container that should play nicely with fast.ai cut/split workflows. FC is always the last layer, pooling is adaptive by default, the dropout and flattens are converted to modules so nothing is missing.

One still needs to make sure they setup their preprocessing pipeline to match the model defaults though.

TomB · October 28, 2019, 10:15am

Don’t think you actually need to have a sequential model for Learner.split. This seems to work fine to create layer groups with your previous version:

lrn.split([lrn.model.conv_stem,lrn.model.blocks,lrn.model.conv_head])

Think those should be correct names, adapted from custom Unet wrapper doing both classification and segmentation. Not quite sure about the best layers though. In particular, that may not be the best for some of the ways of specifying differential learning rates where it uses a lower LR just for the first group as I recall (though don’t think having the stem and whole body at a lower rate is necessarily ideal either).

Think the sequential is only required for cnn_learner. On which it might be nice to selectively load the classifier weights (as lukemelas’ version does). That way you can load pretrained weights for everything else and you don’t need cnn_learner. Should be able to have _create_model do something like:

state = load_state_dict_from_url(model_urls[variant])
if state['classifier'].shape != model.classifier.weight.shape:
    state['classifier'] = model.classifier.weight # Might also want to move to same device as loaded state here
model.load_state(state)

This allows generic use outside of fastai and avoids the loss of many of the layer names with cnn_learner (or did your as_sequential deal with that?). Also means you don’t have the fastai head (or have to override) which is a bit different to the standard one (though mean to try the fastai head sometime).

TomB · October 28, 2019, 10:51am

Oh, and following Jeremy’s twitter post I tested and confirmed that using JIT script matches performance of my CUDA implementation for Swish (and should probably work just as well for Mish though I haven’t tested, but pretty sure it should be able to fuse into a single kernel there too).
I did try this in torch 1.2 and didn’t have success but think I probably messed up. I was also testing converting whole networks to JIT at the time which was quite nasty in Torch 1.2 but improved in the then-nightly 1.3 so put off my experiments. But think you can support activations easily in both (though the decorator support has improved so might need to play around there but think jit.script was there in 1.2).
I just used:

@torch.jit.script
def swish_fwd(i):
    return i * torch.sigmoid(i)

@torch.jit.script
def swish_bwd(i, grad_output):
    sigmoid_i = torch.sigmoid(i)
    return grad_output * (sigmoid_i * (1 + i * (1 - sigmoid_i)))

class SwishJIT(torch.autograd.Function):
    @staticmethod
    def forward(ctx, i):
        ctx.save_for_backward(i)
        return swish_fwd(i)

    @staticmethod
    def backward(ctx, grad_output):
        i, = ctx.saved_tensors
        return swish_bwd(i, grad_output)

But you could also replicate Jeremy’s decorator stuff as you have a few (and would allow different implementations for 1.2/1.3 if needed).

I did also note a bit of a bump in performance in 1.3 when I want back to my tests. From 1:02 per epoch down to 51s.

TomB · October 28, 2019, 12:21pm

Tried it out in the Kaggle severstal comp, but didn’t get to spend much time on the comp so my first submission was on the last day and just 5 submissions to play with. The performance testing locally wasn’t reflected in results on the Kaggle LB. So nothing impressive so far.
I did get it up from an initial 0.8 dice to a final 0.88419. An empty submission would get a dice of about 0.86 while the winner was 0.90883 (a small range given the pretty extreme penalty for any false positives on empty masks with dice). A fastai unet with a resnet34 backbone got 0.88528.
Pretty low down the leaderboard but there was public sharing of some pretty strong performing models so likely a lot of model duplication there. Still obviously a fair way off the SotA.

Some extra non-model related fixes may well improve performance a fair bit though. Notably while I added TTA on my one real day working on it on Kaggle I didn’t actually adjust thresholds based on this, they’re still based on non-TTA results (and thresholds alone moved from ~0.80 to ~0.878 just making a couple of educated guesses).
Also likely better to use n_classes+1 outputs rather than n_classes as I used. I used n_classes and sigmoid(output)>threhsold rather than the n_classes+1 and argmax(output) as allowing thresholds seemed nice. But you can actually combine n_classes+1 outputs with thresholds as this notebook released post-comp shows.
I also divided the input 1600x256 images into 400x256 blocks, which helped a lot with training, but then I didn’t have time to properly deal with this is prediction. Could likely beefit from fine-tuning on full size and made some prediction decisions hard (I ended up combining full image predictions with block-based TTA predictions).

I will look to experiment more now post-comp and will probably also try the architecture out on the clouds comp. Will definitely look to compare the different locations at which to collect the skips too. And try and investigate some of the many ultimately fairly arbitrary decisions I made along the way.

Here’s the notebook for my submission in case anyone still finds any use. Could really do with some cleanup but not sure when I’ll get back to it. Might be some interesting things still.
Fairly happy with the TTA. Predictions with TTA (or even without) off a fastai Learner without OOMing seems to be a problem many hit (as fastai assumes they’ll fit in RAM). The notebook does 1800 1600x256px inputs x 4 classes w/ TTA predictions in 3:28s (and think some opencv post-processing is a fair part there).

morgan · November 1, 2019, 4:08pm

@rwightman after reading through this I’m going to switch to your efficientnet library, but I had a couple of qq’s on loading a model.

How do I select Mish as an activation function instead of relu, I made a guess with the code below but it didn’t work. I also saw there were several Mish functions (vanilla, jit, MishAuto) in the activations folder, is it up to the user to select which one to pick?

m = geffnet.efficientnet_b2(act_layer=MishAuto, num_classes=3, pretrained=True, drop_rate=0.25, drop_connect_rate=0.2)

When running the code above without the act_layer argument, it gives the following message when it loads:

Discarding pretrained classifier since num_classes != 1000

Does that mean all of the pre-trained weights are discarded, or just those from the last fc layer?

Just to confirm, you’d recommend using the tf efficientnets over your pretrained models IF the same tf preprocessing pipeline is used?

Thanks for a great library, I’d be happy to out with some documentation around the above, or anything else you feel needs work too!

rwightman · November 2, 2019, 1:53am

Thinks got a bit complex with the activations (different complications between autograd.function and onnx export/torchscript, etc). I created a string based factory. I was going to make an if isinstance(nn.Module) exception but forgot to add that. The easiest way to use Mish with the current version is to call this before creating the model:
geffnet.add_override_act_layer('swish', geffnet.activations_jit.MishJit)

I’ll add the ability to pass in the activation fn directly again, but the one thing to be aware of is that some models like MixNet override the models base activation on a per stage basis (some stages are ReLU, some Swish by default), so the string mapping method lets you override each activation type used in any model without adding loads of parameters all the way down to the blocks…

Yes, if you pass in fewer classes and ask for a pretrained model it just discards the classifier (fc layer) and uses the default random init for it. If you pass in a different number of input channels, it does the same for that conv but spits out a different msg. If you specify a single channel for input, actually sums the original 3 into one for a pretty decent starting point. The rest of the pretrained weights are still loaded.

If there is a non tf_ variant of the model you want to use, I’d use that one. The tf versions require additional padding (that is dependent on the input size), so that has to be calculated at each forward pass or cached on the first pass and use only the same image size for the rest of the life of that model. There is a small penalty in memory usage / runtime performance for that. The tf models also default to a different BatchNorm epsilon that you will need to keep using in the future to maintain full accuracy with your weights.

For those reasons I also recommend that if you do start with a tf_ model, consider taking the initial hit and load the checkpoint into a non tf_ variant, it’ll have a noteable drop initially, but if you’re fine tuning anyways it’ll likely be moot over the duration of training.

rwightman · November 2, 2019, 2:04am

While I’m in this thread, comment/question for @TomB … with the JIT version of Mish I was playing with I calculated the derivative and ended up with something different than what you and @Diganta have in some of the other versions I’ve seen in Github, etc…

I have:

def mish_jit_bwd(x, grad_output):
    x_sigmoid = torch.sigmoid(x)
    x_tanh_sp = F.softplus(x).tanh()
    return grad_output.mul(x_tanh_sp + x * x_sigmoid * (1 - x_tanh_sp * x_tanh_sp))

Elsewhere I’ve seen:
(1 - torch.exp(-x)) (not in the denominator) instead of the sigmoid …

I haven’t revisited my analysis so not sure if I missed something…

TomB · November 2, 2019, 4:37am

Not the best person to check your maths but the 1-exp(-x) is an alternate gradient for softplus I took from PyTorch when trying to resolve numerical stability issues.

Your calculation passes torch.autograd.gradcheck finite differences test (and torch.autograd.gradgradcheck second derivative test). And also:

>>> go = torch.randn(1000) * torch.randint(-100, 100, (1000,), dtype=torch.float32)
... inp = torch.randn(1000) * torch.randint(-100, 100, (1000,), dtype=torch.float32)
... torch.allclose(mish_jit_bwd(inp, go), mish_cuda.mish_backward(inp, go))
True

morgan · November 3, 2019, 9:26pm

Super helpful, thanks!!

themad95 · November 12, 2019, 5:03am

Can you link me to the thread? I was quite surprised to hear about the detached hooked outputs. Does that mean the conv layers in the skip connections are not updated?

TomB · November 12, 2019, 5:48am

(Sorry, replied to wrong post, replying to @themad95)

I couldn’t find the thread from a quick search. But from memory there wasn’t a lot of discussion just Sylvain confirming that the hooks were intended to be detached (the default for hooks is detached so could have been an oversight).
This will not prevent the update of any layers it just means that gradients flow through the whole downsample/upsample path rather than directly flowing across the skip connections.

It may be this is not an unreasonable choice in at least some cases (possibly common ones for fastai users). From quick tests I think there is a pretty large increase in memory usage when the skip connections are not detached. So the larger batch sizes may balance the less direct backward path (assuming gradient accumulation is not being used and GPU memory is limited).
Also, my tests that suggested an improvement when hooks weren’t detached was on a low-level segmentation task rather than a Coco style task involving segmentation of higher-level imagenet like categories. This may make a difference as I would suspect a greater importance of initial layers in a lower-level task on non imagenet categories (though could be wrong).

sgugger · November 12, 2019, 2:44pm

No that is a bug I discovered when porting the unet to v2. I thought I had corrected it in v1 but will look again. It didn’t change the results BTW so it doesn’t look like it’s important to propagate those gradients.

TomB · November 12, 2019, 2:58pm

Ah, guess I misread the thread or some such.
It indeed now looks fixed. As of a commit about a month ago which was after my testing
I was testing against a custom UNet with residual connections in the upsampling path so may well be more likely to impact there where I did find some evidence of improvement at least in initial training (but didn’t test extensively).

avenio · October 10, 2020, 5:41am

Can’t we use the num_classes parameter instead of creating the fully connected layer separately as mentioned in the Github docs of Efficientnet-Pytorch like this:

from efficientnet_pytorch import EfficientNet
model = EfficientNet.from_pretrained('efficientnet-b1', num_classes=23)

link from https://github.com/lukemelas/EfficientNet-PyTorch

muellerzr · October 10, 2020, 12:44pm

Sure you could, but you won’t get fastai’s head with the dual pooling layers etc

avenio · October 10, 2020, 4:08pm

Is there a way to see how many activations the last layer of the model has, so that we can add the fully connected layer like you have done here :

model = EfficientNet.from_name('efficientnet-b0')
model._fc = nn.Linear(1280, data.c)

What I mean to ask is that, like here you have added 1280 as the input shape of nn.Linear for efficient net-b0. What shall we do for other efficient net models? Do we have to refer to the model architecture every time we create a new model, or is there is some kind of code that does it for us?
Also, are efficientNet models added to the FastAI model zoo?

muellerzr · October 10, 2020, 4:10pm

I would recommend reading my article where I integrated the timm library here it touches on those points. And no, fastai does not have efficientnet models in it’s zoo, though timm (and this general guide) should be good enough for working with any model

It touches on a similar create_body method and a create_head method. (EfficientNet is in timm, hence why I want to point to the article )

avenio · October 11, 2020, 2:39am

Thanks a lot.

ronaldokun · October 13, 2020, 2:38am

That’s amazing! Exactly what I was looking for. The timm library has a great variety of newly trained models, now easily integrated with fastai. Thank you

avenio · October 13, 2020, 8:48am

I had a small doubt. How can I use the timm_learner function?
You had said that we can simply write
from wwf.vision.timm import *
learn = timm_learner(dls, ‘efficientnet_b3a’, metrics=[error_rate, accuracy])
But I’m getting an error

No module named ‘wwf’

I’ve installed fastai and timm on my Colab notebook, still getting this error…
How can I fix that?