Adding EfficientNet to fastai vision

balnazzar · October 16, 2019, 9:08pm

TomB:

Yes, I’ve also found EfficientNet is quite slow in PyTorch. I think that rwightman’s verison is a bit faster but not based on particularly extensive testing. The code looks like it’s written with performance in mind more. Especially with the padding stuff which is a bit weird in Luke’s. On that you might want to ensure your using the fixed image size versions there as they looked better (I think you just had to provide your image size).
I think it might be related to some issues with the depthwise convolutions in PyTorch, I’ve seen various things about performance issues there on the forums/code. You might also want to try PyTorch 1.2 if possible as might be some improvements there (or 1.3 but figured if on 1.1 for some reason then that might be an easier jump).

And yeah, a pretty sizeable memory drop using either the autograd or cuda versions of Swish/Mish (time is one epoch, b0, bs48, 256x256, rwightman’s, Swish).
          alloc MB  time
Original  6879      01:11
Autograd  5421      01:14
CUDA      5400      01:02
From this notebook which has the autograd version of swish and the little wrapper you need to use swish cuda with rwightman’s (check my fork for the little change to allow specifying an activation function).

Thanks @TomB, a lot of useful information in your kind reply.

I’ll try as many things I can, and let you know. Meanwhile, if you guys have some notebook of yours that you want me to run, don’t hesitate. For various reasons, I happen to have access to quite powerful hardware.

TomB · October 22, 2019, 12:00pm

Did some initial profiling of efficientnet, rwightman’s B0. Full results are here, top few operations:

-------------------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  
Name                                   Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     CUDA total %     CUDA total       CUDA time avg    Number of Calls  
-------------------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  
ThnnConvDepthwise2DBackward            0.19%            3.006ms          0.77%            12.000ms         75.001us         6.69%            274.624ms        1.716ms          160              
thnn_conv_depthwise2d_backward         0.52%            8.121ms          0.52%            8.121ms          50.754us         6.66%            273.703ms        1.711ms          160              
thnn_conv_depthwise2d                  0.12%            1.868ms          0.60%            9.319ms          58.245us         1.69%            69.432ms         433.952us        160              
thnn_conv_depthwise2d_forward          0.48%            7.451ms          0.48%            7.451ms          46.570us         1.68%            68.970ms         431.064us        160              
CudnnConvolutionBackward               1.09%            16.956ms         8.58%            133.822ms        205.879us        5.98%            245.505ms        377.700us        650              
cudnn_convolution_backward             7.50%            116.866ms        7.50%            116.866ms        179.793us        5.93%            243.625ms        374.808us        650              
conv2d                                 0.75%            11.751ms         16.45%           256.508ms        316.676us        7.35%            301.968ms        372.800us        810              
convolution                            0.72%            11.281ms         15.70%           244.757ms        302.169us        7.30%            299.588ms        369.861us        810              
_convolution                           1.50%            23.395ms         14.98%           233.476ms        288.242us        7.24%            297.223ms        366.942us        810              
cudnn_convolution                      12.73%           198.410ms        12.73%           198.410ms        305.246us        5.43%            222.896ms        342.917us        650              
MulBackward0                           1.39%            21.662ms         4.91%            76.543ms         117.759us        5.12%            210.099ms        323.229us        650

So, yes, the depthwise convolutions do seem to be a fair part. A bit hard to compare as not split by layer so combining various sizes of operation. But taking about as long for 160 depthwise convs as for 650 non-depthwise convs. This also shows it’s using thnn_ operators for the depthwise convs not cudnn_ operations (for at least some, some of the ‘depthwise’ convs might be going to non-depthwise specific cudnn operators). The former are torch operators, while the later are nVidia cuDNN library and quite possibly more optimised (though I think the thnn_ are at least newer torch ops not older legacy stuff). I think there is some depthwise conv support in cuDNN, so might have a dig in the PyTorch code to see when they are used.

I’m working on some profiling code at the moment to hopefully give more information. Hopefully get back to that towards the end of the week.

remapears · October 22, 2019, 5:59pm

Dear @muellerzr,

Can you please tell me what does model.fc line do? I guess fc stands for fully connected, so maybe it specifies the number of outputs the model shld have? I want to use this model (instead of ResNet) for a multi-label classification problem btw.

muellerzr · October 22, 2019, 6:10pm

Yes exactly fc is the last linear layer. So if we have 3 classes we want that output to be 3

TomB · October 22, 2019, 6:23pm

You can also pass the num_classes parameter in override_params like:
EfficientNet.from_name(model_name, override_params={'num_classes': data.c})
Or for pre-trained you can just pass num_classes like:
EfficientNet.from_pretrained(model_name, num_classes=data.c)
(assuming data is a DataBunch)

remapears · October 23, 2019, 12:33pm

I see but what about the value 1280?
Thank you!

MassimilianoG · October 23, 2019, 3:56pm

Hi, how can I do if I want to use an EfficientNet in a Unet; I tried this code:

model = EfficientNet.from_name(‘efficientnet-b2’)
learn = unet_learner(data, model, metrics=[dice])

But I got this error:

/opt/anaconda3/lib/python3.7/site-packages/efficientnet_pytorch/model.py in forward(self, inputs)
    189     def forward(self, inputs):
    190         """ Calls extract_features to extract features, applies final linear layer, and returns logits. """
--> 191         bs = inputs.size(0)
    192         # Convolution layers
    193         x = self.extract_features(inputs)

AttributeError: 'bool' object has no attribute 'size'

TomB · October 23, 2019, 4:51pm

You won’t be able to use unet_learner without some work. That function only works with the more standard models for which information is in cnn_config. It also expects no special logic in the top-level modules as it takes the children and creates an nn.Sequential. This won’t work as EfficientNet has special logic.
So you’ll have to re-implement the top-level EfficientNet module without the final linear layer (you could wrap an existing EfficientNet and just call it’sextract_features and any subsequent layers that should be included). Then you could construct a models.unet.DynamicUnet from that body.

rwightman · October 23, 2019, 5:00pm

Further that, where the features are supposed to be extraced in EfficientNet / MobileNet-V3 / MixNet is actually non-trivial. It is not the end of each block as in ResNets due to the fact that these are “Inverse Residual” blocks with the expansion being in the middle of the block, bottlenecks between, and no non-linearity between blocks. You need to grab the expanded features in the middle of each block, after the depthwise conv and SE block and before the final 1x1 pointwise projection conv.

I have a work in progress implementation that uses hooks to grab the features at the necessary points. I managed to get it behaving well with multi-GPU setups. I’m debating between that approach and modifying the blocks to accept feature extraction flags and return OrderedDicts (always).

rwightman · October 23, 2019, 5:03pm

@TomB Your CUDA swish/mish look great. I’ve been dragging me feet on doing that myself, but recognized it would help reduce some of the performance overhead of EfficientNets, etc.

Would you mind if I tried to pull them into my efficientnet/image-models repos at some point (with source level and README attribution)?

TomB · October 23, 2019, 5:18pm

Nope, no problem. Note that installation requires CUDA toolkit which may be undesirable. Though currently Swish is CUDA only. There’s CPU support for Mish but not yet optimised so may be slower than the straight PyTorch version.
You might also look at the extra/package.py script, it creates a standalone Python file that compiles and loads the extension. It should be cached by PyTorch so compilation time shouldn’t be an issue (I’ve only used on Kaggle so have to recompile each time but works).
I’ll be looking to spend some more time on them soon to clean them up (and likely merge the repos). Just finishing up some other things.

rwightman · October 23, 2019, 5:26pm

Cool, thanks. I’m setup with full CUDA toolkit install, but aware it’s not the default these days. I was envisioning combining the build for all CUDA extensions in one script call, and having the implementations fallback to Python versions (with a warning) if that step hasn’t been performed. I saw a clean looking template for that somewhere recently… have to dig it up.

TomB · October 23, 2019, 5:42pm

Interesting. I implemented a UNet architecture off your EfficientNet by grabbing in between the IR blocks. Well, technically in between the block repeats, but the resolution changes occur in the first repeat. Not extensively tested but seems to work. Seemed to perform better than a custom ResUNet (but more testing and confirmation needed). Will have to look into your suggestions. I didn’t really consider the structure of the IR blocks.
I just call the child blocks separately collecting appropriate intermediates rather than hooking (but that wouldn’t work nicely to collect from the middle of an IR).
For the decoder I concatenate the upsampled and skips and.use ResBlocks with bn_>act->conv.
For a kaggle comp that ends tomorrow but can look to clean it up and post after that.

But, regardless, you are right that the DynamicUnet likely won’t work well as I think it will just grab right before/after the stride 2 conv, in the middle of an IR (can’t remember if before/after res change).

Oh, and thanks for the nice EfficientNet implementation it’s been a pleasure to use (and abuse).

rwightman · October 23, 2019, 6:50pm

Either does work, yeah. But I don’t think it’s optimal. I noticed qubvel’s segmentation model repo is grabbing the features from between the blocks too. The official TPU repo and the MobileNetV3 paper mention the intended points for these blocks (‘expansion_output’): tpu/models/official/efficientnet/efficientnet_model.py at master · tensorflow/tpu · GitHub

My work in progress should work well with other Unet or Deeplab impl that accept a backbone encoder, I’ve been testing with a Unet based on qubvel’s as below…

encoder = timm.create_model(
    backbone, features_only=True, out_indices=(0, 1, 2, 3, 4), in_chans=in_chans, pretrained=True, **backbone_kwargs)
encoder.out_shapes = encoder.feature_channels()
self.encoder = encoder

self.decoder = UnetDecoder(
    encoder_channels=self.encoder.out_shapes,
    decoder_channels=decoder_channels,
    final_channels=num_classes,
    norm_layer=norm_layer, norm_kwargs=norm_kwargs,
    center=center)

My WIP for the feature extraction, which also includes the new Condition Convolution version of EfficientNet implented in Pytorch is on a branch here: https://github.com/rwightman/pytorch-image-models/blob/condconvs_and_features/timm/models/gen_efficientnet.py

Feedback on the approach for feature extraction welcome, I think my approach works around some issues with the fastai (unet_learner fails when using nn.DataParallel (multi-GPU) · Issue #1435 · fastai/fastai · GitHub) unet hooks as well.

rwightman · October 23, 2019, 7:19pm

If you want to use any of the PyTorch EfficientNet impl with weights ported from Tensorflow, you need to pay attention to several preprocessing details:

Use the correct resolution, already covered in the response below.
Use the correct crop ratio. I don’t believe Luke’s impl covers this properly. As the resolution scales up the crop pct does not remain .875, it’s img_size / (img_size + 32)
Use bicubic interpolation. These models were trained with that, and they definitely have a noticeable gap between bilinear and bicubic. There is even a difference between the TF impl of bicubic and PIL. My repo has a replication of the TF preprocessing pipeline for reference and you can see the differences. I think TF’s bicubic is closer to OpenCV but I’ve yet to test with OpenCV for these models.

Also, if you’re using the TF ported weights in PyTorch, I wouldn’t recommend continuing to use the SAME style padding that is necessary to replicate the results with those weights. I’d use the weights but use standard PyTorch padding. There is a noticeable accuracy hit initially, but it’ll likely disappear with the fine tuning and you’ll be left with a model that’s a bit more efficient and easier to work with down the line. I’ve finetuned B0, B1, and B2 with this technique, roughly back to original accuracy (on imagenet).

TomB · October 23, 2019, 7:46pm

Interesting, thanks for the info. Will take a look at your stuff.

I seem to have taken a bit different approach. Looks like using expansion_output you’d be taking your skip connections (the ones across the U) straight after your resolution reduction. This looks like what fastai does. I take my skip connections from before the resolution reductions (where I use stride 2 convs). Wasn’t really a decided choice here, came mostly as I was borrowing from my implementation of Road Extraction by Deep Residual U-Net (which various Kaggle unets seem to come from). Looks like the original Unet paper takes them after the reduction.
Borrowing from that paper is also where the `‘full pre-activation’ (BN->Act->Conv) in the ResBlocks came from (that paper taking this from He et al).

rwightman · October 23, 2019, 7:59pm

Yes, within a given block that’s true. That ‘expansion_output’ is Google’s impl. My version allows two choices right now that will result in hooking one of the locations. With feature location set to ‘pre_pwl’ my impl hooks the input to the projection conv_pwl, so it’s equivalent to Google, this is after the strided layer, but to compensate for that I hook the ‘last block before the strided’ block at a given resolution. For the other option in mine, post_exp I hook the output of the expanding conv_pw, before the strided DW. For that option I hook the last block with the strides for a given feature map resolution. I also take care of swaping strides with dilations correctly for a given output_stride target (not usually a factor for Unet, but it is for DeepLab and others)

TomB · October 23, 2019, 8:18pm

Ah right, yeah, I wasn’t quite sure if you’d take the expansion_output from the strided block or the one before.

One thing I noted in fastai unets is that they detach the hooked outputs, so there’s no gradients across the U. Seemed an odd choice to me (saw a post from Sylvain saying it was intended). I didn’t follow it but haven’t done any testing there (think this does reduce memory quite a lot, I got lower batch sizes with what I think is a smaller ResUnet than a fastai Unet with resnet backbone). As you didn’t use the fastai hooks you aren’t following this (they default to detaching).
None of the papers I’ve looked at seems to directly address this, in which case I’d assume they intend gradients to propogate.

balnazzar · October 27, 2019, 5:31pm

First of all, thanks for your detailed reply. I’ll try al the stuff you recommended, and let you know (to be honest, I still have to experiment with your implementation, which people reports to be the most effective. Unfortunately I’m drowning in work these days…).

Since you cited the finetuning of the smaller flavors, I’d like to ask if anyone has experimented with the bigger cousins (b5-b7). For me, the most interesting thing about EN was to finally have a network pretrained on bigger images, or, differently stated, capable of recognizing features that almost disappear (or just disappear altogether) as you rescale some big image to 224 or 299. Yes, network pretrained at such res (e.g. ResNetXXX) have some degree of tolerance as you go up with the resolution, but it stops (in my experience) at some 450px-600px, depending upon the specific features to be learned.
I finetuned b7, using just 600px imgs, and the results wrt to resnet, at the same res, were encouraging. It took a lifetime, though, and occupied ~90gb of vram with a bs ~20, as I said above.

I’ll try your implementation with all your recommendations, and let u know.

Again, Thanks!

balnazzar · October 27, 2019, 5:33pm

Very interesting considerations. Keep us posted if you like!