Thanks @TomB, a lot of useful information in your kind reply.
I’ll try as many things I can, and let you know. Meanwhile, if you guys have some notebook of yours that you want me to run, don’t hesitate. For various reasons, I happen to have access to quite powerful hardware.
Did some initial profiling of efficientnet, rwightman’s B0. Full results are here, top few operations:
------------------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- ---------------
Name Self CPU total % Self CPU total CPU total % CPU total CPU time avg CUDA total % CUDA total CUDA time avg Number of Calls
------------------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- ---------------
ThnnConvDepthwise2DBackward 0.19% 3.006ms 0.77% 12.000ms 75.001us 6.69% 274.624ms 1.716ms 160
thnn_conv_depthwise2d_backward 0.52% 8.121ms 0.52% 8.121ms 50.754us 6.66% 273.703ms 1.711ms 160
thnn_conv_depthwise2d 0.12% 1.868ms 0.60% 9.319ms 58.245us 1.69% 69.432ms 433.952us 160
thnn_conv_depthwise2d_forward 0.48% 7.451ms 0.48% 7.451ms 46.570us 1.68% 68.970ms 431.064us 160
CudnnConvolutionBackward 1.09% 16.956ms 8.58% 133.822ms 205.879us 5.98% 245.505ms 377.700us 650
cudnn_convolution_backward 7.50% 116.866ms 7.50% 116.866ms 179.793us 5.93% 243.625ms 374.808us 650
conv2d 0.75% 11.751ms 16.45% 256.508ms 316.676us 7.35% 301.968ms 372.800us 810
convolution 0.72% 11.281ms 15.70% 244.757ms 302.169us 7.30% 299.588ms 369.861us 810
_convolution 1.50% 23.395ms 14.98% 233.476ms 288.242us 7.24% 297.223ms 366.942us 810
cudnn_convolution 12.73% 198.410ms 12.73% 198.410ms 305.246us 5.43% 222.896ms 342.917us 650
MulBackward0 1.39% 21.662ms 4.91% 76.543ms 117.759us 5.12% 210.099ms 323.229us 650
So, yes, the depthwise convolutions do seem to be a fair part. A bit hard to compare as not split by layer so combining various sizes of operation. But taking about as long for 160 depthwise convs as for 650 non-depthwise convs. This also shows it’s using thnn_ operators for the depthwise convs not cudnn_ operations (for at least some, some of the ‘depthwise’ convs might be going to non-depthwise specific cudnn operators). The former are torch operators, while the later are nVidia cuDNN library and quite possibly more optimised (though I think the thnn_ are at least newer torch ops not older legacy stuff). I think there is some depthwise conv support in cuDNN, so might have a dig in the PyTorch code to see when they are used.
I’m working on some profiling code at the moment to hopefully give more information. Hopefully get back to that towards the end of the week.
Can you please tell me what does model.fc line do? I guess fc stands for fully connected, so maybe it specifies the number of outputs the model shld have? I want to use this model (instead of ResNet) for a multi-label classification problem btw.
You can also pass the num_classes parameter in override_params like: EfficientNet.from_name(model_name, override_params={'num_classes': data.c})
Or for pre-trained you can just pass num_classes like: EfficientNet.from_pretrained(model_name, num_classes=data.c)
(assuming data is a DataBunch)
You won’t be able to use unet_learner without some work. That function only works with the more standard models for which information is in cnn_config. It also expects no special logic in the top-level modules as it takes the children and creates an nn.Sequential. This won’t work as EfficientNet has special logic.
So you’ll have to re-implement the top-level EfficientNet module without the final linear layer (you could wrap an existing EfficientNet and just call it’sextract_features and any subsequent layers that should be included). Then you could construct a models.unet.DynamicUnet from that body.
Further that, where the features are supposed to be extraced in EfficientNet / MobileNet-V3 / MixNet is actually non-trivial. It is not the end of each block as in ResNets due to the fact that these are “Inverse Residual” blocks with the expansion being in the middle of the block, bottlenecks between, and no non-linearity between blocks. You need to grab the expanded features in the middle of each block, after the depthwise conv and SE block and before the final 1x1 pointwise projection conv.
I have a work in progress implementation that uses hooks to grab the features at the necessary points. I managed to get it behaving well with multi-GPU setups. I’m debating between that approach and modifying the blocks to accept feature extraction flags and return OrderedDicts (always).
@TomB Your CUDA swish/mish look great. I’ve been dragging me feet on doing that myself, but recognized it would help reduce some of the performance overhead of EfficientNets, etc.
Would you mind if I tried to pull them into my efficientnet/image-models repos at some point (with source level and README attribution)?
Nope, no problem. Note that installation requires CUDA toolkit which may be undesirable. Though currently Swish is CUDA only. There’s CPU support for Mish but not yet optimised so may be slower than the straight PyTorch version.
You might also look at the extra/package.py script, it creates a standalone Python file that compiles and loads the extension. It should be cached by PyTorch so compilation time shouldn’t be an issue (I’ve only used on Kaggle so have to recompile each time but works).
I’ll be looking to spend some more time on them soon to clean them up (and likely merge the repos). Just finishing up some other things.
Cool, thanks. I’m setup with full CUDA toolkit install, but aware it’s not the default these days. I was envisioning combining the build for all CUDA extensions in one script call, and having the implementations fallback to Python versions (with a warning) if that step hasn’t been performed. I saw a clean looking template for that somewhere recently… have to dig it up.
Interesting. I implemented a UNet architecture off your EfficientNet by grabbing in between the IR blocks. Well, technically in between the block repeats, but the resolution changes occur in the first repeat. Not extensively tested but seems to work. Seemed to perform better than a custom ResUNet (but more testing and confirmation needed). Will have to look into your suggestions. I didn’t really consider the structure of the IR blocks.
I just call the child blocks separately collecting appropriate intermediates rather than hooking (but that wouldn’t work nicely to collect from the middle of an IR).
For the decoder I concatenate the upsampled and skips and.use ResBlocks with bn_>act->conv.
For a kaggle comp that ends tomorrow but can look to clean it up and post after that.
But, regardless, you are right that the DynamicUnet likely won’t work well as I think it will just grab right before/after the stride 2 conv, in the middle of an IR (can’t remember if before/after res change).
Oh, and thanks for the nice EfficientNet implementation it’s been a pleasure to use (and abuse).
My work in progress should work well with other Unet or Deeplab impl that accept a backbone encoder, I’ve been testing with a Unet based on qubvel’s as below…
Feedback on the approach for feature extraction welcome, I think my approach works around some issues with the fastai (https://github.com/fastai/fastai/issues/1435) unet hooks as well.
If you want to use any of the PyTorch EfficientNet impl with weights ported from Tensorflow, you need to pay attention to several preprocessing details:
Use the correct resolution, already covered in the response below.
Use the correct crop ratio. I don’t believe Luke’s impl covers this properly. As the resolution scales up the crop pct does not remain .875, it’s img_size / (img_size + 32)
Use bicubic interpolation. These models were trained with that, and they definitely have a noticeable gap between bilinear and bicubic. There is even a difference between the TF impl of bicubic and PIL. My repo has a replication of the TF preprocessing pipeline for reference and you can see the differences. I think TF’s bicubic is closer to OpenCV but I’ve yet to test with OpenCV for these models.
Also, if you’re using the TF ported weights in PyTorch, I wouldn’t recommend continuing to use the SAME style padding that is necessary to replicate the results with those weights. I’d use the weights but use standard PyTorch padding. There is a noticeable accuracy hit initially, but it’ll likely disappear with the fine tuning and you’ll be left with a model that’s a bit more efficient and easier to work with down the line. I’ve finetuned B0, B1, and B2 with this technique, roughly back to original accuracy (on imagenet).
Interesting, thanks for the info. Will take a look at your stuff.
I seem to have taken a bit different approach. Looks like using expansion_output you’d be taking your skip connections (the ones across the U) straight after your resolution reduction. This looks like what fastai does. I take my skip connections from before the resolution reductions (where I use stride 2 convs). Wasn’t really a decided choice here, came mostly as I was borrowing from my implementation of Road Extraction by Deep Residual U-Net (which various Kaggle unets seem to come from). Looks like the original Unet paper takes them after the reduction.
Borrowing from that paper is also where the `‘full pre-activation’ (BN->Act->Conv) in the ResBlocks came from (that paper taking this from He et al).
Yes, within a given block that’s true. That ‘expansion_output’ is Google’s impl. My version allows two choices right now that will result in hooking one of the locations. With feature location set to ‘pre_pwl’ my impl hooks the input to the projection conv_pwl, so it’s equivalent to Google, this is after the strided layer, but to compensate for that I hook the ‘last block before the strided’ block at a given resolution. For the other option in mine, post_exp I hook the output of the expanding conv_pw, before the strided DW. For that option I hook the last block with the strides for a given feature map resolution. I also take care of swaping strides with dilations correctly for a given output_stride target (not usually a factor for Unet, but it is for DeepLab and others)
Ah right, yeah, I wasn’t quite sure if you’d take the expansion_output from the strided block or the one before.
One thing I noted in fastai unets is that they detach the hooked outputs, so there’s no gradients across the U. Seemed an odd choice to me (saw a post from Sylvain saying it was intended). I didn’t follow it but haven’t done any testing there (think this does reduce memory quite a lot, I got lower batch sizes with what I think is a smaller ResUnet than a fastai Unet with resnet backbone). As you didn’t use the fastai hooks you aren’t following this (they default to detaching).
None of the papers I’ve looked at seems to directly address this, in which case I’d assume they intend gradients to propogate.
First of all, thanks for your detailed reply. I’ll try al the stuff you recommended, and let you know (to be honest, I still have to experiment with your implementation, which people reports to be the most effective. Unfortunately I’m drowning in work these days…).
Since you cited the finetuning of the smaller flavors, I’d like to ask if anyone has experimented with the bigger cousins (b5-b7). For me, the most interesting thing about EN was to finally have a network pretrained on bigger images, or, differently stated, capable of recognizing features that almost disappear (or just disappear altogether) as you rescale some big image to 224 or 299. Yes, network pretrained at such res (e.g. ResNetXXX) have some degree of tolerance as you go up with the resolution, but it stops (in my experience) at some 450px-600px, depending upon the specific features to be learned.
I finetuned b7, using just 600px imgs, and the results wrt to resnet, at the same res, were encouraging. It took a lifetime, though, and occupied ~90gb of vram with a bs ~20, as I said above.
I’ll try your implementation with all your recommendations, and let u know.