Fastai v2 additional models


I understand that fastai v2 models are based on fastai-community experience tricks + (m)xResNet Amazon paper and that is working really well.
I have two questions:

  1. would you (@jeremy, @sgugger) be interested in PR with adopting same tricks to DenseNet architecture, given that it works comparably well to the mxresnet?

  2. are you planning to describe all such tricks together in some blog/paper/whatever-else-not-pure-code form?


On 2, I know they are being mentioned (and explained) in the new book they are writing :slight_smile:


Also on 1, I’m fairly certain if you show it has good results/efficiency along with the framework they’ll try to let you include it (like cadene in v1). I’m trying to do similar with NODE right now

(Jeremy or Sylvain can chime in if it’s different :slight_smile: )

Absolutely! Better still if it can be integrated into a single class, or at least refactored in a way where resnets and densenets share code where it makes sense.

At some point, when we have something we’re happy with, we should certainly document it. @david_page pointed out to me this paper:

…which suggests there’s more things we need to add to make our model match current best practices!

I’d love to get help integrating these ideas into fastai2 and documenting them.


Also, @rwightman has been doing interesting experiments, the results of which would be useful to add in. Then we should create a training script and train a few different sizes and variations, and get them hosted on torch hub.


Wow that paper is super impressive! Would love to help out on implementation wherever I can help, even on documentation


They share layers, but not blocks.
I’'ll start with sharing my gitrepo with notebook here this week.


I’ve been poking around with SelectiveKernels this weekend. Wasn’t aware of them before reading the ‘Compounding’ paper. Interestingly they only partially implemented SK, they don’t actually use different kernels, just kept it simple with two 3x3 paths (impl by 1 conv with 2x out chs) + attn, they did not use different kernel sizes or different dilation rates between the paths as the original paper explored.

SK with differening kernels is quite close to the MixedConvs from MixNet (minus the attention part).

A lot of theses ideas can certainly be applied to various base archs … ResNet, Dense, DualPath, MBConv, etc

I’m going to adapt a DropBlock impl next, found a nice TF one which looks better than any of the PyTorch attempts I’ve seen so far…


So, been hacking around with some of these ideas a little. I have some experiments running SelectiveKernels w/ some added configurability, DropBlock.

When setting up a ‘ResNet50+SK’ and ‘ResNet50-D+SK’ model as per the ‘Compounding…’ paper, I noticed something. These models are almost as large as a ResNet101! Stating that ‘ResNet50’ hits 82.8 is thus a bit of a stretch. The SK, as they have applied it, more than doubles the parameter count of all the 3x3 bottleneck convolutions, increasing params by 50% to approx 37-38M (from 25.5M). The last experiment in their paper that bears any resemblence to a ResNet50 is E5, an SE-ResNet50-D at 80.4. Also, a ResNet101 has a higher throughput than the model with SK applied…

In my impl, I’m experimenting with some different splits / concat vs sum, etc of the channels through the SK conv/attention blocks, trying to see if accuracy can be improved without increasing param count (or possibly lowering it).


In the conclusion of the „Compounding…“ paper they mention „efficient channel attention“ from ECA-Net (a Pytorch implementation on GitHub).

Maybe that can help to improve the parameter count and throughput? (I had no time yet to look into it, but the ECA-Net paper looks interesting.)


Looks like a nice lightweight attn option to try out, basically plug and play replacement for an SE module.