Imagenette/ Imagewoof Leaderboards

Hi all, I’m excited to join in, even though I don’t quite have a better result yet.

As I had trouble with fastai2 (can’t find Mish module), I took @LessW2020 's repo from 6 months ago, but I couldn’t get 75% (imageWoof, 5 epochs 5 runs). I was only getting 67 or 68’ish. Maybe the fastai had an update or the dataset wasn’t the same, per @pete88b ? In any case, I was able to get a 1% improvement on that, by changing the 3x3 Conv layer with something a little more complicated (but mathematically well-motivated). I wonder if anyone could add the couple of lines (TwistLayer) in my repo to your 75% performing model and see how it fare. Thanks all (especially to Jeremy for making it all possible.)

(Apologies for the poor code, and poor PyTorch practice. It’s certainly a waste to have 2 extra full-scale conv2d weights. I hope it may be a little easier to understand what’s going on, and to experiment with.)

1 Like

Welcome! Cool to see your results! Nice Job :slight_smile: Yes, the dataset was changed to make it a bit harder, and so the leaderboard percentages were adjusted (larger validation set)

1 Like

Is that why the dataset is called imagewoof2? I’m confused what the current leaderboard is based on.

I made an 80 epoch run (still size=128) and got 87.27 at the end (highest 87.42 at second to last epoch), which is slightly higher than the current record of 87.20. Hooray!

Yes it is :slight_smile: and it’s what the current leaderboard is based on. The baselines were run with the original setup that we found to bring in a fair comparison.

I’ve made a start by using twist in some of the “standard” models using fastai v2 dev:
hope it helps

Thanks, @pete88b. As I mentioned, there’s a lot of waste in parameters.

If you are testing on ResNeXt, did you give each conv2d groups argument? (That’s as much as I understand ResNeXt)

Briefly, I’m adding two extra conv2d (that I call convx and convy), but you can see that I “symmetrized” the weights, so instead of 9 parameters for each filter/channel/feature-map, it’s only 4 in effect. I also copied the convx weights into convy so the entire convy is extraneous. Of course once we have tested various possibilities we could write the TwistLayer more efficiently.

You asked where you can learn about TwistLayer. It’s related to the Neural ODE paper (and others) that interprets ResNet as solving a differential equation. I wrote about the mathematics here

but at the time I didn’t actually know ResNet (even now I know very little beyond ResNet) and I should do a complete rewrite.

I can open up a separate thread to answer questions. [Update: new thread here]


@liuyao that sounds very interesting. Since you’re decreasing the param count, you should get the best benefits with more epochs (so try 200) and less regularization (so try less mixup and larger random resize crop area).


Might not have followed the developments in this thread correctly, but can somebody briefly explain to me what’s a Twist Layer?

1 Like

I’ve simplified it a bit and it seems to be doing better (I’ll update with results). Here’s the conv_twist layer, replacing each 3x3 convolution. I don’t know if I can explain more briefly than the code:

class conv_twist(nn.Module):  # replacing 3x3 Conv2d
    def __init__(self, ni, nf, stride=1):
        super(conv_twist, self).__init__()
        self.conv = nn.Conv2d(ni, nf, kernel_size=3, stride=stride, padding=1, bias=False)
        self.convx = nn.Conv2d(ni, nf, kernel_size=3, stride=stride, padding=1, bias=False)
        self.convy = nn.Conv2d(ni, nf, kernel_size=3, stride=stride, padding=1, bias=False) = (self.convx.weight - self.convx.weight.flip(2).flip(3)) / 2 = self.convx.weight.transpose(2,3).flip(2)
        # self.radii = nn.Parameter(torch.Tensor(nf), requires_grad=True)
        self.center_x = nn.Parameter(torch.Tensor(nf), requires_grad=True)
        self.center_y = nn.Parameter(torch.Tensor(nf), requires_grad=True)
        #, 0.7), 0.7), 0.7)

    def forward(self, x): = (self.convx.weight - self.convx.weight.flip(2).flip(3)) / 2  # make convx a first-order operator by symmetrizing it = (self.convy.weight - self.convy.weight.flip(2).flip(3)) / 2
        # = self.convx.weight.transpose(2,3).flip(2))                    # make convy a 90 degree rotation of convx
        x1 = self.conv(x)
        _, c, h, w = x1.size()
        XX = torch.from_numpy(np.indices((1,h,w))[2]*2/w).type(x.dtype).to(x.device) - self.center_x.view(-1,1,1)
        YY = torch.from_numpy(np.indices((1,h,w))[1]*2/h).type(x.dtype).to(x.device) - self.center_y.view(-1,1,1)
        # mask = ramp_func((XX**2+YY**2)/(self.radii.type(x.dtype).to(x.device).view(-1,1,1)**2))
        return x1 + (XX * self.convx(x) + YY * self.convy(x)) # * mask

Update: imagewoof2

Size (px) Epochs model mixup Accuracy # Runs
128 5 (Leaderboard) 73.37% 5, mean
128 5 RMS 0 68.54% 5, mean
128 5 RMS + twist 0 70.95% 5, mean
128 20 (Leaderboard) 85.52% 5, mean
128 20 RMS 0 84.62% 5, mean
128 20 RMS + twist 0 85.24% 5, mean
128 80 (Leaderboard) 87.20% 1
128 80 RMS + twist 0.2 87.81% 1
128 80 RMS + twist 0.5 88.52% 1
128 200 (Leaderboard) 87.20% 1
128 200 RMS + twist 0.2 88.70% 1
256 200 (Leaderboard) 90.38% 1
256 200 RMS + twist 0.2 91.52% 1


Size (px) Epochs model mixup Accuracy # Runs
256 200 (Leaderboard) 95.11% 1
256 200 RMS + twist 0.5 95.87% 1

@a_yasyrev, if you could help test with your ResNet trick + MaxBlurPool, that would be very nice.

1 Like

Any literature references for this?

Not that I know of. As I mentioned above, the initial observation in Neural ODE paper (and probably others) is related, but I don’t know about this particular implementation.

Maybe I can write about it in the fastpages blog :slight_smile:


I inadvertently replaced all the 1x1 convolutions (in addition to the 3x3’s) by conv_twist, and it bumped the acc up…

The resnet50 uses 1x1 in the bottleneck block (1x1 3x3 1x1), as well as some of the skip connections.

I don’t know what to make of it.

(Update: It probably only happens with 5 epochs… Ok, not a fair comparison. Still, that it doesn’t just break down is a surprise.)


Found time to make long runs.
Add nbdev to repo with notebooks, so now you can find all results and links to nbs on ‘doc’ page
On 128 small improvements, but on 192 and 256 good enough.
size 128:
80 eps - 87.63% (now 87.20%)
200 eps - 88.30% (87.20%)
size 192:
80 eps - 89.69% (89.21%)
200 eps - 90.35% (89.54%)
size 256:
80 eps - 90.63% (90.48%)
200 eps - 91.14% (90.38%)
Did 3-4 runs.
Worth mentioning what i used start_pst 0.4-0.2 at long runs.


I started test twist layer.
Lets move this discussion to separate trade!

I think we need not only one leaderboard with best score. May be good idea have another page with list, than we can compare results not with only the best but with different models or settings. So we can compare not againt best, but for example against default xresnet or resnet.
We can have template for submissions, with results in different sizes, links to nbs etc. We can submit results to some folder, then some script forms list, best can be moved to main liderboard.

Sorry for using this thread as a lab notebook… but today I was trying to tweak the “hyperparameter” (the 0.7 in initializing parameters with uniform_(-0.7,0.7) in conv_twist) and the 5-epoch accuracy went further up as I increased it. The way I had in mind is that (center_x, center_y) is the point around which the conv_twist is “twisting” the image, which in this scale is situated at the [-1,1]x[-1,1] square.

turning it up to 1.5 (so the center can be outside the image)

[0.741156 0.73632  0.744973 0.751591 0.72563 ]
0.7399338   0.008722797

at 2.0,

[0.757954 0.747264 0.748282 0.750064 0.75719 ]
0.75215065   0.004522691

at 2.5

[0.759735 0.755663 0.75999  0.753372 0.761262]
0.75800455   0.0029829554

(I’m not sure we can still call this ResNet, as it is using 3x3 exclusively.)

I will start a new thread [Update: new thread here]. This reminds me of the PolyMath projects (of Terry Tao and others), a series of online collaborative projects in research mathematics in recent years. Participants would post small comments to a blog post describing the problem, and once in a while (when comments reach the hundreds) the host would start a new post, summarizing what they have learned, and discussion would take off from there. A couple of success stories. I’m new to fastai, and it is probably what’s happening here too.


Very interesting!
I did short test with Twist - only on 128, 5 and 20 runs. Look very good!
5 (3 runs) - 75.13%
20 (2 runs) - 86.39%
Here is notebook -

Right now i tring to finalize couple experiments wiith new tricks. Hope post about it tomorrow (max 2 days). And I tried it with Twist, have some finding - results VERY promising!
So, on 5 epochs 76 -76.5%. Need to do more runs and checks… Stay tuned! :sunglasses:


I stumbled over this publication, which (somehow) reminded me of your twist layer:


TL;DR: The Ghost module creates additional feature maps from a conv module but with “cheaper” operations.

I’ve been playing around trying to beat the Imagewoof leaderboards and ran into something odd regarding the activations in ResNet. Figured I’d share what I have so far.

TL;DR: Using DoubleMish instead of Mish after the residual connection improves accuracy by 1%. Other tweaks to the activation functions improves accuracy by 2%. At least for 128px, 5 epoch, XResNet-18.

Basically I was looking at the default XResNet-18 architecture and found that the way activations are applied along the residual path is inconsistent in scaling blocks versus non-scaling blocks. In a non-scaling block the residual connection adds non-activated values to activated values. Whereas in a scaling block both are non-activated because the residual side passes through a pooling layer, conv, and batchnorm first.

So I was messing around trying to fix that and noticed that in general adding activations along the residual paths tended to improve accuracy. I tried something crazy, went back to default XResNet-18, but with DoubleMish instead of Mish after every residual connect (i.e. here: return self.act(self.convpath(x) + self.idpath(x))). DoubleMish is just mish(mish(x)).

Testing against Imagewoof, 128px, 5 epochs, 20 runs each, I saw a mean accuracy of 67.9% for default XResNet18 and 69.1% with the above tweak.

I’ve since rigged up a custom XResNet18 model where the activation functions can be specified in a variety of places. Running evolutionary search using that I’ve gotten mean accuracy up to 69.9%. The discovered configuration for that adds Mish after the stem, adds DoubleMish before the final AdaptiveAvgPool, adds DoubleMish before the convolutional path in each resblock, adds Mish before the pooling along idpath, adds Mish after the convolution in idpath, and uses DoubleMish instead of Mish after residual connections.

Bit of a laundry list of changes. I’m still experimenting to see which of those are important. The DoubleMish after residual connection alone is +1% so that’s at least significant. Using DoubleMish instead of Mish everywhere does not provide a benefit. So it’s only useful in key locations. I find that quite odd.

Some things to note:

I was never able to match the accuracies on the leaderboard, even when I attempted to replicate the same configurations. So I just run all my experiments referenced to the default Imagenette example code. XResNet18 is the simplest to mess with. All of this is at 128px, 5 epochs, 20 runs each. The evolutionary algorithm uses None, Mish, or DoubleMish in all the configurable activation locations. Ranger optimizer with fit_flat_cos. 64 batch size. 1e-2 lr for all runs.

Obviously this might be overfitting the hyperparameters to Imagewoof. I figure after all my experimentation is done I can try the resulting architecture against other datasets to see if the improvement is consistent.

Since I’m not varying LR it’s possible that plays a role in the differences, but Ranger tends to be forgiving. Again, I can always fine-tune LR once I have an architecture I want to validate.

Given the limited training it’s possible this doesn’t scale to higher epochs. I’ve only got a measly 970 to experiment on right now so I’m working with what I’ve got :stuck_out_tongue:

Here’s a gist with the core bits of the code: I’ll try to release a more complete notebook once I’ve finished experimenting. Right now the experimental notebook is … a nightmare of cells.

Here’s a CSV of all the experimental runs thus far:

That’s all I’ve got for now. Just really strange that more aggressive activation specifically along the residual path leads to improved accuracy. Though Mish itself is quite surprising. It’s so similar to Swish and yet whatever subtle difference is there is significant. So maybe it’s not so odd that DoubleMish is useful in certain places.


I seem to have found a simple way to get consistent good results across the board. (It has nothing to do with what I posted above, and I’ll see if I can incorporate that in.) In fastai’s implementation of the Bottleneck block in ResNet50, change the middle one (3x3) as follows:

ConvLayer(ni, nh, 1, act_fn=act_fn, bn_1st=bn_1st)),
ConvLayer(nh, nh*4, 3, groups=nh, act=False, bn_1st=bn_1st)),
ConvLayer(nh*4, nf, 1, zero_bn=zero_bn, act=False, bn_1st=bn_1st))

This seems to be what MobileNet etc. did with so-called “depthwise separable convolution” with a depth multiplier=4. Not sure why it isn’t more widely used.

[Actually, looking at fastaiv2’s source code it seems that dw=True would be doing just that.]

I’ll fill in the results on ImageWoof as they come in, compared with the current leaderboard (though I know many of you have better results but didn’t PR.)

Size (px) Epochs URL (as of 7/7) acc (as of 7/7) new # Runs
128 5 fastai2 2020-01 + MaxBlurPool 73.37% 77.83 (81.89) 5
128 20 fastai2 2020-01 + MaxBlurPool 85.52% 86.62 (88.77) 5
128 80 fastai2 2020-01 87.20% 88.13 (90.22) 1
128 200 fastai2 2020-01 87.20% (90.71) 1
192 5 Resnet Trick + Mish + Sa + MaxBlurPool 77.87% 81.15 (81.76) 5
192 20 Resnet Trick + Mish + Sa + MaxBlurPool 87.85% 88.37 5
192 80 fastai2 2020-01 89.21% 89.89 (91.44) 1
192 200 fastai2 2020-01 89.54% 90.32 1
256 5 Resnet Trick + Mish + Sa + MaxBlurPool 78,84% 82.33 5
256 20 Resnet Trick + Mish + Sa + MaxBlurPool 88,58% 89.53 5
256 80 fastai2 2020-01 90.48% 90.93 1
256 200 fastai2 2020-01 90.38% 1

In parentheses are the results when applying depthwise (x4) to the stem as well, sticking with the [3,32,64,64] that fastai has optimized. Unfortunately it slows down the training considerably (with batch size=16). There must be ways to do better, timewise.

You may run the code directly here (thanks to @a_yasyrev) or implement it easily in your own network (ResNet or not)