Imagenette/ Imagewoof Leaderboards

I inadvertently replaced all the 1x1 convolutions (in addition to the 3x3’s) by conv_twist, and it bumped the acc up…

The resnet50 uses 1x1 in the bottleneck block (1x1 3x3 1x1), as well as some of the skip connections.

I don’t know what to make of it.

(Update: It probably only happens with 5 epochs… Ok, not a fair comparison. Still, that it doesn’t just break down is a surprise.)


Found time to make long runs.
Add nbdev to repo with notebooks, so now you can find all results and links to nbs on ‘doc’ page
On 128 small improvements, but on 192 and 256 good enough.
size 128:
80 eps - 87.63% (now 87.20%)
200 eps - 88.30% (87.20%)
size 192:
80 eps - 89.69% (89.21%)
200 eps - 90.35% (89.54%)
size 256:
80 eps - 90.63% (90.48%)
200 eps - 91.14% (90.38%)
Did 3-4 runs.
Worth mentioning what i used start_pst 0.4-0.2 at long runs.


I started test twist layer.
Lets move this discussion to separate trade!

I think we need not only one leaderboard with best score. May be good idea have another page with list, than we can compare results not with only the best but with different models or settings. So we can compare not againt best, but for example against default xresnet or resnet.
We can have template for submissions, with results in different sizes, links to nbs etc. We can submit results to some folder, then some script forms list, best can be moved to main liderboard.

Sorry for using this thread as a lab notebook… but today I was trying to tweak the “hyperparameter” (the 0.7 in initializing parameters with uniform_(-0.7,0.7) in conv_twist) and the 5-epoch accuracy went further up as I increased it. The way I had in mind is that (center_x, center_y) is the point around which the conv_twist is “twisting” the image, which in this scale is situated at the [-1,1]x[-1,1] square.

turning it up to 1.5 (so the center can be outside the image)

[0.741156 0.73632  0.744973 0.751591 0.72563 ]
0.7399338   0.008722797

at 2.0,

[0.757954 0.747264 0.748282 0.750064 0.75719 ]
0.75215065   0.004522691

at 2.5

[0.759735 0.755663 0.75999  0.753372 0.761262]
0.75800455   0.0029829554

(I’m not sure we can still call this ResNet, as it is using 3x3 exclusively.)

I will start a new thread [Update: new thread here]. This reminds me of the PolyMath projects (of Terry Tao and others), a series of online collaborative projects in research mathematics in recent years. Participants would post small comments to a blog post describing the problem, and once in a while (when comments reach the hundreds) the host would start a new post, summarizing what they have learned, and discussion would take off from there. A couple of success stories. I’m new to fastai, and it is probably what’s happening here too.


Very interesting!
I did short test with Twist - only on 128, 5 and 20 runs. Look very good!
5 (3 runs) - 75.13%
20 (2 runs) - 86.39%
Here is notebook -

Right now i tring to finalize couple experiments wiith new tricks. Hope post about it tomorrow (max 2 days). And I tried it with Twist, have some finding - results VERY promising!
So, on 5 epochs 76 -76.5%. Need to do more runs and checks… Stay tuned! :sunglasses:


I stumbled over this publication, which (somehow) reminded me of your twist layer:


TL;DR: The Ghost module creates additional feature maps from a conv module but with “cheaper” operations.

I’ve been playing around trying to beat the Imagewoof leaderboards and ran into something odd regarding the activations in ResNet. Figured I’d share what I have so far.

TL;DR: Using DoubleMish instead of Mish after the residual connection improves accuracy by 1%. Other tweaks to the activation functions improves accuracy by 2%. At least for 128px, 5 epoch, XResNet-18.

Basically I was looking at the default XResNet-18 architecture and found that the way activations are applied along the residual path is inconsistent in scaling blocks versus non-scaling blocks. In a non-scaling block the residual connection adds non-activated values to activated values. Whereas in a scaling block both are non-activated because the residual side passes through a pooling layer, conv, and batchnorm first.

So I was messing around trying to fix that and noticed that in general adding activations along the residual paths tended to improve accuracy. I tried something crazy, went back to default XResNet-18, but with DoubleMish instead of Mish after every residual connect (i.e. here: return self.act(self.convpath(x) + self.idpath(x))). DoubleMish is just mish(mish(x)).

Testing against Imagewoof, 128px, 5 epochs, 20 runs each, I saw a mean accuracy of 67.9% for default XResNet18 and 69.1% with the above tweak.

I’ve since rigged up a custom XResNet18 model where the activation functions can be specified in a variety of places. Running evolutionary search using that I’ve gotten mean accuracy up to 69.9%. The discovered configuration for that adds Mish after the stem, adds DoubleMish before the final AdaptiveAvgPool, adds DoubleMish before the convolutional path in each resblock, adds Mish before the pooling along idpath, adds Mish after the convolution in idpath, and uses DoubleMish instead of Mish after residual connections.

Bit of a laundry list of changes. I’m still experimenting to see which of those are important. The DoubleMish after residual connection alone is +1% so that’s at least significant. Using DoubleMish instead of Mish everywhere does not provide a benefit. So it’s only useful in key locations. I find that quite odd.

Some things to note:

I was never able to match the accuracies on the leaderboard, even when I attempted to replicate the same configurations. So I just run all my experiments referenced to the default Imagenette example code. XResNet18 is the simplest to mess with. All of this is at 128px, 5 epochs, 20 runs each. The evolutionary algorithm uses None, Mish, or DoubleMish in all the configurable activation locations. Ranger optimizer with fit_flat_cos. 64 batch size. 1e-2 lr for all runs.

Obviously this might be overfitting the hyperparameters to Imagewoof. I figure after all my experimentation is done I can try the resulting architecture against other datasets to see if the improvement is consistent.

Since I’m not varying LR it’s possible that plays a role in the differences, but Ranger tends to be forgiving. Again, I can always fine-tune LR once I have an architecture I want to validate.

Given the limited training it’s possible this doesn’t scale to higher epochs. I’ve only got a measly 970 to experiment on right now so I’m working with what I’ve got :stuck_out_tongue:

Here’s a gist with the core bits of the code: I’ll try to release a more complete notebook once I’ve finished experimenting. Right now the experimental notebook is … a nightmare of cells.

Here’s a CSV of all the experimental runs thus far:

That’s all I’ve got for now. Just really strange that more aggressive activation specifically along the residual path leads to improved accuracy. Though Mish itself is quite surprising. It’s so similar to Swish and yet whatever subtle difference is there is significant. So maybe it’s not so odd that DoubleMish is useful in certain places.


I seem to have found a simple way to get consistent good results across the board. (It has nothing to do with what I posted above, and I’ll see if I can incorporate that in.) In fastai’s implementation of the Bottleneck block in ResNet50, change the middle one (3x3) as follows:

ConvLayer(ni, nh, 1, act_fn=act_fn, bn_1st=bn_1st)),
ConvLayer(nh, nh*4, 3, groups=nh, act=False, bn_1st=bn_1st)),
ConvLayer(nh*4, nf, 1, zero_bn=zero_bn, act=False, bn_1st=bn_1st))

This seems to be what MobileNet etc. did with so-called “depthwise separable convolution” with a depth multiplier=4. Not sure why it isn’t more widely used.

[Actually, looking at fastaiv2’s source code it seems that dw=True would be doing just that.]

I’ll fill in the results on ImageWoof as they come in, compared with the current leaderboard (though I know many of you have better results but didn’t PR.)

Size (px) Epochs URL (as of 7/7) acc (as of 7/7) new # Runs
128 5 fastai2 2020-01 + MaxBlurPool 73.37% 77.83 (81.89) 5
128 20 fastai2 2020-01 + MaxBlurPool 85.52% 86.62 (88.77) 5
128 80 fastai2 2020-01 87.20% 88.13 (90.22) 1
128 200 fastai2 2020-01 87.20% (90.71) 1
192 5 Resnet Trick + Mish + Sa + MaxBlurPool 77.87% 81.15 (81.76) 5
192 20 Resnet Trick + Mish + Sa + MaxBlurPool 87.85% 88.37 5
192 80 fastai2 2020-01 89.21% 89.89 (91.44) 1
192 200 fastai2 2020-01 89.54% 90.32 1
256 5 Resnet Trick + Mish + Sa + MaxBlurPool 78,84% 82.33 5
256 20 Resnet Trick + Mish + Sa + MaxBlurPool 88,58% 89.53 5
256 80 fastai2 2020-01 90.48% 90.93 1
256 200 fastai2 2020-01 90.38% 1

In parentheses are the results when applying depthwise (x4) to the stem as well, sticking with the [3,32,64,64] that fastai has optimized. Unfortunately it slows down the training considerably (with batch size=16). There must be ways to do better, timewise.

You may run the code directly here (thanks to @a_yasyrev) or implement it easily in your own network (ResNet or not)


Very interesting! It reminds me of the lecture where Jeremy said something along the lines of if you see an old network without resnets, try adding them. Maybe now we can add, ‘if you see a network without separable depthwise convolutions, try adding them’. Staying tuned…

1 Like

Very interesting!
In you code, you didnt use BN and activation function in that middle block.
Did you try with Act and Bn and some variants?

Yes, at first I had both act and bn, then I took out act for most of the 80 and 200 runs for it seems to perform better. Then, in order to run depthwise (stem+body) on 192px and keep at bs=16 (on a single GPU on colab), I had to remove bn. Sorry I didn’t document it in the notebook.

Turns out I forgot to do stride=2 at the very first layer (3 channels to 32 channels), which meant the entire network was four times as costly (clocking at 7 min per epoch, instead of 2 min). But when I put the stride=2 back, it doesn’t seem to do as well. So the higher numbers in parentheses are attributable to the model being 4 times bigger, rather than to depthwise (x4) as I thought :man_facepalming: I’m re-doing the 80 epoch run to see if I can beat the all-time record 91.44. [Update: Indeed, with no striding at the beginning, the model without any of my tweaks is able to achieve 91.21 after 80 epochs. So I didn’t find a better model after all :frowning:]


Played with this code.
Run only size 192.
On my comp i cant reproduce results. On 5 epochs can reach 80.83 with dw on stem (stride 2 on first conv).
Tried to find why and find, looks like software version very important. After upgrade fastai from 1.60 to 1.61 (and pytorch and cuda) results became better on same code.
Anyway i cant reach same results, so stayed it as base.
I tryed base model with groups and find it works very good to.
And this model has about 17M parameters vs 30m.
I played with groups in stem, not to mach variants tested, but now i has very good results. Will show code later - need more tests.
Stem with 4 convs, stride=2 on second conv.
Now i has on size 192:
5 epochs - 80.60,
20 epochs - 89.04 (only 2 runs),
80 epochs - only 1 run yet, but - 90.20.

Great. I don’t know how it came to be that ResNets have stride=2 at the stem stage, and why it was chosen at the very first Conv layer. Looking forward to more variations!

I had an 80-epoch run (192px) that got 92.08, the all-time high for any size that I’ve seen. (no stride, had to cut out act and bn in conv2 to run at bs=16.)

1 Like

Stride 2 on stem because of model size.
On first conv, because in first resnet on stem was only one conv - 77 stride 2! Then after bag of tricks stem was changed to 3 conv 33 as it the same, but less parameters. I tried put stride 2 to second layer and sometime it works. But not always.

Did you try to use blur pooling in the stem instead of striding? Wouldn’t it be better to do that at the end of stem, so as to preserve as much information as possible?

Yes i did. I find that leave MaxPool on stem at least same good as Blur, sometimes better, and model become little bit faster.
Usually i trying both variants.

1 Like

I cant repeat results from fastai v1 pipeline on v2.
Tried same models (taken from external lib, like timm), same optimizer (Adam) with same parameters, fit_one_cycle.
And on v2 cant reach accuracy from v1.
Can somebody confirm that results the same?

Hi everyone,

I just updated my mininet notebook to use the latest version of fastai and found that reducing batch size improves accuracy. Knowing that batch size roughly scales with learning rate, it’s strange that I can’t get the same improvement by changing learning rate. Reducing batch size increases the number of iterations - maybe this is significant when training over just 5 epochs?

I’d be interested to know what you think as I’m working on a tabular problem where we can train with >4000 items in a batch but … bs=32 gets better results!

I would guess that like you mentioned, a smaller batch size means more update steps, meaning more opportunities to learn.

Generally the advantage of larger batch sizes is that it reduces the variance/noise between batches, so the gradients are more consistent between steps. For example if your data is very noisy and you have small batches, then one batch might result in the gradients “pointing” in one direction but in the next batch all the gradients might end up pointing in the opposite direction. So it takes longer for the model to figure out how best to learn. With a larger batch size this noise is averaged out as the model has a better idea of the overall distribution of your data.

But if you’re limited in the number of epochs to run, like in ImageNette/Woof/Wang then you’ve also limited to number of opportunities the network has to learn, so theres a trade off there between higher batch size for stability and lower batch size for more opportunities to learn…

Its been a while since I read about learning rate and batch size scaling, but my guess would be that a larger batch size means that the model has a better idea of the data distribution and so can be more confident in taking larger steps, i.e. using a higher learning rate.

Thats all a little bit hand-wavy but I hope the intuition correct, happy to hear from others!

1 Like