Imagenette/ Imagewoof Leaderboards

a_yasyrev · February 26, 2020, 10:33am

I started test twist layer.
Lets move this discussion to separate trade!

a_yasyrev · February 26, 2020, 10:41am

I think we need not only one leaderboard with best score. May be good idea have another page with list, than we can compare results not with only the best but with different models or settings. So we can compare not againt best, but for example against default xresnet or resnet.
We can have template for submissions, with results in different sizes, links to nbs etc. We can submit results to some folder, then some script forms list, best can be moved to main liderboard.

liuyao · February 26, 2020, 10:35pm

Sorry for using this thread as a lab notebook… but today I was trying to tweak the “hyperparameter” (the 0.7 in initializing parameters with uniform_(-0.7,0.7) in conv_twist) and the 5-epoch accuracy went further up as I increased it. The way I had in mind is that (center_x, center_y) is the point around which the conv_twist is “twisting” the image, which in this scale is situated at the [-1,1]x[-1,1] square.

turning it up to 1.5 (so the center can be outside the image)

[0.741156 0.73632  0.744973 0.751591 0.72563 ]
0.7399338   0.008722797

at 2.0,

[0.757954 0.747264 0.748282 0.750064 0.75719 ]
0.75215065   0.004522691

at 2.5

[0.759735 0.755663 0.75999  0.753372 0.761262]
0.75800455   0.0029829554

(I’m not sure we can still call this ResNet, as it is using 3x3 exclusively.)

I will start a new thread [Update: new thread here]. This reminds me of the PolyMath projects (of Terry Tao and others), a series of online collaborative projects in research mathematics in recent years. Participants would post small comments to a blog post describing the problem, and once in a while (when comments reach the hundreds) the host would start a new post, summarizing what they have learned, and discussion would take off from there. A couple of success stories. I’m new to fastai, and it is probably what’s happening here too.

a_yasyrev · February 27, 2020, 9:01am

Very interesting!
I did short test with Twist - only on 128, 5 and 20 runs. Look very good!
5 (3 runs) - 75.13%
20 (2 runs) - 86.39%
Here is notebook - https://github.com/ayasyrev/imagenette_experiments/blob/master/Woof_twist_s128_e5_7513_e20_8639.ipynb

Right now i tring to finalize couple experiments wiith new tricks. Hope post about it tomorrow (max 2 days). And I tried it with Twist, have some finding - results VERY promising!
So, on 5 epochs 76 -76.5%. Need to do more runs and checks… Stay tuned!

MicPie · March 2, 2020, 8:02pm

I stumbled over this publication, which (somehow) reminded me of your twist layer:

Code: https://github.com/iamhankai/ghostnet.pytorch

TL;DR: The Ghost module creates additional feature maps from a conv module but with “cheaper” operations.

liuyao · July 8, 2020, 1:49am

I seem to have found a simple way to get consistent good results across the board. (It has nothing to do with what I posted above, and I’ll see if I can incorporate that in.) In fastai’s implementation of the Bottleneck block in ResNet50, change the middle one (3x3) as follows:

ConvLayer(ni, nh, 1, act_fn=act_fn, bn_1st=bn_1st)),
ConvLayer(nh, nh*4, 3, groups=nh, act=False, bn_1st=bn_1st)),
ConvLayer(nh*4, nf, 1, zero_bn=zero_bn, act=False, bn_1st=bn_1st))

This seems to be what MobileNet etc. did with so-called “depthwise separable convolution” with a depth multiplier=4. Not sure why it isn’t more widely used.

[Actually, looking at fastaiv2’s source code it seems that dw=True would be doing just that.]

I’ll fill in the results on ImageWoof as they come in, compared with the current leaderboard (though I know many of you have better results but didn’t PR.)

Size (px)	Epochs	URL (as of 7/7)	acc (as of 7/7)	new	# Runs
128	5	fastai2 train_imagenette.py 2020-01 + MaxBlurPool	73.37%	77.83 (81.89)	5
128	20	fastai2 train_imagenette.py 2020-01 + MaxBlurPool	85.52%	86.62 (88.77)	5
128	80	fastai2 train_imagenette.py 2020-01	87.20%	88.13 (90.22)	1
128	200	fastai2 train_imagenette.py 2020-01	87.20%	(90.71)	1
192	5	Resnet Trick + Mish + Sa + MaxBlurPool	77.87%	81.15 (81.76)	5
192	20	Resnet Trick + Mish + Sa + MaxBlurPool	87.85%	88.37	5
192	80	fastai2 train_imagenette.py 2020-01	89.21%	89.89 (91.44)	1
192	200	fastai2 train_imagenette.py 2020-01	89.54%	90.32	1
256	5	Resnet Trick + Mish + Sa + MaxBlurPool	78,84%	82.33	5
256	20	Resnet Trick + Mish + Sa + MaxBlurPool	88,58%	89.53	5
256	80	fastai2 train_imagenette.py 2020-01	90.48%	90.93	1
256	200	fastai2 train_imagenette.py 2020-01	90.38%		1

In parentheses are the results when applying depthwise (x4) to the stem as well, sticking with the [3,32,64,64] that fastai has optimized. Unfortunately it slows down the training considerably (with batch size=16). There must be ways to do better, timewise.

You may run the code directly here (thanks to @a_yasyrev) https://github.com/liuyao12/imagenette_experiments/blob/master/Woof_ResNet_separable.ipynb or implement it easily in your own network (ResNet or not)

digitalspecialists · July 8, 2020, 12:48pm

Very interesting! It reminds me of the lecture where Jeremy said something along the lines of if you see an old network without resnets, try adding them. Maybe now we can add, ‘if you see a network without separable depthwise convolutions, try adding them’. Staying tuned…

a_yasyrev · July 16, 2020, 3:03pm

Very interesting!
In you code, you didnt use BN and activation function in that middle block.
Did you try with Act and Bn and some variants?

liuyao · July 16, 2020, 9:15pm

Yes, at first I had both act and bn, then I took out act for most of the 80 and 200 runs for it seems to perform better. Then, in order to run depthwise (stem+body) on 192px and keep at bs=16 (on a single GPU on colab), I had to remove bn. Sorry I didn’t document it in the notebook.

Turns out I forgot to do stride=2 at the very first layer (3 channels to 32 channels), which meant the entire network was four times as costly (clocking at 7 min per epoch, instead of 2 min). But when I put the stride=2 back, it doesn’t seem to do as well. So the higher numbers in parentheses are attributable to the model being 4 times bigger, rather than to depthwise (x4) as I thought I’m re-doing the 80 epoch run to see if I can beat the all-time record 91.44. [Update: Indeed, with no striding at the beginning, the model without any of my tweaks is able to achieve 91.21 after 80 epochs. So I didn’t find a better model after all ]

a_yasyrev · July 26, 2020, 10:04am

Played with this code.
Run only size 192.
On my comp i cant reproduce results. On 5 epochs can reach 80.83 with dw on stem (stride 2 on first conv).
Tried to find why and find, looks like software version very important. After upgrade fastai from 1.60 to 1.61 (and pytorch and cuda) results became better on same code.
Anyway i cant reach same results, so stayed it as base.
I tryed base model with groups and find it works very good to.
And this model has about 17M parameters vs 30m.
I played with groups in stem, not to mach variants tested, but now i has very good results. Will show code later - need more tests.
Stem with 4 convs, stride=2 on second conv.
Now i has on size 192:
5 epochs - 80.60,
20 epochs - 89.04 (only 2 runs),
80 epochs - only 1 run yet, but - 90.20.

liuyao · July 27, 2020, 3:09pm

Great. I don’t know how it came to be that ResNets have stride=2 at the stem stage, and why it was chosen at the very first Conv layer. Looking forward to more variations!

I had an 80-epoch run (192px) that got 92.08, the all-time high for any size that I’ve seen. (no stride, had to cut out act and bn in conv2 to run at bs=16.)

a_yasyrev · July 27, 2020, 5:48pm

Stride 2 on stem because of model size.
On first conv, because in first resnet on stem was only one conv - 77 stride 2! Then after bag of tricks stem was changed to 3 conv 33 as it the same, but less parameters. I tried put stride 2 to second layer and sometime it works. But not always.

liuyao · July 29, 2020, 5:55pm

Did you try to use blur pooling in the stem instead of striding? Wouldn’t it be better to do that at the end of stem, so as to preserve as much information as possible?

a_yasyrev · July 30, 2020, 12:22pm

Yes i did. I find that leave MaxPool on stem at least same good as Blur, sometimes better, and model become little bit faster.
Usually i trying both variants.

a_yasyrev · September 15, 2020, 9:58am

I cant repeat results from fastai v1 pipeline on v2.
Tried same models (taken from external lib, like timm), same optimizer (Adam) with same parameters, fit_one_cycle.
And on v2 cant reach accuracy from v1.
Can somebody confirm that results the same?

pete88b · October 8, 2020, 11:17am

Hi everyone,

I just updated my mininet notebook to use the latest version of fastai and found that reducing batch size improves accuracy. Knowing that batch size roughly scales with learning rate, it’s strange that I can’t get the same improvement by changing learning rate. Reducing batch size increases the number of iterations - maybe this is significant when training over just 5 epochs?

I’d be interested to know what you think as I’m working on a tabular problem where we can train with >4000 items in a batch but … bs=32 gets better results!

morgan · October 8, 2020, 2:08pm

I would guess that like you mentioned, a smaller batch size means more update steps, meaning more opportunities to learn.

Generally the advantage of larger batch sizes is that it reduces the variance/noise between batches, so the gradients are more consistent between steps. For example if your data is very noisy and you have small batches, then one batch might result in the gradients “pointing” in one direction but in the next batch all the gradients might end up pointing in the opposite direction. So it takes longer for the model to figure out how best to learn. With a larger batch size this noise is averaged out as the model has a better idea of the overall distribution of your data.

But if you’re limited in the number of epochs to run, like in ImageNette/Woof/Wang then you’ve also limited to number of opportunities the network has to learn, so theres a trade off there between higher batch size for stability and lower batch size for more opportunities to learn…

Its been a while since I read about learning rate and batch size scaling, but my guess would be that a larger batch size means that the model has a better idea of the data distribution and so can be more confident in taking larger steps, i.e. using a higher learning rate.

Thats all a little bit hand-wavy but I hope the intuition correct, happy to hear from others!