Imagenette/ Imagewoof Leaderboards

Great work! Amazing how just by changing the order of matrix multiplication made such a difference in execution time! I used the xresnets, with the older version of ssa from your GitHub repo, on Freesound Audio Tagging kaggle competition and got an improvement of about 0.002 (comparing an ensemble of 2 models with and without ssa, one run for each), but I had to run for significantly fewer epochs due to the kaggle kernel time limit. I have to try this faster implementation and do a few runs to better check how much it improves :slight_smile:

Wow I’m very happy that you are using my work.
The new version is much less sensitive to spatial dimensions O(NC^2) vs O(CN^2 + NC^2), where N=height*width.

Also, I don’t think that trick can be used on the original self attention layer due to the presence of softmax (which I believe is also O(N^2) for an N*N input)

I did some comparisons in this notebook: https://github.com/sdoria/SimpleSelfAttention/blob/master/Self%20Attention%20Time%20Complexity.ipynb

2 Likes

Could you explain what you mean by “changing the order of matrix multiplication”? I found these comments in your notebook

    # changed the order of mutiplication to avoid O(N^2) complexity
    # (x*xT)*(W*x) instead of (x*(xT*(W*x)))

but I’m not clear on which line of code implements this, or why it makes it go faster.

The comment applies to the following lines of code:

convx = self.conv(x) # (C,C) * (C,N) = (C,N) => O(NC^2)
xxT = torch.bmm(x,x.permute(0,2,1).contiguous()) # (C,N) * (N,C) = (C,C) => O(NC^2)
o = torch.bmm(xxT, convx) # (C,C) * (C,N) = (C,N) => O(NC^2)

Originally we were doing operations in this order (Note that conv(x) is analogue to a matrix multiplication W*x in this case, where W has dimension (C,C))
x * (x^T * (conv(x)))

  1. conv(x) (dims: (C,C) and (C,N))
  2. x^T * (conv(x)) (dims: (N,C) and (C,N))
  3. x * (x^T * (conv(x))) (dims: (C,N) and (N,N))

This is the naive/“natural” order of implementing those operations.

Check out the complexity of matrix multiplication: https://en.wikipedia.org/wiki/Computational_complexity_of_mathematical_operations#Matrix_algebra

Complexity of those 3 operations:

  1. O(C^2*N)
  2. O(N^2*C)
  3. O( C* N^2)

Now, unless we increase channels a lot, we mainly have an issue with complexity that are proportional to N^2. This is because N= H*W. So if you double image size, you increase complexity by 2^4.

By changing the order of operations to (xxT)(W*x), we do:

  1. convx = conv(x)
  2. xxT = x*xT
  3. o = xxT * convx

And, as commented in the code at the top of this post, those 3 operations are O(NC^2), which means that run time is much less sensitive to image size.

Let me know if you have any other questions.

5 Likes

Great explanation, thank you!

It’s really interesting that just changing the order of operations cuts down on the time so much. Makes me wonder if there are other places where we could save a lot of time by doing matrix multiplications in a clever order.

1 Like

@jamestjw (github handle) trained an xresnet with SimpleSelfAttention on card suits and calculated the mean weights for each pixel on the N * N attention grid. I thought it was pretty neat. This is what it looks like on an example:

source: https://nbviewer.jupyter.org/github/jamestjw/xresnet-self-attn/blob/master/Poker%20Cards%20(training%20an%20Attention%20Model%20from%20scratch).ipynb

7 Likes

Here is my (architecture) improvement of great https://github.com/lessw2020/Ranger-Mish-ImageWoof-5 submission.

Specifically, replace all the pooling layers from AvgPool (inside net) or MaxPool (after stem) to MaxBlurPool2d from " Making Convolutional Networks Shift-Invariant Again" paper https://arxiv.org/abs/1904.11486

The rest of entry is exact clone from Ranger-Mish.

Improvement:

acc = [0.76 0.768 0.762 0.746 0.742]
acc_mean = 0.7556
acc_std = 0.009911619

vs original

acc = [0.708 0.74 0.738 0.756 0.734]
acc_mean = 0.7352
acc_std = 0.01552288

Link to the repo: https://github.com/ducha-aiki/Ranger-Mish-ImageWoof-5

Originally posted in wrong branch

7 Likes

Hi everyone,

I’m trying to replicate the Imagenette results shown in lesson 11; starting with 128px, 5 epochs but I’m getting much lower accuracy than expected when running on my laptop - running on GCP gives the expected results - would be great if anyone can help me understand what I’m doing wrong (o:

Here’s my notebook to show how I’m trying to replicate; https://github.com/pete88b/data-science/blob/master/fastai-things/imagenette-replicate-2019-04-08.ipynb

Edit: I think the answer might be that Imagenette in fastai 1.0.58 (on the GCP VM) has 500 validation items but 1.0.60 (on my laptop) has 3925? Edit2: pretty sure this is it - if I use the imagenette2-160 data on GCP, we’re back to 75% accuracy.

Pete

Thank everybody who participate at this! This is awesome!
I tried a lot of trick from here and i like it a lot! Sometimes i find how to improve a little, but until i test solution, somebody improve more, so i go back and implement that. Interesting, how far it can go!
Last improvement from ducha-aiki is amazing - it works very good!
I can beat it only with same attitude. It not so big, but anyway…
I tested in only on woof and only 5 and 20 epochs.

So here is my results:
size 128,
now 5 ep - 73.37%, 20 ep - 85.52%.
my results:
5: 73,58% 0.0084 std [0.751082, 0.734029, 0.728939, 0.727412, 0.737847]
20: no improvement - 85,22% 0.0061 [0.862560, 0.853143, 0.853652, 0.844490, 0.847544]
size 192,
now 5 ep - 75.94%, 20 ep - 87.25%, 80 ep - 89.21%
my results (bs64):
5 bs: 76,55% 0.0028 std [0.765335, 0.770934, 0.763808, 0.763044, 0.764571]
20 bs32: 87,85% 0.0022 [0.874777, 0.877832, 0.878595, 0.880377, 0.881395]
20 bs64: 87,44% 0.0014 [0.874014, 0.874014, 0.873505, 0.877322, 0.873250]
size 256,
now: 5 ep - 76.87%, 20 ep - 88.29%.
my results:
5: 78,84% (3 run) 0.0042 std [0.783151,0.788496, 0.793586]
20: 88,58% 0.0029 std [0.887503, 0.882667, 0.887758, 0.889285, 0.881904]
Here is links to nb size 128.
https://github.com/ayasyrev/imagenette_experiments/blob/master/Woof_MaxBlurPool_ResnetTrick_s128.ipynbpynb
Others in repo too.


Nb runs on colab, so it easy to rerun it.
I refactor xresnet from fastai v1 for better understand code and easy change model. And now thank to nbdev i can easy share this code. I use it for some time now, but it steel not for production (and not purposed for). I change code as i find what i want change something in model but cant do it easy. And when i start move it to github with nbdev, i rewrite a lot and find what its time for more refactor. So i start rewrite, now it more powerful but steel more like concept. Have a look, i hope it can be helpful.

Back to my solution. I like trick from “Bag of tricks” wean we change conv stride 2 on identity path by pool and conv stride 1.
So i think - why not do same with main path - change conv stride 2 to conv stride 1 and pool. Pool we already have - so i change ResBlock to first use pool to input and then split it to conv and identity paths. So look to code. I wrote explanation how I create model here:

4 Likes

@a_yasyrev would be great if you can submit a PR for the ones that show a reasonable improvement - which looks like the 192 and 256 px ones AFAICT. Congrats on the results! How does it impact training and inference speed?

Just run tests on colab, size 128, Tesla T4:
xresnet with SA - 1:15 per epoch.
xresnet, SA, Mish - 1:18 p/e.
newblock, SA - 1:14 p/e.
newblock, SA, Mish - 1:17 p/e.
So here same speed.
MaxBlurPool slow training, but has very good results.
new block, SA, Mish, MaxBlurPool - 1:20.
Will check on another comp.

3 Likes

One more test on colab. Same Tesla T4.
size 256, bs 32
xresnet - 2:48
xresnet with SA - 3:06 per epoch.
xresnet, SA, Mish - 3:45 p/e.
xresnet, SA, Mish, MaxBlurPool - 4:07 p/e

newblock - 2:31
newblock, SA - 2:50 p/e.
newblock, SA, Mish - 3.23 p/e.
new block, SA, Mish, MaxBlurPool - 3:49

@Jeremy just a thought, should we also include inference times? Such as batch/second or images/second? As part of this is also looking at how realistically from a deployment standpoint, how do they look? Let me know your thoughts on this :slight_smile: (obviously we’d focus on the accuracy only but it would be nice to know their real-time inference times too)

2 Likes

It’s a bit hard, since everyone has different hardware.

1 Like

Hi all, I’m excited to join in, even though I don’t quite have a better result yet.

As I had trouble with fastai2 (can’t find Mish module), I took @LessW2020 's repo from 6 months ago, but I couldn’t get 75% (imageWoof, 5 epochs 5 runs). I was only getting 67 or 68’ish. Maybe the fastai had an update or the dataset wasn’t the same, per @pete88b ? In any case, I was able to get a 1% improvement on that, by changing the 3x3 Conv layer with something a little more complicated (but mathematically well-motivated). I wonder if anyone could add the couple of lines (TwistLayer) in my repo to your 75% performing model and see how it fare. Thanks all (especially to Jeremy for making it all possible.)

(Apologies for the poor code, and poor PyTorch practice. It’s certainly a waste to have 2 extra full-scale conv2d weights. I hope it may be a little easier to understand what’s going on, and to experiment with.)

1 Like

Welcome! Cool to see your results! Nice Job :slight_smile: Yes, the dataset was changed to make it a bit harder, and so the leaderboard percentages were adjusted (larger validation set)

1 Like

Is that why the dataset is called imagewoof2? I’m confused what the current leaderboard is based on.

I made an 80 epoch run (still size=128) and got 87.27 at the end (highest 87.42 at second to last epoch), which is slightly higher than the current record of 87.20. Hooray!

Yes it is :slight_smile: and it’s what the current leaderboard is based on. The baselines were run with the original setup that we found to bring in a fair comparison.

I’ve made a start by using twist in some of the “standard” models using fastai v2 dev: https://github.com/pete88b/data-science/blob/master/fastai-things/train-imagewoof-with-TwistLayer.ipynb
hope it helps

Thanks, @pete88b. As I mentioned, there’s a lot of waste in parameters.

If you are testing on ResNeXt, did you give each conv2d groups argument? (That’s as much as I understand ResNeXt)

Briefly, I’m adding two extra conv2d (that I call convx and convy), but you can see that I “symmetrized” the weights, so instead of 9 parameters for each filter/channel/feature-map, it’s only 4 in effect. I also copied the convx weights into convy so the entire convy is extraneous. Of course once we have tested various possibilities we could write the TwistLayer more efficiently.

You asked where you can learn about TwistLayer. It’s related to the Neural ODE paper (and others) that interprets ResNet as solving a differential equation. I wrote about the mathematics here

but at the time I didn’t actually know ResNet (even now I know very little beyond ResNet) and I should do a complete rewrite.

I can open up a separate thread to answer questions. [Update: new thread here]

2 Likes