Imagenette/ Imagewoof Leaderboards

The comment applies to the following lines of code:

convx = self.conv(x) # (C,C) * (C,N) = (C,N) => O(NC^2)
xxT = torch.bmm(x,x.permute(0,2,1).contiguous()) # (C,N) * (N,C) = (C,C) => O(NC^2)
o = torch.bmm(xxT, convx) # (C,C) * (C,N) = (C,N) => O(NC^2)

Originally we were doing operations in this order (Note that conv(x) is analogue to a matrix multiplication W*x in this case, where W has dimension (C,C))
x * (x^T * (conv(x)))

  1. conv(x) (dims: (C,C) and (C,N))
  2. x^T * (conv(x)) (dims: (N,C) and (C,N))
  3. x * (x^T * (conv(x))) (dims: (C,N) and (N,N))

This is the naive/“natural” order of implementing those operations.

Check out the complexity of matrix multiplication: https://en.wikipedia.org/wiki/Computational_complexity_of_mathematical_operations#Matrix_algebra

Complexity of those 3 operations:

  1. O(C^2*N)
  2. O(N^2*C)
  3. O( C* N^2)

Now, unless we increase channels a lot, we mainly have an issue with complexity that are proportional to N^2. This is because N= H*W. So if you double image size, you increase complexity by 2^4.

By changing the order of operations to (xxT)(W*x), we do:

  1. convx = conv(x)
  2. xxT = x*xT
  3. o = xxT * convx

And, as commented in the code at the top of this post, those 3 operations are O(NC^2), which means that run time is much less sensitive to image size.

Let me know if you have any other questions.

5 Likes

Great explanation, thank you!

It’s really interesting that just changing the order of operations cuts down on the time so much. Makes me wonder if there are other places where we could save a lot of time by doing matrix multiplications in a clever order.

1 Like

@jamestjw (github handle) trained an xresnet with SimpleSelfAttention on card suits and calculated the mean weights for each pixel on the N * N attention grid. I thought it was pretty neat. This is what it looks like on an example:

source: https://nbviewer.jupyter.org/github/jamestjw/xresnet-self-attn/blob/master/Poker%20Cards%20(training%20an%20Attention%20Model%20from%20scratch).ipynb

7 Likes

Here is my (architecture) improvement of great https://github.com/lessw2020/Ranger-Mish-ImageWoof-5 submission.

Specifically, replace all the pooling layers from AvgPool (inside net) or MaxPool (after stem) to MaxBlurPool2d from " Making Convolutional Networks Shift-Invariant Again" paper https://arxiv.org/abs/1904.11486

The rest of entry is exact clone from Ranger-Mish.

Improvement:

acc = [0.76 0.768 0.762 0.746 0.742]
acc_mean = 0.7556
acc_std = 0.009911619

vs original

acc = [0.708 0.74 0.738 0.756 0.734]
acc_mean = 0.7352
acc_std = 0.01552288

Link to the repo: https://github.com/ducha-aiki/Ranger-Mish-ImageWoof-5

Originally posted in wrong branch

7 Likes

Hi everyone,

I’m trying to replicate the Imagenette results shown in lesson 11; starting with 128px, 5 epochs but I’m getting much lower accuracy than expected when running on my laptop - running on GCP gives the expected results - would be great if anyone can help me understand what I’m doing wrong (o:

Here’s my notebook to show how I’m trying to replicate; https://github.com/pete88b/data-science/blob/master/fastai-things/imagenette-replicate-2019-04-08.ipynb

Edit: I think the answer might be that Imagenette in fastai 1.0.58 (on the GCP VM) has 500 validation items but 1.0.60 (on my laptop) has 3925? Edit2: pretty sure this is it - if I use the imagenette2-160 data on GCP, we’re back to 75% accuracy.

Pete

Thank everybody who participate at this! This is awesome!
I tried a lot of trick from here and i like it a lot! Sometimes i find how to improve a little, but until i test solution, somebody improve more, so i go back and implement that. Interesting, how far it can go!
Last improvement from ducha-aiki is amazing - it works very good!
I can beat it only with same attitude. It not so big, but anyway…
I tested in only on woof and only 5 and 20 epochs.

So here is my results:
size 128,
now 5 ep - 73.37%, 20 ep - 85.52%.
my results:
5: 73,58% 0.0084 std [0.751082, 0.734029, 0.728939, 0.727412, 0.737847]
20: no improvement - 85,22% 0.0061 [0.862560, 0.853143, 0.853652, 0.844490, 0.847544]
size 192,
now 5 ep - 75.94%, 20 ep - 87.25%, 80 ep - 89.21%
my results (bs64):
5 bs: 76,55% 0.0028 std [0.765335, 0.770934, 0.763808, 0.763044, 0.764571]
20 bs32: 87,85% 0.0022 [0.874777, 0.877832, 0.878595, 0.880377, 0.881395]
20 bs64: 87,44% 0.0014 [0.874014, 0.874014, 0.873505, 0.877322, 0.873250]
size 256,
now: 5 ep - 76.87%, 20 ep - 88.29%.
my results:
5: 78,84% (3 run) 0.0042 std [0.783151,0.788496, 0.793586]
20: 88,58% 0.0029 std [0.887503, 0.882667, 0.887758, 0.889285, 0.881904]
Here is links to nb size 128.
https://github.com/ayasyrev/imagenette_experiments/blob/master/Woof_MaxBlurPool_ResnetTrick_s128.ipynbpynb
Others in repo too.


Nb runs on colab, so it easy to rerun it.
I refactor xresnet from fastai v1 for better understand code and easy change model. And now thank to nbdev i can easy share this code. I use it for some time now, but it steel not for production (and not purposed for). I change code as i find what i want change something in model but cant do it easy. And when i start move it to github with nbdev, i rewrite a lot and find what its time for more refactor. So i start rewrite, now it more powerful but steel more like concept. Have a look, i hope it can be helpful.

Back to my solution. I like trick from “Bag of tricks” wean we change conv stride 2 on identity path by pool and conv stride 1.
So i think - why not do same with main path - change conv stride 2 to conv stride 1 and pool. Pool we already have - so i change ResBlock to first use pool to input and then split it to conv and identity paths. So look to code. I wrote explanation how I create model here:

4 Likes

@a_yasyrev would be great if you can submit a PR for the ones that show a reasonable improvement - which looks like the 192 and 256 px ones AFAICT. Congrats on the results! How does it impact training and inference speed?

Just run tests on colab, size 128, Tesla T4:
xresnet with SA - 1:15 per epoch.
xresnet, SA, Mish - 1:18 p/e.
newblock, SA - 1:14 p/e.
newblock, SA, Mish - 1:17 p/e.
So here same speed.
MaxBlurPool slow training, but has very good results.
new block, SA, Mish, MaxBlurPool - 1:20.
Will check on another comp.

3 Likes

One more test on colab. Same Tesla T4.
size 256, bs 32
xresnet - 2:48
xresnet with SA - 3:06 per epoch.
xresnet, SA, Mish - 3:45 p/e.
xresnet, SA, Mish, MaxBlurPool - 4:07 p/e

newblock - 2:31
newblock, SA - 2:50 p/e.
newblock, SA, Mish - 3.23 p/e.
new block, SA, Mish, MaxBlurPool - 3:49

@Jeremy just a thought, should we also include inference times? Such as batch/second or images/second? As part of this is also looking at how realistically from a deployment standpoint, how do they look? Let me know your thoughts on this :slight_smile: (obviously we’d focus on the accuracy only but it would be nice to know their real-time inference times too)

2 Likes

It’s a bit hard, since everyone has different hardware.

1 Like

Hi all, I’m excited to join in, even though I don’t quite have a better result yet.

As I had trouble with fastai2 (can’t find Mish module), I took @LessW2020 's repo from 6 months ago, but I couldn’t get 75% (imageWoof, 5 epochs 5 runs). I was only getting 67 or 68’ish. Maybe the fastai had an update or the dataset wasn’t the same, per @pete88b ? In any case, I was able to get a 1% improvement on that, by changing the 3x3 Conv layer with something a little more complicated (but mathematically well-motivated). I wonder if anyone could add the couple of lines (TwistLayer) in my repo to your 75% performing model and see how it fare. Thanks all (especially to Jeremy for making it all possible.)

(Apologies for the poor code, and poor PyTorch practice. It’s certainly a waste to have 2 extra full-scale conv2d weights. I hope it may be a little easier to understand what’s going on, and to experiment with.)

1 Like

Welcome! Cool to see your results! Nice Job :slight_smile: Yes, the dataset was changed to make it a bit harder, and so the leaderboard percentages were adjusted (larger validation set)

1 Like

Is that why the dataset is called imagewoof2? I’m confused what the current leaderboard is based on.

I made an 80 epoch run (still size=128) and got 87.27 at the end (highest 87.42 at second to last epoch), which is slightly higher than the current record of 87.20. Hooray!

Yes it is :slight_smile: and it’s what the current leaderboard is based on. The baselines were run with the original setup that we found to bring in a fair comparison.

I’ve made a start by using twist in some of the “standard” models using fastai v2 dev: https://github.com/pete88b/data-science/blob/master/fastai-things/train-imagewoof-with-TwistLayer.ipynb
hope it helps

Thanks, @pete88b. As I mentioned, there’s a lot of waste in parameters.

If you are testing on ResNeXt, did you give each conv2d groups argument? (That’s as much as I understand ResNeXt)

Briefly, I’m adding two extra conv2d (that I call convx and convy), but you can see that I “symmetrized” the weights, so instead of 9 parameters for each filter/channel/feature-map, it’s only 4 in effect. I also copied the convx weights into convy so the entire convy is extraneous. Of course once we have tested various possibilities we could write the TwistLayer more efficiently.

You asked where you can learn about TwistLayer. It’s related to the Neural ODE paper (and others) that interprets ResNet as solving a differential equation. I wrote about the mathematics here

but at the time I didn’t actually know ResNet (even now I know very little beyond ResNet) and I should do a complete rewrite.

I can open up a separate thread to answer questions. [Update: new thread here]

2 Likes

@liuyao that sounds very interesting. Since you’re decreasing the param count, you should get the best benefits with more epochs (so try 200) and less regularization (so try less mixup and larger random resize crop area).

2 Likes

Might not have followed the developments in this thread correctly, but can somebody briefly explain to me what’s a Twist Layer?

1 Like

I’ve simplified it a bit and it seems to be doing better (I’ll update with results). Here’s the conv_twist layer, replacing each 3x3 convolution. I don’t know if I can explain more briefly than the code:

class conv_twist(nn.Module):  # replacing 3x3 Conv2d
    def __init__(self, ni, nf, stride=1):
        super(conv_twist, self).__init__()
        self.conv = nn.Conv2d(ni, nf, kernel_size=3, stride=stride, padding=1, bias=False)
        self.convx = nn.Conv2d(ni, nf, kernel_size=3, stride=stride, padding=1, bias=False)
        self.convy = nn.Conv2d(ni, nf, kernel_size=3, stride=stride, padding=1, bias=False)
        self.convx.weight.data = (self.convx.weight - self.convx.weight.flip(2).flip(3)) / 2
        self.convy.weight.data = self.convx.weight.transpose(2,3).flip(2)
        # self.radii = nn.Parameter(torch.Tensor(nf), requires_grad=True)
        self.center_x = nn.Parameter(torch.Tensor(nf), requires_grad=True)
        self.center_y = nn.Parameter(torch.Tensor(nf), requires_grad=True)
        # self.radii.data.uniform_(0.3, 0.7)
        self.center_x.data.uniform_(-0.7, 0.7)
        self.center_y.data.uniform_(-0.7, 0.7)

    def forward(self, x):
        self.convx.weight.data = (self.convx.weight - self.convx.weight.flip(2).flip(3)) / 2  # make convx a first-order operator by symmetrizing it
        self.convy.weight.data = (self.convy.weight - self.convy.weight.flip(2).flip(3)) / 2
        # self.convy.weight.data = self.convx.weight.transpose(2,3).flip(2))                    # make convy a 90 degree rotation of convx
        x1 = self.conv(x)
        _, c, h, w = x1.size()
        XX = torch.from_numpy(np.indices((1,h,w))[2]*2/w).type(x.dtype).to(x.device) - self.center_x.view(-1,1,1)
        YY = torch.from_numpy(np.indices((1,h,w))[1]*2/h).type(x.dtype).to(x.device) - self.center_y.view(-1,1,1)
        # mask = ramp_func((XX**2+YY**2)/(self.radii.type(x.dtype).to(x.device).view(-1,1,1)**2))
        return x1 + (XX * self.convx(x) + YY * self.convy(x)) # * mask

Update: imagewoof2

Size (px) Epochs model mixup Accuracy # Runs
128 5 (Leaderboard) 73.37% 5, mean
128 5 RMS 0 68.54% 5, mean
128 5 RMS + twist 0 70.95% 5, mean
128 20 (Leaderboard) 85.52% 5, mean
128 20 RMS 0 84.62% 5, mean
128 20 RMS + twist 0 85.24% 5, mean
128 80 (Leaderboard) 87.20% 1
128 80 RMS + twist 0.2 87.81% 1
128 80 RMS + twist 0.5 88.52% 1
128 200 (Leaderboard) 87.20% 1
128 200 RMS + twist 0.2 88.70% 1
256 200 (Leaderboard) 90.38% 1
256 200 RMS + twist 0.2 91.52% 1

imagenette2

Size (px) Epochs model mixup Accuracy # Runs
256 200 (Leaderboard) 95.11% 1
256 200 RMS + twist 0.5 95.87% 1

@a_yasyrev, if you could help test with your ResNet trick + MaxBlurPool, that would be very nice.

1 Like