Imagenette/ Imagewoof Leaderboards


I thought it’d be a good idea to start a thread dedicated to the Imagenette and Imagewoof leaderboards.

I have some work in progress, and will publish if I get interesting results.

But first, I tried to replicate the baseline results. And I am getting different results for Imagewoof, e.g. at 256 px, 5 epochs, I am getting a range of 57.4 -63% (ran it 4 times, average is 61.2%) compared to the 57.6% published.

It seems that results vary a lot between runs (at least for 5 epochs).


Meet RAdam - imo the new state of the art AI optimizer
(Jeremy Howard (Admin)) #2

If you find something that consistently beats a leaderboard result, please send a PR to the repo to update the leaderboard! :slight_smile:



I haven’t quite had the time to figure out how to do a pull request, but I still want to post my results here right now.

The only result I have confidence on so far is 5 epochs on the 256 px Imagewoof, with a 67.6% average accuracy as opposed to 61.9% previously (over 10 runs each). But it looks promising for more epochs.

Edit: Submitted a PR. Also got results for 20 epochs (256px Imagewoof) with an accuracy of 85.7% vs 83.9% (each averaged over 10 runs).



I compared the same two models (current best xresnet50 vs same model with "simpleSelfAttention) over 80 epochs, 10 runs each:

89.9% vs 90.3% accuracy

Doesn’t seem like it is enough to tell them apart , and it was a bit resource consuming to run 1600 epochs in total so I’ll have to figure out a new method to evaluate the new model.



Another interesting toy dataset is “Tiny Imagenet”, which can be downloaded here:

There is also a Kaggle leaderboard, but it’s a private competition:

Edit: the results on Kaggle don’t look realistic.

The paper below claims a top 1 validation accuracy of 62.73%:

And this one 56.9% on test dataset:



Here is another interesting data point:

ImageWoof, 128px, 5 epochs:

  • baseline model (xresnet50):
    results for 12 runs: [63.8, 59.6, 64, 61.6, 58, 65.6, 63.2, 61, 63.2, 64.2, 62.6, 60.8]
    average accuracy: 62.3%

  • xresnet50 + SelfAttention (self attention layer as currently implemented in fastai)
    results for 12 runs: [64.6, 64.6, 64.4, 62.6, 62 ,57.2, 66, 65, 67.2, 66.8, 62.2, 65.8]
    average accuracy: 64.0%

  • xresnet50 + “SimpleSelfAttention” (cf. my github repo linked in this thread)
    results for 12 runs: [66.6,64.8,65.4,66.4,67,65.6,66.4,62,64.4,64.2,65.4,64.8]
    average accuracy: 65.25%

Both self-attention layers provide a significant improvement when added to xresnet50. The simplified version of self-attention does at least as well, if not better than the original one.

Another factor is that, when increasing to 256px, I have been running into memory issues (on a RTX 2080 Ti) and have had to decrease batch size if using the original self-attention layer. This has not been a problem with SimpleSelfAttention and I have been able to keep batch size =64.

1 Like


Any suggestions on where I could take those results? Is it worth running on imagenet? Writing a blog post?


(Jason Antic) #8

I’m certainly planning on trying your SimpleSelfAttention in DeOldify! Great to see these results- I suspected that image recognition would benefit from self-attention, especially after seeing the bag of features paper ( It just makes sense. It certainly makes all the difference in DeOldify.

Twitter seems like a great place to get the word out honestly. It might take a bit before I get results back to you but if it’s looking good I’ll certainly be talking about it there.

@plain-old-dana Here’s the thread.



I ran into some hurdles trying this on imagenet. Mainly financial, technical, but also I got stuck wondering what a good test would be.

For example, I ran xresnet50 for 5 epochs, 128px, 256 bs, lr= 3e-2, for 1 trial:
xresnet50 gets 44.4% top 5 accuracy
xresnet50 + simple self attention gets 48.4%

However, xresnet50 can train at a much higher learning rate (3e-1) and get to a 73% accuracy in 5 epochs, while xresnet50 +ssa can’t train at that high learning rate.

Maybe some modification to the self attention layer would enable higher learning rates.

I’m going to go back to Imagenette/ Imagewoof as they do provide opportunities to figure things out without breaking the bank or spending a day setting things up.


(Jeremy Howard (Admin)) #10

Figuring out why that happens would be really interesting! :slight_smile:



Interestingly, xresnet18 seems to do better (accuracy after n epochs) than xresnet50 on imagewoof for a small number of epochs. Also, smaller batch sizes do better (32 rather than 64 or 128).

128 px, bs=64, xresnet18 can run quite fast and make for some low compute experimentation.



Update on my resnet+ self-attention model.

While my model beats resnet when constrained by number of epochs (especially for small amount of epochs), this is much less evident when constrained by run time. Therefore, it’s hard to make a case that this new model is useful at this point.

The “simple self-attention” layer might still be useful as a replacement for the one proposed in the SAGAN paper.

Edit: related new paper just came out:



I scratched all my previous results because they were not taking into account execution time.

The good news is that I have found a way to significantly improve the speed of my model (by changing the order of matrix multiplication).
Below the description of my first experiment where I actually improve accuracy for the same total training time.

I could use some feedback on the method and the results:

We compare a baseline resnet model to the same model with an extra self-attention layer (SimpleSelfAttention or ssa).

Same run time ~50 epochs test (xresnet18, 128px, Imagewoof dataset)

1) We first run the original xresnet18 model for 50 epochs with a range of learning rates and pick the best one:

Model Dataset Image Size Epochs Learning Rate # of runs Avg (Max Accuracy)
xresnet18 Imagewoof 128 50 1e-3 10 0.821
xresnet18 Imagewoof 128 50 3e-3 30 0.845
xresnet18 Imagewoof 128 50 5e-3 10 0.846
xresnet18 Imagewoof 128 50 8e-3 20 0.850
xresnet18 Imagewoof 128 50 1e-2 20 0.846
xresnet18 Imagewoof 128 50 12e-3 20 0.844
xresnet18 Imagewoof 128 50 14e-3 20 0.847

Note: we are not using mixup.

2) We pick a number of epochs for our modified xresnet18+SimpleSelfAttention model that gives the same runtime or less:

Model Dataset Image Size Epochs # of runs Avg Wall Time
xresnet18 Imagewoof 128 50 4 9:37
xresnet18 + ssa Imagewoof 128 47 4 9:28

This is using a single RTX 2080 Ti GPU. We use the %%time function on Jupyter notebooks.

3) We compare our two models using the learning rate from step 1 and the number of epochs from step 2:

Model Dataset Image Size Epochs Learning Rate # of runs Avg (Max Accuracy) Stdev (Max Accuracy)
xresnet18 Imagewoof 128 50 8e-3 20 0.8498 0.00782
xresnet18 + ssa Imagewoof 128 47 8e-3 20 0.8567 0.00937

We can compare the results using an independent samples t-test (

  • Difference: 0.007
  • 95% confidence interval: 0.0014 to 0.0124
  • Significance level: P = 0.0157

Adding a SimpleSelfAttention layer seems to provide a statistically significant boost in accuracy after training for ~50 epochs, without additional run time, and while using a learning rate optimized for the original model.


(Miguel) #14

Great work! Amazing how just by changing the order of matrix multiplication made such a difference in execution time! I used the xresnets, with the older version of ssa from your GitHub repo, on Freesound Audio Tagging kaggle competition and got an improvement of about 0.002 (comparing an ensemble of 2 models with and without ssa, one run for each), but I had to run for significantly fewer epochs due to the kaggle kernel time limit. I have to try this faster implementation and do a few runs to better check how much it improves :slight_smile:



Wow I’m very happy that you are using my work.
The new version is much less sensitive to spatial dimensions O(NC^2) vs O(CN^2 + NC^2), where N=height*width.

Also, I don’t think that trick can be used on the original self attention layer due to the presence of softmax (which I believe is also O(N^2) for an N*N input)

I did some comparisons in this notebook:


(Nate) #16

Could you explain what you mean by “changing the order of matrix multiplication”? I found these comments in your notebook

    # changed the order of mutiplication to avoid O(N^2) complexity
    # (x*xT)*(W*x) instead of (x*(xT*(W*x)))

but I’m not clear on which line of code implements this, or why it makes it go faster.



The comment applies to the following lines of code:

convx = self.conv(x) # (C,C) * (C,N) = (C,N) => O(NC^2)
xxT = torch.bmm(x,x.permute(0,2,1).contiguous()) # (C,N) * (N,C) = (C,C) => O(NC^2)
o = torch.bmm(xxT, convx) # (C,C) * (C,N) = (C,N) => O(NC^2)

Originally we were doing operations in this order (Note that conv(x) is analogue to a matrix multiplication W*x in this case, where W has dimension (C,C))
x * (x^T * (conv(x)))

  1. conv(x) (dims: (C,C) and (C,N))
  2. x^T * (conv(x)) (dims: (N,C) and (C,N))
  3. x * (x^T * (conv(x))) (dims: (C,N) and (N,N))

This is the naive/“natural” order of implementing those operations.

Check out the complexity of matrix multiplication:

Complexity of those 3 operations:

  1. O(C^2*N)
  2. O(N^2*C)
  3. O( C* N^2)

Now, unless we increase channels a lot, we mainly have an issue with complexity that are proportional to N^2. This is because N= H*W. So if you double image size, you increase complexity by 2^4.

By changing the order of operations to (xxT)(W*x), we do:

  1. convx = conv(x)
  2. xxT = x*xT
  3. o = xxT * convx

And, as commented in the code at the top of this post, those 3 operations are O(NC^2), which means that run time is much less sensitive to image size.

Let me know if you have any other questions.


(Nate) #18

Great explanation, thank you!

It’s really interesting that just changing the order of operations cuts down on the time so much. Makes me wonder if there are other places where we could save a lot of time by doing matrix multiplications in a clever order.

1 Like


@jamestjw (github handle) trained an xresnet with SimpleSelfAttention on card suits and calculated the mean weights for each pixel on the N * N attention grid. I thought it was pretty neat. This is what it looks like on an example:



(Dmytro Mishkin) #20

Here is my (architecture) improvement of great submission.

Specifically, replace all the pooling layers from AvgPool (inside net) or MaxPool (after stem) to MaxBlurPool2d from " Making Convolutional Networks Shift-Invariant Again" paper

The rest of entry is exact clone from Ranger-Mish.


acc = [0.76 0.768 0.762 0.746 0.742]
acc_mean = 0.7556
acc_std = 0.009911619

vs original

acc = [0.708 0.74 0.738 0.756 0.734]
acc_mean = 0.7352
acc_std = 0.01552288

Link to the repo:

Originally posted in wrong branch