ImageNette/Woof Leaderboards - guidelines for proving new high scores?

LessW2020 · August 17, 2019, 3:04am

Hi all,
One issue that keeps popping up is what is the proper procedure to “verify” if you have achieved a new high score for the leaderboards.
I’m hoping we can determine a reasonable ‘default’ process so users can readily test on their own if they have achieved a new score to submit b/c right now there are many questions surrounding it.
Issues involved:
1 - Is there a number of times we should run (i.e. 5 times, 10 times or ?) to prove a new high score…
2 - Is the new score the average of all X runs above (or the top X of Y runs)…
3 - The leaderboards currently show that many of the scores were run on 4 GPU’s. Since virtually no one has 4 GPU setups, is that even an issue of note or is that just merely trivial details. (I would not expect that GPU count would matter for the final results in the same way that 4 quarters = 1 dollar, but Seb has noted that he saw better results on 1 GPU…). Thus, is there only one leaderboard entry or is it different categories based on GPU count for each image resolution?
4 - The learning rate is also noted on the leaderboards - but is that a requirement to run under that learning rate or is it there to allow you to replicate the current high score? (I can’t imagine LR matching is enforced as the whole point is if you can implement better training that allows faster learning, that’s a win) but some comments seemed to imply that they should be run under the same learning rate.
example - I beat the current high scores on 5 epoch and 20 epoch (multiple times) by swapping in RAdam instead of Adam, but I used a higher learning rate… b/c RAdam stabilizes things to let you use higher LR which is kind of the whole point of having a better optimizer

Anyway, lots of questions and I think if there’s a more standardized procedure to allow people to see if they have achieved a new high score, then it would be very helpful for people testing out new activation functions, optimizers, etc. in the FastAI framework.

Seb · August 17, 2019, 9:34pm

Hopefully I won’t bore everyone with a wall of text, but I’ll make a list of issues I’ve identified, and then I’ll try to answer the OP’s questions.

A) It appears that some of the scores/accuracies for the baseline have been understated.
When rerunning baselines a few months ago I obtained better results:

E.g.:
Imagewoof/128/5 epochs: 55.2% vs 62.3% (12 runs)
Imagenette/128/5 epochs: 84.6% vs 85.5% (10 runs)

This is why I suggest rerunning baselines when running comparative experiments.
I believe the difference comes from Jeremy running those with multiple GPUs, while I was running things on 1 GPU. The learning rate gets adjusted when using 4 GPUs but might not be the optimal one?

B) variance

There is a lot of variance from run to run, especially when running only 5 epochs. Therefore we need to run things multiple times, both for baseline and test models.

C) train time
There should be a way to take into account training time.
I have the top accuracy for Imagewoof/256px/5 epochs. And I could have beaten more, but didn’t insert them.
The reason is that my model took 30-50% more time to run. Therefore, it’s a bit unfair to say my model was better.
So I submit that epochs is not a perfect constraint. However, runtime is not perfect either, because it just means you can run more and better GPUs and get on the leaderboard. That’s an important thing to explore, but I think we’d want to prioritize creating better models/training methods.

To answer the issues in the OP:

1 - Is there a number of times we should run (i.e. 5 times, 10 times or ?) to prove a new high score…

I believe the sample size depends on how much better your model is. It’s a lot easier to prove a 1% increase than a 0.1% increase.
I have a suggestion (and maybe people with more stats knowledge can tell us if it is a bad idea):
Have baselines in the leaderboard include mean, standard deviation, and sample size. Then people with new ideas can test against the baseline, and perform a t-test to compare their mean accuracy against the baseline. All it seems to take is plug in the values into a calculator such as this: MedCalc's Comparison of means calculator
If you get a better results with p<0.05, you have a better model.
We could provide baseline data with 10-20 runs to start with and see if that is enough as we go.

2 - Is the new score the average of all X runs above (or the top X of Y runs)…

I’d say it is the average of all runs, and also includes standard deviation and sample size. Personally, I’ve been using the max accuracy for each run (e.g. if you run 80 epochs, epoch 75 might have the best result).

The leaderboards currently show that many of the scores were run on 4 GPU’s. Since virtually no one has 4 GPU setups, is that even an issue of note or is that just merely trivial details.

If the baseline does better on 1 GPU, we should make sure that this is the result we use, so as not to be unfair.
I believe many people have access to 4 GPUs (e.g. on Salamander or vast.ai), and it might be a pain to run 100+ epochs on 1 GPU.
Maybe we should have a different line on the leaderboard depending on # of GPUs.
Whatever we decide to do, we just have to remember our objective to identify improvements on models and how we train models, and not just beat the leaderboard.

4 - The learning rate is also noted on the leaderboards - but is that a requirement to run under that learning rate or is it there to allow you to replicate the current high score?

I think all the parameters should be there to replicate the current high score, and not a requirement for future tests. However, we should make sure that those parameters are the best for the current baseline models. Otherwise, your improved model/method, might not be better and your new high score might be an artefact of having picked a better lr.
In your example, you picked a higher lr for RAdam and beat Adam. But did you test that higher lr with Adam?
I think we will always find flaws in the leaderboard because new situations will come up. What we have to make sure is that we design our experiments in a way that demonstrates properly that our new ideas are better. If we act that way, then we can improve how the leaderboard works as we go.

Suggestions:

We could rerun some of the baselines, on 1 GPU, making sure the learning rate is decent, over 10-20 runs (the fewer the epochs the bigger the sample size).
Possibly, when someone comes up with a better results, we might want to keep the original baseline results just in case the new model is very new/unfamiliar and people still want to compare against a generic xresnet50.

jeremy · August 17, 2019, 9:47pm

I agree with most of the comments from @Seb. Thanks to you both for your interest and contributions. If you can get better results on 1 GPU, then that result should replace my version with more GPUs! If no-one can find a way to get similarly good results on 4 GPUs, then that’s a very interesting insight in itself.

How about we suggest median of 5 runs? And suggest that people make sure that their result is a clear improvement (i.e if your 5 runs differ by 0.3%, and you’re better than the leaderboard by 0.1%, that’s not really an interesting result). I don’t want to make it to complicated or costly for people to contribute!

Seb · August 17, 2019, 10:17pm

We could very well fix the sample size, assume standard deviation stays the same as in baseline, and determine a min % increase, if that’s more accessible to everyone.

Let’s see with one example from my saved results what that would look like. This is our baseline:

Model	Dataset	Image Size	Epochs	Learning Rate	# of runs	Avg (Max Accuracy)	Stdev (Max Accuracy)
xresnet18	Imagewoof	128	50	8e-3	20	0.8498	0.00782

Using this calc again: https://www.medcalc.org/calc/comparison_of_means.php

With a mean 0.3% improvement and a sample size of 5 (assuming stdev stays constant) we get p = 0.45. Not that great!
You’d need an improvement of 0.8% to get around p=0.05, and maybe 1.1% if my baseline sample size was 5 and not 20.

A sample size of 20 with a diff of 0.5% would work in this specific setting.

The variance is just that hard to work with. 4 options:

Not care too much about statistical significance.
Have people run things 20 times, with a 0.5% cutoff (those values should be reassessed), or 5 times with a 1.1% cutoff.
Fix the variance problem (new dataset?)
Switch from accuracy to validation loss (less variance)

I don’t like 1 because it creates misleading results. 2 is would deter people to participate and miss on small real improvements.
Is 3 possible? IMO this variance in results makes working with Imagewoof/Imagenette a bit difficult.
#4 has helped me in the past get a good feel as to whether a model was better without running 50 times. It also shows directly whether an optimizer is doing better. However, accuracy is what we care about. And people might find loss functions that do better in terms of accuracy.

Seb · August 17, 2019, 10:32pm

Sometimes I wonder if arxiv is full of results that haven’t been tested for statistical significance, because who would run a model on Imagenet more than once? But I reassure myself by assuming that variance must be much smaller when using the Imagenet test set.

LessW2020 · August 18, 2019, 3:08am

I think that sounds really good - it needs to be both simple yet robust.
So basic guidelines:
*5 runs total, and best result from each set (i.e. 20 epoch x5) for final average.
*I also agree that we should be looking for 1%+ type improvements… a .3% jump is not interesting.
*GPU is not a factor other than reporting how it was run - the best score is the best score whether on 1 GPU, 4, 8 etc.

LessW2020 · August 18, 2019, 3:18am

Agree.

Completely disagree It’s not an entrants job to go back and see whether the current leader really optimized their architecture/lr optimally. I have to assume, and will always assume, that if Jeremy, you or whomever is the current leaderboard holder, they used proper LR selection.
If the leader didn’t, then, someone simply picking a better LR through whatever means by rights is the new leaderboard holder, b/c they showed a better result.
Example - if the current record gets beat by someone showing the some crazy high rate like 1e-1 works great and trounces the old 3e-3 …well congrats you made the discovery about what lr works better and you get the new high score. It’s good info for everyone to see that crazy high rates work great on this architecture.
That’s an example but hopefully makes my point that it’s not at all the entrants job to go back and reprove that the current holder optimized their architecture/lr.
Now, that said, personally I’m after testing new things like better activation functions, new optimizers to try and compare vs leaderboard so I’m not likely to be putzing around with lr and claim it’s a new score, but if a better lr works better, then it’s a new entrant regardless imo.

Yes for sure and thanks for all your feedback above. I think the leaderboards are a great way to help test new things and see if it’s really making progress or not.
I have tested so many things from papers and very few (two actually) have shown better scores. I think a lot of papers don’t really hold up on unseen datasets so ImageNette and the leaderboard serve as a great proving ground for testing out new ideas.

Seb · August 18, 2019, 11:53am

Yeah I agree ideally the entrants wouldn’t have to retest the baseline. However at this point it seems necessary if we want to make any conclusion on a new idea.

I’ll probably be satisfied once we rework the baselines, And then we can assume that new entrants have picked the best parameters for their own entry.

Edit to add: also, although it might make the tables too big, I’d like to see more than just the best entry in the leaderboard. Maybe there’s a good way to do that.

jeremy · August 18, 2019, 3:10pm

Right - if someone gets a clearly better result, I want to show that result on the leaderboard, along with the details of how they got it!

Seb · August 20, 2019, 5:19pm

I finally re-figured out what the issue with current baseline is. I will detail it here, and I think you will see why I’ve been telling people to be cautious when comparing new ideas to the leaderboard.

Baseline runs code in train_imagenette.py [1].

Check this part of the code:

bs_rat = bs/256
if gpu is not None: bs_rat = num_distrib()
if not gpu: print(f’lr: {lr}; eff_lr: {lrbs_rat}; size: {size}; alpha: {alpha}; mom: {mom}; eps: {eps}')
lr *= bs_rat

When I run train_imagenette on 1 GPU, with bs = 64, my learning rate gets divided by 4! My understanding is that, with 4 GPUs, the learning rate stays the same but we would want to increase it.
I think this is a relic of having a hardcoded bs of 256 with train_imagenet.py [2] …

Let’s compare some results between using intended lr/4 and intended lr:

Dataset	Epochs	Size	Accuracy	Params	GPUs
Imagenette	5	128	85.36% [4]	%run train_imagenette.py --epochs 5 --bs 64 --lr 12e-3 --mixup 0	1
Imagenette	5	128	82.9% [3]	%run train_imagenette.py --epochs 5 --bs 64 --lr 3e-3 --mixup 0	1

First line has a learning rate of 12e-3 but an effective lr of 3e-3. Second line: lr=3e-3, eff lr = 0.00075.

[1] https://github.com/fastai/fastai/blob/master/examples/train_imagenette.py
[2] https://github.com/fastai/fastai/blob/master/examples/train_imagenet.py
[3] np.mean([83.8,83.8,81.8,82.4,81.8,85,85,80.4,83,82])
[4] np.mean([86,85.4,84.4,85.2,84.8,85,85.6,85.4,85.4,86.4])

LessW2020 · August 20, 2019, 6:28pm

Nice work @Seb!
I couldn’t imagine how number of GPU would affect results but this makes sense that the LR was not adjusted properly.

Seb · August 20, 2019, 6:35pm

Regarding # of GPUs, IIRC, if we add more GPUs, we can increase BS, and thus increase LR.
It’s more of a rule of thumb so results may vary and we might not want to add this extra variable if we are testing an idea against baseline. Depends what your goal is.

jeremy · August 21, 2019, 8:15pm

Increasing # gpus effectively also increases batch size. And batch size is meant to (roughly) scale with learning rate. That’s why that line of code is there. It would be certainly interesting to hear of examples where it doesn’t work well.

Seb · August 21, 2019, 8:44pm

I suggest replacing (in train_imagenette.py) the hardcoded 256 divisor by the input BS so that eff lr = lr for 1 GPU. Then it will take some work to redo the baseline with the intended lr (and ideally a sample size>1)…

rwightman · August 23, 2019, 5:34pm

It’s fairly common to specify LR @ BS of 256 (or some other k) and then scale according to current runtime capabilities. In most cases it makes the results more consistent for those not fully aware of what’s going on. It is helpful to remind in comments/help text that LR is @ BS of 256 if you do that scaling for the user. I do prefer to specify and calculate the effective LR myself though.

There are numerous other reasons why comparing results of 4 GPU vs 1 GPU training can be problematic…

Batch norm is probably the biggest one. Without synchronized batch norm, you’re using BN stats from one of the N GPUs. This isn’t necessarily a bad thing, sometimes it can be a benefit when BS is big, but it is a significant change from the single GPU case. Even if you enable synchronized BN, the synchronized stats end up a little different. I feel the performance hit of sync_bn is not worth it until you’re in the really small batch size realm.

Validation. Typically, if the validation is also being done on N GPUs with a distributed sampler it will not be quite correct unless your dataset % N == 0. Extra samples are inserted and most impl don’t bother to (or can’t easily) remove their impact from the resulting reduction at the end. I always re-run validation at the end with 1 GPU or DP instead of DDP for final comparisons.

KarlH · August 23, 2019, 6:21pm

There’s actually some evidence that batchnorm with large batch sizes decreases generalization. See https://arxiv.org/pdf/1705.08741.pdf

That paper suggests using a “ghost” batchnorm that reduces the effective batch size by applying batchnorm to “virtual” minibatches.

There’s also some empirical evidence from Myrtle.ai (who recently trained Cifar10 to 94% accuracy in 34 seconds) that large batch batchnorm performed worse compared to ghost batchnorm with an effective batch size of 32.

grankin · August 24, 2019, 3:27pm

Seb:

Dataset Epochs Size Accuracy Params GPUs

Imagewoof 5 128 85.36% [4] %run train_imagenette.py --epochs 5 --bs 64 --lr 12e-3 --mixup 0 1

Imagewoof 5 128 82.9% [3] %run train_imagenette.py --epochs 5 --bs 64 --lr 3e-3 --mixup 0 1

I believe the correct dataset in this table is Imagenette, since woof=0 by default.

Seb · August 24, 2019, 3:32pm

Good catch, I corrected it.

grankin · August 25, 2019, 5:49am

I propose to reconsider using OneCycle. RAdam and Novograd - both don’t need warmup, as opposite to Adam. We can utilise this property and introduce different learning rate policy.

I’ve used flat LR and then cosine annealing.

I did simple script to run training 20 times 5 epoch and calculate mean and std. So far looks good.

updated.

https://github.com/mgrankin/over9000

Imagenette 128 scored 0.8746 on 20 runs with Over9000. That is 1.69% bigger than LB. Imagewoof did +2.89%.

grankin · August 25, 2019, 11:01am

I think that the reason why RAdam/Ranger scored worse that Adam is the LR schedule. This new LR schedule (flat and annealing) is the first thing that came to my mind, it could be not great. There should be the LR schedule for RAdam that will be as great as OneCycle for Adam.