I finally re-figured out what the issue with current baseline is. I will detail it here, and I think you will see why I’ve been telling people to be cautious when comparing new ideas to the leaderboard.
Baseline runs code in train_imagenette.py [1].
Check this part of the code:
bs_rat = bs/256
if gpu is not None: bs_rat = num_distrib()
if not gpu: print(f’lr: {lr}; eff_lr: {lrbs_rat}; size: {size}; alpha: {alpha}; mom: {mom}; eps: {eps}')
lr *= bs_rat
When I run train_imagenette on 1 GPU, with bs = 64, my learning rate gets divided by 4! My understanding is that, with 4 GPUs, the learning rate stays the same but we would want to increase it.
I think this is a relic of having a hardcoded bs of 256 with train_imagenet.py [2] …
Let’s compare some results between using intended lr/4 and intended lr:
Dataset | Epochs | Size | Accuracy | Params | GPUs |
---|---|---|---|---|---|
Imagenette | 5 | 128 | 85.36% [4] | %run train_imagenette.py --epochs 5 --bs 64 --lr 12e-3 --mixup 0 | 1 |
Imagenette | 5 | 128 | 82.9% [3] | %run train_imagenette.py --epochs 5 --bs 64 --lr 3e-3 --mixup 0 | 1 |
First line has a learning rate of 12e-3 but an effective lr of 3e-3. Second line: lr=3e-3, eff lr = 0.00075.
[1] https://github.com/fastai/fastai/blob/master/examples/train_imagenette.py
[2] https://github.com/fastai/fastai/blob/master/examples/train_imagenet.py
[3] np.mean([83.8,83.8,81.8,82.4,81.8,85,85,80.4,83,82])
[4] np.mean([86,85.4,84.4,85.2,84.8,85,85.6,85.4,85.4,86.4])