It feels like I shouldn’t be opening a new thread for this but here we go:
In both train_imagenette and train_imagenet, we have the following line:
bs_rat = tot_bs/256.
Is 256 a hard coded value that should actually be our chosen batch size per GPU?
That would explain why I got better results when replicating results from the Imagenette leaderboard to use as my baseline (I hadn’t used that line). Jeremy used a bs=64, so his lr would have been divided by 4 unintentionally.
You can definitely make it a parameter if you want. It’s just the best learning rate was computed for bs=256 in our case, so we adapt it with this rule of thumb.