Strange behavior of resnet101 when going big in image size

sam2 · November 14, 2018, 1:54pm

I was experimenting with a kaggle dataset (whale-categorization-playground).
The dataset has a large number of classes (4251) representing whale tail-fin images.
An exercise using fastai V1.0.21 and resnet101 image size 128 got me a model that trained well and reached an accuracy of apprx 0.58 after 40 epochs.

epoch	train_loss	valid_loss	accuracy
37	1.718895	3.353173	0.581849
38	1.629597	3.378948	0.580153
39	1.598755	3.379822	0.585454
40	1.562429	3.398566	0.585030
41	1.555568	3.449480	0.575064

When I increased the image size to 288 (and reduced the batch size for the health of my 4 GB gpu). loaded the model from previous run and trained for 40 epochs overnight. The model refused to learn and accuracy languished at 0.038 ???

Total time: 5:10:14

epoch	train_loss	valid_loss	accuracy
37	7.481616	20.684521	0.036472
38	7.454990	15.015072	0.039016
39	7.587584	20.919699	0.038380
40	7.223733	18.816559	0.038592

Any ideas why this could be so ??

bbrandt · November 14, 2018, 10:21pm

How large is your batch size? Resnet contains batchnorm layers and if your batch size gets small you can run into issues because you have too few images in the batch to calculate a meaningful variance. How small is too small? There is afaik no hard rule, but from my experience I would certainly start to be careful with batch sizes smaller than 8.

sam2 · November 15, 2018, 3:52pm

@bbrandt. You are right sir ! My dataset contains 7354 images in train, 4716 images in valid. I get good results with resnet34 for image sizes 128 and even 256. At 512 image size I had to reduce batch size to 16 (for the health of my 4 GB gpu) and there the accuracy started going down hill. So the problem is not limited to resnet101. (I wish jeremy would mention this once every time he shows improvement by going big in his classes)