Confusion regarding reproducible results between training runs

Hello all,

I am busy training a image classifier, and I noticed something strange regarding achieving reproducible results between training runs (by restarting the notebook kernel).

fastai version: fastai==1.0.60

So in order to ensure reproducible results, I do all the necessary ‘seeding’:

seed = 42

# python RNG
random.seed(seed)

# pytorch RNGs
torch.manual_seed(seed)
torch.backends.cudnn.deterministic = True
if torch.cuda.is_available():
    torch.cuda.manual_seed_all(seed)

# numpy RNG
np.random.seed(seed)

Set up the data:

data = ImageDataBunch.from_csv(TRAIN_PATH, folder='images', csv_labels='train.csv', size=224, bs=64, num_workers=4,
resize_method=ResizeMethod.SQUISH).normalize(imagenet_stats)

I then set up my learner and train:

kappa = KappaScore()
kappa.weights = 'quadratic'
learn = cnn_learner(data, models.resnet18, metrics=[accuracy, kappa])
learn.callback_fns.append(partial(SaveModelCallback, every='improvement', monitor='accuracy'))
learn.fit(5, lr=1e-4, wd=1e-4)

I then get the following results for run #1

epoch	train_loss	valid_loss	accuracy	kappa_score	time
0	0.583971	0.114860	0.966038	0.932593	00:08
1	0.268060	0.064782	0.983019	0.961805	00:07
2	0.149086	0.067273	0.977359	0.957736	00:07
3	0.083732	0.053412	0.981132	0.964634	00:07
4	0.052447	0.064336	0.977359	0.944663	00:07

and run #2:

epoch	train_loss	valid_loss	accuracy	kappa_score	time
0	0.583971	0.114860	0.966038	0.932593	00:08
1	0.268060	0.064782	0.983019	0.961805	00:07
2	0.149086	0.067273	0.977359	0.957736	00:07
3	0.083732	0.053412	0.981132	0.964634	00:07
4	0.052447	0.064336	0.977359	0.944663	00:07

As one can see the numbers are exactly the same. Initially I found this strange, because I found it hard to believe that when the optimiser (Adam) does gradient descent (and other fancy stuff), that it follows the exact same path to the global (or local) minimum of the loss function surface. I then gave it a bit more thought and kind of convinced myself that it could be possible, since the data split and ‘randomness’ is exactly the same, resulting in the exact same loss function surface between training runs.

I then did another test where I monitored the valid_loss instead of the accuracy i.e

learn.callback_fns.append(partial(SaveModelCallback, every='improvement', monitor='valid_loss'))
learn.fit(5, lr=1e-4, wd=1e-4)

This then produces the results for run #1:

epoch	train_loss	valid_loss	accuracy	kappa_score	time
0	0.582052	0.114496	0.964151	0.926979	00:08
1	0.266717	0.064743	0.981132	0.960227	00:07
2	0.149264	0.062584	0.977359	0.949135	00:07
3	0.082239	0.050661	0.986792	0.973122	00:07
4	0.050845	0.065832	0.977359	0.948850	00:07

and results for run #2:

epoch	train_loss	valid_loss	accuracy	kappa_score	time
0	0.584009	0.112865	0.956604	0.912691	00:08
1	0.266808	0.065932	0.979245	0.950349	00:07
2	0.149355	0.061175	0.984906	0.967462	00:07
3	0.082748	0.050527	0.981132	0.956022	00:07
4	0.052076	0.064512	0.979245	0.954532	00:07

As one can see above the results are not exactly the same, but are very similar (which makes sense to me). I did one last test where I did not include the SaveModelCallback at all, and also got similar results but not exactly the same as above when I monitored the accuracy with the SaveModelCallback.

So at this point I am rather confused why this is happening. If anyone could please provide me with some insight to this, it would be much appreciated.

Thank you!