Reproducible testing and validation


The dataset I’m using was split into training and test sets (as usual) and I was experimenting with different architetures to select the best one to solve my problem. In this case, all resnets and densenets available on Fastai.
The thing is, I tried to reproduce my testing results 3 times now but they are different each run! For instance, while on the first run ResNet-18 yielded a Matthews correlation coef. (MCC) of 0.694, on the second run the MCC dropped to 0.662. I think this is quite a huge difference.

So, my question is what kind of things may cause this behaviour? I’m thinking it could be a high learning rate during fine tuning. I can see from the plot that the max value is on a region where the loss is starting to go up.