Why do these identical(?) models give different results?

I’ve been trying to modify a pretrained resnet and ran into NaNs during training. In an attempt to debug, I reduced the modifications to the bare minimum. I think these two models should be identical, yet they give very different losses on the same training data. Can anyone explain my mistake?

Model 1…

arch = models.resnet50
learn = create_cnn(data, arch)
learn.fit_one_cycle(1, max_lr=1e-2)
#1 	0.317061 	0.308280 (training and validation losses)

Model 2…

class StatsToFC(nn.Module): #Just call the existing RNmodel
    def __init__(self):

    def forward(self, xb):
        global RNmodel
        bOut = RNmodel(xb)
        return bOut

arch = models.resnet50
learn = create_cnn(data, arch)
RNmodel = learn.model
learn.model = nn.Sequential(StatsToFC()).cuda()
learn.fit_one_cycle(1, max_lr=1e-2)
#1 	0.391187 	0.392332 (training and validation losses)


  • I set several random seeds identically, and this discrepancy is consistent.
  • Maybe something to do with different initialization or value vs. reference?

Thanks so much for helping. I am stuck!

Training a model (especially a short training) is very random, so if you want to compare two trainings, you need to have:

  • the same initial state (so transfer your weights from the first to the second model)
  • set the same random seed before launching each training with torch.manual_seed(...) and maybe torch.cuda.manual_seed_all(...)
1 Like

Yes I did the latter. Set every seed known to humankind and num_workers=0. Each model is run separately after Restart Kernel and gives consistent losses.

It seems like the original resnet50 pretrained weights would be in RNmodel (Model 2). Maybe something I am doing confuses automatic differentiation? Or perhaps how the loss function is applied at the end?

Lots of theories. If no one can just see the problem, I will need to learn how to trace inside a model evaluation.

Well, I never figured out the issue. Instead found a way to tack two layers onto the start and end of resnet that give results consistent with the original resnet, and accomplish the task.

I was able to trace that RNmodel(xb) (Model 2) yields a different output (same inputs) than resnet in its original form (Model 1). But why is a mystery.