Seems like I am following you around on these forums. But I like your style, thoroughness, and a good mystery.
The main difference I see is in the initialization of weights, though I don’t know if it can account for such a large discrepancy. Also, the DataLoader is going to generate training samples in different orders, and CUDA is slightly indeterminate.
I would first try copying the weights from the original learner model into the new model. Then evaluate several single training samples to see whether their outputs are the same or very close.
If they are, then try setting the random seeds along with num_workers=1 right before training to get the same training sequence. And the fastai docs show code for making CUDA determinate, someplace.
These steps should at least give you more clues. Good luck!
P.S. Somewhere fastai initializes the optimizer, one hopes the same way for each run you are doing.