Troubleshooting: epochs start taking waay longer

Hello,

I’m not sure where to post this. I’m trying to train a big model with lots of data. I have two nvidia 2080 ti connected via an nvlink.

So I run a python script via the usual way python -m torch.distributed.launch --nproc_per_node=2 ./train.py and the first couple of epochs it takes, say 40 mins each. Then, without changing anything, next epoch takes 1.5 hours .Then 2 hours. Why could this be? If I stop it and immediately run it again, it goes back to taking 40 minutes.

I’m monitoring with nvidia-smi, and the temperature is okay: ~40C for both cards (I have a fan directly pointed at them). All the data is in a nvme hard drive. What else should I be monitoring? I have no idea what could be happening.

Something strange is that volatile GPU utilization is at ~100% for GPU 1, but at ~2% for GPU 2.

Using pytorch 1.3 with fastai 1.058.

The GPU utilization should be equal and the epoch time should be constant (at least that’s what I can say from my experience).

I once had a strange bug because I forgot one step in the setup, like it is outlined here: https://docs.fast.ai/distributed.html
Be sure to adapt the code as outlined there including the initialization and the learner setup.