Hello,
I’m not sure where to post this. I’m trying to train a big model with lots of data. I have two nvidia 2080 ti connected via an nvlink.
So I run a python script via the usual way python -m torch.distributed.launch --nproc_per_node=2 ./train.py
and the first couple of epochs it takes, say 40 mins each. Then, without changing anything, next epoch takes 1.5 hours .Then 2 hours. Why could this be? If I stop it and immediately run it again, it goes back to taking 40 minutes.
I’m monitoring with nvidia-smi, and the temperature is okay: ~40C for both cards (I have a fan directly pointed at them). All the data is in a nvme hard drive. What else should I be monitoring? I have no idea what could be happening.
Something strange is that volatile GPU utilization is at ~100% for GPU 1, but at ~2% for GPU 2.
Using pytorch 1.3 with fastai 1.058.