I am training the language model on p2 instance. My language model training is taking more than 1 hour. Is that normal?
I checked that my GPU usage is at 97 to 100%. So GPU is being utilized. Now if it takes so much time how do I run the 10 epochs? Leave it running overnight? Is there a more practical way to do this? I mean how do I experiment with this if it takes such a long time? Using a p2.8xlarge machine won’t be helpful as that has 8 GPUs not 1 GPU with more RAM so cannot increase batch size even if I use that machine.
In practice you have to learn to experiment on smaller datasets. Figure out how the things you’re experimenting on differ depending on dataset size, so you can get useful insights on smaller datasets. In the end you’ll need to get good at writing scripts to run things overnight too!
I was a bit surprised too, because Jeremy’s course notebooks show that they are taking 12 min per epoch. But my run on amazon p2-instance is taking ~1.5 hours too.