I am training the language model on p2 instance. My language model training is taking more than 1 hour. Is that normal?
I checked that my GPU usage is at 97 to 100%. So GPU is being utilized. Now if it takes so much time how do I run the 10 epochs? Leave it running overnight? Is there a more practical way to do this? I mean how do I experiment with this if it takes such a long time? Using a p2.8xlarge machine won’t be helpful as that has 8 GPUs not 1 GPU with more RAM so cannot increase batch size even if I use that machine.
Take 30mins for 1 epoch with P100 or Gtx1080, p2 has 1 k80 so I think 1.5 hours make sense.
In practice you have to learn to experiment on smaller datasets. Figure out how the things you’re experimenting on differ depending on dataset size, so you can get useful insights on smaller datasets. In the end you’ll need to get good at writing scripts to run things overnight too!
yes, it does take 1.5 hrs on p2. Use paperspace 6000 . It costs the same as p2. But is 4 times faster. it also has 24 GB RAM.
I was a bit surprised too, because Jeremy’s course notebooks show that they are taking 12 min per epoch. But my run on amazon p2-instance is taking ~1.5 hours too.
Wonder what GPU Jeremy used to make his run.