Slow training process for resnet in lesson 1

hughsun · February 19, 2019, 4:19am

Hello guys, the platform I am using is AWS p2.xlarge with Ubuntu 16.0 Deep Learning AMI. It is fine with all the lines before the training step, but when I run the line:

learn.fit_one_cycle(4)

It took me 15 min just to finish the first epoch. (so about 1 hour to finish the whole training process).
Is this normal for the computational power I have for a p2.xlarge? Do I need to upgrade?

njgroene · July 4, 2019, 7:40am

Hi,

Did you every find a reason for this? I’m running into a similar issue: training is taking around 10 minutes / cycle (I’m on GCC)

Cheers
Nick

njgroene · July 8, 2019, 9:48am

Follow-up: definitely a GPU issue, in my case (running on Google Compute) NVIDIA drivers weren’t functioning correctly. Solved the issue by creating a new VM from scratch, which solved many issues for reasons that are still unclear to me (install was identical to previous one).

Oliv · August 16, 2019, 3:15am

hi
same issue here on AWS p2.xlarge taking 20 mins per epoch…
Were you able to resolve?

Regards,

UPDATE:
Following instructions here to downgrade Cuda from 10.0 to 9.0 does the trick!
SageMaker is very slow

njgroene · August 16, 2019, 3:33am

Hi! Yes (also see my previous follow up message). I was able to resolve it with a fresh install of the VM. The iterations were slow because the GPU wasn’t being used. Unfortunately I wasn’t able to find the root cause of this issue, but the fresh install resolved it. Worth a shot!

Oliv · August 16, 2019, 3:35am

hi thanks. This post sorted it for me.
SageMaker is very slow