Slow training process for resnet in lesson 1

Hello guys, the platform I am using is AWS p2.xlarge with Ubuntu 16.0 Deep Learning AMI. It is fine with all the lines before the training step, but when I run the line:

learn.fit_one_cycle(4)

It took me 15 min just to finish the first epoch. (so about 1 hour to finish the whole training process).
Is this normal for the computational power I have for a p2.xlarge? Do I need to upgrade?

Hi,

Did you every find a reason for this? I’m running into a similar issue: training is taking around 10 minutes / cycle (I’m on GCC)

Cheers
Nick

Follow-up: definitely a GPU issue, in my case (running on Google Compute) NVIDIA drivers weren’t functioning correctly. Solved the issue by creating a new VM from scratch, which solved many issues for reasons that are still unclear to me (install was identical to previous one).

hi
same issue here on AWS p2.xlarge taking 20 mins per epoch…
Were you able to resolve?

Regards,

UPDATE:
Following instructions here to downgrade Cuda from 10.0 to 9.0 does the trick!
SageMaker is very slow

Hi! Yes (also see my previous follow up message). I was able to resolve it with a fresh install of the VM. The iterations were slow because the GPU wasn’t being used. Unfortunately I wasn’t able to find the root cause of this issue, but the fresh install resolved it. Worth a shot! :slight_smile:

hi thanks. This post sorted it for me.
SageMaker is very slow