SageMaker is very slow

stevenlybeck · February 13, 2019, 1:03am

I had the same problem with my Sagemaker notebook instance.

I verified that CUDA was not running by adding a code block to my Notebook and trying:

import torch
torch.cuda.is_available()

For me, the result was False - which says pytorch is not using the GPU. It took 40 minutes to run the fit call learn.fit_one_cycle(4).

Turns out the version of pytorch that gets installed when fastai gets set up on the machine - expects Cuda 10.0. But Sagemaker has 9.0 installed and active by default.

You can solve this by installing pytorch with the 9.0 version cudatoolkit. I made the changes for the “sagemaker-create” script and submitted a pull request to the fastai course-v3 repository: https://github.com/fastai/course-v3/pull/211

If you already have a Sagemaker notebook instance running, and you would like to fix it so that Cuda/GPU is working:

From Jupyter, click the “New” button in the upper-right and select “Terminal”

In the terminal, run:

source activate /home/ec2-user/SageMaker/envs/fastai
conda install pytorch torchvision cudatoolkit=9.0 -c pytorch

This should give you a summary of changes it will make. Just type y and press enter to accept the changes. For me, these were the important changes that it made:

The following packages will be UPDATED:
    pytorch:     1.0.1-py3.7_cuda10.0.130_cudnn7.4.2_2 pytorch --> 1.0.1-py3.7_cuda9.0.176_cudnn7.4.2_2 pytorch

The following packages will be DOWNGRADED:
    cudatoolkit: 10.0.130-0                                    --> 9.0-h13b8566_0

After this completes, make sure you shut down and restart any notebooks that are running from the fastai lessons. (You can find these in the “Running” tab.) Or if you have them open in a browser tab, you can reload the kernel by pressing the zero-key two times: 0, 0.

After I did this, I could run the fit command learn.fit_one_cycle(4) in 1m50s on my Sagemaker notebook instance.

Hope this is helpful!