Platform: SageMaker ✅

matt.mcclean · October 25, 2018, 9:55pm

I created a GitHub project with CloudFormation scripts to setup the SageMaker notebook instance. All you need to do is select the AWS region closest to you and click the Launch Stack button. Click through the options of the CloudFormation page then wait till the stack is created.

Works for both the new version of fast.ai (v1.0) as well as the older version (v0.7) by passing the version number as an input parameter to the CloudFormation stack and downloads the notebooks for the course too.

Launch by following the instructions from the README page found here.

jeremy · October 25, 2018, 11:57pm

Very cool @matt.mcclean! So nice to see the AWS gurus here in our little forum too.

Did you include the extra change from @avishalom (post immediately above yours) that ensures everything is set up correctly when launching an existing instance?

astronomy88 · October 26, 2018, 4:00pm

I can confirm that the Sagemaker tutorial doesn’t show how to properly request for increase in limits. The other tutorial is for EC2 instances, and AWS separates this out. However, now I have a question - which Resource type should I choose ? See the following picture:

If I figure it out, I will let everyone know.

astronomy88 · October 26, 2018, 4:03pm

Yes, I think we should. Otherwise it will not work if you follow that tutorial directly.

astronomy88 · October 26, 2018, 4:06pm

Did you figure it out ?

matt.mcclean · October 26, 2018, 4:06pm

Yes, SageMaker Notebooks is the correct resource type

astronomy88 · October 26, 2018, 4:07pm

Thanks Matt! Really appreciate it. If this goes well, we’ll be using a lot of Sagemaker at my company

astronomy88 · October 26, 2018, 4:09pm

By the way, once I get everything running, I will be happy to try to do a PR.

astronomy88 · October 26, 2018, 4:16pm

Should we use the tutorial or CloudFormation for Sagemaker?

matt.mcclean · October 26, 2018, 4:27pm

Hi @jeremy. Yes, it installs the fastai ipython kernel each time the notebook is started to ensure everthing works even after stopping and starting a notebook. It also installs the fastai libraries only when the notebook is first created as it saves the libs to the separate EBS volume mapped to the /home/ec2-user/SageMaker folder (i.e. conda env is ~/SageMaker/envs/fastai). The EBS volume is persisted and reattached to the notebook when stopping and starting so no need to reinstall the libraries in the OnStart script.

Installing the fastai libs and dependencies takes around 3.5 GB leaving around 1.5 GB spare on the EBS volume. There is also around 25 GB free on the root volume. Hopefully there should be sufficient space to run the lessons as the models and data are saved to the~/.fastai and ~/.torch directories which are mapped to the root volume. This does mean however that students will have to download the data and models again after restarting the notebook instance

Hope this helps

jeremy · October 26, 2018, 4:58pm

@matt.mcclean that’s way better than what we’ve currently got in our setup tutorial. If you happen to have the time and interest, we’d love a docs PR for our setup and update instructions:

http://course-v3.fast.ai/start_sagemaker.html

http://course-v3.fast.ai/update_sagemaker.html

astronomy88 · October 26, 2018, 5:16pm

Do we still need to request limit increase with this method?

jeremy · October 26, 2018, 5:57pm

Yes you do.

dreambeats · October 27, 2018, 9:35am

DId anyone else run into this problem?

jeremy · October 27, 2018, 1:13pm

Looks like you tried to launched a t2 instead of a p2. Also, as it mentions, you can find details in your cloudwatch logs.

adi_pradhan · October 27, 2018, 2:54pm

It took me a while to understand how to use the fastai env in sagemaker. It’s somewhat implicit in the scripts share in this thread but I wanted to call it out for beginners like me.

These are some useful troubleshooting steps in case you are having trouble getting it to work:

You need to source activate Sagemaker/envs/fastai to activate the conda env.
After you source activate register your conda env in ipython using ipython kernel install --name 'fastai' --display-name 'fastai' --user
Select the kernel with the ‘display-name’ you entered above

Otherwise you get errors while executing the import statements and it becomes confusing to see a lot of kernels but not the fastai one.

The space restrictions get annoying too. I think we have 5GB only and 20GB of /tmp storage that is not persistent. How are people getting around that? EFS?

jeremy · October 27, 2018, 4:55pm

You can simply use the kernel selector in jupyter notebook - only reason to activate the conda env is if you need to do stuff with it in the console. The env is already registered for you by the scripts.

matt.mcclean · October 27, 2018, 9:18pm

Hi @jeremy. No problem, I updated the docs to use CloudFormation and created a pull request here.

Kaushikjais · October 29, 2018, 11:59am

Hey @jeremy. I have followed the instructions to set sagemaker but whenever i am starting my notebook it shows that kernel not found

avishalom · October 29, 2018, 9:08pm

@Kaushikjais , if you follow what i pasted above you will get the kernel you need.
if you don’t want to add a startup script, you can just open a shell and run

@jeremy I think we are going about this all wrong
I just had a chat with an amazon engineer, and he suggested that running a notebook on a p2 is a waste of resources.
the notebook instance should be a t2, and the training job should be sent to p2 when you train (because a notebook will be on 3 hours for every few minutes the training runs)
I realize that this is specific to the sagemaker architecture, but maybe you know someone on AWS who might be willing to work through this .

e.g.

create_training_params =
{
“AlgorithmSpecification”: {
“TrainingImage”: image,
“TrainingInputMode”: “File”
},
“RoleArn”: role,
“OutputDataConfig”: {
“S3OutputPath”: output_location
},
“ResourceConfig”: {
“InstanceCount”: 2,
“InstanceType”: “ml.c4.8xlarge”,
“VolumeSizeInGB”: 50
},
“TrainingJobName”: job_name,
“HyperParameters”: {
“k”: “10”,
“feature_dim”: “784”,
“mini_batch_size”: “500”,
“force_dense”: “True”
},
“StoppingCondition”: {
“MaxRuntimeInSeconds”: 60 * 60
},
“InputDataConfig”: [
{
“ChannelName”: “train”,
“DataSource”: {
“S3DataSource”: {
“S3DataType”: “S3Prefix”,
“S3Uri”: data_location,
“S3DataDistributionType”: “FullyReplicated”
}
},
“CompressionType”: “None”,
“RecordWrapperType”: “None”
}
]
}

sagemaker = boto3.client(‘sagemaker’)

sagemaker.create_training_job(**create_training_params)