Persistent AWS Spot Instances (How to)

slavivanov · February 24, 2017, 8:08pm

NOTICE

The original post was migrated to the wiki.

You might also refer to this medium article.

xinxin.li.seattle · February 25, 2017, 5:36am

I was getting errors with missing .aws.creds (my.conf is generated automatically from 1st approach)
Do I need to set it up myself in the way you described elsewhere?

2) Create .aws.creds with your actual IAM credentials with EC2 privileges in this format:

AWSAccessKeyId=XXXXXXXXXXXXXXXXXXXX
AWSSecretKey=XXXXXXXXXXXXXXXXXXXXXXXXXX

slavivanov · February 25, 2017, 8:01am

Hi @xinxin.li.seattle,
Sorry about that!
The ondemand_to_spot script creates the .aws.creds file using the same approach as setup_p2.sh (using aws configure get aws_access_key_id and aws configure get aws_secret_access_key).
If these were not set when you ran on_demand_to_spot.sh (e.g. you haven’t run aws configure), you can create .aws.creds in ec2-spotter using this template:
AWSAccessKeyId=XXXXXXXXXXXXXXXXXXXX
AWSSecretKey=XXXXXXXXXXXXXXXXXXXXXXXXXX

PS: Also I missed some crucial steps in the existing instance approach, which I just updated.

Saiyan · February 25, 2017, 8:16am

Thank you this will be quite helpful. I am not a lot familiar with AWS. So this question might seem stupid. It is more sought of clarification. Is the EBS volume running even when the spot instance stops? And if the volume is running, it will cost to keep the EBS volume running right? And how much on average it cost?

slavivanov · February 25, 2017, 8:21am

Hey @Saiyan!
Yes, you pay for the EBS volume regardless of whether it is attached to an instance. Currently it’s $0.1/GB-month. This means that if you have a 100GB volume for a full month, it will cost you $10, which IMO is not that much.

Saiyan · February 25, 2017, 8:34am

Thanks for the clarification and speedy reply

xinxin.li.seattle · February 26, 2017, 5:10am

@slavivanov Thank you for fixing the script. It is working for me very well!!

One advice is to put a word of caution in approach 1 fresh instance, because in step 3 it terminates not just the instance from Step 1, but all of your existing instances created with fastai setup script. Luckily, I always backup my data and code in the cloud, so nothing is lost. Because p2.xlarge was approved with a limit, for those with a small limit, you want to be very cautious about accidentally terminating your only approved instance. Other than that, this script works exceptionally well and is very easy to follow. I highly recommend it. Great job and thank you for sharing it @slavivanov!

slavivanov · February 26, 2017, 9:47am

Thanks @xinxin.li.seattle!
The script will use (and terminate) an instance named “fast-ai-gpu-machine”, which might not be the instance that was just launched. I’ll add a note about this.

z0k · February 27, 2017, 11:53am

Thanks a lot for this!

I’m getting this error when trying to run bash start_spot.sh:

parse error: Invalid numeric literal at line 1, column 8
parse error: Invalid numeric literal at line 1, column 8

It seems to be related to jq. The spot instance seems to otherwise load fine.

I’m running Ubuntu 16.04.1 LTS.

slavivanov · February 27, 2017, 3:32pm

Hi, @z0k
I probably forgot to specify the output type. I’ve pushed a commit to github for this.
Let me know if it works for you.

z0k · March 1, 2017, 6:48am

Thanks a lot! I’ll let you know the next time I spin up a spot instance.

slazien · March 1, 2017, 11:23am

Hey, I’ve tried setting it all up but I get the following error:
ondemand_to_spot.sh: 7: export: i-0278bf10da31b66a9: bad variable name
I suspect some small change in the bash script would do, but I’m still not sure what that should be. Could you please look into that?

Thanks a lot!

EDIT 1:
So using a temporary fix (substitution instance id in the script) worked but then that was the output:

TERMINATINGINSTANCES i-0016ed57539ce3077
CURRENTSTATE 32 shutting-down
PREVIOUSSTATE 16 running
Waiting for volume to become available.
ondemand_to_spot.sh: 91: ondemand_to_spot.sh: cannot create ec2-spotter/.aws.creds: Directory nonexistent
All done, you can start your spot instance with: sh start_spot.sh

Then, when I tried to do sh start_spot.sh, it stated the following:

start_spot.sh: 5: start_spot.sh: Bad substitution
…/ec2spotter-launch: line 38: .aws.creds: No such file or directory
Spot request ID:
Waiting for spot request to be fulfilled…

Waiter SpotInstanceRequestFulfilled failed: Max attempts exceeded
Waiting for spot instance to start up…

Waiter InstanceRunning failed: Waiter encountered a terminal failure state
Spot instance ID:
Please allow the root volume swap script a few minutes to finish.
Then connect to your instance: ssh -i /home/slazien/.ssh/aws-key-fast-ai.pem ubuntu@

I’m not sure what that could be and I’m not sure which variable name from the first issue could be wrong…

EDIT 2:
So I managed to fix my first issue (getting instance ID), but I’m still stuck at “ondemand_to_spot.sh: 91: ondemand_to_spot.sh: cannot create ec2-spotter/.aws.creds: Directory nonexistent”, even though I created the directory manually…

z0k · March 1, 2017, 12:26pm

I think the script assumes that you’re running in the fast_ai directory, so try changing this line

export aws_credentials_file=ec2-spotter/.aws.creds

to the following

export aws_credentials_file=../.aws.creds

Instead of running the script again though, I think it should work if you just manually create the .aws.creds file in the ec2-spotter directory as follows:

export aws_key=`aws configure get aws_access_key_id`
export aws_secret=`aws configure get aws_secret_access_key`
cat > .aws.creds <<EOL
AWSAccessKeyId=$aws_key
AWSSecretKey=$aws_secret
EOL

slavivanov · March 1, 2017, 1:48pm

Hi @slazien, sorry about this!
@z0k is exactly right. The ondemand_to_spot file was previously in a different folder. Follow his instructions to get this solved.
(I’ve also pushed a fix for this to github).

slazien · March 1, 2017, 2:12pm

Hey @z0k and @slavivanov!

Thank you so much for your responses, changing that line (why didn’t I notice that myself?) fixed it all. There is still an error when running start_spot.sh (start_spot.sh: 5: start_spot.sh: Bad substitution), but it seems to work fine.

EDIT: so after terminating the on-demand instance and converting it to spot with the script it turns out nvidia-smi is not working, which is strange:

modprobe: ERROR: …/libkmod/libkmod.c:514 lookup_builtin_file() could not open builtin file '/lib/modules/4.4.0-64-generic/modules.builtin.bin’
modprobe: ERROR: …/libkmod/libkmod.c:586 kmod_search_moddep() could not open moddep file '/lib/modules/4.4.0-64-generic/modules.dep.bin’
modprobe: ERROR: …/libkmod/libkmod.c:586 kmod_search_moddep() could not open moddep file '/lib/modules/4.4.0-64-generic/modules.dep.bin’
modprobe: ERROR: …/libkmod/libkmod-module.c:832 kmod_module_insert_module() could not find module by name='nvidia_367’
modprobe: ERROR: could not insert ‘nvidia_367’: Unknown symbol in module, or unknown parameter (see dmesg)
NVIDIA-SMI has failed because it couldn’t communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.

Also, while trying to apt-get update it says dpkg was interrupted, ugh…

E: dpkg was interrupted, you must manually run ‘sudo dpkg --configure -a’ to correct the problem.

Did any of you have a similar problem?

EDIT 2: After fixing dpkg nvidia-smi seems to work fine.

slavivanov · March 1, 2017, 3:59pm

I’m glad you managed to get it working. I haven’t encountered this error.

shgidi · March 6, 2017, 11:19pm

are there part 2 scripts for this?

slavivanov · March 7, 2017, 12:35pm

@shgidi I plan to look at part 2 scripts this week and make any changes if needed.

shgidi · March 7, 2017, 3:01pm

thank you for the great work!

jamestdsmith · March 7, 2017, 4:48pm

This is awesome work, well done - It will save me millions over the next few year.

I’ve spent several hours installing everything and configured it now so the instances launch and worked out how to mount the instance.

One questions is that I don’t have jupyter notebook installed, so when I do installed - it routes to localhost.
Also the nvidia-smi doesn’t seem to work, so I’m wondering if I need to install a bunch of scripts?

Any thoughts?