Persistent AWS Spot Instances (How to)

(Slav Ivanov) #1


The original post was migrated to the wiki.

You might also refer to this medium article.

Py3 and tensorflow setup
Setup problems: AWS
Script to provision AMI from base AMIs
Setup problems: AWS
AWS: Spot Instances?
(Xinxin) #2

Thank you! @slavivanov

I was getting errors with missing .aws.creds (my.conf is generated automatically from 1st approach)
Do I need to set it up myself in the way you described elsewhere?

2) Create .aws.creds with your actual IAM credentials with EC2 privileges in this format:


(Slav Ivanov) #3

Sorry about that!
The ondemand_to_spot script creates the .aws.creds file using the same approach as (using aws configure get aws_access_key_id and aws configure get aws_secret_access_key).
If these were not set when you ran (e.g. you haven’t run aws configure), you can create .aws.creds in ec2-spotter using this template:

PS: Also I missed some crucial steps in the existing instance approach, which I just updated.


Thank you this will be quite helpful. I am not a lot familiar with AWS. So this question might seem stupid. It is more sought of clarification. Is the EBS volume running even when the spot instance stops? And if the volume is running, it will cost to keep the EBS volume running right? And how much on average it cost?

(Slav Ivanov) #5

Hey @Saiyan!
Yes, you pay for the EBS volume regardless of whether it is attached to an instance. Currently it’s $0.1/GB-month. This means that if you have a 100GB volume for a full month, it will cost you $10, which IMO is not that much.


Thanks for the clarification and speedy reply :slight_smile:

(Xinxin) #7

@slavivanov Thank you for fixing the script. It is working for me very well!!

One advice is to put a word of caution in approach 1 fresh instance, because in step 3 it terminates not just the instance from Step 1, but all of your existing instances created with fastai setup script. Luckily, I always backup my data and code in the cloud, so nothing is lost. Because p2.xlarge was approved with a limit, for those with a small limit, you want to be very cautious about accidentally terminating your only approved instance. Other than that, this script works exceptionally well and is very easy to follow. I highly recommend it. Great job and thank you for sharing it @slavivanov!

(Slav Ivanov) #8

The script will use (and terminate) an instance named “fast-ai-gpu-machine”, which might not be the instance that was just launched. I’ll add a note about this.

(Zarak) #9

Thanks a lot for this!

I’m getting this error when trying to run bash

parse error: Invalid numeric literal at line 1, column 8
parse error: Invalid numeric literal at line 1, column 8

It seems to be related to jq. The spot instance seems to otherwise load fine.

I’m running Ubuntu 16.04.1 LTS.

(Slav Ivanov) #10

Hi, @z0k
I probably forgot to specify the output type. I’ve pushed a commit to github for this.
Let me know if it works for you.

(Zarak) #11

Thanks a lot! I’ll let you know the next time I spin up a spot instance.

(Przemyslaw Zientala) #12

Hey, I’ve tried setting it all up but I get the following error: 7: export: i-0278bf10da31b66a9: bad variable name
I suspect some small change in the bash script would do, but I’m still not sure what that should be. Could you please look into that?

Thanks a lot!

So using a temporary fix (substitution instance id in the script) worked but then that was the output:

TERMINATINGINSTANCES i-0016ed57539ce3077
CURRENTSTATE 32 shutting-down
Waiting for volume to become available. 91: cannot create ec2-spotter/.aws.creds: Directory nonexistent
All done, you can start your spot instance with: sh

Then, when I tried to do sh, it stated the following: 5: Bad substitution
…/ec2spotter-launch: line 38: .aws.creds: No such file or directory
Spot request ID:
Waiting for spot request to be fulfilled…

Waiter SpotInstanceRequestFulfilled failed: Max attempts exceeded
Waiting for spot instance to start up…

Waiter InstanceRunning failed: Waiter encountered a terminal failure state
Spot instance ID:
Please allow the root volume swap script a few minutes to finish.
Then connect to your instance: ssh -i /home/slazien/.ssh/aws-key-fast-ai.pem ubuntu@

I’m not sure what that could be and I’m not sure which variable name from the first issue could be wrong…

So I managed to fix my first issue (getting instance ID), but I’m still stuck at “ 91: cannot create ec2-spotter/.aws.creds: Directory nonexistent”, even though I created the directory manually…

(Zarak) #13

I think the script assumes that you’re running in the fast_ai directory, so try changing this line

export aws_credentials_file=ec2-spotter/.aws.creds

to the following

export aws_credentials_file=../.aws.creds

Instead of running the script again though, I think it should work if you just manually create the .aws.creds file in the ec2-spotter directory as follows:

export aws_key=`aws configure get aws_access_key_id`
export aws_secret=`aws configure get aws_secret_access_key`
cat > .aws.creds <<EOL

(Slav Ivanov) #14

Hi @slazien, sorry about this!
@z0k is exactly right. The ondemand_to_spot file was previously in a different folder. Follow his instructions to get this solved.
(I’ve also pushed a fix for this to github).

(Przemyslaw Zientala) #15

Hey @z0k and @slavivanov!

Thank you so much for your responses, changing that line (why didn’t I notice that myself?) fixed it all. There is still an error when running ( 5: Bad substitution), but it seems to work fine.

EDIT: so after terminating the on-demand instance and converting it to spot with the script it turns out nvidia-smi is not working, which is strange:

modprobe: ERROR: …/libkmod/libkmod.c:514 lookup_builtin_file() could not open builtin file '/lib/modules/4.4.0-64-generic/modules.builtin.bin’
modprobe: ERROR: …/libkmod/libkmod.c:586 kmod_search_moddep() could not open moddep file '/lib/modules/4.4.0-64-generic/modules.dep.bin’
modprobe: ERROR: …/libkmod/libkmod.c:586 kmod_search_moddep() could not open moddep file '/lib/modules/4.4.0-64-generic/modules.dep.bin’
modprobe: ERROR: …/libkmod/libkmod-module.c:832 kmod_module_insert_module() could not find module by name='nvidia_367’
modprobe: ERROR: could not insert ‘nvidia_367’: Unknown symbol in module, or unknown parameter (see dmesg)
NVIDIA-SMI has failed because it couldn’t communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.

Also, while trying to apt-get update it says dpkg was interrupted, ugh…

E: dpkg was interrupted, you must manually run ‘sudo dpkg --configure -a’ to correct the problem.

Did any of you have a similar problem?

EDIT 2: After fixing dpkg nvidia-smi seems to work fine.

(Slav Ivanov) #16

I’m glad you managed to get it working. I haven’t encountered this error.

(Gidi Shperber) #17

are there part 2 scripts for this?

(Slav Ivanov) #18

@shgidi I plan to look at part 2 scripts this week and make any changes if needed.

(Gidi Shperber) #19

thank you for the great work!

Struggling to get to grips with AWS (not another one)
(James Smith) #20

This is awesome work, well done - It will save me millions over the next few year.

I’ve spent several hours installing everything and configured it now so the instances launch and worked out how to mount the instance.

One questions is that I don’t have jupyter notebook installed, so when I do installed - it routes to localhost.
Also the nvidia-smi doesn’t seem to work, so I’m wondering if I need to install a bunch of scripts?

Any thoughts?