Persistent AWS Spot Instances (How to)

NOTICE

The original post was migrated to the wiki.

You might also refer to this medium article.

25 Likes

Thank you! @slavivanov

I was getting errors with missing .aws.creds (my.conf is generated automatically from 1st approach)
Do I need to set it up myself in the way you described elsewhere?

2) Create .aws.creds with your actual IAM credentials with EC2 privileges in this format:

AWSAccessKeyId=XXXXXXXXXXXXXXXXXXXX
AWSSecretKey=XXXXXXXXXXXXXXXXXXXXXXXXXX

Hi @xinxin.li.seattle,
Sorry about that!
The ondemand_to_spot script creates the .aws.creds file using the same approach as setup_p2.sh (using aws configure get aws_access_key_id and aws configure get aws_secret_access_key).
If these were not set when you ran on_demand_to_spot.sh (e.g. you havenā€™t run aws configure), you can create .aws.creds in ec2-spotter using this template:
AWSAccessKeyId=XXXXXXXXXXXXXXXXXXXX
AWSSecretKey=XXXXXXXXXXXXXXXXXXXXXXXXXX

PS: Also I missed some crucial steps in the existing instance approach, which I just updated.

Thank you this will be quite helpful. I am not a lot familiar with AWS. So this question might seem stupid. It is more sought of clarification. Is the EBS volume running even when the spot instance stops? And if the volume is running, it will cost to keep the EBS volume running right? And how much on average it cost?

Hey @Saiyan!
Yes, you pay for the EBS volume regardless of whether it is attached to an instance. Currently itā€™s $0.1/GB-month. This means that if you have a 100GB volume for a full month, it will cost you $10, which IMO is not that much.

1 Like

Thanks for the clarification and speedy reply :slight_smile:

@slavivanov Thank you for fixing the script. It is working for me very well!!

One advice is to put a word of caution in approach 1 fresh instance, because in step 3 it terminates not just the instance from Step 1, but all of your existing instances created with fastai setup script. Luckily, I always backup my data and code in the cloud, so nothing is lost. Because p2.xlarge was approved with a limit, for those with a small limit, you want to be very cautious about accidentally terminating your only approved instance. Other than that, this script works exceptionally well and is very easy to follow. I highly recommend it. Great job and thank you for sharing it @slavivanov!

1 Like

Thanks @xinxin.li.seattle!
The script will use (and terminate) an instance named ā€œfast-ai-gpu-machineā€, which might not be the instance that was just launched. Iā€™ll add a note about this.

1 Like

Thanks a lot for this!

Iā€™m getting this error when trying to run bash start_spot.sh:

parse error: Invalid numeric literal at line 1, column 8
parse error: Invalid numeric literal at line 1, column 8

It seems to be related to jq. The spot instance seems to otherwise load fine.

Iā€™m running Ubuntu 16.04.1 LTS.

Hi, @z0k
I probably forgot to specify the output type. Iā€™ve pushed a commit to github for this.
Let me know if it works for you.

Thanks a lot! Iā€™ll let you know the next time I spin up a spot instance.

Hey, Iā€™ve tried setting it all up but I get the following error:
ondemand_to_spot.sh: 7: export: i-0278bf10da31b66a9: bad variable name
I suspect some small change in the bash script would do, but Iā€™m still not sure what that should be. Could you please look into that?

Thanks a lot!

EDIT 1:
So using a temporary fix (substitution instance id in the script) worked but then that was the output:

TERMINATINGINSTANCES i-0016ed57539ce3077
CURRENTSTATE 32 shutting-down
PREVIOUSSTATE 16 running
Waiting for volume to become available.
ondemand_to_spot.sh: 91: ondemand_to_spot.sh: cannot create ec2-spotter/.aws.creds: Directory nonexistent
All done, you can start your spot instance with: sh start_spot.sh

Then, when I tried to do sh start_spot.sh, it stated the following:

start_spot.sh: 5: start_spot.sh: Bad substitution
ā€¦/ec2spotter-launch: line 38: .aws.creds: No such file or directory
Spot request ID:
Waiting for spot request to be fulfilledā€¦

Waiter SpotInstanceRequestFulfilled failed: Max attempts exceeded
Waiting for spot instance to start upā€¦

Waiter InstanceRunning failed: Waiter encountered a terminal failure state
Spot instance ID:
Please allow the root volume swap script a few minutes to finish.
Then connect to your instance: ssh -i /home/slazien/.ssh/aws-key-fast-ai.pem ubuntu@

Iā€™m not sure what that could be and Iā€™m not sure which variable name from the first issue could be wrongā€¦

EDIT 2:
So I managed to fix my first issue (getting instance ID), but Iā€™m still stuck at ā€œondemand_to_spot.sh: 91: ondemand_to_spot.sh: cannot create ec2-spotter/.aws.creds: Directory nonexistentā€, even though I created the directory manuallyā€¦

I think the script assumes that youā€™re running in the fast_ai directory, so try changing this line

export aws_credentials_file=ec2-spotter/.aws.creds

to the following

export aws_credentials_file=../.aws.creds

Instead of running the script again though, I think it should work if you just manually create the .aws.creds file in the ec2-spotter directory as follows:

export aws_key=`aws configure get aws_access_key_id`
export aws_secret=`aws configure get aws_secret_access_key`
cat > .aws.creds <<EOL
AWSAccessKeyId=$aws_key
AWSSecretKey=$aws_secret
EOL

Hi @slazien, sorry about this!
@z0k is exactly right. The ondemand_to_spot file was previously in a different folder. Follow his instructions to get this solved.
(Iā€™ve also pushed a fix for this to github).

Hey @z0k and @slavivanov!

Thank you so much for your responses, changing that line (why didnā€™t I notice that myself?) fixed it all. There is still an error when running start_spot.sh (start_spot.sh: 5: start_spot.sh: Bad substitution), but it seems to work fine.

EDIT: so after terminating the on-demand instance and converting it to spot with the script it turns out nvidia-smi is not working, which is strange:

modprobe: ERROR: ā€¦/libkmod/libkmod.c:514 lookup_builtin_file() could not open builtin file '/lib/modules/4.4.0-64-generic/modules.builtin.binā€™
modprobe: ERROR: ā€¦/libkmod/libkmod.c:586 kmod_search_moddep() could not open moddep file '/lib/modules/4.4.0-64-generic/modules.dep.binā€™
modprobe: ERROR: ā€¦/libkmod/libkmod.c:586 kmod_search_moddep() could not open moddep file '/lib/modules/4.4.0-64-generic/modules.dep.binā€™
modprobe: ERROR: ā€¦/libkmod/libkmod-module.c:832 kmod_module_insert_module() could not find module by name='nvidia_367ā€™
modprobe: ERROR: could not insert ā€˜nvidia_367ā€™: Unknown symbol in module, or unknown parameter (see dmesg)
NVIDIA-SMI has failed because it couldnā€™t communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.

Also, while trying to apt-get update it says dpkg was interrupted, ughā€¦

E: dpkg was interrupted, you must manually run ā€˜sudo dpkg --configure -aā€™ to correct the problem.

Did any of you have a similar problem?

EDIT 2: After fixing dpkg nvidia-smi seems to work fine.

2 Likes

Iā€™m glad you managed to get it working. I havenā€™t encountered this error.

are there part 2 scripts for this?

@shgidi I plan to look at part 2 scripts this week and make any changes if needed.

thank you for the great work!

This is awesome work, well done - It will save me millions over the next few year.

Iā€™ve spent several hours installing everything and configured it now so the instances launch and worked out how to mount the instance.

One questions is that I donā€™t have jupyter notebook installed, so when I do installed - it routes to localhost.
Also the nvidia-smi doesnā€™t seem to work, so Iā€™m wondering if I need to install a bunch of scripts?

Any thoughts?