Persistent AWS Spot Instances (How to)

Hi @slazien, sorry about this!
@z0k is exactly right. The ondemand_to_spot file was previously in a different folder. Follow his instructions to get this solved.
(I’ve also pushed a fix for this to github).

Hey @z0k and @slavivanov!

Thank you so much for your responses, changing that line (why didn’t I notice that myself?) fixed it all. There is still an error when running start_spot.sh (start_spot.sh: 5: start_spot.sh: Bad substitution), but it seems to work fine.

EDIT: so after terminating the on-demand instance and converting it to spot with the script it turns out nvidia-smi is not working, which is strange:

modprobe: ERROR: …/libkmod/libkmod.c:514 lookup_builtin_file() could not open builtin file '/lib/modules/4.4.0-64-generic/modules.builtin.bin’
modprobe: ERROR: …/libkmod/libkmod.c:586 kmod_search_moddep() could not open moddep file '/lib/modules/4.4.0-64-generic/modules.dep.bin’
modprobe: ERROR: …/libkmod/libkmod.c:586 kmod_search_moddep() could not open moddep file '/lib/modules/4.4.0-64-generic/modules.dep.bin’
modprobe: ERROR: …/libkmod/libkmod-module.c:832 kmod_module_insert_module() could not find module by name='nvidia_367’
modprobe: ERROR: could not insert ‘nvidia_367’: Unknown symbol in module, or unknown parameter (see dmesg)
NVIDIA-SMI has failed because it couldn’t communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.

Also, while trying to apt-get update it says dpkg was interrupted, ugh…

E: dpkg was interrupted, you must manually run ‘sudo dpkg --configure -a’ to correct the problem.

Did any of you have a similar problem?

EDIT 2: After fixing dpkg nvidia-smi seems to work fine.

2 Likes

I’m glad you managed to get it working. I haven’t encountered this error.

are there part 2 scripts for this?

@shgidi I plan to look at part 2 scripts this week and make any changes if needed.

thank you for the great work!

This is awesome work, well done - It will save me millions over the next few year.

I’ve spent several hours installing everything and configured it now so the instances launch and worked out how to mount the instance.

One questions is that I don’t have jupyter notebook installed, so when I do installed - it routes to localhost.
Also the nvidia-smi doesn’t seem to work, so I’m wondering if I need to install a bunch of scripts?

Any thoughts?

@jamestdsmith I believe @slazien had the same issue. Checkout his post above as it might be helpful.
As for jupyter notebook, you might want to look into this script:

Especially this part:
# configure jupyter and prompt for password jupyter notebook --generate-config jupass=python -c “from notebook.auth import passwd; print(passwd())”echo "c.NotebookApp.password = u'"$jupass"'" >> $HOME/.jupyter/jupyter_notebook_config.py echo "c.NotebookApp.ip = '*' c.NotebookApp.open_browser = False" >> $HOME/.jupyter/jupyter_notebook_config.py

awesome, many thanks - I’m very new to development so this helps me loads.

1 Like

@slavivanov I am getting this error message when I tried to run the bash script fast_ai/start_spot.sh (second approaching using an existing instance).

“An error occurred (InvalidAMIID.NotFound) when calling the RequestSpotInstances operation: The image id ‘[ami-6edd3078]’ does not exist
Spot request ID:
Waiting for spot request to be fulfilled…”

It doesn’t seem to like the image id, but the conf file specifically say not to change this image id. Can you help me take a look at this? Thank you!

Hi,

The AMI should correspond to the region you’re in. Here are a couple of snippets from the ondemand_to_spot.sh script:

export region=`aws configure get region`
# The ami to boot up the spot instance with.
# Ubuntu-xenial-16.04 in diff regions.
# Ubuntu 16.04.1 LTS
if [ $region = "us-west-2" ]; then 
	export ami=ami-a58d0dc5 # Oregon
elif [ $region = "eu-west-1" ]; then 
	export ami=ami-405f7226 # Ireland
elif [ $region = "us-east-1" ]; then
  	export ami=ami-6edd3078 # Virginia
fi

# The AMI to be used as the pre-boot environment. This is NOT your target system installation.
# Do Not Modify this unless you have a need for a different Kernel version from what's supplied.
ec2spotter_preboot_image_id=$ami

@z0k I used the correct ami and got the spot instance launched, but the root volume swapping isn’t happening after 15 minutes (see attached screenshot). I did uncomment and update the value for elastic ip. But other than that, I followed every step in the instruction. Is there anything I need to do manually to swap the root device? If not, can you point me to the right script to debug?

the root attached to the spot instance is 8GB in green (in-use), and my spotter(available) in blue.

Hm, can you verify that the name in your .conf script matches what you wrote (spotter) in the console?

# Name of root volume.
ec2spotter_volume_name=spotter

Other than that, I’m not sure what the problem is, but for the time being you can manually attach the volume to your spot instance in the AWS console, and then mount it after SSHing into your instance:

$ mkdir spotter
$ mount /dev/xvdf1 spotter

Hopefully @slavivanov can shed some light on this.

Hi @xinxin.li.seattle,
My first guess is same as @z0k’s: the name of the volume in the my.conf file is different than the actual name of the volume (spotter).

Secondly, the Elastic IP setting in my.conf should be the elastic IP id, not the IP itself. You can find the id from the IP by running:
aws ec2 describe-addresses --public-ips $ip --output text --query 'Addresses[0].AllocationId'
Replace $ip with your elastic IP.

Another reason might be that the ec2spotter_volume_zone is not set correctly (it should be us-west-2a by your screenshot). You can post (or message me) your my.conf file if unsure of any of the settings.

If the above are all set correctly, there might have been some hiccup during the boot. To check for this go to Instances in EC2 Dashboard, select your instance, then Actions, then Instance Settings, then Get System Log. See the last few lines of the log for any errors (or post here if unsure).

Lastly, you can check if the swap commands failed for some reason by running them by hand:
ssh into the server, run sudo su - to use the root account, and then:

  1. Check if the credentials file exists in /root/.aws.creds and that the credentials are correct.
  2. Check that awscli is installed
  3. Check if there are files in /root/ec2-spotter/
  4. Finally, try to run the swap root volume script by hand:
    cd ec2-spotter ./ec2spotter-remount-root --force 1 --vol_name ${ROOT_VOL_NAME} --vol_region ${ROOT_REGION} --elastic_ip $ec2spotter_elastic_ip
    but replace ${ROOT_VOL_NAME} with spotter, ${ROOT_REGION} with us-west-2a, and $ec2spotter_elastic_ip with your elastic IP id.

Let me know what happens.

1 Like

I just got the spot instance working! I hope I didn’t take up too much of your time @slavivanov, now it’s working beautifully at a fraction of the original cost, can’t thank you enough!

I’m glad to have helped!
If it’s not much to ask, please “Like” the original post.

@shgidi The part 2 should work exactly the same. Then after the you start up your instance, run the commands that Jeremy listed.

You are a GODSEND! Saving me so much money! Thank you!!!

One thing I’ve noticed is that my elastic IP is not attaching to the instance, has anyone had this problem before?

I have to manually attach it in the AWS Console page, which is no big deal, but I’m working why it’s not directly updating. The start_spot.sh is listing the correct elastic IP address, and telling me to connect to it, but in the console it is listing a different IP address.

Double checked the names in my_conf, but they match what I have in my console. Odd!

1 Like

I’m glad to have helped @stevelizcnao.
The elastic IP code got removed at some point (probably I was debugging something) and forgot to add it back. It’s pushed to the repo so download the code and it should work from now on.

Firstly @slavivanov, thank you so much for this! This works neatly for the most part.

Here are a couple of issues I ran into.
a) I was unable to attach my existing volume to the spot instance. Not sure why. I followed the instructions to the T.
The script created a new instance for me. I have set the name of the volume in my.conf, etc. What the script does is it creates a new volume AND attaches the existing volume to the instance.
b) The new instance was 8GB only, I noticed that while running gpu-install.sh. So if you’re creating a new instance do check the size before running that script because it’s painful to debug that script and re-run parts of it.

How do i debug it?
Thanks!