Persistent AWS Spot Instances (How to)

(oscar) #103

Solved, the script runs fine with the bash command but with errors with sh command. I don´t know why.

(Olivier Ma) #104

Hi I saw on AWS description page that g2 has 8 cores, 26 Gigabytes GPU, and 15 Gigabytes RAM, while p2 has 4 cores, 12 Gigabytes GPU and 61 Gigabytes RAM. g2 looks somewhat better to me in that it has more cores, much larger GPU (though less memory), but based on your description it can only run smaller jobs, does this mean that most of the jobs would be very memory intensive? I’m rather new to deep learning, so how much memory would the examples in this course actually consume?


Thanks a lot!!! It works perfectly for me!
I had some thoughts about that the two systems’ kernel don’t match but I don’t know how to fix it.
Your solution is excellent!!!


Faced the same “nvidia-smi command not found” issue. It was a path issue like Jeremy had mentioned earlier in the thread. I tried searching for nvidia-smi and found it at “/usr/lib/nvidia-367/bin”. Added it to the PATH and it works fine now

Thu Sep 14 05:10:10 2017
| NVIDIA-SMI 367.48 Driver Version: 367.48 |
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| 0 Tesla K80 Off | 0000:00:1E.0 Off | 0 |
| N/A 72C P0 67W / 149W | 0MiB / 11439MiB | 100% Default |

| Processes: GPU Memory |
| GPU PID Type Process name Usage |
| No running processes found |

(Rishab Gulati) #107

Thanks alot johnny. It took me two straight hours to figure out why it wasn’t working and i tried a lot of things which didn’t work until i found your post. Thanks alot. If i could give you multiple likes i would :smile:

(john v) #108

Your welcome :slight_smile:

(Navin Kumar) #109

I am able to successfully launch p2 spot instance using the Wiki help.
thank you.

I used the Persistence for Spot Instances: Approach 2 — Swap root volume - With a new instance mentioned in the wiki :

My assumption is it should result in the volume “fast-ai-volume” be the only volume attached and should have the status : “in-use” when viewed in the EC2 dashboard.

But in my case upon starting the spot instance, I am seeing a new volume being created. The volume “fast-ai-volume” is in the state:“available”. I am attaching the snapshot of this

To overcome this problem,I had to do attach the “fast-ai-volume” to the instance and then do:
$sudo mount
After this I am able to see persistent data storage.

Using p2 spot instance from eu-west1 (Ireland) AMI.
I started the instance as:
$ bash ec2-spotter/fast_ai/

After the instance is started, I had waited for around 10 minutes to ssh into the instance.

I tried starting the instance with script from inside “ec2_spotter/fast_ai” as well as inside the “ec2_spotter” directory. The result does not change.

I assume this is not the way swap root functions… It should be able to mount the volume “fast-ai-volume” as the root partition and the other newly created volume should not be attached to the instance.
Am i right?

I have a doubt whether the command “ec2spotter-remount-root” in the file is succeeding in my case.

Any suggestions to solve the problem would be welcome.

  • navin

(Tait Larson) #110

Finding this was a huge help. I couldn’t figure out why my ami (used this one from amazon) gave me the following error.

 failed: Volume of size 8GB is smaller than snapshot 'snap-03129c5bb8793afea', expect size >= 50GB

FWIW I was following the instructions here:

(Samuel Kiefer Reh) #111

Checkout Spotr, a tool I wrote to help automate this, here’s a writeup on it:

and github:

(Pavan Alluri) #112

Hey @samuelreh thanks for the package. I tried to use spotr but ran into config file path error. Reported this at

Would be be great if you can have a look at this. Thank you

(Stian Selnes) #113

I tried to use spotr yesterday too and ran into the same problem. To work around it (until it’s fixed in spotr) you can create the config manually before running spotr: Create the file .spotr/config with a config section similar to the text below. However, I also had to hack around a security group issue before it launched a spot instance (but that may depend on your aws setup I asusme).


(Pavan Alluri) #114

Thank you @stianse. Create that folder, but then after the instance was created I couldn’t connect to it! As i received the below error.

matching_rules = (x for x in group['IpPermissions'] if x['FromPort'] == port                                and x['ToPort'] == port)
KeyError: 'FromPort' 

I thought it might be an security group, inbound rules issue. But it wasn’t so because the inbound rules accept from All.

I also tried to simply connect to instance using ssh but that too failed with “Connection timed out” error.

Any ideas on resolving this issue? @stianse & @samuelreh

(Stian Selnes) #115

@pavan_alluri: Seems you’re running into the exact same issues as I did. At this point I fetched spotr from the gitrepo and hacked it to use a specific security group that I had manually created in aws console. But I don’t recommend following that path since I ran into more issues further down the road with the snapshot functionality. I’m sure @samuelreh is interested in fixing these issues and that this will be a quick and neat way of spawning/resuming spot instances. However, in the meantime I propose to wait for an updated version :slight_smile:

(Pavan Alluri) #116

@stianse thank you :slight_smile:

I wasn’t able to follow the tutorial @slavivanov has posted either, it appears to be outdated with a few changes in the meantime with several issues (looked at potential fixes from others but lost somewhere!). Did try to get it since the amount that i could potentially save in billings is high but it appears after 5 hours of struggling i simply end up choosing a normal ec2 instance. Bummer!

(Samuel Kiefer Reh) #117

@stianse @pavan_alluri Please give it another try, I’ve resolved the config and security group issues in version 0.0.12.

(mo ) #118

Hi ,
If anyone can help me it will be most appreciated. I have been trying to get the spot instance code to work for a while and so far I could only
create an instance with jeremy’s script
detach the volume
make all required changes in my.conf
I have verified my.conf and example.conf have the same permissions
when i run sh fast_ai/
I get -> fast_ai/ 5: fast_ai/ Bad substitution
fast_ai/ 7: .: Can’t open …/my.conf

(Md Tauseef) #119

Hey @slavivanov I’m facing a similar issue, and on checking the logs I saw that pip wasn’t installed on bootup which is causing installation of aws to fail. Any thoughts?