Persistent AWS Spot Instances (How to)

I found one problem, however it does not seem to resolve the issue. It seems like the system is missing a specific version of linux-image-extra.
Try to run sudo apt-get install --reinstall nvidia-375 and note for any Fatal error. You might notice:
depmod: ERROR: could not open directory /lib/modules/4.4.0-64-generic: No such file or directory
In order to fix this run sudo apt-get install linux-image-extra-4.4.0-64 or what ever version you were missing.
After that you might want to reinstall CUDA by doing:
sudo apt-get purge --autoremove cuda sudo apt-get install cuda sudo modprobe nvidia nvidia-smi

And now I’m getting

sudo modprobe -v nvidia
modprobe: ERROR: ../libkmod/libkmod-module.c:832 kmod_module_insert_module() could not find module by name='nvidia_375'
modprobe: ERROR: could not insert 'nvidia_375': Unknown symbol in module, or unknown parameter (see dmesg)
2 Likes

Thanks a lot @rqyang. I tried all of the options mentioned in the post to swap the volume with no successes. Your option worked great for me! The only setbacks are:

  1. If you have changes to the system/installations that you wish to keep, then you need to update the AMI
  2. The extra cost for the AMI snapshot.

HI guy, actually i m facing MaxSpotInstanceCountExceeded error and could not figure out how to proceeds further.I m currently following the course’s wiki to start a spot instance.Any help regarding this is highly appreciated. Thank you !!

Sorry, Ashish, I was away. I just opened a customer service case with AWS and asked them to increase my Spot Instance limit. Took 24 hours and they did it.
Hope it helps.

1 Like

OMG, I think I just fixed it. It’s early days but this might help someone.
For context, I was getting all the errors that everyone has had, nvidia-smi not working, the open-iscsi not loading, sudo modprobe nvidia not working.
I decided to setup my stuff in ireland as that is where it was originally tested, but I think that the actual solution is not related to that.
I believe the problem is that the kernel version of the host system and the chroot environment are different. And that the chroot environment doesn’t have the right kernel information. So, after booting into the chroot’d spot instance, try the following:

uname -r to find out what the kernel of the container is, in my case 4.4.0-59
Now, we need to install the image and headers for this kernel. If you go to
ls /lib/modules
you should see that your kernel information is not actually there.
To install the kernel stuff type:
sudo apt-get install linux-image-<your kernel number>-generic
and
sudo apt-get install linux-headers-<your kernel number>-generic
relpace with your kernel number, for instance
sudo apt-get install linux-headers-4.4.0-59-generic

Then, install the nividia kernel module
sudo modprobe nivida
now try
nvidia-smi
and if it doesn’t error, and instead shows you something like

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 367.48                 Driver Version: 367.48                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K80           Off  | 0000:00:1E.0     Off |                    0 |
| N/A   43C    P0    72W / 149W |      0MiB / 11439MiB |     96%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Then you are in business!

10 Likes

Hi everybody,

I’m new with linux and need some help, due to the high cost of the instances on demand I decided to switch to the spot instances. I have followed the steps in Using an Existing Instance to use the same instance that was on demand.
I describe the steps I followed:

  1. Stop on demand instance and detach its root volume.

  2. Rename detached volume.

  3. Create a copy of example.conf named my.conf.

  4. I have modified my.conf, specifically:

ec2spotter_volume_name=spotter
ec2spotter_volume_zone=eu-west-1a
ec2spotter_launch_zone=eu-west-1a
ec2spotter_key_name=EUwest
ec2spotter_instance_type=p2.xlarge
ec2spotter_subnet=subnet-xxxxxx
ec2spotter_security_group=sg-xxxxxxx
ec2spotter_preboot_image_id=ami-d8f4deab

Here I launch start_spot.sh and I get the following error:

fast_ai/start_spot.sh: 5: fast_ai/start_spot.sh: Bad substitution
fast_ai/start_spot.sh: 7: .: Can’t open …/my.conf

I think, something I’m doing wrong with my work directory

Thank you!!

Solved, the script runs fine with the bash command but with errors with sh command. I don´t know why.

Hi @xinxin.li.seattle I saw on AWS description page that g2 has 8 cores, 26 Gigabytes GPU, and 15 Gigabytes RAM, while p2 has 4 cores, 12 Gigabytes GPU and 61 Gigabytes RAM. g2 looks somewhat better to me in that it has more cores, much larger GPU (though less memory), but based on your description it can only run smaller jobs, does this mean that most of the jobs would be very memory intensive? I’m rather new to deep learning, so how much memory would the examples in this course actually consume?

Thanks a lot!!! It works perfectly for me!
I had some thoughts about that the two systems’ kernel don’t match but I don’t know how to fix it.
Your solution is excellent!!!

1 Like

Faced the same “nvidia-smi command not found” issue. It was a path issue like Jeremy had mentioned earlier in the thread. I tried searching for nvidia-smi and found it at “/usr/lib/nvidia-367/bin”. Added it to the PATH and it works fine now

nvidia-smi
Thu Sep 14 05:10:10 2017
±----------------------------------------------------------------------------+
| NVIDIA-SMI 367.48 Driver Version: 367.48 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla K80 Off | 0000:00:1E.0 Off | 0 |
| N/A 72C P0 67W / 149W | 0MiB / 11439MiB | 100% Default |
±------------------------------±---------------------±---------------------+

±----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
±----------------------------------------------------------------------------+
.

Thanks alot johnny. It took me two straight hours to figure out why it wasn’t working and i tried a lot of things which didn’t work until i found your post. Thanks alot. If i could give you multiple likes i would :smile:

1 Like

Your welcome :slight_smile:

Hi,
I am able to successfully launch p2 spot instance using the Wiki help.
thank you.

I used the Persistence for Spot Instances: Approach 2 — Swap root volume - With a new instance mentioned in the wiki : http://wiki.fast.ai/index.php/AWS_Spot_instances

My assumption is it should result in the volume “fast-ai-volume” be the only volume attached and should have the status : “in-use” when viewed in the EC2 dashboard.

But in my case upon starting the spot instance, I am seeing a new volume being created. The volume “fast-ai-volume” is in the state:“available”. I am attaching the snapshot of this

To overcome this problem,I had to do attach the “fast-ai-volume” to the instance and then do:
$sudo mount
After this I am able to see persistent data storage.

Using p2 spot instance from eu-west1 (Ireland) AMI.
I started the instance as:
$ bash ec2-spotter/fast_ai/start_spot.sh

After the instance is started, I had waited for around 10 minutes to ssh into the instance.

I tried starting the instance with start_spot.sh script from inside “ec2_spotter/fast_ai” as well as inside the “ec2_spotter” directory. The result does not change.

I assume this is not the way swap root functions… It should be able to mount the volume “fast-ai-volume” as the root partition and the other newly created volume should not be attached to the instance.
Am i right?

I have a doubt whether the command “ec2spotter-remount-root” in the file ec2-spotter-launch.sh is succeeding in my case.

Any suggestions to solve the problem would be welcome.

  • navin

Finding this was a huge help. I couldn’t figure out why my ami (used this one from amazon) gave me the following error.

 failed: Volume of size 8GB is smaller than snapshot 'snap-03129c5bb8793afea', expect size >= 50GB

FWIW I was following the instructions here:

http://wiki.fast.ai/index.php/AWS_Spot_instances

Checkout Spotr, a tool I wrote to help automate this, here’s a writeup on it:


and github:

Hey @samuelreh thanks for the package. I tried to use spotr but ran into config file path error. Reported this at https://github.com/samuelreh/spotr/issues/15

Would be be great if you can have a look at this. Thank you

1 Like

I tried to use spotr yesterday too and ran into the same problem. To work around it (until it’s fixed in spotr) you can create the config manually before running spotr: Create the file .spotr/config with a config section similar to the text below. However, I also had to hack around a security group issue before it launched a spot instance (but that may depend on your aws setup I asusme).

[config]
max_bid=.30
type=p2.xlarge
1 Like

Thank you @stianse. Create that folder, but then after the instance was created I couldn’t connect to it! As i received the below error.

matching_rules = (x for x in group['IpPermissions'] if x['FromPort'] == port                                and x['ToPort'] == port)
KeyError: 'FromPort' 

I thought it might be an security group, inbound rules issue. But it wasn’t so because the inbound rules accept from All.

I also tried to simply connect to instance using ssh but that too failed with “Connection timed out” error.

Any ideas on resolving this issue? @stianse & @samuelreh

1 Like

@pavan_alluri: Seems you’re running into the exact same issues as I did. At this point I fetched spotr from the gitrepo and hacked it to use a specific security group that I had manually created in aws console. But I don’t recommend following that path since I ran into more issues further down the road with the snapshot functionality. I’m sure @samuelreh is interested in fixing these issues and that this will be a quick and neat way of spawning/resuming spot instances. However, in the meantime I propose to wait for an updated version :slight_smile:

1 Like

@stianse thank you :slight_smile:

I wasn’t able to follow the tutorial @slavivanov has posted either, it appears to be outdated with a few changes in the meantime with several issues (looked at potential fixes from others but lost somewhere!). Did try to get it since the amount that i could potentially save in billings is high but it appears after 5 hours of struggling i simply end up choosing a normal ec2 instance. Bummer!

1 Like