I found one problem, however it does not seem to resolve the issue. It seems like the system is missing a specific version of linux-image-extra.
Try to run sudo apt-get install --reinstall nvidia-375 and note for any Fatal error. You might notice: depmod: ERROR: could not open directory /lib/modules/4.4.0-64-generic: No such file or directory
In order to fix this run sudo apt-get install linux-image-extra-4.4.0-64 or what ever version you were missing.
After that you might want to reinstall CUDA by doing: sudo apt-get purge --autoremove cuda sudo apt-get install cuda sudo modprobe nvidia nvidia-smi
And now I’m getting
sudo modprobe -v nvidia
modprobe: ERROR: ../libkmod/libkmod-module.c:832 kmod_module_insert_module() could not find module by name='nvidia_375'
modprobe: ERROR: could not insert 'nvidia_375': Unknown symbol in module, or unknown parameter (see dmesg)
Thanks a lot @rqyang. I tried all of the options mentioned in the post to swap the volume with no successes. Your option worked great for me! The only setbacks are:
If you have changes to the system/installations that you wish to keep, then you need to update the AMI
HI guy, actually i m facing MaxSpotInstanceCountExceeded error and could not figure out how to proceeds further.I m currently following the course’s wiki to start a spot instance.Any help regarding this is highly appreciated. Thank you !!
Sorry, Ashish, I was away. I just opened a customer service case with AWS and asked them to increase my Spot Instance limit. Took 24 hours and they did it.
Hope it helps.
OMG, I think I just fixed it. It’s early days but this might help someone.
For context, I was getting all the errors that everyone has had, nvidia-smi not working, the open-iscsi not loading, sudo modprobe nvidia not working.
I decided to setup my stuff in ireland as that is where it was originally tested, but I think that the actual solution is not related to that.
I believe the problem is that the kernel version of the host system and the chroot environment are different. And that the chroot environment doesn’t have the right kernel information. So, after booting into the chroot’d spot instance, try the following:
uname -r to find out what the kernel of the container is, in my case 4.4.0-59
Now, we need to install the image and headers for this kernel. If you go to ls /lib/modules
you should see that your kernel information is not actually there.
To install the kernel stuff type: sudo apt-get install linux-image-<your kernel number>-generic
and sudo apt-get install linux-headers-<your kernel number>-generic
relpace with your kernel number, for instance sudo apt-get install linux-headers-4.4.0-59-generic
Then, install the nividia kernel module sudo modprobe nivida
now try nvidia-smi
and if it doesn’t error, and instead shows you something like
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 367.48 Driver Version: 367.48 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla K80 Off | 0000:00:1E.0 Off | 0 |
| N/A 43C P0 72W / 149W | 0MiB / 11439MiB | 96% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
I’m new with linux and need some help, due to the high cost of the instances on demand I decided to switch to the spot instances. I have followed the steps in Using an Existing Instance to use the same instance that was on demand.
I describe the steps I followed:
Stop on demand instance and detach its root volume.
Hi @xinxin.li.seattle I saw on AWS description page that g2 has 8 cores, 26 Gigabytes GPU, and 15 Gigabytes RAM, while p2 has 4 cores, 12 Gigabytes GPU and 61 Gigabytes RAM. g2 looks somewhat better to me in that it has more cores, much larger GPU (though less memory), but based on your description it can only run smaller jobs, does this mean that most of the jobs would be very memory intensive? I’m rather new to deep learning, so how much memory would the examples in this course actually consume?
Thanks a lot!!! It works perfectly for me!
I had some thoughts about that the two systems’ kernel don’t match but I don’t know how to fix it.
Your solution is excellent!!!
Faced the same “nvidia-smi command not found” issue. It was a path issue like Jeremy had mentioned earlier in the thread. I tried searching for nvidia-smi and found it at “/usr/lib/nvidia-367/bin”. Added it to the PATH and it works fine now
±----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
±----------------------------------------------------------------------------+
.
Thanks alot johnny. It took me two straight hours to figure out why it wasn’t working and i tried a lot of things which didn’t work until i found your post. Thanks alot. If i could give you multiple likes i would
My assumption is it should result in the volume “fast-ai-volume” be the only volume attached and should have the status : “in-use” when viewed in the EC2 dashboard.
But in my case upon starting the spot instance, I am seeing a new volume being created. The volume “fast-ai-volume” is in the state:“available”. I am attaching the snapshot of this
To overcome this problem,I had to do attach the “fast-ai-volume” to the instance and then do:
$sudo mount
After this I am able to see persistent data storage.
Using p2 spot instance from eu-west1 (Ireland) AMI.
I started the instance as: $ bash ec2-spotter/fast_ai/start_spot.sh
After the instance is started, I had waited for around 10 minutes to ssh into the instance.
I tried starting the instance with start_spot.sh script from inside “ec2_spotter/fast_ai” as well as inside the “ec2_spotter” directory. The result does not change.
I assume this is not the way swap root functions… It should be able to mount the volume “fast-ai-volume” as the root partition and the other newly created volume should not be attached to the instance.
Am i right?
I have a doubt whether the command “ec2spotter-remount-root” in the file ec2-spotter-launch.sh is succeeding in my case.
Any suggestions to solve the problem would be welcome.
I tried to use spotr yesterday too and ran into the same problem. To work around it (until it’s fixed in spotr) you can create the config manually before running spotr: Create the file .spotr/config with a config section similar to the text below. However, I also had to hack around a security group issue before it launched a spot instance (but that may depend on your aws setup I asusme).
@pavan_alluri: Seems you’re running into the exact same issues as I did. At this point I fetched spotr from the gitrepo and hacked it to use a specific security group that I had manually created in aws console. But I don’t recommend following that path since I ran into more issues further down the road with the snapshot functionality. I’m sure @samuelreh is interested in fixing these issues and that this will be a quick and neat way of spawning/resuming spot instances. However, in the meantime I propose to wait for an updated version
I wasn’t able to follow the tutorial @slavivanov has posted either, it appears to be outdated with a few changes in the meantime with several issues (looked at potential fixes from others but lost somewhere!). Did try to get it since the amount that i could potentially save in billings is high but it appears after 5 hours of struggling i simply end up choosing a normal ec2 instance. Bummer!