Making your own server

I think this is my favorite thread :slight_smile: I benefited greatly from all this advice and just finished my first build this week. I wrote a quick blog post documenting my experience here:

It was my first time building and despite the amazing resources available online, I still got stuck on a number of occasions. Hopefully my experience can be useful to someone.

11 Likes

Nice work
ā€¦Having port 22 open like that you might want to consider putting the rest of your network behind a second router plugged into one of the free ports your main router just in case you DL box gets hacked.

You could also consider setting up port knocking on the DL box to hide the connection when not in use.

I have managed to setup just the CPU (i3) without any GPU and complete lesson-1 successfully. Here is the edited script that I used on Ubuntu16.04

Big thanks to James & Rachel for their big heart! :slight_smile:

5 Likes

Bigggg Thanks! That is really helpful!!! I need to buy a computer and then try to built this!!!

Chris, have you made any other configuration changes to get the 229s fit time youā€™re seeing? Anything special about the theano config?

Iā€™m running the CatsDogs Redux notebook (25K images) and my first fit is ~245s. The difference I can see between my setup and yours is Iā€™m running a 1080 vs your 1070, so I assumed I would have an advantageā€¦

Kaby Lake 7700K
MSI Z270 Pro
Zotac GTX 1080 8GB
32GB Ram

Also I can confirm the AWS p2 comparision. I just ran the CatsDogs Redux notebook on a p2.xlarge and first fit took 602 secs vs my 245.

You have CuDNN installed? That makes a big difference. When you import Theano, look for the line that says it is using your GPU, it will tell you if it is using CuDNN. This is my guess what is the difference.

Are you assigning 80% or 90% ram to the GPU? You can get away with 90% in Linux but in Windows you need to set it to 80%. This is set in the Theano flags.

Are you using a 16x slot? This shouldnā€™t be as important as even the 1080 has a hard time saturating an 8x slot, but it will make a small difference.

Iā€™m guessing a 1080 properly configured would get just under 200s. You can also overclock the 1080 to get even more out of it (roughly 10%).

@brendan
Dude, Thank you so much for the tutorial. Iā€™ve gotten all the way to remoting in, but instead of teamviewer, Iā€™m using nomachine, but I really wanted to SSH in, so Iā€™m going to use your little SSH to jump start my attempt.

I figure Iā€™ll list my specs here when I get home (or successfully remote in from work), as Iā€™m using an older setup than most of you right now.

Iā€™m dual-booting my gaming desktop with Ubuntu.

Has anyone had any problems with jupyter notebookā€™s password? The ā€œdl_courseā€ isnt working for me for some reason.

The Jupyter password is system specific, so you should be able to set your own if you run this:

Hereā€™s my output from ā€œinport theanoā€:
Using gpu device 0: GeForce GTX 1080 (CNMeM is enabled with initial size: 90.0% of memory, cuDNN 5110)
/home/bfortuner/anaconda3/lib/python3.6/site-packages/theano/sandbox/cuda/init.py:600: UserWarning: Your cuDNN version is more recent than the one Theano officially supports. If you see any problems, try updating Theano or downgrading cuDNN to version 5.

I confirmed the card is in the 16x slot. Interesting my board has another 16x sized slot, but that one doesnā€™t have 16 lanes going into it (x4)? Strange. I should have read this article which explains what to watch out for. I compared our two boards and it seems the primary difference is yours has an extra PCI-Ex16 slot.

Here are the specs on my Zotac 1080 if you see anything interesting here. Strange again it doesnā€™t mention PCIE x16, just the PCE Express 3.0.

Are you overclocking?

I am not currently overclocking, but plan to when I have time to go through the testing.

I would expect the 1080 to be faster, not only is it quicker GPU it has faster memory throughput.

If you are on Windows, I would recommend running a benchmark and compare it to other 1080ā€™s and see if you are running at what it should be.

Itā€™s not too strange it doesnā€™t mention x16, itā€™s not common to be mentioned but PCI Express 3 is very important. x16 will only be within 1-2% of x8 as it isnā€™t fully saturated. PCI Express though does make a big difference 10-25%.

I have nothing to compare to, but I have a feeling there is something wrong with my GPU machine. I currently SSH into it via my Mac and run notebook from the SSH Terminal. The GPU machine is running Ubuntu 16.04, and is using an nVidia GTX 1070 w/ 8Gb RAM. 16 gb RAM, i5-6600 3.3 GHz Quad-Core.

As I was running the data sets, this is what I got:

vgg = Vgg16()
...
vgg.fit(batches, val_batches, nb_epoch=1)

/home/username/anaconda3/lib/python3.6/site-packages/keras/layers/core.py:622: UserWarning:
`output_shape` argument not specified for layer lambda_3 and cannot be automatically inferred with the
Theano backend. Defaulting to output shape `(None, 3, 224, 224)` (same as input shape). If the expected
output shape is different, specify it via the `output_shape` argument.
 .format(self.name, input_shape))
Found 23000 images belonging to 2 classes.
Found 2000 images belonging to 2 classes.
Epoch 1/1
23000/23000 [==============================] - 
4724s - loss: 0.1153 - acc: 0.9683 - val_loss: 0.0472 - val_acc: 0.9835

It took 4724 seconds to run this, and that seems like a long time. Am I missing something?

I also checked the GPU machine physically, and the fan didnā€™t even come on the GPU when it was processing this. Should this have kicked in a fan at least?

I installed the nVidia drivers and CDNN, but how can I tell if the GPU is running them?

My first run with a GPU server via SSH, so not sure what to expect. Thanks for the help out there.

Ok, I had to go through and seriously custom this instal from the FastAI script. Took me a good 6 hours to get CUDA and CUDDN installed, paths correct, Theano updated, and everything running. I now have my comp times down to 246s, which I thought was pretty good for a $1k machine.

Still gives me this error: Using GPU device 0: GeForce GTX 1070 (CNMeM is disabled, cuDNN 5110)

I suppose thereā€™s got to be a way to get this working as well.

@reverts
If you dont mind me asking, what changes did you have to make? (Iā€™ll probably be running into those issues here shortly lol)

With the CNMEM, it may have something to do with memory.

Iā€™d take a look at this page, seems to directly address this issue.

http://ankivil.com/making-theano-faster-with-cudnn-and-cnmem-on-windows-10/

@brendan
Thanks for the jupyter help. I swimming in the deep end with alot of this stuff right now ; )

@Kradoc

These are the resources the finally got me through.

Install nVidia, CUDA, and cuDNN

This was the best of the blog posts, though I installed the latest versions of cuDNN and Cuda 8.0, not exactly the versions in the blog.

The tough part is figuring out the post-installation actions. nVidia does a great walk through here:
http://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html#post-installation-actions

Hereā€™s how I fixed my cnMem error:

nano ~/.theanorc

[global]
device = gpu
floatX = float32

[cuda]
root = /usr/local/cuda-8.0

[lib]
cnmem=0.95

Yes, Nvidia GPUā€™s need 16x. Thatā€™s why I told you to get the 7700k, the boards for that chip should have non-PLX solutions for extra pci-lanes beyond just 16. You should be able to confirm itā€™s running 16x somwhere in your bios. You should also be able to confirm with some monitoring what % of your interface youā€™re using at any given point in time if you think this is a reason for slowdown. Generally with one card in a board almost all the shots SHOULD go to 16x if capable, you can visually suss this out generally from the leads inside the PCI slot.

16x wonā€™t actually give you much of an improvement (1-2% tops) even the 1080 wonā€™t saturate an 8x lane. What does make a big improvement is PCI Express 3.0 over 2.0, that will give you 10-15% boost. Faster CPU can be a significant boost as well. Although it is always best to use 16x when you can.

For example, I went from Ivy Bridge 3700K Overclocked to 4.3GHz and 1070 (using 16x bus and Gen 2 PCI Express) to dropping the GPU into a 7700K 16x Bus Gen 3 PCI Express and noticed a 30% drop in train times.

From what others have tested, PCI Express 2.0 -> PCI Express 3.0 will provide a little over 10% boost. The rest was from CPU.

1 Like

Thatā€™s true up until the point you get a 2nd card. https://youtu.be/rctaLgK5stA

CUDA does not use SLI, so it would be using it directly through PCI Express and thus this is a moot point. The difference is still minor with SLI and 16x, in many cases it is getting dropped down to 8x anyway, so again moot.

The point of that video wasnā€™t SLI, it had a nice comparison of 8x vs 16x for people who didnā€™t understand. My system, which has 4 GPUs and the appropriate amt of PCIe lanes to support said GPUs doesnā€™t ever slow down to 8x. I didnā€™t make a claim to be moot. If anything the problem is actually exacerbated because everything you send over goes over the PCIe lanes.

1 Like

After running through a slow cpu for these exercises and trying Amazon, attempted to build my own server. Spend more than 10 hours to build, I guess an expert would have taken less than 2 hours. Thanks a lot to @brendan. His write up was my inspiration and his notes helped a lot which other wise would have been a few days project.

Quoting his link .