I think this is my favorite thread I benefited greatly from all this advice and just finished my first build this week. I wrote a quick blog post documenting my experience here:
It was my first time building and despite the amazing resources available online, I still got stuck on a number of occasions. Hopefully my experience can be useful to someone.
Nice work
ā¦Having port 22 open like that you might want to consider putting the rest of your network behind a second router plugged into one of the free ports your main router just in case you DL box gets hacked.
You could also consider setting up port knocking on the DL box to hide the connection when not in use.
Chris, have you made any other configuration changes to get the 229s fit time youāre seeing? Anything special about the theano config?
Iām running the CatsDogs Redux notebook (25K images) and my first fit is ~245s. The difference I can see between my setup and yours is Iām running a 1080 vs your 1070, so I assumed I would have an advantageā¦
Kaby Lake 7700K
MSI Z270 Pro
Zotac GTX 1080 8GB
32GB Ram
Also I can confirm the AWS p2 comparision. I just ran the CatsDogs Redux notebook on a p2.xlarge and first fit took 602 secs vs my 245.
You have CuDNN installed? That makes a big difference. When you import Theano, look for the line that says it is using your GPU, it will tell you if it is using CuDNN. This is my guess what is the difference.
Are you assigning 80% or 90% ram to the GPU? You can get away with 90% in Linux but in Windows you need to set it to 80%. This is set in the Theano flags.
Are you using a 16x slot? This shouldnāt be as important as even the 1080 has a hard time saturating an 8x slot, but it will make a small difference.
Iām guessing a 1080 properly configured would get just under 200s. You can also overclock the 1080 to get even more out of it (roughly 10%).
@brendan
Dude, Thank you so much for the tutorial. Iāve gotten all the way to remoting in, but instead of teamviewer, Iām using nomachine, but I really wanted to SSH in, so Iām going to use your little SSH to jump start my attempt.
I figure Iāll list my specs here when I get home (or successfully remote in from work), as Iām using an older setup than most of you right now.
Iām dual-booting my gaming desktop with Ubuntu.
Has anyone had any problems with jupyter notebookās password? The ādl_courseā isnt working for me for some reason.
Hereās my output from āinport theanoā:
Using gpu device 0: GeForce GTX 1080 (CNMeM is enabled with initial size: 90.0% of memory, cuDNN 5110)
/home/bfortuner/anaconda3/lib/python3.6/site-packages/theano/sandbox/cuda/init.py:600: UserWarning: Your cuDNN version is more recent than the one Theano officially supports. If you see any problems, try updating Theano or downgrading cuDNN to version 5.
I confirmed the card is in the 16x slot. Interesting my board has another 16x sized slot, but that one doesnāt have 16 lanes going into it (x4)? Strange. I should have read this article which explains what to watch out for. I compared our two boards and it seems the primary difference is yours has an extra PCI-Ex16 slot.
Here are the specs on my Zotac 1080 if you see anything interesting here. Strange again it doesnāt mention PCIE x16, just the PCE Express 3.0.
I am not currently overclocking, but plan to when I have time to go through the testing.
I would expect the 1080 to be faster, not only is it quicker GPU it has faster memory throughput.
If you are on Windows, I would recommend running a benchmark and compare it to other 1080ās and see if you are running at what it should be.
Itās not too strange it doesnāt mention x16, itās not common to be mentioned but PCI Express 3 is very important. x16 will only be within 1-2% of x8 as it isnāt fully saturated. PCI Express though does make a big difference 10-25%.
I have nothing to compare to, but I have a feeling there is something wrong with my GPU machine. I currently SSH into it via my Mac and run notebook from the SSH Terminal. The GPU machine is running Ubuntu 16.04, and is using an nVidia GTX 1070 w/ 8Gb RAM. 16 gb RAM, i5-6600 3.3 GHz Quad-Core.
As I was running the data sets, this is what I got:
vgg = Vgg16()
...
vgg.fit(batches, val_batches, nb_epoch=1)
/home/username/anaconda3/lib/python3.6/site-packages/keras/layers/core.py:622: UserWarning:
`output_shape` argument not specified for layer lambda_3 and cannot be automatically inferred with the
Theano backend. Defaulting to output shape `(None, 3, 224, 224)` (same as input shape). If the expected
output shape is different, specify it via the `output_shape` argument.
.format(self.name, input_shape))
Found 23000 images belonging to 2 classes.
Found 2000 images belonging to 2 classes.
Epoch 1/1
23000/23000 [==============================] -
4724s - loss: 0.1153 - acc: 0.9683 - val_loss: 0.0472 - val_acc: 0.9835
It took 4724 seconds to run this, and that seems like a long time. Am I missing something?
I also checked the GPU machine physically, and the fan didnāt even come on the GPU when it was processing this. Should this have kicked in a fan at least?
I installed the nVidia drivers and CDNN, but how can I tell if the GPU is running them?
My first run with a GPU server via SSH, so not sure what to expect. Thanks for the help out there.
Ok, I had to go through and seriously custom this instal from the FastAI script. Took me a good 6 hours to get CUDA and CUDDN installed, paths correct, Theano updated, and everything running. I now have my comp times down to 246s, which I thought was pretty good for a $1k machine.
Still gives me this error: Using GPU device 0: GeForce GTX 1070 (CNMeM is disabled, cuDNN 5110)
I suppose thereās got to be a way to get this working as well.
Yes, Nvidia GPUās need 16x. Thatās why I told you to get the 7700k, the boards for that chip should have non-PLX solutions for extra pci-lanes beyond just 16. You should be able to confirm itās running 16x somwhere in your bios. You should also be able to confirm with some monitoring what % of your interface youāre using at any given point in time if you think this is a reason for slowdown. Generally with one card in a board almost all the shots SHOULD go to 16x if capable, you can visually suss this out generally from the leads inside the PCI slot.
16x wonāt actually give you much of an improvement (1-2% tops) even the 1080 wonāt saturate an 8x lane. What does make a big improvement is PCI Express 3.0 over 2.0, that will give you 10-15% boost. Faster CPU can be a significant boost as well. Although it is always best to use 16x when you can.
For example, I went from Ivy Bridge 3700K Overclocked to 4.3GHz and 1070 (using 16x bus and Gen 2 PCI Express) to dropping the GPU into a 7700K 16x Bus Gen 3 PCI Express and noticed a 30% drop in train times.
From what others have tested, PCI Express 2.0 -> PCI Express 3.0 will provide a little over 10% boost. The rest was from CPU.
CUDA does not use SLI, so it would be using it directly through PCI Express and thus this is a moot point. The difference is still minor with SLI and 16x, in many cases it is getting dropped down to 8x anyway, so again moot.
The point of that video wasnāt SLI, it had a nice comparison of 8x vs 16x for people who didnāt understand. My system, which has 4 GPUs and the appropriate amt of PCIe lanes to support said GPUs doesnāt ever slow down to 8x. I didnāt make a claim to be moot. If anything the problem is actually exacerbated because everything you send over goes over the PCIe lanes.
After running through a slow cpu for these exercises and trying Amazon, attempted to build my own server. Spend more than 10 hours to build, I guess an expert would have taken less than 2 hours. Thanks a lot to @brendan. His write up was my inspiration and his notes helped a lot which other wise would have been a few days project.