Making your own server

stephenl · April 11, 2017, 11:55pm

FYI

Just a change from nvidia drivers 375 as per CUDA 8.0 deb install to 378.13 went from 201 seconds on driver 375 on a 1080Ti GPU to this ( below) no other changes not even a reboot of the system from above install instructions.

Found 23000 images belonging to 2 classes.
Found 2000 images belonging to 2 classes.
Epoch 1/1
23000/23000 [==============================] - 193s - loss: 0.1226 - acc: 0.9675 - val_loss: 0.0550 - val_acc: 0.9845

RogerS49 · April 12, 2017, 11:13am

@stephenl, @leonletto Hi Well done I can appreciate a lot of effort and frustration went into that analysis. Thanks

leonletto · April 12, 2017, 2:37pm

Thanks for writing this up in an organized manner @stephenl. I hope it saves many people much of our frustration

RogerS49 · April 12, 2017, 6:10pm

Just a note about the 378.13 driver and 375 drivers I finally installed the 378.13 but then I had no visuals. I have two graphics cards, The other was just for video a GT 610. Now when I run queryDevice it can’t see this card. It can see the GTX 1080 ti. Same with nvidia-smi.

stephenl · April 12, 2017, 7:26pm

Roger - you may have the Nouveau problem? Not sure I heard you may have to deal with the ubuntu driver. I have one card only so my didn’t get snagged around two cards. Nvidia says below what to do there’s more in the install guide. Lets us know how it goes so I can add an addendum.

To install the Display Driver, the Nouveau drivers must first be disabled. Each distribution of Linux has a different method for disabling Nouveau.

The Nouveau drivers are loaded if the following command prints anything:
lsmod | grep nouveau

Create a file at /etc/modprobe.d/blacklist-nouveau.conf with the following contents:
blacklist nouveau
options nouveau modeset=0
Regenerate the kernel initramfs:
$ sudo update-initramfs -u

Read more at: http://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html#ixzz4e43h5uDd
Follow us: @GPUComputing on Twitter | NVIDIA on Facebook

stephenl · April 12, 2017, 7:29pm

Almost certainly it will, based on the lack of detailed advice anywhere to tackle this issue, got to pay it forward!

topbots · April 12, 2017, 8:15pm

Thanks to all the detailed articles, step-by-step instructions, and active feedback from this community, I’ve managed to get my own deep learning server set up. Couldn’t have been successful without all the research in this forum thread.

I’m still working on getting code ported from AWS working locally, so you’ll be seeing continued posts from me on the forums here, but in the meantime I whipped up a blog post in the hopes that someone out there might find it useful.

http://www.topbots.com/deep-confusion-misadventures-in-building-a-machine-learning-server/

stephenl · April 13, 2017, 12:08am

This setup may not appreciate ‘suspend’ mode on ubuntu, watch for that one, start up the labs again after suspension, run lesson 1 etc. and it starts throwing py GPU errors. Maybe an aberration but if it does it, you probably need to reboot or find the service and restart it. Not sure which service yet but rebooting fixes it.

one to watch.

Surya501 · April 13, 2017, 2:41am

@topbots Congrats on getting your machine built and running. Read your blog comment about masculine sounding names for gaming components. Search for ATI/AMD Ruby and Nvidia nala/mermaid.

RogerS49 · April 13, 2017, 10:35am

@stephenl Thanks for that.
Not sure thats my problem but I’ll look into it.
I know when I rebooted after installing 378.13 and my video was gone I thought whoops; after ctl-alt-f1 when I ran the samples and smi showed that the 1080ti was there but not the GT 610. You have to appreciate I have two cards and the 1GB is connected to the display. All perhaps I need to do is connect the display to the 1080 ti card.
Please note there is a new driver in BETA released 6th April. The 378.13 is short lived where as the 375.51 is long lived. The new driver that supports 1080 ti is 381.09 maybe this will be the long lived supported version see

http://www.nvidia.com/object/linux-amd64-display-archive.html

for the full archive list or

http://www.nvidia.com/Download/driverResults.aspx/117002/en-us

for the 381.09 driver

Jason · April 13, 2017, 1:15pm

Hey everyone,

This is an awesome thread! Thanks to all the contributors.

I’m currently on lesson 3 (Part 1) and I’ve decided to build my own machine as well. I definitely foresee getting a ton of answers from everyone’s previous struggles, so thanks for that, haha. And hopefully, from my experience I can give back as well - answering questions on the thread and planning on writing a blog article (trying to build a “fast/good enough” machine at $1,000 Canadian).

RogerS49 · April 13, 2017, 3:48pm

@stephenl Caution

echo 'alias ju=‘jupyter notebook —-no-browser —-port=8889’' > ~/.bashrc

should that have ‘>>’ instead of ‘>’ as this would create a new file.

similarly with

echo 'alias remote='ssh -N -f -L localhost:8888:localhost:8889 sl@.localdomain' > ~/.bash_profile

leonletto · April 13, 2017, 4:09pm

I just wanted to recommend that, at this time, I would stick with Intel Kaby Lake CPU’s rather than AMD Ryzen. I am still getting weird network driver issues with my Ryzen system (intermittent network dropouts regularly ) after i install the cuda drivers which I cannot diagnose. I suspect its a bios issue on my ASUS 370 Motherboard but if you are building right now, YMMV ad its frustrating. I even replaced the onboard NIC with an intel server card and the same thing happens.

There are no performance or usability issues except for the timeouts. My performance is within 1% of the top speeds I have seen here.

I will post an update if I get this resolved. If someone here works for AMD or ASUS, feel free to reach out.

Leon

stephenl · April 13, 2017, 6:33pm

thanks Roger - I will try to edit. The greater than and less than symbols are also markup instructions it seems or a bug, I had great trouble with these characters as they disappear on the forum.

stephenl · April 13, 2017, 8:29pm

Roger - driver 381.09 and I are already acquainted I ran tests on it with cuDNN 6.0.20 and I couldn’t see a difference with 378.13 myself it seemed comparable, 381.09 got the shove when I hit a guest boot cycle issue on a reboot. I had this issue before I did the upgrade to cuDNN 6020 on I believe running driver 381.09 on cuDNN 5.1 at the time. Its a nasty issue where the machines reboots into a guest account, (which is set to false BTW as-in, it was never configured in the first place by me!), you put your password in, it loops back into the guest login again wanting a password again and on it goes ad infintum going back to guest login. It requires a major rip and replace of CUDA and everything from that point in the instruction including the lightdm service to kill off the issue. So 381.09 got the blame, its probably really rebooting with jupyter server running that most likely causes this issue as its linked to the lightdm daemon or service, but its not beyond reasonable doubt driver 381.09 may have played a part, so in order to get on with the course labs, and seeing no real benefit so I stuck with driver 378.13. That’'s my story on nvidia driver 381.09.

stephenl · April 13, 2017, 11:20pm

Leon - have you seen better Vgg test times with the server in ‘headless’ (text only) mode after your change?

I still get some value at this point out of the GUI -mostly around pulling files from USB and clicking and dragging stuff- its easier to visualise whats in where.

But if theres an actual performance boost in test runs I will put it into headless mode and use the cmd line.

leonletto · April 13, 2017, 11:31pm

@stephenl I don’t think there is any extra performance. There is just a little more video ram available and having the gui login makes it harder to troubleshoot things on servers.

stephenl · April 14, 2017, 12:07am

ok figured as much I know there’s more ram on tap if I need it, so far I haven’t used it all except doing batch size=128, I was just short, only just of that succeeding. I have played with the screen resolution to increase the ram, I was just checking with you in case, the system is spending noticeable time doing screen refrershing and chewing up CPU/GPU cycles. You would think it is consuming CPU and or GPU time or bus time just updating the screen, but if everything is stationary apparently not.

Whats interesting is taking larger batch sizes did not increase performance, you’d think taking bigger batches of data would help if you have the ram, but it appears not. I have the ram I can do it, but there’s no benefit.

Rothrock42 · April 14, 2017, 4:33am

Has anybody set up their system to use the Wake-on-LAN magic packet? If so, can you share a bit about your home network and what router you’re using?

RogerS49 · April 14, 2017, 5:14am

@stephenl

Appreciate that input not going there yet. I have the 378.13 running with 6020 but my time with lesson1 single epoch (the 7th cell input, just after “The punchline: state of the art custom model in 7 lines of code”) was 366 secs.

As with the 375.39 driver the 378.13 is not recognising the name of the GPU. It was good with the 375.51. Not sure if thats an issue.

Here is the output of cell 4 in lesson 1

/home/dl/anaconda2/lib/python2.7/site-packages/theano/gpuarray/dnn.py:135: UserWarning: Your cuDNN version is more recent than Theano. If you encounter problems, try updating Theano or downgrading cuDNN to version 5.1.
warnings.warn("Your cuDNN version is more recent than "
Using cuDNN version 6020 on context None
Mapped name None to device cuda0: Graphics Device (0000:03:00.0)
Using Theano backend.

Is that as expected.

My issue maybe with the slot position of card. Many other have built systems based on the motherboard and whether which lanes are available for which slot. Unfortunately being HP hardware I don’t have knowledge of whats best support for 1081 ti.
Or maybe some parameter or option needs to be set in it’s configuration.

Any way here is my take on the solution to getting the 6020 recognised.

In ~/.bashrc I had exported the CPLUS_INCLUDE_PATH, LIBRARY_PATH and LD_LIBRARY_PATH with cuda-8.0 but had neglected to add /usr/local/lib and /usr/local/include/gpuarray where gpuarray was installed when I did this the result is as above.

Any comment