Making your own server

RogerS49 · April 22, 2017, 7:08pm

@Christina Hi I am not sure as I brought a ready assembled brand new workstation direct from HP. A base specification defined from this with all these options

http://www8.hp.com/h20195/v2/GetDocument.aspx?docname=c04400040

The workstation did not have a graphics card so I did not waste money on one I did not need then added a 1080TI and a GT610 purely for graphics.

I only had a single cpu E5 2620 v4, has 8 cores/16 threads but you can add another later. It came with only 16GB of ram but can be extended to 128GB per cpu, max 256GB, can only use ddr4 registered dimms. Has PCI 3. Is future proofed.

The second cpu is a module that includes a further 4 memory slots, I am not sure if thats available with older cpu’s. But I am not an expert.

Make sure you find out which version the cpu is ; the E5 2620 v3 is 4 cores /8 threads and supports ddr3 . You can always google “E5 2670” to get the

http://ark.intel.com/products/75275/Intel-Xeon-Processor-E5-2670-v2-25M-Cache-2_50-GHz

I know in the UK there lots of flavours of used Z600 and Z800 workstations on eBay, which I guess have come to the end of there commercial live. I looked at these and thought in the long run getting new would save me being again next year plus I now have a 3year warranty.

Anything else don’t hesitate

Christina · April 22, 2017, 9:05pm

Roger, thanks for the tip on the used Z800 workstations on eBay… I just ended up buying a Z820 with 2x8 cores E5-2670, 128GB RAM, 2TB HD, and a Nvidia Quadro 4000 card. It’s loaded for just over $1200… now I just have to add a GTX 1080ti (the Quadro only has 2GB, so I’ll just use that for graphics). That is cheaper than I could have built a single processor system myself! And I can use the time I would have spent building it actually building and training my neural nets instead!

kzuiderveld · April 22, 2017, 9:51pm

FYI, I started this course using a HP Z800 workstation but I was very disappointed with its performance; I was much happier after I moved my GTX 1080 Ti into a HP Z640.

Having progressed with the course, I now know why performance was so poor. During training, especially with using data augmentation, the CPU is being heavily used - and is hardly taking advantage of multiple CPU cores either. The slow cores of the Z800 induce a severe performance penalty. I presume this can be fixed by more advanced data loaders (and I might look into that eventually), but that’s not done during the first few lessons.

thunderingtyphoons · April 22, 2017, 11:05pm

Would appreciate some thoughts from the hardware gurus.

Some of the core i5-3570 PCs are available on ebay for ~$150. This gets you a basic PC with powersupply, motherboard, etc and one can upgrade the RAM, HDD and SSD as necessary. For example see http://www.ebay.com/itm/Dell-OptiPlex-7010-Mini-Tower-Intel-Core-i5-3570-3-40-GHz-42E1-/172629857056?hash=item28318a9320:g:7kcAAOSwvKtY9NiK

These CPUs have a passmark score of 7153 where as a brand new Intel Core i7-7700K @ 4.20GHz has a passmark score of ~12K. Is it possible to attach a 1080 or 1070 to this machine? Or other similar machines? What are the tradeoffs?

kzuiderveld · April 23, 2017, 1:42am

Older machines like that have slower memory (DDR3), slower CPU, etc. To have decent performance, you’ll want to upgrade RAM/SSD - and I don’t know how much money you then actually would save compared to a “modern” machine.

Oh, and a Google search just revealed that the Optiplex 7010 doesn’t have a PCI-E power cable, so you can’t put in a 1080/1070 anyway.

So: don’t get that system.

shawn · April 23, 2017, 2:33am

Just to share some personal experience: I recently bought a used Dell workstation and built my machine for < $350, and I’ve been rather pleased with the performance, considering how little it cost.

$125 - Dell Precision T3500 workstation (Xeon W3565 processor and 24G of RAM) (Craigslist)
$75 - Corsair CX750M PSU (new)
$140 - GTX 970. (actually, I already had a 970 sitting around, but this is what it’d cost on Craigslist/ebay.)

I installed Ubuntu 16.04 and used the setup script provided with the course materials. Although this PC can accommodate two GPUs, I did find it helpful to remove the original card (an nVidia Quadro FX 580), as its presence seemed to cause some trouble installing nVidia drivers / CUDA etc. I run this machine headless, so the GPU is 100% available for DL.

Using the Lesson 1 Cats/Dogs benchmark, this ran about 400 seconds per epoch on the full set. So far, I’ve not found the performance all that limiting. When I was doing the State Farm homework, e.g., I was able to experiment aplenty and got a result that was well into the top 1/3.

I do plan to replace the 970 with a 1070 soon; it will be interesting to see how much of a gain I get. It’s certainly possible that the PC could bottleneck a more powerful card.

A word of caution: if you do go this route, you do have to replace the stock power supply.

Christina · April 23, 2017, 11:07am

Thanks for sharing, Karel. The one I bought has 128 gb ram so I could probably just load most datasets right into memory.

It has PCIe 3 and a mix of SATA 2 and SATA 3 interfaces. Were those the source of your bottlenecks? They make PCIe adapter cards for SATA devices now to get around any SATA bottlenecks - they are actually pretty cheap.

At any rate, it can’t be any slower than the motherboard and processor I am currently using!

Christina · April 23, 2017, 11:10am

Shawn great point about the power supply! I blew out two of them with my GTX960 in my win7 machine. Bumped up from 500 watts to 875 and haven’t had any problems since!

kzuiderveld · April 23, 2017, 2:24pm

The Z800 workstation is 6 years old, so obviously there are bottlenecks “everywhere” I suspect the slower CPU/RAM was a performance bottleneck as well as the slower PCI express bus.

I came into this course thinking “all training is done on the GPU, so let me get a 1080Ti and I’m set”. It’s now clear to me that the software infrastructure in this field is still young and is not optimized in many respects - I see bottlenecks that should not there. For example, why is data augmentation not done on the GPU - there’s special texture mapping hardware in there to do real-time scaling/rotation/whatever, so at some point I hope to get that functionality in GPU libraries.

Until then, CPU and memory speed will be important if you want to run networks fast. Older computers are fine if you’re not in a hurry though…

Christina · April 23, 2017, 3:46pm

Karel, I hear you… the one I picked up is supposedly only 2 years old, just came off a lease. So I am hoping that performance will be a little better than what you describe.

It has DDR3 memory also.

Are you sure that keras doesn’t do data augmentation on the GPU? I didn’t notice any big slowdowns when I did it for the dogs/cats and fisheries competitions.

What I DO notice now is how slow image preprocessing is using OpenCV with Python for the cervical cancer data! My machine has been running preprocessing algos for a couple days now on the CPU, since there really is no support for Python on the GPU, just with C++. I am seriously thinking about rewriting my code in C++, but decided that I really needed a faster machine to do this stuff anyway.

kzuiderveld · April 23, 2017, 5:08pm

As far as I know, data augmentation is done by an affine_transform() call in scipy, which of course calls other functions to do its job. I didn’t bother to drill down further into the stack. IF this would be done on the GPU (unlikely), you’ll still have the problem that the result likely will be pulled from the GPU, copied into a buffer somewhere and then uploaded to the GPU again as part of the training.

I’d like to see the GPU apply any affine transform before using it for training; the CPU is then only responsible for queueing up the “raw” input images rather than also having to do the augmentation.

I’m now using a Z640 (two years old) and indeed, the big slowdown is not happening anymore. But as I was using the same graphics cards in both machines, I now know from experience that older systems are not necessarily great systems for deep learning (my Z800 was really state of the art at that time - 2 x 6 Xeon cores at 2.8 Ghz, 24 threads. I wrote OpenMP C++ code with Intel intrinsics to accomplish high-throughput processing of digital pathology images. All those cores are not used though in the current Keras stack; having less cores at higher frequency is much better.

I’m about to start experimenting with the cervical cancer data - cannot give useful feedback on your OpenCV Python experience yet.

Christina · April 23, 2017, 6:38pm

Karel,
For image augmentation I ran across a simple python utility that may help – you basically use it beforehand to generate your augmented images, then train your CNN afterwards. I assume this would speed things up a bit, because 1) you can run augmentations on multiple datasets in different processes at the same time, and 2) you don’t have to generate the augmentations in real time every time you are training (helpful if you are training multiple models on the same data). The drawback, of course, is that now you have a lot of extra files on your disk, and if you decide you want to change the augmentation you have to run it again, producing even more files!

Rob Dawson: https://github.com/codebox/image_augmentor

kzuiderveld · April 23, 2017, 7:34pm

Christina,

Jeremy already used a similar “trick” in one of his classes by capturing the output of the data augmentation, combining it with the original data and then storing everything in a bcolz file (I think 5 augmentations + the original). That’s a perfectly valid approach of course - but the augmented data is not truly random anymore as it is constantly reused during training.

I suspect that best performance can be obtained by preprocessing the input images (to avoid having to do jpeg decoding all over again), stash them in one big file and then memory map that file. Data augmentation should be done on-the-fly; I’d imagine multiple processes can each generate a batch that is then queued for training.

Now, this stuff is not really this important when doing processing on small jpeg images like imagenet. My passion is processing of large medical 3D datasets - and I’d like to understand how to optimize preprocessing before I’ll attack the challenge of really large data.

Christina · April 23, 2017, 8:55pm

Karel then you are going to love the cervical cancer contest (except that it’s 2D)! The images are messy with medical devices, off-center, blurry, some pictures in green, etc etc.

I haven’t even started with the CNNs yet. There are some more things I want to explore with the image processing then maybe get to the CNN at the end of this week or next.

BTW, yes I do remember Jeremy doing something with bcolz… seems like ancient history now, and there was so much stuff packed in each lesson that even though I watched every one 2 or 3 times, I still don’t feel like I fully absorbed everything!

One other thought - I suppose you could incorporate code yourself to generate the augmentation in real time between the epochs… not as easy as having Keras do it for you, but you could probably use Dask to do it with multi-core utilization…

thunderingtyphoons · April 23, 2017, 11:39pm

Thanks @kzuiderveld

@shawn – This was exactly the kind of information that I was looking for. Are there some cheap PCs coming off ebay, that can be augmented with a 1070 to get a reasonable deep learning rig.

kzuiderveld · April 23, 2017, 11:55pm

I found the following article that describes the hardware requirements for deep learning: http://timdettmers.com/2015/03/09/deep-learning-hardware-guide/.

The comments have some additional suggestions.

davecg · April 24, 2017, 8:40am

There is a spatial transform function in TensorFlow, but I’m not sure if it uses the GPU or not. Sadly it also only works in 2D. It takes a 6d transform vector which is just the first two rows of the affine transform:

| a1 a2 a3 |
| a4 a5 a6 |
| 0 0 1 |

tf.contrib.image.transform

If anyone has some C skills and wants to add a 3D version of this (and separable_conv2d!) I’m sure people would be appreciative.

kzuiderveld · April 24, 2017, 2:51pm

Thanks for the tip, David. It seems that that function converts the images into tensors that are then processed somewhere else in the software (one might assume the tensor backend, i.e. the GPU). But: I presume that the processed image is made available in Python again, which implies upload and download via the PCI-E bus, which is not good.

In a perfect world, I’d like to see data augmentation done during the training itself, i.e. there should be an option to apply an affine transform to the images just before they’re fed into the training (on the GPU). I kind of assume that this can be done very efficiently there by texture mapping hardware.

But: training takes a while, there’s a good argument to use the CPU to do data augmentation if only we would do it more in parallel. I’ll look into Dask this week to understand how to improve parallelism in Python.

I’m still a Data Science newbie though, still on the learning curve - will take some time.

davecg · April 24, 2017, 4:56pm

Dask definitely helps if you are multiprocessing on the CPU. It’s also really simple to get something up and running.

from keras import backend as K
from dask.delayed import delayed
import dask.array as da

SHAPE = (256, 256, 3)
DTYPE = K.floatx()
FILES = [
    '/file/path/1',
    '/file/path/2',
    # ....
]

@delayed
def process_image(file_path):
    # your logic here, return np array of shape SHAPE and dtype DTYPE
    pass

my_data = da.stack([da.from_array(process_image(fp), shape=SHAPE, dtype=DTYPE) for fp in FILES])

# you can then use this in model.fit, you don't need fit_generator.

I think you could create a custom Keras layer using the tf.image.transform function and build it directly into your model, but I have not tried that yet (and again, it might not work on the GPU).

kzuiderveld · April 24, 2017, 5:15pm

Dave, awesome, thanks for your reply! I’ll do some experiments this week wrt performance and hopefully report on it.

I’d hesitate to add a custom layer to Keras, I don’t see data augmentation as an integral part of the network we need to train. But I was surprised to find out that folks didn’t implement GPU-based image augmentation yet in the various libs (or if that’s a bad idea, reported why it was the case). You’d think it would be a great add-on for NVIDIA’s cudnn…