May you elaborate about that?
i’ll google it for you =)
…and so on…
Thank you. It looks quite interesting. Pity the 2066 platform is so pricey as of now, not to mention ddr4. I think I’ll stick with my 2011/ddr3 machine for a while: in the end, what counts for us is gpu performance.
Hmm, I guess it depends on your situation, but as I see using hosted servers as meaning zero hardware maintenance, since the host does it for you. 1 machine at home is much more hardware maintenance than 2 hosted ones (I couldn’t deal with it, I have around 10 remote servers doing different things but my only computer at home is a small laptop). Software maintenance: maybe running multiple machines takes some getting used to but really if they’re both running the same software then 2 machines isn’t more software maintenance than 1 machine. If you have 1 Linux and 1 Windows then of course that’s more work.
Re ECC and nvme on those servers: good points, though you can (if the servers are in stock at all) get nvme drives at extra cost. Does it matter at all for this sort of thing, which is compute intensive on the gpu? Also I’d expect the gpu itself uses non-ecc ram, so there’s an entry point for ram unreliability. After that it’s just a cost-benefit tradeoff so you have to compare numbers.
736 euro for a new 1080ti is less than I would have expected. Maybe prices of this stuff are finally coming down. Nice.
Regarding the 1180, current rumored announcement date is July 31 so easy availability would be sometime after that. And I’m generally paranoid and cynical toward nvidia so it won’t surprise me if they mess it up some way on purpose, to make it less useful for ML. Look at the GTX vs Quaddro pricing difference and sw license nonsense. I’m sure they’d like to subdivide the market even more (gamers, ML, crypto mining) so maybe they’ll try with the 1180.
I wonder what the current issues are with running the fast.ai code on non-nvidia hardware. Is it not just a pytorch client? I plan to look into it at some point. It would be great if it can build with HIP (AMD’s cuda lookalike) since the new Ryzen APU’s are quite cheap compared to outboard cards. GPU performance with any luck would be in the area of a low end nvidia (1040 or something?), i.e. still much better than pure CPU.
True. GTXs and Titans got non-ecc vram, while Teslas have it. But the point is that vram works just during the training of your NN. If some bit flips inside it, you end up with one data point corrupted inside some minibatch. Hardly important.
With non-ecc system memory (which works the entire uptime of your computer), by the contrary, you can end up with system files corruptions, that easily leads to entire OS reinstall for your carefuly configured DL rig.
They started to fall down to more reasonable prices as cryptos started their retracement.
The problem with nvidia is that they are a de facto monopoly. Indeed, as soon as the market understood the AI industry was totally dependent on their hardware, the company capitalization jumped up by 1500%.
In some sense they deserved it: they invested heavily in gpgpu back when other brands didn’t give it a f*ck. Look at AMD: their ROCm still is a total mess: Rachel said in a recent blog post “I have a phd in math, and I still cannot manage to understand how to use the darn contraption”. We haven’t seen nothing workable from intel nervana as of yet, as well.
It is not so, unfortunately. Any DL framework relies upon cuDNN, the low level cuda library for deep learning. From top to bottom you got python/numpy -> high level libraries like fastai and keras -> proper DL frameworks like pytorch and TF -> cuDNN -> cuda calls to hardware.
Like I said, AMD is trying to do something similar with ROCm, but no one managed to do anything useful with it AFAIK.
But pytorch is buildable without cudnn. Do you mean if you build it for cpu, some of the functions don’t work, as opposed to just being slower? Another possibility is that fastai lib makes some direct cudnn calls - could that be happening? I think you can use tf with no cudnn and take a speed hit. That’s interesting about Rocm. I haven’t tried using any of it at all. I know leela zero / leela chess zero can use tf, cudnn, or its own opencl nn code, but cudnn is by far the fastest.
I think if you build it for CPU it will just be terribly slow (that is not useful in any realistic scenario, not matter how powerful the cpu is).
I think not, but I’ll leave it for @jeremy to answer it properly.
Well if it works at all, that isolates the issue to pytorch, so it becomes a “simple” matter of targeting pytorch to use other hardware. Web search on pytorch and “amd vega” gives conflicting results but at minimum it looks like it’s being worked on. Here’s a post from a facebook AI guy about it 11 months ago (found via someone else’s link on anandtech): https://old.reddit.com/r/MachineLearning/comments/6kv3rs/mopen_10_released_by_amd_deep_learning_software/djpfmu1/
So maybe there is some hope.
Added: another post from same guy: https://old.reddit.com/r/Amd/comments/6lkuzb/facebook_ai_researcher_were_seriously_looking/djwge7k/
Ok don’t want to be hijacking thread but the ROCm port of Torch is here: https://github.com/ROCmSoftwarePlatform/cutorch_hip
Also here’s a tensorflow benchmark suggesting the high end AMD stuff is competitive with high end nvidia: http://blog.gpueater.com/en/2018/04/23/00011_tech_cifar10_bench_on_tf13/
However, the tf benchmark is tf 1.3 (old!) as newer versions have not yet been ported.
I looked at the torch port and the total amount of cuda code is not huge and it’s mostly just bookkeeping, in fact it’s way ugly. Most of it looks like it should be in a lower level wrapper rather than directly in torch.
I think AMD has been ignoring machine learning til recently, focusing on gaming and enjoying the cash windfall of selling gpus to crypto miners. But they are waking up now: https://www.anandtech.com/show/12910/amd-demos-7nm-vega-radeon-instinct-shipping-2018
So nvidia may lose its de facto monopoly soon.
Worry not, these are useful informations.
It’s just that it’s cutorch, that is, cuda backend for torch. How comes that it works with ROCm?
Cuda is a C++ dialect developed by NVidia for gpu programming. HIP (part of ROCm) is basically an alternate implementation of Cuda or something close to it, for AMD gpus. The idea is you can take a cuda program (cutorch in this case) and run it on an amd board with fairly small changes using hip. I’m new at this and don’t really know the current status or how realistic the plan is. I’ve written C++ code but so far not any cuda. I do have some interest in hacking on cuda/hip code as well as with using the higher level libraries that this course discuses. But that’s not likely in my current situation of using hourly paperspace instances.
One thing I don’t understand is why the opencl backend for leela chess is so much slower than the cudnn back end. It makes me wonder whether cudnn has some special difficult optimizations that maybe use special features of the nvidia board. Otherwise porting torch to opencl seems reasonably doable, and AMD has a decent opencl implementation from what I’ve heard. But based on the leela chess observation, an opencl torch might not be competitive in performance with the cudnn version. I’m getting the impression that opencl is falling out of favor and people see hip/cuda as the way things are going. So let’s hope amd keeps making progress.
Guys, I’m doing some testing with the 1080 ti, and noticed that during training with precompute=False GPU utilization (as per GPU-Z and a plethora of similar apps) never goes over ~50%.
Even worse, when precompute=True, GPU utilization is always under 4% (yes, four).
The same happens with the 1070.
Is this normal? Could you run some testing on your GPUs?
It would be great to see some competition in the marketplace for Deep Learning GPUs! Glad people are putting in the work to get AMD in the game.
Depends on the training, size of model you are training, batch size, how many layers are unfreezed.
With precompute=True amount of calculations reduced even more - so it may not utilize all GPU with it.
But GPU utilization should be close to 90%-100% percent normally.
Now where is bottleneck (CPU, RAM throughtput, IO) you have to find out, or it is just python loop somewhere in your code.
For testing use “heavy” architecture like ResNet50, large batch size, unfreeze all layers and your GPU should be up to 100% unless you have bottleneck somewhere else.
Mmmhh, thanks. I’ll keep you all posted.
I personally bought the 1060 6gb. It was waaaay cheaper and quite fast. So fast that NVIDIA purposely disabled SLI in these cards since two of them were faster than a 1080Ti.
My plan was to buy a second one later one, at a cheaper price. 15 months later their price is up by 10%! What a world we live in right? Electronic prices are like good wine nowadays.
I’m currently pondering the possibility to buy a second GPU, but a newer and faster one this time. I am guessing I can have one for prototyping / graphics and another (faster) for training. I have not made sure this would work yet though.
You were right. Once a NN is unfreezed in all its layers, gpu utilization rises to 100% during retraining.
Now, this still leaves us with a question or two:
When we train just the last layer (and PC=False), the computation takes a good amount of time nonetheless. Why the GPU is not fully leveraged in order to accomplish the training as soon as possible?
When PC=True, the training is a lot quicker, but again: it takes time no matter how much. Since in that case the GPU is not used at all (3-4% at the very best), I wonder what is actually making the calculations (CPU?) and how that time is spent (moving minibatches back and forth?)
It is still leveraged, but for very short times (GPU Utilization is averaged metric). The rest of time is taken by slower operations on CPU, load the stuff to GPU and etc.
So the overhead in such cases would take more time than compute minibatch on GPU.
In other words, if I’m not misunderstanding you:
A more powerful CPU could be useful both when precompute=True, and when Precompute=False but with the inner layers frozen.
For the same reasons, an NVMe drive could be important.
Maybe a larger batch size could speed up the training?
Last but not least:
- GPU memory occupation change drastically when you unfreeze the network (with Resnet101 I had to reduce BS from 128 to 24 not to get cuda out of memory errors on a 1080 ti). This in turn forces us to retune the LR.
- You can fully exploit an expensive GPU (power and memory) just when you retrain big models.
Am I right?
My point that by trying to get 100% of GPU utilization on fine tuning one last layer may cost you more much more effort than time it takes to fine tune it.
Anyway fine tuning last layer runs fast.
You can exploit full GPU power pretty easily by adjusting batch size, but perhaps not on edge cases like these.