The Tesla V100 PCIe is currently selling for $10,000 (approx, depending on 12GB or 16GB version, and local currencies).
NVidia announced yesterday the launch, already available for purchase, of a new Titan GPU called “Titan V”, which shares quite a lot of specs with the V100 PCIe version, but for $3000.
The obvious/tricky question (for @jeremy ) is, for $3000 what’s the best Christmas deal for Deep Learning:
- getting four units of GTX 1080 TI ($750 a piece) or
- a single Titan V ?
My 2 cents:
I would get the 4 units of GTX 1080Ti. Why? I think it’d be much more effective to learn how to write NNs for multi-GPUs architectures, in addition to regular single GPU systems, if you’re already at the point of making such heart-ful investments.
As Jeremy might also tell you, having multiple GPUs allow one to test more than one models at the same time (i.e using different hyper-parameters or architectures).
6 months down the block, Nvidia is probably going to release a newer, faster, more salacious GPU card. Again, I believe if one’s work is important enough to warrant the latest and the greatest, it probably becomes inevitable to program for a GPU clustered environment. So, why not start learning that!!
Again my 2 cents and surely your situation could be different.
totally agree - wish i had more GPUs - for literally two days i’m running a training and basically stuck
It’ll will be interesting to see if the tensor performance is enabled on their next generation of GTX cards (? GTX 1180) when they drop (expected Q1).
I agree that having multiple GPUs is very desirable. But, could the increased tensor processing performance make up that difference? I guess we could benchmark on a AWS P3 vs a Paperspace P6000 and / or a home 1080 TI for the shape of problem we are working on to get an idea.
Writing NNs for multi-GPUs seems to be “as simple as” inserting two lines at the top of your notebook, in the spirit of:
os.environ['CUDA_VISIBLE_DEVICES'] = '3,2,1,0', NUM_CUDA_DEVICES = len(os.environ['CUDA_VISIBLE_DEVICES'].split(','))
torch.cuda.set_device(0) for using a specific GPU in PCIe slot “0” in a first notebook,
_device(1) for slot “1” in a second notebook, and so on for running several notebooks at the same time in a multi-GPU rig.
What remains unclear is the impact of “640 tensor cores” in the Titan V/Tesla V100, that don’t exist in the 1080Ti or the Titan Xp.
I would much rather have even 2 1080 Ti. The value just really doesn’t seem like it is there on these at least for what I’m working on. for me it is about value and the 1080 Ti is just a much better value than this beast.
Also, there’s more hardware costs involved in the “Four 1080Ti vs single Titan V” choice.
The platform required to host four 1080Ti is quite expensive (motherboard, power supply etc.), while a Titan V will fit in a regular motherboard/PSU.
Check Tim Dettmers (the author of http://timdettmers.com/2017/04/09/which-gpu-for-deep-learning/) discussion on Twitter.
Do you know the total cost of making a PC with that GPU , the new one. I am interested.
would be the same as building one that supports a single 1080 ti, plus the difference in the cost of the card (same power requirements etc).
Not quite. You have to use
I responded to a similar question here:
Basically I would opt for a titan V for the aforementioned reasons
Btw there is something else I didn’t think about at first but which can orient your choice to a GTX Titan V even more (and correct me if I’m wrong).
Apart for the new “tensor cores” (which seems to mix fp16/fp32 cores) the interesting thing is fp16 compute capabilities. As said here V100 architecture (Volta) seems to have twice as much compute power than fp32 operations. Which mean that if you manage to get your deep learning framework work on fp16 matrices you divide by 2 your VRAM usage which translates to bigger batch sizes in our code.
So you can consider your current GTX 1080Ti being stuck with fp32 operations on 11gb VRAM (as it does not work well with fp16) and your GTX titan V to use fp16 compute capabilities on 12gb of VRAM but as matrices are taking twice less memory you can think of your Titan V to have 24gb of VRAM.
Interesting point, did you share it on KN ?
I didn’t as I’m not an expert so maybe I’m not 100% correct (hence the “correct me if I’m wrong”). The part which looks suspicious to me is the fact that you can turn you float32 matrices to float16 “for free”. I mean if you take a look at what the tensor cores of the volta architecture is about is actually a mix of fp32 and fp16. So now the question is: Why do they do this mix if we can just use fp16 compute capabilities directly? A lot of people (myself included) claim we can just turn our to float16 but it may be more complicated than that.
In any cases what is sure is that volta architecture is tailored for fp16 capabilities and there is a way to run DL models on that “configuration” which translates in all cases to lesser footprint on the GPU VRAM.
The good thing with KN is you can share questions/ressources on hardware/software & maths/stats claims, there will probably be an advanced user to pick it up and confirm/deny it (like Anokas/CPMP/KazAnova/Laurae & co).
I’d post it with a “does it make sense ?”
BTW, the discussion on the link you provided from Nvidia forums for “Titan V FP16 Performance” evolved a bit since yesterday.
As I am using a Titan V (and Titan XP) and am trying to benchmark their performance, I moved a previous post to this thread.
As background, I decided to subsidize/rationalize my DeepLearning GPU purchase of a Titan V and Titan Xp by using them for ethereum crypto-mining. As a result, and discussed below, I have come across some puzzling phenomena.
When I run the following code without any other jobs running, it is significantly slower than when the GPU is running other process. (Specifically, it is under heavy load running crypto-mining software.) I have repeated the trials numerous times to make sure that there were no differences in pre-computing or caching taking place. Moreover, I have tested this off and on over several weeks with the same result. I have used
nvidia-smi to verifying what jobs are running on the GPU. Here are the times:
data = ImageClassifierData.from_paths(PATH, tfms=tfms_from_model(arch, sz))
learn = ConvLearner.pretrained(arch, data, precompute=True)
This really doesn’t make sense to me.
In trying to figure it out, I was wondering is anyone is using either a Titan V or Titan X. If so, I was wondering if they could let me know how long the above code runs for them. This is right out of Lesson1. Note that in learn.fit(0.01, 5), I am running 5 epochs vs. 3.
That is a pre-trained model in those first few steps. It runs “fast” no matter what. I would be more interested in you running the entire notebook. Then, look at the widgets and processing time for the learn.fit operations that are more computationally intensive in the data augmentation and fine tuning sections of that lesson1 notebook. If you could do that, and let me know or post to this thread, it would be appreciated.
@FourMoBro thanks for response. I will run and post some times there.
Note that fastai isn’t currently optimized for the tensorcores on this GPU.
I considered delaying my new car and investing in a Titan V. After all, 3 grands are not too much for 110 Tflops, if you are serious about DL.
But what I find a bit unsettling is that your tensors got to have all dimensions to be multiples of 8 in order to to leverage tensor cores.