Yeah, PCIe lane speed is clearly not the most critical factor for a lot of ML workloads. Again, I can't find a lot of hard evidence online that it matters that much. For example, see this server
It includes 10 GTX 1080 Ti cards in one chassis. While it's dual CPU, there's no way it can provide enough PCIe 3.0 x16 lanes for all those cards!
But this graph shows that it still gets a linear speed up on performance:
Most of the benchmarks around slots are around gaming, which is a very different workload.
I remember there was a post here that showed that with a convolutional workload, there's no way that it could saturate the bus (or even a 1x PCIe slot). The bottleneck is in the computation and the memory bandwidth on the GPU card itself (ie. transferring data within the card, and not even going to the CPU's memory).
I bought an AMD Ryzen Threadripper 1900X mainly because it supports up to 64PCIe lanes. But in hindsight I don't think it matters all that much.
When using the AWS P2.xlarge instances the bottlenecks I see all the time when I run
watch -n 0.5 nvidia-smi are to do wit GPU utilisation, GPU total RAM capacity, single-threaded CPU utilisation while preprocessing data, and CPU RAM loading my training data into memory (get as much as you can).
I bought 2 GTX 1080Ti cards. Not because I think I'll need to run both of them in parallel, but for better iteration. Ie. when I'm training and testing on one GPU, I can run another variation on another GPU.
I think time to iteration is more important in general. At least for Kaggle competitions.
And for this GPU RAM bandwidth, GPU speed (try to get the best architecture you can afford - eg. currently Pascal), GPU RAM capacity (so you can fit bigger models in memory - I'm constantly reducing model size and batch size to avoid annoying memory capacity issues). And CPU system RAM. I'd recommend at LEAST 32GB if you can depending on your datasets. I'm personally starting with 64GB as I've regularly had 50+GB datasets loaded in memory on AWS.
I don't think PCIe lane speed is a big issue in practice. As long as we're talking about version 3.0 PCIe.
Each generation of PCIe is 2x the bandwidth of the previous generation. So a x8 3.0 is as fast as a 16x 2.0 I believe. So it's possible that some of the information out there about PCIe speed is referring to older PCIe versions.
But for most motherboards made in the last 2-3 years you could be OK.
Does anyone have some more hard data on the PCIe issue?
The funny thing is, I can't even BUY a motherboard that supports the full 64 PCIe lanes of the Threadripper processor! I thought I'd be "futureproofing" my build. But I doubt it. I did buy a full-tower case and a 1500W power supply to support up to 4 GPUs in the future.
Though I think there's a better than even chance that we won't be using GPUs in the future anyway.
Look at Google's TPU and Intel's Nervana.
I think it's likely we'll be running dedicated ASIC hardware in the future which will be faster, higher density and lower power.
If anyone's interesting, I highly recommend reading Google's paper on the Tensor Processing Unit - It's a very interesting insight into the kind of machine learning workloads that occur at scale in production.
I thought this table was very illuminating:
Ie. Only 5% of their models uses CNNs. 61% Multi-Layer-Perceptrons (ie. Dense Layer models). And 29% LSTMs.
Can anyone else comment on the PCIe issue in practice?