I came across this paper (https://www.cs.rice.edu/~as143/Papers/SLIDE_MLSys.pdf) implementing efficient sampling on the CPU, apparently outperforming TF GPU. I would love to get your take on this and if you - the fast.ai community- think it is legit, if and how we could implement this.
training with SLIDE on a 44 core CPU is more than 3.5 times (1 hour vs. 3.5 hours) faster than the same network trained using TF on Tesla V100 […] On the same CPU hardware, SLIDE is over 10x faster than TF.
I would put this paper in the technically true but unlikely to be useful category as they require a large number of CPU and algorithmic modifications (which might not be applicable to arbitrary networks) to equal one GPU.
AMD has desktop CPUs with 64 cores now. I think this could be very useful.
Interesting. Some consideration on generalizability could be good.
I understand your skepticism. It certainly remains to be seen if this generalises to different kinds of networks. However, GPU are generally speaking the more constrained computing device. So I hope CPUs can make a comeback. Maybe someone with more hardware knowledge than me can comment on this. It might be that in terms of raw flops the gap is just too wide, I don’t know.
One problem with CPUs is that you get much less flops per volume of hardware / energy consumed (which is why most supercomputers are converting to GPU).
I am sure you can beat a GPU if you have 50 cores (even with old tricks such as network averaging) but it will take a powerful server where you could have stored the GPU in a laptop.
Furthermore, while GPU operations are limited in general, its not a problem for deep learning as you parallelize identical operations across batches (which might seem obvious but is a genius way to use restricted hardware on arbitrary problems).
A good question, however, is: how easily can one currently parallelize training on CPU only ? Pytorch has some functions to help with that and I have been told horovod can also help but I don’t know how accessible it currently is.
Its true that the GPU is a more constrained computing device, but that’s only relevant if you need the extra computing options. Given the nature of neural networks, I’m not immediately convinced that the more constraints counteract the immense amount of extra raw compute power.
Looking at this paper specifically, if true, this could be pretty major. Their proposal doesn’t need the extreme parallel capabilities of a GPU, but uses smart sampling and asynchronous SGD for improved performance. However, their advantages are only applicable in a very specific domain, namely high dimensional extreme multiclass tabular data. It does not generalize to either visual, textual or audio domains, at least that’s not mentioned in this paper.
I though it is saying Intel i5 8gen can beat a TeslaV100 with some codes.
One advantage of using a GPU is that it can run in parallel with your CPU, so while the GPU is busy training the CPU can already load new data and preprocess it. By moving everything to the CPU, you lose this parallelism. So that would be a downside.
… + an external >300W power source?