Has anyone tried setting up a cluster of machines with GPU’s using Uber’s Horovod?
I am curious to know what people’s experiences were in this regard. For example, things that you want to know but were not clearly spelled out in the documentation.
Has anyone tried setting up a cluster of machines with GPU’s using Uber’s Horovod?
I am curious to know what people’s experiences were in this regard. For example, things that you want to know but were not clearly spelled out in the documentation.
I know this is a super late response, but we’ve done quite a bit with horovod using both Tensorflow/Keras as well as pytorch. It was really awesome! We clustered together 100 V100s and it went very smoothly (powers of 2 tend to work better, but 100 being less efficient is better then 64 at “peak” efficiency). I can’t remember the exact numbers, but there was a not insignificant loss of efficiency per GPU, but overall the additional parallelization made it worthwhile. I’ve been interested in trying to see how well a fastai model would do, and exactly what it would take, but unfortunately haven’t been able to make time for it yet.
@jaredwads , Hi , Can you share a dummy code for scaling tensorflow on multiple nodes? I am unable to do so using the latest horovod and tf12