Hi @hwasiti,
After you have looked at Horovod, and you still think that our fast.ai group can get scalable multiple GPU training faster, then I will take your word for it! I have read Jeremy’s DAWNBench blog and I just wasn’t sure whether it was a one-time model-specific AWS solution or it was something we can all use. I hope he will get into it more in the current class!
My problem is NLP and building a big fat language model faster, and I didn’t know how well the DAWNBench experience would generalize to the other models I think your last URL
https://forums.fast.ai/t/distributeddataparallel-init-hanging/41218/3
ends on a very positive note where @kcturgutlu got his code to work and got linear scaling with # of GPU’s! My goal would be for a nice multiple GPU approach that handles 2-4 GPU’s on one node, and that doesn’t require a lot of detailed tweaking for each model.