SaveModelCallback in distributed training

bwangwp · August 30, 2019, 2:21pm

I followed the tutorial in https://github.com/fastai/fastai/blob/master/docs/distributed.md and converted my notebook to work with multiple GPUs. One thing I noticed is that at the end of every epoch, my model is validated on all the GPUs and the absolute best model might be overwritten. I tried to name the best model differently by appending the gpu id to the model name but it seems that only the model on device 0 is saved. Is there any example on how to save the best model when working with multiple GPUs?