A few features not working with distributed training on SageMaker

austinmw · March 18, 2019, 7:00pm

I’m doing distributed training of the U-Net segmentation model in SageMaker. I’m able to train with a single fit_one_cycle call, but three things that I haven’t got working are reloading the best model, two-phase training, and using lr_find.

What I’d like to do is train the network head using SaveModelCallback to save the best rather than just the last epoch, then load in the best epoch weights, unfreeze and train the rest of the network (related to #2). The issue seems to be that the SaveModelCallback is only saving my best.pth to the filesystem of the master instance, so the slave instances fail to find the file to load. I could have the master instance save its weights to S3 and then have the slave instances pull from S3 and load, but I’d think there’s probably a cleaner way to do this. Anybody have an idea?
When I call learn.unfreeze() and then do a second fit call for fine-tuning, training also crashes:

Starting training of entire network…
epoch train_loss valid_loss acc_unet time
algo-1:50:87 [0] transport/net_socket.cu:188 NCCL WARN Message truncated : received 1048576 bytes instead of 32768

algo-1:50:87 [0] transport.cu:153 NCCL WARN transport.cu:153 → 3 [Proxy thread error]

And finally, it seems that lr_find is failing on the slave instances due to a similar issue with not finding a required file.

I’m still able to train a well-performing model without these features, but I’d like to get a better understanding of the underlying issues. Any tips or suggestions greatly appreciated!

austinmw · March 19, 2019, 11:23pm

So it looks like a lot of my issues are stemming from this line. Monkey patching save and export methods commenting out this line seems to work for now.

I’m currently doing distributed training with single-gpu instances (unet_learner), maybe this line was added due to issues with parallel training trying to write to the same file?