I’m doing distributed training of the U-Net segmentation model in SageMaker. I’m able to train with a single
fit_one_cycle call, but three things that I haven’t got working are reloading the best model, two-phase training, and using lr_find.
What I’d like to do is train the network head using
SaveModelCallbackto save the best rather than just the last epoch, then load in the best epoch weights, unfreeze and train the rest of the network (related to #2). The issue seems to be that the
SaveModelCallbackis only saving my
best.pthto the filesystem of the master instance, so the slave instances fail to find the file to load. I could have the master instance save its weights to S3 and then have the slave instances pull from S3 and load, but I’d think there’s probably a cleaner way to do this. Anybody have an idea?
When I call
learn.unfreeze()and then do a second fit call for fine-tuning, training also crashes:
Starting training of entire network…
epoch train_loss valid_loss acc_unet time
algo-1:50:87  transport/net_socket.cu:188 NCCL WARN Message truncated : received 1048576 bytes instead of 32768
algo-1:50:87  transport.cu:153 NCCL WARN transport.cu:153 -> 3 [Proxy thread error]
- And finally, it seems that
lr_findis failing on the slave instances due to a similar issue with not finding a required file.
I’m still able to train a well-performing model without these features, but I’d like to get a better understanding of the underlying issues. Any tips or suggestions greatly appreciated!