To_distributed() with SaveModelCallback

They are getting called no matter what even with 1 epoch, see fit().

I have edit the command:

NCCL_DEBUG=WARN python ../fastai/fastai/launch.py --gpus=3456 ./multi_gpu_debug.py --arch_name=resnet18 --model_suffix=_non_overlap --fold_num=0

letā€™s see if any meaningful error comes.

1 Like

So it looks like this monkey patch has worked at least initially. Iā€™ll try multi-phase training now.

1 Like

Ok, the hang problem is with EarlyStoppingCallback.

Without any callbacks script works fine.

Probably master and workers are out of sync because of stopping training?

Hmm just ran a multi-phase test and that completed successfully, but too low of epochs to trigger early stopping. Will retry with high epochs and low patience now.

Yeah if error occurs again after early stepping while going from stage-1 to stage-2 we can be sure. Early stopping breaks the fit function. Maybe best would be to use your monkey patch if it seems to work. That way at least we would have access to best models during training, kind of early stop without actually stopping :slight_smile: But it would be nice to understand why breaking the fit function causing such hang, hmmā€¦

   self.learn.load(f'{self.name}', purge=False)
 File "/home/turgutluk/fastai/fastai/basic_train.py", line 264, in load
   state = torch.load(self.path/self.model_dir/f'{name}.pth', map_location=device)
 File "/home/turgutluk/.conda/envs/my_fastai/lib/python3.7/site-packages/torch/serialization.py", line 368, in load
   return _load(f, map_location, pickle_module)
 File "/home/turgutluk/.conda/envs/my_fastai/lib/python3.7/site-packages/torch/serialization.py", line 549, in _load
   deserialized_objects[key]._set_from_file(f, offset, f_should_read_directly)
RuntimeError: storage has wrong size: expected 1776574026747881532 got 128

got this error using only SaveModelCallback with the monkey patch.

That error is definitely a new one to me. I just was able to use both callbacks successfully though (ignore the added #quality_metric printouts callback I added):

Is this with multiple stages with unfreezing and with EarlyStoppingCallback and SaveModelCallback?

If thatā€™s the case can you share your script? Thanks

Yes I used both callbacks and trained the network head, unfroze, then trained the rest of the network with differential lrs. Iā€™m not really allowed to share the entire script, but I can answer any questions. One difference I see is that I was having issues with setup_distrib in SageMaker, so I instead initialized distrib training similar to the way that SageMaker PyTorch examples recommended:

parser.add_argument('--hosts', type=str, default=ast.literal_eval(os.environ['SM_HOSTS']))
parser.add_argument('--current_host', type=str, default=os.environ['SM_CURRENT_HOST'])

print('Turning on distributed training.')
print('hosts:', args.hosts)
print('current_host:', args.current_host)        
world_size = len(args.hosts)
os.environ['WORLD_SIZE'] = str(world_size)
host_rank = args.hosts.index(args.current_host)
os.environ['RANK'] = str(host_rank)
print('world_size:', world_size)
print('host_rank', host_rank)
torch.cuda.set_device(0)
torch.distributed.init_process_group(backend=args.backend, rank=host_rank, world_size=world_size)
logger.info('Initialized the distributed environment: \'{}\' backend on {} nodes. '.format(
        args.backend, world_size) + 'Current host rank is {}. Number of gpus: {}'.format(
        host_rank, args.num_gpus))
learn.to_distributed(0)
2 Likes

Why are you specifying gpu id in learn.to_distributed(0)? Does this run on multiple gpus, have you checked it?

Iā€™m only using distributed training, not parallel, so each machine only has 1 GPU, so I hardcoded the gpu index. This is because Dynamic U-Net docs had a warning note saying that parallel training didnā€™t work: https://docs.fast.ai/vision.learner.html#unet_learner

I am also using to_distributed but when I hardcoded the gpu id I saw that multiple processes were spawned in the same gpu but not in multiple gpus. If you see that multiple gpus spawning process then it should be fine, e.g. watch gpustat.

Also, this thread suggest not saving model for slaves as it can corrupt the file when they are trying to write in it at the same time, i guess thatā€™s why we had that condition in save(). https://github.com/pytorch/pytorch/issues/12042

In my case the training script is being run identically on completely separate EC2 instances rather than using distributed on a single multi-gpu machine, could that be the difference?

I wonder if that forum issue doesnā€™t affect me because Iā€™m not doing parallel+distributed and have separate filesystems so nothing is trying to write to the same file.

Hmm, I see , thanks for clarification. I will try some other stuff.

No luck couldnā€™t solve the issue :frowning:

If any other info about my setup would help please let me know. Just to make sure I understand, the hardware youā€™re testing on is a single multi-gpu machine?

Yes it is a single machine single node 8 gpu machine which I use 3-4 of gpus. I can share my scripts with you:

Thanks a lot for your help!

1 Like

And you said just turning off EarlyStoppingCallback makes everything work, or turning off all callbacks is required?

I could see how EarlyStoppingCallback wouldnā€™t work without patching because it calls learn.load at the end of training, but not quite sure why patching has fixed both callbacks for me, but only SaveModelCallback for you.

Also Iā€™m not using ReduceLROnPlateauCallback or CSVLogger, although those probably arenā€™t an issue.

It fails with either learn.load or EarlyStoppingCallback