They are getting called no matter what even with 1 epoch, see fit()
.
I have edit the command:
NCCL_DEBUG=WARN python ../fastai/fastai/launch.py --gpus=3456 ./multi_gpu_debug.py --arch_name=resnet18 --model_suffix=_non_overlap --fold_num=0
letās see if any meaningful error comes.
So it looks like this monkey patch has worked at least initially. Iāll try multi-phase training now.
Ok, the hang problem is with EarlyStoppingCallback
.
Without any callbacks script works fine.
Probably master and workers are out of sync because of stopping training?
Hmm just ran a multi-phase test and that completed successfully, but too low of epochs to trigger early stopping. Will retry with high epochs and low patience now.
Yeah if error occurs again after early stepping while going from stage-1 to stage-2 we can be sure. Early stopping breaks the fit function. Maybe best would be to use your monkey patch if it seems to work. That way at least we would have access to best models during training, kind of early stop without actually stopping But it would be nice to understand why breaking the fit function causing such hang, hmmā¦
self.learn.load(f'{self.name}', purge=False)
File "/home/turgutluk/fastai/fastai/basic_train.py", line 264, in load
state = torch.load(self.path/self.model_dir/f'{name}.pth', map_location=device)
File "/home/turgutluk/.conda/envs/my_fastai/lib/python3.7/site-packages/torch/serialization.py", line 368, in load
return _load(f, map_location, pickle_module)
File "/home/turgutluk/.conda/envs/my_fastai/lib/python3.7/site-packages/torch/serialization.py", line 549, in _load
deserialized_objects[key]._set_from_file(f, offset, f_should_read_directly)
RuntimeError: storage has wrong size: expected 1776574026747881532 got 128
got this error using only SaveModelCallback
with the monkey patch.
That error is definitely a new one to me. I just was able to use both callbacks successfully though (ignore the added #quality_metric printouts callback I added):
Is this with multiple stages with unfreezing and with EarlyStoppingCallback
and SaveModelCallback
?
If thatās the case can you share your script? Thanks
Yes I used both callbacks and trained the network head, unfroze, then trained the rest of the network with differential lrs. Iām not really allowed to share the entire script, but I can answer any questions. One difference I see is that I was having issues with setup_distrib
in SageMaker, so I instead initialized distrib training similar to the way that SageMaker PyTorch examples recommended:
parser.add_argument('--hosts', type=str, default=ast.literal_eval(os.environ['SM_HOSTS']))
parser.add_argument('--current_host', type=str, default=os.environ['SM_CURRENT_HOST'])
print('Turning on distributed training.')
print('hosts:', args.hosts)
print('current_host:', args.current_host)
world_size = len(args.hosts)
os.environ['WORLD_SIZE'] = str(world_size)
host_rank = args.hosts.index(args.current_host)
os.environ['RANK'] = str(host_rank)
print('world_size:', world_size)
print('host_rank', host_rank)
torch.cuda.set_device(0)
torch.distributed.init_process_group(backend=args.backend, rank=host_rank, world_size=world_size)
logger.info('Initialized the distributed environment: \'{}\' backend on {} nodes. '.format(
args.backend, world_size) + 'Current host rank is {}. Number of gpus: {}'.format(
host_rank, args.num_gpus))
learn.to_distributed(0)
Why are you specifying gpu id in learn.to_distributed(0)
? Does this run on multiple gpus, have you checked it?
Iām only using distributed training, not parallel, so each machine only has 1 GPU, so I hardcoded the gpu index. This is because Dynamic U-Net docs had a warning note saying that parallel training didnāt work: https://docs.fast.ai/vision.learner.html#unet_learner
I am also using to_distributed
but when I hardcoded the gpu id I saw that multiple processes were spawned in the same gpu but not in multiple gpus. If you see that multiple gpus spawning process then it should be fine, e.g. watch gpustat
.
Also, this thread suggest not saving model for slaves as it can corrupt the file when they are trying to write in it at the same time, i guess thatās why we had that condition in save()
. https://github.com/pytorch/pytorch/issues/12042
In my case the training script is being run identically on completely separate EC2 instances rather than using distributed on a single multi-gpu machine, could that be the difference?
I wonder if that forum issue doesnāt affect me because Iām not doing parallel+distributed and have separate filesystems so nothing is trying to write to the same file.
Hmm, I see , thanks for clarification. I will try some other stuff.
No luck couldnāt solve the issue
If any other info about my setup would help please let me know. Just to make sure I understand, the hardware youāre testing on is a single multi-gpu machine?
Yes it is a single machine single node 8 gpu machine which I use 3-4 of gpus. I can share my scripts with you:
Thanks a lot for your help!
And you said just turning off EarlyStoppingCallback
makes everything work, or turning off all callbacks is required?
I could see how EarlyStoppingCallback
wouldnāt work without patching because it calls learn.load
at the end of training, but not quite sure why patching has fixed both callbacks for me, but only SaveModelCallback
for you.
Also Iām not using ReduceLROnPlateauCallback
or CSVLogger
, although those probably arenāt an issue.
It fails with either learn.load
or EarlyStoppingCallback