To_distributed() with SaveModelCallback

kcturgutlu · March 20, 2019, 12:08am

They are getting called no matter what even with 1 epoch, see fit().

kcturgutlu · March 20, 2019, 12:10am

I have edit the command:

NCCL_DEBUG=WARN python ../fastai/fastai/launch.py --gpus=3456 ./multi_gpu_debug.py --arch_name=resnet18 --model_suffix=_non_overlap --fold_num=0

let’s see if any meaningful error comes.

austinmw · March 20, 2019, 12:19am

So it looks like this monkey patch has worked at least initially. I’ll try multi-phase training now.

kcturgutlu · March 20, 2019, 12:28am

Ok, the hang problem is with EarlyStoppingCallback.

Without any callbacks script works fine.

Probably master and workers are out of sync because of stopping training?

austinmw · March 20, 2019, 12:33am

Hmm just ran a multi-phase test and that completed successfully, but too low of epochs to trigger early stopping. Will retry with high epochs and low patience now.

kcturgutlu · March 20, 2019, 12:34am

Yeah if error occurs again after early stepping while going from stage-1 to stage-2 we can be sure. Early stopping breaks the fit function. Maybe best would be to use your monkey patch if it seems to work. That way at least we would have access to best models during training, kind of early stop without actually stopping But it would be nice to understand why breaking the fit function causing such hang, hmm…

kcturgutlu · March 20, 2019, 12:49am

   self.learn.load(f'{self.name}', purge=False)
 File "/home/turgutluk/fastai/fastai/basic_train.py", line 264, in load
   state = torch.load(self.path/self.model_dir/f'{name}.pth', map_location=device)
 File "/home/turgutluk/.conda/envs/my_fastai/lib/python3.7/site-packages/torch/serialization.py", line 368, in load
   return _load(f, map_location, pickle_module)
 File "/home/turgutluk/.conda/envs/my_fastai/lib/python3.7/site-packages/torch/serialization.py", line 549, in _load
   deserialized_objects[key]._set_from_file(f, offset, f_should_read_directly)
RuntimeError: storage has wrong size: expected 1776574026747881532 got 128

got this error using only SaveModelCallback with the monkey patch.

austinmw · March 20, 2019, 12:54am

That error is definitely a new one to me. I just was able to use both callbacks successfully though (ignore the added #quality_metric printouts callback I added):

kcturgutlu · March 20, 2019, 12:56am

Is this with multiple stages with unfreezing and with EarlyStoppingCallback and SaveModelCallback?

If that’s the case can you share your script? Thanks

austinmw · March 20, 2019, 1:04am

Yes I used both callbacks and trained the network head, unfroze, then trained the rest of the network with differential lrs. I’m not really allowed to share the entire script, but I can answer any questions. One difference I see is that I was having issues with setup_distrib in SageMaker, so I instead initialized distrib training similar to the way that SageMaker PyTorch examples recommended:

parser.add_argument('--hosts', type=str, default=ast.literal_eval(os.environ['SM_HOSTS']))
parser.add_argument('--current_host', type=str, default=os.environ['SM_CURRENT_HOST'])

print('Turning on distributed training.')
print('hosts:', args.hosts)
print('current_host:', args.current_host)        
world_size = len(args.hosts)
os.environ['WORLD_SIZE'] = str(world_size)
host_rank = args.hosts.index(args.current_host)
os.environ['RANK'] = str(host_rank)
print('world_size:', world_size)
print('host_rank', host_rank)
torch.cuda.set_device(0)
torch.distributed.init_process_group(backend=args.backend, rank=host_rank, world_size=world_size)
logger.info('Initialized the distributed environment: \'{}\' backend on {} nodes. '.format(
        args.backend, world_size) + 'Current host rank is {}. Number of gpus: {}'.format(
        host_rank, args.num_gpus))
learn.to_distributed(0)

kcturgutlu · March 20, 2019, 1:06am

Why are you specifying gpu id in learn.to_distributed(0)? Does this run on multiple gpus, have you checked it?

austinmw · March 20, 2019, 1:09am

I’m only using distributed training, not parallel, so each machine only has 1 GPU, so I hardcoded the gpu index. This is because Dynamic U-Net docs had a warning note saying that parallel training didn’t work: https://docs.fast.ai/vision.learner.html#unet_learner

kcturgutlu · March 20, 2019, 1:11am

I am also using to_distributed but when I hardcoded the gpu id I saw that multiple processes were spawned in the same gpu but not in multiple gpus. If you see that multiple gpus spawning process then it should be fine, e.g. watch gpustat.

Also, this thread suggest not saving model for slaves as it can corrupt the file when they are trying to write in it at the same time, i guess that’s why we had that condition in save(). https://github.com/pytorch/pytorch/issues/12042

austinmw · March 20, 2019, 1:19am

In my case the training script is being run identically on completely separate EC2 instances rather than using distributed on a single multi-gpu machine, could that be the difference?

I wonder if that forum issue doesn’t affect me because I’m not doing parallel+distributed and have separate filesystems so nothing is trying to write to the same file.

kcturgutlu · March 20, 2019, 1:20am

Hmm, I see , thanks for clarification. I will try some other stuff.

kcturgutlu · March 20, 2019, 1:51am

No luck couldn’t solve the issue

austinmw · March 20, 2019, 2:01am

If any other info about my setup would help please let me know. Just to make sure I understand, the hardware you’re testing on is a single multi-gpu machine?

kcturgutlu · March 20, 2019, 2:04am

Yes it is a single machine single node 8 gpu machine which I use 3-4 of gpus. I can share my scripts with you:

github.com

KeremTurgutlu/hist_cancer_detection/blob/master/multi_gpu_debug.py

import warnings; warnings.filterwarnings('ignore')
from fastai.script import *
from fastai.distributed import *
from fastai.vision import *
from fastai.callbacks import *
from fastai.vision.models import cadene_models 
sys.path.append("dev/"); from metric_utils import AUC

# learn.load gives error
class OnlySaveModelCallback(TrackerCallback):
    "A `TrackerCallback` that saves the model when monitored quantity is best."
    def __init__(self, learn:Learner, monitor:str='val_loss', mode:str='auto', every:str='improvement', name:str='bestmodel'):
        super().__init__(learn, monitor=monitor, mode=mode)
        self.every,self.name = every,name
        if self.every not in ['improvement', 'epoch']:
            warn(f'SaveModel every {self.every} is invalid, falling back to "improvement".')
            self.every = 'improvement'
                 
    def jump_to_epoch(self, epoch:int)->None:
        try:

This file has been truncated. show original

github.com

KeremTurgutlu/hist_cancer_detection/blob/master/multi_gpu_training.sh

NCCL_DEBUG=WARN python ../fastai/fastai/launch.py --gpus=12345 ./multi_gpu_debug.py --arch_name=resnet18 --model_suffix=_non_overlap --fold_num=0
NCCL_DEBUG=WARN python ../fastai/fastai/launch.py --gpus=12345 ./multi_gpu_debug.py --arch_name=resnet18 --model_suffix=_non_overlap --fold_num=1
NCCL_DEBUG=WARN python ../fastai/fastai/launch.py --gpus=12345 ./multi_gpu_debug.py --arch_name=resnet18 --model_suffix=_non_overlap --fold_num=2
NCCL_DEBUG=WARN python ../fastai/fastai/launch.py --gpus=12345 ./multi_gpu_debug.py --arch_name=resnet18 --model_suffix=_non_overlap --fold_num=3
NCCL_DEBUG=WARN python ../fastai/fastai/launch.py --gpus=12345 ./multi_gpu_debug.py --arch_name=resnet18 --model_suffix=_non_overlap --fold_num=4


NCCL_DEBUG=WARN python ../fastai/fastai/launch.py --gpus=12345 ./multi_gpu_debug.py --arch_name=resnet101 --model_suffix=_non_overlap --fold_num=0
NCCL_DEBUG=WARN python ../fastai/fastai/launch.py --gpus=12345 ./multi_gpu_debug.py --arch_name=resnet101 --model_suffix=_non_overlap --fold_num=1
NCCL_DEBUG=WARN python ../fastai/fastai/launch.py --gpus=12345 ./multi_gpu_debug.py --arch_name=resnet101 --model_suffix=_non_overlap --fold_num=2
NCCL_DEBUG=WARN python ../fastai/fastai/launch.py --gpus=12345 ./multi_gpu_debug.py --arch_name=resnet101 --model_suffix=_non_overlap --fold_num=3
NCCL_DEBUG=WARN python ../fastai/fastai/launch.py --gpus=12345 ./multi_gpu_debug.py --arch_name=resnet101 --model_suffix=_non_overlap --fold_num=4



# python ../fastai/fastai/launch.py --gpus=3456 ./multi_gpu_debug.py --arch_name=resnet18 --model_suffix=_non_overlap --fold_num=1
# python ../fastai/fastai/launch.py --gpus=3456 ./multi_gpu_debug.py --arch_name=resnet18 --model_suffix=_non_overlap --fold_num=2
# python ../fastai/fastai/launch.py --gpus=3456 ./multi_gpu_debug.py --arch_name=resnet18 --model_suffix=_non_overlap --fold_num=3
# python ../fastai/fastai/launch.py --gpus=3456 ./multi_gpu_debug.py --arch_name=resnet18 --model_suffix=_non_overlap --fold_num=4

This file has been truncated. show original

Thanks a lot for your help!

austinmw · March 20, 2019, 2:13am

And you said just turning off EarlyStoppingCallback makes everything work, or turning off all callbacks is required?

I could see how EarlyStoppingCallback wouldn’t work without patching because it calls learn.load at the end of training, but not quite sure why patching has fixed both callbacks for me, but only SaveModelCallback for you.

Also I’m not using ReduceLROnPlateauCallback or CSVLogger, although those probably aren’t an issue.

kcturgutlu · March 20, 2019, 2:43am

It fails with either learn.load or EarlyStoppingCallback