To_distributed() with SaveModelCallback

SaveModelCallback seems to be causing pickle errors while training with multi gpu using to_distributed():

 File "/home/turgutluk/fastai/fastai/basic_train.py", line 111, in fit
    fit(epochs, self, metrics=self.metrics, callbacks=self.callbacks+callbacks)
  File "/home/turgutluk/fastai/fastai/basic_train.py", line 111, in fit
    finally: cb_handler.on_train_end(exception)
  File "/home/turgutluk/fastai/fastai/callback.py", line 322, in on_train_end
    finally: cb_handler.on_train_end(exception)
  File "/home/turgutluk/fastai/fastai/callback.py", line 322, in on_train_end
    finally: cb_handler.on_train_end(exception)
  File "/home/turgutluk/fastai/fastai/callback.py", line 322, in on_train_end
    self('train_end', exception=exception)
  File "/home/turgutluk/fastai/fastai/callback.py", line 250, in __call__
    self('train_end', exception=exception)
  File "/home/turgutluk/fastai/fastai/callback.py", line 250, in __call__
    self('train_end', exception=exception)
  File "/home/turgutluk/fastai/fastai/callback.py", line 250, in __call__
    for cb in self.callbacks: self._call_and_update(cb, cb_name, **kwargs)
  File "/home/turgutluk/fastai/fastai/callback.py", line 240, in _call_and_update
    for cb in self.callbacks: self._call_and_update(cb, cb_name, **kwargs)
  File "/home/turgutluk/fastai/fastai/callback.py", line 240, in _call_and_update
    for cb in self.callbacks: self._call_and_update(cb, cb_name, **kwargs)
  File "/home/turgutluk/fastai/fastai/callback.py", line 240, in _call_and_update
    new = ifnone(getattr(cb, f'on_{cb_name}')(**self.state_dict, **kwargs), dict())
  File "/home/turgutluk/fastai/fastai/callbacks/tracker.py", line 104, in on_train_end
    new = ifnone(getattr(cb, f'on_{cb_name}')(**self.state_dict, **kwargs), dict())
  File "/home/turgutluk/fastai/fastai/callbacks/tracker.py", line 104, in on_train_end
    new = ifnone(getattr(cb, f'on_{cb_name}')(**self.state_dict, **kwargs), dict())
  File "/home/turgutluk/fastai/fastai/callbacks/tracker.py", line 104, in on_train_end
    self.learn.load(f'{self.name}', purge=False)
  File "/home/turgutluk/fastai/fastai/basic_train.py", line 264, in load
    self.learn.load(f'{self.name}', purge=False)
  File "/home/turgutluk/fastai/fastai/basic_train.py", line 264, in load
    self.learn.load(f'{self.name}', purge=False)
  File "/home/turgutluk/fastai/fastai/basic_train.py", line 264, in load
    state = torch.load(self.path/self.model_dir/f'{name}.pth', map_location=device)
  File "/home/turgutluk/.conda/envs/my_fastai/lib/python3.7/site-packages/torch/serialization.py", line 368, in load
    state = torch.load(self.path/self.model_dir/f'{name}.pth', map_location=device)
  File "/home/turgutluk/.conda/envs/my_fastai/lib/python3.7/site-packages/torch/serialization.py", line 368, in load
    state = torch.load(self.path/self.model_dir/f'{name}.pth', map_location=device)
  File "/home/turgutluk/.conda/envs/my_fastai/lib/python3.7/site-packages/torch/serialization.py", line 368, in load
    return _load(f, map_location, pickle_module)
  File "/home/turgutluk/.conda/envs/my_fastai/lib/python3.7/site-packages/torch/serialization.py", line 532, in _load
    return _load(f, map_location, pickle_module)
  File "/home/turgutluk/.conda/envs/my_fastai/lib/python3.7/site-packages/torch/serialization.py", line 532, in _load
    return _load(f, map_location, pickle_module)
  File "/home/turgutluk/.conda/envs/my_fastai/lib/python3.7/site-packages/torch/serialization.py", line 532, in _load
    magic_number = pickle_module.load(f)
    magic_number = pickle_module.load(f)

Here is the full script: https://github.com/KeremTurgutlu/hist_cancer_detection/blob/master/multi_gpu_training.py

1 Like

That is weird. Does it pickle okay without distributed?

Yes without distributed it works fine. I will turn on and off couple of things in my script to further understand what might be causing problems. Because even without this callback training sometime hangs, e.g. trains stage-1, stage-2 then stucks at stage-3. Or sometimes if I am creating another learning after destroying the current one in the same script it stucks after training for several epochs. There might be many reasons for such behavior, so let’ see.

I posted a similar issue recently: A few features not working with distributed training on SageMaker

If I make an os.listdir call to the save location of SaveModelCallback at the end of my training script, it seems only 1/4 of the distributed instances I’m using has actually saved the model.pth, so of course 3/4 will fail when attempting to reload it. The same thing happens when trying to call learn.export – only one of the instances ends up having an export.pkl.

Any chance you happen to be using SageMaker as well?

Btw I also am having issues with distributed multi-stage training (and unfreezing in between phases).

Hmm, we should definitely dig deeper to callbacks but fit_one_cycle is also a callback and it seems to work fine.

No, I am not using SageMaker but I am using a desktop ubuntu machine.

Are your errors with SaveModelCallback related to the actual callback erroring out, or is the issue when attempting to learn.load the saved model? For me, the callback itself does not error, it just doesn’t save a file like it’s supposed to for all but one of the instances.

It’s good that you’re not using SageMaker or unet_learner so I think we can eliminate those variables.

One error was with learn.load which is called by SavedModelCallback. The second error is that training sometimes hangs independent of callbacks, but to make sure I will test it by disabling all callbacks. BTW examples/train_cifar.py works for me but it doesn’t do stage wise learning or loading.

Hmm, I didn’t notice that SavedModelCallback actually calls learn.load internally. That doesn’t error for me, but explicitly calling learn.load at the expected path does. I’m going to try and see if the save path is different than expected for the other instances.

It looks like SavedModelCallback uses learn.load twice, the first time in jump_to_epoch which I haven’t fully comprehended yet, but the second time it checks to make sure the file exists first, so in that case it wouldn’t error out. But, then if you continue to a second phase, one of the instances will have loaded the best epoch on train end, while the others haven’t, so they become out of sync.

Does training only hang when trying to do multi-phase? Also does it still happen if you don’t modify freezing in between?

Yes at the end of training it loads the best model so far.

One possibility is that it might be silently failing to load the best model at end of training on all but one of the instances, in which case they would be out of sync for the next phase of training.

It looks like it waits for a process but why some process are not able to get into sync, that I don’t know.

So I did another simple test where I called save_path = learn.save('testing_save', return_path=True) right after instantiating the learner, and then printed save_path. The master instance returns a path, while the slave instances return None.

Do we need a shared filesystem? How can we manipulate or wrap these callbacks to work with distributed?

Ok, I’m pretty positive now that this line is the issue: https://github.com/fastai/fastai/blob/276f20bfaf413680367edd54d7d2fa8199151f6b/fastai/basic_train.py#L246

I don’t see any obvious way to turn that off though. If I try to trick it by redefining the environmental variable RANK used by rank_distrib, I think it’s likely that will screw with other things.

So in this case rank=0 would be master and it would save it’s state dict but not slaves?

Yeah I believe so. It may be that distributed training was typically tested for benchmarking with a single training phase as with DAWNBench, so saving on slaves in that case was either just not useful or even potentially detrimental to maximizing speed.

Edit: So I think we could monkey patch the save and export methods like this, although I’m a bit hesistant to do so.

As a fast solution I am manually saving the model and not loading anything, let’s see if it still hangs during different stages. Will let you know.

EarlyStoppingCallback and ReduceLROnPlateauCallback together might compensate for SaveModelCallback.

1 Like

Ok, I’ll check if the monkey patch works as a temporary solution. Something like this:

import types

def _export(self, fname:PathOrStr='export.pkl', destroy=False):
	"Export the state of the `Learner` in `self.path/fname`."
	# if rank_distrib(): return # don't save if slave proc
	args = ['opt_func', 'loss_func', 'metrics', 'true_wd', 'bn_wd', 'wd', 'train_bn', 'model_dir', 'callback_fns']
	state = {a:getattr(self,a) for a in args}
	state['cb_state'] = {cb.__class__:cb.get_state() for cb in self.callbacks}
	#layer_groups -> need to find a way
	#TO SEE: do we save model structure and weights separately?
	with ModelOnCPU(self.model) as m:
		state['model'] = m
		xtra = dict(normalize=self.data.norm.keywords) if getattr(self.data, 'norm', False) else {}
		state['data'] = self.data.valid_ds.get_state(**xtra)
		state['cls'] = self.__class__
		try_save(state, self.path, fname)
	if destroy: self.destroy()

def _save(self, name:PathOrStr, return_path:bool=False, with_opt:bool=True):
	"Save model and optimizer state (if `with_opt`) with `name` to `self.model_dir`."
	self._test_writeable_path()
	# if rank_distrib(): return # don't save if slave proc
	path = self.path/self.model_dir/f'{name}.pth'
	if not hasattr(self, 'opt'): with_opt=False
	if not with_opt: state = get_model(self.model).state_dict()
	else: state = {'model': get_model(self.model).state_dict(), 'opt':self.opt.state_dict()}
	torch.save(state, path)
	if return_path: return path

learn.save = types.MethodType(_save, learn)	
learn.export = types.MethodType(_export, learn)
1 Like

When doing

epochs = {'stage1':30, 'stage2':50 'stage3':100}

it hangs during the start of stage-2 and doesn’t actually start stage-2.

But below works fine.

epochs = {'stage1':1, 'stage2':1, 'stage3':1}

What about with all set to 2? Maybe with only 1 epoch both on_epoch_end and on_train_end don’t get called?