To_distributed() with SaveModelCallback

austinmw · March 20, 2019, 12:54am

That error is definitely a new one to me. I just was able to use both callbacks successfully though (ignore the added #quality_metric printouts callback I added):

kcturgutlu · March 20, 2019, 12:56am

Is this with multiple stages with unfreezing and with EarlyStoppingCallback and SaveModelCallback?

If that’s the case can you share your script? Thanks

austinmw · March 20, 2019, 1:04am

Yes I used both callbacks and trained the network head, unfroze, then trained the rest of the network with differential lrs. I’m not really allowed to share the entire script, but I can answer any questions. One difference I see is that I was having issues with setup_distrib in SageMaker, so I instead initialized distrib training similar to the way that SageMaker PyTorch examples recommended:

parser.add_argument('--hosts', type=str, default=ast.literal_eval(os.environ['SM_HOSTS']))
parser.add_argument('--current_host', type=str, default=os.environ['SM_CURRENT_HOST'])

print('Turning on distributed training.')
print('hosts:', args.hosts)
print('current_host:', args.current_host)        
world_size = len(args.hosts)
os.environ['WORLD_SIZE'] = str(world_size)
host_rank = args.hosts.index(args.current_host)
os.environ['RANK'] = str(host_rank)
print('world_size:', world_size)
print('host_rank', host_rank)
torch.cuda.set_device(0)
torch.distributed.init_process_group(backend=args.backend, rank=host_rank, world_size=world_size)
logger.info('Initialized the distributed environment: \'{}\' backend on {} nodes. '.format(
        args.backend, world_size) + 'Current host rank is {}. Number of gpus: {}'.format(
        host_rank, args.num_gpus))
learn.to_distributed(0)

kcturgutlu · March 20, 2019, 1:06am

Why are you specifying gpu id in learn.to_distributed(0)? Does this run on multiple gpus, have you checked it?

austinmw · March 20, 2019, 1:09am

I’m only using distributed training, not parallel, so each machine only has 1 GPU, so I hardcoded the gpu index. This is because Dynamic U-Net docs had a warning note saying that parallel training didn’t work: https://docs.fast.ai/vision.learner.html#unet_learner

kcturgutlu · March 20, 2019, 1:11am

I am also using to_distributed but when I hardcoded the gpu id I saw that multiple processes were spawned in the same gpu but not in multiple gpus. If you see that multiple gpus spawning process then it should be fine, e.g. watch gpustat.

Also, this thread suggest not saving model for slaves as it can corrupt the file when they are trying to write in it at the same time, i guess that’s why we had that condition in save(). https://github.com/pytorch/pytorch/issues/12042

austinmw · March 20, 2019, 1:19am

In my case the training script is being run identically on completely separate EC2 instances rather than using distributed on a single multi-gpu machine, could that be the difference?

I wonder if that forum issue doesn’t affect me because I’m not doing parallel+distributed and have separate filesystems so nothing is trying to write to the same file.

kcturgutlu · March 20, 2019, 1:20am

Hmm, I see , thanks for clarification. I will try some other stuff.

kcturgutlu · March 20, 2019, 1:51am

No luck couldn’t solve the issue

austinmw · March 20, 2019, 2:01am

If any other info about my setup would help please let me know. Just to make sure I understand, the hardware you’re testing on is a single multi-gpu machine?

kcturgutlu · March 20, 2019, 2:04am

Yes it is a single machine single node 8 gpu machine which I use 3-4 of gpus. I can share my scripts with you:

github.com

KeremTurgutlu/hist_cancer_detection/blob/master/multi_gpu_debug.py

import warnings; warnings.filterwarnings('ignore')
from fastai.script import *
from fastai.distributed import *
from fastai.vision import *
from fastai.callbacks import *
from fastai.vision.models import cadene_models 
sys.path.append("dev/"); from metric_utils import AUC

# learn.load gives error
class OnlySaveModelCallback(TrackerCallback):
    "A `TrackerCallback` that saves the model when monitored quantity is best."
    def __init__(self, learn:Learner, monitor:str='val_loss', mode:str='auto', every:str='improvement', name:str='bestmodel'):
        super().__init__(learn, monitor=monitor, mode=mode)
        self.every,self.name = every,name
        if self.every not in ['improvement', 'epoch']:
            warn(f'SaveModel every {self.every} is invalid, falling back to "improvement".')
            self.every = 'improvement'
                 
    def jump_to_epoch(self, epoch:int)->None:
        try:

This file has been truncated. show original

github.com

KeremTurgutlu/hist_cancer_detection/blob/master/multi_gpu_training.sh

NCCL_DEBUG=WARN python ../fastai/fastai/launch.py --gpus=12345 ./multi_gpu_debug.py --arch_name=resnet18 --model_suffix=_non_overlap --fold_num=0
NCCL_DEBUG=WARN python ../fastai/fastai/launch.py --gpus=12345 ./multi_gpu_debug.py --arch_name=resnet18 --model_suffix=_non_overlap --fold_num=1
NCCL_DEBUG=WARN python ../fastai/fastai/launch.py --gpus=12345 ./multi_gpu_debug.py --arch_name=resnet18 --model_suffix=_non_overlap --fold_num=2
NCCL_DEBUG=WARN python ../fastai/fastai/launch.py --gpus=12345 ./multi_gpu_debug.py --arch_name=resnet18 --model_suffix=_non_overlap --fold_num=3
NCCL_DEBUG=WARN python ../fastai/fastai/launch.py --gpus=12345 ./multi_gpu_debug.py --arch_name=resnet18 --model_suffix=_non_overlap --fold_num=4


NCCL_DEBUG=WARN python ../fastai/fastai/launch.py --gpus=12345 ./multi_gpu_debug.py --arch_name=resnet101 --model_suffix=_non_overlap --fold_num=0
NCCL_DEBUG=WARN python ../fastai/fastai/launch.py --gpus=12345 ./multi_gpu_debug.py --arch_name=resnet101 --model_suffix=_non_overlap --fold_num=1
NCCL_DEBUG=WARN python ../fastai/fastai/launch.py --gpus=12345 ./multi_gpu_debug.py --arch_name=resnet101 --model_suffix=_non_overlap --fold_num=2
NCCL_DEBUG=WARN python ../fastai/fastai/launch.py --gpus=12345 ./multi_gpu_debug.py --arch_name=resnet101 --model_suffix=_non_overlap --fold_num=3
NCCL_DEBUG=WARN python ../fastai/fastai/launch.py --gpus=12345 ./multi_gpu_debug.py --arch_name=resnet101 --model_suffix=_non_overlap --fold_num=4



# python ../fastai/fastai/launch.py --gpus=3456 ./multi_gpu_debug.py --arch_name=resnet18 --model_suffix=_non_overlap --fold_num=1
# python ../fastai/fastai/launch.py --gpus=3456 ./multi_gpu_debug.py --arch_name=resnet18 --model_suffix=_non_overlap --fold_num=2
# python ../fastai/fastai/launch.py --gpus=3456 ./multi_gpu_debug.py --arch_name=resnet18 --model_suffix=_non_overlap --fold_num=3
# python ../fastai/fastai/launch.py --gpus=3456 ./multi_gpu_debug.py --arch_name=resnet18 --model_suffix=_non_overlap --fold_num=4

This file has been truncated. show original

Thanks a lot for your help!

austinmw · March 20, 2019, 2:13am

And you said just turning off EarlyStoppingCallback makes everything work, or turning off all callbacks is required?

I could see how EarlyStoppingCallback wouldn’t work without patching because it calls learn.load at the end of training, but not quite sure why patching has fixed both callbacks for me, but only SaveModelCallback for you.

Also I’m not using ReduceLROnPlateauCallback or CSVLogger, although those probably aren’t an issue.

kcturgutlu · March 20, 2019, 2:43am

It fails with either learn.load or EarlyStoppingCallback

austinmw · March 20, 2019, 3:42am

I think the different hardware setup you have makes the situation very different, but I’d guess that if you fix the learn.load issues the others may get resolved as well. You have a single filesystem so there’s no chance that the file learn.load is trying to load does not exist, right? Alternatively, maybe it could be that multiple processes are trying to read from the same file at the same time and crashing?

kcturgutlu · March 20, 2019, 5:41am

That might be possible don’t have an idea. As a temp solution I am not loading but only saving best models as I train with a custom SaveModelCallback.

MartinBai · October 29, 2019, 11:04am

I have came across a similar issue, thinking to share as well. It is indeed the SaveModelCallback will break the distributed training. The training will be running fine when remove the SaveModelCallback in the fit_one_cycle. Interestingly, one of GPU will keep going and hanging there forever then.

Traceback (most recent call last):
  File "/mnt/py_new/lib/python3.6/tarfile.py", line 2294, in next
    tarinfo = self.tarinfo.fromtarfile(self)
  File "/mnt/py_new/lib/python3.6/tarfile.py", line 1090, in fromtarfile
    obj = cls.frombuf(buf, tarfile.encoding, tarfile.errors)
  File "/mnt/py_new/lib/python3.6/tarfile.py", line 1026, in frombuf
    raise EmptyHeaderError("empty header")
tarfile.EmptyHeaderError: empty header

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/mnt/py_new/lib/python3.6/site-packages/torch/serialization.py", line 595, in _load
    return legacy_load(f)
  File "/mnt/py_new/lib/python3.6/site-packages/torch/serialization.py", line 506, in legacy_load
    with closing(tarfile.open(fileobj=f, mode='r:', format=tarfile.PAX_FORMAT)) as tar, \
  File "/mnt/py_new/lib/python3.6/tarfile.py", line 1586, in open
    return func(name, filemode, fileobj, **kwargs)
  File "/mnt/py_new/lib/python3.6/tarfile.py", line 1616, in taropen
Traceback (most recent call last):
  File "/mnt/py_new/lib/python3.6/tarfile.py", line 2294, in next
    return cls(name, mode, fileobj, **kwargs)
  File "/mnt/py_new/lib/python3.6/tarfile.py", line 1479, in __init__
    tarinfo = self.tarinfo.fromtarfile(self)
  File "/mnt/py_new/lib/python3.6/tarfile.py", line 1090, in fromtarfile
    self.firstmember = self.next()
  File "/mnt/py_new/lib/python3.6/tarfile.py", line 2309, in next
    obj = cls.frombuf(buf, tarfile.encoding, tarfile.errors)
  File "/mnt/py_new/lib/python3.6/tarfile.py", line 1026, in frombuf
    raise ReadError("empty file")
tarfile.ReadError: empty file

During handling of the above exception, another exception occurred:

    raise EmptyHeaderError("empty header")
Traceback (most recent call last):
  File "us_cm_training.py", line 91, in <module>
tarfile.EmptyHeaderError: empty header

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/mnt/py_new/lib/python3.6/site-packages/torch/serialization.py", line 595, in _load
    callbacks=[NotificationCallback('cls_stage3'), SaveModelCallback(cls_leanrner, every='improvement', monitor='accuracy', name='best_cls_model')]
  File "/mnt/py_new/lib/python3.6/site-packages/fastai/train.py", line 23, in fit_one_cycle
    learn.fit(cyc_len, max_lr, wd=wd, callbacks=callbacks)
  File "/mnt/py_new/lib/python3.6/site-packages/fastai/basic_train.py", line 200, in fit
    return legacy_load(f)
  File "/mnt/py_new/lib/python3.6/site-packages/torch/serialization.py", line 506, in legacy_load
    fit(epochs, self, metrics=self.metrics, callbacks=self.callbacks+callbacks)
  File "/mnt/py_new/lib/python3.6/site-packages/fastai/basic_train.py", line 112, in fit
    finally: cb_handler.on_train_end(exception)
  File "/mnt/py_new/lib/python3.6/site-packages/fastai/callback.py", line 323, in on_train_end
    with closing(tarfile.open(fileobj=f, mode='r:', format=tarfile.PAX_FORMAT)) as tar, \
  File "/mnt/py_new/lib/python3.6/tarfile.py", line 1586, in open
    self('train_end', exception=exception)
  File "/mnt/py_new/lib/python3.6/site-packages/fastai/callback.py", line 251, in __call__
    for cb in self.callbacks: self._call_and_update(cb, cb_name, **kwargs)
  File "/mnt/py_new/lib/python3.6/site-packages/fastai/callback.py", line 241, in _call_and_update
    new = ifnone(getattr(cb, f'on_{cb_name}')(**self.state_dict, **kwargs), dict())
  File "/mnt/py_new/lib/python3.6/site-packages/fastai/callbacks/tracker.py", line 105, in on_train_end
    self.learn.load(f'{self.name}', purge=False)
  File "/mnt/py_new/lib/python3.6/site-packages/fastai/basic_train.py", line 267, in load
    return func(name, filemode, fileobj, **kwargs)
  File "/mnt/py_new/lib/python3.6/tarfile.py", line 1616, in taropen
    state = torch.load(source, map_location=device)
  File "/mnt/py_new/lib/python3.6/site-packages/torch/serialization.py", line 426, in load
    return _load(f, map_location, pickle_module, **pickle_load_args)
  File "/mnt/py_new/lib/python3.6/site-packages/torch/serialization.py", line 597, in _load
    if _is_zipfile(f):
  File "/mnt/py_new/lib/python3.6/site-packages/torch/serialization.py", line 75, in _is_zipfile
    return cls(name, mode, fileobj, **kwargs)
  File "/mnt/py_new/lib/python3.6/tarfile.py", line 1479, in __init__
    if ord(magic_byte) != ord(read_byte):
TypeError: ord() expected a character, but string of length 0 found
    self.firstmember = self.next()
  File "/mnt/py_new/lib/python3.6/tarfile.py", line 2309, in next
    raise ReadError("empty file")
tarfile.ReadError: empty file

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "us_cm_training.py", line 91, in <module>
    callbacks=[NotificationCallback('cls_stage3'), SaveModelCallback(cls_leanrner, every='improvement', monitor='accuracy', name='best_cls_model')]
  File "/mnt/py_new/lib/python3.6/site-packages/fastai/train.py", line 23, in fit_one_cycle
    learn.fit(cyc_len, max_lr, wd=wd, callbacks=callbacks)
  File "/mnt/py_new/lib/python3.6/site-packages/fastai/basic_train.py", line 200, in fit
    fit(epochs, self, metrics=self.metrics, callbacks=self.callbacks+callbacks)
  File "/mnt/py_new/lib/python3.6/site-packages/fastai/basic_train.py", line 112, in fit
    finally: cb_handler.on_train_end(exception)
  File "/mnt/py_new/lib/python3.6/site-packages/fastai/callback.py", line 323, in on_train_end
    self('train_end', exception=exception)
  File "/mnt/py_new/lib/python3.6/site-packages/fastai/callback.py", line 251, in __call__
    for cb in self.callbacks: self._call_and_update(cb, cb_name, **kwargs)
  File "/mnt/py_new/lib/python3.6/site-packages/fastai/callback.py", line 241, in _call_and_update
    new = ifnone(getattr(cb, f'on_{cb_name}')(**self.state_dict, **kwargs), dict())
  File "/mnt/py_new/lib/python3.6/site-packages/fastai/callbacks/tracker.py", line 105, in on_train_end
    self.learn.load(f'{self.name}', purge=False)
  File "/mnt/py_new/lib/python3.6/site-packages/fastai/basic_train.py", line 267, in load
    state = torch.load(source, map_location=device)
  File "/mnt/py_new/lib/python3.6/site-packages/torch/serialization.py", line 426, in load
    return _load(f, map_location, pickle_module, **pickle_load_args)
  File "/mnt/py_new/lib/python3.6/site-packages/torch/serialization.py", line 597, in _load
    if _is_zipfile(f):
  File "/mnt/py_new/lib/python3.6/site-packages/torch/serialization.py", line 75, in _is_zipfile
    if ord(magic_byte) != ord(read_byte):
TypeError: ord() expected a character, but string of length 0 found
Better model found at epoch 1 with accuracy value: 0.9131627082824707.
Traceback (most recent call last):
  File "/mnt/py_new/lib/python3.6/tarfile.py", line 2294, in next
    tarinfo = self.tarinfo.fromtarfile(self)
  File "/mnt/py_new/lib/python3.6/tarfile.py", line 1090, in fromtarfile
    obj = cls.frombuf(buf, tarfile.encoding, tarfile.errors)
  File "/mnt/py_new/lib/python3.6/tarfile.py", line 1026, in frombuf
Traceback (most recent call last):
  File "/mnt/py_new/lib/python3.6/tarfile.py", line 2294, in next
    raise EmptyHeaderError("empty header")
tarfile.EmptyHeaderError: empty header

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/mnt/py_new/lib/python3.6/site-packages/torch/serialization.py", line 595, in _load
Traceback (most recent call last):
  File "/mnt/py_new/lib/python3.6/tarfile.py", line 2294, in next
    return legacy_load(f)
  File "/mnt/py_new/lib/python3.6/site-packages/torch/serialization.py", line 506, in legacy_load
    with closing(tarfile.open(fileobj=f, mode='r:', format=tarfile.PAX_FORMAT)) as tar, \
  File "/mnt/py_new/lib/python3.6/tarfile.py", line 1586, in open
    tarinfo = self.tarinfo.fromtarfile(self)
  File "/mnt/py_new/lib/python3.6/tarfile.py", line 1090, in fromtarfile
    obj = cls.frombuf(buf, tarfile.encoding, tarfile.errors)
  File "/mnt/py_new/lib/python3.6/tarfile.py", line 1026, in frombuf
    return func(name, filemode, fileobj, **kwargs)
  File "/mnt/py_new/lib/python3.6/tarfile.py", line 1616, in taropen
    tarinfo = self.tarinfo.fromtarfile(self)
  File "/mnt/py_new/lib/python3.6/tarfile.py", line 1090, in fromtarfile
    raise EmptyHeaderError("empty header")
tarfile.EmptyHeaderError: empty header

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/mnt/py_new/lib/python3.6/site-packages/torch/serialization.py", line 595, in _load
    return cls(name, mode, fileobj, **kwargs)
  File "/mnt/py_new/lib/python3.6/tarfile.py", line 1479, in __init__
    obj = cls.frombuf(buf, tarfile.encoding, tarfile.errors)
  File "/mnt/py_new/lib/python3.6/tarfile.py", line 1026, in frombuf
    return legacy_load(f)
  File "/mnt/py_new/lib/python3.6/site-packages/torch/serialization.py", line 506, in legacy_load
    with closing(tarfile.open(fileobj=f, mode='r:', format=tarfile.PAX_FORMAT)) as tar, \
  File "/mnt/py_new/lib/python3.6/tarfile.py", line 1586, in open
    self.firstmember = self.next()
  File "/mnt/py_new/lib/python3.6/tarfile.py", line 2309, in next
Traceback (most recent call last):
  File "/mnt/py_new/lib/python3.6/tarfile.py", line 2294, in next
    raise EmptyHeaderError("empty header")
tarfile.EmptyHeaderError: empty header

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/mnt/py_new/lib/python3.6/site-packages/torch/serialization.py", line 595, in _load
    return legacy_load(f)
  File "/mnt/py_new/lib/python3.6/site-packages/torch/serialization.py", line 506, in legacy_load
    return func(name, filemode, fileobj, **kwargs)
  File "/mnt/py_new/lib/python3.6/tarfile.py", line 1616, in taropen
    with closing(tarfile.open(fileobj=f, mode='r:', format=tarfile.PAX_FORMAT)) as tar, \
  File "/mnt/py_new/lib/python3.6/tarfile.py", line 1586, in open
    raise ReadError("empty file")
tarfile.ReadError: empty file

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "us_cm_training.py", line 91, in <module>
    callbacks=[NotificationCallback('cls_stage3'), SaveModelCallback(cls_leanrner, every='improvement', monitor='accuracy', name='best_cls_model')]
  File "/mnt/py_new/lib/python3.6/site-packages/fastai/train.py", line 23, in fit_one_cycle
    return cls(name, mode, fileobj, **kwargs)
  File "/mnt/py_new/lib/python3.6/tarfile.py", line 1479, in __init__
    learn.fit(cyc_len, max_lr, wd=wd, callbacks=callbacks)
  File "/mnt/py_new/lib/python3.6/site-packages/fastai/basic_train.py", line 200, in fit
    fit(epochs, self, metrics=self.metrics, callbacks=self.callbacks+callbacks)
  File "/mnt/py_new/lib/python3.6/site-packages/fastai/basic_train.py", line 112, in fit
    return func(name, filemode, fileobj, **kwargs)
    finally: cb_handler.on_train_end(exception)
  File "/mnt/py_new/lib/python3.6/tarfile.py", line 1616, in taropen
  File "/mnt/py_new/lib/python3.6/site-packages/fastai/callback.py", line 323, in on_train_end
    tarinfo = self.tarinfo.fromtarfile(self)
  File "/mnt/py_new/lib/python3.6/tarfile.py", line 1090, in fromtarfile
    self('train_end', exception=exception)
  File "/mnt/py_new/lib/python3.6/site-packages/fastai/callback.py", line 251, in __call__
    self.firstmember = self.next()
  File "/mnt/py_new/lib/python3.6/tarfile.py", line 2309, in next
    for cb in self.callbacks: self._call_and_update(cb, cb_name, **kwargs)
  File "/mnt/py_new/lib/python3.6/site-packages/fastai/callback.py", line 241, in _call_and_update
    new = ifnone(getattr(cb, f'on_{cb_name}')(**self.state_dict, **kwargs), dict())
  File "/mnt/py_new/lib/python3.6/site-packages/fastai/callbacks/tracker.py", line 105, in on_train_end
    self.learn.load(f'{self.name}', purge=False)
  File "/mnt/py_new/lib/python3.6/site-packages/fastai/basic_train.py", line 267, in load
    state = torch.load(source, map_location=device)
  File "/mnt/py_new/lib/python3.6/site-packages/torch/serialization.py", line 426, in load
    return cls(name, mode, fileobj, **kwargs)
  File "/mnt/py_new/lib/python3.6/tarfile.py", line 1479, in __init__
    obj = cls.frombuf(buf, tarfile.encoding, tarfile.errors)
  File "/mnt/py_new/lib/python3.6/tarfile.py", line 1026, in frombuf
    return _load(f, map_location, pickle_module, **pickle_load_args)
  File "/mnt/py_new/lib/python3.6/site-packages/torch/serialization.py", line 597, in _load
    raise ReadError("empty file")
tarfile.ReadError: empty file

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "us_cm_training.py", line 91, in <module>
    callbacks=[NotificationCallback('cls_stage3'), SaveModelCallback(cls_leanrner, every='improvement', monitor='accuracy', name='best_cls_model')]
  File "/mnt/py_new/lib/python3.6/site-packages/fastai/train.py", line 23, in fit_one_cycle
    if _is_zipfile(f):
  File "/mnt/py_new/lib/python3.6/site-packages/torch/serialization.py", line 75, in _is_zipfile
    learn.fit(cyc_len, max_lr, wd=wd, callbacks=callbacks)
  File "/mnt/py_new/lib/python3.6/site-packages/fastai/basic_train.py", line 200, in fit
    if ord(magic_byte) != ord(read_byte):
TypeError: ord() expected a character, but string of length 0 found
    fit(epochs, self, metrics=self.metrics, callbacks=self.callbacks+callbacks)
  File "/mnt/py_new/lib/python3.6/site-packages/fastai/basic_train.py", line 112, in fit
    self.firstmember = self.next()
    finally: cb_handler.on_train_end(exception)
  File "/mnt/py_new/lib/python3.6/site-packages/fastai/callback.py", line 323, in on_train_end
  File "/mnt/py_new/lib/python3.6/tarfile.py", line 2309, in next
    raise EmptyHeaderError("empty header")
tarfile.EmptyHeaderError: empty header

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/mnt/py_new/lib/python3.6/site-packages/torch/serialization.py", line 595, in _load
    self('train_end', exception=exception)
  File "/mnt/py_new/lib/python3.6/site-packages/fastai/callback.py", line 251, in __call__
    for cb in self.callbacks: self._call_and_update(cb, cb_name, **kwargs)
  File "/mnt/py_new/lib/python3.6/site-packages/fastai/callback.py", line 241, in _call_and_update
    new = ifnone(getattr(cb, f'on_{cb_name}')(**self.state_dict, **kwargs), dict())
  File "/mnt/py_new/lib/python3.6/site-packages/fastai/callbacks/tracker.py", line 105, in on_train_end
    return legacy_load(f)
  File "/mnt/py_new/lib/python3.6/site-packages/torch/serialization.py", line 506, in legacy_load
    self.learn.load(f'{self.name}', purge=False)
  File "/mnt/py_new/lib/python3.6/site-packages/fastai/basic_train.py", line 267, in load
    state = torch.load(source, map_location=device)
  File "/mnt/py_new/lib/python3.6/site-packages/torch/serialization.py", line 426, in load
    with closing(tarfile.open(fileobj=f, mode='r:', format=tarfile.PAX_FORMAT)) as tar, \
  File "/mnt/py_new/lib/python3.6/tarfile.py", line 1586, in open
    return _load(f, map_location, pickle_module, **pickle_load_args)
  File "/mnt/py_new/lib/python3.6/site-packages/torch/serialization.py", line 597, in _load
    raise ReadError("empty file")
tarfile.ReadError: empty file

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "us_cm_training.py", line 91, in <module>
    if _is_zipfile(f):
  File "/mnt/py_new/lib/python3.6/site-packages/torch/serialization.py", line 75, in _is_zipfile
    if ord(magic_byte) != ord(read_byte):
    callbacks=[NotificationCallback('cls_stage3'), SaveModelCallback(cls_leanrner, every='improvement', monitor='accuracy', name='best_cls_model')]
TypeError: ord() expected a character, but string of length 0 found
  File "/mnt/py_new/lib/python3.6/site-packages/fastai/train.py", line 23, in fit_one_cycle
    learn.fit(cyc_len, max_lr, wd=wd, callbacks=callbacks)
  File "/mnt/py_new/lib/python3.6/site-packages/fastai/basic_train.py", line 200, in fit
    fit(epochs, self, metrics=self.metrics, callbacks=self.callbacks+callbacks)
  File "/mnt/py_new/lib/python3.6/site-packages/fastai/basic_train.py", line 112, in fit
    finally: cb_handler.on_train_end(exception)
  File "/mnt/py_new/lib/python3.6/site-packages/fastai/callback.py", line 323, in on_train_end
    return func(name, filemode, fileobj, **kwargs)
  File "/mnt/py_new/lib/python3.6/tarfile.py", line 1616, in taropen
    self('train_end', exception=exception)
  File "/mnt/py_new/lib/python3.6/site-packages/fastai/callback.py", line 251, in __call__
    for cb in self.callbacks: self._call_and_update(cb, cb_name, **kwargs)
  File "/mnt/py_new/lib/python3.6/site-packages/fastai/callback.py", line 241, in _call_and_update
    new = ifnone(getattr(cb, f'on_{cb_name}')(**self.state_dict, **kwargs), dict())
  File "/mnt/py_new/lib/python3.6/site-packages/fastai/callbacks/tracker.py", line 105, in on_train_end
    self.learn.load(f'{self.name}', purge=False)
  File "/mnt/py_new/lib/python3.6/site-packages/fastai/basic_train.py", line 267, in load
    state = torch.load(source, map_location=device)
  File "/mnt/py_new/lib/python3.6/site-packages/torch/serialization.py", line 426, in load
    return cls(name, mode, fileobj, **kwargs)
  File "/mnt/py_new/lib/python3.6/tarfile.py", line 1479, in __init__
    return _load(f, map_location, pickle_module, **pickle_load_args)
  File "/mnt/py_new/lib/python3.6/site-packages/torch/serialization.py", line 597, in _load
    if _is_zipfile(f):
  File "/mnt/py_new/lib/python3.6/site-packages/torch/serialization.py", line 75, in _is_zipfile
    if ord(magic_byte) != ord(read_byte):
TypeError: ord() expected a character, but string of length 0 found
    self.firstmember = self.next()
  File "/mnt/py_new/lib/python3.6/tarfile.py", line 2309, in next
    raise ReadError("empty file")
tarfile.ReadError: empty file

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "us_cm_training.py", line 91, in <module>
    callbacks=[NotificationCallback('cls_stage3'), SaveModelCallback(cls_leanrner, every='improvement', monitor='accuracy', name='best_cls_model')]
  File "/mnt/py_new/lib/python3.6/site-packages/fastai/train.py", line 23, in fit_one_cycle
    learn.fit(cyc_len, max_lr, wd=wd, callbacks=callbacks)
  File "/mnt/py_new/lib/python3.6/site-packages/fastai/basic_train.py", line 200, in fit
    fit(epochs, self, metrics=self.metrics, callbacks=self.callbacks+callbacks)
  File "/mnt/py_new/lib/python3.6/site-packages/fastai/basic_train.py", line 112, in fit
    finally: cb_handler.on_train_end(exception)
  File "/mnt/py_new/lib/python3.6/site-packages/fastai/callback.py", line 323, in on_train_end
    self('train_end', exception=exception)
  File "/mnt/py_new/lib/python3.6/site-packages/fastai/callback.py", line 251, in __call__
    for cb in self.callbacks: self._call_and_update(cb, cb_name, **kwargs)
  File "/mnt/py_new/lib/python3.6/site-packages/fastai/callback.py", line 241, in _call_and_update
    new = ifnone(getattr(cb, f'on_{cb_name}')(**self.state_dict, **kwargs), dict())
  File "/mnt/py_new/lib/python3.6/site-packages/fastai/callbacks/tracker.py", line 105, in on_train_end
    self.learn.load(f'{self.name}', purge=False)
  File "/mnt/py_new/lib/python3.6/site-packages/fastai/basic_train.py", line 267, in load
    state = torch.load(source, map_location=device)
  File "/mnt/py_new/lib/python3.6/site-packages/torch/serialization.py", line 426, in load
    return _load(f, map_location, pickle_module, **pickle_load_args)
  File "/mnt/py_new/lib/python3.6/site-packages/torch/serialization.py", line 597, in _load
    if _is_zipfile(f):
  File "/mnt/py_new/lib/python3.6/site-packages/torch/serialization.py", line 75, in _is_zipfile
    if ord(magic_byte) != ord(read_byte):
TypeError: ord() expected a character, but string of length 0 found
Traceback (most recent call last):
  File "/mnt/py_new/lib/python3.6/tarfile.py", line 2294, in next
    tarinfo = self.tarinfo.fromtarfile(self)
  File "/mnt/py_new/lib/python3.6/tarfile.py", line 1090, in fromtarfile
    obj = cls.frombuf(buf, tarfile.encoding, tarfile.errors)
  File "/mnt/py_new/lib/python3.6/tarfile.py", line 1026, in frombuf
    raise EmptyHeaderError("empty header")
tarfile.EmptyHeaderError: empty header

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/mnt/py_new/lib/python3.6/site-packages/torch/serialization.py", line 595, in _load
    return legacy_load(f)
  File "/mnt/py_new/lib/python3.6/site-packages/torch/serialization.py", line 506, in legacy_load
    with closing(tarfile.open(fileobj=f, mode='r:', format=tarfile.PAX_FORMAT)) as tar, \
  File "/mnt/py_new/lib/python3.6/tarfile.py", line 1586, in open
    return func(name, filemode, fileobj, **kwargs)
  File "/mnt/py_new/lib/python3.6/tarfile.py", line 1616, in taropen
    return cls(name, mode, fileobj, **kwargs)
  File "/mnt/py_new/lib/python3.6/tarfile.py", line 1479, in __init__
    self.firstmember = self.next()
  File "/mnt/py_new/lib/python3.6/tarfile.py", line 2309, in next
    raise ReadError("empty file")
tarfile.ReadError: empty file

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "us_cm_training.py", line 91, in <module>
    callbacks=[NotificationCallback('cls_stage3'), SaveModelCallback(cls_leanrner, every='improvement', monitor='accuracy', name='best_cls_model')]
  File "/mnt/py_new/lib/python3.6/site-packages/fastai/train.py", line 23, in fit_one_cycle
    learn.fit(cyc_len, max_lr, wd=wd, callbacks=callbacks)
  File "/mnt/py_new/lib/python3.6/site-packages/fastai/basic_train.py", line 200, in fit
    fit(epochs, self, metrics=self.metrics, callbacks=self.callbacks+callbacks)
  File "/mnt/py_new/lib/python3.6/site-packages/fastai/basic_train.py", line 112, in fit
    finally: cb_handler.on_train_end(exception)
  File "/mnt/py_new/lib/python3.6/site-packages/fastai/callback.py", line 323, in on_train_end
    self('train_end', exception=exception)
  File "/mnt/py_new/lib/python3.6/site-packages/fastai/callback.py", line 251, in __call__
    for cb in self.callbacks: self._call_and_update(cb, cb_name, **kwargs)
  File "/mnt/py_new/lib/python3.6/site-packages/fastai/callback.py", line 241, in _call_and_update
    new = ifnone(getattr(cb, f'on_{cb_name}')(**self.state_dict, **kwargs), dict())
  File "/mnt/py_new/lib/python3.6/site-packages/fastai/callbacks/tracker.py", line 105, in on_train_end
    self.learn.load(f'{self.name}', purge=False)
  File "/mnt/py_new/lib/python3.6/site-packages/fastai/basic_train.py", line 267, in load
    state = torch.load(source, map_location=device)
  File "/mnt/py_new/lib/python3.6/site-packages/torch/serialization.py", line 426, in load
    return _load(f, map_location, pickle_module, **pickle_load_args)
  File "/mnt/py_new/lib/python3.6/site-packages/torch/serialization.py", line 597, in _load
    if _is_zipfile(f):
  File "/mnt/py_new/lib/python3.6/site-packages/torch/serialization.py", line 75, in _is_zipfile
    if ord(magic_byte) != ord(read_byte):
TypeError: ord() expected a character, but string of length 0 found

philchu · October 29, 2019, 5:17pm

@kcturgutlu, @austinmw,

This could be a general synchronization issue among multiple producers/consumers processes.

I ran into corrupted file when attempting to DDP-ify lesson3’s imdb language model training. I was able to solve part of my problem in the application space, with the help from PyTorch team’s tutorial on DDP.

When in distributed-data-parallel (DDP) mode in a single-node/multi-GPU setting, only 1 process should be responsible for save() the checkpoint. This is the producer, sometimes also known as “the master process”.

All other consumer processes on other GPUs, who need to load() the checkpoint from the same pathname on the shared file system, must wait for the completion of the above save() operation by the sole producer.

These two types of operations needs to be coordinated to avoid file corruption due to inadvertent multiple, simultaneous writes or read-before-write hazard, via barrier synchronization. PyTorch tutorial on DistributedDataParallel explains this point, in the section on “Save and Load Checkpoints”. In summary:

Guard save() using a unique rank number to ensure only the master process can save(). In your setup (single node/multi-GPU), local_rank should suffice. In multi-node/multi-GPU setting (perhaps more relevant to your situation @austinmw?) from pytorch issue 12042 (towards the end of the thread) recommends using torch.distribute.get_rank() to detect the master process, as a more reliable method across all setups.
Use dist.barrier(), which is torch.distributed.barrier() to place a synchronization barrier among all processes between a pair of save() and load(), regardless their ordering, e.g. save()-then-load(), or load()-then-erase.
The barrier says: Nobody can proceed beyond this point, until everybody has arrived.

These two tricks help me solve the spurious checkpoint file corruption around learn.load() and learn.save() within the high level code, but still run into problems when attempting to train a text classifier learner starting with an encoder tuned on a language modeler. I bet the SaveModelCallback and EarlyStoppingCallback insight you and @austinmw brought up are part of the remaining puzzle, thanks!

Any thought on this, @sgugger? I will experiment around the idea in this pair of callbacks. Perhaps elsewhere in the fastai library have more save()/load() calls that may affect distributed training…

MartinBai · October 29, 2019, 6:10pm

I totally agree with all here.

What I also found a customized best model callback will work in the case. Importantly, it should only saves but does not load the best like what implemented in SaveModelCallback at the end of train. It will work.

philchu · October 29, 2019, 7:00pm

@MartinBai, agree, the load() inside SaveModelCallback.on_epoch_end() can crash into an incompletely written file by SaveModelCallback.on_train_end(), if not synchronized properly.

Perhaps placing a barrier before the call to self.learn.load() in
SaveModelCallback.on_train_end(), if running in distributed mode ?? That would require all GPUs to be exactly there and not in on_epoch_end(), before the load()happens…

Also I notice the save() in basic_train.py nicely guards against multiple simultaneous writes by checking the environment variable RANK == 0, via rank_distrib().

philchu · October 30, 2019, 11:30am

I created a gist here to simulate the condition, and suggest a fix.

A conditional torch.distributed.barrier() before the self.learn.load() call in on_train_end() would resolve this:

            if torch.distributed.is_available() and torch.distributed.is_initialized(): torch.distributed.barrier()
            self.learn.load(f'{self.name}', purge=False)

I would suggest this to fastai dev folks. For now, application can subclass SaveModelCallback and override the on_train_end() method to insert the fix.