Distributed training in fastai v2 example code error: train_imagenette.py

cooli46 · February 5, 2020, 9:52am

Hi all,

I am trying to do distributed training using fastaiV2. And I try to run the example code:

fastai/fastai2/blob/master/nbs/examples/train_imagenette.py

from fastai2.basics import *
from fastai2.vision.all import *
from fastai2.callback.all import *
from fastai2.distributed import *
from fastprogress import fastprogress
from torchvision.models import *
from fastai2.vision.models.xresnet import *
from fastai2.callback.mixup import *
from fastscript import *

torch.backends.cudnn.benchmark = True
fastprogress.MAX_COLS = 80

def get_dls(size, woof, bs, sh=0., workers=None):
    if size<=224: path = URLs.IMAGEWOOF_320 if woof else URLs.IMAGENETTE_320
    else        : path = URLs.IMAGEWOOF     if woof else URLs.IMAGENETTE
    source = untar_data(path)
    if workers is None: workers = min(8, num_cpus())
    dblock = DataBlock(blocks=(ImageBlock, CategoryBlock),
                       splitter=GrandparentSplitter(valid_name='val'),

This file has been truncated. show original

python train_imagenette.py works.

But I get the following error with command python -m fastai.launch train_imagenette.py
: ( AttributeError: ‘XResNet’ object has no attribute ‘module’)

  File "train_imagenette.py", line 79, in main
    learn.fit_flat_cos(epochs, lr, wd=1e-2, cbs=cbs)
  File "/envs/mainpy3.7/lib/python3.7/site-packages/fastai2/callback/schedule.py", line 112, in fit_flat_cos
    self.fit(n_epoch, cbs=ParamScheduler(scheds)+L(cbs), reset_opt=reset_opt, wd=wd)
  File "/envs/mainpy3.7/lib/python3.7/site-packages/fastai2/learner.py", line 300, in fit
    finally:                               self('after_fit')
  File "/envs/mainpy3.7/lib/python3.7/site-packages/fastai2/learner.py", line 228, in __call__
    def __call__(self, event_name): L(event_name).map(self._call_one)
  File "/envs/mainpy3.7/lib/python3.7/site-packages/fastcore/foundation.py", line 362, in map
    return self._new(map(g, self))
  File "/envs/mainpy3.7/lib/python3.7/site-packages/fastcore/foundation.py", line 315, in _new
    def _new(self, items, *args, **kwargs): return type(self)(items, *args, use_list=None, **kwargs)
  File "/envs/mainpy3.7/lib/python3.7/site-packages/fastcore/foundation.py", line 41, in __call__
    res = super().__call__(*((x,) + args), **kwargs)
  File "/envs/mainpy3.7/lib/python3.7/site-packages/fastcore/foundation.py", line 306, in __init__
    items = list(items) if use_list else _listify(items)
  File "/envs/mainpy3.7/lib/python3.7/site-packages/fastcore/foundation.py", line 242, in _listify
    if is_iter(o): return list(o)
  File "/envs/mainpy3.7/lib/python3.7/site-packages/fastcore/foundation.py", line 208, in __call__
    return self.fn(*fargs, **kwargs)
  File "/envs/mainpy3.7/lib/python3.7/site-packages/fastai2/learner.py", line 231, in _call_one
    [cb(event_name) for cb in sort_by_run(self.cbs)]
  File "/envs/mainpy3.7/lib/python3.7/site-packages/fastai2/learner.py", line 231, in <listcomp>
    [cb(event_name) for cb in sort_by_run(self.cbs)]
  File "/envs/mainpy3.7/lib/python3.7/site-packages/fastai2/learner.py", line 25, in __call__
    if self.run and _run: getattr(self, event_name, noop)()
  File "/envs/mainpy3.7/lib/python3.7/site-packages/fastai2/distributed.py", line 106, in after_fit
    self.learn.model = self.learn.model.module
  File "/envs/mainpy3.7/lib/python3.7/site-packages/torch/nn/modules/module.py", line 585, in __getattr__
    type(self).__name__, name))
AttributeError: 'XResNet' object has no attribute 'module'

cooli46 · February 6, 2020, 1:56am

@sgugger

Sorry to bother you, would you like to give some advice?
Thanks a lot.

muellerzr · February 6, 2020, 2:29am

He probably can’t look at this for a bit until after their book deadline this week. Or Atleast give it too much attention. XResNet’s changed a bit since then so it’s possible it needs to be updated.

cooli46 · February 6, 2020, 9:19am

Hi @muellerzr,

Thanks for the reply. Hope sgugger will look at it when he is free.
Anyone is welcome to give any clue…

gokkulnath · March 17, 2020, 11:48pm

@cooli46
Where able to get past this issue ? I am also facing the same issue when i try to run with defaults. No Change to the existing script.

cooli46 · March 23, 2020, 2:27am

hello,
I am unable to solve this problem yet. I use V1 instead.

philchu · March 23, 2020, 5:30pm

Hello @cooli46, @gokkulnath,

Sometimes it’s kind of tricky regarding stack trace. In this case a full stack trace would show the root cause. The real error is another stack dump, above the XResNet error, it shows that distributed data parallel group was not properly initialized:

  File "/home/ndim1/.local/lib/python3.7/site-packages/fastai2/distributed.py", line 91, in begin_fit
    self.learn.model = DistributedDataParallel(self.model, device_ids=[self.cuda_id], output_device=self.cuda_id)
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/parallel/distributed.py", line 273, in __init__
    self.process_group = _get_default_group()
  File "/usr/local/lib/python3.7/dist-packages/torch/distributed/distributed_c10d.py", line 268, in _get_default_group
    raise RuntimeError("Default process group has not been initialized, "
RuntimeError: Default process group has not been initialized, please make sure to call init_process_group.

And it tracks down to line 55 of train_imagenette.py was commented out in this commit 3 months ago.

Uncomment line 55 would get you going.

Also, in the command line you use, perhaps should be -m fastai2.launch, not fastai.launch. (Yes, line 76 says # Requires -m fastai.launch, that should have been updated for v2.)

Have fun!

–

@sgugger any reason this line should still be commented out?

sgugger · March 23, 2020, 7:47pm

Fixed those two things.

gokkulnath · March 23, 2020, 11:37pm

Thanks a lot. I will give it a try with the changes today!

philchu · April 5, 2020, 10:20am

Hello,

train_imagenette.py has been updated to work, without modification, with: stand-alone invocation to use data parallel, or in conjunction via -m fastai2.launch to use distributed data parallel.

If only a single GPU or a single member is available, both modes revert to the base case of single GPU/single process training.

It also showcases an alternative semantic of parallel/distributed training using context manager, to simplify usage (and to minimize certain side effects post-training).