Distributed training in fastai v2 example code error: train_imagenette.py

Hi all,

I am trying to do distributed training using fastaiV2. And I try to run the example code:


python train_imagenette.py works.

But I get the following error with command python -m fastai.launch train_imagenette.py
: ( AttributeError: ‘XResNet’ object has no attribute ‘module’)

  File "train_imagenette.py", line 79, in main
    learn.fit_flat_cos(epochs, lr, wd=1e-2, cbs=cbs)
  File "/envs/mainpy3.7/lib/python3.7/site-packages/fastai2/callback/schedule.py", line 112, in fit_flat_cos
    self.fit(n_epoch, cbs=ParamScheduler(scheds)+L(cbs), reset_opt=reset_opt, wd=wd)
  File "/envs/mainpy3.7/lib/python3.7/site-packages/fastai2/learner.py", line 300, in fit
    finally:                               self('after_fit')
  File "/envs/mainpy3.7/lib/python3.7/site-packages/fastai2/learner.py", line 228, in __call__
    def __call__(self, event_name): L(event_name).map(self._call_one)
  File "/envs/mainpy3.7/lib/python3.7/site-packages/fastcore/foundation.py", line 362, in map
    return self._new(map(g, self))
  File "/envs/mainpy3.7/lib/python3.7/site-packages/fastcore/foundation.py", line 315, in _new
    def _new(self, items, *args, **kwargs): return type(self)(items, *args, use_list=None, **kwargs)
  File "/envs/mainpy3.7/lib/python3.7/site-packages/fastcore/foundation.py", line 41, in __call__
    res = super().__call__(*((x,) + args), **kwargs)
  File "/envs/mainpy3.7/lib/python3.7/site-packages/fastcore/foundation.py", line 306, in __init__
    items = list(items) if use_list else _listify(items)
  File "/envs/mainpy3.7/lib/python3.7/site-packages/fastcore/foundation.py", line 242, in _listify
    if is_iter(o): return list(o)
  File "/envs/mainpy3.7/lib/python3.7/site-packages/fastcore/foundation.py", line 208, in __call__
    return self.fn(*fargs, **kwargs)
  File "/envs/mainpy3.7/lib/python3.7/site-packages/fastai2/learner.py", line 231, in _call_one
    [cb(event_name) for cb in sort_by_run(self.cbs)]
  File "/envs/mainpy3.7/lib/python3.7/site-packages/fastai2/learner.py", line 231, in <listcomp>
    [cb(event_name) for cb in sort_by_run(self.cbs)]
  File "/envs/mainpy3.7/lib/python3.7/site-packages/fastai2/learner.py", line 25, in __call__
    if self.run and _run: getattr(self, event_name, noop)()
  File "/envs/mainpy3.7/lib/python3.7/site-packages/fastai2/distributed.py", line 106, in after_fit
    self.learn.model = self.learn.model.module
  File "/envs/mainpy3.7/lib/python3.7/site-packages/torch/nn/modules/module.py", line 585, in __getattr__
    type(self).__name__, name))
AttributeError: 'XResNet' object has no attribute 'module'
1 Like

@sgugger

Sorry to bother you, would you like to give some advice?
Thanks a lot.

He probably can’t look at this for a bit until after their book deadline this week. Or Atleast give it too much attention. :slight_smile: XResNet’s changed a bit since then so it’s possible it needs to be updated.

1 Like

Hi @muellerzr,

Thanks for the reply. Hope sgugger will look at it when he is free.
Anyone is welcome to give any clue…

@cooli46
Where able to get past this issue ? I am also facing the same issue when i try to run with defaults. No Change to the existing script.

hello,
I am unable to solve this problem yet. I use V1 instead.

Hello @cooli46, @gokkulnath,

Sometimes it’s kind of tricky regarding stack trace. In this case a full stack trace would show the root cause. The real error is another stack dump, above the XResNet error, it shows that distributed data parallel group was not properly initialized:

  File "/home/ndim1/.local/lib/python3.7/site-packages/fastai2/distributed.py", line 91, in begin_fit
    self.learn.model = DistributedDataParallel(self.model, device_ids=[self.cuda_id], output_device=self.cuda_id)
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/parallel/distributed.py", line 273, in __init__
    self.process_group = _get_default_group()
  File "/usr/local/lib/python3.7/dist-packages/torch/distributed/distributed_c10d.py", line 268, in _get_default_group
    raise RuntimeError("Default process group has not been initialized, "
RuntimeError: Default process group has not been initialized, please make sure to call init_process_group.

And it tracks down to line 55 of train_imagenette.py was commented out in this commit 3 months ago.

Uncomment line 55 would get you going.

Also, in the command line you use, perhaps should be -m fastai2.launch, not fastai.launch. (Yes, line 76 says # Requires -m fastai.launch, that should have been updated for v2.)

Have fun!

@sgugger any reason this line should still be commented out?

2 Likes

Fixed those two things.

Thanks a lot. I will give it a try with the changes today!

Hello,

train_imagenette.py has been updated to work, without modification, with: stand-alone invocation to use data parallel, or in conjunction via -m fastai2.launch to use distributed data parallel.

If only a single GPU or a single member is available, both modes revert to the base case of single GPU/single process training.

It also showcases an alternative semantic of parallel/distributed training using context manager, to simplify usage (and to minimize certain side effects post-training).

1 Like