context: running a distributed train across multiple gpus. i am executing the file with python3 -m fastai.launch main.py
. the program only fails when i run the file with with learn.distrib_ctx():
all other attempts to train the model on a single gpu work or without the context manager work (running without context manager is the same as running on single gpu). below is the stack trace:
File "tabnet-distributed.py", line 122, in <module>
lr=1e-3,
File "/home/ubuntu/anaconda3/envs/rapids-21.10/lib/python3.7/site-packages/fastai/learner.py", line 221, in fit
self._with_events(self._do_fit, 'fit', CancelFitException, self._end_cleanup)
File "/home/ubuntu/anaconda3/envs/rapids-21.10/lib/python3.7/site-packages/fastai/learner.py", line 163, in _with_events
try: self(f'before_{event_type}'); f()
File "/home/ubuntu/anaconda3/envs/rapids-21.10/lib/python3.7/site-packages/fastai/learner.py", line 212, in _do_fit
self._with_events(self._do_epoch, 'epoch', CancelEpochException)
File "/home/ubuntu/anaconda3/envs/rapids-21.10/lib/python3.7/site-packages/fastai/learner.py", line 163, in _with_events
try: self(f'before_{event_type}'); f()
File "/home/ubuntu/anaconda3/envs/rapids-21.10/lib/python3.7/site-packages/fastai/learner.py", line 206, in _do_epoch
self._do_epoch_train()
File "/home/ubuntu/anaconda3/envs/rapids-21.10/lib/python3.7/site-packages/fastai/learner.py", line 198, in _do_epoch_train
self._with_events(self.all_batches, 'train', CancelTrainException)
File "/home/ubuntu/anaconda3/envs/rapids-21.10/lib/python3.7/site-packages/fastai/learner.py", line 163, in _with_events
try: self(f'before_{event_type}'); f()
File "/home/ubuntu/anaconda3/envs/rapids-21.10/lib/python3.7/site-packages/fastai/learner.py", line 169, in all_batches
for o in enumerate(self.dl): self.one_batch(*o)
File "/home/ubuntu/anaconda3/envs/rapids-21.10/lib/python3.7/site-packages/fastai/data/load.py", line 111, in __iter__
yield self.after_batch(b)
File "/home/ubuntu/anaconda3/envs/rapids-21.10/lib/python3.7/site-packages/fastai/distributed.py", line 107, in after_batch
self.i += find_bs(b)
File "/home/ubuntu/anaconda3/envs/rapids-21.10/lib/python3.7/site-packages/fastai/torch_core.py", line 569, in find_bs
return item_find(b).shape[0]
File "/home/ubuntu/anaconda3/envs/rapids-21.10/lib/python3.7/site-packages/fastcore/basics.py", line 389, in __getattr__
if attr is not None: return getattr(attr,k)
File "/home/ubuntu/anaconda3/envs/rapids-21.10/lib/python3.7/site-packages/fastcore/transform.py", line 204, in __getattr__
def __getattr__(self,k): return gather_attrs(self, k, 'fs')
File "/home/ubuntu/anaconda3/envs/rapids-21.10/lib/python3.7/site-packages/fastcore/transform.py", line 165, in gather_attrs
if not res: raise AttributeError(k)
AttributeError: shape