Is there way for the tensorboard and distributed callbacks to work together? I would like to capture metrics for tensorboard visualization. I did a simple example using the code from
https://docs.fast.ai/distributed.html
https://docs.fast.ai/callbacks.tensorboard.html
The code to use the tensorboard writer seems pretty straightforward:
from fastai.vision import *
from fastai.vision.models.wrn import wrn_22
from fastai.distributed import *
from fastai.callbacks.tensorboard import *
import argparse
parser = argparse.ArgumentParser()
parser.add_argument("--local_rank", type=int)
args = parser.parse_args()
torch.cuda.set_device(args.local_rank)
torch.distributed.init_process_group(backend='nccl', init_method='env://')
path = untar_data(URLs.CIFAR)
ds_tfms = ([*rand_pad(4, 32), flip_lr(p=0.5)], [])
data = ImageDataBunch.from_folder(path, valid='test', ds_tfms=ds_tfms, bs=128).normalize(cifar_stats)
learn = Learner(data, wrn_22(), metrics=accuracy).to_distributed(args.local_rank)
project_id = 'distributed'
tboard_path=Path('data/tensorboard/' + project_id)
name="run1"
learn.callback_fns.append(partial(LearnerTensorboardWriter,
base_dir=tboard_path,
name=name))
learn.fit_one_cycle(10, 3e-3, wd=0.4, div_factor=10, pct_start=0.5)
Unfortunately, running the distributed training on a dual GPU system resulted in the following errors:
epoch train_loss valid_loss accuracy time
Traceback (most recent call last):
File "distributed-train-tb.py", line 27, in <module>
learn.fit_one_cycle(10, 3e-3, wd=0.4, div_factor=10, pct_start=0.5)
File "/opt/conda/lib/python3.6/site-packages/fastai/train.py", line 23, in fit_one_cycle
learn.fit(cyc_len, max_lr, wd=wd, callbacks=callbacks)
File "/opt/conda/lib/python3.6/site-packages/fastai/basic_train.py", line 200, in fit
fit(epochs, self, metrics=self.metrics, callbacks=self.callbacks+callbacks)
File "/opt/conda/lib/python3.6/site-packages/fastai/basic_train.py", line 101, in fit
loss = loss_batch(learn.model, xb, yb, learn.loss_func, learn.opt, cb_handler)
File "/opt/conda/lib/python3.6/site-packages/fastai/basic_train.py", line 34, in loss_batch
if not skip_bwd: loss.backward()
File "/opt/conda/lib/python3.6/site-packages/torch/tensor.py", line 118, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File "/opt/conda/lib/python3.6/site-packages/torch/autograd/__init__.py", line 93, in backward
allow_unreachable=True) # allow_unreachable flag
RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [384]] is at version 4; expected version 3 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True).
Traceback (most recent call last):
File "distributed-train-tb.py", line 27, in <module>
learn.fit_one_cycle(10, 3e-3, wd=0.4, div_factor=10, pct_start=0.5)
File "/opt/conda/lib/python3.6/site-packages/fastai/train.py", line 23, in fit_one_cycle
learn.fit(cyc_len, max_lr, wd=wd, callbacks=callbacks)
File "/opt/conda/lib/python3.6/site-packages/fastai/basic_train.py", line 200, in fit
fit(epochs, self, metrics=self.metrics, callbacks=self.callbacks+callbacks)
File "/opt/conda/lib/python3.6/site-packages/fastai/basic_train.py", line 101, in fit
loss = loss_batch(learn.model, xb, yb, learn.loss_func, learn.opt, cb_handler)
File "/opt/conda/lib/python3.6/site-packages/fastai/basic_train.py", line 34, in loss_batch
if not skip_bwd: loss.backward()
File "/opt/conda/lib/python3.6/site-packages/torch/tensor.py", line 118, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File "/opt/conda/lib/python3.6/site-packages/torch/autograd/__init__.py", line 93, in backward
allow_unreachable=True) # allow_unreachable flag
RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [384]] is at version 4; expected version 3 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True).
terminate called without an active exception
terminate called without an active exception
Traceback (most recent call last):
File "/opt/conda/lib/python3.6/runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "/opt/conda/lib/python3.6/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/opt/conda/lib/python3.6/site-packages/torch/distributed/launch.py", line 246, in <module>
main()
File "/opt/conda/lib/python3.6/site-packages/torch/distributed/launch.py", line 242, in main
cmd=cmd)
subprocess.CalledProcessError: Command '['/opt/conda/bin/python', '-u', 'distributed-train-tb.py', '--local_rank=1']' died with <Signals.SIGABRT: 6>.
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
*****************************************
Any tips on how I can get this to work?
Thanks,
Jeff