Tensorboard Integration

Embedding is missing from LearnerTensorboardWriter Callback.

Can I add a few lines of code to LearnerTensorboardWriter.on_epoch_end ?

Following the pattern by calling _write_embedding().

By the way, I would like to add this nice callback to fastai_doc. https://docs.fast.ai/callbacks.html

1 Like

Me personally (author of callback)- that all sounds great!

1 Like

I’ve submitted pr already. If I need to change anything, please let me know.

1 Like

I’ve also submitted pr for docs but only for LearnerTensorboardWriter.

3 Likes

Is there way for the tensorboard and distributed callbacks to work together? I would like to capture metrics for tensorboard visualization. I did a simple example using the code from

https://docs.fast.ai/distributed.html
https://docs.fast.ai/callbacks.tensorboard.html

The code to use the tensorboard writer seems pretty straightforward:

from fastai.vision import *
from fastai.vision.models.wrn import wrn_22
from fastai.distributed import *
from fastai.callbacks.tensorboard import *

import argparse
parser = argparse.ArgumentParser()
parser.add_argument("--local_rank", type=int)
args = parser.parse_args()
torch.cuda.set_device(args.local_rank)
torch.distributed.init_process_group(backend='nccl', init_method='env://')

path = untar_data(URLs.CIFAR)
ds_tfms = ([*rand_pad(4, 32), flip_lr(p=0.5)], [])
data = ImageDataBunch.from_folder(path, valid='test', ds_tfms=ds_tfms, bs=128).normalize(cifar_stats)
learn = Learner(data, wrn_22(), metrics=accuracy).to_distributed(args.local_rank)


project_id = 'distributed'
tboard_path=Path('data/tensorboard/' + project_id)
name="run1"
learn.callback_fns.append(partial(LearnerTensorboardWriter,
                         base_dir=tboard_path,
                         name=name))


learn.fit_one_cycle(10, 3e-3, wd=0.4, div_factor=10, pct_start=0.5)

Unfortunately, running the distributed training on a dual GPU system resulted in the following errors:

epoch     train_loss  valid_loss  accuracy  time    
Traceback (most recent call last):
  File "distributed-train-tb.py", line 27, in <module>
    learn.fit_one_cycle(10, 3e-3, wd=0.4, div_factor=10, pct_start=0.5)
  File "/opt/conda/lib/python3.6/site-packages/fastai/train.py", line 23, in fit_one_cycle
    learn.fit(cyc_len, max_lr, wd=wd, callbacks=callbacks)
  File "/opt/conda/lib/python3.6/site-packages/fastai/basic_train.py", line 200, in fit
    fit(epochs, self, metrics=self.metrics, callbacks=self.callbacks+callbacks)
  File "/opt/conda/lib/python3.6/site-packages/fastai/basic_train.py", line 101, in fit
    loss = loss_batch(learn.model, xb, yb, learn.loss_func, learn.opt, cb_handler)
  File "/opt/conda/lib/python3.6/site-packages/fastai/basic_train.py", line 34, in loss_batch
    if not skip_bwd:                     loss.backward()
  File "/opt/conda/lib/python3.6/site-packages/torch/tensor.py", line 118, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/opt/conda/lib/python3.6/site-packages/torch/autograd/__init__.py", line 93, in backward
    allow_unreachable=True)  # allow_unreachable flag
RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [384]] is at version 4; expected version 3 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True).
Traceback (most recent call last):
  File "distributed-train-tb.py", line 27, in <module>
    learn.fit_one_cycle(10, 3e-3, wd=0.4, div_factor=10, pct_start=0.5)
  File "/opt/conda/lib/python3.6/site-packages/fastai/train.py", line 23, in fit_one_cycle
    learn.fit(cyc_len, max_lr, wd=wd, callbacks=callbacks)
  File "/opt/conda/lib/python3.6/site-packages/fastai/basic_train.py", line 200, in fit
    fit(epochs, self, metrics=self.metrics, callbacks=self.callbacks+callbacks)
  File "/opt/conda/lib/python3.6/site-packages/fastai/basic_train.py", line 101, in fit
    loss = loss_batch(learn.model, xb, yb, learn.loss_func, learn.opt, cb_handler)
  File "/opt/conda/lib/python3.6/site-packages/fastai/basic_train.py", line 34, in loss_batch
    if not skip_bwd:                     loss.backward()
  File "/opt/conda/lib/python3.6/site-packages/torch/tensor.py", line 118, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/opt/conda/lib/python3.6/site-packages/torch/autograd/__init__.py", line 93, in backward
    allow_unreachable=True)  # allow_unreachable flag
RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [384]] is at version 4; expected version 3 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True).
terminate called without an active exception
terminate called without an active exception
Traceback (most recent call last):
  File "/opt/conda/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/opt/conda/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/opt/conda/lib/python3.6/site-packages/torch/distributed/launch.py", line 246, in <module>
    main()
  File "/opt/conda/lib/python3.6/site-packages/torch/distributed/launch.py", line 242, in main
    cmd=cmd)
subprocess.CalledProcessError: Command '['/opt/conda/bin/python', '-u', 'distributed-train-tb.py', '--local_rank=1']' died with <Signals.SIGABRT: 6>.
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************

Any tips on how I can get this to work?

Thanks,
Jeff

1 Like

Hi! This callback was totally new for me but I loved the idea! So as usually I went to the documentation and tried implementing it with a very minimalist example (resnet18 for mnist) but I may be missing something. I have seen there are 2 threads for this topic on our forum. Can somebody take a quick look at my question here? Thanks!

Hello @jsa169,

I have used your tensorboard callback but I really could not make it run with DynamicUnet. The graph for model_stats does not show at all, distribution, histograms as well.
It seems that the way we are sending message async has conflicted somehow with Hooks which is being used in DynamicUnet. Any thought?

Is there an error message and stack trace you’re getting that you can share?

I do know the graph functionality added in recent months to the callback was causing me problems so I commented those lines out.

during running learn.fit_one_cycle(), it appeared in colab notebook the message as below. I don’t know why it keeps printing out tensor values which might be input and output at each network’s module.


And sometime it also displayed the message as below:

IPub data rate exceeded. The notebook server will temporarily stop sending output to the client in order to avoid crashing it. To change this limit, set the config variable --NotebookApp.iopub_data_rate_limit

For me, I have changed graph part from on_train_begin to on_train_end, it is unstable in showing the graph for different runs but at least, it doesn’t conflict with other streaming data like histogram or loss, metrics, image.

Yeah I just commented out the graph writer logic in the tensorboard.py file:

It’s not the final solution but it’s the hack that got me productive again in the meantime. Note that this graph functionality isn’t something I added and I haven’t had the time to investigate/fix it and won’t for a while. I originally omitted the graph functionality when I created the tensorboard callback because it was causing too many perplexing problems back then and the benefits were minimal in my opinion.

Maybe we should remove it then?

1 Like

I certainly don’t have any desire to use it honestly. Not sure about anybody else.

@jeffchen72 did you ever resolve this issue? I’m having the same problem.

Hi there,
Your file is not opening…

Thanks.

Since the original posting the code was merged into fastai’s code base and so now I have DeOldify calling that: https://github.com/fastai/fastai1/blob/master/fastai/callbacks/tensorboard.py

1 Like

Thanks alot, for quick reply.
This is for fastai1, am I right?
Have you also implemented for fastai2…?
Thanks…

They actually already have a callback built into V2 as well (I wasn’t involved): https://github.com/fastai/fastai/blob/master/fastai/callback/tensorboard.py

Thanks,
Can we use it for logging images, to tensorboard like in GANs, etc

I think you’ll have to try for yourself and dig from here. I’m as out of the loop as you are as I haven’t been involved in any fastai2 development and am still on fastai1 myself.

Ok, thanks…