Tensorboard Integration

Hello everyone-

I discussed this with @jeremy and @sgugger already. But the gist is I’m planning on submitting a pull request to add built-in TensorboardX functionality. There’s still a lot more functionality that could be built on top of this, but here’s what I intend on putting up:

  • Image generation visualization (for both GAN and non GAN learners)
  • Model histograms/distributions for each of the layers
  • Various gradient stats
  • All available metrics/losses (losses are reported as a ‘metric’ currently).

The single file I plan on putting up is here:

https://github.com/jantic/fastai/blob/Tensorboard-Integration/fastai/callbacks/tensorboard.py

The usage is as follows (example directly from DeOldify):

proj_id = 'Colorize'
tboard_path = Path('data/tensorboard/' + proj_id)
learn.callback_fns.append(partial(GANTensorboardWriter, base_dir=tboard_path, name='GanLearner'))

The approach Jeremy suggested was that I’d just submit the single tensorboard.py file in callbacks without adding dependencies to install files (which would be tensorboard and tensorboardx). He’d add the logic from there to handle the case where the the (still optional) prerequisites aren’t installed- logic that would basically inform the user that if they want to use these callbacks, they’ll have to install these additional dependencies.

I tried to tackle the performance issues and basically that amounted to putting blocking i/o operations into a simple request/queue daemon thread based writer (AsyncTBWriter). This shouldn’t actually be necessary on our part though- it should be handled on the TensorboardX end. So I plan on digging further on that and raising the issue in that project. Anyway my Python isn’t all that great yet so there’s a good chance there’s a better way than what I did there. Just let me know- it won’t hurt my feelings :slight_smile:

A few other things to scrutinize would be GPU to CPU logic on the tensors, the way I’m handling getting and caching batches from one_batch calls (which is slow with ImageNet at least), and the defaults I set for how often these things get written (stats_iters, hist_iters, etc). I basically did what worked for me but this isn’t battle tested for everything. I’ve been running this for a while as is and I haven’t had any noticeable issues.

Anyway, I really do recommend using Tensorboard, in particular for image generation and the model histograms. It’s enormously helpful to see the transitions with the image sliders to see the subtle changes in the images.

And some puzzling bugs in the model can be readily exposed with the histogram graphs- this happened recently when I found a bug in the new fastai SelfAttention module a few weeks ago, where it wasn’t actually learning. This was obvious in the graphs- gamma remained 0.

image
I’m new to the whole pull request process so just let me know if I screwed anything up. And i’m certainly willing to put up documentation. Also- the pull request instructions for new features suggest adding tests but I’m honestly not sure what that would consist of in this case (and I’m a guy who is really into testing). Any pointers?

I’ll formally submit the pull request once I get the green-light here.

19 Likes

This would be great!

I also played with it around some time ago.

Maybe the new tensorboard(X) notebook and slides/video from the CMU DL course could be interesting for you.

Keep up the great work! :slight_smile:

3 Likes

Hello @jsa169,

I just saw the first parts of the tensor board callback on GitHub.
If I can help you (with my limited skills) just tell me. - I would be happy to learn and contribute! :slight_smile:

Kind regards
Michael

1 Like

Thanks @MicPie! So I can tell you this much- there’s probably still a lot of functionality that could be added, for one. Like…I wasn’t sure what all stats would be relevant or useful, so I just did what worked for me at the time. Should be easy to add from here. So simply having your second eye look at it in terms of that would help.

There’s also other types of models not covered yet- audio and text generation, for example. But I know Tensorboard has support for these.

Also- I’m just not sure if I did everything 100% legit. So that definitely needs to be scrutinized. I must have screwed something up. That’s a given.

1 Like

@jsa169 great work. I wasn’t aware of the performance bottlenecks, but I mostly used the scalar logger of tensorboard(X).

I think that self.metrics_root = '/metrics/' in LearnerTensorboardWriter should be self.metrics_root = 'metrics/', otherwise you get warnings like this:
Summary name /metrics/valid_loss is illegal; using metrics/valid_loss instead.

1 Like

Thanks for pointing that out! I’m surprised I didn’t notice that. I’ll put it on my todo list.

Really great work! Love the visualizations. Do you think it would be possible to merge some metrics for training and validation in the same graph with different colors like so:

2 Likes

Thanks! I’d definitely like to have this functionality too. Unfortunately I just personally don’t have to the time to look into it yet. Others (you included!) are certainly encouraged to do that.

Tensorboard is now natively supported on Pytorch 1.1.
So no need to use tensorboardX anymore :slight_smile: !

1 Like

Indeed: https://pytorch.org/docs/stable/tensorboard.html
Very nice!

I have created a repository to show how to use Tensorboard in fastai:

With the great callback system it’s very easy!

5 Likes

Embedding is missing from LearnerTensorboardWriter Callback.

Can I add a few lines of code to LearnerTensorboardWriter.on_epoch_end ?

Following the pattern by calling _write_embedding().

By the way, I would like to add this nice callback to fastai_doc. https://docs.fast.ai/callbacks.html

1 Like

Me personally (author of callback)- that all sounds great!

1 Like

I’ve submitted pr already. If I need to change anything, please let me know.

1 Like

I’ve also submitted pr for docs but only for LearnerTensorboardWriter.

3 Likes

Is there way for the tensorboard and distributed callbacks to work together? I would like to capture metrics for tensorboard visualization. I did a simple example using the code from

https://docs.fast.ai/distributed.html
https://docs.fast.ai/callbacks.tensorboard.html

The code to use the tensorboard writer seems pretty straightforward:

from fastai.vision import *
from fastai.vision.models.wrn import wrn_22
from fastai.distributed import *
from fastai.callbacks.tensorboard import *

import argparse
parser = argparse.ArgumentParser()
parser.add_argument("--local_rank", type=int)
args = parser.parse_args()
torch.cuda.set_device(args.local_rank)
torch.distributed.init_process_group(backend='nccl', init_method='env://')

path = untar_data(URLs.CIFAR)
ds_tfms = ([*rand_pad(4, 32), flip_lr(p=0.5)], [])
data = ImageDataBunch.from_folder(path, valid='test', ds_tfms=ds_tfms, bs=128).normalize(cifar_stats)
learn = Learner(data, wrn_22(), metrics=accuracy).to_distributed(args.local_rank)


project_id = 'distributed'
tboard_path=Path('data/tensorboard/' + project_id)
name="run1"
learn.callback_fns.append(partial(LearnerTensorboardWriter,
                         base_dir=tboard_path,
                         name=name))


learn.fit_one_cycle(10, 3e-3, wd=0.4, div_factor=10, pct_start=0.5)

Unfortunately, running the distributed training on a dual GPU system resulted in the following errors:

epoch     train_loss  valid_loss  accuracy  time    
Traceback (most recent call last):
  File "distributed-train-tb.py", line 27, in <module>
    learn.fit_one_cycle(10, 3e-3, wd=0.4, div_factor=10, pct_start=0.5)
  File "/opt/conda/lib/python3.6/site-packages/fastai/train.py", line 23, in fit_one_cycle
    learn.fit(cyc_len, max_lr, wd=wd, callbacks=callbacks)
  File "/opt/conda/lib/python3.6/site-packages/fastai/basic_train.py", line 200, in fit
    fit(epochs, self, metrics=self.metrics, callbacks=self.callbacks+callbacks)
  File "/opt/conda/lib/python3.6/site-packages/fastai/basic_train.py", line 101, in fit
    loss = loss_batch(learn.model, xb, yb, learn.loss_func, learn.opt, cb_handler)
  File "/opt/conda/lib/python3.6/site-packages/fastai/basic_train.py", line 34, in loss_batch
    if not skip_bwd:                     loss.backward()
  File "/opt/conda/lib/python3.6/site-packages/torch/tensor.py", line 118, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/opt/conda/lib/python3.6/site-packages/torch/autograd/__init__.py", line 93, in backward
    allow_unreachable=True)  # allow_unreachable flag
RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [384]] is at version 4; expected version 3 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True).
Traceback (most recent call last):
  File "distributed-train-tb.py", line 27, in <module>
    learn.fit_one_cycle(10, 3e-3, wd=0.4, div_factor=10, pct_start=0.5)
  File "/opt/conda/lib/python3.6/site-packages/fastai/train.py", line 23, in fit_one_cycle
    learn.fit(cyc_len, max_lr, wd=wd, callbacks=callbacks)
  File "/opt/conda/lib/python3.6/site-packages/fastai/basic_train.py", line 200, in fit
    fit(epochs, self, metrics=self.metrics, callbacks=self.callbacks+callbacks)
  File "/opt/conda/lib/python3.6/site-packages/fastai/basic_train.py", line 101, in fit
    loss = loss_batch(learn.model, xb, yb, learn.loss_func, learn.opt, cb_handler)
  File "/opt/conda/lib/python3.6/site-packages/fastai/basic_train.py", line 34, in loss_batch
    if not skip_bwd:                     loss.backward()
  File "/opt/conda/lib/python3.6/site-packages/torch/tensor.py", line 118, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/opt/conda/lib/python3.6/site-packages/torch/autograd/__init__.py", line 93, in backward
    allow_unreachable=True)  # allow_unreachable flag
RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [384]] is at version 4; expected version 3 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True).
terminate called without an active exception
terminate called without an active exception
Traceback (most recent call last):
  File "/opt/conda/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/opt/conda/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/opt/conda/lib/python3.6/site-packages/torch/distributed/launch.py", line 246, in <module>
    main()
  File "/opt/conda/lib/python3.6/site-packages/torch/distributed/launch.py", line 242, in main
    cmd=cmd)
subprocess.CalledProcessError: Command '['/opt/conda/bin/python', '-u', 'distributed-train-tb.py', '--local_rank=1']' died with <Signals.SIGABRT: 6>.
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************

Any tips on how I can get this to work?

Thanks,
Jeff

1 Like

Hi! This callback was totally new for me but I loved the idea! So as usually I went to the documentation and tried implementing it with a very minimalist example (resnet18 for mnist) but I may be missing something. I have seen there are 2 threads for this topic on our forum. Can somebody take a quick look at my question here? Thanks!

Hello @jsa169,

I have used your tensorboard callback but I really could not make it run with DynamicUnet. The graph for model_stats does not show at all, distribution, histograms as well.
It seems that the way we are sending message async has conflicted somehow with Hooks which is being used in DynamicUnet. Any thought?

Is there an error message and stack trace you’re getting that you can share?

I do know the graph functionality added in recent months to the callback was causing me problems so I commented those lines out.

during running learn.fit_one_cycle(), it appeared in colab notebook the message as below. I don’t know why it keeps printing out tensor values which might be input and output at each network’s module.


And sometime it also displayed the message as below:

IPub data rate exceeded. The notebook server will temporarily stop sending output to the client in order to avoid crashing it. To change this limit, set the config variable --NotebookApp.iopub_data_rate_limit

For me, I have changed graph part from on_train_begin to on_train_end, it is unstable in showing the graph for different runs but at least, it doesn’t conflict with other streaming data like histogram or loss, metrics, image.