Tensorboard Callback for Fastai

Pendar · July 8, 2018, 6:41am

I created a fast.ai callback that logs model and training information that can be viewed in tensorboard.

Tensorboard is a visualization tool that can help debug and explore your model. Read more about it here. Tensorboard is made for Tensorflow, but thanks to TensorboardX it also works with Pytorch.

Download the callback and an example notebook similar to lesson 5 at https://github.com/Pendar2/fastai-tensorboard-callback.

Currently this callback plots training loss, validation loss, and metrics. These plots can be viewed in Tensorboard scalars tab. More could be added in the future such as learning rate and momentum. Every X iterations a snapshot of the model’s weights are logged and can be viewed in Tensorboard histogram and distribution tab. Every epoch, the embedding layers are saved and can be viewed in 3D with dimensionality reduction, in the projector tab. Lastly, the model’s dataflow graph can be viewed in the graph tab (can be buggy with RNNs). Below are screenshots of each.

To use you must have Tensorboard and TensorboardX installed.
pip install tensorflow
pip install git+https://github.com/lanpa/tensorboard-pytorch
Graph visualization requires Pytorch >= 0.4. Fastai currently uses 0.3. I have only tested with Pytorch 0.4.

Launch the Tensorboard server with tensorboard --logdir="directory of logs file. default location is PATH/logs"
Then navigate your browser to http://localhost:6006

I made an example notebook on how to use the callback. The logs are stored at the ModelData path in the logs directory. The constructor requires a nn.Module instance, a ModelData instance, and a name for the log. The metrics_names parameter is a list of names for the fit function’s metrics. If this callback ever gets merged into fastai then these parameters (except for log name) wouldn’t be required. Modify the save path, and histogram save frequency with the path=None, histogram_freq=100 parameters.

TheShadow29 · July 8, 2018, 10:18am

This is so cool. Tensorboard does have many interesting visualizations. Thanks for this. I made a similar thing i.e. using callbacks for visdom here https://github.com/TheShadow29/FAI-notes/blob/master/notebooks/Visdom-With-FastAI.ipynb.

Cheers!

fredguth · August 15, 2018, 2:29pm

@Pendar, thanks for this code!

I am new to Tensoboard and although it seems to be working fine, I wasn’t able to see the training loss and the validation loss in the same graph. How can I do that?

Pendar · August 19, 2018, 12:15pm

Each line on the graph is a different run where the run name is defined when creating the callback object. I made it this way so it can be used to evaluate and compare the performance of multiple models.

The new fastai_v1 progress bar could do this for you: https://twitter.com/GuggerSylvain/status/1031109930353352705

maw501 · September 21, 2018, 1:59pm

@Pendar

This is wonderful and worked out the box - you hero.

jamesp · October 16, 2018, 4:12pm

Out of curiosity, does this work still with the v1 fastai library?

Pendar · October 17, 2018, 1:38am

Probably not, as v1 had major callback changes. Will update this to work with v1 soon.

TheShadow29 · October 18, 2018, 3:12am

It works pretty much the same with a few changes. @jamesp @Pendar
Here is my current code (checked on fastai 1.0.5)

from tensorboardX import SummaryWriter
from fastai.callback import Callback
from pathlib import Path
import shutil


class TensorboardLogger(Callback):
    """
    A general Purpose Logger for TensorboardX
    Also save a .txt file for the important parts
    """

    def __init__(self, learner, log_name, cfgtxt, del_existing=False, histogram_freq=100):
        """
        Learner is the ConvLearner
        log_name: name of the log directory to be formed. Will be input
        for each run
        cfgtxt: HyperParams
        del_existing: To run the experiment from scratch and remove previous logs
        """
        super().__init__()
        self.learn = learner
        self.model = learner.model
        self.md = learner.data

        self.metrics_names = ["validation_loss"]
        self.metrics_names += [m.__name__ for m in learner.metrics]

        self.best_met = 0

        self.histogram_freq = histogram_freq
        self.cfgtxt = cfgtxt

        path = Path(self.md.path) / "logs"
        self.log_name = log_name
        self.log_dir = path / log_name

        self.init_logs(self.log_dir, del_existing)
        self.init_tb_writer()
        self.init_txt_writer(path, log_name)

    def init_logs(self, log_dir, del_existing):
        if log_dir.exists():
            if del_existing:
                print(f'removing existing log with same name {log_dir.stem}')
                shutil.rmtree(self.log_dir)

    def init_tb_writer(self):
        self.writer = SummaryWriter(
            comment='main_mdl', log_dir=str(self.log_dir))
        self.writer.add_text('HyperParams', self.cfgtxt)

    def init_txt_writer(self, path, log_name):
        self.fw_ = path / f'{log_name}.txt'
        self.str_form = '{} \t {} \t '
        for m in self.metrics_names:
            self.str_form += '{} \t '
        self.str_form += '\n'
        self.out_str = self.str_form.format(
            'epoch', 'trn_loss', *self.metrics_names)

        with open(self.fw_, 'w') as f:
            f.write(self.cfgtxt)
            f.write('\n')
            f.write(self.out_str)

    def on_batch_end(self, **kwargs):
        self.trn_loss = kwargs['last_loss']
        num_batch = kwargs['num_batch']
        self.writer.add_scalar(
            'trn_loss_batch', self.trn_loss, num_batch)

    def on_epoch_end(self, **kwargs):
        metrics = kwargs['last_metrics']
        epoch = kwargs['epoch']
        trn_loss = kwargs['smooth_loss']
        self.writer.add_scalar('trn_loss', trn_loss, epoch)

        for val, name in zip(metrics, self.metrics_names):
            self.writer.add_scalar(name, val, epoch)

        self.file_write(self.str_form.format(epoch,
                                             self.trn_loss, *metrics))

        m = metrics[1]
        if m > self.best_met:
            self.best_met = m
            self.learn.save(self.log_name)

    def on_train_end(self, **kwargs):
        self.writer.add_text('Total Epochs', str(kwargs['epoch']))
        self.writer.close()
        self.file_write(f'Epochs done, {kwargs["epoch"]}')

    def file_write(self, outstr):
        with open(self.fw_, 'a') as f:
            f.write(outstr)

And you use it with your learner function like this:

tb_callback = TensorboardLogger(
        learn, uid, json.dumps(cfg), del_existing=del_existing)
learn.callbacks = [tb_callback]

uid is just a unique identifier (name of the log), del_existing if True will delete the previous log with the same name. And cfg is a dictionary with all the hyper-parameters.

Pendar · October 19, 2018, 11:08am

Updated to support fastai v1. Added lr and mom logging. Also simplified params:
learn.fit(1, 1e-3, callbacks=[TensorboardLogger(learn, "run-1")])

Uttam · October 26, 2018, 6:52am

@TheShadow29 . I want to plot the graphs of training & validation losses as well as accuracy through tensorboard for the ULMFiT Model . Can you help me out with the implementation part. I am not sure how to add the hyperparameters .

TheShadow29 · October 26, 2018, 5:12pm

I added hyper-params in a config dict. So my config dict is like cfg = {'bs': 64, 'lr': 1e-3}, then I do json.dumps(cfg) which converts it into a string, and then in tb_callback use writer.add_text('Hyp-Param', cfgtxt).

tinhb · December 24, 2018, 8:44am

I tried to install tensorflow but not working with python 3.7 which fastai is using. So effectively cant use tensorboard with fastai. Anyone is facing the same issue?

xeTaiz · December 24, 2018, 9:15am

I had the same problem. I downgraded to python 3.6, added a few Imports (callbacks and the Learner were missing I think) and removed the dataclass decorator to make it work. Removing the dataclass decorator involves writing the constructor as well. However it works fine

bhoomit · May 25, 2019, 6:25pm

Any thoughts on https://pytorch.org/docs/stable/tensorboard.html

(TORCH.UTILS.TENSORBOARD)

mgloria · October 24, 2019, 1:58pm

Thanks a lot @Pendar for having such an awesome vision! I wanted to give it a try with a very minimalist example (resnet18 on MNIST running on Sagemaker). For some reason I get the classical error " Failed to load the set of active dashboards." but my back-end is indeed running. I pretty much followed the documentation. Could somebody take a look at the code and tell me what obvious thing I am missing?

!pip install tensorboard
!pip install tensorboardx

from fastai.vision import *
from fastai.callbacks.tensorboard import *
path = untar_data(URLs.MNIST_SAMPLE)
data = ImageDataBunch.from_folder(path)
project_id = 'project1'
tboard_path = Path('data/tensorboard/' + project_id)
learn.callback_fns.append(partial(LearnerTensorboardWriter, 
                                    base_dir=tboard_path, 
                                    name='run1'))
learn.fit_one_cycle(2)
!tensorboard --logdir=data/tensorboard/project1 --port=6006

In my case, localhost should be https://pytorch-tensorboard.notebook.eu-west-1.sagemaker.aws/proxy/6006 according to this.

[UPDATE]: I tried out the sample tutorial (colab data) without any changes, same issue

egm · November 16, 2019, 10:12am

Hello,
torch.utils.tensorboard uses tensorboardX (more exactly, the code has been taken from tensorboardX). fastai callback also uses tensorboardX (import as a module).
Fastai callback is more convinient if you use fastai. See more details here. https://perfstories.wordpress.com/2019/11/13/how-to-visualize-your-pytorch-model-in-tensorboard/

Hope this helpful.

SravanVoonna · January 31, 2020, 3:47pm

Hey, In your last line of code when you call the tensorboard, the directory you have mentioned is incorrect because the log files are stored inside the folder run1(in your case), so the correct line will be
!tensorboard --logdir=data/tensorboard/project1/run1 --port=6006
Hope this helps you!!

ThankYou

imkhoa99 · June 25, 2020, 12:35pm

Hi,

Thank you for your work. I have one question: I tried your TensorboardLogger and it works perfectly fine, but when I include multiple runs in the same time or I used other name than “run1” in the log, the Graph no longer shows the network architecture, it is blank.

Could you check how could I fix that issue?