Workflow to compare & monitor models using WandbCallback

Useful resources

Features

  • Log and compare runs and hyperparameters
  • Keep track of code, models and datasets
  • Automatically log prediction samples to visualize during training
  • Monitor computer resources
  • Make custom graphs and reports with data from your runs
  • Launch and scale hyperparameter search on your own compute, orchestrated by W&B
  • Collaborate in a transparent way, with traceability and reproducibility

Original Post

Hi,

I’ve been working on WandbCallback for the past few months (with a lot of help from @sgugger) and I’m very excited to show how it works!

This is still in very active development so I’d love all the feedback you have regarding bugs or new features.

To use it:

import wandb
from fastai2.callback.wandb import *

# start logging a wandb run
wandb.init()  # optional -> wandb.init(project='my_project')

# just add WandbCallback to your learner
learn.fit(..., cbs=WandbCallback())

It let you:

  • quickly compare models -> I used it to debug and check GradientAccumulation
  • make lots of custom graphs or reports pulling data from your runs
  • watch long training runs on your phone
  • log automatically prediction samples

You can test it with your own project or this small demo notebook.

When you run it, you will have access to:

Now what I like most is that if you run the notebook several times trying different parameters (batch size, number of epochs, learning rate, GradientAccumulation callback…), then open your project page, you will see that more than 100 parameters have automatically been logged for you from all the fastai functions you used.

Press the little wand on the top right of your runs summary table, reorganize and hide columns you want, and you get a nice comparative summary.

You can easily create graphs to compare runs.

And finally you can use them to create cool reports where your results are fully traceable.

I’d love any feedback you may have and I’m here to help if you have any questions.

25 Likes

Thanks for all the work on this, that report functionality is a great example! I used the callback a little last week and enjoyed using it, especially for looking at the gradients in the model (although I still need to figure out a good rule of thumb of what is the most informative layer(s) I should be looking at when examining gradients :smiley: )

1 Like

One question I had is how to log slices?

Let’s assume learn.fit(lr=slice(5):

  • Option 1: string, ie {"learn.fit.lr": "slice(None, 5, None)"}: the issue is that we won’t be able to make graphs from it or use the parameter importance feature
  • Option 2: log all parts, ie
    {"learn.fit.lr.start" = None, "learn.fit.lr.stop" = 5, "learn.fit.lr.step" = None}
  • Option 3: same as option 2 but also log stop value within main parameter value, ie
    {"learn.fit.lr.start" = None, "learn.fit.lr.stop" = 5, "learn.fit.lr.step" = None, "learn.fit.lr" = 5}

I think option 2 probably makes the most sense and will be more convenient to make graphs, except when we alternate between using slices and single values.

The main impact I see is for the parameter importance feature (ex on this report). A random forest is run on all the parameters to see which ones are the most important and I’m not sure what makes the most sense in this instance but I think it would be the option 2.

1 Like

Also just learnt about an awesome new feature for semantic segmentation.
Basically images are logged so you can decide which masks to show as overlays and to which opacity.
Definitely want to add it to the callback.

See their example (and they are actually using fastai!).

1 Like

Sounds reasonable. Though I think it should still be logged like this then post-processed at some point with a regex (e.g., not make log_args have special behavior on slices).

1 Like

Don’t see this in the docs … just wondering what the status of the callback is. Would love to give it a whirl.

@wgpubs: https://dev.fast.ai/callback.wandb

To my knowledge it’s fully integrated :slight_smile:

2 Likes

Correct it is already integrated. I think @sgugger has different plans for third party integrations which is probably why it’s not in the docs at the moment.

2 Likes

I just added a new feature: custom logging of semantic segmentation samples.

You can create custom graphs:

  • select which classes to display
  • add optionally input underneath the mask
  • set up opacity of mask and image

semantic_segmentation.png

You can even plot evolution of prediction over time.

See a sample run

And here I compiled a report showing all the current features of WandbCallback.

5 Likes

I got this

Installed with pip install --upgrade wandb the version is 0.8.36.

Any hint?


well, just after post it, I see the way you import it and do it from fastai… not like they have in their docs, so I guess I see the difference now.

The doc on Weights & Biases website is for v1 If fastai (until v2 is officially released).

@muellerzr do you have admin rights to add a link to the report and to dev documentation in the first post? I don’t seem to be able to edit it.

I do indeed. I went ahead and made it a Wiki so you can update it in the future if need be :slight_smile:

1 Like

It does work with the text classifier?

In the future please copy paste the full stack trace to help us debug :slight_smile: You can wrap it in “```” above and below and it’ll format the text

like this

And make it easier to read

1 Like

@tyoc213 I can’t see the full error stack.

It should have continued to run and not log sample predictions as it is not supported in this case.

You can also set “log_preds” arg to False.

Thanks will do that.

And yes, I have the stack trace… becuase I saved it later to try to watch on… but u give some hint just with the error.


epoch	train_loss	valid_loss	accuracy	time
0	3.589976	3.483551	0.098774	18:48
epoch	train_loss	valid_loss	accuracy	time
0	0.000000	01:37
WARNING:wandb.util:requests_with_retry encountered retryable exception: ('Connection aborted.', OSError("(104, 'ECONNRESET')")). args: ('https://api.wandb.ai/files/tyoc213/trec-covid/t3htw2er/file_stream',), kwargs: {'json': {'complete': False, 'failed': False}}
WandbCallback was not able to get prediction samples -> 'TextLearner' object has no attribute 'preds'
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<timed eval> in <module>

~/Documentos/github/fastcore/fastcore/utils.py in _f(*args, **kwargs)
    429         init_args.update(log)
    430         setattr(inst, 'init_args', init_args)
--> 431         return inst if to_return else f(*args, **kwargs)
    432     return _f
    433 

~/Documentos/github/fastai2/fastai2/callback/schedule.py in fine_tune(self, epochs, base_lr, freeze_epochs, lr_mult, pct_start, div, **kwargs)
    163     base_lr /= 2
    164     self.unfreeze()
--> 165     self.fit_one_cycle(epochs, slice(base_lr/lr_mult, base_lr), pct_start=pct_start, div=div, **kwargs)
    166 
    167 # Cell

~/Documentos/github/fastcore/fastcore/utils.py in _f(*args, **kwargs)
    429         init_args.update(log)
    430         setattr(inst, 'init_args', init_args)
--> 431         return inst if to_return else f(*args, **kwargs)
    432     return _f
    433 

~/Documentos/github/fastai2/fastai2/callback/schedule.py in fit_one_cycle(self, n_epoch, lr_max, div, div_final, pct_start, wd, moms, cbs, reset_opt)
    112     scheds = {'lr': combined_cos(pct_start, lr_max/div, lr_max, lr_max/div_final),
    113               'mom': combined_cos(pct_start, *(self.moms if moms is None else moms))}
--> 114     self.fit(n_epoch, cbs=ParamScheduler(scheds)+L(cbs), reset_opt=reset_opt, wd=wd)
    115 
    116 # Cell

~/Documentos/github/fastcore/fastcore/utils.py in _f(*args, **kwargs)
    429         init_args.update(log)
    430         setattr(inst, 'init_args', init_args)
--> 431         return inst if to_return else f(*args, **kwargs)
    432     return _f
    433 

~/Documentos/github/fastai2/fastai2/learner.py in fit(self, n_epoch, lr, wd, cbs, reset_opt)
    201                     try:
    202                         self.epoch=epoch;          self('begin_epoch')
--> 203                         self._do_epoch_train()
    204                         self._do_epoch_validate()
    205                     except CancelEpochException:   self('after_cancel_epoch')

~/Documentos/github/fastai2/fastai2/learner.py in _do_epoch_train(self)
    173         try:
    174             self.dl = self.dls.train;                        self('begin_train')
--> 175             self.all_batches()
    176         except CancelTrainException:                         self('after_cancel_train')
    177         finally:                                             self('after_train')

~/Documentos/github/fastai2/fastai2/learner.py in all_batches(self)
    151     def all_batches(self):
    152         self.n_iter = len(self.dl)
--> 153         for o in enumerate(self.dl): self.one_batch(*o)
    154 
    155     def one_batch(self, i, b):

~/Documentos/github/fastai2/fastai2/learner.py in one_batch(self, i, b)
    157         try:
    158             self._split(b);                                  self('begin_batch')
--> 159             self.pred = self.model(*self.xb);                self('after_pred')
    160             if len(self.yb) == 0: return
    161             self.loss = self.loss_func(self.pred, *self.yb); self('after_loss')

~/miniconda3/envs/fastai2/lib/python3.7/site-packages/torch/nn/modules/module.py in __call__(self, *input, **kwargs)
    548             result = self._slow_forward(*input, **kwargs)
    549         else:
--> 550             result = self.forward(*input, **kwargs)
    551         for hook in self._forward_hooks.values():
    552             hook_result = hook(self, input, result)

~/miniconda3/envs/fastai2/lib/python3.7/site-packages/torch/nn/modules/container.py in forward(self, input)
     98     def forward(self, input):
     99         for module in self:
--> 100             input = module(input)
    101         return input
    102 

~/miniconda3/envs/fastai2/lib/python3.7/site-packages/torch/nn/modules/module.py in __call__(self, *input, **kwargs)
    548             result = self._slow_forward(*input, **kwargs)
    549         else:
--> 550             result = self.forward(*input, **kwargs)
    551         for hook in self._forward_hooks.values():
    552             hook_result = hook(self, input, result)

~/Documentos/github/fastai2/fastai2/text/models/core.py in forward(self, input)
     79             #Note: this expects that sequence really begins on a round multiple of bptt
     80             real_bs = (input[:,i] != self.pad_idx).long().sum()
---> 81             o = self.module(input[:real_bs,i: min(i+self.bptt, sl)])
     82             if self.max_len is None or sl-i <= self.max_len:
     83                 outs.append(o)

~/miniconda3/envs/fastai2/lib/python3.7/site-packages/torch/nn/modules/module.py in __call__(self, *input, **kwargs)
    548             result = self._slow_forward(*input, **kwargs)
    549         else:
--> 550             result = self.forward(*input, **kwargs)
    551         for hook in self._forward_hooks.values():
    552             hook_result = hook(self, input, result)

~/Documentos/github/fastai2/fastai2/text/models/awdlstm.py in forward(self, inp, from_embeds)
    105         new_hidden = []
    106         for l, (rnn,hid_dp) in enumerate(zip(self.rnns, self.hidden_dps)):
--> 107             output, new_h = rnn(output, self.hidden[l])
    108             new_hidden.append(new_h)
    109             if l != self.n_layers - 1: output = hid_dp(output)

~/miniconda3/envs/fastai2/lib/python3.7/site-packages/torch/nn/modules/module.py in __call__(self, *input, **kwargs)
    548             result = self._slow_forward(*input, **kwargs)
    549         else:
--> 550             result = self.forward(*input, **kwargs)
    551         for hook in self._forward_hooks.values():
    552             hook_result = hook(self, input, result)

~/Documentos/github/fastai2/fastai2/text/models/awdlstm.py in forward(self, *args)
     52             #To avoid the warning that comes because the weights aren't flattened.
     53             warnings.simplefilter("ignore")
---> 54             return self.module.forward(*args)
     55 
     56     def reset(self):

~/miniconda3/envs/fastai2/lib/python3.7/site-packages/torch/nn/modules/rnn.py in forward(self, input, hx)
    568         if batch_sizes is None:
    569             result = _VF.lstm(input, hx, self._flat_weights, self.bias, self.num_layers,
--> 570                               self.dropout, self.training, self.bidirectional, self.batch_first)
    571         else:
    572             result = _VF.lstm(input, batch_sizes, hx, self._flat_weights, self.bias,

RuntimeError: CUDA out of memory. Tried to allocate 102.00 MiB (GPU 0; 7.79 GiB total capacity; 5.68 GiB already allocated; 105.19 MiB free; 5.82 GiB reserved in total by PyTorch)

CPU: 15/20/2590 MB | GPU: 5895/110/7767 MB | Time 0:20:43.136 | (Consumed/Peaked/Used Total)

yeah, I know that :slight_smile:

Looks like you just ran out of CUDA memory.
Does it work without WandbCallback?

answering the question: no, but interestingly enought each time activated always fails in same location involving the callback.

But, I can’t do learn.fine_tune(1, 1e-1, bs=32) even with batchsize of 32… but I could complete learn.unfreeze() folowed by learn.fit_one_cycle(10, 1e-3) (I think default batch size is 64?)… I didnt use the callback just to be sure, but let me check tomorrow if it work with the cb.

@boris is there any plans in the works to support Tabular dataloaders? While training I got this error:

“Could not gather input dimensions
WandbCallback was not able to prepare a DataLoader for logging prediction samples -> list indices must be integers or slices, not list”
Which is due to the input being two tensors, not one