Workflow to compare & monitor models using WandbCallback

wandb versus tensorboard, which will be the pros and cons ? . Please ignore if this question does not make any sense.

I don’t think there is any stupid way. Whatever works well for you is good and you’ll probably change how you do it several times!
I personally like to keep my notebook as short and concise as possible. Whenever you change any parameter (learning rate, batch size, new callback…), it should automatically be tracked with this integration so you can easily see the difference between your experiments in your project run page.

I’ve not looked into it too much but we could patch Learner to do sweeps. The only issue is that it would probably be on a limited number of parameters (batch size, learning rate, epochs…) and it may be hard to make it as flexible as the traditional sweeps. It may be sufficient based on how people use it though. Let me know your feedback

It is a great question. W&B includes the tensorboard dashboard (when used for logging) and additional features.
My favorite feature is its ability to centralize experiments and quickly compare them.
When I have a complex project, I typically try a few different ideas and pull my comparisons into W&B reports to write my reasoning along the way. Helps me think more clearly.
They have a comparison section in their documentation but I would recommend you just to do a test on a few runs as it will be easier to understand.

1 Like

That’s indeed a good idea! Hadn’t thought about it.

Just an update that Jeremy just merged the PR for handling tabular data.
Many thanks to @muellerzr for the help in that one!

Prediction table is now automatically logged with losses, metrics, etc

4 Likes

With the recent update from @muellerzr, we now automatically get great details on the config parameters logged!

These are saved as string so if you play a lot with these, it may not be completely straightforward to organize your runs. As an alternative we could also extract each value such as dls.after_batch.IntToFloatTensor.div (float) but it may not completely help (Normalize.mean is tuple of 3 floats here). I’m thinking of letting this explicit description as is for now.

Let me know if you have any suggestions.

5 Likes

Thank you! I was missing this information!

1 Like

Just a quick update I’m really excited about: the integration of artifacts in the callback.

I’ll add more doc about it but basically this is how it works:

  • the callback can log & track your datasets with log_dataset arg (set to True or your custom path)
  • you can manually log datasets with the log_dataset(path…) function (for example to split train/valid)
  • models are now logged as artifacts (through log_model)
  • you can also manually log models with log_model(…) function
3 Likes

I think WandbCallback is now ready for big show time with the new release of fastai :wink:

I added some new documentation and quick examples (also added to top post):

And here is a quick summary of the features included in the callback:

  • Log and compare runs and hyperparameters
  • Keep track of code, models and datasets
  • Automatically log prediction samples to visualize during training
  • Monitor computer resources
  • Make custom graphs and reports with data from your runs
  • Launch and scale hyperparameter search on your own compute, orchestrated by W&B
  • Collaborate in a transparent way, with traceability and reproducibility

Pretty excited about it! Feel free to share your feedback!

12 Likes

@boris thank you so much for all your work. This callback is really cool and a delight to use.

I think even a limited use of the sweeps functionality as a patch to Learner would be 100% worth it. I personally wouldn’t write a custom script (yet) to use sweeps, so trying it out without additional effort is a huge win for users like me.


I think it makes sense for the log_model and log_dataset functions to expose description as a parameter rather than fixing it to 'trained_model' and 'raw dataset' respectively.

1 Like

Thanks for the feedback @rsomani95 .
I made a PR for custom description.

I still have the sweeps functionality in my todo list!

2 Likes

Just an update that the callback does not use log_args any longer since it was removed for efficiency reasons.

Arguments are still captured automatically through store_args.
You may notice some changes in your config parameters logged in the upcoming version.

Please feel free to give any feedback and let me know if we are missing any important parameters so we add them!

I’m having now this error calling learning.fit with the WandbCallback


---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-37-1c1a865cde40> in <module>
----> 1 learn.fit_one_cycle(10, lr_max=1e-3, cbs=WandbCallback(log_preds=False))
      2 learn.recorder.plot_loss()

/usr/local/lib/python3.6/dist-packages/fastai/callback/wandb.py in __init__(self, log, log_preds, log_model, log_dataset, dataset_name, valid_dl, n_preds, seed, reorder)
     26         # W&B log step
     27         self._wandb_step = wandb.run.step - 1  # -1 except if the run has previously logged data (incremented at each batch)
---> 28         self._wandb_epoch = 0 if not(wandb.run.step) else math.ceil(wandb.run.summary['epoch']) # continue to next epoch
     29         store_attr('log,log_preds,log_model,log_dataset,dataset_name,valid_dl,n_preds,seed,reorder')
     30 

/usr/local/lib/python3.6/dist-packages/wandb/sdk/wandb_summary.py in __getitem__(self, key)
     38 
     39     def __getitem__(self, key):
---> 40         item = self._as_dict()[key]
     41 
     42         if isinstance(item, dict):


KeyError: 'epoch'

Any ideas?

@vrodriguezf Did you log anything else in the same run?

1 Like

yes, and actually, removing that log solved the issue, thanks!! Why is that happening?

The callback also logs the epoch.
When you have items previously logged (maybe from a previous loop). It wants to make sure it continues to the next epoch, so tries to read what is the last epoch logged.
Maybe I could change the logic and not assume that there will always be an epoch logged.

There should be no issue if you do some manual logging after at least one point has been logged.

1 Like

(I realize this is a year later, but I see no other related posts on the forum.)

When trying to call learn.fit with the WandBCallback for the GANLearner.wgan, I get the error, WandbCallback was not able to prepare a DataLoader for logging prediction samples -> list index out of range.

What exactly is it that’s out of range? I’m able run run learn.show_results() with no problem.

Update: Started writing my own callback for wandb but am confused about how to get the output of the generator for “preds” instead of just the output of the critic. Here’s where I’m at so far – it doesn’t work – I’d welcome suggestions!

from PIL import Image 

class WandB_WGAN_Images(Callback):
    "Progress-like callback: log WGAN predictions to WandB"
    order = ProgressCallback.order+1
    def __init__(self, n_preds=6):
        store_attr()

    def after_epoch(self):  
        if not self.learn.training:
            with torch.no_grad():
                self.learn.switch(gen_mode=True)
                inp,preds,targs,out = self.learn.pred
                b = tuplify(inp) + tuplify(targs)
                self.dl.show_results(b, out, show=False, max_n=self.n_preds)
                preds = preds.detach().permute(1, 2, 0).cpu().squeeze().numpy() 
            images = [Image.fromarray(image) for image in preds]
            wandb.log({"examples": [wandb.Image(image) for image in images]})
            self.learn.switch(gen_mode=False)

Currently fails at the inp,preds,targs... line with ValueError: too many values to unpack (expected 4)

I see that show_results() uses “samples” and “outs” – but I can’t figure out how to obtain samples & outs while inside a callback.

1 Like

Update: Got it:

class WandB_WGAN_Images(Callback):
    "Progress-like callback: log WGAN predictions to WandB"
    order = ProgressCallback.order+1
    def __init__(self, n_preds=10):
        store_attr()

    def after_epoch(self):  
        if self.gen_mode:
            preds = learn.gan_trainer.last_gen.cpu()
            img_grid = make_grid(preds[:self.n_preds], nrow=5)
            img_grid = img_grid.permute(1, 2, 0).squeeze()
            wandb.log({"examples": wandb.Image(img_grid)})

NB: This callback should be used in fit() but not in the definition of the learner. Otherwise you’ll get an error if you call learn.show_results() after a wandb.finish().

Example: Anime Faces GAN results on WandB:

3 Likes

Hi there

I am unable to get wandb to log metrics from my fastai learner no matter what I try.
currently running like so:

import wandb
from fastai.callback.wandb import *
wandb.login()
wandb.init(project_name)
learn = cnn_learner_3d(dls,resnet18_3d, metrics = accuracy, cbs=[WandbCallback(log = ‘all’, log_preds_every_epoch = True)])

metrics are calculated in my progress bar, but just won’t appear in wandb

any thoughts?

thanks in advance

Try passing project="project_name" to wandb.init. I think the first param is job_type.

I am unable to reproduce your issue, this notebooks is logging fine for me:

thanks. you are right. works for simple example. not sure why it doesn’t work with the add on library I was using. will do more investigation and report back if I figure it out