Online logging, monitoring and comparison of models with W&B

Hi everyone,

I contributed to a module which can log, monitor and easily compare different runs by adding compatibility with fast.ai. It uses Weights & Biases (free for public code) and only requires to add a callback to your existing code.
It’s convenient to monitor long runs on your phone and compare runs on the same graph.

I’d be happy if anyone could test it and give me any feedback before the guys from W&B post a release note.

Here is a sample project:

You can also log images and all kind of custom data. Here is an example where I used Pytorch to colorize black & white images: https://app.wandb.ai/borisd13/colorizer/runs/0uiwhl8e?workspace=user-borisd13

Feel free if you have any questions or comments. The objective is for it to be as convenient as possible to log everything (losses, metrics, model topology, weights & gradients histogram, model file, predictions…).

6 Likes

Hi Boris, this looks cool. Does it slow down the model when you’re training though? How does the logging work for saving stuff?

Hi Carey!
So far I didn’t notice anything slowing down.
You just add the callback “WandBCallback” and it will log graphs for all your metrics as well as training and validation losses. You will also be able to see all parameters and the model.
There’s a few custom options too for logging gradients, weights, any type of file (such as the trained model), or even predictions as it trains.

1 Like

@boris How does W&B use GPUs? I see the wandb.fastai import, which is a nice add-on.

I don’t think it uses GPU at all (unless some of the underlying libraries such as pillow use it for converting images). When doing predictions (for computer vision problems), the .predict function is used which should not affect GPU.

Something interesting is that your computer resources (including GPU) are monitored so you can track allocated GPU: https://app.wandb.ai/borisd13/DeOldify/runs/fb9p2rsy/system

It does allocate GPU resources, though, so clearly training is being run on the GPU, right? It seems to allocate it automatically. That seems to be what the system outputs suggest at least.

The callback in itself should not. What allocate GPU resources is your regular training loop. Even if you remove this callback, you can see that you are using GPU resources allocated by typing nvidia-smi in the terminal (assuming you use Ubuntu). Hope this answers your question.

Does it need active internet connection to do it’s functionality? I’m currently working through Kubernetes pods and there won’t be any internet access in them.

Yes that will work too. You can use wandb sync functionality to upload your data from the local run folders afterwards.
See here and feel free if you have any questions.

1 Like

Also just for future reference @sgugger added this functionality by default in v2

1 Like

Thank you very much that’s exactly what I needed.
I have one more doubt.
For example in case of Keras, I am using a Callback to compute a validation metric, how can I let use wandb to log this callback metric as well apart from the values from model.fit()?

Correct me if I am wrong.
Let’s say we have a callback as follows and the last line of the below code does the logging(apart from this wandb should also log other things like val loss, val acc, etc through fit_generator()),

class Metrics_crm(Callback):

    def __init__(self, model, val_files, batch_size):
        self.model = model
        self.val_folders = val_folders
        self.batch_size = batch_size

    def on_train_begin(self, logs={}):
        self.val_sdr = []
 
    def on_epoch_end(self, epoch, logs={}):
        pred = model.predict_generator(DataGenerator(val_folders, batch_size))
        sdr = compute_sdr(pred, true)
        print('SDR:', sdr)

        wandb.log({'sdr': sdr})      #To let wandb log SDR (Will this work?)

Update: Just tested the code, it’s working and the last line of this code is able to log the metric ‘sdr’.
But there is a small problem here. It is counting double the number of epochs. So in the X-axis of all graphs, the number of points will be double the number of epochs.
Any solution for this?

Yes there are 2 possible solutions:

  • add sdr as a metric as they will automatically be tracked (see the standard approach to do it depending on whether you use keras or fastai)
  • do wandb.log({ your_dict }, commit=False) -> when you don’t pass a step as argument, you need to have only one commit per step which is done by the callback
1 Like

Awesome. Including commit=False works perfectly.
Regarding the first suggestion, I cannot add sdr as a metric for Keras because my compute_sdr function contains non-tensor operations like Inverse short Fourier transform from Scipy and some others. In Keras all the operations in the metric should be tensor operations, so keras throws an error for this.
Thank you for your help.

1 Like

The WandBCallback didn’t slow down my training, but the SaveModelCallback slowed it down considerably.