LearnerTensorboardWriter causing RAM overload


I noticed an annoying behavior from LearnerTensorboardWriter. In the specific case where the training phase ends with iteration % self.loss_iters == 0 or iteration % self.hist_iters == 0 (I am not sure if it is only one or the two of them that cause problem), as the value of iteration doesn’t change throughout the validation phase, it will keep trying to write the loss and the histogram to the tensorboard. As it doesn’t write anything, I guess it gets stuck in the queue for the whole phase and keeps accumulating in RAM (it was something like 300MB/batch, so there is at least all the histograms getting stuck).
I am not sure if this behavior can be changed directly within the async writer, but for now I patched it by testing if model is in training phase at the beginning at on_batch_end:

def on_batch_end(self, last_loss:Tensor, iteration:int, train:bool, **kwargs)->None:
    "Callback function that writes batch end appropriate data to Tensorboard."
    if iteration == 0 or not train: return
    if iteration % self.loss_iters == 0: self._write_training_loss(iteration=iteration, last_loss=last_loss)
    if iteration % self.hist_iters == 0: self._write_weight_histograms(iteration=iteration)

If that solution is ok, I could suggest a PR, but there may be a more general solution (putting iteration = -1 in CallbackHandler when training is finished or something with the writer itself, I don’t know).

It seems like a good patch, you should definitely suggest a PR with it.

I’m on it! I also suggest to switch from tensorboardX to torch.utils.tensorboard, do you think it is a good idea ?

Yeah we’d want that ultimately. Didn’t put my hands in the current implementation of the callback/integration with tensorboard yet, but if you want to tackle it, by all means, go ahead!

I am currently using it without any change except that I’m importing it instead of tensorboardX and it works fine (it is basically the same implementation). The only thing that doesn’t work is graph writing, but it’s due to the facts that my model contains hooks, which are not compatible with tensorboard. I’ll make some other test and maybe come with another PR.

Looking at the source code, it looks like a copy/paste of tensorboardX.