I noticed an annoying behavior from
LearnerTensorboardWriter. In the specific case where the training phase ends with
iteration % self.loss_iters == 0 or
iteration % self.hist_iters == 0 (I am not sure if it is only one or the two of them that cause problem), as the value of
iteration doesn’t change throughout the validation phase, it will keep trying to write the loss and the histogram to the tensorboard. As it doesn’t write anything, I guess it gets stuck in the queue for the whole phase and keeps accumulating in RAM (it was something like 300MB/batch, so there is at least all the histograms getting stuck).
I am not sure if this behavior can be changed directly within the async writer, but for now I patched it by testing if model is in training phase at the beginning at
def on_batch_end(self, last_loss:Tensor, iteration:int, train:bool, **kwargs)->None: "Callback function that writes batch end appropriate data to Tensorboard." if iteration == 0 or not train: return self._update_batches_if_needed() if iteration % self.loss_iters == 0: self._write_training_loss(iteration=iteration, last_loss=last_loss) if iteration % self.hist_iters == 0: self._write_weight_histograms(iteration=iteration)
If that solution is ok, I could suggest a PR, but there may be a more general solution (putting
iteration = -1 in
CallbackHandler when training is finished or something with the writer itself, I don’t know).