Hi!
I noticed an annoying behavior from LearnerTensorboardWriter
. In the specific case where the training phase ends with iteration % self.loss_iters == 0
or iteration % self.hist_iters == 0
(I am not sure if it is only one or the two of them that cause problem), as the value of iteration
doesn’t change throughout the validation phase, it will keep trying to write the loss and the histogram to the tensorboard. As it doesn’t write anything, I guess it gets stuck in the queue for the whole phase and keeps accumulating in RAM (it was something like 300MB/batch, so there is at least all the histograms getting stuck).
I am not sure if this behavior can be changed directly within the async writer, but for now I patched it by testing if model is in training phase at the beginning at on_batch_end
:
def on_batch_end(self, last_loss:Tensor, iteration:int, train:bool, **kwargs)->None:
"Callback function that writes batch end appropriate data to Tensorboard."
if iteration == 0 or not train: return
self._update_batches_if_needed()
if iteration % self.loss_iters == 0: self._write_training_loss(iteration=iteration, last_loss=last_loss)
if iteration % self.hist_iters == 0: self._write_weight_histograms(iteration=iteration)
If that solution is ok, I could suggest a PR, but there may be a more general solution (putting iteration = -1
in CallbackHandler
when training is finished or something with the writer itself, I don’t know).