Cuda out of memory with Tensorboard Callback

I am encountering a strange behavior running my model on a P100 GPU with 16GB of memory.
Using a batch-size of 8, the first epoch and validation step runs without problems. However, starting the second epoch, (or probably during callbacks) it runs out of memory.

RuntimeError: CUDA out of memory. Tried to allocate 60.00 MiB (GPU 0; 15.90 GiB total capacity; 15.02 GiB already allocated; 17.75 MiB free; 15.07 GiB reserved in total by PyTorch)

If I use a dev-dataset (same data just less images) of about 20 images, this does not happen with the same batch-size as before. Once the dataset exceeds 30 images, the error occurs again. The validation step runs through and my metrics get displayed. So, I do not think it is because of this somehow.

The same happens, even with a batch-size of 2.

Tracking the memory usage manually step-by-step with nvidia-smi:
Initialiazing cuda: 0.8 GB
Model on GPU: 1.4 GB
Forward with single batch: 14.4 GB
loss.backward(): 13.9 GB

Training for more epochs, so that the total amount of data going through the GPU is the same as for the full training dataset does not lead to an error. If I do not use the Tensorboard Callback everything works fine. All works on CPU.

Any ideas why this happens?