I’m training a language model on a fairly large dataset (~60 million examples) and I’m running into issues where memory usage (system memory, not GPU memory) continually increases during training to the point where my instance runs out of memory.
At the start of training, my instance has 44 GB of memory available. At the end of the first epoch, only 3 GB of memory is available. After each epoch there’s a memory release, but at some point during training the instance runs out of memory. This happened over night, so I’m estimating from model checkpoint save times, but the out of memory point happens about 3-4 epochs in
Does anyone know where this memory usage is coming from?