Running into a bit of a strange issue. I’m running a fairly large language model with a GPU that has 12 GB of memory.
I’m able to load pretrained wikitext103, run through days of training on another large corpus for fine-tuning, and load and save my models all without any memory issues. Now I’m trying to do some evaluation (for now just testing on my validation set in the same data bunch), but to avoid having to even fit once (which takes close to an hour) I’m attempting to use the validate
function found in basic_train.py
to just get the validation loss. My workflow is the following
- Create a data bunch with the same data as before
- load previously saved model weights
- run:
validate(learn.model, learn.data.valid_dl)
This runs for about 50 batches then crashes with RuntimeError: CUDA error: out of memory
. I’ve even tried loading a much smaller validation dataset into the data bunch. Still running into the memory issue. Checking the memory in nvidia-smi
it looks like the usage keeps growing after each batch.
What could be the issue here, given I’ve had no issues with training? Also, is this even the right way to approach this, or is there a more efficient workflow I can employ?