RuntimeError: CUDA error ONLY when using `validate()`

steveyang · October 30, 2018, 7:16am

Running into a bit of a strange issue. I’m running a fairly large language model with a GPU that has 12 GB of memory.

I’m able to load pretrained wikitext103, run through days of training on another large corpus for fine-tuning, and load and save my models all without any memory issues. Now I’m trying to do some evaluation (for now just testing on my validation set in the same data bunch), but to avoid having to even fit once (which takes close to an hour) I’m attempting to use the validate function found in basic_train.py to just get the validation loss. My workflow is the following

Create a data bunch with the same data as before
load previously saved model weights
run: validate(learn.model, learn.data.valid_dl)

This runs for about 50 batches then crashes with RuntimeError: CUDA error: out of memory. I’ve even tried loading a much smaller validation dataset into the data bunch. Still running into the memory issue. Checking the memory in nvidia-smi it looks like the usage keeps growing after each batch.

What could be the issue here, given I’ve had no issues with training? Also, is this even the right way to approach this, or is there a more efficient workflow I can employ?

sgugger · October 30, 2018, 1:44pm

Note that you should use learn.validate.

The problem here is that you didn’t pass a loss function to validate, so it is accumulating the outputs of the model and the targets. Since the vocab size is probably large, you’re running out of memory pretty quickly! Passing the loss function will fix that issue (validate will then accumulate the losses and metrics if you add that argument), which is what learn.validate does

steveyang · October 30, 2018, 4:46pm

Thank you! Looks like I have a slightly older version that doesn’t have a validate class method, only a function outside the class. Will update and give that a try.