Learner uses GPU with training set but not validation

gupta · March 5, 2020, 10:32pm

Hello,

I am training a text classification learner :

learn = text_classifier_learner(data_clas, AWD_LSTM, drop_mult=0.5, metrics=[accuracy, Precision(average='micro'), Recall(average='micro')])

When I train it with learn.fit_one_cycle(1, 5e-3) and watch the GPU on the terminal with the command watch -n 0.5 nvidia-smi, the gpu is being used (i.e.: Volatile GPU-Util is not at 0%) during the first progress bar (which I gather is on the training set since the length of the training set corresponds to batch size * total steps of progress bar). However, when the second progress bar appears (on the validation set), the GPU usage goes to 0%.

Is there a reason for this? Can I force the learner to use the GPU is some way?

Thanks

gupta · March 6, 2020, 7:18pm

Apparently it’s a specificity of FastAi; inference is computed on CPU. If anyone has a way to force the use of GPU at inference time I’d be interested because my classifier has more than 10k labels and would benefit greatly from it.

sgugger · March 6, 2020, 7:50pm

No, that is not true. Validation is done on the GPU as well, like the training. If your GPU is less used, it may be because your CPUs can’t keep up by opening the images fast enough.

gupta · March 6, 2020, 7:55pm

Thanks for the reply! I’m not opening images, I am doing a text classification. Could lowering the batch size help? My batch size is currently 8. I have 16Go of RAM.

sgugger · March 6, 2020, 8:08pm

Then there is no reason it wouldn’t use your GPU, especially if it’s already using it for training.

gupta · March 6, 2020, 8:10pm

I lowered the batch size to 2 and I can now see that the GPU is being used (spikes of 2 to 6% of GPU usage on nvidia-smi). However the ETA is still in the same order of magnitude (2hours for a single cycle). On the CPU side, my 4 cores are used at max capacity so I guess it is indeed the limiting factor.

gupta · March 9, 2020, 4:36pm

In the end, after some memory profiling I found the exact line that was slowing down the whole validation.
I created a pull request:
https://github.com/fastai/fastai/pull/2516

sgugger · March 9, 2020, 4:52pm

I replied on the PR. This change does not give the same result as before, so it won’t really work. Note that fastai v2 will rely only on scikit-learn to compute those metrics, so it might be good to switch to v2 for your problem.

sebderhy · August 11, 2020, 5:07pm

Just bringing up this topic again, because I’m having the exact same problem now, and I’m using fastai v2 (0.0.21). The GPU is seems to be used during training, but not during validation. It’s also a task of NLP (language translation).