GPU memory issue - low BS - small model

mcclomitz · March 5, 2020, 7:10pm

Hey team,

I am having problems with CUDA out of memory error on Kaggle.

RuntimeError: CUDA out of memory. Tried to allocate 1.63 GiB (GPU 0; 15.90 GiB total capacity; 13.57 GiB already allocated; 993.88 MiB free; 14.27 GiB reserved in total by PyTorch)

I read through the GPU issue section
https://docs.fast.ai/dev/gpu.html

As well as a number of questions posted here and other places with no real luck.
The dataset is tabular data with 1,557,000 data points. There are 66 features to the dataset.

I have tried running the smallest model and I get the CUDA error i.e.
layers of [1, 1] and batch_size =1

I am unable to run learn.fit_one_cycle without hitting the CUDA error.
I have tried refreshing the kaggle page and confirming that the GPU is empty and then running, but continue to get the problem.

Is it likely that there is something weird in my dataset that is causing a huge data leak somehow? I have run models on kaggle with far more than 1.5 million datapoints before.

Any ideas are much appreciated.

gerardo · March 5, 2020, 7:31pm

Have you tried this one?
https://docs.fast.ai/utils.collect_env.html#show_install
Can you run show_install and paste it here?

mcclomitz · March 5, 2020, 7:46pm

I hadn’t seen this one. Not 100% sure how to interpret it though.

arora_aman · March 5, 2020, 8:53pm

Can you please share the code?

mcclomitz · March 6, 2020, 11:53am

Yeah sure - I cant provide the data but I can print out explanations of the dataset.

I simplified everything to try and get find the issue:

This does actually work but only because I changed the split_by_rand_pct from 0.3 to 0.5 - however the GPU at capacity still and I want a bigger model than [10, 10] and I cant change the batch_size.

As in my first post the dataset is only 1.5 mil with 66 features. Seems like it shouldn’t be such a load on the GPU

JonathanSum · March 6, 2020, 10:35pm

Do you mind to tell us what gpu you are using?

mcclomitz · March 9, 2020, 10:31am

It’s the kaggle free version.

It’s strange but everything seemed to go back to normal when I changed the validation set to 0.5 from 0.3. I can now change the batch_size, the size of the layers etc and have not had any further issues. It could be that it’s a Kaggle issue, or that the size of the train and validation set is quite important?

I might try some further testing with that during the week if I find some time.

Thanks for the inputs team.

JonathanSum · March 9, 2020, 10:31am

I haven’t use kaggle gpu before. I suggest to use colab.