I am having problems with CUDA out of memory error on Kaggle.
RuntimeError: CUDA out of memory. Tried to allocate 1.63 GiB (GPU 0; 15.90 GiB total capacity; 13.57 GiB already allocated; 993.88 MiB free; 14.27 GiB reserved in total by PyTorch)
As well as a number of questions posted here and other places with no real luck.
The dataset is tabular data with 1,557,000 data points. There are 66 features to the dataset.
I have tried running the smallest model and I get the CUDA error i.e.
layers of [1, 1] and batch_size =1
I am unable to run learn.fit_one_cycle without hitting the CUDA error.
I have tried refreshing the kaggle page and confirming that the GPU is empty and then running, but continue to get the problem.
Is it likely that there is something weird in my dataset that is causing a huge data leak somehow? I have run models on kaggle with far more than 1.5 million datapoints before.
This does actually work but only because I changed the split_by_rand_pct from 0.3 to 0.5 - however the GPU at capacity still and I want a bigger model than [10, 10] and I cant change the batch_size.
As in my first post the dataset is only 1.5 mil with 66 features. Seems like it shouldn’t be such a load on the GPU
It’s strange but everything seemed to go back to normal when I changed the validation set to 0.5 from 0.3. I can now change the batch_size, the size of the layers etc and have not had any further issues. It could be that it’s a Kaggle issue, or that the size of the train and validation set is quite important?
I might try some further testing with that during the week if I find some time.