Tips for training a large Tabular Dataset?

I have a relatively big dataset (pandas data frame of 90 columns with 7 million rows). I have been having CUDA out of memory errors trying to train this. I tried loading in batch size 1 but it did not help (did not really do anything?). Are there any guidelines on how to train such a dataset without maxing out GPU memory?

Do you have a minimal reproducible code? 90 columns don’t seem that big. What is the exact reason for OOM?

1 Like

I was able to get it to train! It was the embedding sizes causing the problem because I had a column with almost 8 million unique values! I guess the embedding size is not by default set to the min(50, (unique_val+1)/2) that prior fastai versions had due to learner reloading issues: Loading saved TabularModel fails due to embeddings

1 Like