I was just trying my hands on the Corporación Favorita Grocery Sales Forecasting competition on Kaggle, I am only using a year and a half’s worth of data to train my model. I just finished preprocessing the datasets, however I when I wanted to create a TabularDataBunch from the dataframes, I get an out of memory error. The sizes of the data frames from the .info() method tells me that the train, valid and test sets have sizes 6.1gb, 560mb and 376mb respectively. I have never worked with a dataset this big before and I’m not sure what to do. Are there any rule of thumbs when dealing with big datasets like these? I could perhaps reduce the size of the training set, but I wonder if there is a simpler and more elegant solution to this.
Best would be to search the #ml1 category. This is a topic that’s been discussed there a lot. The details may be slightly different for fastai v1, but the basic idea is the same.