Out of memory when using >70 million entries for training grocery

Sichen · November 10, 2017, 7:42am

My machine is 64GB memory, and I am following @jeremy 's lesson3 to run np.array(trn, dtype = np.float32) firstly before running RandomForestRegressor for training the Grocery prediction model. However, python (or Jupyter) quits during the execution of this line of code if I use the train set more than 70 million. BTW, I figured out this number by trying multiple times. If I don’t run this np.array code separately, the out of memory issue still happens during fit() of the regressor.

I am pretty sure that this is out of memory issue, because I watched the top output and the Used Memory size goes up all the way up until about 64GB and then suddenly python/Jupytor quits.

I am not sure I am the only person having this issue. Should I choose another machine with bigger memory? or there is some other solutions?

thanks!

jeremy · November 10, 2017, 10:04am

It’s unlikely you need to build a model on that many rows - so you may simply want to consider using the most recent year or so of data. Also, perhaps you’ve got more columns that you need? If you’ve joined some columns into the dataset, first use a smaller dataset to get feature importance, and then only keep the columns that are useful.

parrt · November 10, 2017, 4:15pm

also consider how many jupyter notebooks you have running and how many times you’ve renamed variables pointing to huge data frames. The python interpreter will keep those around maybe until you restart the kernel(s).

kkibrahim · November 20, 2017, 6:07pm

Thank you! So helpful! This should have been obvious to me…