How to handle dataframes too large to fit in memory?

marcmuc · February 27, 2019, 11:08am

Sorry if this is obvious and you have done so already, but you should first of all make sure that you specify datatypes for the columns when reading in the data with pandas. In one kaggle competition this enabled reducing the memory needed by more than 50%. I have created a kernel about that:

https://www.kaggle.com/marcmuc/large-csv-datasets-with-pandas-use-less-memory

The key is that pandas automatically assigns 64bit versions of int, float to the columns, whereas your data can probably live with 8bit ints sometimes or 32bit floats most of the times. This significantly reduces your memory footprint

Also when running the model, and that fails, try to set the workers to 0, there are still often problems when using workers in pytorch/fastai due to memory consumption, see this thread