Loading the data for Corporación Favorita Grocery Sales Forecasting in Less Ram Machine

Dhruv · August 9, 2018, 6:41am

Hi

My query is Related to the ML course offered by fast.ai.In lesson 3,Jeremy started the class by explaining about this competition Kaggle Competition

So i tried to load the train.csv file but i found out that the file itself is around 4.65 GB which i believe would not fit in my system.

My system specs are-Core i7,2GB Nvidia Geforce 840M,8GB Ram.I usually use this system for only ML and i really do not want to go in the hassle of setting up fast ai on another system in the cloud.So is there any way i can load the dataset in like smaller pieces so that i may be able to work with it using pandas?

MicPie · August 9, 2018, 7:09pm

Hi @Dhruv,

you could use “chunking” to load the file in smaller pieces, but I would recommend to have a look at the kernel posted here:

With pandas data frames it is always best to use the appropriate data format to save memory space.
With the approach from above reductions of 50% are not uncommon.

Best regards
Michael

Dhruv · August 10, 2018, 7:35pm

Yeah but in the notebook,he is able to load the data initially and then compress it to show the difference.I am not being able to load the dataset in the initial case also.

MicPie · August 10, 2018, 7:53pm

There is an older thread about this issue:

Maybe dask instead of pandas can already solve your problem?

Or chunking with pandas: http://pandas-docs.github.io/pandas-docs-travis/io.html#iterating-through-files-chunk-by-chunk ?

Once I had the problem when I had merged all data that I got an out-of-memory when writing to a compressed file format, which needed additional memory. In such cases the feather format solved the problem (I guess it writes the in-memory data directly to the file without doing anything additional).

Hope something helps!

Best regards
Michael