Dealing with Large Datasets in Python

KevinB · October 22, 2017, 2:51am

I’m trying to work on this Kaggle competition: Corporación Favorita Grocery Sales Forecasting and I’m running into an issue that I feel like has to have a good solution. Basically I have 16 GB of RAM and I’m having troubles loading this all without hitting OOM error. Is there a way to batch this that would work or what kind of strategies are there to handle this? If I had the toolkit I have at work, I would use something like Alteryx to split it up into multiple files with a months data or something, but even then I’m not sure the problem would be fixed. How is everyone handling big datasets currently? Can a generator be used to process a small amount of data through at a time and reduce the memory to something more manageable? One of the things I’ve seen suggested is changing the data type, but that seems like a temporary solution that eventually will need a more permanent solution.

neuralMax · October 22, 2017, 8:17am

For me loading train.csv from this competition with pandas takes up about 6GB RAM.
I think you should be OK, if you not try to keep multiple copies at once.
If not, Maybe you can do some bcolz juggling like in lesson10?

jeremy · October 22, 2017, 4:42pm

Yup if you’re doing DL stuff for it, bcolz would be perfect. Otherwise look into using Dask.

KevinB · October 22, 2017, 6:03pm

So at some point you run into actual hardware limitations? I know when using images you can run through 32 at a time. But when it’s rows of data can you run through it in batches? Maybe that’s my real question, can you run batches on rows so it would only have 1000 rows of data loaded into RAM at a time to avoid memory issues?

jeremy · October 22, 2017, 8:12pm

Absolutely. Both the suggestions I made use that approach.

KevinB · October 22, 2017, 8:41pm

Ok, great. Thanks for the clarification! I will make a post on here once I figure out how it works.

Kjeanclaude · October 26, 2017, 6:51am

OK good, I am involved in this competition too and even others with Terabytes of data.
I will try them and give a feedback too.