Dealing with Large Datasets in Python

I’m trying to work on this Kaggle competition: Corporación Favorita Grocery Sales Forecasting and I’m running into an issue that I feel like has to have a good solution. Basically I have 16 GB of RAM and I’m having troubles loading this all without hitting OOM error. Is there a way to batch this that would work or what kind of strategies are there to handle this? If I had the toolkit I have at work, I would use something like Alteryx to split it up into multiple files with a months data or something, but even then I’m not sure the problem would be fixed. How is everyone handling big datasets currently? Can a generator be used to process a small amount of data through at a time and reduce the memory to something more manageable? One of the things I’ve seen suggested is changing the data type, but that seems like a temporary solution that eventually will need a more permanent solution.


For me loading train.csv from this competition with pandas takes up about 6GB RAM.
I think you should be OK, if you not try to keep multiple copies at once.
If not, Maybe you can do some bcolz juggling like in lesson10?

Yup if you’re doing DL stuff for it, bcolz would be perfect. Otherwise look into using Dask.


So at some point you run into actual hardware limitations? I know when using images you can run through 32 at a time. But when it’s rows of data can you run through it in batches? Maybe that’s my real question, can you run batches on rows so it would only have 1000 rows of data loaded into RAM at a time to avoid memory issues?

Absolutely. Both the suggestions I made use that approach.


Ok, great. Thanks for the clarification! I will make a post on here once I figure out how it works.

OK good, I am involved in this competition too and even others with Terabytes of data.
I will try them and give a feedback too.

1 Like