Most effective ways to merge "big data" on a single machine

Ekami · November 30, 2017, 11:48am

To come back to the main topic (big data) I tried many different things to load, transform and preprocess data from the groceries competitions (which have a csv file of size ~5gb). I have a machine with 32gb of ram and even by loading the data with the right types (by using int8 instead of the default int64) I quickly ran out of ram because as soon as you merge and transform the data you are creating new dimensions and eventually duplicate dataframes to work on them.

Of course you could just work on a subset of data but lets say you are using RNNs to load your data in batch, it would mean that you will find yourself doing preprocessing (merge, transformation etc…) each time you extract a batch from the whole data. But this is really inneficient as:

You are limited to few operations (you can forget mean/median operations for instance)
You face new challenges (what if the batch you extracted do not have the right indexes to be merged with the other tables?)
You are doing preprecessing for each batch which means you are loosing the optimizations made for preprocessing large amount of data and you are also preprocessing each time you rerun your neural network.

So basically: Doing preprocessing after extracting in batches -> Trash
I need to perform all the operations on all the data.
So I ran into Dask and tried to use it as a replacement for pandas. Big mistake… As they say on their website, Dask is not a replacement for pandas as it does not have all the capabilities Pandas has. While it was very useful for merging the tables, doing preprocessing with it was a pain in the ass as it lacks some useful functions pandas has so you find yourself switching between the 2 very often and sometimes well… your memory explodes.

While @jeremy’s method is valid (get a aws instance with a lot of ram then get back the result on your machine), I find it more being a hack than a sustainable solution. I think the issue with such hack is that everytime you find you transformed your data incorrectly you’ll spawn a new aws instance again and pay a little fee (in term of time and $$) everytime you do that.

I think next time I’ll face such issues I’ll just use Spark as @yinterian advised. I also found this tool (which run on Spark) to have a very clear and neat API.