Hello,
I’m currently participating in the grocery forecasting Kaggle competition and one of the issues I stumbled upon is “how to effectively merge very big datasets?”.
This questions apply not only to this competition but to all kind of problems where you face “big data” challenges.
Some context:
With 5 csv files (train.csv
, test.csv
, transactions.csv
, stores.csv
, oil.csv
, items.csv
, holidays.csv
) I want to obtain 2 resulting files (train.csv
and test.csv
) which contains all the metadata from the other tables. Here is an overview of the size of each table:
# File sizes
train.csv 4997.45MB
test.csv 126.16MB
transactions.csv 1.55MB
stores.csv 0.0MB
oil.csv 0.02MB
items.csv 0.1MB
holidays_events.csv 0.02MB
As you can see it will be very difficult to keep the whole train.csv
file into memory while doing some merge
operations on it.
1st attempt
So I started my journey by implementing a “home made” solution (big mistake) which was opening into pandas DataFrames all the files into memory except train.csv
and merging them all together via the pandas.merge()
function.
As for the train.csv
I opened it with a chunksize
size so basically I was firing up as many processes as CPU cores I have and each train.csv
chunk would get merged with the other tables then saved as individual csv
files. At the end of the process when all chunks are merged with the other tables I would join together each .csv
files to get a resulting merged_train.csv
file.
Although this process worked it was not optimal as it was creating a lot of .csv
files:
- Which were eating all my disk space
- Which format is inefficient for I/O operations
- Which final merge (to create
merged_train.csv
was very slow an single threaded)
2nd attempt
So at that point I told myself: “Why reinventing the wheel? Other people must have run into the same kind of issues.” and I discovered many tools. Here are my findings:
Data containers
- HDF5: Looks very neat, it’s basically a filesystem within a file so you can do all kind of operations without loading all your datasets into memory
- Bcolz: Seems to be faster than HDF5 and the golden standard for fast I/O operations. You can think of it as a “on-disk” numpy.
- Feather: Is fairly new and very few tools are yet developed for it.
- SQLite: A portable relational database with a very powerful query language (SQL) but is not generalizable to all kind of data.
High level data libraries
- Blaze: A frontend for all kind of data source (SQL db, Nosql db, csv, numpy arrays etc etc…) where you have an “unified” API to handle any kind of data.
- PyTables: Built on top of HDF5 to make the code more “python-ish” and object oriented
-
Dask: A “pandas-like” tool. Allow you to do a lot of operations you would do with pandas (not all of them tho) but in contrast it loads the dataframe in memory only when it needs to (for example if you want to visualize your data or export it). It’s very handy as with this solution I don’t have to load my
train.csv
in chunks.
And I’m sure there are many more but I tried to stick to them as they seems to be the most used ones.
So now back to my issue: “Effectively merging big tables”.
Today what I’m doing is opening my datasets with Dask, doing the merge
operations and then I’ll try to save my merged_train
to the bcolz
format so I can open it back later very fast when I’ll have to fit the data into my models.
Thing is, Dask does not have a to_blosc
/to_bcolz
function to easily do that, only to_hdf5
.
I saw @jeremy using this format a lot but I’m not sure how he goes from csv (or multiple csv) to this format. I mean, you can surely open the csv files as text files and fill in the Bcolz ctable
but isn’t there a more effective way? Btw I believe you are loosing your columns names by doing this so it’s harder to keep track of your data transformations.
Have you any tips or recommendations regarding this issue? Maybe I’m all wrong and I should instead use HDF5 and merge my tables with something like h5py? I would also like to find a pattern to replicate when I run into the same issue for any type of “big data” datasets.
Thanks