Merging data from multiples datasets

Ekami · November 8, 2017, 12:00pm

Hello there,
This question is more ML related than DL related but I find this community to be the best to answer these kind of questions.
I’m participating to this Kaggle competition and my first approach to this challenge was to start by merging the different datasets between them as you can see on my kernel.
This would result in a train dataset of size 12gb compared to the original training data which size is 5gb.
Ofc I don’t plan to work on the entire dataset at first and I plan to use new Tensorflow Dataset API to read the resulting train/test csv iteratively.

But my questions are:

Is it a good practice to start off by merging the tables when you start a new Kaggle competition? Then later on you do some feature engineering to transform/add/remove some features on your “big” dataset?
Is there a much better way to merge/join the tables together than using pandas or sqlite? (for me I hesitated between using pandas and the .merge function as on my notebook and putting all the data in sqlite to join them using SQL. Because as you can see the csv files looks like they have been extracted from relational databases)

Thanks a lot for your help

ar_ai · November 8, 2017, 12:18pm

Some of these questions are addressed in the Jeremy’s Machine Learning lesson 3 video. Check that thread. He discusses the exact same competition.

Ekami · November 8, 2017, 1:17pm

Oh that’s perfect then, I’m currently at ML lesson 2. Thanks a lot!!

KevinB · November 9, 2017, 11:13pm

I’m working on the same thing I really need to catch up on the ml1 videos!

Ekami · November 9, 2017, 11:20pm

I didn’t find the answer on the lesson 3 video unfortunately. As @jeremy requested I moved this conversation to the appropriate thread . I will keep you updated if I find something not from this forum @KevinB