Hello there,
This question is more ML related than DL related but I find this community to be the best to answer these kind of questions.
I’m participating to this Kaggle competition and my first approach to this challenge was to start by merging the different datasets between them as you can see on my kernel.
This would result in a train dataset of size 12gb compared to the original training data which size is 5gb.
Ofc I don’t plan to work on the entire dataset at first and I plan to use new Tensorflow Dataset API to read the resulting train/test csv iteratively.
But my questions are:
- Is it a good practice to start off by merging the tables when you start a new Kaggle competition? Then later on you do some feature engineering to transform/add/remove some features on your “big” dataset?
- Is there a much better way to merge/join the tables together than using pandas or sqlite? (for me I hesitated between using pandas and the
.merge
function as on my notebook and putting all the data in sqlite to join them using SQL. Because as you can see the csv files looks like they have been extracted from relational databases)
Thanks a lot for your help