Ekami
(Tuatini GODARD)
November 9, 2017, 11:10pm
159
In that case I also move my question here for which I didn’t find an answer in lesson 3:
I’m participating to the Favorita grocery Kaggle competition and my first approach to this challenge was to start by merging the different datasets between them as you can see on my kernel .
This would result in a train dataset of size 12gb compared to the original training data which size is 5gb.
Ofc I don’t plan to work on the entire dataset at first and I plan to use new Tensorflow Dataset API to read the resulting train/test csv iteratively.
But my questions are:
Is it a good practice to start off by merging the tables when you start a new Kaggle competition? Then later on you do some feature engineering to transform/add/remove some features on your “big” dataset?
Is there a much better way to merge/join the tables together than using pandas or sqlite? (for me I hesitated between using pandas and the .merge function as on my notebook and putting all the data in sqlite to join them using SQL. Because as you can see the csv files looks like they have been extracted from relational databases)
Thanks a lot for your help
This is more or less related to what we discussed about the curse of dimensionality earlier. Do we really want to merge this meta data to the rest of the training/test sets from the start?
2 Likes