When we do data analysis, we often plot to see the relationships between data(bivariate analysis) and when the dataset is very huge, the plots become over populated and gibberish.
Choosing a sample which represents the whole data is also important when we perform hyperparameter tuning. It helps us get a quick feed back on how our model is performing.
So what are the techniques to create a good sample for both regression problems and classification problems?
Please let me know your views and feel free to list out useful resources which can help me understand these techniques.
I suppose it depends on the type of data you have. If it’s not sequential (i.e., a time series), a random sample would do. For instance, the code below from the ‘translate’ notebook (lesson 11) splits the data into a training and validation sets with about 10% probability (see the second line):
trn_keep = np.random.rand(len(en_ids_tr)) > 0.1
en_trn,fr_trn = en_ids_tr[trn_keep],fr_ids_tr[trn_keep]
en_val,fr_val = en_ids_tr[~trn_keep],fr_ids_tr[~trn_keep]
For time series, I think you’d get the training data as the first part of the time-sorted data set (say 90%) and leave the remaining for testing purposes.
Hope this helps.
You are talking about strategies for train and validation split. Which is not what I was looking for.
Perhaps you didn’t understand my question, let me try to explain you again.
So when dealing with huge data sets, especially structured data - how do you select a sample data which you can work on to save time and run models quickly. Having a data sample which represents the whole dataset also helps when we do data visualization. For example in ML course Jeremy uses a sample dataset to work on Randomforests. This is after you’ve done train and validation split.
When you have huge dataset, let’s suppose you wanna do a scatter plot, the plot becomes overpopulated and it becomes hard to conclude anything from that. Most times having a sample set will help here in removing the gibberishness and resulting in a clean plot.
My question is about how to choose a sample from the train data which represents the whole dataset for both regression and classification problems to perform quick model tuning and also data visualization.
I think the only way is to get a random sample (and that’s why I put the example as it showed the code to do that – I understand that the code is related to splitting the data into training and testing parts…). However I think that wouldn’t necessarily work for time-series data…