How to choose sample data which represents the whole dataset?

vineeth_k · June 22, 2018, 12:29pm

When we do data analysis, we often plot to see the relationships between data(bivariate analysis) and when the dataset is very huge, the plots become over populated and gibberish.

Choosing a sample which represents the whole data is also important when we perform hyperparameter tuning. It helps us get a quick feed back on how our model is performing.

So what are the techniques to create a good sample for both regression problems and classification problems?

Please let me know your views and feel free to list out useful resources which can help me understand these techniques.

dannyp · June 22, 2018, 1:36pm

Hi Vineeth,

I suppose it depends on the type of data you have. If it’s not sequential (i.e., a time series), a random sample would do. For instance, the code below from the ‘translate’ notebook (lesson 11) splits the data into a training and validation sets with about 10% probability (see the second line):

np.random.seed(42)
trn_keep = np.random.rand(len(en_ids_tr)) > 0.1
en_trn,fr_trn = en_ids_tr[trn_keep],fr_ids_tr[trn_keep]
en_val,fr_val = en_ids_tr[~trn_keep],fr_ids_tr[~trn_keep]

For time series, I think you’d get the training data as the first part of the time-sorted data set (say 90%) and leave the remaining for testing purposes.

Hope this helps.

Dan

vineeth_k · June 22, 2018, 4:25pm

Danny,

You are talking about strategies for train and validation split. Which is not what I was looking for.
Perhaps you didn’t understand my question, let me try to explain you again.

So when dealing with huge data sets, especially structured data - how do you select a sample data which you can work on to save time and run models quickly. Having a data sample which represents the whole dataset also helps when we do data visualization. For example in ML course Jeremy uses a sample dataset to work on Randomforests. This is after you’ve done train and validation split.

When you have huge dataset, let’s suppose you wanna do a scatter plot, the plot becomes overpopulated and it becomes hard to conclude anything from that. Most times having a sample set will help here in removing the gibberishness and resulting in a clean plot.

My question is about how to choose a sample from the train data which represents the whole dataset for both regression and classification problems to perform quick model tuning and also data visualization.

dannyp · June 25, 2018, 10:56am

Hi Vineeth,

I think the only way is to get a random sample (and that’s why I put the example as it showed the code to do that – I understand that the code is related to splitting the data into training and testing parts…). However I think that wouldn’t necessarily work for time-series data…

HTH,

Dan