Randomly sampling data - Random forests - Lesson 2


I am confused about the following piece of code in lesson 2:

df_trn, y_trn, nas = proc_df(df_raw, ‘SalePrice’, subset=30000,
X_train, _ = split_vals(df_trn, 20000)
y_train, _ = split_vals(y_trn, 20000)

In the above piece of code, to speed things up, we are randomly selecting a subset of 30000. In the earlier part of the lecture, the Professor states that, when a data set has a timeseries element in it, we need to make sure that the traning, validation and test sets have different time periods.

So my question is, how can we maintain different time ranges for each of the training, validation and test sets when the data is sampled randomly?

Kindly clarify.

Kiran Hegde

Earlier in the code, he creates his validation set by splitting the entire data set into two, the validation set consisted of the last 12000 data points.

Over here, he uses a random subset of 30000 rows from the entire data set and then he splits it data into X_train and _ (this is data just thrown away). He trains the data with X_train and Y_train which consists of the first 20000 rows of the subset and then he validates it with the earlier 12000 rows he set to X_valid and Y_valid.

In case the 30000 subset chosen into df_trn overlapped with the validation set, the split_val function used to only use the first 20000 rows will ensure that the majority of the validation data is not used for training.

Hope that helps :slight_smile:

Hello @Buddhi

Thanks for responding back. My question really is, do we really want to randomly sample data as the Professor has shown here in practice? Is the point of randomly sampling data into a 30000 subset to just to test our model quickly to save time and to set our parameters and then once the parameters are chosen, we extend this to the entire dataset?

Please let me know.

Kiran Hegde

Hi Kiran,

I share your confusion. Also, I don’t understand the points about time series in the bulldozer dataset. It is not sorted by saledate, so the initial split into test and validation (last 12,000 rows) does not take any temporal component into consideration. Maybe I’m missing something and someone can explain?


Will T.

My understanding is that the example is used to show off one of the features of FastAI library. It cannot be used on datetime series specially if you think datetime has a significant impact on the model.

Sorry to resurrect an old thread.

I share the confusion here too. Does this mean that we get a subset of a big data, which is random, to represent the whole data so we can train faster? Since it’s random, it kind of represents the whole data, but less rows. If we tune our hyperparameters from that subset, the model will work well if given the whole dataset. Is my understanding correct?

Hi, I share the same confusion. Basically, in lecture 2, @jeremy discussed that by default each tree is trained on data randomly sampled with replacement, and the sub-sampling size is equal to the number of rows in the training set. That way we ensure that atleast 63 % data is seen in training samples.

By using set rf sample, we force each tree to look at a different set of samples every time, resulting in non correlated individual trees. Each tree overfits on a particular sample of data but generalizes well on the overall dataset ( Predictions are averaged ).

In lesson 4, however I noticed that once the feature importance was gauged, the rf samples were reset and then the model was eventually trained on the whole bootstrapped sample.

Does that mean, we use sub-sampling only for getting an overall sense of the features and not for the final training of the model ??