At 30:35 of lesson 2, Jeremy gets a random sample of 30k rows. He then says the validation set should not change, and that the training set should not overlap with the dates (not sure which dates he is referring to).
The original validation set is made up of the last 12k rows. Since
proc_df is run on a subset of a random 30k rows, isn’t is possible that some of the new, smaller training data consists of rows from the validation set? Furthermore, I would think that the smaller training set is not necessarily ordered by date any longer since rows were picked at random.
edit: I checked out the source code, and
get_sample returns the data in sorted order so that addresses that question. I still think it’s possible that the training data could overlap with the original validation set.