Why use continuous chunk of data as validation set for tabular data

PegasusWithoutWinds · November 15, 2018, 5:37am

Here is how Jeremy constructed the data-bunch for the lesson4-tabular notebook.

data = (TabularList.from_df(df, path=path, cat_names=cat_names, cont_names=cont_names, procs=procs)
                           .split_by_idx(list(range(800,1000)))
                           .label_from_df(cols=dep_var)
                           .add_test(test, label=0)
                           .databunch())

where the line

.split_by_idx(list(range(800,1000))

splits the data from index 800 to index 1000, as the validation set.

Jeremy provided a brief explanation in this part of the lecture.

It is very common to keep your validation set as a continuous group of things. Like, if they are map tiles, they should be map tiles that are next to each other; if they are time periods, they should be days that are next to each other; if they are video frames, they should be video frames next to each other. Cause otherwise, you are cheating. So it is often a good idea to use split_by_idx to grab a range that is next to each other if your data has a structure like that.

I have two questions about this. First, why it is cheating to use discontinuous data as the validation set for a dataset that has certain continuity in its structure? Second, even if we assume the previous statement, how does our ADULT dataset has any of such structure? It is not really ordered and does not really have any kind of continuity in its structure.

I would really appreciate if anyone could shed some light on these questions. Thank you!

utkb · March 20, 2019, 5:36pm

Hi George,

Only just come across this when searching for topics related to tabular learner. You’ve probably figured these out by now – I think your two questions are very nicely answered in Rachel’s post here. Your Q1 is nicely illustrated in the “Time Series” part of her post, regarding how and why it is “cheating”. As for your Q2, I haven’t really looked into the ADULT dataset in much detail, but I figure the notebook split the data that way either because it’s just to illustrate a quick and easy way to split data (without considering its continuity/lack thereof), or it’s to account for the type of “new” data mentioned in Rachel’s post (e.g. the last chunk of ADULT data might be for “new” individuals who had not shown up in the earlier data points).

Hopefully I didn’t misunderstand (or miss-explain) these too badly! Thanks.

Yijin

PegasusWithoutWinds · March 21, 2019, 1:07am

Hey @utkb, thanks a lot for your reply! I have almost forgot about the post already. Rachel’s post, especially the picture she attached there, is very illustrative!