Tabular: validation set percentage

Bliss · June 12, 2019, 11:23am

Hi,
Going through the lesson 4 tabular part I see the following lines are defining 200 lines (from 800 to 1000) as the validation set:

test = TabularList.from_df(df.iloc[800:1000].copy(), path=path, cat_names=cat_names, cont_names=cont_names)
data = (TabularList.from_df(df, path=path, cat_names=cat_names, cont_names=cont_names, procs=procs)
                           .split_by_idx(list(range(800,1000)))
                           .label_from_df(cols=dep_var)
                           .add_test(test)
                           .databunch())

The adult.csv file has 32561 rows (+ header) so 200 lines as the validation set is around 0.6% of the data. Is that a meaningful enough percentage? In previous lessons, the validation set is typically 10%-20% of our data. Is the logic different for tabular data sets?

Thanks a lot

muellerzr · June 12, 2019, 1:48pm

It was an example, to my knowledge, to show a split_by_idx() and to show how to add a test set. I always do a 70/20/10 split so I can get a good amount of decent representation of the data.

Bliss · June 12, 2019, 2:10pm

Thanks
Ummm… sorry for the newbie question but I was thinking on selecting 3000 rows (eg. 1000:4000) in order to have roughly 90/10 (training/validation).
Can you provide an example on how you create 70/20/10 split and use it? (so you are creating also a test set?)

Thanks again

muellerzr · June 12, 2019, 2:45pm

Sure so I’d create two dataframe essentially where one is the first 90% of the original, and the rest is my test.

Idx = len(df)*.9
Train = df.iloc[0, idx]
test = df.iloc[idx, :]

Then do a split by rand pct () on the train, and pass the test in as a separate dataframe. Let me know if this is confusing, this is how I make labeled test sets but you can also do the same for unlabeled. I can show some more code if needed

You could also keep the same format as above, and in your split_by_idx, see what value is at the 70% percentage, then see how many values 10% of your data is, and pass those in as the idxs.

Bliss · June 12, 2019, 2:50pm

I think I get it (but more code examples are always welcome )
I was thinking something like the following (less elegant than your approach )

The data et has roughly 30K rows. I think I want to do the following:

70% train
20% validation
10% test

So I am defining a test set with the last 3000 rows: [-3000:-1]
And defining the validation by doing a split_by_idx with a range that is roughly 6000 rows. E.g. [1000:7000]

test = TabularList.from_df(df.iloc[-3000:-1].copy(), path=path, cat_names=cat_names, cont_names=cont_names)
data = (TabularList.from_df(df, path=path, cat_names=cat_names, cont_names=cont_names, procs=procs)
                           .split_by_idx(list(range(1000,7000)))
                           .label_from_df(cols=dep_var)
                           .add_test(test)
                           .databunch())

Hope this makes sense!

muellerzr · June 12, 2019, 2:58pm

Sure! One question for you, passing in -, what does that do for pandas?

Otherwise, here is how I would set it up, but I believe what you do does the same, following the same split_by methodology

start = (len(df)*.7)
end = (len(df)*.1) + start

test = TabularList.from_df(df.iloc[start:end].copy(), path...)
data = (TabularList.from_df(df, path=...)
             .split_by_idx(list(range(start, end)))
             .label_from_df(cols=dep_var)
            .add_test(test)
            .databunch()

Bliss · June 12, 2019, 3:03pm

Just indexing the dataframe: -1 means the last element, -10 means the 10th element starting from the end.

For example, in a dataframe:
df[3000:-1000] will select rows from 3000 to the 1000 starting from the end

Od course you can use a second range to select columns ( [x:y,n:p] ) and columns can also use negative indexes to start from the end.

muellerzr · June 12, 2019, 3:05pm

Ah that makes sense. Thank you for the explanation Let me know if the above is confusing at all, I can show my other method which uses functionality from sklearn (which is what I normally use in my research)

Bliss · June 12, 2019, 3:12pm

Oh yes, please show me. I am learning/improving this code every hour, lol, you can tell I am starting with all this.

muellerzr · June 12, 2019, 3:24pm

Sure! So here’s how I go about it. I keep this function in my private library of functions since I use it so often.

from sklearn.model_selection import train_test_split

def SplitSet(df):
     train, test = train_test_split(df, test_size=0.1)
     train, valid = train_test_split(df, test_size=0.2)
     split_val = len(train)
     train = train.append(valid)
     return train, test, split_val

So essentially what this sets you up to do is something like this:

df = pd.from_csv(...)
traindf, testdf, idx = SplitSet(df)

test = TabularList.from_df(testdf, path=....)

data = (TabularList.from_df(traindf, path=...)
            .split_by_idx(list(range(idx, len(train)))
            .label_from_df(cols=dep_var)
            .add_test(test)
            .databunch())

Bliss · June 12, 2019, 3:31pm

nice!
Can you also share the train_test_split code?

Thanks

muellerzr · June 12, 2019, 3:39pm

Ah my bad! See the edit above. It is taken from the sklearn library

Bliss · June 13, 2019, 12:29pm

Hi again @muellerzr (and all!)
When you define the “data” variable you are defining the training set, the validation set (by splitting) and passing the test set as a separate df. That’s understood.

Then I train and I get something like:

epoch	train_loss	valid_loss	accuracy	time
0	0.322909	0.357733	0.835833	00:03
1	0.327918	0.359473	0.834833	00:03
2	0.335293	0.361857	0.832667	00:03

My understanding is that the accuracy is based on using the validation set… so, how is the test set being used in this case? (As originally the whole dataset is labeled, the test set is like a second validation set…)

I hope this question makes sense
Thanks!

muellerzr · June 13, 2019, 12:33pm

The test set is not used until you are finished training and you want to only evaluate how you are doing. You do it at the very end. Does that make sense? I’ll also say, the above only have unlabeled test sets. You need to do things differently if we want to grade the test set, instead of just getting predictions.

Bliss · June 13, 2019, 12:48pm

Yes, I was looking at:

learn.get_preds(ds_type=DatasetType.Test)
[tensor([[0.5895, 0.4105],
         [0.9798, 0.0202],
         [0.7655, 0.2345],
         ...,
         [0.9958, 0.0042],
         [0.5941, 0.4059],
         [0.7750, 0.2250]]), tensor([0, 0, 0,  ..., 0, 0, 0])]

So I could write some code to parse the test dataset, get the categories into 0’s and 1’s and compare with the second list…
I can see the use if this was a kaggle challenge, you had no labels in the test set and you need to submit results.
But in my case I have a single dataset with everything labeled… not sure if I should just forget about the test set and just have bigger train/validation sets and look at the accuracy after training.

Or maybe I am missing the smart way to use the test set

muellerzr · June 13, 2019, 12:50pm

No absolutely! The test set is extremely important. Give me a moment and I’ll give some code on how to do the evaluation. Essentially we label the test Tabular list and we turn the valid_dl into that one.

See my post here:

Bliss · June 13, 2019, 3:03pm

Thanks a lot one more time. Will look at that post and will be waiting for that code!

muellerzr · June 13, 2019, 3:09pm

The code is in the post

Bliss · June 13, 2019, 3:15pm

Lol. Then I think I missed something. Will ask baby questions first. This is the code (similar as before):

test = TabularList.from_df(df.iloc[-3000:-1].copy(), path=path, cat_names=cat_names, cont_names=cont_names)
data = (TabularList.from_df(df[0:-3001], path=path, cat_names=cat_names, cont_names=cont_names, procs=procs)
                           .split_by_idx(list(range(1000,7000)))
                           .label_from_df(cols=dep_var)
                           .add_test(test)
                           .databunch())

Why I am adding a test (.add_test) to “data”?
Shouldn’t I be able to do something direct like learn.validate(learn.data.test_dl) ?

muellerzr · June 13, 2019, 3:17pm

Sorry, I meant the code here:

We cannot directly pass a data loader to validate. What will happen is it’ll run the default validation set no matter what. So what I describe in the link above is how to go about overriding that with the labeled test set so we can properly use it.