Tabular: validation set percentage

Hi,
Going through the lesson 4 tabular part I see the following lines are defining 200 lines (from 800 to 1000) as the validation set:

test = TabularList.from_df(df.iloc[800:1000].copy(), path=path, cat_names=cat_names, cont_names=cont_names)
data = (TabularList.from_df(df, path=path, cat_names=cat_names, cont_names=cont_names, procs=procs)
                           .split_by_idx(list(range(800,1000)))
                           .label_from_df(cols=dep_var)
                           .add_test(test)
                           .databunch())

The adult.csv file has 32561 rows (+ header) so 200 lines as the validation set is around 0.6% of the data. Is that a meaningful enough percentage? In previous lessons, the validation set is typically 10%-20% of our data. Is the logic different for tabular data sets?

Thanks a lot

It was an example, to my knowledge, to show a split_by_idx() and to show how to add a test set. I always do a 70/20/10 split so I can get a good amount of decent representation of the data.

Thanks :slight_smile:
Ummm… sorry for the newbie question but I was thinking on selecting 3000 rows (eg. 1000:4000) in order to have roughly 90/10 (training/validation).
Can you provide an example on how you create 70/20/10 split and use it? (so you are creating also a test set?)

Thanks again

Sure :slight_smile: so I’d create two dataframe essentially where one is the first 90% of the original, and the rest is my test.

Idx = len(df)*.9
Train = df.iloc[0, idx]
test = df.iloc[idx, :]

Then do a split by rand pct () on the train, and pass the test in as a separate dataframe. Let me know if this is confusing, this is how I make labeled test sets but you can also do the same for unlabeled. I can show some more code if needed :slight_smile:

You could also keep the same format as above, and in your split_by_idx, see what value is at the 70% percentage, then see how many values 10% of your data is, and pass those in as the idxs.

I think I get it (but more code examples are always welcome :slight_smile: )
I was thinking something like the following (less elegant than your approach :wink: )


The data et has roughly 30K rows. I think I want to do the following:

  • 70% train
  • 20% validation
  • 10% test

So I am defining a test set with the last 3000 rows: [-3000:-1]
And defining the validation by doing a split_by_idx with a range that is roughly 6000 rows. E.g. [1000:7000]


test = TabularList.from_df(df.iloc[-3000:-1].copy(), path=path, cat_names=cat_names, cont_names=cont_names)
data = (TabularList.from_df(df, path=path, cat_names=cat_names, cont_names=cont_names, procs=procs)
                           .split_by_idx(list(range(1000,7000)))
                           .label_from_df(cols=dep_var)
                           .add_test(test)
                           .databunch())

Hope this makes sense!

Sure! One question for you, passing in -, what does that do for pandas?

Otherwise, here is how I would set it up, but I believe what you do does the same, following the same split_by methodology

start = (len(df)*.7)
end = (len(df)*.1) + start

test = TabularList.from_df(df.iloc[start:end].copy(), path...)
data = (TabularList.from_df(df, path=...)
             .split_by_idx(list(range(start, end)))
             .label_from_df(cols=dep_var)
            .add_test(test)
            .databunch()

Just indexing the dataframe: -1 means the last element, -10 means the 10th element starting from the end.

For example, in a dataframe:
df[3000:-1000] will select rows from 3000 to the 1000 starting from the end

Od course you can use a second range to select columns ( [x:y,n:p] ) and columns can also use negative indexes to start from the end.

Ah that makes sense. Thank you for the explanation :slight_smile: Let me know if the above is confusing at all, I can show my other method which uses functionality from sklearn (which is what I normally use in my research)

Oh yes, please show me. I am learning/improving this code every hour, lol, you can tell I am starting with all this.

Sure! So here’s how I go about it. I keep this function in my private library of functions since I use it so often.

from sklearn.model_selection import train_test_split

def SplitSet(df):
     train, test = train_test_split(df, test_size=0.1)
     train, valid = train_test_split(df, test_size=0.2)
     split_val = len(train)
     train = train.append(valid)
     return train, test, split_val

So essentially what this sets you up to do is something like this:

df = pd.from_csv(...)
traindf, testdf, idx = SplitSet(df)

test = TabularList.from_df(testdf, path=....)

data = (TabularList.from_df(traindf, path=...)
            .split_by_idx(list(range(idx, len(train)))
            .label_from_df(cols=dep_var)
            .add_test(test)
            .databunch())
3 Likes

nice!
Can you also share the train_test_split code?

Thanks

Ah my bad! See the edit above. It is taken from the sklearn library :slight_smile:

Hi again @muellerzr (and all!)
When you define the “data” variable you are defining the training set, the validation set (by splitting) and passing the test set as a separate df. That’s understood.

Then I train and I get something like:

epoch train_loss valid_loss accuracy time
0 0.322909 0.357733 0.835833 00:03
1 0.327918 0.359473 0.834833 00:03
2 0.335293 0.361857 0.832667 00:03

My understanding is that the accuracy is based on using the validation set… so, how is the test set being used in this case? (As originally the whole dataset is labeled, the test set is like a second validation set…)

I hope this question makes sense :wink:
Thanks!

The test set is not used until you are finished training and you want to only evaluate how you are doing. You do it at the very end. Does that make sense? :slight_smile: I’ll also say, the above only have unlabeled test sets. You need to do things differently if we want to grade the test set, instead of just getting predictions.

Yes, I was looking at:

learn.get_preds(ds_type=DatasetType.Test)
[tensor([[0.5895, 0.4105],
         [0.9798, 0.0202],
         [0.7655, 0.2345],
         ...,
         [0.9958, 0.0042],
         [0.5941, 0.4059],
         [0.7750, 0.2250]]), tensor([0, 0, 0,  ..., 0, 0, 0])]

So I could write some code to parse the test dataset, get the categories into 0’s and 1’s and compare with the second list…
I can see the use if this was a kaggle challenge, you had no labels in the test set and you need to submit results.
But in my case I have a single dataset with everything labeled… not sure if I should just forget about the test set and just have bigger train/validation sets and look at the accuracy after training.

Or maybe I am missing the smart way to use the test set

No absolutely! The test set is extremely important. Give me a moment and I’ll give some code on how to do the evaluation. Essentially we label the test Tabular list and we turn the valid_dl into that one.

See my post here:

Thanks a lot one more time. Will look at that post and will be waiting for that code! :slight_smile:

The code is in the post :wink:

Lol. Then I think I missed something. Will ask baby questions first. This is the code (similar as before):

test = TabularList.from_df(df.iloc[-3000:-1].copy(), path=path, cat_names=cat_names, cont_names=cont_names)
data = (TabularList.from_df(df[0:-3001], path=path, cat_names=cat_names, cont_names=cont_names, procs=procs)
                           .split_by_idx(list(range(1000,7000)))
                           .label_from_df(cols=dep_var)
                           .add_test(test)
                           .databunch())

Why I am adding a test (.add_test) to “data”?
Shouldn’t I be able to do something direct like learn.validate(learn.data.test_dl) ?

Sorry, I meant the code here:

We cannot directly pass a data loader to validate. What will happen is it’ll run the default validation set no matter what. So what I describe in the link above is how to go about overriding that with the labeled test set so we can properly use it.