Going through the lesson 4 tabular part I see the following lines are defining 200 lines (from 800 to 1000) as the validation set:
test = TabularList.from_df(df.iloc[800:1000].copy(), path=path, cat_names=cat_names, cont_names=cont_names)
data = (TabularList.from_df(df, path=path, cat_names=cat_names, cont_names=cont_names, procs=procs)
The adult.csv file has 32561 rows (+ header) so 200 lines as the validation set is around 0.6% of the data. Is that a meaningful enough percentage? In previous lessons, the validation set is typically 10%-20% of our data. Is the logic different for tabular data sets?
Ummm… sorry for the newbie question but I was thinking on selecting 3000 rows (eg. 1000:4000) in order to have roughly 90/10 (training/validation).
Can you provide an example on how you create 70/20/10 split and use it? (so you are creating also a test set?)
Then do a split by rand pct () on the train, and pass the test in as a separate dataframe. Let me know if this is confusing, this is how I make labeled test sets but you can also do the same for unlabeled. I can show some more code if needed
You could also keep the same format as above, and in your split_by_idx, see what value is at the 70% percentage, then see how many values 10% of your data is, and pass those in as the idxs.
Ah that makes sense. Thank you for the explanation Let me know if the above is confusing at all, I can show my other method which uses functionality from sklearn (which is what I normally use in my research)
Hi again @muellerzr (and all!)
When you define the “data” variable you are defining the training set, the validation set (by splitting) and passing the test set as a separate df. That’s understood.
Then I train and I get something like:
My understanding is that the accuracy is based on using the validation set… so, how is the test set being used in this case? (As originally the whole dataset is labeled, the test set is like a second validation set…)
The test set is not used until you are finished training and you want to only evaluate how you are doing. You do it at the very end. Does that make sense? I’ll also say, the above only have unlabeled test sets. You need to do things differently if we want to grade the test set, instead of just getting predictions.
So I could write some code to parse the test dataset, get the categories into 0’s and 1’s and compare with the second list…
I can see the use if this was a kaggle challenge, you had no labels in the test set and you need to submit results.
But in my case I have a single dataset with everything labeled… not sure if I should just forget about the test set and just have bigger train/validation sets and look at the accuracy after training.
Or maybe I am missing the smart way to use the test set
No absolutely! The test set is extremely important. Give me a moment and I’ll give some code on how to do the evaluation. Essentially we label the test Tabular list and we turn the valid_dl into that one.
We cannot directly pass a data loader to validate. What will happen is it’ll run the default validation set no matter what. So what I describe in the link above is how to go about overriding that with the labeled test set so we can properly use it.