Split my dataset

Hello!

I have collected my own dataset and trying to run random forrest on it but fastai complains that there are labels in my validation set that do not appear in the training set. How do I prepare and split my data?

The data is collected from a web site selling used cars. My columns is brand, model, type (sedan…), gear, fuel, model_year and milage. I have some 150k rows. I have left out the brand feature because I think I need to make it less diverse in the future. I have also tried leaving out all brands not being one of the top 20.

The error message does not help me figure out which feature is containing the missing category. It just says:

Exception: Your validation data contains a label that isn't present in the training set, please fix your data.

Here is my code running fastai v1.0x

from pathlib import Path
from fastai.tabular import *

path = Path('/my_path')
df = pd.read_csv(path/'data.csv')
procs = [FillMissing, Categorify, Normalize]
valid_idx = range(len(df)-2000, len(df))
dep_var = 'price'
cat_names = ['brand','model_year','gear','fuel','type']
cont_names = ['milage']
data = TabularDataBunch.from_df(path, df, dep_var, valid_idx=valid_idx, procs=procs, 
cat_names=cat_names, cont_names=cont_names)

Is there any good tools to do this split? Is there any good guides on how to prepare data for training?

Thanks!

Try making your validation set range larger. If you have 150K samples, you’re only assigning 1.3% for your validation set. Try making it 10% or more to see if you get the same error.

Thanks, but it didnt help.

valid_idx = range(len(df)-(len(df)//10), len(df))
print(valid_idx)

range(138281, 153645)

data = TabularDataBunch.from_df(path, df, dep_var, valid_idx=valid_idx, procs=procs, cat_names=cat_names)

Exception: Your validation data contains a label that isn’t present in the training set, please fix your data.

I think I have found what is wrong. It seems I need to be explicit that I am tying to create a regression model and not a classifier.

Right now I don’t know how to use this information.

Define your databunch like this:

data =(TabularList.from_df(df, path=path, cat_names=cat_vars, cont_names=cont_vars, procs=procs)
                   .split_by_idx(valid_idx)
                   .label_from_df(cols=dep_var, label_cls=FloatList, log=True)
                   .databunch())

The label_cls=FloatList parameter tells fastai that you want to do regression.

Then you’ll want to define your model something like this:

learn = tabular_learner(data, layers=[200, 100], y_range=y_range, metrics = exp_rmspe)

Where exp_rmspe is the exponential of the root mean squared percentage error. Since you took the log of the dependent variable when you created the databunch, the metric will come out to be the RMSPE.

Thanks, I will try that! Before you wrote your response I went ant changed my dataset to a categorical label instead. Now it works but I get lousy results.

learn = tabular_learner(data, layers=[200,100], metrics=accuracy)
learn.fit_one_cycle(10, 1e-1)

Any tips?

change your learning rate to 1e-2

learn.fit_one_cycle(10, 1e-2)

Much more than that I can’t really say. You’d have to play around with your data. I couldn’t get mine to work when I tried to label it as categorical with multiple classes.

OK, thanks. I will try that. It was the “default” from the documentation. I chose 1e-1 because I think I saw Jeremy chose a learning rate that was near the bottom (look at my line chart above) but not quite, and still had a negative tangent.