I have collected my own dataset and trying to run random forrest on it but fastai complains that there are labels in my validation set that do not appear in the training set. How do I prepare and split my data?
The data is collected from a web site selling used cars. My columns is brand, model, type (sedan…), gear, fuel, model_year and milage. I have some 150k rows. I have left out the brand feature because I think I need to make it less diverse in the future. I have also tried leaving out all brands not being one of the top 20.
The error message does not help me figure out which feature is containing the missing category. It just says:
Exception: Your validation data contains a label that isn't present in the training set, please fix your data.
Here is my code running fastai v1.0x
from pathlib import Path from fastai.tabular import * path = Path('/my_path') df = pd.read_csv(path/'data.csv') procs = [FillMissing, Categorify, Normalize] valid_idx = range(len(df)-2000, len(df)) dep_var = 'price' cat_names = ['brand','model_year','gear','fuel','type'] cont_names = ['milage'] data = TabularDataBunch.from_df(path, df, dep_var, valid_idx=valid_idx, procs=procs, cat_names=cat_names, cont_names=cont_names)
Is there any good tools to do this split? Is there any good guides on how to prepare data for training?