TabularDataBunch Error: "Your validation data contains a label that isn't present in the training set, please fix your data."

I’ve been getting this error ever since I’ve upgraded to 1.0.38. It appears to be giving me the same error on 1.0.37, but not on 1.0.36.

However, my Learner results are really bad in 1.0.36, so maybe this error is trying to keep me from doing something stupid.

I pushed a minimal test case to: https://github.com/matanhershberg/fastai-houses/tree/tabular-data-bunch-issue

The notebook in question is called Houses.ipynb

Any help would be greatly appreciated.

Matan

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
~/miniconda3/envs/houses/lib/python3.7/site-packages/fastai/data_block.py in process_one(self, item)
    277     def process_one(self,item):
--> 278         try: return self.c2i[item] if item is not None else None
    279         except:

KeyError: 229456

During handling of the above exception, another exception occurred:

Exception                                 Traceback (most recent call last)
<ipython-input-8-2f10f34fa878> in <module>
----> 1 data = TabularDataBunch.from_df(path, df, dep_var, valid_idx=valid_idx, procs=procs, cat_names=cat_names)

~/miniconda3/envs/houses/lib/python3.7/site-packages/fastai/tabular/data.py in from_df(cls, path, df, dep_var, valid_idx, procs, cat_names, cont_names, classes, test_df, **kwargs)
     93         src = (TabularList.from_df(df, path=path, cat_names=cat_names, cont_names=cont_names, procs=procs)
     94                            .split_by_idx(valid_idx)
---> 95                            .label_from_df(cols=dep_var, classes=classes))
     96         if test_df is not None: src.add_test(TabularList.from_df(test_df, cat_names=cat_names, cont_names=cont_names,
     97                                                                  processor = src.train.x.processor))

~/miniconda3/envs/houses/lib/python3.7/site-packages/fastai/data_block.py in _inner(*args, **kwargs)
    391             self.valid = fv(*args, **kwargs)
    392             self.__class__ = LabelLists
--> 393             self.process()
    394             return self
    395         return _inner

~/miniconda3/envs/houses/lib/python3.7/site-packages/fastai/data_block.py in process(self)
    438         "Process the inner datasets."
    439         xp,yp = self.get_processors()
--> 440         for i,ds in enumerate(self.lists): ds.process(xp, yp, filter_missing_y=i==0)
    441         return self
    442 

~/miniconda3/envs/houses/lib/python3.7/site-packages/fastai/data_block.py in process(self, xp, yp, filter_missing_y)
    563     def process(self, xp=None, yp=None, filter_missing_y:bool=False):
    564         "Launch the processing on `self.x` and `self.y` with `xp` and `yp`."
--> 565         self.y.process(yp)
    566         if filter_missing_y and (getattr(self.x, 'filter_missing_y', None)):
    567             filt = array([o is None for o in self.y])

~/miniconda3/envs/houses/lib/python3.7/site-packages/fastai/data_block.py in process(self, processor)
     66         if processor is not None: self.processor = processor
     67         self.processor = listify(self.processor)
---> 68         for p in self.processor: p.process(self)
     69         return self
     70 

~/miniconda3/envs/houses/lib/python3.7/site-packages/fastai/data_block.py in process(self, ds)
    284         ds.classes = self.classes
    285         ds.c2i = self.c2i
--> 286         super().process(ds)
    287 
    288     def __getstate__(self): return {'classes':self.classes}

~/miniconda3/envs/houses/lib/python3.7/site-packages/fastai/data_block.py in process(self, ds)
     36     def __init__(self, ds:Collection=None):  self.ref_ds = ds
     37     def process_one(self, item:Any):         return item
---> 38     def process(self, ds:Collection):        ds.items = array([self.process_one(item) for item in ds.items])
     39 
     40 class ItemList():

~/miniconda3/envs/houses/lib/python3.7/site-packages/fastai/data_block.py in <listcomp>(.0)
     36     def __init__(self, ds:Collection=None):  self.ref_ds = ds
     37     def process_one(self, item:Any):         return item
---> 38     def process(self, ds:Collection):        ds.items = array([self.process_one(item) for item in ds.items])
     39 
     40 class ItemList():

~/miniconda3/envs/houses/lib/python3.7/site-packages/fastai/data_block.py in process_one(self, item)
    278         try: return self.c2i[item] if item is not None else None
    279         except:
--> 280             raise Exception("Your validation data contains a label that isn't present in the training set, please fix your data.")
    281 
    282     def process(self, ds):

Exception: Your validation data contains a label that isn't present in the training set, please fix your data.
1 Like

What happens when you remove this line from your notebook:

cat_names = cat_names[cat_names < 50]

Seems to be that you are giving the databunch less categories than are in the dataframe

I agree, this is a problem.

I have a friend with the same problem and I tried to see his code and get around debugging it. I don’t remember I the details that I get.
But even with just two variables: the independent and the dependent one, with limited number of data and cleaned this error kicked in

The same code worked in version 1.0.36 without any problem.

Solution :slight_smile:

Sorry

I get the same error without that line.

That line is trying to take a subset of the categories, so it is less categories than the dataframe, with the rest being automatically assigned as continuous variables, as far as I understand the code.

Thanks Willismar,

It says I don’t have access to that thread. Not sure why.

Hi Sorry …

its because the thread is inside a course v3 that s not shared yet. But I will post the solution I found here

Found a definitive solution, works even with version 1.0.39

#after load the dataset, grab the targets and make unique list
classes = df['SalePrice'].unique()
classes.sort()

#later passes that list to be treated as categorical values.
.label_from_df(cols=dep_var, classes=classes)
1 Like

The error is thrown as soon as… “your validation data contains a label that isn’t present in your training set” (don’t know how I can be clearer than that :wink: )
This is going to be a problem since when you validate your model, it can’t really predict efficiently something it has never seen before. The real fix is to make sure your training set contains all your labels at least a few time by either:

  • making sure your validation indexes are class-balanced
  • remove from your data items with rare labels.
1 Like

Hi @sgugger,

I just found this is a problem when dealing with Regression Models because the logic behind on data_block.py tried to categorize all the continuous values and split into train and validation, at that moment the categorized values of the target (Regression model) will never match and then will raise the error.

Earlier in version 1.0.36 it was just this line bellow, that didn’t get any trouble at all with regression models:

self.c2i.get(item)

My suggestion, if I may, is to change it back, because .get method in a dictionary, supports a default value:

def process_one(self,item):
     try: return self.c2i.get(item, None)
     except:
           raise Exception("Your validation data contains a label that isn't present in the training set, please fix your data.")

If you have a regression model, you should use label_cls = FloatItemList when you label it, if the data block API doesn’t automatically pick it’s a regression target.

1 Like

Oh yeah …

Now I get it. Thank you so much for this clarification.

Hi @rubyrhod, @Lankinen

Sorry about what I said before, I was wrong!

The better way to you solve this is to pass this information on your
.label_from_df(cols=dep_var, label_cls=FloatList, log=True)

Then your Target Column will be treated as a regression model

1 Like

Thanks @sgugger

This is pointing me in the right direction. I was trying to follow the tabular tutorial but I see I have more reading to do.

I get the sam error and have created a topic on it. Did you get past the error? If so can you explain what your solution is?

Thanks!

I was using the sklearn’s test_train_split to generate the test and validation df. Using the stratify option in the the function call ensures that the labels are present in both test and validation df. which fixes this error.

How do I use the last line of your solution? I get ‘invalid syntax’ when I try to use this in my script.

That line just works if your problem is a classification problem , its not the solution for a Regression Problem.
Classification Problem

data = (TabularList
            .from_df(.....args....)
            .split_by_idx(....args.....)
            .label_from_df(cols=dep_var, classes=["class1", "class2", .... etc ])
            .databunch() )

Regression problem

data = (TabularList
            .from_df(.....args....)
            .split_by_idx(....args.....)
            .label_from_df(cols=dep_var, label_cls=FloatList)
            .databunch() )

Something like that.

You need to read the documentation to determine what you need exactly.

3 Likes