There are nan values in field but there were none in the training set

I’m trying to build a regressor for tabular data, when I create a databunch as follows:

data = (TabularList.from_df(train_df, path=path, cat_names=cat_names, cont_names=cont_names, procs=procs)
                           .random_split_by_pct(valid_pct=0.2)
                           .label_from_df(cols=dep_var)
                           .add_test(test, label=0)
                           .databunch())

I get the following error

ERROR:root:An unexpected error occurred while tokenizing input
The following traceback may be corrupted or invalid
The error message is: ('EOF in multi-line string', (1, 42))

---------------------------------------------------------------------------
Exception                                 Traceback (most recent call last)
<ipython-input-15-06fcbb6e7104> in <module>()
      2                            .split_by_idx(list(range(800,1000)))
      3                            .label_from_df(cols=dep_var)
----> 4                            .add_test(test, label=0)
      5                            .databunch())

/usr/local/lib/python3.6/dist-packages/fastai/data_block.py in add_test(self, items, label)
    433         if label is None: label = self.train[0][1].obj
    434         labels = [label for _ in range_of(items)]
--> 435         if isinstance(items, ItemList): self.test = self.valid.new(items.items, labels, xtra=items.xtra)
    436         else: self.test = self.valid.new(items, labels)
    437         return self

/usr/local/lib/python3.6/dist-packages/fastai/data_block.py in new(self, x, y, **kwargs)
    473             return self.__class__(x, y, tfms=self.tfms, tfm_y=self.tfm_y, **self.tfmargs)
    474         else:
--> 475             return self.new(self.x.new(x, **kwargs), self.y.new(y, **kwargs)).process()
    476 
    477     def __getattr__(self,k:str)->Any:

/usr/local/lib/python3.6/dist-packages/fastai/data_block.py in process(self, xp, yp, filter_missing_y)
    519             filt = array([o is None for o in self.y])
    520             if filt.sum()>0: self.x,self.y = self.x[~filt],self.y[~filt]
--> 521         self.x.process(xp)
    522         return self
    523 

/usr/local/lib/python3.6/dist-packages/fastai/data_block.py in process(self, processor)
     65         if processor is not None: self.processor = processor
     66         self.processor = listify(self.processor)
---> 67         for p in self.processor: p.process(self)
     68         return self
     69 

/usr/local/lib/python3.6/dist-packages/fastai/tabular/data.py in process(self, ds)
     60             return
     61         for i,proc in enumerate(self.procs):
---> 62             if isinstance(proc, TabularProc): proc(ds.xtra, test=True)
     63             else:
     64                 #cat and cont names may have been changed by transform (like Fill_NA)

/usr/local/lib/python3.6/dist-packages/fastai/tabular/transform.py in __call__(self, df, test)
     30         "Apply the correct function to `df` depending on `test`."
     31         func = self.apply_test if test else self.apply_train
---> 32         func(df)
     33 
     34     def apply_train(self, df:DataFrame):

/usr/local/lib/python3.6/dist-packages/fastai/tabular/transform.py in apply_test(self, df)
     81             elif pd.isnull(df[name]).sum() != 0:
     82                 raise Exception(f"""There are nan values in field {name} but there were none in the training set. 
---> 83                 Please fix those manually.""")
     84 
     85 class Normalize(TabularProc):
Exception: There are nan values in field BsmtFinSF2 but there were none in the training set. 
                Please fix those manually.

How can fix this? Any hints would be appreciated.

3 Likes

filling the test dataframe with 0 fixed the issue,

test_df = test_df.fillna(0)

But are there other options? I mean better than putting 0 as I though fastai library will fill in missing when the FillMissing transformation is used!

3 Likes

FillMissignis meant to handle missing values that already appeared in your training set. As the error message said, the man values only appeared in the validation of test set, and the library can’t handle that (it can’t add a new man column at evaluation time for instance).
It’s yours to fix in the sense we do t want to make the choice for you. You can either create a man in that column in your training set or choose a value to fill Rhodes nans in your validation set or decide to remove those samples.

5 Likes

I have the same problem, one of my variable in training set does not have any missing variable, but it has one missing variable in the test set.
My data is tabular and I am using TabularPandas to prepare my independent and dependent variable for my random forest classifier. I am doing the following:

train = pd.read_csv(path/‘train.csv’, low_memory=True)
test = pd.read_csv(path/‘test.csv’, low_memory=True)

to = TabularPandas(train, procs=procs, cat_names=cat_names, cont_names=cont_names,
y_names=dep_var, splits=splits)

to_test = TabularPandas(test, procs=procs, cat_names=cat_names, cont_names=cont_names)

but when I want to look at my data by to.show(3) it gives the following error:
“[‘Fare_na’] not in index”

the reason is that the Fare column in train data does not have any nan value, but in test data it includes nan values, what should I do to fix this?

I’d recommend the last two sentences Sylvain said. While they’re two years old now, the answer hasn’t changed.

2 Likes

Thank you for your response.
I thought there might be a better approach, I remember in very early fastai class (introduction to machine learning), when we did transforming for training data the function was giving us an attribute that by passing it in for testing data it took care of all these things to make sure that train set and test set have the same set of columns.

Another related question is this: Given that when we use TabularPandas the conversion of categorical columns to numbers is done by simply replacing each unique level with a number, and the numbers associated with the levels are chosen consecutively as they are seen in a column, is it possible that when we use TabularPandas separately for converting the train and test data, the same level of a categorical variable is replaced with different numbers due to different order by which
the variable levels are seen in the train and test data?

We still do that with cat and cont names. Just procs are based on the training data, not your entire dataset.

Yes, see my previous comment. We always preprocess our data based on the training dataset and apply this to the validation and any new test data we have. As this is what our model was trained to identify and work with, so we can’t have it perform black magic :wink:

1 Like

Thanks for your answer, but I did not get how we apply the preprocess that we did on the training dataset to any new test data ( assuming that we use separate TabularPandas on training set and test set). Don’t we need to concatenating the train data and the test data, preprocessing them using TabularPandas, and then separate them again?

The documentation is pretty clear on that, see the bottom of this section:

learn.dls.test_dl will do this automatically

1 Like

I get the issue and why there is no one-size-fits-all solution. In my case I would drop these observations along with some information in a log as there may be something wrong with them. Although if I want to build production-ready code I kind of need a way to apply the procs/transformers to the test set and then take action on the observations with issues (similar to fit_transform in sklearn). The thing is, before running learn.dls.test_dl it might be okay for missing values to be present (if they are also present in the training set) I can’t really know unless i can access the procs generated at training. So I it is hard to account for this in the ETL flow because you dont really know what is going to show up in the test set / inference set. Any suggestions?

Did a hotfix here although not FastAI related:

    none_missing_list = list(df_train.isnull().mean()[df_train.isnull().mean() == 0].index)

    df_pred_x_len_before = len(df_pred_x)

    df_pred_x_len_after = len(df_pred_x.dropna(subset=none_missing_list))

    if not df_pred_x_len_before == df_pred_x_len_after:

         df_pred_x = df_pred_x.dropna(subset=none_missing_list)