Error due to unexpected missing values with TabularPandas

Postradamus · March 6, 2020, 4:12pm

Hi,

I’m new here and in the phase of getting to know fastai2 (great toy , thank you for all the hard work!), so apologies if this was already answered and I didn’t see it or the solution should be obvious. Also I hope this is the appropriate place here .

To the problem. I was playing with tabular data today when I noticed that the transformation pipeline of TabularPandas does not seem to be happy about unexpected missing data, throwing this assertion error:

AssertionError: nan values in cont_prop but not in setup training set

where cont_prop is just some random column name.

I’ve set up a notebook to reproduce the behavior here.

It seems this assertion error is thrown directly when TabularPandas is first initialized over the training/validation data if the validation part contains unexpected missing values as well as when an existing instance of TabularPandas is used over a test set with unexpected missing values.

A fix which seems to work for my toy data sets is to just add a row to the training set which contains a bunch of nans for the relevant columns.

But I’m wondering what to do if one cannot easily anticipate which columns may contain missing values? Just adding a row with every entry being nan seems like it would be inefficient. Is there possibly some best practice for dealing with this?

Thanks!

sgugger · March 6, 2020, 4:24pm

In general, you should fix this manually: if there are no NaNs in your training set, your model won’t have any idea on how to deal with them in the validation set. Just adding one NaN value on the training set won’t really solve your problem, the model will still be inexperienced with them.

fastai2 sends an error to force you to deal with this, so just think of the best value to put in place of your NaN, in the specific case of your problem, and you should be good.

Postradamus · March 6, 2020, 4:44pm

That was fast. Very good point, thanks !