Question About Tabular FillMissing Behavior

sean.shahkarami · April 11, 2020, 9:59pm

Hello everyone!

I was trying out the FillMissing transformer in fastai.tabular.transform and noticed that if there are no missing values in a training set column, the transformer will throw this exception when applied to a test set.

I was curious if this behavior was an explicit decision or if any alternative are being considered?

In my particular case, I went ahead and just manually filled the missing test values with the median of the training values without using any _na columns.

One immediate problem I can see with not halting immediately is that all the _na columns would be False in the transformed training set which could lead to unpredictable behavior when doing predictions.

Issues like this may indicate that there’s no good default in general, hence forcing the user to address it, but I figured I’d ask to hear other folks thoughts.

sgugger · April 11, 2020, 10:24pm

Yes, it is a design choice. If we always added na columns even when there are no nans in the training set, your model would not learn anything about the na column/missing value. So this error warns you there is something new on the validation set and lets you fix it as it should be, depending on your problem.

sean.shahkarami · April 11, 2020, 10:30pm

Got it. Yes, after thinking about it a bit more, even just doing something like filling the median without the na columns seems like it would be a questionable default… So, I agree halting is probably the best option.

Thanks!