Silent Killer? Dubious bug in fastai.tabular

jaxondk · November 12, 2020, 8:56pm

Hello friends!

At work, I’m doing a fair amount of things with tabular data and I’m using fastai. Unfortunately, I’m tied to fastai v1 and cannot at this time upgrade to fastai v2. Because of this, I’m putting this on the forums instead of making it an issue.

I’ve just discovered a behavior I was not expecting that has major consequences for me, and could for others. This may be old news but I haven’t found anything about it on the forums / google, so I wanted to document it here in case it’s helpful to others. I’ll give my context for discovering the behavior first, and then the unexpected behavior.

Typically, I don’t ever extract the pytorch model from the learner. However, fastai cannot do everything and at times we need to extract the pytorch models (such as when using 3rd party packages that expect pytorch models and not fastai learners). This is the case for the model interpretability package “shap”, which accepts a trained pytorch model as a parameter and uses it to make predictions and generate explanations based on features.

I’ve discovered that TabularDataBunch.from_df has an undocumented and unexpected (for me at least) behavior - it silently changes the order of the features you pass in from df.columns to some unknown ordering based off of the set operations it performs. First couple lines of the function:

cat_names = ifnone(cat_names, []).copy()
cont_names = ifnone(cont_names, list(set(df)-set(cat_names)-set(dep_var)))

These lines result in the ordering of features to be shuffled, even if you have no categorical features. I am using no categorical features, and so I expected my continuous features to be in the same order as df.columns but it’s not. And to make this behavior even harder to detect, the databunch’s inner_df does not show this reordering, only the show_batch function does, so I did not realize this. This means that when I try predicting on data (in the same ordering as df.columns) using the extracted pytorch models, no bugs occur but the predictions we see are actually completely bogus, since the model was trained on the features coming into it in a different order.

So, if you ever manually call learner.model.forward on data in your df, beware! You need to be sure you re-order your feature columns into the random ordering you see when calling learner.data.show_batch(). An easy fix if you haven’t already spent a lot of time training your models is to just manually pass cont_names to from_df so it doesn’t try to figure them out for you.

I would expect the order of columns to be preserved when possible, or at least something about this in the documentation. Since this is fastaiv1 I wasn’t sure if I should submit an issue or just post about it on the forums. And I could just be way off and this isn’t actually an issue or it’s very rare, and I have just been very dense/unlucky

LMK if you’ve run into problems with this! And also lmk if this should be submitted as an issue

PS
One other odd quirk of the two lines of code I included is that they actually do not remove the dep var from cont_names. I think this doesn’t matter because later

src = src.label_from_df(cols=dep_var) if classes is None else src.label_from_df(cols=dep_var, classes=classes)

gets called.