Getting Tabular_Learner to Production Issue - Best Way To Bring in Data For Predictions?

En_g_neer · November 29, 2020, 9:13pm

I’m training tabular_learner models. I would like to get these to production, but I’m struggling to understand the best way to bring in my data I would like to perform predictions on, as the Tabular Datasets require parameters such as continuous variables, y_targets, etc.

When training these models, the process to prepping the data can be represented by
Data → Dataframe → Categorizing Columns (using .ascategory) → TabularPandas → TabDataLoader → DataLoader

Obviously this applies my categoricals, continuous, Normalize(), etc., but is this same process required in my production environment? Just to experiment, I have been saving out my models, importing them back in, and trying to run predictions on my test set with learn.predict, which just yield :

AttributeError: ‘DataFrame’ object has no attribute ‘conts’

when using learn.predict(test_dl.iloc[0]) for example. Running learn.predict(test_dl) just yields
“AttributeError: to_frame”.

This test set works when I run learn.validate(dl = dls.test_dl) to test my accuracies when training models, but this will not work when I don’t have the actual truth source for future predictions in my production environment.

Does anybody have some code they could share for a tabular_learner on how they prep their data before they run predictions in a more production-like environment?

I can of course share the entire tracebacks, but the question is more of a general best practices one as opposed to “I believe I’m doing this correctly, so what is wrong with my code?”. I’m somewhat lost because as a new user who has gone through the course, I’m still very confused by dataloaders, data bunches, and all of the transforms and changes in data types we must apply between loading and running predictions/training on data.

muellerzr · November 29, 2020, 9:15pm

Have you checked out the tabular tutorial? https://docs.fast.ai/tutorial.tabular.html

It covers both using learn.predict and learn.dls.test_dl for preprocessing the data and predictions

En_g_neer · November 29, 2020, 9:22pm

Thanks Zach.
I’ve used this among others that you have posted such as your tutorial notebooks.

If this is the de-facto best way, then I will continue to pursue it.

I do have one more issue I need to tackle, however. One of my concerns in general which I believe is often overlooked comes about with Normalizing data.

In the past when I have done machine learning in Matlab, I realized that one issue I had to overcome was that Normalizing data across large training sets can have a profoundly different effect than normalizing over small sets to run predictions on in production. I got around this by bringing all of my data in with my set I want to predict, normalizing, and pulling my prediction data back out. I felt comfortable doing that because I wrote everything myself and had multiple steps to verify data was not shuffled around. Have you ever addressed this issue in FastAI? My prediction sets for production will be on the order of 3-8 lines of data as opposed to the thousands I am training on.

muellerzr · November 29, 2020, 9:30pm

I actually have Check out my walk with fastai article on it. Essentially we calculate the statistics overall just once (you can use some fancy math and reading in and out various partial dataframes to help calculate it) and then you load in a special Normalize, Categorify, and FillMissing designed for it and load in partials: Using Custom Transform Statistics (Intermediate) | walkwithfastai

From there when you export the normalization statistics will already be there, so learn.predict and test_dl will just pick up the stats you trained with as it normally would.

Eventually this will (hopefully) be added into the library once Jeremy focuses more on the tabular portions

En_g_neer · November 29, 2020, 9:33pm

Huge thanks Zach. And I really want to thank you for not only the help you’ve given me, but the amount of help and content you have provided to this whole community. I have found answers to many other questions of mine in threads you’ve contributed to.