Proc_df alternative (again)

The proc_df function from fastai.structured is not available in fastai V1; you are supposed to pass ‘Categorify’ etc. as procs to the data block API. This works great for the standard models (like with the tabular learner).

But how do you use the fastai preprocessing pipeline for models that do not use databunches (like a random forest from sklearn)? Or what do you do if you want to compare a deep learning model to a classical model on the same data?

Fastai v2 is much better geared towards this idea with the TabularPandas module. I have an example of using it with fastai, RF, and XGBoost here:

Thanks, that is exactly what I was looking for! Seems like I have to switch over to fastai2. Or is there an established solution for V1?

I’d switch to v2, I don’t see a reason not to. In v1 it’d probably be a bit more convoluted to actually get what you’re wanting from it.

@muellerzr, How would I use this for new data? let’s say the model is deployed and I want to process new data. with sklearn I could create a pipeline and make sure any new data will go through the sklearn pipeline to make sure it uses a transformer object that was fitted to the training set. is there a way to do the same with fastai procs(i.e Categorify, FillMissing and Normalize)?

Have you looked through the documentation? There’s a nice example with test_dl in here. This works on models exported with learn.export() as well https://docs.fast.ai/tutorial.tabular.html

I was thinking of the scenario where the model is Xgboost.

This would be what you want then: https://walkwithfastai.com/tab.export

Just !pip install wwf and from wwf.tab.export import *

Followed by to.export()

Thanks for your help. I will try it.