Proc_df alternative (again)

CommanderCool · March 31, 2020, 3:03pm

The proc_df function from fastai.structured is not available in fastai V1; you are supposed to pass ‘Categorify’ etc. as procs to the data block API. This works great for the standard models (like with the tabular learner).

But how do you use the fastai preprocessing pipeline for models that do not use databunches (like a random forest from sklearn)? Or what do you do if you want to compare a deep learning model to a classical model on the same data?

muellerzr · March 31, 2020, 3:04pm

Fastai v2 is much better geared towards this idea with the TabularPandas module. I have an example of using it with fastai, RF, and XGBoost here:

github.com

muellerzr/Practical-Deep-Learning-for-Coders-2.0/blob/master/Tabular Notebooks/02_Ensembling.ipynb

{
  "nbformat": 4,
  "nbformat_minor": 0,
  "metadata": {
    "colab": {
      "name": "02_Ensembling.ipynb",
      "provenance": []
    },
    "kernelspec": {
      "name": "python3",
      "display_name": "Python 3"
    },
    "accelerator": "GPU"
  },
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "e1IxWLqT2riY",
        "colab_type": "text"

This file has been truncated. show original

CommanderCool · March 31, 2020, 3:11pm

Thanks, that is exactly what I was looking for! Seems like I have to switch over to fastai2. Or is there an established solution for V1?

muellerzr · March 31, 2020, 3:15pm

I’d switch to v2, I don’t see a reason not to. In v1 it’d probably be a bit more convoluted to actually get what you’re wanting from it.

elie · January 5, 2021, 3:03pm

@muellerzr, How would I use this for new data? let’s say the model is deployed and I want to process new data. with sklearn I could create a pipeline and make sure any new data will go through the sklearn pipeline to make sure it uses a transformer object that was fitted to the training set. is there a way to do the same with fastai procs(i.e Categorify, FillMissing and Normalize)?

muellerzr · January 5, 2021, 3:11pm

Have you looked through the documentation? There’s a nice example with test_dl in here. This works on models exported with learn.export() as well https://docs.fast.ai/tutorial.tabular.html

elie · January 5, 2021, 4:04pm

I was thinking of the scenario where the model is Xgboost.

muellerzr · January 5, 2021, 4:12pm

This would be what you want then: https://walkwithfastai.com/tab.export

Just !pip install wwf and from wwf.tab.export import *

Followed by to.export()

elie · January 6, 2021, 4:45am

Thanks for your help. I will try it.