Tabular Data, how to get mean and std from TabularDataBunch object?

Hello to all.
I’m trying to create function that makes prediction on tabular model for a batch of data (existing method .predict() can process only one item an once, which can be pretty slow). I could manage to use existing model (learner.model()) as a function to make prediction (here I assume that it can predict directly for a batch, even though for now I tested it with a batch of one sample), but for that I have to process all categorical and continuous columns with parameters (categories and it’s codes, mean and standard deviation) used on databunch creation. I was able to extract all the categories from databunch object, but I couldn’t find any way to do the same for mean and standard deviation of continuous columns. But it should be somewhere in this object (otherwise it won’t be able to predict on one item or make one_batch()). So for now my plan is to process original dataframe’s cont columns, to calculate mean and std from the scratch. That should work, but it’s not the most elegant and universal way.
So :slight_smile: Is it possible to extract mean and std of continuous columns directly from databunch object?

Have you tried Learner.pred_batch?

1 Like

Thank you.
No, I will try to dig into it when I will be back home, but I assume that it takes DataBunch object as an argument, not the actual source data. And my trouble is to convert new data to processed (applied Categories and normalized cont columns) form with parameters used on train databunch creation.

In case someone will find it useful…
Well, you don’t…
At least not for DataBunch object, but there are some good news too :slight_smile:
After days of diving into code (as well as, apparently, into my inability to understand it :slight_smile: ) I have found what I was looking for.
One can get mean and std the following way:

  1. I assume that you take standard approach on tabular data learning process. Particularly you use
    procs=[FillMissing, Categorify, Normalize]
    for processing dataframe
  2. You should split your DataBunch creation process into two phases. Instead of making DataBunch
    data = (TabularList.from_df(df, path=path, cat_names=cat_names, cont_names=cont_names, procs=procs)
    .split_by_idx(list(range(800,1000)))
    .label_from_df(cols=dep_var)
    .add_test(test)
    .databunch())
    (as in lesson4-tabular.ipynb) you should get LabelList (data_prep in my example) first, as you can get mean and std only from this type of object
    data_prep = TabularList.from_df(df, path=path, cat_names=cat_names, cont_names=cont_names, procs=procs)
    .split_by_idx(list(range(800,1000)))
    .label_from_df(cols=dep_var)
    .add_test(test)
    and only then convert it into DataBunch
    data = (data_prep.databunch())
    After that you can easily grab Normalize object which contains all means and stds for test(!) data
    proc = data_prep.get_processors()[0][0]
    norm = proc.procs[2]
    print(norm.means)
    print(norm.stds)
1 Like

Hi Pavel,
Great work digging through this tabular stuff. Did you ever figure out how we can decouple the processors from the data loader? When I apply this algorithm to new data I want to reuse the processors (which have also been trained in the training phase).

Hi, it depends on what is your endpoint in that. If you just want to apply processors (categorification, normalization and fill Missing) to new data you can do that. The first post in this topic should help you Some useful functions for tabular models
Feel free to ask me if something in confusing there (there is also a working example on Rosmann data)
As for the first post in link I gave, in function get_model_real_input, I think you can export (pickled) object named proc = data_prep.get_processors()[0][0] which contains all the processes

1 Like

Hi, What actually is the tabular model predicting? and what are we passing to learn.predict() ?
@Pak

Hi, sorry, maybe I did not quite catch you question.
We pass the row of new data (with all the fields/features used in train set) to learn.predict() and get the prediction of the model in learner as in docs here
(type of prediction (float, class and etc) depends of type of the model you set building learner)

1 Like

Oh ok, Suppose, like, I have separate test data for a person, how do i input those data to predict his salary ?

You should, for ex., put his data in a pandas dataframe table with categorical and continuous fields named just like in your train data, then take a row from this dataframe (like row=new_df.iloc[0]) and then run the predict() like learn.predict(row). And you will get the probability of class of his salary or salary itself depended on how you constructed your learner

1 Like

Understood, thank you