Q: Can I adjust categorical order in columns of TabularDataBunch?

cwerner · October 19, 2018, 7:42pm

Apologies if this is the wrong place to ask this.

I’m currently trying to replicate the ML1 course with the v1 API to familiarize
myself with the new version. In lesson1 it is demonstrated how one can change the order in categorical variables.

df_raw.UsageBand.cat.set_categories(['High', 'Medium', 'Low'], ordered=True, inplace=True)

I’m not 100% sure, but I do not see the possibility to modify individual columns in the new TabularDataBunch object. Do I have to do this manually upfront myself?

At the moment I simply identify all non-numeric columns and specify them as categorical target to the TabularDataBunch

cat_names = list(df_raw.select_dtypes(exclude=np.number))

Am I missing something?

sgugger · October 19, 2018, 8:15pm

You’d have to modify the transform Categorify which is the one that attributes category codes to the variables for this.

cwerner · October 19, 2018, 9:24pm

Thanks for the pointer @sgugger

I came up with this, but it feels clunky. How would I pass an optional cat_name dictionary to my Transform class?

from pandas import DataFrame

class CustomOrder:
    "Information for optional custom categorical ordering"
    cat_order={'UsageBand': ['High', 'Medium', 'Low']}
   
class CategorifyWithCustomOrder(TabularTransform, CustomOrder):
    "Transform the categorical variables to that type."

    def apply_train(self, df:DataFrame):
        self.categories = {}
        for n in self.cat_names:
            df[n] = df[n].astype('category').cat.as_ordered()
            if n in CustomOrder.cat_order:
                df[n].cat.set_categories(CustomOrder.cat_order[n], ordered=True, inplace=True)
            self.categories[n] = df[n].cat.categories

    def apply_test(self, df:DataFrame):
        for n in self.cat_names:
            df[n] = pd.Categorical(df[n], categories=self.categories[n], ordered=True)