How to normalize tabular pandas columns separately?

shado · October 21, 2023, 11:12pm

I have a dataset with some columns that require normalization across specific groups, rather than across the entire training set. The rest of the columns don’t have that requirement.

Now I can do this manually on the separate columns BEFORE creating the tabular pandas object, but then calling Normalize means those columns get normalized twice: once in my grouped normalisation and once in the global normalization.

This is not ideal because the second normalisation ignores the importance of the grouping.

The only way I can think of to get around this is to tell tabular pandas to normalize specific columns only (i.e. ignore the ones I have already normalized). OR to create two tabular pandas objects- one using Normalize proc and one without it, then join the two to objects together.

Does anyone know if this is possible? Or any other suggestions how to get around this problem?

Thanks

shado · October 28, 2023, 11:11pm

Anyone?

vbakshi · October 29, 2023, 4:28am

I haven’t come across this situation before, so I’m not really sure how you’d go about it, but in this Walk with fastai tutorial he shows how to modify tabular procs. Here is how he modifies Normalize to handle means and standard deviations differently than the built-in way. Not sure how @patch works, but I’m wondering if something like this modification would apply to your situation? I also don’t know (or see in the docs) how to combine DataLoaders or TabularPandas objects, so not sure if that route is possible.

Lastly, as an alternative solution, what if you did all of your normalization manually, and then did not pass Normalize as one of the procs when creating your TabularPandas object? My understanding is that TabularPandas doesn’t normalize the data by default.

pankaj_pansari · October 31, 2023, 12:44pm

Hi,

Can you get by with normalizing the data beforehand and not passing Normalize as one of the procs when creating the Tabular dataset? In this case, you would be normalizing the two sets of columns separately beforehand.

If you need to pass in Normalize as part of procs, one solution may be to write your own custom Normalize class. Something like this:

class NormalizeByFactor(DisplayedTransform):
    def ___init__(self, norm_by_factor, norm_not_by_factor, factor_col):
        self.norm_by_factor = norm_by_factor      #cols to be normalized by factor
        self.norm_not_by_factor = norm_not_by_factor #cols to be normalized ordinarily
        self.factor = factor_col  #Normalize subset of cols based on this factor
        self.mean = None
        self.std = None
   
   def setup(self, x):
        # Obtain self.mean, self.std from x

    def encodes(self, x):
        for col in x.columns:
            if col in self.norm_by_factor:
                # Normalize col based on self.factor
            elif col in self.norm_not_by_factor:
                # Normalize normally
        return x

    def decodes(self, x):
         # Analogously

Now you can pass NormalizeByFactor as part of procs. Hope it helps.