I have a dataset with some columns that require normalization across specific groups, rather than across the entire training set. The rest of the columns don’t have that requirement.
Now I can do this manually on the separate columns BEFORE creating the tabular pandas object, but then calling Normalize means those columns get normalized twice: once in my grouped normalisation and once in the global normalization.
This is not ideal because the second normalisation ignores the importance of the grouping.
The only way I can think of to get around this is to tell tabular pandas to normalize specific columns only (i.e. ignore the ones I have already normalized). OR to create two tabular pandas objects- one using Normalize proc and one without it, then join the two to objects together.
Does anyone know if this is possible? Or any other suggestions how to get around this problem?
I haven’t come across this situation before, so I’m not really sure how you’d go about it, but in this Walk with fastai tutorial he shows how to modify tabular procs. Here is how he modifies
Normalize to handle means and standard deviations differently than the built-in way. Not sure how
@patch works, but I’m wondering if something like this modification would apply to your situation? I also don’t know (or see in the docs) how to combine
TabularPandas objects, so not sure if that route is possible.
Lastly, as an alternative solution, what if you did all of your normalization manually, and then did not pass
Normalize as one of the
procs when creating your
TabularPandas object? My understanding is that
TabularPandas doesn’t normalize the data by default.
Can you get by with normalizing the data beforehand and not passing
Normalize as one of the
procs when creating the Tabular dataset? In this case, you would be normalizing the two sets of columns separately beforehand.
If you need to pass in
Normalize as part of
procs, one solution may be to write your own custom Normalize class. Something like this:
def ___init__(self, norm_by_factor, norm_not_by_factor, factor_col):
self.norm_by_factor = norm_by_factor #cols to be normalized by factor
self.norm_not_by_factor = norm_not_by_factor #cols to be normalized ordinarily
self.factor = factor_col #Normalize subset of cols based on this factor
self.mean = None
self.std = None
def setup(self, x):
# Obtain self.mean, self.std from x
def encodes(self, x):
for col in x.columns:
if col in self.norm_by_factor:
# Normalize col based on self.factor
elif col in self.norm_not_by_factor:
# Normalize normally
def decodes(self, x):
Now you can pass
NormalizeByFactor as part of
procs. Hope it helps.