Tabular data, normalising vs standardising

i noticed in lesson 4 when jeremy described normalising in tabular he was actually talking about standardising, and a quick look at the code shows it is standardising, and normalising isn’t an option.

obviously in puny human ml libraries like skl if you’ve got highly skewed data then you’re normally better off min/max scaling instead of standardising.

is fastai just better at dealing with this so i dont have to worry about it, or should i manually scale anything too skewed before i hand it to tabular?

I haven’t noticed an issue myself. However if you’d like to run some experiments on v2 and if it works put a PR in as a separate normalization method you’re welcome to :slight_smile:

See here for the relevant function in v2:

1 Like

it was more of a theoretical question really, so far i’ve not really dealt with data that bad, and when i have i’ve often dealt with it by binning it anyway.

i had a look at the v2 code and got another headache. my python is not yet ninja enough to work out why they’re decorating functions which do normalisation with a class which does normalisation.

i’ve penciled myself in for a proper session on what you can do with decorators and decided to not worry about std vs norm until i come across some badly skewed data. then i’ll manually test it both ways and see if tabular cares either way.

That is the v2 codebase entirely, I’d recommend Jeremy’s v2 walkthrough videos (not the course) to understand what all that means. It stems from the fastcore library.

(also NormalizeTab actually is the wrong one, sorry, it’s the @Normalize functions)

So for instance, for you it would be something like:

def setups(self, to:Tabular):
    self.mins,self.maxs= getattr(to, 'train', to).conts.min(),getattr(to, 'train', to).conts.max()
    return self(to)

def encodes(self, to:Tabular):
    to.conts = (to.conts-self.mins) / (self.maxs - self.mins)
    return to

#Need to check my math on how to undo min/max norm
def decodes(self, to:Tabular):
    to.conts = (to.conts) * (to.maxs - to.mins) + to.maxs
    return to

ok, that confuses me less. my assumption at this point is still that it probably doesn’t matter as much to fastai (rather like unbalanced datasets) but i’ll prod it at some point and see whether it helps overall.

thanks. :+1:

1 Like

I may look as well, when I have some time. If you feel like getting into v2 tabular, I have a series of notebooks available here: (there’s lectures too, look up Walk with fastai2 Tabular on the forums)

i was under the impression fastai2 is under dev & invite only. any idea how far from “finished” it is?

While it is under dev, the initial version is almost finished. Invite only is course-v4. Just to show how ready it is, I did my own course/walkthrough back in January (the walk with fastai2 lectures, pinned on #fastai-users:fastai-v2 ), and the actual code changes since then has been relatively minimal. (and I’ve used it myself since about October)

TL/DR: feature complete, minimal bugs.

1 Like