Tabular data, normalising vs standardising

joedockrill · July 16, 2020, 12:11pm

i noticed in lesson 4 when jeremy described normalising in tabular he was actually talking about standardising, and a quick look at the code shows it is standardising, and normalising isn’t an option.

obviously in puny human ml libraries like skl if you’ve got highly skewed data then you’re normally better off min/max scaling instead of standardising.

is fastai just better at dealing with this so i dont have to worry about it, or should i manually scale anything too skewed before i hand it to tabular?

muellerzr · July 16, 2020, 12:24pm

I haven’t noticed an issue myself. However if you’d like to run some experiments on v2 and if it works put a PR in as a separate normalization method you’re welcome to

See here for the relevant function in v2: https://github.com/fastai/fastai2/blob/master/fastai2/tabular/core.py#L261

joedockrill · July 16, 2020, 1:16pm

it was more of a theoretical question really, so far i’ve not really dealt with data that bad, and when i have i’ve often dealt with it by binning it anyway.

i had a look at the v2 code and got another headache. my python is not yet ninja enough to work out why they’re decorating functions which do normalisation with a class which does normalisation.

i’ve penciled myself in for a proper session on what you can do with decorators and decided to not worry about std vs norm until i come across some badly skewed data. then i’ll manually test it both ways and see if tabular cares either way.

muellerzr · July 16, 2020, 1:19pm

That is the v2 codebase entirely, I’d recommend Jeremy’s v2 walkthrough videos (not the course) to understand what all that means. It stems from the fastcore library.

(also NormalizeTab actually is the wrong one, sorry, it’s the @Normalize functions)

So for instance, for you it would be something like:

@Normalize
def setups(self, to:Tabular):
    self.mins,self.maxs= getattr(to, 'train', to).conts.min(),getattr(to, 'train', to).conts.max()
    return self(to)

@Normalize
def encodes(self, to:Tabular):
    to.conts = (to.conts-self.mins) / (self.maxs - self.mins)
    return to

#Need to check my math on how to undo min/max norm
@Normalize
def decodes(self, to:Tabular):
    to.conts = (to.conts) * (to.maxs - to.mins) + to.maxs
    return to

joedockrill · July 16, 2020, 1:25pm

ok, that confuses me less. my assumption at this point is still that it probably doesn’t matter as much to fastai (rather like unbalanced datasets) but i’ll prod it at some point and see whether it helps overall.

thanks.

muellerzr · July 16, 2020, 1:26pm

I may look as well, when I have some time. If you feel like getting into v2 tabular, I have a series of notebooks available here: https://github.com/muellerzr/Practical-Deep-Learning-for-Coders-2.0/tree/master/Tabular%20Notebooks (there’s lectures too, look up Walk with fastai2 Tabular on the forums)

joedockrill · July 16, 2020, 1:35pm

i was under the impression fastai2 is under dev & invite only. any idea how far from “finished” it is?

muellerzr · July 16, 2020, 1:37pm

While it is under dev, the initial version is almost finished. Invite only is course-v4. Just to show how ready it is, I did my own course/walkthrough back in January (the walk with fastai2 lectures, pinned on #fastai-users:fastai-v2 ), and the actual code changes since then has been relatively minimal. (and I’ve used it myself since about October)

TL/DR: feature complete, minimal bugs.