Ordinality of 'ProductSize' in Chapter 9 (Tabular)

Mark_F · October 7, 2020, 9:00pm

Using the category data type (with the code provided in the chapter), I get the following ordinality when I check the data:

I believe that this order is backwards? “Compact” is largest and “Large” is smallest. Perhaps this doesn’t matter because the order is the same but reversed so the NN will just learn correctly anyway. But am I missing something? Is there a reason to order the data like this?

P.S. There is a single line of code you can use in Pandas to do the same thing:

df['ProductSize'] = df['ProductSize'].astype(CategoricalDtype(sizes, ordered=True))

joedockrill · October 7, 2020, 11:12pm

What you’re doing in that cell is printing productsize for every single row, pandas is truncating it and giving you the first 5 and the last 5 rows.

Your column doesn’t change the order of the data but you don’t need the dataframe ordered by anything, it’s irrelevant.

Mark_F · October 8, 2020, 12:37am

It’s possible that I’m misunderstanding you, but I’m pointing to the last line of the cell output that starts with “Categories”. It shows the ordinality of the data with " ‘Large’ <‘Large/Medium’…"

My point is that it should be" ‘Large’>‘Large/Medium’… "

But I suspect it doesn’t matter because the NN just figures that out since relative positions are unchanged.

We explicitly did create an order to the categories earlier in the Jupyter notebook.

joedockrill · October 8, 2020, 7:49pm

oh, i see…

if you create an ordered category with [‘a’, ‘b’, ‘c’, ‘d’] then the underlying values will be
a=0
b=1
c=2
d=3
so a < b < c < d, but you’re right in that it doesn’t matter. a correlation between ProductSize and your target, or between ProductSize and another variable can be positive or negative but it’s the amount of correlation which is important, not which direction it goes in.

eg: correlation between height and weight is positive, when one goes up the other goes up. correlation between ice cream sales and umbrella sales is negative, when (the weather changes and) one goes up the other goes down but there is a strong relationship between them. apologies if i’m over-simplifying here.

what matters is that even if the categories look ‘back to front’ to you, how strong (or weak) the correlation is, doesn’t change and your network will still figure things out.

does that make sense?

Mark_F · October 8, 2020, 8:22pm

Yes, that’s very helpful, thanks. It’s essentially as I thought but appreciate having someone confirm it for me.