TabularPandas: Inconsistent row order after Categorify with random split

tombucky · May 7, 2023, 8:28pm

Hi, I’m new to fastai and may be missing something but the following behavior seems rather odd:
Small dataframe with 4 columns, 2 of which categorical. 1 of the categorical columns contains missing values. Categorify is supposed to map the categories to ordinal numbers, mapping the missing values to 0. When I do this in the context below, I seem to get

rows in the categorical columns in an order different from the continuous columns,
in the column with missing values, an unexpected additional 0 value

Has anyone seen similar behavior? Is this expected? What would I have to change to get to the expected output (indicated below)?

import pandas as pd
from fastai.tabular.all import *

df = pd.DataFrame({'nacol':[np.nan,np.nan,2,3,4],'col2':[0,1,2,3,4],'col3':[90,70,50,30,10],'y':[10,20,30,40,50]})
print(df)

splits = RandomSplitter(valid_pct=0.2,seed=475)(range_of(df))
print(splits)

to = TabularPandas(
    df, 
    procs=[Categorify],
    cat_names=['nacol','col2'],
    cont_names='col3',
    y_names=['y'],
    splits=splits,
)
print(to.items)

output:

(df)
   nacol  col2  col3   y
0    NaN     0    90  10
1    NaN     1    70  20
2    2.0     2    50  30
3    3.0     3    30  40
4    4.0     4    10  50

(splits)
([3, 4, 0, 1], [2])

(to.items)
   nacol  col2  col3   y
3      1     3    30  40
4      2     4    10  50
0      0     1    90  10
1      0     2    70  20
2      0     0    50  30

I would have expected for to.items:

   nacol  col2  col3   y
3      2     3    30  40
4      3     4    10  50
0      0     0    90  10
1      0     1    70  20
2      1     2    50  30

I observed that using a split, perhaps using the RandomSplitter only, may play a role. If I set

splits=(range_of(df),)

I would get

   nacol  col2  col3   y
0      0     0    90  10
1      0     1    70  20
2      1     2    50  30
3      2     3    30  40
4      3     4    10  50

as expected.

I’m using fastai 2.7.12, fastcore 1.5.29, pandas 1.3.4.

Thanks for your suggestions.

tombucky · May 8, 2023, 4:09pm

Two observations:

In the example the raw nacol and col2 data has a numerical type and is to be treated as categories. In my original problem, such data represented room numbers. In the example, if I change the data in the two columns to be strings, everything seems to work as expected. I will use this as a work-around.
col2 information in my earlier post is off by 1 (copy-paste error, sorry)

tombucky · May 8, 2023, 5:48pm

Anyone stumbling over this thread hopefully reads this first. I made a grand mistake to begin with:

The whole point of Categorify is to translate object/string data into numerical data so that the data can be consumed in the model building process.
I started out with a finite set of values that, in the original problem space, may be considered categories. However, the values were numerical right from the beginning - no need to use Categorify at all.

As humbling as this conclusion is, maybe it proves useful to someone as new to ML-based data science as me.

Pablo · May 11, 2023, 7:25am

Hi tombucky, I’m not sure that’s entirely the case. categorify does turn your values into int, but there’s more to it. Crucially, columns treated as categorical will get embeddings, while continuous columns will not. The point of using

cat_names=['nacol','col2'],

is precisely this: because some columns look like numerical but are indeed not (e.g., day of the week from 1 to 7).

Also, looking closely things seem to be working as intended. Note your splits:

(splits)
([3, 4, 0, 1], [2])

Example 2 ( 2.0 2 50 30) is in validation, which means the nacol value 2.0 is not found in any train examples, so it should be treated as an unknown (same as NaN, so index 0 for the embedding). Try adding more rows with repeated values to test this

However, I would expect you to get a new column “nacol_isna” or something like that, but I can’t exactly say from the top of my head why you didn’t.

tombucky · May 12, 2023, 12:23pm

Thanks, @Pablo, for your helpful reply. So categorical values that are in the validation but not in the training set are mapped to unknown/nan/0. I tried your suggestion and, for col2 I repeat value 2 at row index 3 and, indeed, the value 2 is not mapped to 0 (missing value) anymore (see below).
Interestingly, this is the behavior for categorical numerical data. Here is what happens if I make nacol a string type:

df=pd.DataFrame({'nacol':[None,None,'2','3','4'],'col2':[0,1,2,2,4],'col3':[90,70,50,30,10],'y':[10,20,30,40,50]})
print(f'df=\n{df}')
print(f'df.dtypes=\n{df.dtypes}')

# creating splits omitted

to=TabularPandas(
    df, 
    procs=[Categorify],
    cat_names=['nacol','col2'],
    cont_names='col3',
    y_names=['y'],
    splits=splits,
)
print(f'to.items=\n{to.items}')

gives

df=
  nacol  col2  col3   y
0  None     0    90  10
1  None     1    70  20
2     2     2    50  30
3     3     2    30  40
4     4     4    10  50
df.dtypes=
nacol    object
col2      int64
col3      int64
y         int64
dtype: object
splits = ([3, 4, 0, 1], [2])
valid_idx = [2]
to.items=
   nacol  col2  col3   y
3      2     3    30  40
4      3     4    10  50
0      0     1    90  10
1      0     2    70  20
2      1     3    50  30

For col2 this looks as hoped for
For nacol this looks as I expected originally. However, it seems this does not map the value “2” at row index 2 to 0 (missing value)

Regarding not getting a new column “nacol_na”, if I understand the fastai code correctly, this is done only by FillMissing which is applied to numerical data (i.e., not categorical data).

Pablo · May 12, 2023, 12:54pm

That’s a bit weird about the value for nacol in row 2! Can you try to give another row the same value again, to see what happens? I see you now have “None” instead of “NaN”, I’m not 100% sure about how Fastai treats these either. (There’s a chance it interprets these exactly as missing values, or maybe it treats “None” as any valid string.) A way to look into this could be to have another value truly empty and see what happens.

As for the na column, you may be right. This makes most sense if missing values are left in categorical columns (and assigned their corresponding embedding). The “na” column is needed only when you input new values for missing stuff (like filling in with median etc).