Hi, I’m new to fastai and may be missing something but the following behavior seems rather odd:
Small dataframe with 4 columns, 2 of which categorical. 1 of the categorical columns contains missing values. Categorify is supposed to map the categories to ordinal numbers, mapping the missing values to 0. When I do this in the context below, I seem to get
- rows in the categorical columns in an order different from the continuous columns,
- in the column with missing values, an unexpected additional 0 value
Has anyone seen similar behavior? Is this expected? What would I have to change to get to the expected output (indicated below)?
import pandas as pd
from fastai.tabular.all import *
df = pd.DataFrame({'nacol':[np.nan,np.nan,2,3,4],'col2':[0,1,2,3,4],'col3':[90,70,50,30,10],'y':[10,20,30,40,50]})
print(df)
splits = RandomSplitter(valid_pct=0.2,seed=475)(range_of(df))
print(splits)
to = TabularPandas(
df,
procs=[Categorify],
cat_names=['nacol','col2'],
cont_names='col3',
y_names=['y'],
splits=splits,
)
print(to.items)
output:
(df)
nacol col2 col3 y
0 NaN 0 90 10
1 NaN 1 70 20
2 2.0 2 50 30
3 3.0 3 30 40
4 4.0 4 10 50
(splits)
([3, 4, 0, 1], [2])
(to.items)
nacol col2 col3 y
3 1 3 30 40
4 2 4 10 50
0 0 1 90 10
1 0 2 70 20
2 0 0 50 30
I would have expected for to.items:
nacol col2 col3 y
3 2 3 30 40
4 3 4 10 50
0 0 0 90 10
1 0 1 70 20
2 1 2 50 30
I observed that using a split, perhaps using the RandomSplitter only, may play a role. If I set
splits=(range_of(df),)
I would get
nacol col2 col3 y
0 0 0 90 10
1 0 1 70 20
2 1 2 50 30
3 2 3 30 40
4 3 4 10 50
as expected.
I’m using fastai 2.7.12, fastcore 1.5.29, pandas 1.3.4.
Thanks for your suggestions.