TabularDataBunch.from_df Categorical cannot perform the operation median

Elfayoumi · October 21, 2018, 1:51pm

Hello
I am trying to use TabularDataBunch.from_df, for london energy usage kaggle problem. The dataframe has the following fields:
print(df.head().T)

LCLid MAC000002
energy_sum 7.098
stdorToU Std
Acorn ACORN-A
Acorn_grouped Affluent
temperatureMax 11.53
windBearing 252
icon partly-cloudy-day
dewPoint 6.15
cloudCover 0.29
windSpeed 2.18
pressure 1004.92
apparentTemperatureHigh 11.53
precipType rain
visibility 12.94
humidity 0.84
apparentTemperatureLow 1.64
apparentTemperatureMax 11.53
uvIndex 2
temperatureLow 2.81
temperatureMin 6.41
temperatureHigh 11.53
summary Partly cloudy until evening.
apparentTemperatureMin 4.01
moonPhase 0.92
temperature_skewness -0.131944
temperature_kurtosis 1.82943
day_length 0.451424
day.of.week Friday
Type Normal
before_holiday 5
after_holiday 5
month 10
year 2012
temperatureMaxHour 14
temperatureMinHour 22
apparentTemperatureMinHour 22
apparentTemperatureHighHour 14
sunsetHour 17
uvIndexHour 12
sunriseHour 6
temperatureHighHour 14
temperatureLowHour 7
apparentTemperatureMaxHour 14
apparentTemperatureLowHour 6

I created cat_names:
cat_names = [‘LCLid’,‘Acorn’, ‘Acorn_grouped’, “icon”, “stdorToU”, “Type”, “day.of.week”, ‘precipType’, ‘summary’,
‘before_holiday’, ‘after_holiday’, ‘month’, ‘year’]

dep_var = ‘energy_sum’
cont_names = list(filter(lambda x: x not in cat_names, df.columns))

and when I got the data:
data = TabularDataBunch.from_df(path, train_df, valid_df, dep_var,
tfms=[FillMissing, Categorify], cat_names=cat_names)

I obtain the following error:

TypeError Traceback (most recent call last)
in ()
1 data = TabularDataBunch.from_df(path, train_df, valid_df, dep_var,
----> 2 tfms=[FillMissing, Categorify], cat_names=cat_names)

~/anaconda3/envs/aind2/lib/python3.6/site-packages/fastai/tabular/data.py in from_df(cls, path, train_df, valid_df, dep_var, test_df, tfms, cat_names, cont_names, stats, log_output, **kwargs)
77 cat_names = ifnone(cat_names, [])
78 cont_names = ifnone(cont_names, list(set(train_df)-set(cat_names)-{dep_var}))
—> 79 train_ds = TabularDataset.from_dataframe(train_df, dep_var, tfms, cat_names, cont_names, stats, log_output)
80 valid_ds = TabularDataset.from_dataframe(valid_df, dep_var, train_ds.tfms, train_ds.cat_names,
81 train_ds.cont_names, train_ds.stats, log_output)

~/anaconda3/envs/aind2/lib/python3.6/site-packages/fastai/tabular/data.py in from_dataframe(cls, df, dep_var, tfms, cat_names, cont_names, stats, log_output)
61 else:
62 tfm = tfm(cat_names, cont_names)
—> 63 tfm(df)
64 tfms[i] = tfm
65 cat_names, cont_names = tfm.cat_names, tfm.cont_names

~/anaconda3/envs/aind2/lib/python3.6/site-packages/fastai/tabular/transform.py in call(self, df, test)
13 “Apply the correct function to df depending on test.”
14 func = self.apply_test if test else self.apply_train
—> 15 func(df)
16
17 def apply_train(self, df:DataFrame):

~/anaconda3/envs/aind2/lib/python3.6/site-packages/fastai/tabular/transform.py in apply_train(self, df)
51 df[name+’_na’] = pd.isnull(df[name])
52 if name+’_na’ not in self.cat_names: self.cat_names.append(name+’_na’)
—> 53 if self.fill_strategy == FillStrategy.MEDIAN: filler = df[name].median()
54 elif self.fill_strategy == FillStrategy.CONSTANT: filler = self.fill_val
55 else: filler = df[name].dropna().value_counts().idxmax()

~/anaconda3/envs/aind2/lib/python3.6/site-packages/pandas/core/generic.py in stat_func(self, axis, skipna, level, numeric_only, **kwargs)
7313 skipna=skipna)
7314 return self._reduce(f, name, axis=axis, skipna=skipna,
-> 7315 numeric_only=numeric_only)
7316
7317 return set_function_name(stat_func, name, cls)

~/anaconda3/envs/aind2/lib/python3.6/site-packages/pandas/core/series.py in _reduce(self, op, name, axis, skipna, numeric_only, filter_type, **kwds)
2579 return delegate._reduce(op=op, name=name, axis=axis, skipna=skipna,
2580 numeric_only=numeric_only,
-> 2581 filter_type=filter_type, **kwds)
2582
2583 def _reindex_indexer(self, new_index, indexer, copy):

~/anaconda3/envs/aind2/lib/python3.6/site-packages/pandas/core/categorical.py in _reduce(self, op, name, axis, skipna, numeric_only, filter_type, **kwds)
1963 if func is None:
1964 msg = ‘Categorical cannot perform the operation {op}’
-> 1965 raise TypeError(msg.format(op=name))
1966 return func(numeric_only=numeric_only, **kwds)
1967

TypeError: Categorical cannot perform the operation median

I am not sure what causes this error.
Regards
Ibrahim

sgugger · October 21, 2018, 6:13pm

I think you have one categorical variable flagged as continuous. So when it tried to find its median value during FillMissing, it throws an error.

Elfayoumi · October 22, 2018, 12:30pm

Thanks, it was strange, but next time I tried it, it worked. Thanks