Getting an error when trying to use TabularDataBunch.from_df

DerekHsieh · November 14, 2018, 3:56am

I tried running the code from lesson4-tabular and got an error on this line:
data = TabularDataBunch.from_df(path, train_df, valid_df, dep_var,
tfms=[FillMissing, Categorify], cat_names=cat_names)

Error is:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
in
2 cat_names = [‘workclass’, ‘education’, ‘marital-status’, ‘occupation’, ‘relationship’, ‘race’, ‘sex’, ‘native-country’]
3 data = TabularDataBunch.from_df(path, train_df, valid_df, dep_var,
----> 4 tfms=[FillMissing, Categorify], cat_names=cat_names)

~/miniconda3/envs/myenv/lib/python3.6/site-packages/fastai/tabular/data.py in from_df(cls, path, df, dep_var, valid_idx, procs, cat_names, cont_names, classes, **kwargs)
111 “Create a DataBunch from train/valid/test dataframes.”
112 cat_names = ifnone(cat_names, [])
–> 113 cont_names = ifnone(cont_names, list(set(df)-set(cat_names)-{dep_var}))
114 procs = listify(procs)
115 return (TabularList.from_df(df, path=path, cat_names=cat_names, cont_names=cont_names, procs=procs)

~/miniconda3/envs/myenv/lib/python3.6/site-packages/pandas/core/generic.py in hash(self)
1490 def hash(self):
1491 raise TypeError(’{0!r} objects are mutable, thus they cannot be’
-> 1492 ’ hashed’.format(self.class.name))
1493
1494 def iter(self):

TypeError: ‘DataFrame’ objects are mutable, thus they cannot be hashed

Is there a way around this?

I’m using fast ai version 1.0.24

joshfp · November 14, 2018, 2:01pm

I faced the same issue. The lesson4-tabular notebook in the repo is not the same as the one showed by Jeremy in the class, and since TabularDataBunch.from_df has changed in fastai, the notebook seems to be broken. Could you please check @sgugger?

By now, you can solve it, either:

by using TabularDataBunch factory method:

valid_idx = range(len(df)-2000, len(df))
data = TabularDataBunch.from_df(path, df, dep_var, valid_idx=valid_idx, procs=[FillMissing, Categorify], cat_names=cat_names)

or, by using data_block (as Jeremy did in the video):

valid_idx = range(len(df)-2000, len(df))
cont_names = ['age', 'fnlwgt', 'education-num']
data = (TabularList.from_df(df, cat_names, cont_names, procs=[FillMissing, Categorify])
        .split_by_idx(valid_idx)
        .label_from_df(cols=dep_var)
        .databunch())

DerekHsieh · November 15, 2018, 4:35am

Either method works. Thanks Jose