TL;DR: I’m getting a pandas keyerror from the function label_from_df when creating some overlapping databunches from the same pandas dataframe when I use the FillMissing preprocessor. The KeyError is on one of the _na
columns, ie. the columns made by fastai to show when an item was removed.
I’m making some DataBunch
objects from a single pandas dataframe as follows:
(Note: This is the data set Jeremy uses in one of the lectures)
path = untar_data(URLs.ADULT_SAMPLE)
df = pd.read_csv(path / 'adult.csv')
l = len(df)
dep_var = 'salary'
cat_names = ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race']
cont_names = ['age', 'fnlwgt', 'education-num']
procs = [FillMissing, Categorify, Normalize]
Im making new dataframes with subset of original data
first = 0
second = l//10
third = (2*l)//10
fourth = (3*l)//10
fifth = (4*l)//10
df_base = df[:][first : third].copy(deep=True)
df_m1 = df[:][third: fourth].copy(deep=True)
df_m2 = df[:][fourth : fifth].copy(deep=True)
df_full = df[:][first : fifth].copy(deep=True)
After this I create DataBunch
for all:
data_m1 = (TabularList.from_df(df_m1, path=path, cat_names=cat_names, cont_names=cont_names, procs=procs)
.split_by_idx(list(range(int(0.8*len(df_m1)), len(df_m1))))
.label_from_df(cols=dep_var)
.databunch(bs=bs))
data_m2 = (TabularList.from_df(df_m2, path=path, cat_names=cat_names, cont_names=cont_names, procs=procs)
.split_by_idx(list(range(int(0.8*len(df_m2)), len(df_m2))))
.label_from_df(cols=dep_var)
.databunch(bs=bs))
data_base = (TabularList.from_df(df_base, path=path, cat_names=cat_names, cont_names=cont_names, procs=procs)
.split_by_idx(list(range(int(0.8*len(df_base)), len(df_base))))
.label_from_df(cols=dep_var)
.databunch(bs=bs))
data_full = (TabularList.from_df(df_full, path=path, cat_names=cat_names, cont_names=cont_names, procs=procs)
.split_by_idx(list(range( int(0.8*len(df_full)), len(df_full))))
.label_from_df(cols=dep_var)
.databunch(bs=bs))
Now when I try to create the databunch in sequence:
data_m1
, data_m2
, data_base
, data_full
: This does not result in any error in creation of the databunch but gives an error when i do lr_find
on learner defined on data_m1
In this case the error is:
learn_m1 = tabular_learner(data_m1, [50, 100], metrics=accuracy)
learn_m1.lr_find()
learn_m1.recorder.plot()
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
<ipython-input-14-3888ffa4b04e> in <module>()
----> 1 learn_m1 = tabular_learner(data_m1, [50, 100], metrics=accuracy)
2 # learn_m1.load('base')
3 learn_m1.lr_find()
4 learn_m1.recorder.plot()
3 frames
/usr/local/lib/python3.6/dist-packages/fastai/tabular/data.py in def_emb_sz(classes, n, sz_dict)
16 "Pick an embedding size for `n` depending on `classes` if not given in `sz_dict`."
17 sz_dict = ifnone(sz_dict, {})
---> 18 n_cat = len(classes[n])
19 sz = sz_dict.get(n, int(emb_sz_rule(n_cat))) # rule of thumb
20 return n_cat,sz
KeyError: 'education-num_na'
However in case I create data_m1
after I’ve created data_full
, I get the error instantly. The error here comes out to:
data_m1 = (TabularList.from_df(df_m1, path=path, cat_names=cat_names, cont_names=cont_names, procs=procs)
.split_by_idx(list(range(int(0.8*len(df_m1)), len(df_m1))))
.label_from_df(cols=dep_var)
.databunch(bs=bs))
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
/usr/local/lib/python3.6/dist-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
2645 try:
-> 2646 return self._engine.get_loc(key)
2647 except KeyError:
pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()
pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()
pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()
pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()
KeyError: 'education-num_na'
During handling of the above exception, another exception occurred:
KeyError Traceback (most recent call last)
15 frames
/usr/local/lib/python3.6/dist-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
2646 return self._engine.get_loc(key)
2647 except KeyError:
-> 2648 return self._engine.get_loc(self._maybe_cast_indexer(key))
2649 indexer = self.get_indexer([key], method=method, tolerance=tolerance)
2650 if indexer.ndim > 1 or indexer.size > 1:
pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()
pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()
pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()
pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()
KeyError: 'education-num_na'
Thanks a lot.
Edit 1: I tried using TabularList object to split the data so that I get the same category names, but the error persists