Tabular data block error

Narang · April 16, 2020, 10:05pm

TL;DR: I’m getting a pandas keyerror from the function label_from_df when creating some overlapping databunches from the same pandas dataframe when I use the FillMissing preprocessor. The KeyError is on one of the _na columns, ie. the columns made by fastai to show when an item was removed.

I’m making some DataBunch objects from a single pandas dataframe as follows:
(Note: This is the data set Jeremy uses in one of the lectures)

path = untar_data(URLs.ADULT_SAMPLE)
df = pd.read_csv(path / 'adult.csv')
l = len(df) 

dep_var = 'salary'
cat_names = ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race']
cont_names = ['age', 'fnlwgt', 'education-num']
procs = [FillMissing, Categorify, Normalize]

Im making new dataframes with subset of original data

first = 0
second = l//10
third = (2*l)//10
fourth = (3*l)//10
fifth = (4*l)//10

df_base = df[:][first : third].copy(deep=True)
df_m1 = df[:][third: fourth].copy(deep=True)
df_m2 = df[:][fourth : fifth].copy(deep=True)
df_full = df[:][first : fifth].copy(deep=True)

After this I create DataBunch for all:

data_m1 = (TabularList.from_df(df_m1, path=path, cat_names=cat_names, cont_names=cont_names, procs=procs)
                           .split_by_idx(list(range(int(0.8*len(df_m1)), len(df_m1))))
                           .label_from_df(cols=dep_var)
                           .databunch(bs=bs))

data_m2 = (TabularList.from_df(df_m2, path=path, cat_names=cat_names, cont_names=cont_names, procs=procs)
                           .split_by_idx(list(range(int(0.8*len(df_m2)), len(df_m2))))
                           .label_from_df(cols=dep_var)
                           .databunch(bs=bs))

data_base = (TabularList.from_df(df_base, path=path, cat_names=cat_names, cont_names=cont_names, procs=procs)
                           .split_by_idx(list(range(int(0.8*len(df_base)), len(df_base))))
                           .label_from_df(cols=dep_var)
                           .databunch(bs=bs))

data_full = (TabularList.from_df(df_full, path=path, cat_names=cat_names, cont_names=cont_names, procs=procs)
                           .split_by_idx(list(range( int(0.8*len(df_full)),  len(df_full))))
                           .label_from_df(cols=dep_var)
                           .databunch(bs=bs))

Now when I try to create the databunch in sequence:
data_m1, data_m2, data_base, data_full: This does not result in any error in creation of the databunch but gives an error when i do lr_find on learner defined on data_m1
In this case the error is:

learn_m1 = tabular_learner(data_m1, [50, 100], metrics=accuracy)
learn_m1.lr_find()
learn_m1.recorder.plot()
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-14-3888ffa4b04e> in <module>()
----> 1 learn_m1 = tabular_learner(data_m1, [50, 100], metrics=accuracy)
      2 # learn_m1.load('base')
      3 learn_m1.lr_find()
      4 learn_m1.recorder.plot()

3 frames
/usr/local/lib/python3.6/dist-packages/fastai/tabular/data.py in def_emb_sz(classes, n, sz_dict)
     16     "Pick an embedding size for `n` depending on `classes` if not given in `sz_dict`."
     17     sz_dict = ifnone(sz_dict, {})
---> 18     n_cat = len(classes[n])
     19     sz = sz_dict.get(n, int(emb_sz_rule(n_cat)))  # rule of thumb
     20     return n_cat,sz

KeyError: 'education-num_na'

However in case I create data_m1 after I’ve created data_full, I get the error instantly. The error here comes out to:

data_m1 = (TabularList.from_df(df_m1, path=path, cat_names=cat_names, cont_names=cont_names, procs=procs)
                           .split_by_idx(list(range(int(0.8*len(df_m1)), len(df_m1))))
                           .label_from_df(cols=dep_var)
                           .databunch(bs=bs))
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
/usr/local/lib/python3.6/dist-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
   2645             try:
-> 2646                 return self._engine.get_loc(key)
   2647             except KeyError:

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: 'education-num_na'

During handling of the above exception, another exception occurred:

KeyError                                  Traceback (most recent call last)
15 frames
/usr/local/lib/python3.6/dist-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
   2646                 return self._engine.get_loc(key)
   2647             except KeyError:
-> 2648                 return self._engine.get_loc(self._maybe_cast_indexer(key))
   2649         indexer = self.get_indexer([key], method=method, tolerance=tolerance)
   2650         if indexer.ndim > 1 or indexer.size > 1:

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: 'education-num_na'

Thanks a lot.
Edit 1: I tried using TabularList object to split the data so that I get the same category names, but the error persists