Error in md = ColumnarModelData.from_data_frame()


#1

having an error with
md = ColumnarModelData.from_data_frame(PATH, val_idx, df, yl.astype(np.float32), cat_flds=cat_vars, bs=128, test_df=df_test)

-------------------------------------------------------
TypeError Traceback (most recent call last)
in ()
_ 1 md = ColumnarModelData.from_data_frame(PATH, val_idx, df, yl.astype(np.float32), cat_flds=cat_vars, bs=128,_
----> 2 test_df=df_test)
_ 3 _
_ 4 #md = ColumnarModelData.from_data_frame(PATH, val_idx, df, yl.astype(np.float32), cat_flds=cat_vars, bs=128)_

~/fastai/courses/dl1/fastai/column_data.py in from_data_frame(cls, path, val_idxs, df, y, cat_flds, bs, is_reg, test_df)
_ 68 def from_data_frame(cls, path, val_idxs, df, y, cat_flds, bs, is_reg=True, test_df=None):_
_ 69 ((val_df, trn_df), (val_y, trn_y)) = split_by_idx(val_idxs, df, y)_
—> 70 return cls.from_data_frames(path, trn_df, val_df, trn_y, val_y, cat_flds, bs, is_reg, test_df=test_df)
_ 71 _
_ 72 def get_learner(self, emb_szs, n_cont, emb_drop, out_sz, szs, drops,_

~/fastai/courses/dl1/fastai/column_data.py in from_data_frames(cls, path, trn_df, val_df, trn_y, val_y, cat_flds, bs, is_reg, test_df)
_ 61 @classmethod_
_ 62 def from_data_frames(cls, path, trn_df, val_df, trn_y, val_y, cat_flds, bs, is_reg, test_df=None):_
—> 63 test_ds = ColumnarDataset.from_data_frame(test_df, cat_flds, is_reg) if test_df is not None else None
_ 64 return cls(path, ColumnarDataset.from_data_frame(trn_df, cat_flds, trn_y, is_reg),_
_ 65 ColumnarDataset.from_data_frame(val_df, cat_flds, val_y, is_reg), bs, test_ds=test_ds)_

~/fastai/courses/dl1/fastai/column_data.py in from_data_frame(cls, df, cat_flds, y, is_reg)
_ 43 @classmethod_
_ 44 def from_data_frame(cls, df, cat_flds, y=None, is_reg=True):_
—> 45 return cls.from_data_frames(df[cat_flds], df.drop(cat_flds, axis=1), y, is_reg)
_ 46 _
_ 47 _

~/fastai/courses/dl1/fastai/column_data.py in from_data_frames(cls, df_cat, df_cont, y, is_reg)
_ 39 cat_cols = [c.values for n,c in df_cat.items()]_
_ 40 cont_cols = [c.values for n,c in df_cont.items()]_
—> 41 return cls(cat_cols, cont_cols, y, is_reg)
_ 42 _
_ 43 @classmethod_

~/fastai/courses/dl1/fastai/column_data.py in init(self, cats, conts, y, is_reg)
_ 27 self.y = np.zeros((n,1)) if y is None else y_
_ 28 if is_reg:_
—> 29 self.y = self.y[:,None]
_ 30 self.is_reg = is_reg_
_ 31 _

TypeError: ‘bool’ object is not subscriptable

from the rossmann notebook.
the problem does not appear i am not using the “test_df=df_test” part and when checking the classes of the df_test there seems to be no problem.

df_test.dtypes.value_counts()
int8 21
float64 18
int16 1
dtype: int64

Any ideas?? what is the effect of the “test_df=df_test”


#2

Not sure what type of data you are using but the fastai library is not plug and play with anything as I have come to find out the hard way.
I would edit the ColumnDataSet or whatever file:

You can eliminate the is_reg if clause and just set

self.y = y

I have set the is_reg to None and False and it still has issues. Documentation or half a sentence about what this flag is? No.
Again, not know the format of your test_df but having one or not did not seem to matter in my case.

You will probably have more issues but this one should be sorted by this.
Perhaps we are looking to achieve the same goal and would be good to share if so.


#3

I’m also getting an error when using ColumnarModelData, although I’m getting a ValueError. I’m working this Kaggle competition. On inspecting the data, I believe there are no NaNs. I’m following the Rossmann notebook as a guideline. These are separate pieces of code in a Jupyter notebook, but I’ve included everything here for brevity.

df = pd.read_csv(training_csv, index_col=['id'])
df_test = pd.read_csv(testing_csv, index_col=['id'])

train_cats(df)
for v in cat_vars: df[v] = df[v].astype('category').cat.as_ordered()
apply_cats(df_test, df)

for v in contin_vars:
    df[v] = df[v].astype('float32')
    df_test[v] = df_test[v].astype('float32')
    
df_samp, y, nas, mapper = proc_df(df, 'loss', do_scale=True)  
samp_size = len(df_samp)
print(df_samp.isnull().values.any(), df_samp.isnull().sum().sum())

And I get the output:
False 0

indicating that df_samp does not contain any null values. However, when I call ColumnarModelData,

md = ColumnarModelData.from_data_frame(PATH, val_idx, df_samp, y, cat_flds=cat_vars,
                                      bs=128, test_df=df_test)

I get the following error:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-12-122effa9459d> in <module>()
      1 md = ColumnarModelData.from_data_frame(PATH, val_idx, df_samp, y, cat_flds=cat_vars,
----> 2                                       bs=128, test_df=df_test)

~/kaggle-comps/allstate/fastai/column_data.py in from_data_frame(cls, path, val_idxs, df, y, cat_flds, bs, is_reg, test_df)
     68     def from_data_frame(cls, path, val_idxs, df, y, cat_flds, bs, is_reg=True, test_df=None):
     69         ((val_df, trn_df), (val_y, trn_y)) = split_by_idx(val_idxs, df, y)
---> 70         return cls.from_data_frames(path, trn_df, val_df, trn_y, val_y, cat_flds, bs, is_reg, test_df=test_df)
     71 
     72     def get_learner(self, emb_szs, n_cont, emb_drop, out_sz, szs, drops,

~/kaggle-comps/allstate/fastai/column_data.py in from_data_frames(cls, path, trn_df, val_df, trn_y, val_y, cat_flds, bs, is_reg, test_df)
     61     @classmethod
     62     def from_data_frames(cls, path, trn_df, val_df, trn_y, val_y, cat_flds, bs, is_reg, test_df=None):
---> 63         test_ds = ColumnarDataset.from_data_frame(test_df, cat_flds, is_reg) if test_df is not None else None
     64         return cls(path, ColumnarDataset.from_data_frame(trn_df, cat_flds, trn_y, is_reg),
     65                     ColumnarDataset.from_data_frame(val_df, cat_flds, val_y, is_reg), bs, test_ds=test_ds)

~/kaggle-comps/allstate/fastai/column_data.py in from_data_frame(cls, df, cat_flds, y, is_reg)
     43     @classmethod
     44     def from_data_frame(cls, df, cat_flds, y=None, is_reg=True):
---> 45         return cls.from_data_frames(df[cat_flds], df.drop(cat_flds, axis=1), y, is_reg)
     46 
     47 

~/kaggle-comps/allstate/fastai/column_data.py in from_data_frames(cls, df_cat, df_cont, y, is_reg)
     39         cat_cols = [c.values for n,c in df_cat.items()]
     40         cont_cols = [c.values for n,c in df_cont.items()]
---> 41         return cls(cat_cols, cont_cols, y, is_reg)
     42 
     43     @classmethod

~/kaggle-comps/allstate/fastai/column_data.py in __init__(self, cats, conts, y, is_reg)
     23     def __init__(self, cats, conts, y, is_reg):
     24         n = len(cats[0]) if cats else len(conts[0])
---> 25         self.cats = np.stack(cats, 1).astype(np.int64) if cats else np.zeros((n,1))
     26         self.conts = np.stack(conts, 1).astype(np.float32) if conts else np.zeros((n,1))
     27         self.y = np.zeros((n,1)) if y is None else y

ValueError: cannot convert float NaN to integer

Any help is appreciated!


#4

I pre-processed my df so there were no nan. I didn’t look closely at your trace but probably you need to set your cat variables to INT type. proc_df might be doing it for you and it is a float.

Am just so invested at this point I want to use this library for the goal even though it is quite the time sink.


#5

I checked the df returned by proc_df and all the categorical vars are either int8 or int16 depending on how big they are:

df_samp.dtypes

cat1         int8
cat2         int8
cat3         int8
cat4         int8
cat5         int8
cat6         int8
cat7         int8
cat8         int8
cat9         int8
cat10        int8
cat11        int8
cat12        int8
cat13        int8
cat14        int8
cat15        int8
cat16        int8
cat17        int8
cat18        int8
cat19        int8
cat20        int8
cat21        int8
cat22        int8
cat23        int8
cat24        int8
cat25        int8
cat26        int8
cat27        int8
cat28        int8
cat29        int8
cat30        int8
           ...   
cat101       int8
cat102       int8
cat103       int8

Any other suggestions?

I understand your frustration. But I think this library has a lot of potential and us using it and and discovering/fixing bugs would make it be a plug-n-play library.

Thanks for your help!


#6

Just following the notebook by line. (did a git pull just yesterday after discovering the error)
Not using my own data.
The format of the data at least in terms of the data types of the columns as described above is:

df_test.dtypes.value_counts()
int8 21
float64 18
int16 1
dtype: int64

Will try to check again the data processing in df_test.
Btw in the video there is no “test_df=df_test” variable passed and that is why i tried without it and it did work.


(Jeremy Howard) #7

Rather than being rude, maybe consider helping?

If not, just avoid being rude please. A kind person submitted a patch that added this flag and functionality about 2 days ago. Being unpleasant about their hard work benefits no-one, and certainly doesn’t seem like the kind of thing likely to make people in this community want to spend their time helping you.


#8

Update 1:
I traced the PR conversation and you can find it here. I believe the is_reg flag indicates whether we want a classifier (is_reg=False) or a regressor (is_reg=True). However, both of these throw the same type error of bool object not being subscriptable.

---------------------------------------------------------------------------------------------------------------------------------------------

I discovered my bug and it was a silly mistake by me. I did not proc_df my df_test and that was what was causing the error. Once I fixed it, I get the same error as OP.

I’ve submitted the issue onto github and hopefully someone would be able to help.


(Taman) #9

Hi, I had the same issue and traced it to a mismatch in function arguments.
It seems fast.ai team has already adjusted the necessary files and when you do a git pull it should work fine now.


(Sneha Nagpaul) #10

So I ran into the same problem. And I’m not sure if your issue was resolved. Please let me know if it was…
From my tinkering around. I feel the problem lies in the apply_cats() call. This function takes the test_df and train_df and uses the same categorizing keys in the test df. Perhaps your dataset has an ‘id’ field that was ‘categorical’. In the test set…if we were to ‘categorize’ using the training categories—none of the id’s would match the initial key. Thus…the new nans. A whole column of nans (i.e. the id field). Let me know how you resolved this though. I’m going to pull out the functionality for now(??apply_cats()) and skip the ‘id’ field for the test set.