Error in md = ColumnarModelData.from_data_frame()

MarKo · March 28, 2018, 9:31am

having an error with
md = ColumnarModelData.from_data_frame(PATH, val_idx, df, yl.astype(np.float32), cat_flds=cat_vars, bs=128, test_df=df_test)

-------------------------------------------------------
TypeError Traceback (most recent call last)
in ()
_ 1 md = ColumnarModelData.from_data_frame(PATH, val_idx, df, yl.astype(np.float32), cat_flds=cat_vars, bs=128,_
----> 2 test_df=df_test)
_ 3 _
_ 4 #md = ColumnarModelData.from_data_frame(PATH, val_idx, df, yl.astype(np.float32), cat_flds=cat_vars, bs=128)_

~/fastai/courses/dl1/fastai/column_data.py in from_data_frame(cls, path, val_idxs, df, y, cat_flds, bs, is_reg, test_df)
_ 68 def from_data_frame(cls, path, val_idxs, df, y, cat_flds, bs, is_reg=True, test_df=None):_
_ 69 ((val_df, trn_df), (val_y, trn_y)) = split_by_idx(val_idxs, df, y)_
—> 70 return cls.from_data_frames(path, trn_df, val_df, trn_y, val_y, cat_flds, bs, is_reg, test_df=test_df)
_ 71 _
_ 72 def get_learner(self, emb_szs, n_cont, emb_drop, out_sz, szs, drops,_

~/fastai/courses/dl1/fastai/column_data.py in from_data_frames(cls, path, trn_df, val_df, trn_y, val_y, cat_flds, bs, is_reg, test_df)
_ 61 @classmethod_
_ 62 def from_data_frames(cls, path, trn_df, val_df, trn_y, val_y, cat_flds, bs, is_reg, test_df=None):_
—> 63 test_ds = ColumnarDataset.from_data_frame(test_df, cat_flds, is_reg) if test_df is not None else None
_ 64 return cls(path, ColumnarDataset.from_data_frame(trn_df, cat_flds, trn_y, is_reg),_
_ 65 ColumnarDataset.from_data_frame(val_df, cat_flds, val_y, is_reg), bs, test_ds=test_ds)_

~/fastai/courses/dl1/fastai/column_data.py in from_data_frame(cls, df, cat_flds, y, is_reg)
_ 43 @classmethod_
_ 44 def from_data_frame(cls, df, cat_flds, y=None, is_reg=True):_
—> 45 return cls.from_data_frames(df[cat_flds], df.drop(cat_flds, axis=1), y, is_reg)
_ 46 _
_ 47 _

~/fastai/courses/dl1/fastai/column_data.py in from_data_frames(cls, df_cat, df_cont, y, is_reg)
_ 39 cat_cols = [c.values for n,c in df_cat.items()]_
_ 40 cont_cols = [c.values for n,c in df_cont.items()]_
—> 41 return cls(cat_cols, cont_cols, y, is_reg)
_ 42 _
_ 43 @classmethod_

~/fastai/courses/dl1/fastai/column_data.py in init(self, cats, conts, y, is_reg)
_ 27 self.y = np.zeros((n,1)) if y is None else y_
_ 28 if is_reg:_
—> 29 self.y = self.y[:,None]
_ 30 self.is_reg = is_reg_
_ 31 _

TypeError: ‘bool’ object is not subscriptable

from the rossmann notebook.
the problem does not appear i am not using the “test_df=df_test” part and when checking the classes of the df_test there seems to be no problem.

df_test.dtypes.value_counts()
int8 21
float64 18
int16 1
dtype: int64

Any ideas?? what is the effect of the “test_df=df_test”

gambit50 · March 28, 2018, 6:03pm

Not sure what type of data you are using but the fastai library is not plug and play with anything as I have come to find out the hard way.
I would edit the ColumnDataSet or whatever file:

You can eliminate the is_reg if clause and just set

self.y = y

I have set the is_reg to None and False and it still has issues. Documentation or half a sentence about what this flag is? No.
Again, not know the format of your test_df but having one or not did not seem to matter in my case.

You will probably have more issues but this one should be sorted by this.
Perhaps we are looking to achieve the same goal and would be good to share if so.

shaun1 · March 28, 2018, 6:28pm

I’m also getting an error when using ColumnarModelData, although I’m getting a ValueError. I’m working this Kaggle competition. On inspecting the data, I believe there are no NaNs. I’m following the Rossmann notebook as a guideline. These are separate pieces of code in a Jupyter notebook, but I’ve included everything here for brevity.

df = pd.read_csv(training_csv, index_col=['id'])
df_test = pd.read_csv(testing_csv, index_col=['id'])

train_cats(df)
for v in cat_vars: df[v] = df[v].astype('category').cat.as_ordered()
apply_cats(df_test, df)

for v in contin_vars:
    df[v] = df[v].astype('float32')
    df_test[v] = df_test[v].astype('float32')
    
df_samp, y, nas, mapper = proc_df(df, 'loss', do_scale=True)  
samp_size = len(df_samp)
print(df_samp.isnull().values.any(), df_samp.isnull().sum().sum())

And I get the output:
False 0

indicating that df_samp does not contain any null values. However, when I call ColumnarModelData,

md = ColumnarModelData.from_data_frame(PATH, val_idx, df_samp, y, cat_flds=cat_vars,
                                      bs=128, test_df=df_test)

I get the following error:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-12-122effa9459d> in <module>()
      1 md = ColumnarModelData.from_data_frame(PATH, val_idx, df_samp, y, cat_flds=cat_vars,
----> 2                                       bs=128, test_df=df_test)

~/kaggle-comps/allstate/fastai/column_data.py in from_data_frame(cls, path, val_idxs, df, y, cat_flds, bs, is_reg, test_df)
     68     def from_data_frame(cls, path, val_idxs, df, y, cat_flds, bs, is_reg=True, test_df=None):
     69         ((val_df, trn_df), (val_y, trn_y)) = split_by_idx(val_idxs, df, y)
---> 70         return cls.from_data_frames(path, trn_df, val_df, trn_y, val_y, cat_flds, bs, is_reg, test_df=test_df)
     71 
     72     def get_learner(self, emb_szs, n_cont, emb_drop, out_sz, szs, drops,

~/kaggle-comps/allstate/fastai/column_data.py in from_data_frames(cls, path, trn_df, val_df, trn_y, val_y, cat_flds, bs, is_reg, test_df)
     61     @classmethod
     62     def from_data_frames(cls, path, trn_df, val_df, trn_y, val_y, cat_flds, bs, is_reg, test_df=None):
---> 63         test_ds = ColumnarDataset.from_data_frame(test_df, cat_flds, is_reg) if test_df is not None else None
     64         return cls(path, ColumnarDataset.from_data_frame(trn_df, cat_flds, trn_y, is_reg),
     65                     ColumnarDataset.from_data_frame(val_df, cat_flds, val_y, is_reg), bs, test_ds=test_ds)

~/kaggle-comps/allstate/fastai/column_data.py in from_data_frame(cls, df, cat_flds, y, is_reg)
     43     @classmethod
     44     def from_data_frame(cls, df, cat_flds, y=None, is_reg=True):
---> 45         return cls.from_data_frames(df[cat_flds], df.drop(cat_flds, axis=1), y, is_reg)
     46 
     47 

~/kaggle-comps/allstate/fastai/column_data.py in from_data_frames(cls, df_cat, df_cont, y, is_reg)
     39         cat_cols = [c.values for n,c in df_cat.items()]
     40         cont_cols = [c.values for n,c in df_cont.items()]
---> 41         return cls(cat_cols, cont_cols, y, is_reg)
     42 
     43     @classmethod

~/kaggle-comps/allstate/fastai/column_data.py in __init__(self, cats, conts, y, is_reg)
     23     def __init__(self, cats, conts, y, is_reg):
     24         n = len(cats[0]) if cats else len(conts[0])
---> 25         self.cats = np.stack(cats, 1).astype(np.int64) if cats else np.zeros((n,1))
     26         self.conts = np.stack(conts, 1).astype(np.float32) if conts else np.zeros((n,1))
     27         self.y = np.zeros((n,1)) if y is None else y

ValueError: cannot convert float NaN to integer

Any help is appreciated!

gambit50 · March 28, 2018, 6:39pm

I pre-processed my df so there were no nan. I didn’t look closely at your trace but probably you need to set your cat variables to INT type. proc_df might be doing it for you and it is a float.

Am just so invested at this point I want to use this library for the goal even though it is quite the time sink.

shaun1 · March 28, 2018, 9:39pm

I checked the df returned by proc_df and all the categorical vars are either int8 or int16 depending on how big they are:

df_samp.dtypes

cat1         int8
cat2         int8
cat3         int8
cat4         int8
cat5         int8
cat6         int8
cat7         int8
cat8         int8
cat9         int8
cat10        int8
cat11        int8
cat12        int8
cat13        int8
cat14        int8
cat15        int8
cat16        int8
cat17        int8
cat18        int8
cat19        int8
cat20        int8
cat21        int8
cat22        int8
cat23        int8
cat24        int8
cat25        int8
cat26        int8
cat27        int8
cat28        int8
cat29        int8
cat30        int8
           ...   
cat101       int8
cat102       int8
cat103       int8

Any other suggestions?

I understand your frustration. But I think this library has a lot of potential and us using it and and discovering/fixing bugs would make it be a plug-n-play library.

Thanks for your help!

MarKo · March 29, 2018, 12:59am

Just following the notebook by line. (did a git pull just yesterday after discovering the error)
Not using my own data.
The format of the data at least in terms of the data types of the columns as described above is:

df_test.dtypes.value_counts()
int8 21
float64 18
int16 1
dtype: int64

Will try to check again the data processing in df_test.
Btw in the video there is no “test_df=df_test” variable passed and that is why i tried without it and it did work.

jeremy · March 29, 2018, 1:30am

Rather than being rude, maybe consider helping?

If not, just avoid being rude please. A kind person submitted a patch that added this flag and functionality about 2 days ago. Being unpleasant about their hard work benefits no-one, and certainly doesn’t seem like the kind of thing likely to make people in this community want to spend their time helping you.

shaun1 · March 29, 2018, 1:13pm

Update 1:
I traced the PR conversation and you can find it here. I believe the is_reg flag indicates whether we want a classifier (is_reg=False) or a regressor (is_reg=True). However, both of these throw the same type error of bool object not being subscriptable.

---------------------------------------------------------------------------------------------------------------------------------------------

I discovered my bug and it was a silly mistake by me. I did not proc_df my df_test and that was what was causing the error. Once I fixed it, I get the same error as OP.

I’ve submitted the issue onto github and hopefully someone would be able to help.

Taman · March 30, 2018, 6:57am

Hi, I had the same issue and traced it to a mismatch in function arguments.
It seems fast.ai team has already adjusted the necessary files and when you do a git pull it should work fine now.

snagpaul · May 2, 2018, 7:51pm

So I ran into the same problem. And I’m not sure if your issue was resolved. Please let me know if it was…
From my tinkering around. I feel the problem lies in the apply_cats() call. This function takes the test_df and train_df and uses the same categorizing keys in the test df. Perhaps your dataset has an ‘id’ field that was ‘categorical’. In the test set…if we were to ‘categorize’ using the training categories—none of the id’s would match the initial key. Thus…the new nans. A whole column of nans (i.e. the id field). Let me know how you resolved this though. I’m going to pull out the functionality for now(??apply_cats()) and skip the ‘id’ field for the test set.

dhoa · August 8, 2018, 10:23am

I face this problem when using fastai in the Kernel in Kaggle with GPU activated. You can find my kernel here: NYC Taxi Fare

I have no problem at all running it on my own PC with the up-to-date fastai library.

The weird thing is without GPU activated, the kernel can run ok.

Removing “test_df=df_test” can fix the problem but I want to predict in the test set.

I think that maybe in Kaggle the library is not updated yet. Anyone know how to solve this problem ?

Thank you in advance,

philippschw · September 7, 2018, 3:04pm

I am having the same problem. On my AWS instance with updated fastai library it works like a charm, but as soon as I turn on GPU on the kaggle kernel it fails. I hope they will update it soon. It is so lame being stopped by things like this.

nicapotato · November 27, 2018, 6:38pm

Unfortunately, looks like this problem still persists.