Dtype issue with get_tabular_data_from_df


(Aiden) #1

Hi all,

Thanks for the great library. I’ve been using it to look at some tabular data and I’ve run into an issue I can’t wrap my head around.

When I call get_tabular_data_from_df to get the DataBunch, it returns an error saying:

TypeError: can’t convert np.ndarray of type numpy.int8. The only supported types are: double, float, float16, int64, int32, and uint8.

I traced the issue back to the dependent variable, which is being converted to an int8 by the function. I’ve tried loading the data as an int32 and as a category. If the dtype is int32, I get an error saying I can’t use .cat unless it’s on a categorical variable. However, I get the TypeError when it’s loaded as a category dtype.

Let me know if I can provide any other info.


#2

It would be great to have more code and your full error message.


(Aiden) #3

Sure, the link to the notebook is here.

The error message is at the very bottom.


#4

Hum, I think there might be a as_type(np.int64) missing at the end of line 24 of tabular.data. Do you have a developer install of fastai? If so could you check this solves the issue? Thanks.


(Aiden) #5

Thanks @sgugger. I pulled the developer version last night and tried adding it in. I now get the following error whether the dependent variable dtype is set to categorical or numeric before calling the method.

AttributeError: Can only use .cat accessor with a ‘category’ dtype

If the dtype is category before calling the method, it changes it to int64 and raises the error.


#6

Raises which error?


(Aiden) #7

The AttributeError. It happens whether the dependent variable is passed in as a numeric dtype.


#8

I don’t understand, you didn’t have that error in your notebook before and the categorical variable was having no problem. Could you share an updated notebook?


(Aiden) #9

Hi @sgugger,

My apologies this wasn’t more clear. I pulled the new dev version today and reran the notebook. Currently, the errors are as follows:

if dep_var is int64 on the call to TabularDataBunch.from_df and is left to be converted to category using the Categorify transform, the error is:

AttributeError: ‘CategoricalAccessor’ object has no attribute ‘astype’

if dep_var is set as a category dtype prior to calling TabularDataBunch.from_df, it gets the same error:

AttributeError: ‘CategoricalAccessor’ object has no attribute ‘astype’

I saw in the new data.py you updated line 24 with df[dep_var].cat.astype(np.int64). I tweaked that line to df[dep_var].cat.codes.astype(np.int64). It seems to have resolved that issue, but now I get an error downstream on line 29 of data.py. The notebook with the error message is here:

I went back and checked the dtypes of the columns in cat_names and they’re all properly set as category.


#10

Could you run %debug to know which column poses problem? From your error message and the test you ran, this should work properly.
The .codes missing has been fixed.