Problem with cont_cat_split() in Google Collab

mgreenbe · January 8, 2021, 7:16pm

Hi,

I am using Google Collab.
I am trying to run section 09_tabular.ipynb

fastai/fastbook/blob/master/clean/09_tabular.ipynb

{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#hide\n",
    "!pip install -Uqq fastbook kaggle waterfallcharts treeinterpreter dtreeviz\n",
    "import fastbook\n",
    "fastbook.setup_book()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [

This file has been truncated. show original

and starting from today I have a problem with
cont,cat = cont_cat_split(df, 1, dep_var=dep_var)
it throws the following error:

TypeError Traceback (most recent call last)
in ()
----> 1 cont,cat = cont_cat_split(df, 1, dep_var=dep_var)

1 frames
/usr/local/lib/python3.6/dist-packages/numpy/core/numerictypes.py in issubdtype(arg1, arg2)
386 “”"
387 if not issubclass_(arg1, generic):
–> 388 arg1 = dtype(arg1).type
389 if not issubclass_(arg2, generic):
390 arg2 = dtype(arg2).type

TypeError: Cannot interpret ‘UInt32Dtype()’ as a data type

I already run this notebook before and it run fine.
Maybe somebody else also experienced such a problem?
I can manually resolve this problem by changing the data type of the saleWeek to int64, but it throws a simmilar error with ProductSize which is a “category”.
Tnx!

muellerzr · January 9, 2021, 12:12am

This was a PR done by someone recently. Could you open up an issue for it in the fastai github? (And if possible just a minimal reproducer of what’s needed? It would be great if that were a colab notebook )

mgreenbe · January 9, 2021, 9:18am

I have opened an issue:

What is a PR? Pull request?
Tnx,
Maxim

sylvaint · January 11, 2021, 12:37am

If you just want to run the notebook for now:

Change saleWeek from UInt32 to uint32
df = df.astype({'saleWeek' : 'uint32'})

Modify cont_cat_split to detect ProductSize which is a CategoricalDtype properly by using the name property:

def cont_cat_split(df, max_card=20, dep_var=None):
    cont_names, cat_names = [], []
    for label in df:
        if label in L(dep_var): continue
        
        # mod to detect ProductSize type properly
        if (df[label].dtype.name == 'category'):
          cat_names.append(label)
          continue

        if (np.issubdtype(df[label].dtype, np.integer) and
            df[label].unique().shape[0] > max_card or
            np.issubdtype(df[label].dtype, np.floating)):
            cont_names.append(label)
        else: cat_names.append(label)
    return cont_names, cat_names

mgreenbe · January 11, 2021, 6:28am

Great, tnx!

chrismilleruk · January 13, 2021, 11:24pm

New issue and PR here, please review and let me know if you need any changes.

Chijioke · January 16, 2021, 9:08am

Hey @sylvaint , I did as you instructed, but I still have the same issue:

TypeError: Cannot interpret ‘CategoricalDtype(categories=[‘Large’, ‘Large / Medium’, ‘Medium’, ‘Small’, ‘Mini’,
‘Compact’],
ordered=True)’ as a data type

sylvaint · January 17, 2021, 1:17am

Sounds like you did not redefine the function cont_cat_split
Just copy, paste and run the code above before using the function.

Chijioke · January 17, 2021, 6:17pm

@sylvaint Initially, I redefined the function cont_cat_split in the script core.py. Now, I had to copy the function cont_cat_split from the script and paste it into my code to redefine it there. And it worked. Thanks.

lymenlee · January 25, 2021, 2:21am

Thanks, @sylvaint. This solved my issue.

aberres · February 18, 2021, 12:32pm

Do I understand it correctly, that we do not have the fix (checking the dtype name for categories) upstream yet? Is there any open pull request?

aberres · February 18, 2021, 3:12pm

Answering to myself: The issue is fixed, it is just that the current PyPi release is still 2.2.3 in reality: https://github.com/fastai/fastai/issues/3220

When installed from git things work nicely for me. Thanks!