TypeError: an integer is required (got type NoneType) With tabular data

I have this code and the problem is when I try to fit it. It says
TypeError: an integer is required (got type NoneType)
Data is from Kaggle House Price Prediction competition.
Fastai version: 1.0.34

from fastai import *
from fastai.tabular import *


df = pd.read_csv('train.csv')

dep_var = 'SalePrice'
cat_names = []
cont_names = []
for label in df:
    if label == 'SalePrice':
        continue
    if len(set(df[label])) > 30 and df[label].dtype != object:
        cont_names.append(label)
    else:
        cat_names.append(label)
procs = [FillMissing, Categorify, Normalize]


data = (TabularList.from_df(df, path='', cat_names=cat_names, cont_names=cont_names, procs=procs)
                           .split_by_idx(range(800,1000))
                           .label_from_df(cols=dep_var)
                           .databunch())


data.show_batch(rows=10)

learn = tabular_learner(data, layers=[200,100], metrics=accuracy)


learn.model

learn.fit_one_cycle(1, 1e-2)

Hello @Lankinen,

the problem is, on the definition of TabularDataBunch it expect a dependent_variable to be categorical,

class TabularDataBunch(DataBunch):
    "Create a `DataBunch` suitable for tabular data."

    @classmethod
    def from_df(cls, path, df:DataFrame, dep_var:str, valid_idx:Collection[int], procs:OptTabTfms=None,
                cat_names:OptStrList=None, cont_names:OptStrList=None, classes:Collection=None, 
                test_df=None, **kwargs)->DataBunch:
        "Create a `DataBunch` from `df` and `valid_idx` with `dep_var`."
        cat_names = ifnone(cat_names, []).copy()
        cont_names = ifnone(cont_names, list(set(df)-set(cat_names)-{dep_var}))
        procs = listify(procs)
        src = (TabularList.from_df(df, path=path, cat_names=cat_names, cont_names=cont_names, procs=procs)
                           .split_by_idx(valid_idx)
                           .label_from_df(cols=dep_var, classes=classes))
        if test_df is not None: src.add_test(TabularList.from_df(test_df, cat_names=cat_names, cont_names=cont_names,
                                                                 processor = src.train.x.processor))
        return src.databunch(**kwargs)

pay attention to this line

src = (TabularList.from_df(df, path=path, cat_names=cat_names, cont_names=cont_names, procs=procs)
                           .split_by_idx(valid_idx)
                           .label_from_df(cols=dep_var, classes=classes))

Possible Solution 1

so you need to pass the dependent_variable and the list of classes on the parameter classes . To do that you will have to convert your dependent variable into statistical ranges and pass it as classes just like the example bellow:

https://docs.fast.ai/tabular.html#tabular

You will see that the dependent variable there is a column ‘>=50k’ and the valies are 0, 1

Posssible Solution 2

You will need implement a Regression Model by yourself.

I found a temporary solution for your problem …

pip install fastai==1.0.36

Found a definitive solution, works even with version 1.0.39

#after load the dataset, grab the targets and make unique list
classes = df['SalePrice'].unique()
classes.sort()

#later passes that list to be treated as categorical values.
.label_from_df(cols=dep_var, classes=classes)
4 Likes

I tried doing the same because I can’t get the Rossman file to work:

classes = df[‘Sales’].unique()
classes.sort()

data = (TabularList.from_df(df, path=path, cat_names=cat_vars, cont_names=cont_vars, procs=procs)
.random_split_by_pct(valid_pct = 0.2)
.label_from_df(cols=dep_var, classes = classes)
.databunch())

But I get the error:

----> 3 .label_from_df(cols=dep_var, classes = classes)
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

Have you seen it before?

Hi @Mauro Try to do this:

.label_from_df(cols=dep_var, label_cls=FloatList)

Based on this conversation, it must be the correct way to deal with Regression

Thanks, but even with .label_from_df(cols=dep_var, label_cls=FloatList)

the same error still appears: ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all()

If you can post your code because I have no clue about it.

I just uploaded it to gist

Just looking into Rossman.ipynb

cat_vars = [
            'Store', 
            'DayOfWeek', 
            'Date', 
            'StateHoliday', 
            #'SchoolHoliday', #this continuous value and its already in the cont_vars
            'CompetitionOpenSinceMonth',
            'CompetitionOpenSinceYear', 
            'Promo2SinceWeek', 
            'StoreType', 
            'Assortment', 
            'PromoInterval', 
            'Promo2SinceYear'
]

cont_vars = [
    'CompetitionDistance', 
    #'StateHoliday', #capturing the 'unique()' information will see that has categorical values in there
    'Promo', 
    'SchoolHoliday'
]

and your split method is causing the problem. I used this one:

size = len(df)
idx = range(size-1000, size)
#.random_split_by_pct(valid_pct = 0.2) # splitting training and validation sets
data =(TabularList.from_df(df, path=path, cat_names=cat_vars, cont_names=cont_vars, procs=procs)
                   .split_by_idx(idx)
                   .label_from_df(cols=dep_var, cls=FloatList)
                   .databunch())

take a look on this:
df.dtypes

Store                          int64
DayOfWeek                      int64
Date                          object
StateHoliday                  object
CompetitionOpenSinceMonth    float64
CompetitionOpenSinceYear     float64
Promo2SinceWeek              float64
StoreType                     object
Assortment                    object
PromoInterval                 object
Promo2SinceYear              float64
CompetitionDistance          float64
Promo                          int64
SchoolHoliday                  int64
Sales                          int64
dtype: object

But look how nice it is the confusion…
This one works

classes = df['Sales'].unique()
classes.sort()
print(classes)

data =(TabularList.from_df(df, path=path, cat_names=cat_vars, cont_names=cont_vars, procs=procs)
        .random_split_by_pct(valid_pct=0.2)
        #.split_by_idx(idx)
        #.label_from_df(cols=dep_var, cls=FloatList)
       .label_from_df(cols=dep_var, classes=classes)
        .databunch())
1 Like

Hi @Mauro,

Try this one:

cat_vars = [
            'Store', 
            'DayOfWeek', 
            'Date', 
            'StateHoliday', 
            #'SchoolHoliday', #this continuous value and its already in the cont_vars
            'CompetitionOpenSinceMonth',
            'CompetitionOpenSinceYear', 
            'Promo2SinceWeek', 
            'StoreType', 
            'Assortment', 
            'PromoInterval', 
            'Promo2SinceYear'
]

cont_vars = [
    'CompetitionDistance', 
    #'StateHoliday', #capturing the 'unique()' information will see that has categorical values in there
    'Promo', 
    'SchoolHoliday'
]

df['Sales'] = df['Sales'].astype('float64')

data =(TabularList.from_df(df, path=path, cat_names=cat_vars, cont_names=cont_vars, procs=procs)
        .random_split_by_pct(valid_pct=0.2)
        .label_from_df(cols=dep_var, cls=FloatList)
        .databunch())
1 Like

Thanks. This worked. I think the mistake was that I had the same column as both categorical and continuous. Great catch!

Hi again :slight_smile:

Yes one of the problem was this you mentioned about the duplication of SchoolHoliday,
another one was the StateHoliday that is a bunch of zeros or ones with , a, b, c in the middle that need to be cleaned. Other one was the fact of Sales is a integer and .random_split_by_pct don’t worked so I needed to test by uniquefying it and passing as classes and worked and later to make it work I dicovered that changing the type of Sales from Int64 to float64 worked.

You probably will find another problem depending of the model you will use it, That the training will work nice but the validation step will encounter a problem to calculate the metric, something will ask about waiting for a LongTensor instead of a FloatTensor, but that is a job you need to discuss about with the fastai team. :slight_smile:

1 Like

I’ve tried these methods to get regression to work with the tabular learner

  1. If you set

df[dep_var] = df[dep_var].astype('float64') and then cls=FloatList

your training loss is going to be in the order of 1e07 (50,000,000). This is with or without log=True and I tested it with the Rossman and the House Prices data.

  1. If you set classes = list(df['SalePrice'].unique()),
    then classes=classes
    then metrics=fbeta or metrics=[accuracy_thresh]

you will get the error when running learn.fit

RuntimeError: The size of tensor a (663) must match the size of tensor b (64) at non-singleton dimension 1

Have you made a regression model that’s working with the last verison of fastai? If so could you share the code?

Edit: I think the problem comes in how the cont and cat columns are assigned

2 Likes

In fact I didn’t tried I just made a setup to help you out at that time. May be in few days I can have a time to test and tell you.