Making a databunch out of a TabularList crashing

talkingtoaj · November 30, 2019, 8:26am

I’m having trouble making a databunch from a TabularList.

I get the error:
TypeError: int() argument must be a string, a bytes-like object or a number, not 'NoneType'

Unfortunately fastai is not very good at giving useful debugging information - why exactly this is crashing.

My dataset is fine as I’ve run this dataset on sklearn without problems - for DecisionTrees and Neural Networks.

This is my code (I’ve split the stages of making a databunch up to determine where the error lies)

    dep_var = 'price'
    cat_names = ['isitma','aidat_provided','gayrimenkul','kimden','krediye_uygun','kullanim_durumu']
    cont_names = ['m2_net','oda','salon','banyo','balkon','bina_age','kat_loc','kat_rel','aidat','brut_pc','map_pin_lat','map_pin_lon','number_photographs','timestamp']
    procs = [FillMissing, Categorify, Normalize]
    db1 = TabularList.from_df(path=data_folder, df=data, cat_names=cat_names, cont_names=cont_names, procs=procs)
    db2 = db1.split_from_df(col='is_valid')
    db3 = db2.label_from_df(cols=dep_var)
    db= db3.databunch()

So it crashes at the .label_from_df() stage. Digging into the fastai functions, I’ve been able to determine the problem is occuring here:

(runstack):
    data_block : _inner (liune 480) 
        self.process()
    LabelLists.process (line 534)
        for ds,n in zip(self.lists, ['train','valid','test']): ds.process(xp, yp, name=n)
    LabelList.process (line 700)
        self.y.process(yp)
    ItemList.process *line 84)
        for p in self.processor: p.process(self)
    CategoryProcessor.process (line 351)
        super().process(ds)
    PreProcessor.process (line 53)
        def process(self, ds:Collection):        ds.items = array([self.process_one(item) for item in ds.items])
    core.py array (line 299)
        return np.array(a, dtype=dtype, **kwargs)

    TypeError: int() argument must be a string, a bytes-like object or a number, not 'NoneType'

So, I can see from PreProcessor.process where ds is set to fastai.data_block.CategoryList object that ds.items = array([1750000, 970000, 950000, 860000, ..., 840000, 820000, 870000, 930000], dtype=object)

But running self.process_one(item) on each element of ds.items produces [181, 143, 141, None, 182, 162, 186, 135, 153, 168, 166, 135, 130, None, ...] and hence introduces the None values crashing the np.array function.

I am guessing what is going on is it is taking the raw values of our y label column ‘price’ and categorifying them (which is a little odd since I didn’t list it in cat_names, after all, it is the y label), and for some reason the infrequency (?) of some raw label values do not have a corresponding category to them.

Any thoughts on what is wrong? Why don’t all attempts at databunching a TabularList fail, presumably there is something a little unique about what I am doing.

muellerzr · November 30, 2019, 1:01pm

First, you’ll want to use a label_cls=FloatList like in the Rossmann example, as this tells the databunch to do regression instead of classification. Second, are you missing any y values in your data frame?

talkingtoaj · December 2, 2019, 9:41am

Yep, that did the trick.

I’m not aware of the Rossmann example, perhaps that’s a reference to a later lesson I haven’t yet covered, but for the sake of other readers, it is worth reading this page in the fastai documenation: https://docs.fast.ai/data_block.html#Step-3:-Label-the-inputs

If my labels had been floats, then fastai’s datablock would automatically have treated my labels as a regression problem, but before they were integers it defaulted to treating them as categories.

So my final code which works correctly reads as follows:

db1 = TabularList.from_df(path=data_folder, df=data, cat_names=cat_names, cont_names=cont_names, procs=procs)
db2 = db1.split_from_df(col='is_valid')
db3 = db2.label_from_df(cols=dep_var, label_cls=FloatList)
db= db3.databunch()