I’m having trouble making a databunch from a TabularList.
I get the error:
TypeError: int() argument must be a string, a bytes-like object or a number, not 'NoneType'
Unfortunately fastai is not very good at giving useful debugging information - why exactly this is crashing.
My dataset is fine as I’ve run this dataset on sklearn without problems - for DecisionTrees and Neural Networks.
This is my code (I’ve split the stages of making a databunch up to determine where the error lies)
dep_var = 'price'
cat_names = ['isitma','aidat_provided','gayrimenkul','kimden','krediye_uygun','kullanim_durumu']
cont_names = ['m2_net','oda','salon','banyo','balkon','bina_age','kat_loc','kat_rel','aidat','brut_pc','map_pin_lat','map_pin_lon','number_photographs','timestamp']
procs = [FillMissing, Categorify, Normalize]
db1 = TabularList.from_df(path=data_folder, df=data, cat_names=cat_names, cont_names=cont_names, procs=procs)
db2 = db1.split_from_df(col='is_valid')
db3 = db2.label_from_df(cols=dep_var)
db= db3.databunch()
So it crashes at the .label_from_df() stage. Digging into the fastai functions, I’ve been able to determine the problem is occuring here:
(runstack):
data_block : _inner (liune 480)
self.process()
LabelLists.process (line 534)
for ds,n in zip(self.lists, ['train','valid','test']): ds.process(xp, yp, name=n)
LabelList.process (line 700)
self.y.process(yp)
ItemList.process *line 84)
for p in self.processor: p.process(self)
CategoryProcessor.process (line 351)
super().process(ds)
PreProcessor.process (line 53)
def process(self, ds:Collection): ds.items = array([self.process_one(item) for item in ds.items])
core.py array (line 299)
return np.array(a, dtype=dtype, **kwargs)
TypeError: int() argument must be a string, a bytes-like object or a number, not 'NoneType'
So, I can see from PreProcessor.process where ds is set to fastai.data_block.CategoryList object that ds.items = array([1750000, 970000, 950000, 860000, ..., 840000, 820000, 870000, 930000], dtype=object)
But running self.process_one(item) on each element of ds.items produces [181, 143, 141, None, 182, 162, 186, 135, 153, 168, 166, 135, 130, None, ...]
and hence introduces the None
values crashing the np.array
function.
I am guessing what is going on is it is taking the raw values of our y label column ‘price’ and categorifying them (which is a little odd since I didn’t list it in cat_names, after all, it is the y label), and for some reason the infrequency (?) of some raw label values do not have a corresponding category to them.
Any thoughts on what is wrong? Why don’t all attempts at databunching a TabularList fail, presumably there is something a little unique about what I am doing.