It seems TabularDataLoaders.from_df() fails if one of the categorical variables has actual None values (my data had this for some strange reason).
To recreate:
df = pd.DataFrame({'a':[1,2,None], 'b':[3,4,'tmp']})
df.iloc[2,1] = None # Pandas seem to cast None to NaN in the constructor
CategoryMap(df['a'], add_na=True) # works fine
CategoryMap(df['b'], add_na=True) # gives an error
The last line gives:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-36-d98dc3407d7a> in <module>
----> 1 CategoryMap(df['b'], add_na=True)
~/anaconda3/envs/fastai2/lib/python3.7/site-packages/fastai2/data/transforms.py in __init__(self, col, sort, add_na)
209 # `o==o` is the generalized definition of non-NaN used by Pandas
210 items = L(o for o in col.unique() if o==o)
--> 211 if sort: items = items.sorted()
212 self.items = '#na#' + items if add_na else items
213 self.o2i = defaultdict(int, self.items.val2idx()) if add_na else dict(self.items.val2idx())
~/anaconda3/envs/fastai2/lib/python3.7/site-packages/fastcore/foundation.py in sorted(self, key, reverse)
360 elif isinstance(key,int): k=itemgetter(key)
361 else: k=key
--> 362 return self._new(sorted(self.items, key=k, reverse=reverse))
363
364 @classmethod
TypeError: '<' not supported between instances of 'NoneType' and 'int'
This is perhaps best fixed in preprocessing with df.fillna(value=np.nan) which gets rid of the None values?
I am trying to follow the MultiCategory examples in https://github.com/fastai/fastai2/blob/master/nbs/40_tabular.core.ipynb. My particular dataset is formated similar to the “not one hot encoded” section that contains the _mock_multi_label. (it’s formatted like it’s output). I managed to get it working by following the one-hot-encoded labels and performed something like so:
vals = merged_df[y_names].unique()
c = []
for val in vals:
c += val.split(' ')
c = list(set(c))
def _mock_multi_label(df, classes=c):
targ_dict = {}
for c in classes:
targ_dict[c] = []
for row in df.itertuples():
labels = row.action.split(' ')
for c in classes:
targ_dict[c] = c in labels
for c in classes:
df[c] = np.array(targ_dict[c])
return df
df_main = _mock_multi_label(merged_df, c)
@EncodedMultiCategorize
def encodes(self, to:Tabular): return to
@EncodedMultiCategorize
def decodes(self, to:Tabular):
to.transform(to.y_names, lambda c: c==1)
return to
to = TabularPandas(merged_df, procs=[], cat_names=[], cont_names = cont_names,
y_names=c, y_block = MultiCategoryBlock(encoded=True, vocab=c), splits=splits)
This builds my DataLoaders just fine. From there, to not get an issue, we need to set dls.c to be the len() of c: dls.c = len(c)
I feel this process should not be this tedious, let me know any ideas. (Should the encodes/decodes be in the actual library too?)
I guess my question is will we ever support non-encoded multi-categorize? Instead of having to pre-process the one hot encode. Similar to how in the PLANETs we can pass in a delimiter to get_y. (I understand the two are separate from just MultiCategoryBlock, so this may not be straightforward to do)
For those interested, I’m working on trying to get fastai2 tabular to support multiple datatypes. This is based on my NumPy tutorial, and as we go along I think we’ll find a good way to integrate it to the ecosystem towards something we could possibly push to the main repo. If you’re interested in helping, see here:
Currently we’re looking at NumPy, cuDF, and others.
I’m trying to use tabular for a regression problem but am confused about the prediction output of the learner.
When I use learn.show_results(), I see the result in the y_pred column, but when I use learn.get_preds() it outputs 2 tensors and I can’t find documentation on what these are. Are they supposed to be the upper and lower confidence interval for the predictions?
I think there is a mistake in the code from the fastbook/09_tabular ,
in the image ,I run the code with the red underline and I got error,
I think it refers to xs_filt …
I have a random forest model and used TabularPandas() to do the normal preprocessing (Normalize, FillMissing and Categorify). Now I would like to share the model and the preprocessing with a partner without sharing the data. This example is listed in the documentation:
But that requires saving the entire to object, which includes the data. I couldn’t find way of just grabbing the procs from the TabularPandas object - any thoughts on this?
Yes, that works - thanks a lot! A 4 minute response time is also ok I guess
For reference, I ended up doing this:
to = TabularPandas(df_main, procs, ...)
to_export = to.new_empty() # to_export only has preprocessing meta-data (eg. means/stds)
to_tst = to_export.new(df_test)
to_tst.process() # tf_tst can now be passed to the model
As seen from attached pic, left image is layers generated from v1 Tabular, while right image is layers generated from v2. See my annotations in the pic. Are the differences a Bug or a new Feature ?
Other than above difference, I noticed the LB score by v1 Tabular is about +0.3 points higher than v2 version.
Hi. I’m currently working on a multiclass classifcation problem with tabular data.
I’m using TabularPandas to prepare the data but I want a specific mapping from target class to integer, so I use: y_block=CategoryBlock(vocab=my_mapping, sort=False) as an argument to TabularPandas. I have noticed that the resulting, transformed target does not match the my_mapping vocab.
If I’m correct, the cause is, when reduce_memory=True in TabularPandas, df_shrink is called and transforms the target variable (if object type) independently to the y_block specified. I’m not sure if this is intentional and if not, what the best alternative is, but a note about this in the docs or code might be helpful in the future.
I’m also willing to help where I can (I’m new to v2) if this is something that needs fixing.
I am also working on a similar problem. When I pass y_block= CategoryBlock to either TabularPandas or tabular_learner the model cannot be trained anymore and fails with ValueError: Expected input batch_size (64) to match target batch_size (13184). I also cannot find an example on TabularPandas for classification anywhere.