TabularDataBunch to Pandas Dataframe

shub.chat · November 20, 2018, 7:43pm

For a tabular data problem, How can I convert the Tabular data Bunch we create to a pandas data frame?

It will help to use the same normalized and cleaned data to check efficacy across other algorithms(RF, XGB…)

apalepu23 · December 26, 2018, 3:48pm

I too was wondering the same thing, did you end up figuring this one out? I noticed there was a function like data.train_ds.to_df() but it wasn’t quite working for me, figured I was doing something wrong…thanks!

shub.chat · January 9, 2019, 5:54am

Sorry mate have not tried it so far .will let you if I find out something.

kachun1017 · March 1, 2019, 4:49am

yeah, I got a similar error. I will check it out and have you know

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-21-826e21acd938> in <module>()
----> 1 data.train_dl.to_df()

/opt/conda/lib/python3.6/site-packages/fastai/data_block.py in to_df(self)
    605     def to_df(self)->None:
    606         "Create `pd.DataFrame` containing `items` from `self.x` and `self.y`."
--> 607         return pd.DataFrame(dict(x=self.x._relative_item_paths(), y=[str(o) for o in self.y]))
    608 
    609     def to_csv(self, dest:str)->None:

/opt/conda/lib/python3.6/site-packages/fastai/data_block.py in _relative_item_paths(self)
    129 
    130     def _relative_item_path(self, i): return self.items[i].relative_to(self.path)
--> 131     def _relative_item_paths(self):   return [self._relative_item_path(i) for i in range_of(self.items)]
    132 
    133     def use_partial_data(self, sample_pct:float=1.0, seed:int=None)->'ItemList':

/opt/conda/lib/python3.6/site-packages/fastai/data_block.py in <listcomp>(.0)
    129 
    130     def _relative_item_path(self, i): return self.items[i].relative_to(self.path)
--> 131     def _relative_item_paths(self):   return [self._relative_item_path(i) for i in range_of(self.items)]
    132 
    133     def use_partial_data(self, sample_pct:float=1.0, seed:int=None)->'ItemList':

/opt/conda/lib/python3.6/site-packages/fastai/data_block.py in _relative_item_path(self, i)
    128         return cls.from_df(df, path=path, cols=cols, processor=processor, **kwargs)
    129 
--> 130     def _relative_item_path(self, i): return self.items[i].relative_to(self.path)
    131     def _relative_item_paths(self):   return [self._relative_item_path(i) for i in range_of(self.items)]
    132 

AttributeError: 'int' object has no attribute 'relative_to'

kechan · May 3, 2019, 5:59pm

I am also interested in this. I like to use data block to quickly grab fastai dataset, do the default preprocessing. After that, I want to get the data back in either pandas dataframe or np.array, so I can continue train it with other library like scikit-learn, xgboost, keras, etc. etc.

So far, I find no easy ready method. the to_df() doesn’t seem to return anything (if you read the doc) so it is definitely an internal op and probably not what you want.

This isn’t a full answer, but you can get the raw data back like this:

for x in data.train_ds.x:
print(x.data)

which would be a pytorch tensor. I think we may just have to convert this to np array manually, and construct the pandas.

If anyone find some better way, please post.

TomB · May 3, 2019, 7:02pm

I was also looking at this. As noted DataLoader.to_df gives an error as it seems to be expecting data is coming from files. You can access the DataBunch.train_ds.inner_df to get a DataFrame with some of the processing applied. Categorical variables have been converted to categorical dtypes but they have not been numericalised yet. I believe missing values should be handled here as well but didn’t fully check this. But if categorical dtypes is enough you could just use that.
Or to get it when it has been numericalised I used:

def get_proc_df(tll):
    """Get processed xs and ys from a tabular `LabelList` with a single value for label such as FloatList.
       For example from `TabularDataBunch.train_ds`.
       :param tll: A tabular `LabelList`. 
       :returns: A tuple of `(x,y)` where `x` is a pandas `DataFrame` and `y` is a numpy array.
    """
    x_vals = np.concatenate([tll.x.codes, tll.x.conts], axis=1)
    x_cols = tll.x.cat_names + tll.x.cont_names
    x_df = pd.DataFrame(data=x_vals, columns=x_cols)[
            [c for c in tll.inner_df.columns if c in x_cols] ] # Retain order
    # Reconstruct ys to apply log if specified
    y_vals = np.array([i.obj for i in tll.y])
    return x_df, y_vals

This avoids having to recreate data row by row by using the fully processed columns in DataSet.codes and DataSet.conts.
As noted this likely won’t work if your label isn’t a single float for regression (though anything else np.array will happily take a list of should work).