For a tabular data problem, How can I convert the Tabular data Bunch we create to a pandas data frame?
It will help to use the same normalized and cleaned data to check efficacy across other algorithms(RF, XGB…)
For a tabular data problem, How can I convert the Tabular data Bunch we create to a pandas data frame?
It will help to use the same normalized and cleaned data to check efficacy across other algorithms(RF, XGB…)
I too was wondering the same thing, did you end up figuring this one out? I noticed there was a function like data.train_ds.to_df()
but it wasn’t quite working for me, figured I was doing something wrong…thanks!
Sorry mate have not tried it so far .will let you if I find out something.
yeah, I got a similar error. I will check it out and have you know
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-21-826e21acd938> in <module>()
----> 1 data.train_dl.to_df()
/opt/conda/lib/python3.6/site-packages/fastai/data_block.py in to_df(self)
605 def to_df(self)->None:
606 "Create `pd.DataFrame` containing `items` from `self.x` and `self.y`."
--> 607 return pd.DataFrame(dict(x=self.x._relative_item_paths(), y=[str(o) for o in self.y]))
608
609 def to_csv(self, dest:str)->None:
/opt/conda/lib/python3.6/site-packages/fastai/data_block.py in _relative_item_paths(self)
129
130 def _relative_item_path(self, i): return self.items[i].relative_to(self.path)
--> 131 def _relative_item_paths(self): return [self._relative_item_path(i) for i in range_of(self.items)]
132
133 def use_partial_data(self, sample_pct:float=1.0, seed:int=None)->'ItemList':
/opt/conda/lib/python3.6/site-packages/fastai/data_block.py in <listcomp>(.0)
129
130 def _relative_item_path(self, i): return self.items[i].relative_to(self.path)
--> 131 def _relative_item_paths(self): return [self._relative_item_path(i) for i in range_of(self.items)]
132
133 def use_partial_data(self, sample_pct:float=1.0, seed:int=None)->'ItemList':
/opt/conda/lib/python3.6/site-packages/fastai/data_block.py in _relative_item_path(self, i)
128 return cls.from_df(df, path=path, cols=cols, processor=processor, **kwargs)
129
--> 130 def _relative_item_path(self, i): return self.items[i].relative_to(self.path)
131 def _relative_item_paths(self): return [self._relative_item_path(i) for i in range_of(self.items)]
132
AttributeError: 'int' object has no attribute 'relative_to'
I am also interested in this. I like to use data block to quickly grab fastai dataset, do the default preprocessing. After that, I want to get the data back in either pandas dataframe or np.array, so I can continue train it with other library like scikit-learn, xgboost, keras, etc. etc.
So far, I find no easy ready method. the to_df() doesn’t seem to return anything (if you read the doc) so it is definitely an internal op and probably not what you want.
This isn’t a full answer, but you can get the raw data back like this:
for x in data.train_ds.x:
print(x.data)
which would be a pytorch tensor. I think we may just have to convert this to np array manually, and construct the pandas.
If anyone find some better way, please post.
I was also looking at this. As noted DataLoader.to_df
gives an error as it seems to be expecting data is coming from files. You can access the DataBunch.train_ds.inner_df
to get a DataFrame
with some of the processing applied. Categorical variables have been converted to categorical dtypes but they have not been numericalised yet. I believe missing values should be handled here as well but didn’t fully check this. But if categorical dtypes is enough you could just use that.
Or to get it when it has been numericalised I used:
def get_proc_df(tll):
"""Get processed xs and ys from a tabular `LabelList` with a single value for label such as FloatList.
For example from `TabularDataBunch.train_ds`.
:param tll: A tabular `LabelList`.
:returns: A tuple of `(x,y)` where `x` is a pandas `DataFrame` and `y` is a numpy array.
"""
x_vals = np.concatenate([tll.x.codes, tll.x.conts], axis=1)
x_cols = tll.x.cat_names + tll.x.cont_names
x_df = pd.DataFrame(data=x_vals, columns=x_cols)[
[c for c in tll.inner_df.columns if c in x_cols] ] # Retain order
# Reconstruct ys to apply log if specified
y_vals = np.array([i.obj for i in tll.y])
return x_df, y_vals
This avoids having to recreate data row by row by using the fully processed columns in DataSet.codes
and DataSet.conts
.
As noted this likely won’t work if your label isn’t a single float for regression (though anything else np.array
will happily take a list of should work).