Saving textdatabunch to csv

praman · January 22, 2019, 1:12am

I want to export the processed dataset to a csv so I tried below

data_processed = TextClasDataBunch.load(path, 'pred_stage1', bs=20)
data_processed.single_ds.to_csv('processed_data')

but it fails with

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-133-60878f87b37f> in <module>
----> 1 data_processed.single_ds.to_csv('processed_data')

/opt/anaconda3/lib/python3.6/site-packages/fastai/data_block.py in to_csv(self, dest)
    496     def to_csv(self, dest:str)->None:
    497         "Save `self.to_df()` to a CSV file in `self.path`/`dest`."
--> 498         self.to_df().to_csv(self.path/dest, index=False)
    499 
    500     def export(self, fn:PathOrStr):

/opt/anaconda3/lib/python3.6/site-packages/fastai/data_block.py in to_df(self)
    492     def to_df(self)->None:
    493         "Create `pd.DataFrame` containing `items` from `self.x` and `self.y`."
--> 494         return pd.DataFrame(dict(x=self.x._relative_item_paths(), y=[str(o) for o in self.y]))
    495 
    496     def to_csv(self, dest:str)->None:

/opt/anaconda3/lib/python3.6/site-packages/fastai/data_block.py in _relative_item_paths(self)
    114 
    115     def _relative_item_path(self, i): return self.items[i].relative_to(self.path)
--> 116     def _relative_item_paths(self):   return [self._relative_item_path(i) for i in range_of(self.items)]
    117 
    118     def use_partial_data(self, sample_pct:float=1.0, seed:int=None)->'ItemList':

/opt/anaconda3/lib/python3.6/site-packages/fastai/data_block.py in <listcomp>(.0)
    114 
    115     def _relative_item_path(self, i): return self.items[i].relative_to(self.path)
--> 116     def _relative_item_paths(self):   return [self._relative_item_path(i) for i in range_of(self.items)]
    117 
    118     def use_partial_data(self, sample_pct:float=1.0, seed:int=None)->'ItemList':

/opt/anaconda3/lib/python3.6/site-packages/fastai/data_block.py in _relative_item_path(self, i)
    113         return cls.from_df(df, path=path, cols=cols, **kwargs)
    114 
--> 115     def _relative_item_path(self, i): return self.items[i].relative_to(self.path)
    116     def _relative_item_paths(self):   return [self._relative_item_path(i) for i in range_of(self.items)]
    117 

AttributeError: 'numpy.ndarray' object has no attribute 'relative_to'

Is there a way to get the tokenized dataset to a file?

praman · January 28, 2019, 8:38pm

any help with this issue?

aftertouch · January 30, 2019, 10:05pm

having the same issue here.

sgugger · January 30, 2019, 10:26pm

The to_csv method is intended to save filenames, not anything else. There is no way to save your processed dataset to a csv file, but the save method of TextDataBunch saves your ids so you can access your processed dataset easily later.

p9anand · February 1, 2019, 6:39pm

@sgugger: How can we convert TabularDataBunch to pandas Dataframe?

sgugger · February 1, 2019, 6:43pm

That is not implemented.

lga · August 20, 2019, 8:32am

I recently needed to do this. Assuming you are reading the following csv:

docid,text
id1,foo bar baz
id2,hello world
... , ...
idN,beep bop bap

then you can write out the numericalized output to csv with:

tl = (TextList.from_csv(Path('.'), 'myfile.csv', cols=1, delimiter=',')
        .split_none()
        .label_from_df(cols=0))
for x, y in tl.train:
    terms = str(x).split()
    docid = str(y)
    tid = ' '.join([str(tl.vocab.stoi.get(x, 0)) for x in terms])
    print(f'{docid} {tid}')