Implementing data augmentation for text


(Sam Shleifer) #1

I am trying to modify fastai to have the ability to run augmentation transforms on text data, like it does for vision. For example, arbitrarily removing a word from an input sentence.
Ideally I would want to call.

data_clas = TextClasDataBunch.from_csv(path, 'texts.csv', vocab=data_lm.train_ds.vocab, tfms=[rand_remove(p=.3, max_n=1)],  bs=42)

Where in the code is the right place to do this? adding a preprocessor to TextDataBunch and pass it to TokenizePreprocessor?


#2

You should change the apply_tfms method of TextList (more like create one). This is what is called behind the scenes when you use .transform in the data block API.
The factory method doesn’t know of any tfms argument, so you should also use the data block API to build your DataBunch.


(Sam Shleifer) #3

I think I have to change it on the single Text element.
If I want to modify self.text, but regardless I’ve been doing everything on disk because it’s faster and I don’t want to retokenize during training.