Data augmentation for text (Recent paper)

Seems like this paper might be of interest/useful in the context of the fastai text augmentation.

EDA: Easy Data Augmentation Techniques for Boosting Performance on Text Classification Tasks

Jason W. Wei, Kai Zou

(Submitted on 31 Jan 2019)

We present EDA: easy data augmentation techniques for boosting performance on text classification tasks. EDA consists of four simple but powerful operations: synonym replacement, random insertion, random swap, and random deletion. On five text classification tasks, we show that EDA improves performance for both convolutional and recurrent neural networks. EDA demonstrates particularly strong results for smaller datasets; on average, across five datasets, training with EDA while using only 50% of the available training set achieved the same accuracy as normal training with all available data. We also performed extensive ablation studies and suggest parameters for practical use.

Oh, looks interesting! I’ll see what we can take of this.


I’m curious how you’re thinking of doing the replacement and swap augmentations… I’m about to do something similar for tabular data (porting a denoising autoencoder from fastai 0.7) and it uses swap noise for the input variables which I originally implemented as a custom dataloader.

In my initial look at transforms I got the sense they’re designed to act on a single input and don’t have a sense of the set of inputs or of the other items in the batch. I’m thinking of integrating it into the model instead by swapping from within the batch, which I think will be a lot more efficient than doing it in the dataloader, but I definitely am interested in your perspective as I hope to eventually add it to the fastai codebase.