Data augmentation for text (Recent paper)

Even · February 17, 2019, 7:33pm

Seems like this paper might be of interest/useful in the context of the fastai text augmentation.

EDA: Easy Data Augmentation Techniques for Boosting Performance on Text Classification Tasks

(Submitted on 31 Jan 2019)

We present EDA: easy data augmentation techniques for boosting performance on text classification tasks. EDA consists of four simple but powerful operations: synonym replacement, random insertion, random swap, and random deletion. On five text classification tasks, we show that EDA improves performance for both convolutional and recurrent neural networks. EDA demonstrates particularly strong results for smaller datasets; on average, across five datasets, training with EDA while using only 50% of the available training set achieved the same accuracy as normal training with all available data. We also performed extensive ablation studies and suggest parameters for practical use.

sgugger · February 18, 2019, 2:19pm

Oh, looks interesting! I’ll see what we can take of this.

Even · February 20, 2019, 7:58am

I’m curious how you’re thinking of doing the replacement and swap augmentations… I’m about to do something similar for tabular data (porting a denoising autoencoder from fastai 0.7) and it uses swap noise for the input variables which I originally implemented as a custom dataloader.

In my initial look at transforms I got the sense they’re designed to act on a single input and don’t have a sense of the set of inputs or of the other items in the batch. I’m thinking of integrating it into the model instead by swapping from within the batch, which I think will be a lot more efficient than doing it in the dataloader, but I definitely am interested in your perspective as I hope to eventually add it to the fastai codebase.