Data augmentation techniques for text

etown · December 12, 2018, 3:13am

Just as transfer learning has been well-established in vision and now is taking off in NLP, will a similar phenomenon occur with data augmention and NLP? In the last lesson Jeremy mentioned there’s big potential for devolpement of NLP augmentation techniques.

I’ve been thinking about it and experimenting for a little while now, and I’m looking to learn as much as I can about current practices and suggestions on what might be the most fruitful things to try. I know there are challenges and it can be more task specific than vision, but obviously there’s an opportunity.

EventualIy I think it would be great if fastai becomes the first library to support built in text transformations. If there are augmentation techniques which work well with many common NLP tasks, then perhaps they could be implemented as transformations for use with the text learner.

I would like to ask the community if you have used data augmentation in your NLP tasks to please share your experience.

Recently, I’ve been trying to use ‘backtranslation’ to create paraphrased augmented training instances. This is using machine translation to translate from the source to a ‘pivot’ language and then back again. I had a lot of AWS credits which were expiring, so as an experiment I used amazon translate api to translate all the IMDb reviews from English to German and then back to English. An example:

Source:

Ghost of Dragstrip Hollow is a typical 1950’s teens in turmoil movie. It is not a horror or science fiction movie. Plot concerns a group of teens who are about to get kicked out of their “hot rod” club because they cannot meet the rent. Once kicked out, they decide to try an old Haunted House. The only saving grace for the film is that the “ghost” (Paul Blaisdell in the She Creature suit) turns out to be an out of work movie monster played by Blaisdell.

Augemented:

Ghost of Dragstrip Hollow is a typical 50s teenager in turbulence film. It’s not a horror or science fiction movie. Action concerns a group of teenagers who are about to throw out of their “Hot Rod” club because they cannot fulfill the rent. Once they’re kicked out, they decide to try an old Haunted House. The only saving grace for the film is that the “ghost” (Paul Blaisdell in the She Creature suit) turns out to be a movie monster played by Blaisdell.

The meaning is preserved and there is some paraphrasing, although it is very close to the original. I read in a comment that somebody felt that modern machine translation is too good now for paraphrasing and they recommend using an older pre-DL MOSES model to generate instances further from the originals.

I have not tested the augmented IMDb set to see if it offers an improvement yet, but I will update when I do. If anybody else is interested in trying I have put up the full augmented archive here:

https://s3-us-west-1.amazonaws.com/nlpdatasets/aclImdb.zip

Beyond paraphrasing, I’ve seen other methods motioned:

Replacing words with synonyms
Perturbations (letter, word, or sentence level). See GitHub - noisemix/noisemix: NoiseMix - data generation for natural language
Generative(GAN/auto encoders)

Would love to learn of anybody’s experience, ideas or intuitions regarding text data augmentation!

etown · December 12, 2018, 3:26am

Here’s a list of interesting relevant articles/resources. Looking to add more to the list…

https://arxiv.org/pdf/1705.00440.pdf

https://openreview.net/pdf?id=B14TlG-RW

What data augmentation techniques are available for deep learning …Quora › What-data-augmentation-techni…

KarlH · December 12, 2018, 4:35am

Here’s a paper about adding noise to word embeddings.

francolq · October 7, 2019, 7:45pm

Hello, in this work we use two data augmentation techniques on tweets:

two-way translation
instance crossover: generate new instances by combining halves of tweets.

noisefield · November 1, 2019, 10:40pm

As MultiFIT uses BPE encoding, it seems that this technique is going to be relevant:

Basically what is says it to use multiple BPE segmentations for the same words, which is a very elegant and simple idea.

eugeneware · November 13, 2019, 6:17am

Really great data augmentation system using label-conditioned language model generation to supplement data. And using a pretrained classifier to filter generated sentences to be close to the distribution of the original dataset.

amitness · May 20, 2020, 4:56pm