Data augmentation for NLP

ben.bowles · November 22, 2016, 7:36pm

What types of data augmentation are there for NLP? I have read about the idea of inserting synonyms instead of words, at random, to generate new data. I have tried this and personally had no success. Are there are any other good ways to use data augmentation for NLP models?

jeremy · November 22, 2016, 10:40pm

The thesaurus thing is all I’ve seen. Some positive results were shown here https://arxiv.org/pdf/1502.01710.pdf . However they were very small, since the datasets were so big.

@anamariapopescug are you aware of other approaches?

anamariapopescug · November 23, 2016, 1:15am

Thesaurus-based approaches are all I’ve come across so far, but I’ll look for others and post if anything interesting. The problem w/ thesaurus-based approaches is that (as @ben.bowles saw first hand), you usually can’t just use an off-the-shelf thesaurus for most tasks …

yad.faeq · November 26, 2016, 5:04pm

Not exactly Data Augmentation, though in the recent research work things like training on “monomodal data” have been tried out and it have been effective.

For example: In Semantic Parsing with Semi-Supervised Sequential Autoencoders, the network is trained to do a “semi-supervised approach for sequence transduction and apply it to semantic parsing.”.

In Section 3.4 Data Generation, it explains how previous data can be used as to generate Valid test/train data, valid as in: proper database queries, or map directions (Since the task here is semantic parsing).

jeremy · November 26, 2016, 7:18pm

Can you explain so that a simpleton like me can understand (I haven’t read this paper yet)? Based on their introduction it sounds like they have an auto-encoder running side-by-side with their main task. Is that right? Is there more to the basic technique? How does an auto-encoder work in NLP?

yad.faeq · November 26, 2016, 7:58pm

Short answer:

in addition to regular encoding and decoding part, the decoder part gets trained separately on large text corpora, which is a new separate dataset. This dataset is usually obtained (generated) from the original set, using different techniques, for example: in a dataset, making new valid sql queries with select * table. This essentially improves the decoder’s loss function.

A little longer answer:
To make sure I’m not going to miss out on any information. In the basic sequence to sequence learning models that leads to generating an output, where it’s:

encoder, encodes the input into some representation.
decoder, takes in the encoder’s out-put while keeping the previous states in memory.

This is good for natural language related models, in terms of dimensionality reduction and capturing actual important bits, however, it struggles with constructing proper structure.

The part it struggles with is the decoder where the mapping between input (to encoder) and to last stage happens, due to Decoder not having enough ground truth perform the Loss Function.

Among all the other techniques, such as adding Attention, RL agents and combined Loss functions for decoder. Being able to train on more valid data for the decoder have done a lot better in terms of performance.

janardhanp22 · December 12, 2016, 7:21pm

@ben.bowles Any other techniques have you tried apart from “synonym replacement” for the NLP tasks and can you share your results if so ?
Have you tried hyponym and hypernym techniques ?

renatocron · December 30, 2016, 5:53am

Hello there!
I have no background on ML, I just watched all 7 videos on this week, so I can be totally off on this guess:
Considering that changing (replacing) words with the synonyms is applied on the corresponding word (eg: good=>well), it (?may?) not help because they “reside” in the same place (have similar weights)

Now if you are talking about adding randomly some adjectives (instead of synonyms), you may train your network to “resist” over-fitting and “comprehend” more noisy texts that can help more than random words (as random words tend to have less chances of being used on a real situation than adjectives, IMHO)

jeremy · December 30, 2016, 11:28pm

You should try it and see!

mrdrozdov · August 11, 2017, 3:04am

An interesting method is interpolating between two text embeddings. This was technique was used to improve performance in the Generative Adversarial Text to Image Synthesis paper by Reed et al.

jasonpmorrison · August 11, 2017, 6:12pm

An interesting technique for data augmentation specific to RNNs from “Data Noising as Smoothing in Neural Network Language Models” by Xie et al (ICLR 2017) (arXiv):

In this work, we consider noising primitives as a form of data augmentation
for recurrent neural network-based language models.

markovbling · August 14, 2017, 7:08pm

I’m doing my Msc thesis on this topic

Specifically, I’m looking at various ways of using external data derived from Wikipedia. It’s still early days but essentially I came up with a simple way of linking wikipedia articles to arbitrary input text. The idea is that if the input text were on Wikipedia, it would have links to other Wikipedia articles (that are semantically related and provide additional info).

The basic procedure is:

break the input text into n-grams
check whether each n-gram exists as a wikipedia article to create a set of ‘candidate links’
prune the candidate links by computing the similarity of the input text and the abstract of each candidate

Once you’ve got ‘wiki-links’ for an article, you can use those as additional data in a variety of ways. For example, you could just throw the abstracts of the linked wiki pages into a bag together with your input document for classification. Or you could run a recursive neural net on the sentences in the abstracts and then average the sentence representations to get a vector representation for each wiki article and a bag of those vectors to represent your input document. I’m also playing around with computing the eigen-centrality of the link graph of the linked documents (up to some link-degree) and using that as a feature representation for the input document.

There’s so much info in wikipedia!

vyassaurabh411 · March 26, 2018, 1:45pm

Here is an interesting idea that was used in recently completed kaggle competition ‘toxic comments’. Few people used [english - ‘intermediate language’ - english] translastion to augment the data. It changes few words in translation keeping the meaning in tact. I think this is similar to synonym replacement strategy. Here is a link of a quick script to do the same - toxic/tools/extend_dataset.py at master · PavelOstyakov/toxic · GitHub

matsair · May 31, 2018, 7:28am

Very interesting approach! Do you have any updates on this? Hope the thesis went well

kaustubh · July 5, 2018, 5:49am

Yes, since machine translation have shown impressive results, English->Intermediate Language->English works well. A very good paper on the same lines is by John Weiting : Learning Paraphrastic Sentence Embeddings from Back-Translated Bitext

sermakarevich · September 10, 2018, 6:57pm

As always it is really problem dependent. We have faced with the need of data augmentation for:

text search model: map manually typed search tokens into a set of tags
text classification model: make text model more robust to product textual description source

Case: User types search tokens and you need to return correct tags
Train set: some limited set of search tokens vs correct tags
Rationale vs augmentation:

User can make a mistake in token - randomly change one letter in a word (white blouse - white blosse)
User can miss a character in a token - randomly delete one letter in a word (white blouse - white blose)
User can use different order of search tokens - randomise tokens position (white blouse - blouse white)
User can use different number of search tokens (2, 3, 4 etc) - subsample number of tokens
User can use other tokens - enrich tokens with synonyms (red - pink - cardinal - cerise etc.)

We augmented 500 tags-search tokens into 10M rows train dataset. After training for around 50 epochs model was absolutely robust to every case we predicted. Needless to say it failed every time for case we did not do augmentation for

jungiew · October 25, 2018, 9:44am

Thanks for sharing. Do you have the whole thesis or a paper to share?

jungiew · October 25, 2018, 10:07am

Thanks for sharing. Do you have any kind of longer description of your solution, results etc. that you could share? I’m especially interested in the text classification case, because I’m doing research on that.

WiraDKP · February 13, 2019, 10:53am

Hi everyone,
Another text data augmentation that is not yet mentioned here is sentence shuffling.
It is used in topic modelling though, not translation. The idea is to shuffle the sentences in paragraph and what it does is:

the topic remains the same
the words’ order are preserved
we get a different data

Hope it helps
Thanks.

Pablo · March 1, 2019, 9:41am

Has anyone thought of using a language model to substitute some words in the example text? This would be specially easy in the transfer-learning framework, since creating a language model is already a requirement.

I guess this would be faster than traditional word substitution (finding the closest embedding), as well as producing richer results.

I have googled this idea for a bit, but found no mentions of anyone who tried it!