What types of data augmentation are there for NLP? I have read about the idea of inserting synonyms instead of words, at random, to generate new data. I have tried this and personally had no success. Are there are any other good ways to use data augmentation for NLP models?
The thesaurus thing is all I’ve seen. Some positive results were shown here https://arxiv.org/pdf/1502.01710.pdf . However they were very small, since the datasets were so big.
@anamariapopescug are you aware of other approaches?
Thesaurus-based approaches are all I’ve come across so far, but I’ll look for others and post if anything interesting. The problem w/ thesaurus-based approaches is that (as @ben.bowles saw first hand), you usually can’t just use an off-the-shelf thesaurus for most tasks …
Not exactly Data Augmentation, though in the recent research work things like training on “monomodal data” have been tried out and it have been effective.
For example: In Semantic Parsing with Semi-Supervised Sequential Autoencoders, the network is trained to do a “semi-supervised approach for sequence transduction and apply it to semantic parsing.”.
In Section 3.4 Data Generation, it explains how previous data can be used as to generate Valid test/train data, valid as in: proper database queries, or map directions (Since the task here is semantic parsing).
Can you explain so that a simpleton like me can understand (I haven’t read this paper yet)? Based on their introduction it sounds like they have an auto-encoder running side-by-side with their main task. Is that right? Is there more to the basic technique? How does an auto-encoder work in NLP?
in addition to regular encoding and decoding part, the decoder part gets trained separately on large text corpora, which is a new separate dataset. This dataset is usually obtained (generated) from the original set, using different techniques, for example: in a dataset, making new valid
sql queries with
select * table. This essentially improves the decoder’s loss function.
A little longer answer:
To make sure I’m not going to miss out on any information. In the basic sequence to sequence learning models that leads to generating an output, where it’s:
- encoder, encodes the input into some representation.
- decoder, takes in the encoder’s out-put while keeping the previous states in memory.
This is good for natural language related models, in terms of dimensionality reduction and capturing actual important bits, however, it struggles with constructing proper structure.
The part it struggles with is the decoder where the mapping between input (to encoder) and to last stage happens, due to Decoder not having enough ground truth perform the Loss Function.
Among all the other techniques, such as adding Attention, RL agents and combined Loss functions for decoder. Being able to train on more valid data for the decoder have done a lot better in terms of performance.
@ben.bowles Any other techniques have you tried apart from “synonym replacement” for the NLP tasks and can you share your results if so ?
Have you tried hyponym and hypernym techniques ?
I have no background on ML, I just watched all 7 videos on this week, so I can be totally off on this guess:
Considering that changing (replacing) words with the synonyms is applied on the corresponding word (eg: good=>well), it (?may?) not help because they “reside” in the same place (have similar weights)
Now if you are talking about adding randomly some adjectives (instead of synonyms), you may train your network to “resist” over-fitting and “comprehend” more noisy texts that can help more than random words (as random words tend to have less chances of being used on a real situation than adjectives, IMHO)
You should try it and see!
An interesting method is interpolating between two text embeddings. This was technique was used to improve performance in the Generative Adversarial Text to Image Synthesis paper by Reed et al.
An interesting technique for data augmentation specific to RNNs from “Data Noising as Smoothing in Neural Network Language Models” by Xie et al (ICLR 2017) (arXiv):
In this work, we consider noising primitives as a form of data augmentation
for recurrent neural network-based language models.
I’m doing my Msc thesis on this topic
Specifically, I’m looking at various ways of using external data derived from Wikipedia. It’s still early days but essentially I came up with a simple way of linking wikipedia articles to arbitrary input text. The idea is that if the input text were on Wikipedia, it would have links to other Wikipedia articles (that are semantically related and provide additional info).
The basic procedure is:
- break the input text into n-grams
- check whether each n-gram exists as a wikipedia article to create a set of ‘candidate links’
- prune the candidate links by computing the similarity of the input text and the abstract of each candidate
Once you’ve got ‘wiki-links’ for an article, you can use those as additional data in a variety of ways. For example, you could just throw the abstracts of the linked wiki pages into a bag together with your input document for classification. Or you could run a recursive neural net on the sentences in the abstracts and then average the sentence representations to get a vector representation for each wiki article and a bag of those vectors to represent your input document. I’m also playing around with computing the eigen-centrality of the link graph of the linked documents (up to some link-degree) and using that as a feature representation for the input document.
There’s so much info in wikipedia!
Here is an interesting idea that was used in recently completed kaggle competition ‘toxic comments’. Few people used [english - ‘intermediate language’ - english] translastion to augment the data. It changes few words in translation keeping the meaning in tact. I think this is similar to synonym replacement strategy. Here is a link of a quick script to do the same - https://github.com/PavelOstyakov/toxic/blob/master/tools/extend_dataset.py
Very interesting approach! Do you have any updates on this? Hope the thesis went well
Yes, since machine translation have shown impressive results, English->Intermediate Language->English works well. A very good paper on the same lines is by John Weiting : Learning Paraphrastic Sentence Embeddings from Back-Translated Bitext
As always it is really problem dependent. We have faced with the need of data augmentation for:
- text search model: map manually typed search tokens into a set of tags
- text classification model: make text model more robust to product textual description source
Case: User types search tokens and you need to return correct tags
Train set: some limited set of search tokens vs correct tags
Rationale vs augmentation:
- User can make a mistake in token - randomly change one letter in a word (white blouse - white blosse)
- User can miss a character in a token - randomly delete one letter in a word (white blouse - white blose)
- User can use different order of search tokens - randomise tokens position (white blouse - blouse white)
- User can use different number of search tokens (2, 3, 4 etc) - subsample number of tokens
- User can use other tokens - enrich tokens with synonyms (red - pink - cardinal - cerise etc.)
We augmented 500 tags-search tokens into 10M rows train dataset. After training for around 50 epochs model was absolutely robust to every case we predicted. Needless to say it failed every time for case we did not do augmentation for