Data augmentation for NLP

Can you explain so that a simpleton like me can understand (I haven’t read this paper yet)? Based on their introduction it sounds like they have an auto-encoder running side-by-side with their main task. Is that right? Is there more to the basic technique? How does an auto-encoder work in NLP?

Short answer:

in addition to regular encoding and decoding part, the decoder part gets trained separately on large text corpora, which is a new separate dataset. This dataset is usually obtained (generated) from the original set, using different techniques, for example: in a dataset, making new valid sql queries with select * table. This essentially improves the decoder’s loss function.

A little longer answer:
To make sure I’m not going to miss out on any information. In the basic sequence to sequence learning models that leads to generating an output, where it’s:

  • encoder, encodes the input into some representation.
  • decoder, takes in the encoder’s out-put while keeping the previous states in memory.

This is good for natural language related models, in terms of dimensionality reduction and capturing actual important bits, however, it struggles with constructing proper structure.

The part it struggles with is the decoder where the mapping between input (to encoder) and to last stage happens, due to Decoder not having enough ground truth perform the Loss Function.

Among all the other techniques, such as adding Attention, RL agents and combined Loss functions for decoder. Being able to train on more valid data for the decoder have done a lot better in terms of performance.

@ben.bowles Any other techniques have you tried apart from “synonym replacement” for the NLP tasks and can you share your results if so ?
Have you tried hyponym and hypernym techniques ?

Hello there!
I have no background on ML, I just watched all 7 videos on this week, so I can be totally off on this guess:
Considering that changing (replacing) words with the synonyms is applied on the corresponding word (eg: good=>well), it (?may?) not help because they “reside” in the same place (have similar weights)

Now if you are talking about adding randomly some adjectives (instead of synonyms), you may train your network to “resist” over-fitting and “comprehend” more noisy texts that can help more than random words (as random words tend to have less chances of being used on a real situation than adjectives, IMHO)

You should try it and see! :slight_smile:

An interesting method is interpolating between two text embeddings. This was technique was used to improve performance in the Generative Adversarial Text to Image Synthesis paper by Reed et al.

1 Like

An interesting technique for data augmentation specific to RNNs from “Data Noising as Smoothing in Neural Network Language Models” by Xie et al (ICLR 2017) (arXiv):

In this work, we consider noising primitives as a form of data augmentation
for recurrent neural network-based language models.

1 Like

I’m doing my Msc thesis on this topic :blush:

Specifically, I’m looking at various ways of using external data derived from Wikipedia. It’s still early days but essentially I came up with a simple way of linking wikipedia articles to arbitrary input text. The idea is that if the input text were on Wikipedia, it would have links to other Wikipedia articles (that are semantically related and provide additional info).

The basic procedure is:

  1. break the input text into n-grams
  2. check whether each n-gram exists as a wikipedia article to create a set of ‘candidate links’
  3. prune the candidate links by computing the similarity of the input text and the abstract of each candidate

Once you’ve got ‘wiki-links’ for an article, you can use those as additional data in a variety of ways. For example, you could just throw the abstracts of the linked wiki pages into a bag together with your input document for classification. Or you could run a recursive neural net on the sentences in the abstracts and then average the sentence representations to get a vector representation for each wiki article and a bag of those vectors to represent your input document. I’m also playing around with computing the eigen-centrality of the link graph of the linked documents (up to some link-degree) and using that as a feature representation for the input document.

There’s so much info in wikipedia! :stuck_out_tongue:

3 Likes

Here is an interesting idea that was used in recently completed kaggle competition ‘toxic comments’. Few people used [english - ‘intermediate language’ - english] translastion to augment the data. It changes few words in translation keeping the meaning in tact. I think this is similar to synonym replacement strategy. Here is a link of a quick script to do the same - https://github.com/PavelOstyakov/toxic/blob/master/tools/extend_dataset.py

4 Likes

Very interesting approach! Do you have any updates on this? Hope the thesis went well :slight_smile:

Yes, since machine translation have shown impressive results, English->Intermediate Language->English works well. A very good paper on the same lines is by John Weiting : Learning Paraphrastic Sentence Embeddings from Back-Translated Bitext

3 Likes

As always it is really problem dependent. We have faced with the need of data augmentation for:

  • text search model: map manually typed search tokens into a set of tags
  • text classification model: make text model more robust to product textual description source

Case: User types search tokens and you need to return correct tags
Train set: some limited set of search tokens vs correct tags
Rationale vs augmentation:

  1. User can make a mistake in token - randomly change one letter in a word (white blouse - white blosse)
  2. User can miss a character in a token - randomly delete one letter in a word (white blouse - white blose)
  3. User can use different order of search tokens - randomise tokens position (white blouse - blouse white)
  4. User can use different number of search tokens (2, 3, 4 etc) - subsample number of tokens
  5. User can use other tokens - enrich tokens with synonyms (red - pink - cardinal - cerise etc.)

We augmented 500 tags-search tokens into 10M rows train dataset. After training for around 50 epochs model was absolutely robust to every case we predicted. Needless to say it failed every time for case we did not do augmentation for :slight_smile:

1 Like

Thanks for sharing. Do you have the whole thesis or a paper to share?

Thanks for sharing. Do you have any kind of longer description of your solution, results etc. that you could share? I’m especially interested in the text classification case, because I’m doing research on that.

Hi everyone,
Another text data augmentation that is not yet mentioned here is sentence shuffling.
It is used in topic modelling though, not translation. The idea is to shuffle the sentences in paragraph and what it does is:

  • the topic remains the same
  • the words’ order are preserved
  • we get a different data

Hope it helps :smiley:
Thanks.

1 Like

Has anyone thought of using a language model to substitute some words in the example text? This would be specially easy in the transfer-learning framework, since creating a language model is already a requirement.

I guess this would be faster than traditional word substitution (finding the closest embedding), as well as producing richer results.

I have googled this idea for a bit, but found no mentions of anyone who tried it!

I reviewed the literature and have written a survey article on this recently. Please check it out.

A Visual Survey of Data Augmentation in NLP

4 Likes

Hi, and good to see you around here!

In fact, a college shared your article yesterday with my work team. Definitely thanks for your contribution, it’s the best review I’ve come across.

While I have you here, could you please expand a bit on Unigram Noising? You say:

The idea is to perform replacement with words sampled from the unigram frequency distribution. This frequency is basically how many times each word occurs in the training corpus.

So do you swap words for others of similar frequency? I don’t really get this one.

Again, thanks and good job!

1 Like

Hi,

Sorry that it was not clear in the article. The idea is basically to randomly select words in the original text and replace it with a random word from the uni-gram distribution. So, frequent words would have a higher chance of being selected than non-frequent ones. The papers uses it as a very simple noising technique only. The resulting sentence might not sound coherent when read by a human.

But, what you described could also be an interesting thing to try out. Swapping words for others of similar frequency.

1 Like

I see! Interesting, thanks for the clarification :slight_smile: