NLP Data Augmentation Experiments



The aim of my project is to experiment with a few NLP Data Augmentation ideas that have been published, (more details below) with an attempt to further improve Language model performance, training time or push accuracies with NLP Tasks such as Sentiment classification as covered in Lesson 4 (Part 1, 2019)


Initial ideas and experiments:

  • EDA Easy Data Augmentation Paper
  • Back Translation:
    Translating from one language to another and then back to the original to utilize the “noise” in back translation as augmented text.

More ideas have been discussed in this thread. I’ll try to go over these and add them here with time.

Initial Experiments:

Based on the EDA paper, we’re trying to perform noun and verb replacement in the IMDB dataset. Demo in kaggle kernel.

An image to show the TL;DR augmentation approach:

We’re trying to create an “Augmented copy” of the IMDB dataset and then train on the original and augmented data in cycles as an experiment.

Further ideas:

  • Back-Translation.
  • Checking aggresive/relaxed replacement techniques.
  • Implementing more ideas from the orignal EDA paper onto other datasets.

If you’re interested, please tag @init_27 or DM me.

Best Regards,


I always wondered if language models could in some way help with data augmentation or just delivering more data to train. If you have a model that can write like Nietzsche, that is additional language data. Or you train a model based on some scientfic style then compare to the original text you need and with some kind of loss function to your original text, which you need augmented, you can ‘tweak it’ (replace some sentences or phrases) towards that scientific style. There you have your augmented text version ‘scientifc style’ … just like augmenting a picture e.g. with perspective, flipping and alike

Just throwing out some ideas, haven’t even read up the literature :wink: But in one way or another language models should be helping in augmenting NLP data?


Thanks! @Benudek
Great ideas. I’ll not pretend to act smart and make a comment but my intuition agrees with yours, it’ll be “artificial NLP data”(?)
Only one way to find out would be experimenting and checking. I’ve added your suggestion on my to-do and will share any failures/successes/ideas here.

On second thoughts since the model generates language from the data it had been trained upon, I’m not sure how much variance would it add to the data. Again, only one way to find out :slight_smile:

Here is a recent paper summary that has claimed some tricks that have helped them perform better than ULMFiT on some cases, I want to try these too sometime soon.

In an ideal world where everything works, a collection of all these ideas might give some great results. But again, I’ll not pretend to act as if I’m an expert :slight_smile:

so lets say your starting text is a medical text, then you take a language model which was trained in common language lets say wikipedia. Then you compare and maybe you can transfer the style from common text to medical text … and you already augmented your source text. A bit like style transfer, advantage might be with language models you can generate whatever and how much NLP data you want.

this is just thinking out loud here really :wink:

1 Like

Isn’t it similar to what is taught in Lesson 4?

WikiText model -> Fine-tuned to target corpus.

Could you kindly elaborate on what you’re suggesting?

not sure, maybe I have it from there.

My rough idea would be you compare two texts and then try to tweak the original text towards the style of the other one - where this other one you might have generated with a language model (hence, you have as much data in whatever style you want as you generate it). Then measure the distance between the texts and then in the original text try to take over the style from that other text, i.e. replace some words or sentence structures.

So you transfer style, different tone or vocabulary, same content. It’s like augmenting a picture really: a different perspective of a viewer on an object e.g. in a language it would be different language style describing the same thing

No clue on details, how to measure the distance between text styles nor if replacing parts of texts (especially with grammar rules, but language models kind of seem to understand those?) can work … just throwing it out here :wink:


That’s actually worth a try.
I’ll first complete my data aug runs in the OT.

Next, I’ll try to figure out your approach.

This paper talks about an “auxiliary loss” I think that might be a good way too.

1 Like

cool, happy to help. maybe we should ask s.o. like Sylvain if it could even make sense what I suggest (or is more a lame NLP joke ;-))

Yes, definitely.
I want to wait until the Part 2 lessons that cover the NLP part are taught. It would be unwise to bother Sylvain before doing my homework :slight_smile:

1 Like

yes, very much agree. I also want to start my pet projects after that only to see whats state of the art !

1 Like

@init_27 I’m interested in implementing the experiments.

Hello everyone, i’m working in NLP with data augmentation too.
One of the ideas may be this paper presented in TASS 2018. They explore some innovative ideas in data augmentation. They use the idea of “Back Translation” that you mention.

Other idea that i’m exploring is the use of a tweets dataset (for creating the language model) and fine tune to a sentiment analysis classifier. For the TASS competition of course

Hello there i am working on a graduation project ( CNN model to classify text) but i need the text augmentation because my dataset is very small so where can i find a working implementation to this idea ? @init_27 shared this paper in one of the earlier study group meeting. There is an associated paper as well.


Thank you so much this helped me alot