The aim of my project is to experiment with a few NLP Data Augmentation ideas that have been published, (more details below) with an attempt to further improve Language model performance, training time or push accuracies with NLP Tasks such as Sentiment classification as covered in Lesson 4 (Part 1, 2019)
I always wondered if language models could in some way help with data augmentation or just delivering more data to train. If you have a model that can write like Nietzsche, that is additional language data. Or you train a model based on some scientfic style then compare to the original text you need and with some kind of loss function to your original text, which you need augmented, you can ‘tweak it’ (replace some sentences or phrases) towards that scientific style. There you have your augmented text version ‘scientifc style’ … just like augmenting a picture e.g. with perspective, flipping and alike
Just throwing out some ideas, haven’t even read up the literature But in one way or another language models should be helping in augmenting NLP data?
Thanks! @Benudek
Great ideas. I’ll not pretend to act smart and make a comment but my intuition agrees with yours, it’ll be “artificial NLP data”(?)
Only one way to find out would be experimenting and checking. I’ve added your suggestion on my to-do and will share any failures/successes/ideas here.
On second thoughts since the model generates language from the data it had been trained upon, I’m not sure how much variance would it add to the data. Again, only one way to find out
Here is a recent paper summary that has claimed some tricks that have helped them perform better than ULMFiT on some cases, I want to try these too sometime soon.
In an ideal world where everything works, a collection of all these ideas might give some great results. But again, I’ll not pretend to act as if I’m an expert
so lets say your starting text is a medical text, then you take a language model which was trained in common language lets say wikipedia. Then you compare and maybe you can transfer the style from common text to medical text … and you already augmented your source text. A bit like style transfer, advantage might be with language models you can generate whatever and how much NLP data you want.
My rough idea would be you compare two texts and then try to tweak the original text towards the style of the other one - where this other one you might have generated with a language model (hence, you have as much data in whatever style you want as you generate it). Then measure the distance between the texts and then in the original text try to take over the style from that other text, i.e. replace some words or sentence structures.
So you transfer style, different tone or vocabulary, same content. It’s like augmenting a picture really: a different perspective of a viewer on an object e.g. in a language it would be different language style describing the same thing
No clue on details, how to measure the distance between text styles nor if replacing parts of texts (especially with grammar rules, but language models kind of seem to understand those?) can work … just throwing it out here
Yes, definitely.
I want to wait until the Part 2 lessons that cover the NLP part are taught. It would be unwise to bother Sylvain before doing my homework
Hello everyone, i’m working in NLP with data augmentation too.
One of the ideas may be this paper presented in TASS 2018. They explore some innovative ideas in data augmentation. They use the idea of “Back Translation” that you mention.
Other idea that i’m exploring is the use of a tweets dataset (for creating the language model) and fine tune to a sentiment analysis classifier. For the TASS competition of course
Hello there i am working on a graduation project ( CNN model to classify text) but i need the text augmentation because my dataset is very small so where can i find a working implementation to this idea ?