Data Augmentation using Thesaurus—thesaurus-based approaches are all I’ve come across so far, but we’ll look for others and post if anything interesting. The problem with thesaurus-based approaches is that, you usually can’t just use an off-the-shelf thesaurus for most tasks. Some results were shown in this paper: https://arxiv.org/abs/1502.01710
Updated:
There’s another interesting technique for data augmentation specific to RNNs from “Data Noising as Smoothing in Neural Network Language Models” (ICLR 2017): https://arxiv.org/abs/1703.02573
In this work, we consider noising primitives as a form of data augmentation
for recurrent neural network-based language models.
A following question on jeremy’s last answer. He says it makes more sense to use xxun … but in that case, would not help the model to know that was a name? The same with local, a special kind of number… this kind of things?
After tokenization, words are just numbers. The network does not care about fonts or style since none of that information is given to it, only numbers otherwise called tokens.
my guess is that it might end up harming than helping since the synonyms are not always easily interchangeable in the sentence. also it might cause combinatorial explosion of every sentence based on the fanout of synonyms for each replacement. maybe if we take a very small subset of strictly interchangeable ones, it might be useful.
I think Jeremy. Like to get a subset of the data such that you can run on your local machine and get results in seconds. so subsetting? Maybe the backend does something like images where it reads in documents for each batch run. It is pretty common to store documents in the folder format that vision module expect. Train/Class/doc1.txt… But I wonder, how preprocessing work like normalization and scaling if you cannot get the whole dataset in memory.
Could you mention specifically a systematic way to add noise to texts? Image with noise is still meaningful image, but I am afraid that noise can easily make a sentence grammatically incorrect or even meaningless.
That does help actually. When making a classifier for Spanish Tweets, I added a token for laughs (‘jajajaja’) a token for numbers (‘34’) a token for users (’’@user") and another for hashtags (’#fastai’). It helped performance.
I pre-tokenized . I iterated twice over the dataset, once to pre-tokenize by replacing tokens by their corresponding placeholders and then ran the official fastai tokenizer.
If we have only a few features (say 5) and a few observations (in the order of thousands), can deep learning still work? Asking because there seems to be a general opinion out there that seems to suggest that deep learning needs a lot of data. Using transfer learning can mitigate such problems in image and text classification but for structured deep learning, is there a solution?