Lesson 4 In-Class Discussion ✅

Data Augmentation using Thesaurus—thesaurus-based approaches are all I’ve come across so far, but we’ll look for others and post if anything interesting. The problem with thesaurus-based approaches is that, you usually can’t just use an off-the-shelf thesaurus for most tasks. Some results were shown in this paper: https://arxiv.org/abs/1502.01710

Updated:

There’s another interesting technique for data augmentation specific to RNNs from “Data Noising as Smoothing in Neural Network Language Models” (ICLR 2017): https://arxiv.org/abs/1703.02573

In this work, we consider noising primitives as a form of data augmentation
for recurrent neural network-based language models.

5 Likes

A following question on jeremy’s last answer. He says it makes more sense to use xxun … but in that case, would not help the model to know that was a name? The same with local, a special kind of number… this kind of things?

2 Likes

thank you! this is helpful. I guess this is not built in fastai currently.

Yes, but is the old version, I think fastai 0.7 with pytorch 0.4

1 Like

After tokenization, words are just numbers. The network does not care about fonts or style since none of that information is given to it, only numbers otherwise called tokens.

1 Like

my guess is that it might end up harming than helping since the synonyms are not always easily interchangeable in the sentence. also it might cause combinatorial explosion of every sentence based on the fanout of synonyms for each replacement. maybe if we take a very small subset of strictly interchangeable ones, it might be useful.

1 Like

is collab filtering same as tabular but columns are interdependent?

1 Like

how?whats the criteria?

I think Jeremy. Like to get a subset of the data such that you can run on your local machine and get results in seconds. so subsetting? Maybe the backend does something like images where it reads in documents for each batch run. It is pretty common to store documents in the folder format that vision module expect. Train/Class/doc1.txt… But I wonder, how preprocessing work like normalization and scaling if you cannot get the whole dataset in memory.

Could you mention specifically a systematic way to add noise to texts? Image with noise is still meaningful image, but I am afraid that noise can easily make a sentence grammatically incorrect or even meaningless.

1 Like

That does help actually. When making a classifier for Spanish Tweets, I added a token for laughs (‘jajajaja’) a token for numbers (‘34’) a token for users (’’@user") and another for hashtags (’#fastai’). It helped performance.

4 Likes

hand curated set would be the easiest way.

how did you do that? You customized tokenizer? how?

  1. for example, i want to see if my NLP model is robust. so this would be a sensitivity test?
  2. i also have super imbalance cases and 1s can be really rare, should augmenting rare class a solution to imbalance data?

@rachel this question gets 6 likes

2 Likes

I pre-tokenized :sunglasses:. I iterated twice over the dataset, once to pre-tokenize by replacing tokens by their corresponding placeholders and then ran the official fastai tokenizer.

3 Likes

If we have only a few features (say 5) and a few observations (in the order of thousands), can deep learning still work? Asking because there seems to be a general opinion out there that seems to suggest that deep learning needs a lot of data. Using transfer learning can mitigate such problems in image and text classification but for structured deep learning, is there a solution?

Please try and let me know your result. If it does help the model to generalize better, then it is great!

what are we trying to “predict” with this collab filtering example

1 Like

how a person will rate a movie (some of the movie ratings are held out for validation & test sets)

1 Like