Lesson 4 In-Class Discussion ✅

cedric · November 14, 2018, 3:39am

Data Augmentation using Thesaurus—thesaurus-based approaches are all I’ve come across so far, but we’ll look for others and post if anything interesting. The problem with thesaurus-based approaches is that, you usually can’t just use an off-the-shelf thesaurus for most tasks. Some results were shown in this paper: https://arxiv.org/abs/1502.01710

Updated:

There’s another interesting technique for data augmentation specific to RNNs from “Data Noising as Smoothing in Neural Network Language Models” (ICLR 2017): https://arxiv.org/abs/1703.02573

In this work, we consider noising primitives as a form of data augmentation
for recurrent neural network-based language models.

fredguth · November 14, 2018, 3:40am

A following question on jeremy’s last answer. He says it makes more sense to use xxun … but in that case, would not help the model to know that was a name? The same with local, a special kind of number… this kind of things?

angelinayy · November 14, 2018, 3:40am

thank you! this is helpful. I guess this is not built in fastai currently.

agentili · November 14, 2018, 3:40am

Yes, but is the old version, I think fastai 0.7 with pytorch 0.4

lesscomfortable · November 14, 2018, 3:40am

After tokenization, words are just numbers. The network does not care about fonts or style since none of that information is given to it, only numbers otherwise called tokens.

ertan · November 14, 2018, 3:41am

my guess is that it might end up harming than helping since the synonyms are not always easily interchangeable in the sentence. also it might cause combinatorial explosion of every sentence based on the fanout of synonyms for each replacement. maybe if we take a very small subset of strictly interchangeable ones, it might be useful.

iyersathya · November 14, 2018, 3:41am

is collab filtering same as tabular but columns are interdependent?

angelinayy · November 14, 2018, 3:41am

how?whats the criteria?

crostino · November 14, 2018, 3:42am

I think Jeremy. Like to get a subset of the data such that you can run on your local machine and get results in seconds. so subsetting? Maybe the backend does something like images where it reads in documents for each batch run. It is pretty common to store documents in the folder format that vision module expect. Train/Class/doc1.txt… But I wonder, how preprocessing work like normalization and scaling if you cannot get the whole dataset in memory.

PegasusWithoutWinds · November 14, 2018, 3:42am

Could you mention specifically a systematic way to add noise to texts? Image with noise is still meaningful image, but I am afraid that noise can easily make a sentence grammatically incorrect or even meaningless.

lesscomfortable · November 14, 2018, 3:42am

That does help actually. When making a classifier for Spanish Tweets, I added a token for laughs (‘jajajaja’) a token for numbers (‘34’) a token for users (’’@user") and another for hashtags (’#fastai’). It helped performance.

ertan · November 14, 2018, 3:43am

hand curated set would be the easiest way.

fredguth · November 14, 2018, 3:43am

how did you do that? You customized tokenizer? how?

angelinayy · November 14, 2018, 3:44am

for example, i want to see if my NLP model is robust. so this would be a sensitivity test?
i also have super imbalance cases and 1s can be really rare, should augmenting rare class a solution to imbalance data?

maryam · November 14, 2018, 3:44am

@rachel this question gets 6 likes

lesscomfortable · November 14, 2018, 3:44am

I pre-tokenized . I iterated twice over the dataset, once to pre-tokenize by replacing tokens by their corresponding placeholders and then ran the official fastai tokenizer.

ilangurudev · November 14, 2018, 3:45am

If we have only a few features (say 5) and a few observations (in the order of thousands), can deep learning still work? Asking because there seems to be a general opinion out there that seems to suggest that deep learning needs a lot of data. Using transfer learning can mitigate such problems in image and text classification but for structured deep learning, is there a solution?

PegasusWithoutWinds · November 14, 2018, 3:45am

Please try and let me know your result. If it does help the model to generalize better, then it is great!

whatrocks · November 14, 2018, 3:46am

what are we trying to “predict” with this collab filtering example

rachel · November 14, 2018, 3:47am

how a person will rate a movie (some of the movie ratings are held out for validation & test sets)