Lesson 4 In-Class Discussion ✅

ilangurudev · November 14, 2018, 3:37am

How do we decide the number of layers and the number in those layers for structured deep learning?

vernboy · November 14, 2018, 3:38am

@sgugger does normalize also handle the issues traditionally caused with skewed distributions?

rohitr · November 14, 2018, 3:38am

untar_data() automatically adds .tgz to the url for downloading. Git
It actually fetches: http://files.fast.ai/data/examples/adult_sample.tgz

PegasusWithoutWinds · November 14, 2018, 3:39am

@rachel Thank you so much for keeping track of the discussion! Is it possible to mention the post number of the question, so we can jump to it and read? It really helps people who are listening in noisy environment.

avatar · November 14, 2018, 3:39am

Thanks, I was thinking more in terms of format of the text like font, style etc.,

cedric · November 14, 2018, 3:39am

Data Augmentation using Thesaurus—thesaurus-based approaches are all I’ve come across so far, but we’ll look for others and post if anything interesting. The problem with thesaurus-based approaches is that, you usually can’t just use an off-the-shelf thesaurus for most tasks. Some results were shown in this paper: https://arxiv.org/abs/1502.01710

Updated:

There’s another interesting technique for data augmentation specific to RNNs from “Data Noising as Smoothing in Neural Network Language Models” (ICLR 2017): https://arxiv.org/abs/1703.02573

In this work, we consider noising primitives as a form of data augmentation
for recurrent neural network-based language models.

fredguth · November 14, 2018, 3:40am

A following question on jeremy’s last answer. He says it makes more sense to use xxun … but in that case, would not help the model to know that was a name? The same with local, a special kind of number… this kind of things?

angelinayy · November 14, 2018, 3:40am

thank you! this is helpful. I guess this is not built in fastai currently.

agentili · November 14, 2018, 3:40am

Yes, but is the old version, I think fastai 0.7 with pytorch 0.4

lesscomfortable · November 14, 2018, 3:40am

After tokenization, words are just numbers. The network does not care about fonts or style since none of that information is given to it, only numbers otherwise called tokens.

ertan · November 14, 2018, 3:41am

my guess is that it might end up harming than helping since the synonyms are not always easily interchangeable in the sentence. also it might cause combinatorial explosion of every sentence based on the fanout of synonyms for each replacement. maybe if we take a very small subset of strictly interchangeable ones, it might be useful.

iyersathya · November 14, 2018, 3:41am

is collab filtering same as tabular but columns are interdependent?

angelinayy · November 14, 2018, 3:41am

how?whats the criteria?

crostino · November 14, 2018, 3:42am

I think Jeremy. Like to get a subset of the data such that you can run on your local machine and get results in seconds. so subsetting? Maybe the backend does something like images where it reads in documents for each batch run. It is pretty common to store documents in the folder format that vision module expect. Train/Class/doc1.txt… But I wonder, how preprocessing work like normalization and scaling if you cannot get the whole dataset in memory.

PegasusWithoutWinds · November 14, 2018, 3:42am

Could you mention specifically a systematic way to add noise to texts? Image with noise is still meaningful image, but I am afraid that noise can easily make a sentence grammatically incorrect or even meaningless.

lesscomfortable · November 14, 2018, 3:42am

That does help actually. When making a classifier for Spanish Tweets, I added a token for laughs (‘jajajaja’) a token for numbers (‘34’) a token for users (’’@user") and another for hashtags (’#fastai’). It helped performance.

ertan · November 14, 2018, 3:43am

hand curated set would be the easiest way.

fredguth · November 14, 2018, 3:43am

how did you do that? You customized tokenizer? how?

angelinayy · November 14, 2018, 3:44am

for example, i want to see if my NLP model is robust. so this would be a sensitivity test?
i also have super imbalance cases and 1s can be really rare, should augmenting rare class a solution to imbalance data?

maryam · November 14, 2018, 3:44am

@rachel this question gets 6 likes