Lesson 4 In-Class Discussion ✅

How do we decide the number of layers and the number in those layers for structured deep learning?

9 Likes

@sgugger does normalize also handle the issues traditionally caused with skewed distributions?

untar_data() automatically adds .tgz to the url for downloading. Git
It actually fetches: http://files.fast.ai/data/examples/adult_sample.tgz

1 Like

@rachel Thank you so much for keeping track of the discussion! Is it possible to mention the post number of the question, so we can jump to it and read? It really helps people who are listening in noisy environment.

1 Like

Thanks, I was thinking more in terms of format of the text like font, style etc.,

Data Augmentation using Thesaurus—thesaurus-based approaches are all I’ve come across so far, but we’ll look for others and post if anything interesting. The problem with thesaurus-based approaches is that, you usually can’t just use an off-the-shelf thesaurus for most tasks. Some results were shown in this paper: https://arxiv.org/abs/1502.01710

Updated:

There’s another interesting technique for data augmentation specific to RNNs from “Data Noising as Smoothing in Neural Network Language Models” (ICLR 2017): https://arxiv.org/abs/1703.02573

In this work, we consider noising primitives as a form of data augmentation
for recurrent neural network-based language models.

5 Likes

A following question on jeremy’s last answer. He says it makes more sense to use xxun … but in that case, would not help the model to know that was a name? The same with local, a special kind of number… this kind of things?

2 Likes

thank you! this is helpful. I guess this is not built in fastai currently.

Yes, but is the old version, I think fastai 0.7 with pytorch 0.4

1 Like

After tokenization, words are just numbers. The network does not care about fonts or style since none of that information is given to it, only numbers otherwise called tokens.

1 Like

my guess is that it might end up harming than helping since the synonyms are not always easily interchangeable in the sentence. also it might cause combinatorial explosion of every sentence based on the fanout of synonyms for each replacement. maybe if we take a very small subset of strictly interchangeable ones, it might be useful.

1 Like

is collab filtering same as tabular but columns are interdependent?

1 Like

how?whats the criteria?

I think Jeremy. Like to get a subset of the data such that you can run on your local machine and get results in seconds. so subsetting? Maybe the backend does something like images where it reads in documents for each batch run. It is pretty common to store documents in the folder format that vision module expect. Train/Class/doc1.txt… But I wonder, how preprocessing work like normalization and scaling if you cannot get the whole dataset in memory.

Could you mention specifically a systematic way to add noise to texts? Image with noise is still meaningful image, but I am afraid that noise can easily make a sentence grammatically incorrect or even meaningless.

1 Like

That does help actually. When making a classifier for Spanish Tweets, I added a token for laughs (‘jajajaja’) a token for numbers (‘34’) a token for users (’’@user") and another for hashtags (’#fastai’). It helped performance.

4 Likes

hand curated set would be the easiest way.

how did you do that? You customized tokenizer? how?

  1. for example, i want to see if my NLP model is robust. so this would be a sensitivity test?
  2. i also have super imbalance cases and 1s can be really rare, should augmenting rare class a solution to imbalance data?

@rachel this question gets 6 likes

2 Likes