Deep Learning Brasília - Lição 4

Verifique a sua compreensão da lição 4

<<< Verifique a sua compreensão da lição 3 | Verifique a sua compreensão da lição 5 >>>

Oi pessoal,

Eu assisti novamente ao video da lição 4 (parte 1) para melhorar meu entendimento dela e tomei notas do vocabulário usado pelo @jeremy.

Vamos jogar um pouquinho ! Concorda ? :wink:
Você pode dar uma definição / uma URL / uma explicação para todos os termos e expressões a seguir?

Se sim, você entendeu perfeitamente a quarto lição! :sunglasses::sunglasses::sunglasses:

PS: se você não quiser se testar ou se quiser checar as suas respostas, vá para o post “Deep Learning 2: Part 1 Lesson 4” do blog de @hiromi : " super travail !!! " :slight_smile:

  • blog posts from fastai students mainly about how to get the best learning rate
    ** learning rate
    ** Stochastic Gradient Descent with Restart (SGDR)
    ** Diferential Learning Rate
    ** (from @bushaev) the difficulty in minimizing the loss arises from saddle points rather than poor local minima
  • computer vision
  • lesson 4 is about 3 concepts :
    ** structured (data) learning (financial, …)
    ** NLP : Language Natural Processing
    ** Collaborative filtering
  • jupyter notebook : we used dropout in DogBreeds Kaggle competition
  • an activation is a number which has been calculated
  • kernel ou filter : same thing
  • a fully connected layer (Linear layer) does a simple matrix multiplication
  • type ‘learn’ in your jupyter to get the last added layers of your learn object (your new model from the pretrained model which is here a ConvNet network) : the fastai library adds by default 2 linear layers (fully connected layers) with the following structure for each one : BatchNormalization, Dropout, Linear, ReLu (and the last layer gives logSoftMax)
  • dropout
    ** ps = 0.5 means delete each activation with a probability of 50% (at random)
    ** therefore, at each mini batch, we throw away different activations
    ** what does dropout ? it forces o network to generalize better (it forces the network to learn the main caracteristics) and then, avoids overfitting (specialization of your network)
    ** dropout was created about 2013 by Geoffrey Hinton and its team (from how the brain works) : it solves the problem of generalization but why droput was inspired by the brain ? (answer : https://www.quora.com/Machine-Learning-What-does-dropout-in-the-brain-works-with-respect-to-all-inputs-while-dropout-in-a-convolutional-network-works-with-respect-to-each-single-unit-mean)
    ** high ps value helps generalize well but decrease your training accuracy
    ** low ps value will generalize less well but will give you a better training accuracy
    ** why in early training, my validation losses is better (lower) than my training losses ? The reason is we turn off the dropout when we look at the validation set.
    ** when we set ps=0.5, pytorch deletes an activation with a probability of 50% and double all the others : then, the average activation does not change
    ** ps=0.5 adds dropout to all the added last layers, but not in the first layers (the pretrained layers)
    ** ps=0 removes dropout and then in the first epoch, my validation accuracy is better (higher) than with ps different of zero but in teh third epoch, it is the contrary : dropout helped a better generalization
    ** ps=0 : massive overfitting from the third epoch !
    ** if you don’t write a ps value in the learn object, there is a default ps value fo each last layer !
    ** ps=[p1,p2] allows to put a different dropout by last layer (idea : the last dropout - the one before the last Linear layer - can be higher than others)
    ** the dropout adds noise in fact
  • add layers
    ** by default, the fastai learn object add 2 Linear layers at the end of the network but you can change the number of last Linear layers and the number of activations of each one by using xtra_fc=[x1,x2,...]
    ** therefore, an empty list xtra_fc=[] creates just one Linear last layer
  • overfitting
    ** bigger models (deeper), bigger dropouts as deeper models tend to overfit much (specialization)
    ** overfitting ? the training loss is much lower than the validation loss
    ** a small overfitting is not a problem but you need to look at at your validation loss and get the smallest number for him
  • Jupyter notebook : Rossman Store
    ** 2 kinds of column : categorical (number of levels) or continuous (number attached)
    ** no feature engineering
    ** a categorical variable tells the neural network that each level (cardinality) means something totally different
    ** a continuous variable comes with a kind of mathematical function
    ** a continuous variable with a low number of levels (cardinality) : better to treat it as a categorical variable
    ** DayOfWeek is a number but it is better to treat it as a categorical variable as each day as a meaning different in relation to the context
    ** then, a categorical data stays a categorical but a continuous data, you have to decide
    ** a continuous variable has floating number
    ** one remark : you can group by interval, numbers of a continuous variable and then turns it into categorical variable (better or not ? there is a paper on that)
  • if in the dataset, if pytorch finds a new categorical variable (category in pytorch) that was not defined, pytorch adds it automatically as an unknown category
  • pandas + Machine Learning course
  • clean data and make type of feature engineering
  • then, convert columns to categorical or continuous variables (float32)
  • idxs = get_cv_idxs() to get a sample (validation set)
  • df, y, nas, mapper = proc_df(joined_samp, 'Sales', do_scale=True) : creates the dependent variable with normalization (do_scale=True)
    ** Neural Net likes to have input data with mean = 0 and standard deviation = 1
    ** mapper keeps the normalization process and then, the same one will be applied to the test set later (that did matches for example from 2014 to 2, from a to 1, etc.)
  • validation set : read thefastai blog post about that
  • what is the metric ? here, RMSPE (Root Mean Square Percentage Error)
  • check the log rules : log (a/b) = log(a) - log(b)
  • md = ColumnModelData.from_data_frame(PATH, val_idx, df, yl, cat_flds=cat_vars, bs=128) : it is important to give cat_flds to tell our model which variable is a categorical variable
  • m = md.get_learner(emb_szs, len(df.columns)-len(cat_vars) (number of continuous variables), dropout of the embedding matrix, prediction of a single number (sales), how many activations in each last layer, dropout at last layers, y_range=y_range)
  • x rank tensor = vector
  • embedding matrix
    ** Jeremy draws a NN with a vector of continuous data in input but what is the situation with categorical variables ? in fact we create an embedding matrix for each categorical variable
    ** It means that this embedding matrix will be the first one of our NN : this another bunch of weight of our NN
    ** the embedding weights mean something but this is the training that will show that
    ** size of each embedding vector ? take the cardinality, divides by 2, not bigger than 50
    ** 1 hot encoding : the problem is that a categorical variable is a concept not linear
    ** embedding vector is a distributed distribution (rich representation, high dimensional concept)
  • when a categorical variable has a high cardinality, you should package it
  • all modern DL libraries, take the index and give back the embedding vector but this is similar to a product matrix between 1 hot encoding vector and the embedding matrix
  • add_datepart() : creates columns ot time series from the name of the column (DayofWeek)
  • after, you can run your NN like for the DogsCats one
  • Pinterest : when they switched from gradient boosting machines to Deep Learning, they did much less features engineering : it is one of the biggest benefit of using DL
  • data augmentation in structured data : jeremy does not know a technique
  • what is the downside of using Deep Learning on structured data (structured Deep Learning model) ? few people work on it and before fastai, there was no easy way to code it (this class is the first time)
    ** Pinterest has a O’Reilly video on that
    ** 2 academics paper from Yoshua Bengio (taxi destination forecast) and the kaggle Rossmann competition
  • Natural Language Processing
    ** NLP (Natural Language Processing) is the most up-and-coming area
    ** it is 2 or 3 years behind computer vision in Deep Learning
    ** it is kind of second area that Dl started to be popular in
    ** DL in computer visions came to state-of-the-art in 2014/2015
    ** NLP is still in a state where the concepts and the use of DL are less mature than computer vision
    ** in the last few month, models and ideas used in computer vision started to be tested in NLP. Therefore, it is quite new
  • Language modeling (ou Model language)
    ** We are going to talk about a particular problem in NLP : language modeling
    ** language modeling : build a model where given a few words of a sentence, can you predict what the next word is going to be ? (swiftkey no mobile for example)
    ** arxiv.org : very popular site on DL papers
    ** Jeremy downloaded from this site 18 months academic papers (topics, titles and summaries)
    ** Then, he build an algorithm to predict the rest of a phrase from the first words
    ** the algorithm did not know English : it starts with an embedding matrix for each word
    ** csn = computer science networking
    ** create a language modeling has many consequences : learn to write correct English, learn to put single into parentheses
    ** we are creating a Language Model because we want to create a pretrained model and use it within a IMDB movie review to detect positive and negative sentence (it does not work to directly create a model to detect a positive or negative review)
    ** Why ? because of many reasons :
    ** 1) fine tuning of a pretrained network works very well : the networks has detected many characteristics of a “world” (it “understood” the world) and then we can use it for another related task
    ** 2) IMDB movie reviews are big (1000 words long), then after reading 1000 words, knowing nothing about how English is structured or even what is the concept of a word or punctuation and try to give a positive or negative sentiments (1 or 0) is just too much to expect
  • Usually Language Model works at words level
    ** you can create your American Novel generator :slight_smile:
    ** we focus here on text classification which is very powerful (hedge fund to classify news that have impacts, customer service to detect people who will cancel their contracts in the next month)
  • lesson4-imdb.ipynb
    ** new things we import : torchtext (NLP library)
    ** after library importation and PATH setup, we tokenize the training dataset (words, punctuations) using spacy (the best tokenizer)
    ** ' '.join(spacy_tok(review[0])) : puts a space between 2 tokens
    ** with torchtext, we are going to create a torchtext field : how to pre-process some a piece of text like making everything lowercase and tokenize it : TEXT = data.Field(lower=True, tokenize=spacy_tok)
  • We can now create our model :
    ** FILES = dict(train=TRN_PATH, validation=VAL_PATH, test=VAL_PATH)
    ** md = LanguageModelData(PATH, TEXT, **FILES, bs=bs, bptt=bptt, min_freq=10)
  • min_freq : we will remplace all words by indexes (a word has an unique index) but if a word occurs less than 10 times, just call it unknown
    ** bptt = backprop though time is where we define how long a sentence will stick on a GPU at once (we will break sentences into bptt tokens or less)
    ** Then, we add a very important attribute to our torchtext field : TEXT.vocab
    ** this is the vocabulary : list of unique words in the corpus classed by frequency or list of their indexes
    ** context is very important in NLP !
    ** note : bag of words is not longer useful because it does not keep the order between orders
  • bptt
    **we concatenated all review in one file. Then, we split into batches (64 for example). The, we move into columns these 64 words splits. Then, we grab a bptt (70 tokens = 70 lines) of this matrix that will give to our GPU.
    ** a batch size in NLP is a batch of 64 bits and each bit is a sequence of length bptt (70 here)
    ** next(iter(md.trn_dl)) gives you a batch like the one uses by your GPU
    ** warning : randomly, torchtext changes the bptt size around 70 for each batch
    ** the first column represents the first 75 words of the first review (bptt = 75)
    ** The second column represents the first 75 words of the second of the 64 segments (you have to go in like 10 millions words to find that one)
    ** The last column is the first 75 words of the last of those 64 segments
  • Now, we have to create an embedding matrix for each word or our corpus (1 word is a categorical variable)
    ** len(md.trn_dl) : number of blocks
    ** md.nt : number of tokens
    ** len(md.trn_ds) : length of dataset (1 here, because we put all review in 1 file which is our corpus)
    ** len(md.trn_ds[0].text) : number of words in our corpus
    ** we define em_sz (size of our embedding matrix : 200), 3 layers with 500 activations by layer
  • put Adam optimizer with opt_fn = partial(optim.Adam, betas=(0.7, 0.99))
  • recent Language model : we use AWD LSTM from Stephen Merity (important about dropout)
  • another way to avoid overfitting : learner.reg_fn = partial(seq2seq_reg, alpha=2, beta=1)
  • Gradient clipping : learner.clip=0.3 (avoid a learning rate above 0.3)
  • technique to use pretrained model for sentiment analysis more powerful than WordToVec
  • Here, we use a Recurrent Neural Network using LSTM (long short term memory)
  • NLP researchers :
    ** James Bradbury
    ** Sebastian Ruder
  • collaborative filtering (see lesson 5)