Deep Learning Brasília - Lição 4

<<< Post: Lição 3 | Post: Revisão>>>

Nesta aula, é apresentada a aplicação de DL em séries temporais, para gerar Embeddings e como utilizar Dropouts para regularização e evitar over fitting. A segunda parte foca na introdução do uso da biblioteca fastai para NLP, mais detalhes serão estudados na aula 5. Como prática iremos utilizar dados do PubMed para classificação de artigos científicos com base sem seus resumos.

Agenda:
09:00-10:40 Aula 4
10:40-10:50 Intervalo
10:50-12:00 Atividade prática

Roteiro do vídeo: baseado no post (Wiki: Lesson 4)

Intro
*Parte 1: 00:00:04 - 00:23:45
Discussões gerais, apresentação de artigos de interesse e dúvidas da aula passada. Para os links dos artigos consultar o post da @rachel.
Nessa parte iremos discutir somente as dúvidas sobre Dropouts: 00:18:04 “What kind of ‘p’ to use for Dropout as default”, overfitting, underfitting, ‘xtra_fc’”

*Parte 2: 00:25:04 - 01:07:10
É apresentada a abordagem sobre dados estruturados e séries temporais na biblioteca fastai. Funções auxiliares da biblioteca para lidar com questões recorrentes nesses tipos de dados sãos apresentadas. A principal parte é a discussão sobre o que são embeddings para dados categóricos.

*Parte 3: 01:23:30 - 01:43:45
Os conceitos iniciais sobre NLP utilizando a fastai e a torchtext, sem muitos detalhes dos conceitos. Mais detalhes serão explicados na aula 5.

Desafio:
Apesar da atenção grande em cima das competições do Kaggle, essa aula vamos torrar um pouco os neurônios para entender os conceitos básicos de vocabulário e testar uma classificação multilabel com a fastai. A base do desafio é a talk realizada em 2017 no Spark Summit: https://databricks.com/session/natural-language-understanding-at-scale-with-spark-native-nlp-spark-ml-and-tensorflow. A ideia é pegar 1.000 documentos do pubmed de 6 categorias diferentes e tentar criar uma rede para prever sua categoria.
Nossa baseline é uma acurácia de 99%.
Parte 1: Utilizar os vetores pré-treinados do IMDB para tentar classificar. Qual o impacto de utilizar um vetor pré-treinado?
Parte 2: Criar o modelo/vocabulário em cima dos dados baixados. Qual tamanho para o novo vocabulário?
Parte 3: Rodar as 6 etapas do fastai para classificar os textos.

Amanhã o notebook com o download dos dados será postado na atualização do post.

Importante
O dataset do IMDB gera vetores esparsos muito grandes e a parte de treino é bem demorada. O @vikbehal salvou a parte de criação do modelo e depois do fine tunning para classificação de sentimentos. Portanto durante a aula, ao invés de rodarmos cada passo iremos carregar os modelos fornecidos.

2 Likes

Bom dia, galera.

A semana anda corrida, mas vamos lá. Uma das dúvidas da aula de terça foi sobre a camada linear que o pytorch usa. Não soube explicar na hora. Olhei a documentação do pytorch.
E o que ela diz:
“Applies a linear transformation to the incoming data: y=Ax+b”

Na documentação a um pequeno exemplo para examinar o resultado dessa operação:
m = nn.Linear(20, 30)
input = torch.randn(128, 20)
output = m(input)
print(output.size())

Para entender melhor, pesquisei alguns posts e achei um que clareia tudo.
http://blog.mmast.net/linear-transformations-numpy

Alguns links :

Entender “visualmente” o dropout : Dropout amount in the classifier

O trecho em relação a isso : https://youtu.be/gbceqO8PpBg?t=5m4s

Apresentação de Fernando Melo :
Teste do NLP language model Fast.Ai lição 4, utilizando textos de twitter/blogs /newspapers da empresa Swiftkey, para predição da próxima palavra.

Seguem os links:
Exploratory data analysis: https://rpubs.com/flbmelo/347476
Jupyter notebook fastai: https://github.com/Nandobr/Deep-Learning-Brasilia/blob/master/lesson4-swiftkey.ipynb

O @jeremy roda o jupyter notebook lesson4-imdb.ipynb durante a lição 4.

Esse notebook é sobre “language modeling” com um large movie view dataset do IMDB.
Baixe na pasta data o dataset com este link usando wget link mas como unzip o arquivo aclImdb.tgz baixado ?

Num Terminal, vá para a pasta data e digitar o código seguinte : tar -xvzf aclImdb.tgz

Se tiver um problema a rodar spacy_tok = spacy.load('en'), não tem provavelmente o modelo spacy en instalado (ou tem um problema de symlink). Você pode verificar rodando o código seguinte :

import spacy
spacy.load('en')

Se isso falhar, então abre um terminal, execute python -m spacy download en e depois de alguns minutos você terá esse modelo e funcionará.

Nota : fiz isso num Terminal Anaconda Prompt (uso meu GPU Nvidia com Windows 10) e houve 2 problemas. Eis as soluções :

  1. Rodei python -m spacy download en e houve uma mensagem de erro : python.exe: No module named spacy. A razão era que meu “virtual environment” fastai não estava ativado. Por isso, rodei activate fastai e deu certo.

  2. Afinal da instalação, houve a mensagem seguinte : Error: Couldn't link model to 'en'. Creating a symlink in spacy/data failed. E ai, criei o symlink en do jeito seguinte num Terminal Anaconda Prompt :

    1. Entrei na pasta spacy do meu “virtual environment” fastai :
      cd Anaconda3/envs/fastai/Lib/site-packages/spacy/data
    2. Criei o symlink : mklink /d en ..\..\en_core_web_sm

:slight_smile:

Where to download data for these lessons 4 & 5 (Arxiv, Wikipedia, etc)?

Leia esta thread :slight_smile:

Verifique a sua compreensão da lição 4

<<< Verifique a sua compreensão da lição 3 | Verifique a sua compreensão da lição 5 >>>

Oi pessoal,

Eu assisti novamente ao video da lição 4 (parte 1) para melhorar meu entendimento dela e tomei notas do vocabulário usado pelo @jeremy.

Vamos jogar um pouquinho ! Concorda ? :wink:
Você pode dar uma definição / uma URL / uma explicação para todos os termos e expressões a seguir?

Se sim, você entendeu perfeitamente a quarto lição! :sunglasses::sunglasses::sunglasses:

PS: se você não quiser se testar ou se quiser checar as suas respostas, vá para o post “Deep Learning 2: Part 1 Lesson 4” do blog de @hiromi : " super travail !!! " :slight_smile:

  • blog posts from fastai students mainly about how to get the best learning rate
    ** learning rate
    ** Stochastic Gradient Descent with Restart (SGDR)
    ** Diferential Learning Rate
    ** (from @bushaev) the difficulty in minimizing the loss arises from saddle points rather than poor local minima
  • computer vision
  • lesson 4 is about 3 concepts :
    ** structured (data) learning (financial, …)
    ** NLP : Language Natural Processing
    ** Collaborative filtering
  • jupyter notebook : we used dropout in DogBreeds Kaggle competition
  • an activation is a number which has been calculated
  • kernel ou filter : same thing
  • a fully connected layer (Linear layer) does a simple matrix multiplication
  • type ‘learn’ in your jupyter to get the last added layers of your learn object (your new model from the pretrained model which is here a ConvNet network) : the fastai library adds by default 2 linear layers (fully connected layers) with the following structure for each one : BatchNormalization, Dropout, Linear, ReLu (and the last layer gives logSoftMax)
  • dropout
    ** ps = 0.5 means delete each activation with a probability of 50% (at random)
    ** therefore, at each mini batch, we throw away different activations
    ** what does dropout ? it forces o network to generalize better (it forces the network to learn the main caracteristics) and then, avoids overfitting (specialization of your network)
    ** dropout was created about 2013 by Geoffrey Hinton and its team (from how the brain works) : it solves the problem of generalization but why droput was inspired by the brain ? (answer : https://www.quora.com/Machine-Learning-What-does-dropout-in-the-brain-works-with-respect-to-all-inputs-while-dropout-in-a-convolutional-network-works-with-respect-to-each-single-unit-mean)
    ** high ps value helps generalize well but decrease your training accuracy
    ** low ps value will generalize less well but will give you a better training accuracy
    ** why in early training, my validation losses is better (lower) than my training losses ? The reason is we turn off the dropout when we look at the validation set.
    ** when we set ps=0.5, pytorch deletes an activation with a probability of 50% and double all the others : then, the average activation does not change
    ** ps=0.5 adds dropout to all the added last layers, but not in the first layers (the pretrained layers)
    ** ps=0 removes dropout and then in the first epoch, my validation accuracy is better (higher) than with ps different of zero but in teh third epoch, it is the contrary : dropout helped a better generalization
    ** ps=0 : massive overfitting from the third epoch !
    ** if you don’t write a ps value in the learn object, there is a default ps value fo each last layer !
    ** ps=[p1,p2] allows to put a different dropout by last layer (idea : the last dropout - the one before the last Linear layer - can be higher than others)
    ** the dropout adds noise in fact
  • add layers
    ** by default, the fastai learn object add 2 Linear layers at the end of the network but you can change the number of last Linear layers and the number of activations of each one by using xtra_fc=[x1,x2,...]
    ** therefore, an empty list xtra_fc=[] creates just one Linear last layer
  • overfitting
    ** bigger models (deeper), bigger dropouts as deeper models tend to overfit much (specialization)
    ** overfitting ? the training loss is much lower than the validation loss
    ** a small overfitting is not a problem but you need to look at at your validation loss and get the smallest number for him
  • Jupyter notebook : Rossman Store
    ** 2 kinds of column : categorical (number of levels) or continuous (number attached)
    ** no feature engineering
    ** a categorical variable tells the neural network that each level (cardinality) means something totally different
    ** a continuous variable comes with a kind of mathematical function
    ** a continuous variable with a low number of levels (cardinality) : better to treat it as a categorical variable
    ** DayOfWeek is a number but it is better to treat it as a categorical variable as each day as a meaning different in relation to the context
    ** then, a categorical data stays a categorical but a continuous data, you have to decide
    ** a continuous variable has floating number
    ** one remark : you can group by interval, numbers of a continuous variable and then turns it into categorical variable (better or not ? there is a paper on that)
  • if in the dataset, if pytorch finds a new categorical variable (category in pytorch) that was not defined, pytorch adds it automatically as an unknown category
  • pandas + Machine Learning course
  • clean data and make type of feature engineering
  • then, convert columns to categorical or continuous variables (float32)
  • idxs = get_cv_idxs() to get a sample (validation set)
  • df, y, nas, mapper = proc_df(joined_samp, 'Sales', do_scale=True) : creates the dependent variable with normalization (do_scale=True)
    ** Neural Net likes to have input data with mean = 0 and standard deviation = 1
    ** mapper keeps the normalization process and then, the same one will be applied to the test set later (that did matches for example from 2014 to 2, from a to 1, etc.)
  • validation set : read thefastai blog post about that
  • what is the metric ? here, RMSPE (Root Mean Square Percentage Error)
  • check the log rules : log (a/b) = log(a) - log(b)
  • md = ColumnModelData.from_data_frame(PATH, val_idx, df, yl, cat_flds=cat_vars, bs=128) : it is important to give cat_flds to tell our model which variable is a categorical variable
  • m = md.get_learner(emb_szs, len(df.columns)-len(cat_vars) (number of continuous variables), dropout of the embedding matrix, prediction of a single number (sales), how many activations in each last layer, dropout at last layers, y_range=y_range)
  • x rank tensor = vector
  • embedding matrix
    ** Jeremy draws a NN with a vector of continuous data in input but what is the situation with categorical variables ? in fact we create an embedding matrix for each categorical variable
    ** It means that this embedding matrix will be the first one of our NN : this another bunch of weight of our NN
    ** the embedding weights mean something but this is the training that will show that
    ** size of each embedding vector ? take the cardinality, divides by 2, not bigger than 50
    ** 1 hot encoding : the problem is that a categorical variable is a concept not linear
    ** embedding vector is a distributed distribution (rich representation, high dimensional concept)
  • when a categorical variable has a high cardinality, you should package it
  • all modern DL libraries, take the index and give back the embedding vector but this is similar to a product matrix between 1 hot encoding vector and the embedding matrix
  • add_datepart() : creates columns ot time series from the name of the column (DayofWeek)
  • after, you can run your NN like for the DogsCats one
  • Pinterest : when they switched from gradient boosting machines to Deep Learning, they did much less features engineering : it is one of the biggest benefit of using DL
  • data augmentation in structured data : jeremy does not know a technique
  • what is the downside of using Deep Learning on structured data (structured Deep Learning model) ? few people work on it and before fastai, there was no easy way to code it (this class is the first time)
    ** Pinterest has a O’Reilly video on that
    ** 2 academics paper from Yoshua Bengio (taxi destination forecast) and the kaggle Rossmann competition
  • Natural Language Processing
    ** NLP (Natural Language Processing) is the most up-and-coming area
    ** it is 2 or 3 years behind computer vision in Deep Learning
    ** it is kind of second area that Dl started to be popular in
    ** DL in computer visions came to state-of-the-art in 2014/2015
    ** NLP is still in a state where the concepts and the use of DL are less mature than computer vision
    ** in the last few month, models and ideas used in computer vision started to be tested in NLP. Therefore, it is quite new
  • Language modeling (ou Model language)
    ** We are going to talk about a particular problem in NLP : language modeling
    ** language modeling : build a model where given a few words of a sentence, can you predict what the next word is going to be ? (swiftkey no mobile for example)
    ** arxiv.org : very popular site on DL papers
    ** Jeremy downloaded from this site 18 months academic papers (topics, titles and summaries)
    ** Then, he build an algorithm to predict the rest of a phrase from the first words
    ** the algorithm did not know English : it starts with an embedding matrix for each word
    ** csn = computer science networking
    ** create a language modeling has many consequences : learn to write correct English, learn to put single into parentheses
    ** we are creating a Language Model because we want to create a pretrained model and use it within a IMDB movie review to detect positive and negative sentence (it does not work to directly create a model to detect a positive or negative review)
    ** Why ? because of many reasons :
    ** 1) fine tuning of a pretrained network works very well : the networks has detected many characteristics of a “world” (it “understood” the world) and then we can use it for another related task
    ** 2) IMDB movie reviews are big (1000 words long), then after reading 1000 words, knowing nothing about how English is structured or even what is the concept of a word or punctuation and try to give a positive or negative sentiments (1 or 0) is just too much to expect
  • Usually Language Model works at words level
    ** you can create your American Novel generator :slight_smile:
    ** we focus here on text classification which is very powerful (hedge fund to classify news that have impacts, customer service to detect people who will cancel their contracts in the next month)
  • lesson4-imdb.ipynb
    ** new things we import : torchtext (NLP library)
    ** after library importation and PATH setup, we tokenize the training dataset (words, punctuations) using spacy (the best tokenizer)
    ** ' '.join(spacy_tok(review[0])) : puts a space between 2 tokens
    ** with torchtext, we are going to create a torchtext field : how to pre-process some a piece of text like making everything lowercase and tokenize it : TEXT = data.Field(lower=True, tokenize=spacy_tok)
  • We can now create our model :
    ** FILES = dict(train=TRN_PATH, validation=VAL_PATH, test=VAL_PATH)
    ** md = LanguageModelData(PATH, TEXT, **FILES, bs=bs, bptt=bptt, min_freq=10)
  • min_freq : we will remplace all words by indexes (a word has an unique index) but if a word occurs less than 10 times, just call it unknown
    ** bptt = backprop though time is where we define how long a sentence will stick on a GPU at once (we will break sentences into bptt tokens or less)
    ** Then, we add a very important attribute to our torchtext field : TEXT.vocab
    ** this is the vocabulary : list of unique words in the corpus classed by frequency or list of their indexes
    ** context is very important in NLP !
    ** note : bag of words is not longer useful because it does not keep the order between orders
  • bptt
    **we concatenated all review in one file. Then, we split into batches (64 for example). The, we move into columns these 64 words splits. Then, we grab a bptt (70 tokens = 70 lines) of this matrix that will give to our GPU.
    ** a batch size in NLP is a batch of 64 bits and each bit is a sequence of length bptt (70 here)
    ** next(iter(md.trn_dl)) gives you a batch like the one uses by your GPU
    ** warning : randomly, torchtext changes the bptt size around 70 for each batch
    ** the first column represents the first 75 words of the first review (bptt = 75)
    ** The second column represents the first 75 words of the second of the 64 segments (you have to go in like 10 millions words to find that one)
    ** The last column is the first 75 words of the last of those 64 segments
  • Now, we have to create an embedding matrix for each word or our corpus (1 word is a categorical variable)
    ** len(md.trn_dl) : number of blocks
    ** md.nt : number of tokens
    ** len(md.trn_ds) : length of dataset (1 here, because we put all review in 1 file which is our corpus)
    ** len(md.trn_ds[0].text) : number of words in our corpus
    ** we define em_sz (size of our embedding matrix : 200), 3 layers with 500 activations by layer
  • put Adam optimizer with opt_fn = partial(optim.Adam, betas=(0.7, 0.99))
  • recent Language model : we use AWD LSTM from Stephen Merity (important about dropout)
  • another way to avoid overfitting : learner.reg_fn = partial(seq2seq_reg, alpha=2, beta=1)
  • Gradient clipping : learner.clip=0.3 (avoid a learning rate above 0.3)
  • technique to use pretrained model for sentiment analysis more powerful than WordToVec
  • Here, we use a Recurrent Neural Network using LSTM (long short term memory)
  • NLP researchers :
    ** James Bradbury
    ** Sebastian Ruder
  • collaborative filtering (see lesson 5)

64 batches e não um batch de tamanho 64

Uma thread interessante para entender porque as frases do dataset IMDB (notebook lesson4-imdb.ipynb) estão verticais e não horizontais na language model data matrix : Batch and bptt in language model data matrix

Imagem 1 : https://cdn-images-1.medium.com/max/2000/1*O-Kq1qtgZmrShbKhaN3fTg.png
Imagem 2 : https://raw.githubusercontent.com/pekoto/fast.ai/master/images/nlp-batch2.jpg