Deep Learning Brasilia - Lição 3


(Saul Campos Berardo) #1

<<< Post: Lição 2Post: Lição 4>>>

Nesta lição, além de serem apresentados novos detalhes sobre as bibliotecas utilizadas no curso, serão abordados três assuntos principais: visualização das convoluções em CNNs; classificação Multi-Label; e classificação de dados não-estruturados (assunto que será melhor explorado nas próximas lições). Durante a aula, três competições do Kaggle são mencionadas, que servirão de deixa para a parte prática pós-intervalo.

Agenda:

09:00-10:40 Aula 3
10:40-10:50 Intervalo
10:50-12:00 Atividade prática

Parte prática:

A atividade proposta consiste em participar de uma competição do Kaggle utilizando a biblioteca do FastAI no PaperSpace. No Jupyter Notebook do link a seguir, é disponibilizado um esqueleto que pode ser utilizado como ponto de partida para participação na competição Dog Breed Identification:
Link para o Jupyter Notebook de apoio:

Roteiro proposto:

  1. Criar conta no Kaggle

  2. Instalar e se familiarizar com um dos clientes do kaggle (instruções nas respectivas páginas do GitHub)
    a. Cliente oficial:
    https://github.com/Kaggle/kaggle-api
    b. Cliente não-oficial, utilizado pelo JH:
    https://github.com/floydwch/kaggle-cli

  3. Escolher uma das competições citadas na aula:
    a. Classificação de imagens (raças de cachorros)
    https://www.kaggle.com/c/dog-breed-identification
    b. Dados não-estruturados (predição de vendas):
    https://www.kaggle.com/c/favorita-grocery-sales-forecasting
    c. Classificação multi-label (imagens de satelite):
    https://www.kaggle.com/c/planet-understanding-the-amazon-from-space
    Obs. os dados de treino desta competição pesam cerca de

  4. Baixar os dados da competição para o PaperSpace ou Crestle através de um dos clientes de linha de comando

  5. Treinar modelo com o FastAI

  6. Submeter resultados de baseline para a competição

Resumo dos Vídeos:

1) Where we are, where we go

2) Review of dog x cat notebook

  • Link: https://www.youtube.com/embed/9C06ZPF8Uuc?start=905&end=1202
  • Duration: 04:57
  • Key points:
    • JH shows how to put the training code in compact form (and reviews the code).
    • A quick discution is made about the confusion which has been around precompute=true: “it only makes things faster” and “you can always skip it”.
    • When using precompute=True, data augmentation doesnt work, bacause it uses the cached non augmented activations.
    • The parameter bn_freeze(True) causes the “batch normalization moving averages to not be updated” (in the second half of the course we gonna learn why we want to do that. It is sth that is not supported by any other library, but is very important). It should be used if we have more than 49 layers and the dataset is similar to image net.

3) CNN theory

4) Multi-Label classification

  • Link: https://www.youtube.com/embed/9C06ZPF8Uuc?start=4824&end=5117
  • Duration: 04:53
  • Key points:
    • Softmax: “wants to pick a thing”. Use softmax just when you want to assign a single class to an image. You can’t use it to multi-lab classification.
    • FastAI can do “multi-label classification automaically”. If there is more than one label in a CSV associated to an image, it will switch automatically toulti-label mode.
    • In multi-label classification, you can’t use keras style data loading (i.e. from folders), you need to use ImageClassifierData.from_csv approach.
    • In Pytorch there are all 8 possible transforms we can apply to images (“dihedral group”).

5) Practical tips for multi-label classification

  • Link: https://www.youtube.com/embed/9C06ZPF8Uuc?start=5690&end=6077
  • Duration: 06:27
  • Key points:
    • Planet images are not like image net images.
    • Most ImageNet models are trained with 224x224 images. If we resize it to 64x64, we are destructing the pre-trained weights.
    • Start with a small sz parameter, train quickly to a reasonable weights value then increase the sz parameter (as powers of 2) until the original dimensions of the image.
    • Sped up convenience function: data.resize(int(sz*1.3), 'tmp').

6) Metrics (beyond accuray)

  • Link: https://www.youtube.com/embed/9C06ZPF8Uuc?start=6077&end=6287
  • Duration: 03:30
  • Key points:
    • JH shows an example of the use of a different metric (F2 instead of Accuracy).
    • The confusion matrix can be turned into a score in a lot of different ways.
    • “F beta”, where beta is the weight we use to weight false negative vs false positives.

7) Unstructured data vs. Structured data

  • Link: https://www.youtube.com/embed/9C06ZPF8Uuc?start=7193&end=7561
  • Duration: 06:08
  • Key points:
    • There are two types of data (there is no agreed upon terminology):
      • Unstructered) audio, images, natural language (all dimensions mean the same things).
      • Structured: data in tables (each dimension means sth different).
    • Structured data is what most of you analyse most of the time.
    • Structured data is largely ignored (but we won’t ignore them, because we are practical people!)
    • There is a Kaggle structured data competition “Grocery Sales Forecasting”(nobody knows what they are doing).
    • The FastAI has a special package for structured data: fastai.structured .

8) Lesson 3 Notebook

  • Link: https://www.youtube.com/embed/9C06ZPF8Uuc?start=7561&end=8197
  • Duration: 10:36
  • Key points:
    • JH used the code of the third place in the competition.
    • JH doesn’t care too much about looking at the data dictionary. Let the data speak.
    • The function pd.read_feather() is good for large datasets.
    • Variables must be split in: categorical (one-hard encoding) and numerical.
    • Dear students, try to enter in as many Kaggle competitions as possible!!!

Deep Learning Brasília - Revisão (lições 1, 2, 3 e 4)
Brasília part 1 group
Deep Learning Brasília - Lição 4
Deep Learning Brasilia - Lição 2
Brasília part 1 group
(Eric Hans Silva) #2

Pessoal,

o notebook da competição do kaggle integrado ao colab para quem se interessar: https://drive.google.com/file/d/1T8f-PB3jBxthverN3F0x6-u-ipHyaild/view?usp=sharing
Caso tenham algum problema ao abrir, me avisem.
[]s


(Eric Hans Silva) #3

Caso a GPU não venha ativada por padrão, basta ir no menu Runtime -> Change runtime type -> Hardware accelerator -> GPU -> Save


(Lucas O. Souza) #4

Pessoal

Uma correção no arquivo do desafio do dog breeds. Na biblioteca do fastai os dados de teste não ficam ordenados na ordem alfabética como é esperado pelo arquivo de submissão.

Na célula onde gera o arquivo para submissão, troque para:

# Cria data frame para submissão, com as probabilidades calculadas pelo modelo

df=pd.DataFrame(
    data=probs,
    columns=d.columns[1:], # Excluir primeira coluna, que é o ID
    index=[f[5:-4] for f in data.test_dl.dataset.fnames] 
)
df.index.name = 'id'

Explicando melhor, só troquei a linha onde está index. Ao invés de pegar do index da submissão, peguei o nome dos arquivos de teste do objeto data, onde carregamos todos os dados (data.test_dl.dataset.fnames). O [5:-4] é para tirar o começo do caminho do arquivo (test/) e o final (.jpg), ficando só o id da imagem, que é o formato esperado pelo kaggle


(Lucas O. Souza) #5

Pessoal

Quem quiser usar TTA, trocar a célula de predição para:

log_preds = learn.TTA(is_test=True)
probs = np.mean(np.exp(log_preds[0]), axis=0)

No TTA o log_preds vem em uma tupla, onde a primeira entrada dessa tupla é o array com log_preds. E depois como são várias predições para a mesma imagem, precisa dar um np.mean.


(Lucas O. Souza) #6

Coloque seu resultado no nosso leaderboard! https://bit.ly/2IrcRpC


(Hissashi Rocha) #7

@saulberardo @lucasosouza, percebi que na competição de Breed Identification não existe validation set, apenas training e test set, mas quando chamo o learn.fit ele apresenta um val_loss. Então a função val_idxs = get_cv_idxs(n) pega aleatoriamente algumas imagens e define elas como parte do validation set? Ao fazer isso, as imagens que foram setadas para validation também são utilizadas no treinamento ou não?


(Lucas O. Souza) #8

Fala Hissashi

Então a função val_idxs = get_cv_idxs(n) pega aleatoriamente algumas imagens e define elas como parte do validation set?

Exatamente!

Ao fazer isso, as imagens que foram setadas para validation também são utilizadas no treinamento ou não?

Não são. Mas você pode fazer isso antes de enviar, retreinar o modelo no dataset completo, para aumentar a generalização. Para isso crie um novo objeto data. Como validação, passe uma quantidade pequena de índices (não sei se tem como desligar o validation no fastai, acho que não, vi em outra thread sugerindo criar uma pasta ‘validation’ e colocar apenas uma imagem dentro). E depois mude o dataset no objet learn, dando learn.set_data(new_data).

abs


#9

Hey there,

Non-Brazilian here, so forgive me for the English :grinning:. I just wanted to share a funding opportunity for applying data science and DL tools to public health in Brazil. You must be Brazilian and based in Brazil to apply.

https://gcgh.grandchallenges.org/challenge/grand-challenges-explorations-brazil-data-science-approaches-improve-maternal-and-child


(Pierre Guillou) #10

Thanks @afrocraft for this information !


#11

You’re welcome!


(Gustavo) #12

Pessoal,

Complementando o post inicial do Saul, deixo algumas orientações sobre como baixar os dados da competição “Dog Breed Identification” no Kaggle:

1- Instalar o cliente oficial do Kaggle no terminal do paperspace
pip install kaggle

2- Fazer uma conta no site do Kaggle indicando usuário e senha, isto é,
sem usar credenciais de outros sites

3- Entrar na seção “My Account” e acessar a opção “Create New API Token”

4- Copiar o texto do token para um novo arquivo chamado “kaggle.json”
no diretório “/home/paperspace/.kaggle”

5- Acessar a página da competição no site do Kaggle (ex. Dog Breed
Identification)

6- Iniciar o download de algum dos arquivos de dados, apenas para
aceitar as regras da competição

7- Baixar os dados da competição no terminal
kaggle competitions download -c dog-breed-identification

8- Entrar no diretório onde os arquivos foram baixados
cd /home/paperspace/.kaggle/competitions/dog-breed-identification

9- Descompactar os arquivos baixados
unzip labels.csv.zip
unzip train.zip
unzip test.zip
unzip sample_submission.csv.zip

10- Acesse e Jupyter Notebook e mãos a obra!
jupyter notebook

Abs,
Gustavo


(Pierre Guillou) #13

A thread da lição 3 : Wiki: Lesson 3

Alguns links importantes :slight_smile:


(Pierre Guillou) #14

Fotos dos participantes estudando a lição 3 :slight_smile:


(Pierre Guillou) #15

Verifique a sua compreensão da lição 3

<<< Verifique a sua compreensão da lição 2 | Verifique a sua compreensão da lição 4 >>>

Oi pessoal,

Eu assisti novamente ao video da lição 3 (parte 1) para melhorar meu entendimento dela e tomei notas do vocabulário usado pelo @jeremy.

Vamos jogar um pouquinho ! Concorda ? :wink:
Você pode dar uma definição / uma URL / uma explicação para todos os termos e expressões a seguir?

Se sim, você entendeu perfeitamente a terceira lição! :sunglasses::sunglasses::sunglasses:

PS: se você não quiser se testar ou se quiser checar as suas respostas, vá para o post “Deep Learning 2: Part 1 Lesson 3” do blog de @hiromi : " super travail !!! :slight_smile: "

  • try to teach what you learned by posting in a blog
  • wiki thread in the Fastai forum
  • AWS fastai AMI
  • Github
  • Tmux (Ubuntu, Mac Os)
  • Understand why there are validation images not well classified
  • learning rate
  • why a low learning rate is safer but slower for training a NN ?
  • why a high learning rate can increase the value of the loss function ?
  • learn.lr_find(); learn.sched.plot()
  • batch size
  • SGDR
  • fastai vs pytorch
  • CNN ou Convolutional Neural Network
  • Resnet
  • Beginner Fastai forum
  • Kaggle site
  • How to download data from Kaggle : script kaggle-cli
  • pip install kaggle-cli
  • accepts the competition rules in Kaggle site
  • kg download -u user -p ‘password’ -c competition
  • How to download images from any sites
  • CurlWget as Google Chrome extension
  • symlinks
  • ls -l in a terminal
  • Quick DogsCats
  • fastai.conv_learner
  • tfms, data transformation
  • data object
  • shift + tab
  • test_name=“test”
  • learn object
  • precompute=True
  • learn.unfreeze()
  • learn.bn_freeze(True) for deeper NN (resnet50 and above) with similar dataset that Imagenet dataset ( if are you using a deep network on a very similiar dataset to your target (ours is dogs and cats) - its causing the batch normalization not be updated)
  • batch normalization
  • use TTA for get validation predictions
  • tensorflow, keras // pytorch, fastai
  • mobile applications
  • create a submission file
  • individual prediction
  • http://setosa.io/ev/image-kernels/
  • diference between element-wise product and matrix product ?
  • Video do Otavio Good : “A visual and intuitive understanding of deep learning
  • kernel / filter of convolutional with a shape of 3 x 3
  • search for edges (left and top)
  • feature maps
  • non linearity, relu
  • max pooling
  • fastai/courses/dl1/excel
  • MNIST data base
  • filter to detects top edges
  • we get activation after the element-wise product by the convolutional filter
  • an activation is calculated
  • Relu means max(0, value)
  • pytorch stores convolutional filters as a tensor
  • a tensor is an array with more dimensions (additional axis)
  • the size of each hidden layer in a CNN is the number of convolutional filters used to get the feature maps
  • the size of a convolutional kernel has 3 dimensions and the third one is the number of feature maps in the input hidden layer
  • max pooling : kill the dimension by sub-sampling (keep the max) without over-lapping
  • fully connected layer (linear matrix product)
  • but big CNN gives big number of weights in the fully connected layers : risk of overfitting !
  • VGG (16 layers) : 138 millions of weights
  • VGG (19 layers) : more than 143 millions of weights
  • in theses CNN, the number of weights of the convolutional filters is about 20 millions : the majority of the weights comes from the fully connected layers
  • Resnet and ResNext do not use large fully connected layers
  • the 50-layer ResNet network has about 26 million weight parameters and computes ~16 million activations in the forward pass (https://www.graphcore.ai/posts/why-is-so-much-memory-needed-for-deep-neural-networks)
  • the fully connected layers do a classic matrice product
  • last layer : there is no Relu (than, we can have negative value)
  • softmax is an activation function that allows to get probabilities
  • softmax tends to take one thing out of the other (ie, with a probabilities clearly higher than the other ones) : its “personality” is to pick a thing (so, it is perfect for one or 2 label classifier)
  • sigmoid is an activation function uses for multi-label classifier because it gives a number between 0 and 1 (looks like a probability) for each label
  • Relu is an activation function too but it does not get probabilities
  • an activation function is a function applied on activations
  • in Deep Learning, an activation function adds a non-linearity
  • we must know log, exp
  • activation functions have a personality
  • we can not use softmax for multi-label classification
  • if your objective is to classify multi-labels images, you can not use ImageClassifierData.from_paths because an image can not be in more than a folder. Then, you need to use ImageClassifierData.from_csv
  • Good news : the Fastai library will recognize in your csv file if they are more than 2 labels (multi-label classification)
  • data.val_ds (ds como data set in pytorch) : gives you a single image (or object) back
  • data.val_dl (dl como data loader in pytorch) : gives you a transformed mini batch
  • in pytorch, to get the next mini batch, we use a generator (iterator) : next(iter(data.val_dl))
  • if you know python, you learn pytorch naturally
  • zip takes 2 lists and combines them : list(zip(data.classes,y[0]))
  • 1 hot encoded vector
  • CatsDogs and DogsBreed were a single-label classification
  • images from The Planet competition are not like ones used in Imagenet competition
  • you can change the input image size during the training for the NN that have an adaptative pooling before the first fully connected layer like Resnet (but not VGG) : learn.set_data(get_data(sz))
  • get data (imagens) resize before to pass them to the data object thanks to data.resize(int(sz*1.3), ‘tmp’) : speed-up ! (faster than resize directly in the tfms)
  • after dogsbreed, try to run the Planet jupyter notebook
  • metrics for accuracy : metrics = [f2] (f2 uses fbeta_score) and pass it to the learn object : learn = ConvLearner.pretrained(arch, data, metrics=metrics)
  • in the Fastai library, everything can be changed
  • sigmoid function is used for logistic regression
  • fastai chooses automatically softmax or sigmoid activation function
  • when you use a pretrained CNN network, it means that the weight of the first layer of your new models are not random but the ones of the last fully connected layers you added, are random. Then, you need to train firstly theses last layers before to unfreeze and train teh whole network. If not, the random weight of the last layers will destroy the weights of the first layers (from the pretrained model)
  • the GPU takes a center crop on each input image of size sz. That’s why it is important to do Data Augmentation before on the input dataset
  • in the fastai library, there is a concept of layer groups
  • learn.summary()
  • tables of data : structured data
  • audio, images, natural linguaguem : unstructured
  • Grocery Sales Forecasting competition in Kaggle
  • Rossman data
  • from fastai.structured import *
  • from fastai.colum_data import *
  • pandas (book : Python for Data Analysis)
  • test = pd.read_csv(f’{PATH}test.csv’, parse_dates=[‘Date’])
  • there is a difference with the DogsCats dataset : we do a lot of preprocessing on these structured data
  • enter kaggle and do competitions !

Deep Learning Brasília - Lição 4
Deep Learning Brasília - Revisão (lições 1, 2, 3 e 4)
Wiki: Lesson 3