第五课:反向传递与手写神经网络
本课框架
lesson 5 outline
lesson 5 outline
0:00-3:30
* downhill into details behind the scene
* why start with machine vision?
* why we finish with tabular data and collaborative filtering?
* how this lesson is structured by starting with the latest notebook?
* regularization is the key in this lesson and will help improve your models
对深度学习原理和反向传递的回顾
review the workflow of deep learning and backprop
3:20-8:30
review the workflow of deep learning and backprop
* How to understand the Layers of parameters and activations
* how to update parameters
* how to get activations with parameters
* Inputs are special kind of activations
* original inputs
* inputs created by element-wise function with activations, Relu
* Relu works all the time
* What is Universal Approximation theorem
* parameter matrix product with input
* activations with relu to get input features
* stack many and large enough of weight matrices together to solve any function to any level of accuracy
* This is all the trick you need about DL for CS
* What is bp?
* name sounds impressive, but
* in fact = prediction + target -> loss -> gradient -> update parameters by - lr*gradient
如何理解迁移学习的微调训练
How to understand fine-tuning with ImageNet on new classifiers
8:30-19:51
How to understand fine-tuning with ImageNet on new classifiers
what exactly does Resnet classifier do behind the scene?
how to change ResNet’s last second matrix of 1000 category to suit your classes need?
What are the other/remaining hidden layers good at?
- as layers go up, features become more complex
- you want earlier layers weights stay where they are
- so we want them frozen
What does freeze do to the model?
- don’t backpropagate those frozen layers
- model run faster
- earlier layers’ weights stay the same
After a while, we want to train the rest of network, how to do that?
- unfreeze the layers
- earlier layers need almost no update, refer to very small learning rate
- middle layers need slight higher learning rate to have a little more update
- later layers need larger learning rate to update even more
- this process is called “discriminative learning rate”
How to do discriminative learning rate with fastai?
- fit(1, 1e-3)
- fit(1, slice(1e-3))
, middle layer rate/3
- fit(1, slice(1e-5, 1e-3))
, spread reasonably
- different learning rate for each layer group
为什么N-embedding比one-hot encoding更优
why is N-embedding better than one-hot encoding
19:50-36:00
why is N-embedding better than one-hot encoding
What is Affine function?
- sort of matrix multiplications
- CNN: weights are tied, so affine function is more accurate
- most common in DL
How to use one-hot encoding as input
- 15 users one-hot encoding as user input
- 15 movie one-hot encoding as movie input
How to understand one-hot encoding vs N-embedding
- conventionally, user weights(embedding) do matrix multiplication with user input (one-hot encoding)
- output is activation, which actually is the same to user weights (embedding)
- activations are actually the same to user embedding
- or equal to user embedding with array lookup (computationally easy)
What does it mean that user embeddings corresponds to user idx (same to movie)?
- when movie and user embedding matrix multiplication output (activation) is high, it means
- user embedding features corresponds to movie embedding features
- they are referring to the same features in their own corresponding embedding values
- user’s features = personal tastes, corresponds to/matches to movie’s features
- these underlying features are latent factor or latent features
How to deal with bad movies even though there are good features inside?
- solution is to add bias (for both user and movie)
- use user bias to represent user rating behavior
- use movie bias to represent movie overall quality
- so, bias is important, this is why by default all NN add bias to train
问答:训练次数与Affine函数
questions on epoch and affine function
questions on epoch and affine function
36:06-38:21
* When we load a pretrained model, can we reload activations to say what they are good at?
* yes
* What is Affine function?
* linear function
* multiplication + add up = affine function
* affine function (affine function) = affine function
* affine + relu + affine + relu …. = deep neural net
用collaborative filtering来解决电影用户预测问题
run full movie lens dataset with collaborative filtering
run full movie lens dataset with collaborative filtering
38:21- 48:02
- Introduction to Movielens dataset and how to pick the dataset
- How to open and check the dataset with pandas
- What does encoding="latin-1"
do?
- nowadays standards is unicode
- old style is latin-1
- what about the genre of movie in the dataset?
- how to use merge function to bring title into the table?
- how to create a CollabDataBunch
and set specific column to be the movie or item column?
- what about the trick of setting the y-range to improve performance?
- first use sigmoid to move values into [0, 1]
- better, to move [0,1] to [0, 5.5] make sure the largest rate can be 5
- What are n_factors
(latent factors)?
- matrix factorization
- the width of embedding matrix
- tried a number of values, 40 works the best so far
- How to pick the lr from lr plotting?
- paper author’s method
- find the lowest value
- then go back by x10
- Jeremy method method
- find the sharpest decline
- go back and forward by x10
- see which one works better
- go to LibRec to compare your result with the benchmarks
如何解读用户和电影对应的参数
How to interpret the weights or parameters of users and movies
48:00-61:00
How to interpret the weights or parameters of users and movies
- What interesting inside the latent factors (embeddings)?
- How to deal with Anime fans who just love anime and rate very high, so you see many episodes of anime stay the top 100 over the top 1000 movies?
- this kind of features are captured by bias
- it is interesting to check out bias vector of all movies
- how to use pandas to find the movies being rated the most?
- to find movies we might have seen hopefully
- How to access model’s item/movie bias?
- it is a vector of course
- how to group mean_ratins
, bias
and movie title together?
- then, how to sort them by bias values
- to compare bias value and rating numbers
- How to squish the 40 latent factors into 3 factors?
- how to use pca
?
- Rachel teaches PCA from a different course
- How to use PCA to compare image similarities?
- How to group PCA latent factors with movie titles
- and sort by different factors
- how to interpret
- How to plot the movie by factors
如何解读collaborative learner的源代码
How to read source code of collaborative learner
How to read source code of collaborative learner
- just watch Jeremy explore it a lot more times
- dive deeper with vim
- dive deeper with ipdb
61:00-66:57
如何解读embeddings
Interpreting embeddings
66:57-72:27
Interpreting embeddings
What is the big deal about entity embeddings?
- a Kaggle dataset with 2016 paper with entity embeddings
- how it work with different models?
What is the interesting founding from plotting the embedding?
- embedding projection discovered geography
- clear path between weekday and month of a year
- Embedding is under researched
- interesting to see pre-trained models’ embeddings
什么是weight decay
what is weight decay
72:20-79:35
what is weight decay
- weight decay is a kind of regularization
- how to understand regularization with Ng’s graph?
- what is under and over fitting?
- what is the lie taught in social statistics courses?
- too many parameters cause overfitting
- complexity depend on number parameters
- How to balance complexity and parameters number?
- real life is full of complexity, curve lines, more nonlinearity, many parameters
- but no more curvy than necessary
- but how to avoid overfitting at the same time?
- how to penalize complexity?
- use a lot of parameters but penalize complexity
- one way: sum up value of parameters(not really), sum up the square of parameter values
- all such value to loss
- what problem can it be?
- good loss require sum of squared parameters to be zero
- solution: multiply wd to the sum of squares of parameters
- generally, wd = 0.01, but here we use 0.1
- wd too small, model is easy to overfit, so can’t train too long
- what is the difference on using kwarg
between collab_learner
and learner
?
- how to pass on or add additional args
, such as wd
如何手写SGD与weight decay
How to write SGD with weight decay from scratch?
79:35-102:50
How to write SGD with weight decay from scratch
- how to implement SGD from scratch? (review)
- what is loss from scratch, such as MSE?
- how to move from loss to loss with weight decay?
- How to use MNIST dataset pickle file for experiment
- How to use Pytorch to create DataLoader, loss and Affine function?
- How to access a batch of dataset? (why always use DataBunch)
- Make sure you can do subModule
- What does nn.Linear
do? do the same to Mnist_logistic
- how to create a model out of the submodule you create?
- what does model.parameters()
do?
- why use crossEntropy rather than MSE?
- How to implement update
with weight_decay
?
- how to write w2
sum squared of parameters?
- what to access all the parameters for update?
- how to write loss with weight decay?
- what does loss.item()
mean?
- why we need reduce learning rate as we train the model further?
- see the plotting of loss
- why we call w2*wd
weight decay?
- linear algebra transformation outcome show wd*w
- without this transformation, we call it L2 regularization
- later they become different some how
- How to further refactor the code Mnist_NN
- How to do all the above with less code using Pytorch functions?
什么是Adam优化算法
What is Adam optimization
102:50-120:00
What is Adam optimization
* what is SGD in excel?
* what is momentum in excel?
* v = v_{t-1}*0.9 + g*0.1
* w = w_{t-1} - lr*v
* so momentum is gradients momentum, not weights
* what is the intuition of momentum on graph
* weighted gradient
* ::one more step by inertia, maybe we see the world better::
* how to do SGD with momentum in pytorch
* what is RMSProp in excel?
* where is the first citation of this method
* v = v_{t-1}*0.9 + 0.1*g^2
->
* if gradient is consistently small, v will be small
* if gradient is volatile, v will be large
* if gradient is consistently large, v will be large
* w = w_{t-1} - g_{t-1}*lr/sqrt(v_{t-1})
->
* if previous momentum of gradient is very small, let’s update weight in bigger steps
* vice versa
* ::make a change the previous lasting state, maybe we see the world better::
* learning rate is still necessary
* what is Adam in excel?
* adding momentum and RMSProp together
* w = w_{t-1} - lr*v_momentum/sqrt(v_rmsp)
* Deep dive into the excel sheet
* Deep Dive: An overview of gradient descent optimization algorithms
什么是fit-one-cycle
What is fit-one-cycle
120:00-123:30
What is fit-one-cycle
- fastai takes care of the optimization details for us
- what does fit-one-cycle
do?
- make learning rate start low
- as we know very little of the world
- go up about half of the time
- knows the world better, direction is correct
- then go down about half of the time
- finetune to get closer to convergence
- right : momentum graph
- when steps are small and momentum is accumulating, suggest we can just make big steps
- when steps are big, momentum is small, suggesting we change flexibly
- when steps are getting smaller, the momentum is increasing, we can make big steps again
- make it super fast convergence
- an inspiring story
什么是cross-entropy loss
What is cross-entropy loss?
123:30-end
What is cross-entropy loss
Introduction to the toy dataset
what is the intuition of cross-entropy loss in excel
what is intuition of softmax in excel
- single label multiple classification
How pytorch does them both for us?
- nn.CrossEntropyLoss
has both loss functions above inside
what pytorch multiple class classification model return to us?
- pytorch default is different from fastai default