Fastai DL1v3 review notes – Lesson 8

Here are some of my notes for Lesson 8, in case anyone finds them useful. I looked at the list of subjects we covered in part-1, and wrote down some of the ones I hear of a lot and have a ‘sort-of’ idea about, but would have trouble telling you exactly what they were if asked. It was actually a very good exercise, and clarified a lot of topics. I may add to this if I do more; note-taking order is newest topic on top.

I had difficulty searching for specific topics in the part-1v3 course, Hiromi’s notes were very useful for finding relevant lecture sections.


Broadcasting
[Computation on Arrays: Broadcasting]

Adam [ DL1v3L5@1:54:00 ][ Notes ]

  • Adam builds upon RMSProp which builds upon Momentum.
    • RMSProp takes the expo-weighted moving average of the squared gradient
      • the result is: updates will be large if the gradients are volatile or consistenly large; small if constly. small.
  • Adam takes RMSProp and divides by the squareroot of the previous update (the squared terms).
    • if the gradients are consistently small and non-volatile, the update will be larger.
  • Adam is an adaptive optimizer
  • Adam keeps track of the expo-weighted moving avg (EWMA) of the squared gradients (RMSProp), and the expo-weighted moving avg of the steps (momentum); and divide the EWMA of the prev steps by the EWMA of the squared terms (gradients); and also use ratios as in momentum.
  • Adam is RMSProp & Momentum

Momentum [ Notes ]

  • momentum is performing a gradient update by adding ratios of the current update and the previous.
    • eg: update = 0.1 * new_gradient + 0.9 * prev_update
  • momentum adds stability to NNs, by having them update in more the same direction.
  • this effectively creates an exponentially-weighted moving average, because all previous updates are inside the previous update, but they’re multiplied smaller each step.
  • momentum is a weighted average of previous updates.

Affine Functions & Nonlinearities [ dl1v3L5@20:24 ]

  • a superset of matrix multiplications
  • convolutions are matmuls where some of the weights are tied
    • so it’s more accurate to call them affine functions
  • an affine fn is a linear fn; a matmul is a kind of affine fn .

[DL1v3L3 Notes]

  • a nonlinearity, a.k.a. an activation function, is applied on the result of a matmul.
  • sigmoid’s used to be used, now ReLU (zero-minned / max-zero) are used.

Universal approximation theorem [DL1v3L3@1:52:27][Notes]

  • any function can be approximated arbitrarily closely by a series of matmuls (affine fns) followed by nonlinearities.
    • this is the entire foundation & framework of deep learning: if you have a big enough matrix to multiply by, or enough of them — a function that’s just a sequence of matmuls and nonlinearities that can approximate anything — then you just need a way to find the particular values of the weight matricies in your matmuls that solve your problem. We know how to find the values of parameters: gradient descent. So that’s actually it .

parameters = SGD ( affine-fn → nonlinear-fn )

  • parameters are the values of the matrices (NN) that solve your problem via their sequence of affine-fn→nonlinearity.

Collaborative Filtering & Embeddings [ dl1v3L4 ]:

  • creating a matrix of one variable versus another, and doing a dot-product of embedding vectors to calculate a relational value.
  • the embeddings are a set of vectors for each set of variables.
  • training is done by normal SGD updating the embeddings, after calculating the loss of the predicted relational value vs. actual.
    • the independent variables can be seen as the sets of variables, the dependent is the relational value being computed.
  • “an embedding matrix is designed ot be something you can index into as an array and grab one vector out of it” [dl1v3l4]
  • there is also one tweak: a bias term is added; it isn’t multipled in the dot-product, but added afterwards.

[DL1v3L5@27:57]

  • an embedding just means to look something up in an array.
  • multiplying by a one-hot encoded matrix is identical to an array lookup — an embedding is using an array lookup to do a matmul by a one-hot encoded matrix without ever actually creating it.

Weight Decay [ dl1v3L5 ]:

  • a value scaling the sum-of-squares of parameters, which is then added to the loss function.
  • purpose: penalize complexity. This is done by adding the sum of squared parameter values to the loss function, and tuning this added term by multiplying it be a number: the wd hyperparameter.
6 Likes

A post was merged into an existing topic: :memo: Deep Learning Lesson 1 Notes