Here are some of my notes for Lesson 8, in case anyone finds them useful. I looked at the list of subjects we covered in part-1, and wrote down some of the ones I hear of a lot and have a ‘sort-of’ idea about, but would have trouble telling you exactly what they were if asked. It was actually a very good exercise, and clarified a lot of topics. I may add to this if I do more; note-taking order is newest topic on top.
I had difficulty searching for specific topics in the part-1v3 course, Hiromi’s notes were very useful for finding relevant lecture sections.
[Computation on Arrays: Broadcasting]
Adam [ DL1v3L5@1:54:00 ][ Notes ]
- Adam builds upon RMSProp which builds upon Momentum.
- RMSProp takes the expo-weighted moving average of the squared gradient
- the result is: updates will be large if the gradients are volatile or consistenly large; small if constly. small.
- Adam takes RMSProp and divides by the squareroot of the previous update (the squared terms).
- if the gradients are consistently small and non-volatile, the update will be larger.
- Adam is an adaptive optimizer
- Adam keeps track of the expo-weighted moving avg (EWMA) of the squared gradients (RMSProp), and the expo-weighted moving avg of the steps (momentum); and divide the EWMA of the prev steps by the EWMA of the squared terms (gradients); and also use ratios as in momentum.
- Adam is RMSProp & Momentum
Momentum [ Notes ]
- momentum is performing a gradient update by adding ratios of the current update and the previous.
update = 0.1 * new_gradient + 0.9 * prev_update
- momentum adds stability to NNs, by having them update in more the same direction.
- this effectively creates an exponentially-weighted moving average, because all previous updates are inside the previous update, but they’re multiplied smaller each step.
- momentum is a weighted average of previous updates.
Affine Functions & Nonlinearities [ dl1v3L5@20:24 ]
- a superset of matrix multiplications
- convolutions are matmuls where some of the weights are tied
- so it’s more accurate to call them affine functions
an affine fn is a linear fn; a matmul is a kind of affine fn .
- a nonlinearity, a.k.a. an activation function, is applied on the result of a matmul.
- sigmoid’s used to be used, now ReLU (zero-minned / max-zero) are used.
Universal approximation theorem [DL1v3L3@1:52:27][Notes]
any function can be approximated arbitrarily closely by a series of matmuls (affine fns) followed by nonlinearities.
this is the entire foundation & framework of deep learning: if you have a big enough matrix to multiply by, or enough of them — a function that’s just a sequence of matmuls and nonlinearities that can approximate anything — then you just need a way to find the particular values of the weight matricies in your matmuls that solve your problem. We know how to find the values of parameters: gradient descent. So that’s actually it .
parameters = SGD ( affine-fn → nonlinear-fn )
- parameters are the values of the matrices (NN) that solve your problem via their sequence of affine-fn→nonlinearity.
Collaborative Filtering & Embeddings [ dl1v3L4 ]:
- creating a matrix of one variable versus another, and doing a dot-product of embedding vectors to calculate a relational value.
- the embeddings are a set of vectors for each set of variables.
- training is done by normal SGD updating the embeddings, after calculating the loss of the predicted relational value vs. actual.
- the independent variables can be seen as the sets of variables, the dependent is the relational value being computed.
- “an embedding matrix is designed ot be something you can index into as an array and grab one vector out of it” [dl1v3l4]
- there is also one tweak: a bias term is added; it isn’t multipled in the dot-product, but added afterwards.
- an embedding just means to look something up in an array.
- multiplying by a one-hot encoded matrix is identical to an array lookup — an embedding is using an array lookup to do a matmul by a one-hot encoded matrix without ever actually creating it.
Weight Decay [ dl1v3L5 ]:
- a value scaling the sum-of-squares of parameters, which is then added to the loss function.
- purpose: penalize complexity. This is done by adding the sum of squared parameter values to the loss function, and tuning this added term by multiplying it be a number: the