Here are some of my notes for Lesson 8, in case anyone finds them useful. I looked at the list of subjects we covered in part1, and wrote down some of the ones I hear of a lot and have a āsortofā idea about, but would have trouble telling you exactly what they were if asked. It was actually a very good exercise, and clarified a lot of topics. I may add to this if I do more; notetaking order is newest topic on top.
I had difficulty searching for specific topics in the part1v3 course, Hiromiās notes were very useful for finding relevant lecture sections.
Broadcasting
[Computation on Arrays: Broadcasting]
Adam [ DL1v3L5@1:54:00 ][ Notes ]
 Adam builds upon RMSProp which builds upon Momentum.
 RMSProp takes the expoweighted moving average of the squared gradient
 the result is: updates will be large if the gradients are volatile or consistenly large; small if constly. small.
 Adam takes RMSProp and divides by the squareroot of the previous update (the squared terms).
 if the gradients are consistently small and nonvolatile, the update will be larger.
 Adam is an adaptive optimizer
 Adam keeps track of the expoweighted moving avg (EWMA) of the squared gradients (RMSProp), and the expoweighted moving avg of the steps (momentum); and divide the EWMA of the prev steps by the EWMA of the squared terms (gradients); and also use ratios as in momentum.
 Adam is RMSProp & Momentum
Momentum [ Notes ]
 momentum is performing a gradient update by adding ratios of the current update and the previous.
 eg:
update = 0.1 * new_gradient + 0.9 * prev_update
 momentum adds stability to NNs, by having them update in more the same direction.
 this effectively creates an exponentiallyweighted moving average, because all previous updates are inside the previous update, but theyāre multiplied smaller each step.
 momentum is a weighted average of previous updates.
Affine Functions & Nonlinearities [ dl1v3L5@20:24 ]
 a superset of matrix multiplications
 convolutions are matmuls where some of the weights are tied
 so itās more accurate to call them affine functions

an affine fn is a linear fn; a matmul is a kind of affine fn .
[DL1v3L3 Notes]
 a nonlinearity, a.k.a. an activation function, is applied on the result of a matmul.
 sigmoidās used to be used, now ReLU (zerominned / maxzero) are used.
Universal approximation theorem [DL1v3L3@1:52:27][Notes]

any function can be approximated arbitrarily closely by a series of matmuls (affine fns) followed by nonlinearities.

this is the entire foundation & framework of deep learning: if you have a big enough matrix to multiply by, or enough of them ā a function thatās just a sequence of matmuls and nonlinearities that can approximate anything ā then you just need a way to find the particular values of the weight matricies in your matmuls that solve your problem. We know how to find the values of parameters: gradient descent. So thatās actually it .
parameters = SGD ( affinefn ā nonlinearfn )
 parameters are the values of the matrices (NN) that solve your problem via their sequence of affinefnānonlinearity.
Collaborative Filtering & Embeddings [ dl1v3L4 ]:
 creating a matrix of one variable versus another, and doing a dotproduct of embedding vectors to calculate a relational value.
 the embeddings are a set of vectors for each set of variables.
 training is done by normal SGD updating the embeddings, after calculating the loss of the predicted relational value vs. actual.
 the independent variables can be seen as the sets of variables, the dependent is the relational value being computed.
 āan embedding matrix is designed ot be something you can index into as an array and grab one vector out of itā [dl1v3l4]
 there is also one tweak: a bias term is added; it isnāt multipled in the dotproduct, but added afterwards.
[DL1v3L5@27:57]
 an embedding just means to look something up in an array.
 multiplying by a onehot encoded matrix is identical to an array lookup ā an embedding is using an array lookup to do a matmul by a onehot encoded matrix without ever actually creating it.
Weight Decay [ dl1v3L5 ]:
 a value scaling the sumofsquares of parameters, which is then added to the loss function.
 purpose: penalize complexity. This is done by adding the sum of squared parameter values to the loss function, and tuning this added term by multiplying it be a number: the
wd
hyperparameter.