Lesson 8 notes

For myself reading notes is important part of learning experience. On last courses, students have made great summaries of the lectures and that way made it easier to recall things. I wanted to continue writing these notes also in part 2 and I thought that it might be a good idea to gather all these notes into one place. Due to that, I suggest that people share their notes here instead of in discussion thread.

I start by sharing my notes:


Hello guys!

You did a wonderful job Laukinen! I will definitely save it!

Here’s my class notes: https://github.com/WittmannF/fastai-dlpt2-notes/blob/master/lesson-8.md

I love the freedom that writing in markdown provides. It is clean and at the same time allows many complex formatting, including latex equations and images. It will be my default tool for writing my notes during the class. I hope to be useful for those who want to quickly recap some topic that we’ve seen. I did a review this morning, but is still a little chaotic, specially when writing notes from jupyter parts. It is still a working in progress :slight_smile:


Thanks so much to both of you!

@Lankinen: sorry to trouble you, especially after all that hard work, but you could you please move your notes to somewhere that’s not Medium? We don’t need lesson notes for the unreleased course to be actually hidden, but I don’t want them to be promoted/shared until after the MOOC is out in June. Medium does a lot of cross-promoting, which can cause problems if folks not in the course start seeing them and asking about stuff which they don’t yet have access to.

Github or Google docs or threads on the forum are all good options.


There is already a thread for lesson notes collaborative lesson notes.
It’s a wiki as well. It will be great if we can add all different versions of notes there. And collaborate into a really good one as Jeremy suggested.


@Lankinen FYI you could share notes as an ‘unlisted’ draft on Medium too. This isn’t promoted, and is accessible only to those who have the link to the post :slight_smile: You can later publish this publicly.

  • Unlisted stories will not appear in the home feed, profile page, tag page, or search on Medium, nor will they appear in notifications or email digests.
  • To share an unlisted story, simply share the URL of the post. Unlisted stories are not password protected, and anyone who has the link will be able to view the post.

Sorry about that. I will chance it to unlisted right away like @Taka suggested.

1 Like

Thank you for noticing that because I wasn’t available to see the thread. Collaborative notes are great idea and I try to help with those again on this part. From now I will publish notes there.

I am looking for help on the markdown used in this forum. This is so I can use vim and use pandoc to convert from ‘.md’ but also post here parts of what I create. If I wanted to have a footnote in my post is there away here. We have the icons in this reply box but is there more that can be used. In markdown [^1] is a footnote but does not seem to work here.

In short is there a cheatsheet for markdown feature used in this forum as apposed to markdown used else where.

[^1]: A foot note

from googling my impression is that discourse (the forum software) uses commonmark to implement markdown so maybe try here? https://commonmark.org/help/

also it looks like their spec is here: https://spec.commonmark.org/

Here’s some more notes:


Here’s a quite literal transcription of Lesson 8 in a Jupyter Notebook, including slides and code:

I made this transcription as an experiment to see how much I would learn doing it, but at least to me it is a pretty inefficient way to learn - especially when transcribing everything. Next week, I’ll just make a summary with the core concepts, so I have more time to experiment and focus on the most important things :).
Hope this is useful to at least some people.


Here are some of my notes for Lesson 8, in case anyone finds them useful. I looked at the list of subjects we covered in part-1, and wrote down some of the ones I hear of a lot and have a ‘sort-of’ idea about, but would have trouble telling you exactly what they were if asked. It was actually a very good exercise, and clarified a lot of topics. I may add to this if I do more; note-taking order is newest topic on top.

I had difficulty searching for specific topics in the part-1v3 course, Hiromi’s notes were very useful for finding relevant lecture sections.

[Computation on Arrays: Broadcasting]

Adam [ DL1v3L5@1:54:00 ][ Notes ]

  • Adam builds upon RMSProp which builds upon Momentum.
    • RMSProp takes the expo-weighted moving average of the squared gradient
      • the result is: updates will be large if the gradients are volatile or consistenly large; small if constly. small.
  • Adam takes RMSProp and divides by the squareroot of the previous update (the squared terms).
    • if the gradients are consistently small and non-volatile, the update will be larger.
  • Adam is an adaptive optimizer
  • Adam keeps track of the expo-weighted moving avg (EWMA) of the squared gradients (RMSProp), and the expo-weighted moving avg of the steps (momentum); and divide the EWMA of the prev steps by the EWMA of the squared terms (gradients); and also use ratios as in momentum.
  • Adam is RMSProp & Momentum

Momentum [ Notes ]

  • momentum is performing a gradient update by adding ratios of the current update and the previous.
    • eg: update = 0.1 * new_gradient + 0.9 * prev_update
  • momentum adds stability to NNs, by having them update in more the same direction.
  • this effectively creates an exponentially-weighted moving average, because all previous updates are inside the previous update, but they’re multiplied smaller each step.
  • momentum is a weighted average of previous updates.

Affine Functions & Nonlinearities [ dl1v3L5@20:24 ]

  • a superset of matrix multiplications
  • convolutions are matmuls where some of the weights are tied
    • so it’s more accurate to call them affine functions
  • an affine fn is a linear fn; a matmul is a kind of affine fn .

[DL1v3L3 Notes]

  • a nonlinearity, a.k.a. an activation function, is applied on the result of a matmul.
  • sigmoid’s used to be used, now ReLU (zero-minned / max-zero) are used.

Universal approximation theorem [DL1v3L3@1:52:27][Notes]

  • any function can be approximated arbitrarily closely by a series of matmuls (affine fns) followed by nonlinearities.
    • this is the entire foundation & framework of deep learning: if you have a big enough matrix to multiply by, or enough of them — a function that’s just a sequence of matmuls and nonlinearities that can approximate anything — then you just need a way to find the particular values of the weight matricies in your matmuls that solve your problem. We know how to find the values of parameters: gradient descent. So that’s actually it .

parameters = SGD ( affine-fn → nonlinear-fn )

  • parameters are the values of the matrices (NN) that solve your problem via their sequence of affine-fn→nonlinearity.

Collaborative Filtering & Embeddings [ dl1v3L4 ]:

  • creating a matrix of one variable versus another, and doing a dot-product of embedding vectors to calculate a relational value.
  • the embeddings are a set of vectors for each set of variables.
  • training is done by normal SGD updating the embeddings, after calculating the loss of the predicted relational value vs. actual.
    • the independent variables can be seen as the sets of variables, the dependent is the relational value being computed.
  • “an embedding matrix is designed ot be something you can index into as an array and grab one vector out of it” [dl1v3l4]
  • there is also one tweak: a bias term is added; it isn’t multipled in the dot-product, but added afterwards.


  • an embedding just means to look something up in an array.
  • multiplying by a one-hot encoded matrix is identical to an array lookup — an embedding is using an array lookup to do a matmul by a one-hot encoded matrix without ever actually creating it.

Weight Decay [ dl1v3L5 ]:

  • a value scaling the sum-of-squares of parameters, which is then added to the loss function.
  • purpose: penalize complexity. This is done by adding the sum of squared parameter values to the loss function, and tuning this added term by multiplying it be a number: the wd hyperparameter.

Thanks @Borz! I’ve moved this to the main notes thread - hope that’s OK with you.

I wanted to write about lessons on my personal blog (my domain). Is this cool with the fast.ai team?
I won’t do full lessons but rather interesting/ challenging parts from the lectures (I.e. matmul section, einsum etc. from lesson 8). With proper referencing fast.ai of course…

In case Jeremy doesn’t notice this I would say that it is ok. At least previously people have published this kind of independent tutorials without problems if you don’t rely to much on the material. I’m very interested of reading these blogs so I hope that you add the link as soon as you publish something.

Thanks, it’ll probably serve its purpose better here :slight_smile:

Please do - but don’t link to unreleased materials of course, and try to focus on individual bits that are still useful to people that don’t have access to the full course yet.

Thanks for checking!

1 Like

Would that be fine though If I want to write about the PyTorch nn.module that explained in the lesson? I found the refactoring part is quite neat and it may be useful for people does not have a strong software background (like myself).

That largely exists already - I contributed a tutorial to PyTorch :slight_smile:



ah, that’s awesome.