Lesson 4 - Official Topic

This is lesson 4, which is week 5 since last week was out of order.

Note: This is a wiki post - feel free to edit to add links from the lesson or other useful info.


Links from lesson

Other useful links

Notes by @Lankinen


Questionnaire entry 13 may be slightly misleading (Why does SGD use mini batches?).

Perhaps you could say a few words about the differences between (1) Stochastic Gradient Descent, (2) Mini Batch Gradient Descent, and (3) Batch Gradient Descent.

1 Like

Scratch that. I did not notice your text in ‘SGD and mini-batches’.

You do explain it, it is just the question’s phrasing that is a little confusing.

1 Like

A minor thing. In Lesson 4 from the book, pandas background gradient applied in row-wise. For example, df.iloc[10,16] which is 22 dark.

Seaborn considers all the numbers in the dataframe for applying gradient. The same number 22 is light now.


Is this post going to be a wiki for the community to add resources?

1 Like

I’ve wiki-fied it :+1:


Is there a plan to work in a smaller groups?

@SMEissa there are a few smaller groups that are active-please check the study groups section, we have a book reading group and a Mid-Level API ones that are active.

There are few open collaborations created by Radek. If you find something interesting and want to start a group-please do so! :smiley:


@rachel, Jeremy’s mic isn’t working

1 Like

Yes, Text is good


Are you also meeting this weekend? it would be great

I’m not sure yet-but I’ll update the respective wikis

Whats the difference between Gradient Descent and Stochastic Gradient Descent ? Is there something that I should particularly remember ?

1 Like

Gradient descent is when you use all your data to compute the gradients, then update your weights. Stochastic gradient descent is when you use mini-batches with random samples of your training set to compute the gradients, then update your weights.


To add: I remember it as: The “stochastic-ness” comes from the batches


Gradient Descent ==> Gradient is calculated using Whole data
Stochastic Gradient Descent ==> Gradient is calculated using one sample of data
Mini Batch Gradient Descent ==> Gradient is calculated using batch(generally given by batch-size) of data

Traditionally, we call mini-batch gradient descent as SGD.


I don’t really like this terminology that is pretty old (and absolutely no one does true stochastic gradient descent anyway). When we say stochastic gradient descent, it’s the mini-batch gradient descent (and usually, when people that want to refer to the stochastic gradient descent of this definition, they say true stochastic gradient descent).


Why did we reshape the images from a matrix into a vector?

To be able to do a matrix multiplication with the set of those images by our weights: you can’t multiply a tensor of size N x 28 x 28 by some weights, but you can multiply a tensor of size N x 784 by some weights.