Lesson 4 - Official Topic

Antoine.C · April 15, 2020, 1:34pm

With some clarification on notation from the first part of my answer, let me go back to the distinction between the most elementary gradient-based optimization methods:

SGD
mini-batch GD
GD (=Gradient Descent, aka “vanilla gradient descent”).

tl;dr

The upshot in the distinction between the three GD-based methods is that they optimize three distinct (though related) functions. The three functions have the same number of parameters (weights), but not the same number of data values:

SGD: the function uses only one sample of data (and at every iteration the function is fed a different sample);
mini-batch GD: the function uses as many samples as are in a batch (and at every iteration the function is fed a different batch);
GD: the function uses all samples of data (and at every iteration the function is fed the full dataset).

Nomenclature

SGD nowadays usually refers to mini-batch gradient descent, as pointed out by others, but for the purpose of exposition, here I will stricly follow the above nomenclature.

What follows is perhaps geared towards the mathematically inclined.

A bit of statistics
The gradients computed in SGD and mini-batch GD are approximations to the gradients computed in GD. At each iteration, SGD and mini-batch GD are fed randomly selected samples (while GD is always fed the same full set of samples), and the statistical averages of the gradients computed in SGD and mini-batch GD are equal to the gradients computed in GD.

Factors in choosing one or the other
There are many reasons to use mini-batch GD over the other methods. Two of them are:

computational cost: if the data set is very large, then it will take several cycles for your CPU or GPU to calculate gradients, but if the batch size is too small, then the (CPU or more likely the) GPU will be underused: you might as well load the GPU to full capacity;
regularization: very loosely speaking and without going into any detail (the course will cover this on several occasions), regularization refers to methods to avoid overfitting and to increase its generalizability. This touches on the dichotomy “optimization vs generalization”. Optimizing the loss function using the full dataset, the learner may learn the specific dataset too well (overfitting), while mini-batch GD allows for some fluctuation and should be able to generalize better in the sense that it will make more reliable predictions no matter what new data it is fed (for the purpose of predictions, not learning; a subtle point).

A bit of mathematical formalism

Let’s say that our data consists of 10 samples:
x0, x1, ..., x9

Gradient Descent

Let’s say that the function to optimize has 42 parameters (weights):

w0, w1, ..., w41

Let’s denote the function to optimize with

y=f_GD(w0, ..., w41, x0, ..., x9)

and for the moment let’s not worry too much about the precise formula for this function f_GD, i.e. how it depends on the ws and xs.

Stochastic gradient descent

The function to optimize also has 42 parameters (weights) w0, …, w41, but at each iteration, it will be fed a different sample:

1st epoch:
- f_SGD(w0, ..., w41, x0); then
- f_SGD(w0, ..., w41, x1); then
- …
- f_SGD(w0, ..., w41, x9); then
2nd epoch:
- f_SGD(w0, ..., w41, x0) again; then
- …
- f_SGD(w0, ..., w41, x9) again; then
3rd epoch:
- f_SGD(w0, ..., w41, x0) yet again;
- you get the idea.

As before, for the moment, let’s not worry too much about the precise dependence of the function f_SGD on the ws and xs.

Mini-batch GD

The function to optimize also has 42 parameters (weights) w0, …, w41, but at each iteration it will be fed a different (mini-)batch of data samples. Say for concreteness the batch size is bs=5. Then, along the iteration, gradients are computed for

1st epoch:
- f_MB(w0, ..., w41, x0, ..., x4); then
- f_MB(w0, ..., w41, x5, ..., x9); then
2nd epoch:
- f_MB(w0, ..., w41, x0, ..., x4) again; then
- f_MB(w0, ..., w41, x5, ..., x9) again; then
3rd epoch:
- f_MB(w0, ..., w41, x0, ..., x4) yet again;
- you get the idea.

How are f_GD, f_SGD, f_MB related?

There is a loss function associated with a single sample of data. Let’s denote it

y=f(w0, ..., w41, x)

where x is any one sample of data.

GD
f_GD(w0, ..., w41, x0, ..., x9) is the average of the 10 single losses:

f(w0, ..., w41, x0), ..., f(w_1, ..., w42, x9)

Mini-batch GD
f_MB(w0, ..., w41 x0, ..., x4) is the average of 5 single losses:
f(w0, ..., w41, x0), ..., f(w0, ..., w41, x4)

and similarly f_MB(w0, ..., w41 x5, ..., x9) is the average of f(w0, ..., w41, x5), ..., f(w0, ..., w41, x9)

SGD

f_SGD(w0, ..., w42, x0) is simply f(w0, ..., w42, x0), and likewise with x0 replaced with x1, …, x9.