Lesson 4 - Official Topic

I’m still a bit shocked at the global approach of trying to convince people to believe (or understand) science rather than working with behavioral experts to get folks wearing masks much sooner. Then, support with science. Behavioral economics FTW.

Can Jeremy share the last section on Masks publicly again? Would love to update my network with the latest. Thanks again for another great lecture!

PLUS I hope your daughter is doing OK and feels better soon!!

1 Like

I’m not sure why you assume that. Plenty of behavioral experts involved. It turned out that we kept hitting a brick wall when policy makers checked with their science advisors, so that was the obstacle we had to remove.

4 Likes

I live in an area where they neither believe the science nor is there any behavioral incentive. Not surprisingly, few are wearing masks. I’ve not heard many behavioral science voices throughout the pandemic- other than the group of psychiatrists who put their careers on the line speaking out about a different aspect of the situation.

1 Like

Hi Jess hope you are having a wonderful day!

I just saw your’s and Jeremy’s post. I am based in London and was stunned when I heard @jeremy say he will be on TV on the BBC and ITV. I don’t own a TV but will be trying to watch it online if I can.

I had just replied to a post on ethics describing some thoughts I have about ethics.

The last two paragraphs in this link https://forums.fast.ai/t/lesson-5-official-topic/68039/271?u= :smiley: :smiley: say it perfectly for me, its people like Jeremy and Rachel that I refer to in the last two paragraphs.

And the majority of people who are my hero’s Benjamin Franklyn, Ghandhi and Martin Luther King where not experts in politics or change management, having read all their autobiographies they were all just individuals who where striving to be and do better, none of them set out to become who they became.

Have a wonderful day, cheers mrfabulous1 :grinning: :smiley:

it would make more sense to see the plot of weight values how do they look like in comparision to generated activations using them

In the example in the lesson, the images are presized to 460, then resized to 224. How does one in general choose the relative sizes of these, including the original image size? Or does one just have to play with them to see what’s good?

After rewatching the presizing part of the lesson and reading the chapter’s explanation, I believe that we need to chose a size that’s equal either to the height or width of the original images in general.

Since the images of pets dataset, all images are of width of 500, thus I believe 460 which is just smaller than it is chosen.

I am going to add to the other answers (e.g. @ram_cse), catering perhaps to the more mathematically inclined, but I hope this will be helpful to others.

I am splitting my answer into two parts. This first part is on notation.

Let’s forget for a moment neural nets. The conventional notation in math textbooks it to write a fuction as

y=f(x)

where x is the variable with respect to which the typical math exercise asks to run the optimization. See for example the cell in the fastbook notebook 04_mnist_basics.ipynb with:

def f(x): return x**2
plot_function(f, 'x', 'x**2')
plt.scatter(-1.5, f(-1.5), color='red');

(the function plot_function is defined in fastbook/utils.py).

Note that diagrams in subsequent cells show a graph with parameter along the horizontal axis, rather than x.

In training a neural net, on the other hand, the (loss) function to optimize involves two very different types of variables:

  • weights (and biases): let’s collectively denote them with w;
  • data samples: let’s collectively denote them with x.

Thus, writing explicitly the dependence of the loss function on its weights and the data gives

y=f(w,x)

But in the training phase, the optimization is with respect to w (the weights), not x (the data). That is, x remains fixed. (Alternatively, we could decide to incorporate the dependence on x in the symbol f, so that the function that we are interested in optimizing would simply be written y=f(w), but it is helpful to keep track of the data that we are using, so we shall keep the explicit dependence on x in the notation as well.)

Revisiting the earlier example in fastbook notebook 04_mnsit_basics.ipynb, we now obtain:

def f(w): return w**2
plot_function(f, 'w', 'y')
plt.scatter(-1.5, f(-1.5), color='red');
1 Like

With some clarification on notation from the first part of my answer, let me go back to the distinction between the most elementary gradient-based optimization methods:

  • SGD
  • mini-batch GD
  • GD (=Gradient Descent, aka “vanilla gradient descent”).

tl;dr

The upshot in the distinction between the three GD-based methods is that they optimize three distinct (though related) functions. The three functions have the same number of parameters (weights), but not the same number of data values:

  • SGD: the function uses only one sample of data (and at every iteration the function is fed a different sample);
  • mini-batch GD: the function uses as many samples as are in a batch (and at every iteration the function is fed a different batch);
  • GD: the function uses all samples of data (and at every iteration the function is fed the full dataset).

Nomenclature

SGD nowadays usually refers to mini-batch gradient descent, as pointed out by others, but for the purpose of exposition, here I will stricly follow the above nomenclature.

What follows is perhaps geared towards the mathematically inclined.

A bit of statistics
The gradients computed in SGD and mini-batch GD are approximations to the gradients computed in GD. At each iteration, SGD and mini-batch GD are fed randomly selected samples (while GD is always fed the same full set of samples), and the statistical averages of the gradients computed in SGD and mini-batch GD are equal to the gradients computed in GD.

Factors in choosing one or the other
There are many reasons to use mini-batch GD over the other methods. Two of them are:

  • computational cost: if the data set is very large, then it will take several cycles for your CPU or GPU to calculate gradients, but if the batch size is too small, then the (CPU or more likely the) GPU will be underused: you might as well load the GPU to full capacity;
  • regularization: very loosely speaking and without going into any detail (the course will cover this on several occasions), regularization refers to methods to avoid overfitting and to increase its generalizability. This touches on the dichotomy “optimization vs generalization”. Optimizing the loss function using the full dataset, the learner may learn the specific dataset too well (overfitting), while mini-batch GD allows for some fluctuation and should be able to generalize better in the sense that it will make more reliable predictions no matter what new data it is fed (for the purpose of predictions, not learning; a subtle point).

A bit of mathematical formalism

Let’s say that our data consists of 10 samples:
x0, x1, ..., x9

Gradient Descent

Let’s say that the function to optimize has 42 parameters (weights):

w0, w1, ..., w41

Let’s denote the function to optimize with

y=f_GD(w0, ..., w41, x0, ..., x9)

and for the moment let’s not worry too much about the precise formula for this function f_GD, i.e. how it depends on the ws and xs.

Stochastic gradient descent

The function to optimize also has 42 parameters (weights) w0, …, w41, but at each iteration, it will be fed a different sample:

  • 1st epoch:
    • f_SGD(w0, ..., w41, x0); then
    • f_SGD(w0, ..., w41, x1); then
    • f_SGD(w0, ..., w41, x9); then
  • 2nd epoch:
    • f_SGD(w0, ..., w41, x0) again; then
    • f_SGD(w0, ..., w41, x9) again; then
  • 3rd epoch:
    • f_SGD(w0, ..., w41, x0) yet again;
    • you get the idea.

As before, for the moment, let’s not worry too much about the precise dependence of the function f_SGD on the ws and xs.

Mini-batch GD

The function to optimize also has 42 parameters (weights) w0, …, w41, but at each iteration it will be fed a different (mini-)batch of data samples. Say for concreteness the batch size is bs=5. Then, along the iteration, gradients are computed for

  • 1st epoch:
    • f_MB(w0, ..., w41, x0, ..., x4); then
    • f_MB(w0, ..., w41, x5, ..., x9); then
  • 2nd epoch:
    • f_MB(w0, ..., w41, x0, ..., x4) again; then
    • f_MB(w0, ..., w41, x5, ..., x9) again; then
  • 3rd epoch:
    • f_MB(w0, ..., w41, x0, ..., x4) yet again;
    • you get the idea.

How are f_GD, f_SGD, f_MB related?

There is a loss function associated with a single sample of data. Let’s denote it

y=f(w0, ..., w41, x)

where x is any one sample of data.

GD
f_GD(w0, ..., w41, x0, ..., x9) is the average of the 10 single losses:

f(w0, ..., w41, x0), ..., f(w_1, ..., w42, x9)

Mini-batch GD
f_MB(w0, ..., w41 x0, ..., x4) is the average of 5 single losses:
f(w0, ..., w41, x0), ..., f(w0, ..., w41, x4)

and similarly f_MB(w0, ..., w41 x5, ..., x9) is the average of f(w0, ..., w41, x5), ..., f(w0, ..., w41, x9)

SGD

f_SGD(w0, ..., w42, x0) is simply f(w0, ..., w42, x0), and likewise with x0 replaced with x1, …, x9.

4 Likes

When we do augment in batch_tfms - augments dos with some probability. Is it the same or different to each item in batch ?

Why don’t we use “path” attribute in datablock? I understand what we can later use it with another course etc. But why not to use it when create datablock and optionally leave it empty. So next we can use pets.summary() w/o path or create dataloaders w/o path (or with path if we want it)?

Can you replace x1... x10 with x0 ... x9 so that notations are consistent.

1 Like

Thanks for spotting that one - and for reading that far!

1 Like

Thanks for the reply! Jeremy mentioned that it’s possible to have only one block, or even three blocks. Can you give examples of what problems those would solve?

If you have 0 targets or 2 targets for instance.

A person in my study group asked this question, and I didn’t know where to put it (addressed to Jeremy and @sgugger) :

Has your opinion about Reinforcement Learning changed since the past year?

1 Like

One instance is where you could have multiple inputs to the model, like maybe an image and some metadata.

1 Like

Thanks–this sounds exactly like functionality I’ve been looking to use! Do you happen to have any kernel or notebook showing basic implementation of image+metadata -> class?