Understanding what's going on (Part 1 Lesson 2)

Hi everyone,

I’m really enjoying the course so far, and my understanding is slowly increasing with time, but there’s a (huge?) conceptual ‘thing’ that I’m not quite getting, and I don’t even know how to Google the issue, so wanted to ask here.

If this is the wrong place to ask this question, please feel free to delete. Apologies in advance.


In lesson 2, Jeremy demonstrates a simple linear function (y = ax + b), and it’s easy to understand that with one value (x temperature), the other value (y ice cream sales) is predicted by tweaking the coefficients.

But, I can’t really conceptualise how it applies to more complex things like classifying images.

The neural network

What is literally happening? Is the neural network just conceptual, rather than there actually being some digitised form of a neural network? Or does it ‘exist’?

Conceptual .vs. Real

The relationship between what is conceptually occurring (ie, for the animal breed categoriser: “The network is learning to spot angles, corners, then shapes, then things like eyes and finally breeds”) and what is really/literally occurring is confusing me. ie, What we mean when we say that it’s “finding corners”.

Is the process written down anywhere in a digestible form?

By this I mean: I’m not clear about how it begins, ends and what’s really happening in between. I’m not looking for anybody to spoon-feed this to me, rather just a pointer like “Go here, read this”.

My fudged understanding / guesstimate(?), which is almost certainly very wrong, is that:

  • The image is fed into the neural network’s inputs, pixel by pixel (so every pixel is a separate input)
  • The pixels are surely(?) grouped, so that they form a context(?), otherwise they’re just individual pixels from which no meaning can be derived
  • ???
  • Probabilities are calculated for the various labels applying to the image, and argmax is used for the most likely outcome(s)

I think what I’m not ‘getting’ is what is really saved/known at each layer in the network, and what is even ‘passed’ to each layer of the network.

If one of the earlier layers is ‘finding diagonal colours’ and another is ‘spotting corners’, what does that really mean? Is there a hidden internal process where ‘labels’ are created (like ‘corner’) and a group of pixels are read-in together and a pattern is spotted and ‘corner’ is ‘activated’ as a match, and then this is passed on to another layer?

If I’m able to understand what’s happening in more depth, I think the relationship between the math(s) and how I’m supposed to envisage what’s conceptually going on will make more sense.

Thanks, and apologies for such an elementary question.

Starting with the linear function, y=wx+b. Why we do not use any other function like some quadratic? My best guess would be to say it is already tested and I think it may cause some stability or convergence issues but I am not sure.

Now imagine a hypothetical problem where you are given 2 points each of red and yellow color. Place the red point on opposite sides of a napkin and place the points on the other opposite sides. So now you have 1 point at each corner with opposite corners having same color. The napkin is your 2D surface and you want to train some network to classify these points.

By using the basic y=wx+b we cannot accomplish this task. A single line would not be able to solve it. And we don’t want to increase the complexity by using more than 1 line. So another way to approach this problem is to fold the napkin along the diagonal. Now you can draw a straight line that can accomplish our task. When we folded the napkin we actually introduced a non-linearity (which is done by activation function like ReLU). If you open the napkin you can see that the line you just drew is not of the form y=wx+b.

Give this thing a try to convince yourself it is possible.

Now coming to the real world. Let’s say we are given a dog image that we want to classify. Our basic building blocks are

  1. y = wx + b
  2. Some non-linearity like ReLU

So we want to fold draw some lines and fold some things but we do not want to do these things manually. We want the computer to learn on its own. And this is what we mean by neural network training.

Pass the image to the network as input. Now instead of telling the computer that we want to fold the napkin in half, we let the computer decide what it wants to do with the napkin. We use more than 1 filter as there may be many ways to achieve what we want. Each filter would transform the input image to a new image. If the filter identifies horizontal lines then the image you would get after running that filter would have horizontal parts more pronounced than others. Similarly, for each filter you would get a different output image. Then it is just stacking of these images and repeating this process for as much as you want.

Why did the network learn corners, edges in the staring? Maybe it found it easy. But you can think of it as incremental learning where you first try to learn simple things and then combining these simple things you can create much more complex things.

1 Like

Hi @kushaj – Thanks for taking the time to reply to me. I appreciate it.

Napkin analogy

This still doesn’t quite make sense to me (or how it relates to a linear problem like “Ice cream sales .vs. Temperature”).

We start with this:

Then we create this:

But that’s still not an equivalent linear problem to ‘ice cream sales’, because we have one full dot, and two 0.5 dots. It’s still a 2D area with a total of two dots (11 and 0.52).

I feel like I’m on the brink of understanding this, but I’m still not quite getting what ‘happens’ when we say the network “learns” something.

There is a layer/filter (synonymous?) that transforms the image, and another layer that receives this transformed image and does something else to it (etc).

Then eventually this leads to the output of “X% probability of being Y label”?

Make the fold below the point, so end up with 3 points on one side and the last point on other side. Now if you draw the line, then the two red points would be on top and the yellow points would be below (one front and one in back). You can fold more to make them on same side.

Also, points are 0 dimensional so you cannot split them in half.

For the CNN part it is only concerned with learning filters so as to transform the image into useful features. After the CNN part we use linear layers and they are responsible for using the features learned by CNN to classify them into class labels.

If you plot your ice cream sales on y-axis and temperature on x-axis. Then neural network is essentially learning a function that can predict the sales given temperature.

Sorry for the slow response – I’ve had a busy few weeks.

Also thank you for your example :slight_smile:

It’s all beginning to make more sense to me. I’ve paused Fast AI for now, and I’m reading Machine Learning is Fun! first, to give me a solid foundation.

I’ll then come back to the course.

Thanks for your help!

1 Like