Lesson 8 (2019) discussion & wiki

in the notebook 01_matmul we define near and test_near as:

def near(a,b): return torch.allclose(a, b, rtol=1e-3, atol=1e-5)
def test_near(a,b): test(a,b,near)

from numpy docs np.allclose is defined as:
absolute(a - b) <= (atol + rtol * absolute(b))
and there is a note :

The above equation is not symmetric in a and b, so that allclose(a, b) might be different from allclose(b, a) in some rare cases.

I think torch defines it the same way(???)
Any insight on why it is not just defined as:
absolute(a - b) <= atol ?
Is rtol(relative tolerance ) defined to deal with the numerical precision issues? Is there something more to it?

Hi. I just started lesson 8. In the first notebook( 00_exports.ipynb ) I’ve got the following error: python3: can’t open file ‘notebook2script.py’: [Errno 2] No such file or directory
for this cell: !python notebook2script.py 00_exports.ipynb

I should mention that I use Colab to run the codes.

Are you able to see that file ‘notebook2script.py’ in your current directory?
more likely that you forgot these installations:

conda install nbconvert
conda install nbformat
conda install fire -c conda-forge

if you are running these from with in jupyter then i do:
!conda install nbconvert --yes
similarly for the other two

1 Like

@SirweSaeedi, I built this Jupyter notebook that will help you set up to properly to run the course notebooks in Google Colab

4 Likes

Thank you. That’s great

Hi @lfrachon

Suppose x is a m \times 1 column vector. Let A be an n \times m matrix, and let y = Ax, which we see is a n\times 1 column vector.

The gradient of the vector y with respect to the vector x is a two dimensional matrix, and as @Jeremy points out in The Matrix Calculus You Need for Deep Learning, there are two possible definitions of the gradient:

In the numerator layout, the row number in the gradient matrix is the index of y and the column number is the index of x in the derivative. In this case \nabla y is a n \times m matrix, \nabla y = \nabla (Ax)=A.

Alternatively, in the denominator layout, the row number in the gradient matrix is the index of x in the derivative, and the column number is the index of y. In this case \nabla y is a m \times n matrix, and \nabla y = \nabla (Ax) = A^{T}.

1 Like

Thanks, it’s clear now. I think I was getting confused as to with respect to which variable were the gradients computed.

Can we use the absolute error caused by the weight instead of the derivative of error for updating weights ?
Weight update is as follows

updated_weight := weight - learning_rate * derivative_of_error
W := W - lr * dC/dW

Since we know the error by chain rule and also know the absolute cost. Can we find what amount of error was propagated from the given neuron by

delta_error = cost / derivative_of_error
dW = C / (dC/dW)
and subtract it from the actual weight ?

For calculating the gradients of the linear layer

def lin_grad(inp, out, w, b):
   inp.g = out.g @ w.t()
   w.g = (inp.unsqueeze(-1) * out.g.unsqueeze(1)).sum(0)
   b.g = out.g.sum(0)

For calculating the gradients of the weights and biases why are we summing up in dimension (0)?

Hi @ahteniv
Not sure what you are trying to do, but your formula dW = C/(dC/dW) is incorrect, as it reduces to C = dC.

Hi,
I have just finished working on my article describing fixup initialization from a practitioner’s point of view. It may help you to understand this method. Here’s a link to the article. Enjoy!

1 Like

Just started on lesson 8 (after a long break). Can someone who has seen ahead point me to the recommended way to unit test for this; is it doctest or something else. I know that the video starts to talk about one custom function.

My bad. When I wrote the query, my understanding of back propagation was very poor. Thanx for the reply anyways.

1 Like

Hi @ashwin93961, welcome to the fast.ai forums!

The short answer is that in order to update the layer weights after a forward pass, for each one of a layer’s weights we need to add up the gradients that were computed with respect to each of model inputs. We sum up in dimension 0 because this is the dimension that holds all the inputs.

More conceptually, in the simple case where we only pass one input through our model and then calculate the weight gradients of a layer with respect to that input, there’d be no need to sum along dim=0. We’d already have a single gradient that corresponds to each one of our layer’s weights (in the notebook’s example, these weights reside in a 784x50 weight matrix).

However, what if if we want to calculate the cumulative gradient, for each of the weights in the layer’s weight matrix, after passing several inputs through our model? Indeed, this is something we’ll need to do when we use mini-batches to train a neural network.

On order to do this, we’ll need to keep track of the weight gradients with respect to each model input, for all inputs that we pass through our model. Recall that the MNIST dataset used in the notebook’s example has 50,000 inputs. If we pass all of those 50,000 inputs through our layer, we’ll eventually have 50,000 separate sets of 784x50 weight gradients. Each set contains the gradients of the layer’s weights with respect to a different input.

But given that we only have one set of 784x500 weights at that layer, how can we update these weights using the gradients found in all 50,000 different sets of weight gradients?

The way to do this is to sum up the weight gradients across all the 50,000 inputs. Since dim=0 is the dimension that stores model inputs, this is why we sum the weight gradients (or bias gradients) across dim=0.

Looking more deeply at the notebook’s example, let’s zoom in to the calculation of weight gradients for the first layer, l1 = inp @ w1 + b1, as defined in the forward_and_backward() function in the notebook’s next cell.

Recall that the layer’s weights, w1, have a shape of torch.Size([784, 50]), where 50 is the hidden layer size and 784 is the length of one MNIST image’s flattened vector.

Recall also that the MNIST inputs (inp) to the linear layer have a shape of torch.Size([50000, 784]). That’s one row of length 784 for each of the 50,000 MNIST images.

Now, the operation inp.unsqueeze(-1) adds an extra final dimension to the inputs, changing their shape from torch.Size([50000, 784]) to torch.Size([50000, 784, 1]).

Additionally, the operation out.g.unsqueeze(1) adds an extra dimension to out.g at the dim=1 axis. This changes the shape of the matrix containing layer’s outputs’ gradients from torch.Size([50000, 50]) to torch.Size([50000, 1, 50]).

Multiplying these two matrices together then results in a product that has the shape torch.Size([50000, 784, 50]). Indeed, we added the extra dimensions so that we’d be able to successfully multiply the two matrices together. And their product contains all 50,000 sets of weight gradients – and each set is with respect to a different MNIST input.

By summing up along dim=0, we aggregate the weight gradients across all 50,000 inputs, which’ll give us a matrix of gradients of shape torch.Size([784, 50]) that our optimizer can then use to update the layer’s weights.

3 Likes

@jamesd Wow! Thank you for the reply. Understand it now.
Also, great article on Weight Initialization.

1 Like

About The fully connected Notebook:
Why is it that much more performant to refactor everything under classes than in functions?
I feel that the refactoring is a lot of coding compared to the simplicity of the functions we first implemented.

There are a bunch of reasons for this as you’ll see as you progress through the course:

  1. Sometimes you need to preserve states of different variables when you’re performing certain operations. Having classes instead of functions makes it a lot less painful to do so.
  2. Having one general class inherited by all related layers/modules enforces a consistent API. This, in turn, makes your code much more readable, understandable and less error-prone.
  3. Debugging becomes much easier.

There will be a few more reasons for this which I’ve failed to mention. But I believe this is the crux of it.

It’s been a while since @Brainkite asked this question. I’m sure you’ve discovered what I just said already by now.
Anyways, happy to help!

1 Like

Indeed, I kept on to lessons 9 and 10 and I now understand fully the amount of states, parameters and sometime history of states that we have to handle and store in classes object to be able to call a simple training without providing 27 parameters. And also having some class type much mode flexible such as callbacks, hooks, …
Thanks anyway for the answer!

1 Like

Hi, this is more of a python question than it is deep learning. But it is from the lesson, so i hope I’m asking this in right place:
I’m trying to understand class structures thorough fully_connected notebook.
The example is:

>class Mse():
    def __call__(self, inp, targ):
        self.inp = inp
        self.targ = targ
        self.out = (inp.squeeze() - targ).pow(2).mean()
        return self.out
    def backward(self):
        self.inp.g = 2. * (self.inp.squeeze() - self.targ).unsqueeze(-1) / self.targ.shape[0] 

How does this code get away with self.inp.g . To understand this i wrote a simple class but it returns error. Here is the class i wrote:

>class Foo():
    def __call__(self, a):
        self.a = a
        return self.a
    def example(self):
        self.a.g = 2*self.a
        return self.a.g

>>b=Foo()
>>b(5)
>>b.example()

When i run this b.example(), i get:

AttributeError: 'int' object has no attribute 'g'

How the Mse, Lin and Relu classes get away with this. They all have the similar situation.
Especially in here:

class Lin():
    def __init__(self, w, b): self.w,self.b = w,b
        
    def __call__(self, inp):
        self.inp = inp
        self.out = inp@self.w + self.b
        return self.out
    
    def backward(self):
        self.inp.g = self.out.g @ self.w.t()
        # Creating a giant outer product, just to sum it, is inefficient!
        self.w.g = (self.inp.unsqueeze(-1) * self.out.g.unsqueeze(1)).sum(0)
        self.b.g = self.out.g.sum(0)

In the line before the last one: self.out.g is not assigned to anything thorough the class.
How this works like that.

My second question is: How can we call w1.g or w2.g like this. We are able to do:

model = Model(w1, b1, w2, b2)
model.backward()
test_near(w2g, w2.g)
test_near(b2g, b2.g)
test_near(w1g, w1.g)
test_near(b1g, b1.g)
test_near(ig, x_train.g)

How can we write this like w1.g. We used w1 as variable to initialize our class with our instance(model).
If i implement this to my sample code it would be like:

c=5
class Foo():
    def __init__(self, c):
        self.c = c
    def __call__(self, a):
        self.a = a
        return self.a
    def example(self):
        self.c.g = 3*self.c
        self.a.g = 2*self.a
        return self.a.g

>>b=Foo(c)
>>b(3)
>>b.example()
>>c.g

This also returns:

AttributeError: 'int' object has no attribute 'g'

Isn’t the calling .g on w1 is like calling .g on c in here. How can we call an attribute on a variable that we used to initialize the class like we did to w1.
Sorry for the long post.
Thanks for the answers…

The thing is int cannot have an attribute as it’s a numeric value. You should run the same code by passing it a tensor as that’s what can have attributes assigned dynamically. I tried below and this worked well:

x = torch.randn([2,3])

class Foo():
    def __call__(self, a):
        self.a = a
        return self.a
    
    def example(self):
        self.a.g = 2*self.a
        return self.a.g

b = Foo()
b(x) #outputs correct tensor

b.example() #outputs correct tensor
2 Likes