Lesson 8 (2019) discussion & wiki

I solved this bug in a slightly different way (which was more readable to me at least)
Rather than:

for i in range(ar):
    c[i] = (a[i,:,None]*b).sum(dim=0)

I made it explicit that we pick the ith row and then unsqueeze it:

for i in range(ar):
    c[i] = (a[i][:,None]*b).sum(0)

Just that it works better with my intuition of broadcasting, might help someone who shares the same mental model.

2 Likes

I finally understood how we went from the code in mse to the code in mse_grad, and gained some insights from it.

First, I replaced output with inp to match:

mse:           (inp.squeeze(-1) - targ).pow(2).mean()
mse_grad: 2. * (inp.squeeze(-1) - targ).unsqueeze(-1) / inp.shape[0]

Then, I didn’t know how to take a derivative of mean - so I rewrote mean as sum/size, so now it’s easier to compare:

mse:           (inp.squeeze(-1) - targ).pow(2).sum()  / inp.shape[0]
mse_grad: 2. * (inp.squeeze(-1) - targ).unsqueeze(-1) / inp.shape[0]

so once I take the derivative of mse, everything in the above code matches except I’m missing sum().

After some thinking I realized that since this function does partial derivatives, then the sum disappears, since the derivative for all entries but the specific input it’s derived for will be 0.

Two insights I gained from this:

  1. The batch size has a huge impact on the size of the gradient (in mse). The larger the batch size the smaller the gradient. (inp.shape[0] is the batch size above)
  2. The network doesn’t care for the loss result at all, it is only a useful indicator for the network operator, it only cares for the gradient of the loss function.
12 Likes

Tunneling into a remote system behind a router requires opening ssh port(s) on the router. The serveo.net way doesn’t require opening ports on the router or firewall. However, there’s the security issue of relying on the protection of the jupyter login screen. The tunnel is encrypted so not even serveo.net can see communications. However, I’m not an expert on security. Another issue with using serveo, or any other cloud provider, is reliance on their service being robust. You can run your own private serveo server instead of using theirs.

Assuming a use case of a non-network savvy Jupyter Notebook user who just wants a simple way to access their system when away from the local network, serveo’s benefits are:

  1. Notebook users don’t have to install any software. They only need a browser and the login key.
  2. The server admin doesn’t have to open ports on the router, firewall or servers.
  3. If the ISP mucks with the router’s settings, such as doing a factory reset, the service continues to work. My ISP has pushed through updates that’s hosed my router. I prefer using the default settings.
  4. The server “admin” doesn’t need to know how to port forward.
  5. If your infrastructure changes, such as adding or removing systems, you don’t have to change your network infrastructure (open router port and portforwarding changes).
  6. You can support many servers without thinking.
  7. You can use your own domain.
  8. ngrok is a great alternative but serveo has more free features.
  9. Works if you don’t have access to the router such as at a hotel or cafe.
2 Likes

I would strongly recommend opening forwarding your router’s ssh port to your pc’s ssh port over using serveo to make your jupyter port available directly. ssh’s security, whilst imperfect, is much more well-studied than jupyter’s, and the protocol is designed to be secure.

It’s not a really big problem to open a port to jupyter that’s just on an aws instance that has nothing private on it, and isn’t on a vpc with anything else - i.e. where you don’t mind too much if someone hacks that machine. But it’s a really big problem if someone gets shell access (which jupyter provides) to a machine on your home network.

I would very highly recommend not using serveo on anything similar to access a jupyter port. You could, however, use serveo to access an ssh port, if you wish.

(I founded and ran a fairly large email provider including handling much of the security for it for many years - so my advice here is based on a lot of research and experience.)

5 Likes

Right, perhaps the safest option is to use serveo to access an ssh port. I don’t quite know how that’s done or what the downsides are. Is there some software, considered safe, that provides a level of protection ahead of access to a jupyter login screen?

Another option is to use remote software such as Remote Desktop, VNC, Teamviewer.

A larger batch size leads to more accumulated gradients.
So I guess there should be some kind of “equilibrium” between more data points and a higher loss?
Please correct me if I am wrong. :slight_smile:

1 Like

Thank you for the suggestions everyone!

I spoke to my ISP but they require me to get a static/public IP (Sorry If I’m mixing 2 terms here) so that I can allow incoming traffic, so I’m forced to purchase the add on :frowning:

Thank you! I ended up setting up noip, it was a hassle free setup, just a few adjustments were required on my router (Most routers turns out have a DDNS option in-built).

Sorry to steer the discussion off Lesson 8, but is it suggested to run just scripts on the machine instead?

Based on the suggestions above, I’m doing the following:

  • Setup SSH on my “box”.
  • Put the public key of my laptop on the ssh folder in the “box”
  • Enabled ssh services in Ubuntu
  • Enabled port forwarding of port 22 (internal) to Port 22 (External) on my router
  • Setup noip as the DDNS

Now I ssh as follows:

ssh user@customdomainon.no.ip -L 8888:localhost:8888

I created a certificate on Jupyter and put a password on the notebooks.

Are there any further cautionary recommended steps?

I couldn’t understand Jeremy’s suggestion (I never had setup ssh earlier, please excuse the noob questions):

Is that referring to launching a nb locally?

No - you should use ssh to create an ssh tunnel for jupyter. That way only your laptop can access jupyter (or someone with your ssh key).

1 Like

Thanks for the clarification. :slight_smile:
Don’t know why '__call__' came as a '__custom__' def in my head, my bad.

Notes:

  • __call__ makes a instance of a class callable as a function.
class model:
    def __init__(self, arg1, arg2, arg3):
     # code 
    def __call__(self, arg1):
    # code

resnet = model(foo1, foo2, foo3) #instance of a class

resnet(image) #making it a callable (instance as a function)
resnet.__call__(image) #is same as above

  • Here __init__ is basically a constructor, as it is invoked once by __new__ when the object is created , but __call__ can be called with the object any number of times which gives us the flexibility to redefine it (like, changing the state).

Also, in a deep learning setting the state of an entity (for example, layer) changes after each iteration. Some of the framework(design) decisions in pytorch makes a lot of sense now.

Some core concepts in python.

  1. Understanding ‘_’ in python.
  2. Is python a functional programming language ?
  3. Python Data Model.
3 Likes

Wait, now I’m a bit confused – does it really matter whether you use the training set or the validation set? It does seem like in practice the distributions should be very similar, but I thought the point you were trying to make in class was that both the training set and validation sets should be normalized in a consistent manner.

Perhaps it’s slightly suboptimal for training to normalize the training set using the validation set (you will no longer be guaranteed to be training on a set that has mean 0 and stdev 1), but it seems minor compared to normalizing the training and validation sets independently.

1 Like

It is minor - but a bug nonetheless :slight_smile:

Yeah, I get these mixed up too—I always have to go back and rederive things with explicit indices in the chain rule and then see if I can spot the matrix multiplications afterwards. Is there an easier way? I guess just memorize some matrix derivative identities? Probably time to work through the matrix calculus post :slight_smile:

As an example, for calculating the derivative of the loss with respect to weight W_{ij} from a linear layer y_{nj} = \sum_i x_{ni} W_{ij} + b_j (here n is the batch index), I’d have to do the following: W_{ij} affects the loss via y_{nj} for every n in the batch (but not via any of the other activations for some index other than j), so if we already know all those upstream derivatives \partial \mathcal{L}/\partial y_{nj} (aka out.g), then the chain rule gives

\frac{\partial \mathcal{L}}{\partial W_{ij}} = \sum_n \frac{\partial \mathcal{L}}{\partial y_{nj}} \frac{\partial y_{nj}}{\partial W_{ij}}

Here \partial y_{nj}/ \partial W_{ij} = x_{ni}, so

\frac{\partial \mathcal{L}}{\partial W_{ij}} = \sum_n \frac{\partial \mathcal{L}}{\partial y_{nj}} x_{ni}

and yeah, if I squint I can write that as a matrix multiplication as x.t() @ out.g.

I actually kind of like all the explicit indices (otherwise I tend to forget things like the batch index), and I think it makes the chain rule a little easier to feel than with the matrix notation, but it’s definitely a little tedious…

5 Likes

Here is a short answer I can come up with. torch.randn gives you random numbers of specified shape whose mean is 0 and standard deviation is 1. We want our input and activations to have mean 0 standard deviation 1; not the weight matrix.

Hope that helps.

4 Likes

Edit: woops, didn’t search hard enough in the thread, @mkardas was way ahead of me :slight_smile:

Random thought about the shifted ReLU idea: the avg. value of a standard normal truncated to be positive is 1/\sqrt{2\pi} (about 0.4), so what about doing ReLU - 1/\sqrt{2 \pi} rather than ReLU - 0.5? Lol presumably that doesn’t do much different, but it looks cool :sunglasses:

4 Likes

Thanks for clarifying Jeremy – I was trying to make sure I understood the issue, as opposed to criticizing whether the bug was worth fixing!

For whatever it’s worth, the shifted/renormalized ReLU interacts with Kaiming initialization in a strange way. Eq. 7 from the paper

Var[y_l] = n_lVar[W_l]E[x_l^2]

uses the fact that E[x_l^2] = \frac{1}{2}Var[y_{l-1}], when the activation is ReLU, to obtain their initialization scheme. But if we shift ReLU by c, then

\begin{align*} E[x_l^2] & = E[(y_{l-1}^+ + c)^2] \\ & = E[y_{l-1}^{+2}] + 2cE[y_{l-1}^+] + c^2 \\ & = \frac{1}{2}Var[y_{l-1}] + 2cE[y_{l-1}^+] + c^2. \end{align*}

When c = -0.5, for example, the extra terms come out to be 0.25 - E[y_{l-1}^+].

So it probably depends on how your input data is distributed, but if e.g. the E[y_l^+] > 0.25, then the variances of your later layers are probably going to be less than that of your earlier layers compared to if you had used plain ReLU. For example, as it’s been pointed out a few times in this thread, if we assume the y_l are N(0,1), then E[y_l] = \frac{1}{\sqrt{2\pi}} \approx 0.4.

4 Likes

When doing back propagation, if you have a batch size of n, do you add the gradients across batches or average them?

The intuition behind doing things in batches is that 1 input is not enough to know to right “direction”, so you take multiple inputs and assume that the (averaged) gradient will take the step in the right direction.

Over 1 epoch, you take lots of steps ( ~[number of inputs / batch size] steps), each in the direction of the averaged gradient of the current batch.

1 Like

Thanks @PierreO. I do get the logic, but if I am not mistaken, in the notebook the sum is used, hence my question. I always thought we use the average of gradient for each batch to the backprop.

I don’t think there’s any need to explicitly average gradients—you just calculate the loss, which is usually implicitly averaged across the batch, e.g. the “mean” in MSE. That happens in some of the notebooks when we divide by some shape[0] (the batch dimension).

1 Like