so once I take the derivative of mse, everything in the above code matches except Iâm missing sum().
After some thinking I realized that since this function does partial derivatives, then the sum disappears, since the derivative for all entries but the specific input itâs derived for will be 0.
Two insights I gained from this:
The batch size has a huge impact on the size of the gradient (in mse). The larger the batch size the smaller the gradient. (inp.shape[0] is the batch size above)
The network doesnât care for the loss result at all, it is only a useful indicator for the network operator, it only cares for the gradient of the loss function.
Tunneling into a remote system behind a router requires opening ssh port(s) on the router. The serveo.net way doesnât require opening ports on the router or firewall. However, thereâs the security issue of relying on the protection of the jupyter login screen. The tunnel is encrypted so not even serveo.net can see communications. However, Iâm not an expert on security. Another issue with using serveo, or any other cloud provider, is reliance on their service being robust. You can run your own private serveo server instead of using theirs.
Assuming a use case of a non-network savvy Jupyter Notebook user who just wants a simple way to access their system when away from the local network, serveoâs benefits are:
Notebook users donât have to install any software. They only need a browser and the login key.
The server admin doesnât have to open ports on the router, firewall or servers.
If the ISP mucks with the routerâs settings, such as doing a factory reset, the service continues to work. My ISP has pushed through updates thatâs hosed my router. I prefer using the default settings.
The server âadminâ doesnât need to know how to port forward.
If your infrastructure changes, such as adding or removing systems, you donât have to change your network infrastructure (open router port and portforwarding changes).
You can support many servers without thinking.
You can use your own domain.
ngrok is a great alternative but serveo has more free features.
Works if you donât have access to the router such as at a hotel or cafe.
I would strongly recommend opening forwarding your routerâs ssh port to your pcâs ssh port over using serveo to make your jupyter port available directly. sshâs security, whilst imperfect, is much more well-studied than jupyterâs, and the protocol is designed to be secure.
Itâs not a really big problem to open a port to jupyter thatâs just on an aws instance that has nothing private on it, and isnât on a vpc with anything else - i.e. where you donât mind too much if someone hacks that machine. But itâs a really big problem if someone gets shell access (which jupyter provides) to a machine on your home network.
I would very highly recommend not using serveo on anything similar to access a jupyter port. You could, however, use serveo to access an ssh port, if you wish.
(I founded and ran a fairly large email provider including handling much of the security for it for many years - so my advice here is based on a lot of research and experience.)
Right, perhaps the safest option is to use serveo to access an ssh port. I donât quite know how thatâs done or what the downsides are. Is there some software, considered safe, that provides a level of protection ahead of access to a jupyter login screen?
Another option is to use remote software such as Remote Desktop, VNC, Teamviewer.
A larger batch size leads to more accumulated gradients.
So I guess there should be some kind of âequilibriumâ between more data points and a higher loss?
Please correct me if I am wrong.
I spoke to my ISP but they require me to get a static/public IP (Sorry If Iâm mixing 2 terms here) so that I can allow incoming traffic, so Iâm forced to purchase the add on
Thank you! I ended up setting up noip, it was a hassle free setup, just a few adjustments were required on my router (Most routers turns out have a DDNS option in-built).
Sorry to steer the discussion off Lesson 8, but is it suggested to run just scripts on the machine instead?
Based on the suggestions above, Iâm doing the following:
Setup SSH on my âboxâ.
Put the public key of my laptop on the ssh folder in the âboxâ
Enabled ssh services in Ubuntu
Enabled port forwarding of port 22 (internal) to Port 22 (External) on my router
Thanks for the clarification.
Donât know why '__call__' came as a '__custom__'def in my head, my bad.
Notes:
__call__ makes a instance of a class callable as a function.
class model:
def __init__(self, arg1, arg2, arg3):
# code
def __call__(self, arg1):
# code
resnet = model(foo1, foo2, foo3) #instance of a class
resnet(image) #making it a callable (instance as a function)
resnet.__call__(image) #is same as above
Here __init__ is basically a constructor, as it is invoked once by __new__ when the object is created , but __call__ can be called with the object any number of times which gives us the flexibility to redefine it (like, changing the state).
Also, in a deep learning setting the state of an entity (for example, layer) changes after each iteration. Some of the framework(design) decisions in pytorch makes a lot of sense now.
Wait, now Iâm a bit confused â does it really matter whether you use the training set or the validation set? It does seem like in practice the distributions should be very similar, but I thought the point you were trying to make in class was that both the training set and validation sets should be normalized in a consistent manner.
Perhaps itâs slightly suboptimal for training to normalize the training set using the validation set (you will no longer be guaranteed to be training on a set that has mean 0 and stdev 1), but it seems minor compared to normalizing the training and validation sets independently.
Yeah, I get these mixed up tooâI always have to go back and rederive things with explicit indices in the chain rule and then see if I can spot the matrix multiplications afterwards. Is there an easier way? I guess just memorize some matrix derivative identities? Probably time to work through the matrix calculus post
As an example, for calculating the derivative of the loss with respect to weight W_{ij} from a linear layer y_{nj} = \sum_i x_{ni} W_{ij} + b_j (here n is the batch index), Iâd have to do the following: W_{ij} affects the loss via y_{nj} for every n in the batch (but not via any of the other activations for some index other than j), so if we already know all those upstream derivatives \partial \mathcal{L}/\partial y_{nj} (aka out.g), then the chain rule gives
and yeah, if I squint I can write that as a matrix multiplication as x.t() @ out.g.
I actually kind of like all the explicit indices (otherwise I tend to forget things like the batch index), and I think it makes the chain rule a little easier to feel than with the matrix notation, but itâs definitely a little tediousâŚ
Here is a short answer I can come up with. torch.randn gives you random numbers of specified shape whose mean is 0 and standard deviation is 1. We want our input and activations to have mean 0 standard deviation 1; not the weight matrix.
Edit: woops, didnât search hard enough in the thread, @mkardas was way ahead of me
Random thought about the shifted ReLU idea: the avg. value of a standard normal truncated to be positive is 1/\sqrt{2\pi} (about 0.4), so what about doing ReLU - 1/\sqrt{2 \pi} rather than ReLU - 0.5? Lol presumably that doesnât do much different, but it looks cool
For whatever itâs worth, the shifted/renormalized ReLU interacts with Kaiming initialization in a strange way. Eq. 7 from the paper
Var[y_l] = n_lVar[W_l]E[x_l^2]
uses the fact that E[x_l^2] = \frac{1}{2}Var[y_{l-1}], when the activation is ReLU, to obtain their initialization scheme. But if we shift ReLU by c, then
When c = -0.5, for example, the extra terms come out to be 0.25 - E[y_{l-1}^+].
So it probably depends on how your input data is distributed, but if e.g. the E[y_l^+] > 0.25, then the variances of your later layers are probably going to be less than that of your earlier layers compared to if you had used plain ReLU. For example, as itâs been pointed out a few times in this thread, if we assume the y_l are N(0,1), then E[y_l] = \frac{1}{\sqrt{2\pi}} \approx 0.4.
The intuition behind doing things in batches is that 1 input is not enough to know to right âdirectionâ, so you take multiple inputs and assume that the (averaged) gradient will take the step in the right direction.
Over 1 epoch, you take lots of steps ( ~[number of inputs / batch size] steps), each in the direction of the averaged gradient of the current batch.
Thanks @PierreO. I do get the logic, but if I am not mistaken, in the notebook the sum is used, hence my question. I always thought we use the average of gradient for each batch to the backprop.
I donât think thereâs any need to explicitly average gradientsâyou just calculate the loss, which is usually implicitly averaged across the batch, e.g. the âmeanâ in MSE. That happens in some of the notebooks when we divide by some shape[0] (the batch dimension).