Lesson 8 (2019) discussion & wiki

ELU flattens out at -1 instead of zero for argument x < 0 . In part to address the issue with the mean.

2 Likes

We need a non-linearity because two linear functions in a row is just another linear function. And linear functions don’t recognize cats from dogs. ReLU is a fast non-linearity.

5 Likes

I understand why the weights might vanish (std deviation is decreasing with the layers due to the non-linearity), but why might they explode ?

1 Like

loving class…learned about two topics that I was ignoring in code

  1. what is “[…, ]”
  2. what is c[:,None]
1 Like

If the scale of your weights is 1.5 after the fist layer and activation, they’ll be 1.5**2 at the second, and quickly get out of hand.

6 Likes

@PierreO, technically for Einstein summation convention, such situations are simply poorly defined, and likely are errors.
In any case, no summation is implied if the index appears in the same dimension.

Why does PyTorch transpose and we don’t? Is that just a stylistic choice for how we want to treat rows vs columns?

3 Likes

It might be linked to some kind of optimization, or maybe it just made more sense to them when they wrote that layer.

1 Like

why couldn’t you just set a cutoff for the initialization instead of implementing all this math? ie all w1 must be between 0.001 and 0.01 (and then randomly draw from that with a generic gaussian distribution)

Because if you use a too tiny std, then your weights will vanish across the network and all become zeros. It is a tricky business: you have to be neither too big or too small.

6 Likes

Does Terence know LaTeX or all that LaTeX is copy pasted from wikipedia ? :joy:

5 Likes

I think you just gave me the solution a problem I’ve been having with one of my networks…!

I am using a single embedding layer, but for 16 inputs. I’m guessing the the scales aren’t working out well because of it.

1 Like

@rachel to help people get started…

I can confirm that the docker works well. https://forums.fast.ai/t/docker-image-of-fastai-jupyter-cuda10-py3-7/40081

To update, log in via bash, and then git pull, reinstall, then git clone the part3 code. I have provided the commands there.

3 Likes

he made bookish a tool to convert md to latex

2 Likes

Would be cleaner if Jeremy used y=f(x) and y=g(x), just to avoid any confusion about composing functions.

He didn’t use f anf f afterward thought :wink:

2 Likes

I love the course! I really hope that you will expand this course in some advanced topics like randomized linear algebra and the linkage to convex/non-convex optimization.

2 Likes

Sorry, why do his function def’s not have “return ____” at the end?
Is it a Python default that functions return the result of the last operation?

It’s storing the gradients, that’s all we need.

1 Like

it’s assigned to a layer via a mutation (class.g = whatever)

2 Likes