Lesson 8 (2019) discussion & wiki

drscotthawley · March 19, 2019, 3:33am

ELU flattens out at -1 instead of zero for argument x < 0 . In part to address the issue with the mean.

sgugger · March 19, 2019, 3:33am

We need a non-linearity because two linear functions in a row is just another linear function. And linear functions don’t recognize cats from dogs. ReLU is a fast non-linearity.

PierreO · March 19, 2019, 3:33am

I understand why the weights might vanish (std deviation is decreasing with the layers due to the non-linearity), but why might they explode ?

tamhash · March 19, 2019, 3:33am

loving class…learned about two topics that I was ignoring in code

what is “[…, ]”
what is c[:,None]

sgugger · March 19, 2019, 3:34am

If the scale of your weights is 1.5 after the fist layer and activation, they’ll be 1.5**2 at the second, and quickly get out of hand.

drscotthawley · March 19, 2019, 3:35am

@PierreO, technically for Einstein summation convention, such situations are simply poorly defined, and likely are errors.
In any case, no summation is implied if the index appears in the same dimension.

yeldarb · March 19, 2019, 3:36am

Why does PyTorch transpose and we don’t? Is that just a stylistic choice for how we want to treat rows vs columns?

sgugger · March 19, 2019, 3:38am

It might be linked to some kind of optimization, or maybe it just made more sense to them when they wrote that layer.

alando · March 19, 2019, 3:39am

why couldn’t you just set a cutoff for the initialization instead of implementing all this math? ie all w1 must be between 0.001 and 0.01 (and then randomly draw from that with a generic gaussian distribution)

sgugger · March 19, 2019, 3:40am

Because if you use a too tiny std, then your weights will vanish across the network and all become zeros. It is a tricky business: you have to be neither too big or too small.

PierreO · March 19, 2019, 3:41am

Does Terence know LaTeX or all that LaTeX is copy pasted from wikipedia ?

pl3 · March 19, 2019, 3:42am

I think you just gave me the solution a problem I’ve been having with one of my networks…!

I am using a single embedding layer, but for 16 inputs. I’m guessing the the scales aren’t working out well because of it.

username_not_found · March 19, 2019, 3:42am

@rachel to help people get started…

I can confirm that the docker works well. https://forums.fast.ai/t/docker-image-of-fastai-jupyter-cuda10-py3-7/40081

To update, log in via bash, and then git pull, reinstall, then git clone the part3 code. I have provided the commands there.

swagman · March 19, 2019, 3:44am

he made bookish a tool to convert md to latex

paul · March 19, 2019, 3:44am

Would be cleaner if Jeremy used y=f(x) and y=g(x), just to avoid any confusion about composing functions.

sgugger · March 19, 2019, 3:45am

He didn’t use f anf f afterward thought

knguyen · March 19, 2019, 3:47am

I love the course! I really hope that you will expand this course in some advanced topics like randomized linear algebra and the linkage to convex/non-convex optimization.

drscotthawley · March 19, 2019, 3:49am

Sorry, why do his function def’s not have “return ____” at the end?
Is it a Python default that functions return the result of the last operation?

sgugger · March 19, 2019, 3:49am

It’s storing the gradients, that’s all we need.

zachcaceres · March 19, 2019, 3:49am

it’s assigned to a layer via a mutation (class.g = whatever)