Lesson 8 - Official topic

giacomov · May 6, 2020, 3:15am

So the truncated backprop is truncated every batch size?

sgugger · May 6, 2020, 3:15am

It’s a one-layer recurrent model.

erlapi · May 6, 2020, 3:18am

how many parameters an RNN end up having if we really only have one layer repeated multiple times?
Are we changing the parameters on the same layer at each loop, or creating a layer for each loop?

bibsian · May 6, 2020, 3:21am

So does self.h represent the one layer? Or is it self.h_h?

jwuphysics · May 6, 2020, 3:22am

self.h represents the hidden state of the RNN. self.h_h is the one (linear) layer.

sgugger · May 6, 2020, 3:22am

self.h is the hidden state, it’s not a layer. self.h_h is the layer (h_h stands for hidden to hidden).

radikubwa · May 6, 2020, 3:24am

You could run learn.summary() to figure this out. https://docs.fast.ai/basic_train.html#model_summary -> check this out too.

Nonnormalizable · May 6, 2020, 3:25am

Would you determine good values for things like n_hidden and n_layers through a standard hyperparameter grid search?

bostonsparky · May 6, 2020, 3:25am

I also got this

victor.vargas · May 6, 2020, 3:28am

Floating point reference to Rachel’s course on Linear Algebra lots of fun too. Would love an updated version of that as well

jwuphysics · May 6, 2020, 3:28am

Could we somehow use regularization to try to make the RNN parameters close to the identity matrix? Or would that cause bad results because the hidden layers want to deviate from the identity during training (and thus tend to explode/vanish)?

pinaki · May 6, 2020, 3:29am

is there a way to quickly check if the activations are disappearing / exploding ?

rachel · May 6, 2020, 3:29am

Floating point is discussed starting around minute 54 of this video from the computational linear algebra course:

victor.vargas · May 6, 2020, 3:31am

Thanks Rachel!!!

FraPochetti · May 6, 2020, 3:31am

Check this out: The colorful dimension

radikubwa · May 6, 2020, 3:31am

Yes, but that could take quite sometime but if you have the compute go for it. Even Random search. I find Bayesian optimization a bit better. Or you look into this method adaptive resampling and this notebook.

jcatanza · May 6, 2020, 3:35am

How exploding/vanishing gradients work: Screenshot_2020-05-05 improving-by-1-every-day-for-a-year-1-0136537-8-0-99365-0-03-33478614 png (PNG Image, 500 × 587 pixels)

Albertotono · May 6, 2020, 3:36am

ActivationStats

ilovescience · May 6, 2020, 3:36am

Original dropout paper here

matdmiller · May 6, 2020, 3:37am

Does dropout somehow skip the computation or just set the activation to zero?