Lesson 9 Discussion & Wiki (2019)

devforfu · March 26, 2019, 2:08am

Yes, I guess it is somehow related to legacy image formats and libraries that used BGR conventions. So probably OpenCV had it for better compatibility with the rest of the tools. However, now the situation is the opposite

sgugger · March 26, 2019, 2:08am

Good init. Gradient clipping will only save you from exploding gradients. Good init will save you from exploding activations, vanishing activations, exploding gradients and vanishing gradients.

andavargas · March 26, 2019, 2:09am

That makes sense to me. So why is anyone using sqrt(5) anywhere? Seems like a mistake.

sgugger · March 26, 2019, 2:09am

That’s the whole point of this first part. It was a mistake

sparalic · March 26, 2019, 2:12am

Seems like pytorch uses 1./ math.sqrt(self.hidden_size) intialization why not Xavier initialization for the sigmoid activations?

rsturley · March 26, 2019, 2:12am

Not sure if this was their motivation, but I imagine that an advantage of the uniform distribution is that it is bounded; if you use the gaussian, you have a small probability of very large values for a few of your initial parameters…and in really big models with millions of parameters, at least a few of them will be disproportionately large in magnitude.

ThomM · March 26, 2019, 2:13am

So would this method work for multiclass predictions?

erikg · March 26, 2019, 2:14am

Truncated gaussian?

KevinB · March 26, 2019, 2:14am

How did Jeremy figure out that indexing thing? I’m just curious if there was a process or if that’s something he was just born with.

sgugger · March 26, 2019, 2:14am

Sigmoid or softmax is only the final activation of your neural net. Nowadays, it’s only ReLUs inside.

devforfu · March 26, 2019, 2:14am

I remember how I used Gaussian distribution for one of the parameters when running a random search for my classifier. And, every once in a while, I was getting the values outside of the parameter’s domain (like, greater than 1 for probability) due to these long tails So some truncation definitely makes sense.

Dee · March 26, 2019, 2:14am

Maybe its a bias issue? Would we expect our values to be gaussian?

Gabriel_Syme · March 26, 2019, 2:14am

Anyone used swish activation?

I did have some good results but did very few tests.

Some more tests here: https://aclweb.org/anthology/D18-1472

sparalic · March 26, 2019, 2:16am

But with a GRU or LSTM each gate has an activation of tanh or sigmoid?

PierreO · March 26, 2019, 2:16am

Neat trick!

sgugger · March 26, 2019, 2:17am

Oh sorry, I hadn’t followed. This would probably need someone to compute the exact math, yes.

mediocrates · March 26, 2019, 2:18am

Haha I can only imagine that someone’s model definitely kept giving them garbage and they spent a long time figuring out what happened before they decided that they needed this trick.

sparalic · March 26, 2019, 2:18am

I most papers I’ve seen using GRUs will use a default 0.01 scaling on each of the randomly initialized weights, but I am wondering if it may make sense to try out Xavier initializations…

champs.jaideep · March 26, 2019, 2:20am

hi…
can any one re explain logsumexp…
is it on the top of logsoftmx ?

sparalic · March 26, 2019, 2:20am

About the max trick for the NLL, what about using the max trick to avoid exploding values without adding it back? What is the benefit?