Lesson 9 Discussion & Wiki (2019)

Yes, I guess it is somehow related to legacy image formats and libraries that used BGR conventions. So probably OpenCV had it for better compatibility with the rest of the tools. However, now the situation is the ​opposite :smile:

1 Like

Good init. Gradient clipping will only save you from exploding gradients. Good init will save you from exploding activations, vanishing activations, exploding gradients and vanishing gradients.

14 Likes

That makes sense to me. So why is anyone using sqrt(5) anywhere? Seems like a mistake.

That’s the whole point of this first part. It was a mistake :wink:

2 Likes

Seems like pytorch uses 1./ math.sqrt(self.hidden_size) intialization why not Xavier initialization for the sigmoid activations?

Not sure if this was their motivation, but I imagine that an advantage of the uniform distribution is that it is bounded; if you use the gaussian, you have a small probability of very large values for a few of your initial parameters…and in really big models with millions of parameters, at least a few of them will be disproportionately large in magnitude.

14 Likes

So would this method work for multiclass predictions?

Truncated gaussian?

1 Like

How did Jeremy figure out that indexing thing? I’m just curious if there was a process or if that’s something he was just born with.

5 Likes

Sigmoid or softmax is only the final activation of your neural net. Nowadays, it’s only ReLUs inside.

2 Likes

I remember how I used Gaussian distribution for one of the parameters when running a random search for my classifier. And, every once in a while, I was getting the values outside of the parameter’s domain (like, greater than 1 for probability) due to these long tails :smile: So some truncation definitely makes sense.

Maybe its a bias issue? Would we expect our values to be gaussian?

Anyone used swish activation?

I did have some good results but did very few tests.

Some more tests here: https://aclweb.org/anthology/D18-1472

But with a GRU or LSTM each gate has an activation of tanh or sigmoid?

2 Likes

Neat trick!

Oh sorry, I hadn’t followed. This would probably need someone to compute the exact math, yes.

Haha I can only imagine that someone’s model definitely kept giving them garbage and they spent a long time figuring out what happened before they decided that they needed this trick. :laughing:

1 Like

I most papers I’ve seen using GRUs will use a default 0.01 scaling on each of the randomly initialized weights, but I am wondering if it may make sense to try out Xavier initializations…

4 Likes

hi…
can any one re explain logsumexp…
is it on the top of logsoftmx ?

About the max trick for the NLL, what about using the max trick to avoid exploding values without adding it back? What is the benefit?