Self-Normalizing Neural Networks

Posting this in Part 1 since I’m still only at lesson 2. But that seems fitting, as the lesson mentions the RELU function as an activation. Anyway, it seems like this here might be a better deep learning approach to some problems:

Self-Normalizing Neural Networks


Code at:-

If things pan out, this is gonna be huge. =)

Yes, I saw @jeremy had retweeted the paper. Interested to see what he’s got to say about it.

It is truly amazing how fast the field of neural networks is advancing… what a great time to be alive, and how lucky we are to be able to participate in this A.I. renaissance! I can’t wait to try out SELUs in my architectures!

1 Like

I tried it out in a UNet and didn’t really notice anything at all. I simply got rid of BN+ReLU and followed every convolution with a SELU unit – except the ultimate layer.

TBH - I didn’t read the entire 2893092390 pg paper, but I think I recall there being something about weight initialization? I’m currently just using the default PyTorch conv weight initialization mechanism. Has anyone else had any other experiences?

PS - I tried it out in another architecture, and it resulted in my gradients, and consequently loss function exploding to +inf.

EDIT: Hey guys, take a look at this: They keep BN in there, after the SELU. I’m about to add it back into my model to see what happens.

EDIT2: Adding BN did not help. In fact, it made it worse (the loss graph is a lot more jumpy between batches). I went back and re-read the first few pages of the paper. SNN’s do Not use BN at all, so I’m a bit confused at the repo above. Also it does look like they present a very simple weight initialization scheme on the bottom of page 3, though I’m not sure about it’s weight of importance (pun) as the authors even say, “Of course, during learning these assumptions on the weight vector will be violated.” But they immediately counter with, “However, we can prove the self-normalizing property even for weight vectors that are not normalized, therefore, the self-normalizing property can be kept during learning and weight changes,” so maybe it’s important after all. More experimentation needed. . .

The paper also uses a different form of dropout. @haresenpai are you using that?

UNet only uses 5% dropout and in my dataset I only have ~1300 images, so I removed dropout entirely because I wasn’t afraid of overfitting.

I went ahead and fixed my convolutional weights initialization so that all the biases are set to 0 just like in the paper, and all the weights are drawn from a random normal distribution with mean=0, stdev = sqrt(1 / (kernel_x_size * kernel_y_size * in_channels)). This unequivocally caused my network output to explode. I tried setting stdev to 0.5 / sqrt (…), which completely fixed the explosion problem. What I’ve learned so far:

  1. Use selu, lol.
  2. Your initial network input must have near-0 mean, and unit variance (divide by std).
  3. Biases=0, and Conv+FC weights (both of them) should have the above mentioned initialization.
  4. If using dropout, you have to use their version of it. I don’t have that working in PyTorch yet, so opted not to use it for the interim.

I’m uncertain why my gradients explode when I use 1.0 / instead of 0.5 /. I think it has to do something with the fact that UNet has skip connections rather than being a vanilla FFN. To test this, I added residual connections into the UNet for those layers with the same filter / channel size. Even with 0.5 instead of 1.0, the gradients still exploded.

To me, it seems that self normalizing neural networks are very particular about the network input being standardized, and that the subsequent input to each layer must remain that way (or very close to it) as well. If for some weird reason that gets violated, things get messy. I’ll be interested to see how people with more advanced mathematical + theoretical understanding than me take the ideas from the original paper and provided source code, and then apply them to the various architectures that we’ve seen developed over the past year.

Have you had any more luck with this? Would be very interesting if it works well.

Unfortunately, no.

I have only a few days left with the current Kaggle competitions so I haven’t pursued it whole heartedly. If anyone else is giving it a shot, my recommendation is start simple with a small, basic, feed forward net with a small kernel and small image and go forward from there. Trying it out on a UNet permutation that has inception blocks right off the bat probably wasn’t the brightest idea on my part.

After the competition is over, I plan on spending more time revisiting SELU.

1 Like

I wonder if this would be a good solution to the sensitive init problem?:

I think this is already been implemented in Keras.

1 Like

I’ve spent a day or so testing out SELU as a replacement for ELU in an relatively simple, feed-forward network based around separable convolutions (similar to MobileNets), and I could not get the architecture to converge at all. I encountered similar issues with PELU in the past, so this may say more about the topology rather than the activation function, but at least it seems that SELU isn’t a silver bullet when it comes to helping all CNNs train faster & cheaper (which in fairness I don’t believe the authors ever claimed — the community certainly did/does have its hopes up about SELU though). I’ll keep experimenting to see if I made any implementation mistakes, but there seems to be enough people trying & failing with SELU on various non-trivial CNN architectures that it may worthwhile to consider SNNs as a holistic, alternative architecture, as opposed to talking about SELU as a function you can drop in any other arbitrary architecture. That’s definitely how I’m starting to think about it now…

1 Like

If you’re on the master branch of PyTorch, it’s also available there too. Both SELU and Alpha Dropout (1D Only) and no weight inits.

I would also be interested to know if anyone had significant good results.


in my understanding the self-normalizing property is preserved for feed-forward neural networks (this is what the paper claims). For CNNs it doesn’t work as input is normalized globally but not locally. To use the effectiveness of SNNs you need locally normalized inputs (try whitening). Hope this helps.