I tried it out in a UNet and didn't really notice anything at all. I simply got rid of BN+ReLU and followed every convolution with a SELU unit -- except the ultimate layer.
TBH - I didn't read the entire 2893092390 pg paper, but I think I recall there being something about weight initialization? I'm currently just using the default PyTorch conv weight initialization mechanism. Has anyone else had any other experiences?
PS - I tried it out in another architecture, and it resulted in my gradients, and consequently loss function exploding to +inf.
EDIT: Hey guys, take a look at this: https://github.com/shaohua0116/Activation-Visualization-Histogram/blob/master/ops.py They keep BN in there, after the SELU. I'm about to add it back into my model to see what happens.
EDIT2: Adding BN did not help. In fact, it made it worse (the loss graph is a lot more jumpy between batches). I went back and re-read the first few pages of the paper. SNN's do Not use BN at all, so I'm a bit confused at the repo above. Also it does look like they present a very simple weight initialization scheme on the bottom of page 3, though I'm not sure about it's weight of importance (pun) as the authors even say, "Of course, during learning these assumptions on the weight vector will be violated." But they immediately counter with, "However, we can prove the self-normalizing property even for weight vectors that are not normalized, therefore, the self-normalizing property can be kept during learning and weight changes," so maybe it's important after all. More experimentation needed. . .