Shifted ReLU (-0.5)

yeldarb · March 19, 2019, 10:37pm

As Jeremy suggested in last night’s lecture, I tried subclassing nn.ReLU and added a constant -0.5 shift to it. I plugged it into a modified single-channel, shrunken-down ResNet I’ve been using for my Magic Sudoku app (the input is basically identical in format to MNIST).

class shiftedReLU(nn.ReLU):
    def forward(self, input):
        return F.threshold(input, self.threshold, self.value, self.inplace) - 0.5

I pulled the ResNet and BasicBlock classes from torchvision's ResNet implementation and replaced nn.ReLU with shiftedReLU as defined above.

It seems to have dramatically decreased the training loss but didn’t do much for training loss or accuracy. The biggest difference is in the first epoch where the shifted relu did much worse.

Here’s the output of fit_one_cycle from a few weeks ago.

epoch     train_loss     valid_loss     error_rate     accuracy
    1       0.186894       0.175872       0.058128     0.941872
    2       0.086348       0.087768       0.028518     0.971482
    3       0.053646       0.106015       0.031677     0.968323
    4       0.046601       0.049357       0.017113     0.982887
    5       0.037949       0.027546       0.009331     0.990669
    6       0.033589       0.018919       0.006171     0.993828
    7       0.018287       0.020058       0.006578     0.993422
    8       0.013359       0.014329       0.004357     0.995643
    9       0.010926       0.015629       0.004546     0.995454
   10       0.009849       0.012810       0.003776     0.996224
   11       0.003071       0.010585       0.002900     0.997100
   12       0.002972       0.011666       0.002914     0.997086
   13       0.000561       0.011291       0.002368     0.997632
   14       0.000085       0.012187       0.002249     0.997751
   15       0.000177       0.012452       0.002235     0.997765

And here’s the output from this afternoon after modifying ReLU:

epoch     train_loss     valid_loss     error_rate     accuracy     time
0         0.314576         0.243694     0.079613       0.920387     02:53
1         0.080145         0.075431     0.024889       0.975111     02:51
2         0.072920         0.065042     0.020266       0.979734     02:55
3         0.036023         0.034078     0.011040       0.988960     02:55
4         0.032858         0.027199     0.008889       0.991111     02:55
5         0.022661         0.020598     0.006760       0.993240     02:58
6         0.019699         0.021478     0.006164       0.993836     02:56
7         0.014048         0.021239     0.006207       0.993793     02:57
8         0.012207         0.017040     0.005212       0.994788     02:56
9         0.007839         0.012296     0.003622       0.996378     02:57
10        0.004507 .       0.012291     0.003376       0.996624     02:57
11        0.000943         0.011359     0.002802       0.997198     02:58
12        0.000763         0.012123     0.002578       0.997422     02:55
13        0.000805         0.011969     0.002340       0.997660     02:57
14        0.000020         0.011902     0.002277       0.997723     02:59

Unfortunately my initial output was generated with an older version of fastai though so it’s not an exact comparison.

Has anyone else given it a try yet? Did you see similar differences?

jeremy · March 20, 2019, 1:40am

I discovered shifted relus discussed here, BTW:

yeldarb · March 20, 2019, 2:13am

Thanks, I’m going to play around with it some more tomorrow.

This bit of that paper is interesting:

On CIFAR-100 ELUs networks significantly outperform ReLU networks with batch normalization while batch normalization does not improve ELU networks.

If I recall correctly, all of the ReLUs I replaced in BasicBlock had a BatchNorm immediately prior. I wonder what would happen if I removed those.

Even · March 20, 2019, 3:34am

Given the degree to which the model’s able to fit the data (0.23% error) I’m not sure it’s the best evaluation, but good to see that the performance was comparable.

I’m definitely curious as to what other things have been tried. I’m hoping to run it on some tabular datasets tonight or tomorrow and will report back.

yeldarb · March 20, 2019, 4:59pm

Commenting out all the BatchNorm2d calls that immediately precede a shiftedReLU sped up training time by 10% and looks like it didn’t negatively affect the results.

epoch 	train_loss 	valid_loss 	error_rate 	accuracy 	time
    0   0.450781   	0.372300 	  0.120684 	0.879316 	02:37
    1   0.104154 	0.083668 	  0.028090 	0.971910 	02:32
    2   0.066984   	0.100175 	  0.029786 	0.970214 	02:35
    3   0.043659   	0.028763 	  0.009408 	0.990592 	02:34
    4   0.033553   	0.023877 	  0.007853 	0.992147 	02:34
    5   0.032346   	0.023091 	  0.006949 	0.993051 	02:33
    6   0.013689   	0.019441 	  0.005681 	0.994319 	02:35
    7   0.013407   	0.018044 	  0.005408 	0.994592 	02:35
    8   0.009179   	0.016570 	  0.004827 	0.995173 	02:33
    9   0.005654   	0.012886 	  0.003916 	0.996084 	02:31
   10   0.002885   	0.012881 	  0.003152 	0.996848 	02:31
   11   0.001693   	0.013370 	  0.002977 	0.997023 	02:32
   12   0.000686   	0.013164   	  0.002543 	0.997457 	02:40
   13   0.000332   	0.013628   	  0.002389 	0.997611 	02:35
   14   0.000159   	0.013449 	  0.002284 	0.997716 	02:33

To make it a fair test, I also tried removing BatchNorm2d and using regular ReLUs – those results were actually very similar as well. So maybe BatchNorm2d just isn’t very important for this simple ResNet as I expected it to be.

epoch 	train_loss 	valid_loss 	error_rate 	accuracy 	time
    0     0.184621 	0.167899 	0.056818 	0.943182 	02:27
    1 	  0.092659 	0.139100 	0.039684 	0.960316 	02:31
    2     0.065546 	0.057735 	0.018325 	0.981675 	02:28
    3     0.047646 	0.055818 	0.017233 	0.982767 	02:27
    4     0.041871 	0.034755 	0.011096 	0.988904 	02:30
    5     0.031094 	0.026970 	0.009149 	0.990851 	02:28
    6     0.020846 	0.016686 	0.005268 	0.994732 	02:26
    7     0.017267 	0.019431 	0.005485 	0.994515 	02:28
    8     0.009905 	0.016638 	0.005072 	0.994928 	02:29
    9     0.006423 	0.012799 	0.003762 	0.996238 	02:29
   10     0.006129 	0.011062 	0.003103 	0.996897 	02:31
   11     0.001631 	0.011843 	0.002634 	0.997366 	02:29
   12     0.001006 	0.011802 	0.002361 	0.997639 	02:27
   13     0.000392 	0.012514 	0.002312 	0.997688 	02:28
   14     0.000037 	0.012661 	0.002207 	0.997793 	02:26

For fun, I got rid of all the BatchNorms (including ones not immediately preceding a ReLU) expecting that to completely blow up the model… and it actually didn’t. It ran another 10% faster and worked fine without BatchNorm and without shiftedReLU. (Maybe a tiny bit worse; 99.74% accuracy vs 99.77% previously but it was still improving at the 15th epoch so training more probably would have “fixed” it.)

epoch     train_loss     valid_loss     error_rate     accuracy     time
    0       0.215644     0.260280     0.087515     0.912485     02:11
    1       0.105287     0.112263     0.036931     0.963069     02:17
    2       0.072714     0.057838     0.017989     0.982011     02:16
    3       0.057783     0.047268     0.014942     0.985058     02:14
    4       0.044393     0.059535     0.016091     0.983909     02:11
    5       0.048049     0.035181     0.010508     0.989492     02:11
    6       0.036510     0.031670     0.009954     0.990046     02:11
    7       0.023029     0.027181     0.007790     0.992210     02:11
    8       0.015449     0.019131     0.005723     0.994277     02:11
    9       0.017070     0.016566     0.004714     0.995286     02:11
   10       0.006749     0.015771     0.004252     0.995748     02:13
   11       0.006951     0.012698     0.003180     0.996820     02:11
   12       0.002223     0.013309     0.002942     0.997058     02:11
   13       0.001267     0.014047     0.002697     0.997303     02:15
   14       0.000310     0.014796     0.002599     0.997401     02:12

The paper’s ELU is actually α(exp(x)−1) if x ≤ 0 (and they set α = 1). I want to try that next to see if throwing that exponential in there makes any difference… but, as mentioned above, I think my dataset may just be too easy for any of this to make a difference.

Anyone have suggestions on a better model to test this on that might show the differences more clearly?

Edit: I think I’ve been massively overkilling this problem… I bumped down from a ResNet34 to a ResNet18 (with no BatchNorm) and now I’m fully 2x faster than where I was yesterday with the same 99.7x% accuracy after 15 epochs.

Edit2: Swapped out ReLU with ELU as defined in the paper (turns out PyTorch already defines nn.ELU) and it didn’t seem to make a difference on this model either.

wdhorton · March 20, 2019, 7:22pm

I agree with what Even said above, I feel like testing a problem where you’re already achieving 99.7% accuracy isn’t going to give you a good point of comparison between different approaches.

nirantk · March 20, 2019, 7:51pm

I experimented with this on Imagenette :

Model Name	Activation	Accuracy
ResNet18	ShiftedReLu	0.56
ResNet18	ReLu	0.6
ResNet101	ShiftedReLu	0.2
ResNet101	ReLu	0.27

Key Observation: ShiftedRelu has worse off validation accuracy in early training compared to ReLU.

Experiment Notes:

No pretrained models
above numbers on validation.
All are trained with identical LR of 1e-2 for 5 epochs

I was playing around this and came to share my results and then saw your post. Thanks for sharing these!

I am using fastai version 1.0.47.post1 on conda release.

I called this FastReLU and my implementation was quite similar to yours:

class FastReLU(nn.Threshold):
    def __init__(self, threshold=0.0, value=0.0, inplace=False):
        super(FastReLU, self).__init__(threshold, value)
        
        self.threshold = threshold
        self.value = value
        self.inplace = inplace
        
    def forward(self, input):
        return F.threshold(input, self.threshold, self.value, self.inplace) - 0.5
    
    def extra_repr(self):
        inplace_str = 'inplace' if self.inplace else ''
        return inplace_str

The is the link to notebook on Github

Thanks to @yeldarb for pointing out the typo in the notebook !

@jeremy - quite curious to understand these somewhat counter-intuitive results. Any pointers where I should go looking?

yeldarb · March 20, 2019, 8:55pm

Thanks for posting that, @nirantk.

I tweaked your notebook a bit and did some more experiments.

One thing I noticed was changing your batch size down to 32 from 128 got me a 0.661 baseline on the standard ResNet18. I then changed it to use ELU as cited in the paper Jeremy posted above and it did improve the fast_rn18 results. On the average of 5 training runs I got 0.699 (or about a 6% improvement after 5 epochs).

(I also tried removing BatchNorm as the paper suggested that it was unnecessary with ELU. I tried both dropping it from the network completely and only dropping it when called immediately prior to ELUs. Dropping it everywhere was horrible. Dropping it only before ELUs got a 0.638 average which, while worse than the baseline or using BatchNorm+ELU, wasn’t as dramatically horrible as no-BatchNorm.)

~~I’m going to re-run on ResNet101 now and I’ll update with the results.~~

Edit: Baseline ResNet101 average was 0.3580 (I think it’s lower than @nirant’s because ResNet101 should have a Bottleneck rather than a BasicBlock so I swapped that out) and with ELU instead of ReLU it’s an average of 0.7176 (?!) after 5 epochs. That’s an absolutely enormous improvement… to the point where I’m skeptical that I didn’t somehow screw things up.

Edit 2: Here’s my notebook adapted from @nirantk’s above that swaps shiftedReLU for ELU and has the results.

Edit 3: And of course Jeremy had us all beat already the “bag of tricks” xresnet18 gets 0.846 average after 5 epochs and xresnet101 gets 0.836. Changing ELU to ReLU in xresnet doesn’t seem to help. I don’t have an intuition for why it’d work so well in resnet but not xresnet though; they are very similar. I’m going to keep poking at it.

jeremy · March 21, 2019, 12:54am

Removing batchnorm would make this a better test.

nirantk · March 21, 2019, 2:44am

Results for without Batchnorm experiments:

Model	BN	Activation	Accuracy
RN18	Yes	ReLU	0.64
RN18	No	ReLU	0.36
RN18	Yes	ELU	0.67
RN18	No	ELU	0.56
RN18	Yes	FastReLU	0.57
RN18	No	FastReLU	0.5
RN101	Yes	ReLU	0.37
RN101	No	ReLU	0.33
RN101	Yes	ELU	0.70
RN101	No	ELU	0.34
RN101	Yes	FastReLU	0.15
RN101	No	FastReLU	0.37

This is interesting and acts as a proxy ablation study for BatchNorm itself I think.

gietema · March 21, 2019, 3:41am

That’s an absolutely enormous improvement… to the point where I’m skeptical that I didn’t somehow screw things up.

Removing batchnorm would make this a better test.

Does anyone know any resources / can explain how to run these kinds of tests effectively? What are some of the things to look out for?
I mean, I know how to do the same thing twice where I change a small thing the second time and then compare the accuracy, loss plots etc.
But I don’t know at what point I have proven that a specific tweak actually works better. Would love to know more about the approach of someone with more experience.

jeremy · March 21, 2019, 1:38pm

Thanks for sharing!. Note that shifted relu will only be useful if you’re using the correct kaiming init - make sure you’re not using the default conv2d init. Since the default init doesn’t give unit variance output anyway!

machinethink · March 21, 2019, 7:48pm

The choice of -0.5 seems fairly arbitrary and only works if the mean of the output is actually 0.5. But that depends on the magnitude of the input data and the magnitude of the weights.

It may have worked for the demo example, but in practice you might want to adjust this -0.5 to whatever the true mean is of the data.

And that’s what batch norm does, of course (if you put it behind the ReLU). It figures out what the mean is of the data, so that it learns the amount to shift everything by, rather than using a hardcoded value that might not be appropriate.

Likewise, the bias of the next (conv or fully-connected) layer serves a similar purpose and is also learnable. (If that bias is -0.5 then it achieves the exact same effect as the “shifted ReLU”.)

So I’d be surprised if this works better than batch norm, especially as that also “fixes” the standard deviation of the data.

(Perhaps an alternative way to achieve this effect is to put a constraint on the weights during training, so that the output of the layer is always “guaranteed” to have a 0 mean and 1 stddev.)

jeremy · March 21, 2019, 7:53pm

Your input data should have mean zero and var one, and weights selected to keep the activations that way!

machinethink · March 21, 2019, 8:09pm

But this almost never happens with the (computer vision) models I see in practice. Even for models where the pixel inputs are normalized to be between -1 and +1 (which means they have a stddev < 1), the activations are often very high.

In some of these models, a ReLU6 is used – which does min(max(x, 0), 6) – instead of “plain” ReLU, as an additional way to force activations to not get out of hand. (You can think of this as a poor person’s sigmoid, as it has roughly the same S-shape.)

So I’m curious to see what happens to the sizes of the activations and the weights in next week’s lesson when you add the training loop.

jeremy · March 22, 2019, 1:12am

They should be normalized with zero mean and unit variance, not scaling them between -1 and +1. Hopefully after the next class you’ll agree that this is both possible and useful. If you want to see how to really do it properly, see the SeLU and Fixup papers…

regrettable-username · March 25, 2019, 4:29am

So I tried my hand at reproducing some of the results from the Clevert ELU paper that Jeremy linked above. I figured I should share my results here. Here’s the notebook if you’re interested: https://github.com/regrettable-username/pytorch_experiments/blob/master/Activation%20Fun.ipynb

The setup is an 8 layer fully connected network with 128 units per layer. The dataset was MNIST normalized to mean of ~0 and stdev of ~1–the validation set was normalized based on the training set mean/stdev. The weights were initialized with the Kaiming normal (fan_in) initialization scheme. I trained with relu, leaky_relu with alpha of 0.1, elu with alpha of 1.0 and the shifted relu that was mentioned in the lecture. The alpha values were chosen based on the ELU paper. All tests were done with a learning rate of 1e-2.

This first plot shows the median, out of 5 separate runs with fresh initialization, of the average unit-activation for each non-linearity over 125 epochs. The error bars represent the std over those 5 runs.

The second plot shows the mean training cross-entropy loss (dashed line) and mean validation cross-entropy loss over 5 runs for 25 epochs (solid line).

Disclaimer: I’m still fairly new to deep learning and PyTorch (as you’d expect), so I’m not sure I’ve calculated everything correctly.

I’ve captured the mean and std of the weights in each layer before and after applying Kaiming initialization:

Before - mean: 4.421262929099612e-05, std: 0.020606495440006256
Before - mean: -0.00017039463273249567, std: 0.05113699287176132
Before - mean: -0.0006515454151667655, std: 0.05086797848343849
Before - mean: 0.0003819070116151124, std: 0.051259368658065796
Before - mean: -0.0003702500252984464, std: 0.05098085105419159
Before - mean: -0.00011002244718838483, std: 0.05091703683137894
Before - mean: -0.00017100712284445763, std: 0.05093710124492645
Before - mean: -4.990950765204616e-05, std: 0.0516427718102932
-----------------------------------------------------------------------------
After - mean: -3.820666461251676e-05, std: 0.050436556339263916
After - mean: -0.0017552259378135204, std: 0.12501567602157593
After - mean: 0.00023908735602162778, std: 0.12458109855651855
After - mean: 0.0003903487231582403, std: 0.12537115812301636
After - mean: -0.0005305547965690494, std: 0.12493725121021271
After - mean: 0.0015680694486945868, std: 0.12606129050254822
After - mean: -0.0021257216576486826, std: 0.1244535818696022
After - mean: 0.00431952066719532, std: 0.12619906663894653

I believe the std is correct as there are 128 units per layer so sqrt(2/128) = 0.125. And each mean is approximately 0.

I wasn’t confident in how I computed the average unit-activation. Here’s how I went about it:

After each epoch I added a forward hook to each linear layer.
I then evaluated the network on a fixed subset of the training set. This is data I actually trained on, not a holdout set.
After every model evaluation I grabbed the tensors for each layer. It seemed these were the results of the affine function without the non-linearity applied, so I applied the correct one to each tensor based on the current test’s activation function.
I took the mean of the result from each tensor from step 3 and just grabbed the item(), summing them together scaled them by 1./8. to account for number of layers.
After summing this over every batch, I scaled the sum by 1. / batch_count to get the final mean.
I cleaned up the hooks as I’m not sure of the performance implications of leaving them during training.

Any advice is greatly appreciated.

All that aside, it’s interesting to see the relative difference in the various non-linearities here. I still want to try a few different hyper parameters and perhaps some other activation functions. Next up I think I’ll try the conv-net architecture used in the CIFAR-100 tests done in the paper and compare it to various ResNets. I’m sure the lesson tomorrow will shed more light on all of this!

Cheers,

-James

Pablo · March 27, 2019, 12:31pm

I have also started trying the shifted ReLU idea on an NLP problem.

I have not tested proper kaiming initialization yet (already pushed!) and I don’t know how the concept of normalizing input data applies to NLP, but so far I can say that results seem to be about the same quality (I have quite a bit of variance, but no changes are obvious after a few runs).

However, training is dramatically faster. Where I would normally have around 90 epochs now I need about 40. That’s great!

narvind2003 · March 29, 2019, 2:08am

Subtracting 0.5 after RELU feels like a hard shift to the negative axis. Has anyone already tried subtracting a random value between 0 and 0.5 instead?

Pablo · March 29, 2019, 9:29am

But wouldn’t that result in unpredictable activations?