Meet Mish: New Activation function, possible successor to ReLU?

I have not looked at it yet myself, I briefly looked at the optimizer notebook to see what’s changed. @LessW2020?

I tried it out briefly but was getting a couple of errors, I’d need to spend a while looking at it to figure out what was going wrong. Maybe after the RSNA kaggle comp :slight_smile:

@morgan look at the optimizer notebook. It seemed to me like the new optimizers need to be ported over to the new pytorch to directly use them. (Or redone how fastai expects them)

1 Like

Hey all,

Here is a rough first draft of a working RangerQH port from @LessW2020 for fastai_v2.

The main difference I could see between the original code and what fastai v2 Optimizer wanted was the need to split into updates to the State and then the actual Step function.

stats
Updates to elements in the State happen by passing functions to stats in the main RangerQH optimizer function. Updates to the State happen sequentially in order of the functions passed.

steppers
Updates to the parameters themselves happen via steppers, which again is a list of functions, executed sequentially on your parameters p.

This was my first go at playing around with the innards of an optimizer so there is plenty to improve in my code I’m sure, happy to hear any suggestions around logic and naming conventions in particular :smiley:

One thing thats probably worth doing is splitting the Lookahead steppe out from rangerqh_step and into its own stepper, I haven’t had time today.

Also, I have just tested on the MNIST logistic classifer net from the “What is torch.nn Really” tutorial where it consistently beat SGD. Haven’t used it on more advanced architectures yet so there might be a little more work to do yet.

4 Likes

Trialling it out with EfficientNet-b2 but the lr_finder is giving much too high a loss, used to getting closer to 0.07 at the minimum…

@morgan see sguggers comment here. Turns out we already have a full ranger! This should help with adapting QH :slight_smile:

Meet Ranger - RAdam + Lookahead optimizer

Perhaps you could implement the Quasi Hyperbolic Momentum? (Only part missing) and make it modular like RAdam and LookAhead? (Or see if your current version will stack together!)

2 Likes

Haha nice, lets see what a proper implementation looks like :smiley: will see if I can figure a good way to add QH

1 Like

Fit_fc is now in there too as fit_flat_cos (thanks sgugger!)

I believe (if I read it right) you can do: Learner.fit_flat_cos

I’ll port over a notebook to run them all on my v2 repo here in the next few days for ImageWoof :slight_smile:

Also just saw that the Simple Self Attention layer got added too :slight_smile:

@Diganta and mish! (Swish too) :wink:

4 Likes

Was trying to study the use of Mish in case of transfer learning i.e. only using Mish in the last FC layers. I tested with a pretrained Resnet50 and against ReLU on CIFAR10 and CIFAR100. Although the results are quite similar but there was marginal improvements when using Mish. The runs were only for 10 epochs and only 1 Mish activation function was used in the entire network. But the results did show promise and I observed we can get some improvements just by changing from ReLU to Mish in the head of the model in case of pretrained models.

Also, all the parameters were kept the same. So just be replacing ReLU I got some improvements. There was one problem though, Mish overfitted quickly when using higher learning rate, which maybe due to not finding the best parameters.

I also wrote a medium post for the same.

5 Likes

Can someone help me with this issue over here - https://github.com/AlexeyAB/darknet/issues/3994
Commit - https://github.com/AlexeyAB/darknet/commit/bf8ea4183dc265ac17f7c9d939dc815269f0a213
Mish was added to DarkNet however in practical usage it is giving NaN. Would appreciate if someone can point out if there is any mistake with the implementation.
Thanks!

@Diganta and others I have a question. I have seen this being discussed around and wanted to know which would be better. To use a pretrained resnet with ReLU activations and only have Mish on our head? Or to replace all activations with Mish and load the pre-trained model in. Thoughts?

2 Likes

That implementation is not at all numerically stable. All the exps quickly lead to overflow and hence NaN. Should be possible to adapt either the Eigen based implementation from tensorflow contrib or my mostly pure C++ implementation (mostly as it’s using the PyTorch dispatch/templating but is otherwise standard C++). The TF one is probably slightly more stable given handling of both underflow and overflow but will require more adaptation to remove the Eigen dependency.

3 Likes

Not sure which would be better in that case. I did try replacing Swish with Mish using pre-trained weights and that seemed to work well from minimal tests. I didn’t do anything special in terms of freezing or LRs.
But may not work so well from ReLU. But don’t think it’s that dissimilar in terms of general output range at least.
I gather the idea of the various initialisation procedures is to choose random initial weights that give similar activation ranges to those you’d find in a pretrained model. In which case presumably pretrained weights shouldn’t be any worse, only potentially better (given there is no special init handling for Mish).

Perhaps a bit of extra care around initial training would also help. Trying to ensure the previous learning is preserved rather than a poor step early on changing too much. Maybe training for a few iterations with everything but the batchnorms frozen (the default if you freeze the entire model in fastai). I wouldn’t think you’d need much to update the BNs, probably not even a full epoch assuming a reasonable dataset size. Then maybe a lower LR for a little bit, again maybe even less than an epoch. You could do both of those by using a warmup schedule from 0 over the first epoch (at least if testing suggested it was likely to be worthwhile).

2 Likes

UPDATE:
Lost results of my initial tests due to some Jupyterlab issues (lost connection then it went a bit feral). But got some information from the partial results and have re-run a more selective set of tests. Just testing various setting with 1-cycle not the schedulers. Some promise with schedulers but need some more work there as think you really want differential learning rates across layer groups but not sure that will work with the schedulers.
I’ve updated the notebook (note it won’t run and produce those results but all the code for them is there). Tables and graphs at the end show main results.
Not shown in these results is the very poor performance if you don’t first train the classifier with the body frozen. I initially missed this and while ReLU also suffered quite a lot this really affected the cases with Mish.

I tried both replacing Mish across all layers from the start (mish_all in the results) with a staged approach (mish_stg) of first replacing Mish in the classifier and just training that, then replacing Mish in the body when you unfreeze and fine-tune. Results here are a bit mixed with probably a slight overall advantage to the non-staged approach. But I think this is largely because I was comparing everything across just 10 epochs, unfreezing at epoch 5. So the staged approach gives a lot less time for the body to adapt.
Differential learning rates seemed to help quite a bit so I compare a couple of settings there. Quite low rates on initial layers do seem to help with the adaptation but there’s a bit of a trade off as this gives less overtall learning. Again this is probably slanted given the short testing time. With more epochs I think these might lead to better final performance.
Tests with a lower learning rate seem to suggest a lower rate can help avoid big drops when you switch, but again likely due to the very limited time they gave worse final results.

My overall view from testing would be that there does seem to be the possibility of getting pretty good results with Mish across the whole model but it is a little sensitive to parameters and it may not be possible to use basic parameter settings to optimally adapt it.
I’d tend to think based on this that trying to replace Mish in the body while training a new task may not be worthwhile. Especially if you are only training for a limited time and/or don’t have a lot of training data so overfit is a concern.
But it seems like it may be worth trying to adapt existing weights to Mish. It seems like you may be able to come up with a fairly minimal training regime to create weights quite well adapted to Mish with much less work than full ImageNet training. Particularly promising here is that the best performing all Mish model outperformed the worst ReLU model even though they’re all fairly reasonable settings. So the gap isn’t that big.

One thing that might be interesting to try is progressively replacing the activations in the body. So replacing the final activation and just unfreezing and training layers after it. Then the next to last activation and so on. This would need a bit of training time but I don’t think that much. I’d guess you might get good results with just a few hundred batches per step (the epochs there being ~200 batches). Not really appropriate for normal use, but may produce some nice weights with a lot less work than training from scratch.

5 Likes

There’s an issue in your Repository. Someone from the YoloV3 team needed some clarification with how to integrate Mish CUDA in their script. Letting you know in case if you haven’t checked. Thanks!

@TomB also can you please take a look at this comment:

Also as you can see there are 3 different Mish-implementations , even forward-mish functions are different, so we can’t convert model between TF(2 thresholds) <-> Pytorch(1 threshold) <-> MXNet (0 thresholds):

Link to discussion - https://github.com/AlexeyAB/darknet/issues/3994

Couldn’t actually find that comment from a scan of that quite long thread. So not sure if/how they resolved it.
The differences between PyTorch and TF reflect slight differences in their implementations of softplus. The single threshold in my CUDA version reflects the PyTorch logic. I don’t think that the differences are big enough that there’s any strong reason to use the same implementation so think you could just as well use the TF logic for Mish in PyTorch. They just both come from borrowing the relevant softplus implementation.
I’m not sure the differences make a real impact and wouldn’t prevent converting models, at least not between TF and PyTorch. As noted this would also potentially apply to any model using softplus.
If there is indeed no theshold in MXNet then that may cause issues. But this also depends on other details. There may be other handling of non-finite values that would mitigate issues. It also depends on the datatypes used. In general this is mostly an issue for 16-bit floats. Though I think I did see some issues with 32-bit floats I think that was with the quite unstable calculation involving multiple exponents rather than the symbolically derived gradient calculation.

Oh and I’ve responded to that post.
I’d also note that you pointed to the Autograd implementation which should reduce memory usage but will result in lower performance. The JIT version combines both the lower memory usage and better performance so should generally be preferred.
The one issue is support in older PyTorch versions. It should be fine in PyTorch 1.2 and 1.3 (though I’ve mostly tested in 1.3). I think it should probably also work in 1.1 and maybe even 1.0 in which case it should always be fine as I can’t imagine you’d want to support pre-1.0 anymore.
But the JIT version should probably be preferred unless older support is key. I’d also note that I don’t think my CUDA version will work pre-1.2 so the JIT version should offer equivalent performance and version support. I just need to run a few extra tests on the JIT version and then will likely update the repo to indicate the JIT version should be preferred.

1 Like

Can you explain this in more clarity of why Autograd has this shortcoming.

Additionally, I noticed your implementation takes Mish function to be: x * tanh(ln(exp(x))) instead of x * tanh(ln(exp(x)+1)) which is the original implementation. Both are considerably different.

It performs the same as the non-autograd version for forward as it requires multiple CUDA kernel launches while for backward it is likely a bit slower due to the less efficient backward to recalculate values. The JIT script reduces this somewhat by fusing multiple operations into a single kernel thus increasing performance. Actually while Swish fully fuses into a single kernel launch and so performs about the same as my CUDA implementations of Mish/Swish (which is close to ReLU), the Mish JIT version does not fully fuse as fusing is not supported for Softplus so only the mul and tanh fuse. I did implement a fully fused version by manually implementing Softplus, however this was slower than the partially fused version at least on my initial tests. I’m not quite sure why but my first guess would be perhaps the where op I used to implement thresholding does not fuse well.

Which implementation? I’m pretty sure my CUDA version is implementing the later and it’s tested against x.mul(torch.tanh(F.softplus(x))).