@LessW2020 I am trying to run running batch norm as according to this post: Running batch norm tweaks and when it runs update stats I get the following:
--> 24 s = x .sum(dims, keepdim=True)
25 ss = (x*x).sum(dims, keepdim=True)
26 c = self.count.new_tensor(x.numel()/nc)
IndexError: Dimension out of range (expected to be in range of [-2, 1], but got 2)
Which I believe has to due with the fact of how tabular tensors are passed in as two separate entities (cont and cat vars). Where should I go about fixing this?
As far as I got everything correct from the previous posts, this was so far the best model setup:
Mish activation function
change the order to Conv-Mish-BN (this order seems to be already included into the ConvLayer class of fastai v2 dev.)
hyper parameters, optimizer, and training schedule like in the notebook
I tried my Squeeze-Excite-XResNet implementation with AdaptiveConcatPool2d with Mish based on the fastai XResNet and got the following results after 5 runs:
0.638, 0.624, 0.668, 0.700, 0.686 --> average: 0.663
… so this was not really improving the SimpleSelfAttention approach from the notebook.
Then I combined the MXResNet from LessW2020 with the SE block to get the SEMXResNet ( ) and got the following results (+ Ranger + SimpleSelfAttention + Flatten Anneal):
0.748, 0.748, 0.746, 0.772, 0.718 --> average: 0.746
With the increased parameter count (e.g., SE block and/or double the parameters from the FC head input stage after the AdaptiveConcatPool2d) the 5 epochs are very likely not enough to really compare it to the models with fewer parameters, as it will need more time to train (like mentioned above).
I also have seen the thread from LessW2020 about Res2Net. - Did somebody already tried it in a similar fashion and got some (preliminary) results?
Observation Notes - Mish Loss is extremely stable as compared to others especially when compared to E-Swish and ReLU. Mish is faster than GELU in comparison. Mish also is the only activation function which crossed 91% in the 2 runs while other activation functions went to a max of 90.7%.
Mish highest Test Top-1 Accuracy - 91.248%. SELU performed the worst throughout.
(Please click on the picture to view it in larger aspect)
Not yet – I was going to wait until after I am finished with the project since this was just a small part of something much bigger. But I’ll definitely share it here when I do!
I just wanna note here as I reran my tabular models and here is one bit I noticed. The increase in accuracy was noticed specifically when I included information via time-based feature engineering. When the paper is published I will be able to share more details, but I definitely notice the change specifically there. In all other instances there was no radical difference. The moment that was introduced accuracy shot from 92% to 97% with negligible error. This I believe is what was missing. I will note that this time-based feature engineering is not like Rossmann. I will try plugging in a Mish activation as I believe that this could shine here.
The reasoning for hoping for Mish is I have been thoroughly testing these new optimizer + Mish on a variety of datasets with no changes whatsoever. But with this dataset I saw direct changes using the new optimizer + scheduler. Sadly, the only bit that did not change was Mish in the end My model was set up with two hidden layers of 1000 and 500. I ran it in three separate areas that I saw various improvements on (with the optimizer and in general) but to no avail Sorry @Diganta! The tabular mystery is still a thing! (because this whole experiment has gotten me re-thinking tabular models as a whole)
I’ve been out of the loop here for a bit as I have a lot of consulting work in progress, but @muellerzr happy to see you might have a paper out soon!
Re: tabular data - here’s a recent paper and apparently open source code that may be of great interest:
I want to test it out but too swamped atm.
Also, I wanted to add that I’m having really good results with the Res2Net Plus architecture and the Ranger beta optimizer. I’ll have an article out soon on Res2Net plus, and then the Ranger beta and the paper it’s based on but it trains really well…not fast enough for leaderboards, but excellent for production work.
@Diganta Congratulations for your work!
Now it is a fact that Mish activation boosts the accuracy as well as the model stability on SqueezeNet.
I have been following this thread & the imagenette/wolf and was super excited about MXResnet, where Mish also boosted the model.
Do you know if is there any benchmark of EfficientNet replacing the Swish activation for Mish?
Exciting times we are living, you don’t know if current SOTA will survive next week.
Thanks again for keep pushing.
@LessW2020 I’ve been running some trials with Ranger913A over the past few days with EfficientNet-b3 + Mish on Stanford-Cars. Overall I found it matched the accuracy of “vanilla” Ranger but didn’t beat it. It also needed a higher lr (10x) than Ranger and was about 15% slower to run.
If there are 1 or 2 other parameter configurations you feel are worth exploring then I can probably make time to trial them, however in general with Ranger913A and Ranger I’ve found 93.8% to be a hard ceiling with b3, although I have seen 93.9% after adding Mixup.
Its not on Imagenette/Imagewoof but I’ve found Mish + Ranger to be super helpful. Also, myself and @ttsantos are working on a PR for the efficientnet_pytorch library to allow you to toggle the activation function between Mish, Swish and Relu…
Hey Marc. Thanks for the appreciation. Yes it demonstrates the consistency and superiority of Mish yes. @morgan here used Mish with Efficient Net and beat the accuracy of the Google’s paper on the Stanford Cars Dataset on the B3 model.
Also for more comparison, I used the B0 model to get scores for various activations on the CIFAR-10 dataset.
It’s very nice you are doing all those benchmarks and also including statistical significance.
I have two suggestions you could explore:
Does Mish beat ReLU when constrained by run time? For example, if 50 epochs with Mish take as long as 80 epochs with ReLU (I just made up numbers here), you could compare those results.
Does Mish increase the capacity of a network when replacing ReLU? I’m not sure exactly how you would test that, but I assume that there’s a certain number of epochs after which networks won’t improve their accuracy by much. I assume it’s higher than 50 epochs.