Super interesting thread and discussion!
As far as I got everything correct from the previous posts, this was so far the best model setup:
- Mish activation function
- change the order to Conv-Mish-BN (this order seems to be already included into the
ConvLayer
class of fastai v2 dev.) - hyper parameters, optimizer, and training schedule like in the notebook
I tried my Squeeze-Excite-XResNet implementation with AdaptiveConcatPool2d with Mish based on the fastai XResNet and got the following results after 5 runs:
0.638, 0.624, 0.668, 0.700, 0.686 ā average: 0.663
⦠so this was not really improving the SimpleSelfAttention approach from the notebook.
Then I combined the MXResNet from LessW2020 with the SE block to get the SEMXResNet ( ) and got the following results (+ Ranger + SimpleSelfAttention + Flatten Anneal):
0.748, 0.748, 0.746, 0.772, 0.718 ā average: 0.746
And with BN after the activation:
0.728, 0.768, 0.774, 0.738, 0.752 ā average: 0.752
And with AdaptiveConcatPool2d:
0.620, 0.628, 0.684, 0.714, 0.700 ā average: 0.669
With the increased parameter count (e.g., SE block and/or double the parameters from the FC head input stage after the AdaptiveConcatPool2d) the 5 epochs are very likely not enough to really compare it to the models with fewer parameters, as it will need more time to train (like mentioned above).
I also have seen the thread from LessW2020 about Res2Net. - Did somebody already tried it in a similar fashion and got some (preliminary) results?