@daniel-j-h The MNasNet paper was important in the jump from MobileNet-V2 to MobileNet-V3 and EfficientNet, it introduce the SE to the Inverted Residual block and as you can see the MobileNet-V3 and EfficientNet were both very much written with the assumption that you read this first: https://arxiv.org/abs/1807.11626
The MNasNet-A1 vs B1 is worth looking at for an SE vs not comparison. The A1 is SE block based with some reductions in dimensions vs the B1, but it can be trained to 75.2% top-1 as per paper (I’ve managed 75.45) and the B1 is 74.5 (74.66 in my attempt). The A1 has 3.9M params and the B1 4.4M.
The attention from SE blocks typically improve parameter efficiency. Just as with SE-ResNets or ResNeXts, the IR block based networks with SE present give you a higher ratio of performance (accuracy metrics) to parameters. Unfortunately they seem to make the networks harder to train, more epochs (maybe more variety needed, so bigger datasets or better augmentation helps?) and a little slower to push the images through.
All said though, for PyTorch running on an NVIDIA GPU, I don’t think EfficientNets make sense as a go to network for most applications. While they are parameter and flop efficient, when it comes to GPU memory usage and image throughput, they are no faster, if not worse than ResNe(X)ts. Well trained ResNe(X)ts can achieve similar accuracy metrics with better GPU memory characteristics, higher throughputs, and half the training time epochs (possibly up to 3-4x the wallclock time)