MixNets! Google Brain's newest NN architecture, outdoes ResNet153, 8x fewer params

LessW2020 · December 10, 2019, 11:15pm

Hi all,
As I’ve referenced a few times before, I’m excited about the new MixNet architecture that Google Brain released. MixNet is a new state of the art Mobile AI architecture (top 1% ImageNet accuracy).

By blending a range of kernels from 3x3 - 9x9 in the same block, MixNet set’s a new accuracy and efficiency record. (MixNet-L outperforms ResNet-153 with 8x fewer params, and MixNet-M matches it exactly but with 12x fewer params and 31x fewer FLOPS.)

I wrote a summary article explaining the MixConv architecture and the MixNet layout. TF and PyTorch code is linked. I’m hoping to make some adjustments to the PyTorch code and leverage it inside of the FastAI framework.

and full paper is here:

I’m hoping to put MixNet to use on a new contract I’m working on and will post the updated code once I’ve tested it out.
Best regards,
Less

farid · December 11, 2019, 2:40am

Thank you very much @LessW2020 for sharing a very clear and concise summary of the article. Looking forward for your fastai implementation.

LessW2020 · December 11, 2019, 2:42am

Thanks @farid!
I’m running MixNet-M as we speak on ImageWoof and getting a feel for it Should have some updates soon.

LessW2020 · December 11, 2019, 6:17am

So on a quick run with RangerQH and ImageWoof-256 for 20 epochs, MixNet-L got pretty close to our leaderboard record (note that I did swap in Mish activation as that seemed to help it).
Res2Net-50 plus did not do as well under same ‘default’ settings.

On lower resolution (128) the difference was not as great and Res2Net had a slight edge.

These are only 1 run each and not the desired 5 runs * 20 epochs, but MixNet looks pretty strong considering it was close to the leaderboard (2.4% off) with no tuning at all - just plug in RangerQH with defaults and go.

Here’s the runs for comparison:

More testing is needed, and some learning rate calibration, etc. but at least can say MixNet looks pretty good and certainly worth further investigation. It seems to train pretty steadily.

Best regards,
Less

MicPie · December 11, 2019, 7:57pm

Great summary post!

I wonder how this compares to the EfficientNet network?
(They cite it but don’t show a direct comparison with it.)

bauke-b · December 13, 2019, 2:02pm

I don’t think that is a coincidence :’). Looking at both papers

MixNet-M

5.0M params
360M FLOPS
top-1 77.0%

EfficientNet-B0

5.3M params
390M FLOPS
top-1: 77.3%

They are extremely close to each other. Comparing MixNet-L vs EfficientNet-B1 (closest one) shows similar results. In both cases, EfficientNet has slightly more parameters but also slightly higher accuracy, both in top-1 and top-5 accuracies.

MicPie · December 13, 2019, 3:13pm

Thanks for putting this together!

On some kaggle discussion it was pointed out that the EfficientNet was not so fast in pytorch due to some not optimized operations. Maybe this is fixed now, but if not MixNet could solve this problem as it relies on multiple standard conv2d (unless there are some drawbacks in terms of running multiple conv2d with less filters vs. all filters in one conv2d).

If somebody has (preliminary) results I would be very interested to hear about them!

Addition:
This is the kaggle discussion thread, but it was about memory efficiency of the swish activation function: https://www.kaggle.com/c/rsna-intracranial-hemorrhage-detection/discussion/111292
So this seems not really connected to the conv2d operation.

LessW2020 · December 13, 2019, 6:59pm

I’m going to try to run MixNet and EfficientNet (B-5) today against ImageWoof and see.
Offhand I felt MixNet ran faster, but let me prove that with a direct test

LessW2020 · December 15, 2019, 12:47am

Here’s the results of a very fast and basic test - MXResNet and MixNet were the two best performers.

Note - MXResNet is likely using a lot larger amount of params,

Note that swapping in Mish activation made a big improvement for Scarlet. Scarlet-A is certainly the smallest model of them all…meaning it’s not a fair comparison with 5x smaller model, but assuming a GPU card on a server and not mobile is what is being tested here.

To put Scarlet on a bit more equal footing, I ran it for 24 epochs to match the equivalent run time EfficientNet B5 was run under, at which point Scarlet beats EfficientNet for 12 minutes of training (71.2 vs 61.4).

MicPie · December 16, 2019, 8:22am

Very nice!

Is the MXResNet a MixConv-ResNet (with the fastai bag of tricks)?

Do you have the link to the scarlet-A architecture or did you designed it by yourself?

LessW2020 · December 16, 2019, 2:52pm

Hi @MicPie -
The MXResNet is the FastAI2 XResNet, but with Mish Activation and Seb’s self attention layer plus a change to the initial receptive fields.

Scarlet is a NAS designed architecture from XiaoMi - they claim it is more efficient than EfficientNet.
I just added in Mish activation instead of their Relu6. Here’s their github and I can post my modified version out as well:

Hope that helps!

Diganta · December 17, 2019, 7:17pm

Looking forward to your repository.

nchukaobah · December 23, 2019, 3:39am

Do you have a post of the modification you made for MXResNet? I would like to see how it is done.