Meet Mish: New Activation function, possible successor to ReLU?

fgfm · August 29, 2019, 8:08pm

Congrats on those results @LessW2020 @muellerzr !

Hopefully I may have something for you to get slightly better results. In case you guys are using Lookahead (even combined version), right before evaluation, there is a decision that should be made:

At the end of an epoch, most likely nb_batches % k != 0. Which means, that you are evaluating your model on your fast weights (before the next synchronization).
The difference might be slim but positive as there are two choices right before evaluation: copy slow weights to fast weights (walking a few steps back), or perform synchronization even though you haven’t yet performed k fast steps since last sync.

I’m still investigating which option is giving the best results but at least, it’s better to have the choice. You can find the method I implemented in commit, that could be used as follows:

from torch.optim import Adam
optimizer = Adam(model_params)
optimizer = Lookahead(optimizer, sync_rate=0.5, sync_period=6)

for _ in range(nb_epochs):
    # Train here
    optimizer.sync_params()
    # not specifying sync_rate means model params <- slow params
    # otherwise optimizer.sync_params(0.5) will force early synchronization
    # Evaluate here

Hope this helps!
Cheers