Meet Mish: New Activation function, possible successor to ReLU?

Congrats on those results @LessW2020 @muellerzr !

Hopefully I may have something for you to get slightly better results. In case you guys are using Lookahead (even combined version), right before evaluation, there is a decision that should be made:

  • At the end of an epoch, most likely nb_batches % k != 0. Which means, that you are evaluating your model on your fast weights (before the next synchronization).

  • The difference might be slim but positive as there are two choices right before evaluation: copy slow weights to fast weights (walking a few steps back), or perform synchronization even though you haven’t yet performed k fast steps since last sync.

I’m still investigating which option is giving the best results but at least, it’s better to have the choice. You can find the method I implemented in commit, that could be used as follows:

from torch.optim import Adam
optimizer = Adam(model_params)
optimizer = Lookahead(optimizer, sync_rate=0.5, sync_period=6)

for _ in range(nb_epochs):
    # Train here
    optimizer.sync_params()
    # not specifying sync_rate means model params <- slow params
    # otherwise optimizer.sync_params(0.5) will force early synchronization
    # Evaluate here

Hope this helps!
Cheers

2 Likes