Lesson 18 official topic

Two new optimisers were recently published Lion (Chen 2023) and dadaptation (Defazio 2023). Both need a bit more epoch to get good results but are very competitive with AdamW.

I had a deeper look at Lion, it is simpler, faster, smaller than Adam or DAdaptAdam.

It exposes a somehow hidden fact that Adam when things go well updates parameters with learning rate ignoring the gradient scale, and lion makes it explicit.
Have a look how easy it is (the code updates only one parameter for simplicity):

def sgd(lr): # for comparison with lion
    def sgd_step(w, g): 
        return w - lr * g
    return sgd_step
def lion(lr=0.1, b1=0.9, b2=0.99):
    lion.exp_avg = 0 # shared state betwen multiple calls to lion_step
    def lion_step(w, g):
        sign = np.sign(lion.exp_avg * b1 + grad * (1 - b1)) # s is 1 or -1
        lion.exp_avg = lion.exp_avg*b2 + (1-b2)*g
        return w - lr * sign 
    return lion_step

@Mkardas made a nice notebook exploring how those optimisers work with one variable I will share it here once we get it polished.

3 Likes