Any ideas how to translate the optimizer schedule in “Attention Is All You Need” to fastai?

wgpubs · December 17, 2019, 11:26pm

Here is the definition from the paper:

5.3 Optimizer
We used the Adam optimizer with β1 = 0.9, β2 = 0.98 and e= 10−9. We varied the learning rate over the course of training, according to the formula:

lrate = dmodel**−0.5 · min(step_num**−0.5, step_num · warmup_steps**−1.5)

This corresponds to increasing the learning rate linearly for the first warmup_steps training steps,
and decreasing it thereafter proportionally to the inverse square root of the step number. We used
warmup_steps = 4000.

Not sure how to take that formula and convert it into the appropriate call to schedule_hp.

Skumarr53 · April 28, 2020, 7:34am

I am also facing the same issue. If you have figured out the solution, can you share it here?