Any ideas how to translate the optimizer schedule in “Attention Is All You Need” to fastai?
Here is the definition from the paper:
5.3 Optimizer
We used the Adam optimizer with β1 = 0.9, β2 = 0.98 and e= 10−9. We varied the learning rate over the course of training, according to the formula:
lrate = dmodel**−0.5 · min(step_num**−0.5, step_num · warmup_steps**−1.5)
This corresponds to increasing the learning rate linearly for the first warmup_steps training steps,
and decreasing it thereafter proportionally to the inverse square root of the step number. We used
warmup_steps = 4000.
Not sure how to take that formula and convert it into the appropriate call to schedule_hp
.