I am confused about which optimizer does fastai uses actually by default. In the learner default arguments it does say Adam, but I am confused then how fastai uses cyclic momentum. Also, if fastai does use Adam, then can anyone comment on the performance of Nadam.
Can you also discuss a subtle point. In the paper of Adam, when we update the weights we use normalized values of exp_avg and exp_avg_sq (i.e. bias_corrected). But in the pytorch implementation of Adam, they did not use the bias_corrected version when updating the weights.
In the paper, they did say, the bias correction was necessary as exp_avg and exp_avg_sq are initialized as 0, which is True in case of PyTorch also.
The code I am referring to
bias_correction1 = 1 - beta1 ** state['step']
bias_correction2 = 1 - beta2 ** state['step']
step_size = group['lr'] * math.sqrt(bias_correction2) / bias_correction1
p.data.addcdiv_(-step_size, exp_avg, denom)
# PyTorch does not use exp_avg = exp_avg/bias_correction1 in p.data.addcdiv_
I am running some benchmarks to see how much difference that makes.
In all the cases, I got much worse scores when using bias_correction. Why is this happening?
The only reason that I can come up with is that it actually makes the learning very slow. As in the weight update equation, the step update size is being divided by 10 when the training begins
(beta1=0.9, beta2=0.999, so
1-beta1 and 1-beta2 = 0.1, 0.0001 and
when we can take them to numerator we get 10, 10000 times exp_grad and exp_grad_sq.
Finally in the update step we take sqrt of 10000 and divide 10/sqrt(10000) = 1/10 of the original gradient update)
As the number of time steps increase the 1/10 factor get’s closer to the original factor, but the cumulative 1/x of the gradient update’s adds on and we have very slow training.
I haven’t tried with cyclic learning for this as of now.
When we simplify the equations for the param update we do get alpha*srqt(beta2)/beta1 which is the step_size and we do not have to do the m_t_hat and y_t_hat steps.
Spent complete evening, for this and now I understand why my experiments were failing. But now I can write any optimizer function. Thanks, @jeremy for helping along the way.